import PULS_20160108

This commit is contained in:
Stricted
2018-03-13 20:29:02 +01:00
parent a8d97b1bd0
commit 6fa3eb70c0
6257 changed files with 2910927 additions and 5489 deletions
-1
View File
@@ -1,6 +1,5 @@
*.xml
*.ps
*.pdf
*.html
*.9.gz
*.9
+121
View File
@@ -0,0 +1,121 @@
=============
A N D R O I D
=============
Copyright (C) 2009 Google, Inc.
Written by Mike Chan <mike@android.com>
CONTENTS:
---------
1. Android
1.1 Required enabled config options
1.2 Required disabled config options
1.3 Recommended enabled config options
2. Contact
1. Android
==========
Android (www.android.com) is an open source operating system for mobile devices.
This document describes configurations needed to run the Android framework on
top of the Linux kernel.
To see a working defconfig look at msm_defconfig or goldfish_defconfig
which can be found at http://android.git.kernel.org in kernel/common.git
and kernel/msm.git
1.1 Required enabled config options
-----------------------------------
After building a standard defconfig, ensure that these options are enabled in
your .config or defconfig if they are not already. Based off the msm_defconfig.
You should keep the rest of the default options enabled in the defconfig
unless you know what you are doing.
ANDROID_PARANOID_NETWORK
ASHMEM
CONFIG_FB_MODE_HELPERS
CONFIG_FONT_8x16
CONFIG_FONT_8x8
CONFIG_YAFFS_SHORT_NAMES_IN_RAM
DAB
EARLYSUSPEND
FB
FB_CFB_COPYAREA
FB_CFB_FILLRECT
FB_CFB_IMAGEBLIT
FB_DEFERRED_IO
FB_TILEBLITTING
HIGH_RES_TIMERS
INOTIFY
INOTIFY_USER
INPUT_EVDEV
INPUT_GPIO
INPUT_MISC
LEDS_CLASS
LEDS_GPIO
LOCK_KERNEL
LkOGGER
LOW_MEMORY_KILLER
MISC_DEVICES
NEW_LEDS
NO_HZ
POWER_SUPPLY
PREEMPT
RAMFS
RTC_CLASS
RTC_LIB
SWITCH
SWITCH_GPIO
TMPFS
UID_STAT
UID16
USB_FUNCTION
USB_FUNCTION_ADB
USER_WAKELOCK
VIDEO_OUTPUT_CONTROL
WAKELOCK
YAFFS_AUTO_YAFFS2
YAFFS_FS
YAFFS_YAFFS1
YAFFS_YAFFS2
1.2 Required disabled config options
------------------------------------
CONFIG_YAFFS_DISABLE_LAZY_LOAD
DNOTIFY
1.3 Recommended enabled config options
------------------------------
ANDROID_PMEM
PSTORE_CONSOLE
PSTORE_RAM
SCHEDSTATS
DEBUG_PREEMPT
DEBUG_MUTEXES
DEBUG_SPINLOCK_SLEEP
DEBUG_INFO
FRAME_POINTER
CPU_FREQ
CPU_FREQ_TABLE
CPU_FREQ_DEFAULT_GOV_ONDEMAND
CPU_FREQ_GOV_ONDEMAND
CRC_CCITT
EMBEDDED
INPUT_TOUCHSCREEN
I2C
I2C_BOARDINFO
LOG_BUF_SHIFT=17
SERIAL_CORE
SERIAL_CORE_CONSOLE
2. Contact
==========
website: http://android.git.kernel.org
mailing-lists: android-kernel@googlegroups.com
+34
View File
@@ -0,0 +1,34 @@
Tagged virtual addresses in AArch64 Linux
=========================================
Author: Will Deacon <will.deacon@arm.com>
Date : 12 June 2013
This document briefly describes the provision of tagged virtual
addresses in the AArch64 translation system and their potential uses
in AArch64 Linux.
The kernel configures the translation tables so that translations made
via TTBR0 (i.e. userspace mappings) have the top byte (bits 63:56) of
the virtual address ignored by the translation hardware. This frees up
this byte for application use, with the following caveats:
(1) The kernel requires that all user addresses passed to EL1
are tagged with tag 0x00. This means that any syscall
parameters containing user virtual addresses *must* have
their top byte cleared before trapping to the kernel.
(2) Non-zero tags are not preserved when delivering signals.
This means that signal handlers in applications making use
of tags cannot rely on the tag information for user virtual
addresses being maintained for fields inside siginfo_t.
One exception to this rule is for signals raised in response
to watchpoint debug exceptions, where the tag information
will be preserved.
(3) Special care should be taken when using tagged pointers,
since it is likely that C compilers will not hazard two
virtual addresses differing only in the upper byte.
The architecture prevents the use of a tagged PC, so the upper byte will
be set to a sign-extension of bit 55 on exception return.
+9
View File
@@ -598,6 +598,15 @@ is completely unused; @cgrp->parent is still valid. (Note - can also
be called for a newly-created cgroup if an error occurs after this
subsystem's create() method has been called for the new cgroup).
int allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
Called prior to moving a task into a cgroup; if the subsystem
returns an error, this will abort the attach operation. Used
to extend the permission checks - if all subsystems in a cgroup
return 0, the attach will be allowed to proceed, even if the
default permission check (root or same user) fails.
int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
+85
View File
@@ -28,6 +28,7 @@ Contents:
2.3 Userspace
2.4 Ondemand
2.5 Conservative
2.6 Interactive
3. The Governor Interface in the CPUfreq Core
@@ -218,6 +219,90 @@ a decision on when to decrease the frequency while running in any
speed. Load for frequency increase is still evaluated every
sampling rate.
2.6 Interactive
---------------
The CPUfreq governor "interactive" is designed for latency-sensitive,
interactive workloads. This governor sets the CPU speed depending on
usage, similar to "ondemand" and "conservative" governors, but with a
different set of configurable behaviors.
The tuneable values for this governor are:
target_loads: CPU load values used to adjust speed to influence the
current CPU load toward that value. In general, the lower the target
load, the more often the governor will raise CPU speeds to bring load
below the target. The format is a single target load, optionally
followed by pairs of CPU speeds and CPU loads to target at or above
those speeds. Colons can be used between the speeds and associated
target loads for readability. For example:
85 1000000:90 1700000:99
targets CPU load 85% below speed 1GHz, 90% at or above 1GHz, until
1.7GHz and above, at which load 99% is targeted. If speeds are
specified these must appear in ascending order. Higher target load
values are typically specified for higher speeds, that is, target load
values also usually appear in an ascending order. The default is
target load 90% for all speeds.
min_sample_time: The minimum amount of time to spend at the current
frequency before ramping down. Default is 80000 uS.
hispeed_freq: An intermediate "hi speed" at which to initially ramp
when CPU load hits the value specified in go_hispeed_load. If load
stays high for the amount of time specified in above_hispeed_delay,
then speed may be bumped higher. Default is the maximum speed
allowed by the policy at governor initialization time.
go_hispeed_load: The CPU load at which to ramp to hispeed_freq.
Default is 99%.
above_hispeed_delay: When speed is at or above hispeed_freq, wait for
this long before raising speed in response to continued high load.
The format is a single delay value, optionally followed by pairs of
CPU speeds and the delay to use at or above those speeds. Colons can
be used between the speeds and associated delays for readability. For
example:
80000 1300000:200000 1500000:40000
uses delay 80000 uS until CPU speed 1.3 GHz, at which speed delay
200000 uS is used until speed 1.5 GHz, at which speed (and above)
delay 40000 uS is used. If speeds are specified these must appear in
ascending order. Default is 20000 uS.
timer_rate: Sample rate for reevaluating CPU load when the CPU is not
idle. A deferrable timer is used, such that the CPU will not be woken
from idle to service this timer until something else needs to run.
(The maximum time to allow deferring this timer when not running at
minimum speed is configurable via timer_slack.) Default is 20000 uS.
timer_slack: Maximum additional time to defer handling the governor
sampling timer beyond timer_rate when running at speeds above the
minimum. For platforms that consume additional power at idle when
CPUs are running at speeds greater than minimum, this places an upper
bound on how long the timer will be deferred prior to re-evaluating
load and dropping speed. For example, if timer_rate is 20000uS and
timer_slack is 10000uS then timers will be deferred for up to 30msec
when not at lowest speed. A value of -1 means defer timers
indefinitely at all speeds. Default is 80000 uS.
boost: If non-zero, immediately boost speed of all CPUs to at least
hispeed_freq until zero is written to this attribute. If zero, allow
CPU speeds to drop below hispeed_freq according to load as usual.
Default is zero.
boostpulse: On each write, immediately boost speed of all CPUs to
hispeed_freq for at least the period of time specified by
boostpulse_duration, after which speeds are allowed to drop below
hispeed_freq according to load as usual.
boostpulse_duration: Length of time to hold CPU speed at hispeed_freq
on a write to boostpulse, before allowing speed to drop according to
load as usual. Default is 80000 uS.
3. The Governor Interface in the CPUfreq Core
=============================================
@@ -0,0 +1,172 @@
=======================================================
ARM CCI cache coherent interconnect binding description
=======================================================
ARM multi-cluster systems maintain intra-cluster coherency through a
cache coherent interconnect (CCI) that is capable of monitoring bus
transactions and manage coherency, TLB invalidations and memory barriers.
It allows snooping and distributed virtual memory message broadcast across
clusters, through memory mapped interface, with a global control register
space and multiple sets of interface control registers, one per slave
interface.
Bindings for the CCI node follow the ePAPR standard, available from:
www.power.org/documentation/epapr-version-1-1/
with the addition of the bindings described in this document which are
specific to ARM.
* CCI interconnect node
Description: Describes a CCI cache coherent Interconnect component
Node name must be "cci".
Node's parent must be the root node /, and the address space visible
through the CCI interconnect is the same as the one seen from the
root node (ie from CPUs perspective as per DT standard).
Every CCI node has to define the following properties:
- compatible
Usage: required
Value type: <string>
Definition: must be set to
"arm,cci-400"
- reg
Usage: required
Value type: <prop-encoded-array>
Definition: A standard property. Specifies base physical
address of CCI control registers common to all
interfaces.
- ranges:
Usage: required
Value type: <prop-encoded-array>
Definition: A standard property. Follow rules in the ePAPR for
hierarchical bus addressing. CCI interfaces
addresses refer to the parent node addressing
scheme to declare their register bases.
CCI interconnect node can define the following child nodes:
- CCI control interface nodes
Node name must be "slave-if".
Parent node must be CCI interconnect node.
A CCI control interface node must contain the following
properties:
- compatible
Usage: required
Value type: <string>
Definition: must be set to
"arm,cci-400-ctrl-if"
- interface-type:
Usage: required
Value type: <string>
Definition: must be set to one of {"ace", "ace-lite"}
depending on the interface type the node
represents.
- reg:
Usage: required
Value type: <prop-encoded-array>
Definition: the base address and size of the
corresponding interface programming
registers.
* CCI interconnect bus masters
Description: masters in the device tree connected to a CCI port
(inclusive of CPUs and their cpu nodes).
A CCI interconnect bus master node must contain the following
properties:
- cci-control-port:
Usage: required
Value type: <phandle>
Definition: a phandle containing the CCI control interface node
the master is connected to.
Example:
cpus {
#size-cells = <0>;
#address-cells = <1>;
CPU0: cpu@0 {
device_type = "cpu";
compatible = "arm,cortex-a15";
cci-control-port = <&cci_control1>;
reg = <0x0>;
};
CPU1: cpu@1 {
device_type = "cpu";
compatible = "arm,cortex-a15";
cci-control-port = <&cci_control1>;
reg = <0x1>;
};
CPU2: cpu@100 {
device_type = "cpu";
compatible = "arm,cortex-a7";
cci-control-port = <&cci_control2>;
reg = <0x100>;
};
CPU3: cpu@101 {
device_type = "cpu";
compatible = "arm,cortex-a7";
cci-control-port = <&cci_control2>;
reg = <0x101>;
};
};
dma0: dma@3000000 {
compatible = "arm,pl330", "arm,primecell";
cci-control-port = <&cci_control0>;
reg = <0x0 0x3000000 0x0 0x1000>;
interrupts = <10>;
#dma-cells = <1>;
#dma-channels = <8>;
#dma-requests = <32>;
};
cci@2c090000 {
compatible = "arm,cci-400";
#address-cells = <1>;
#size-cells = <1>;
reg = <0x0 0x2c090000 0 0x1000>;
ranges = <0x0 0x0 0x2c090000 0x6000>;
cci_control0: slave-if@1000 {
compatible = "arm,cci-400-ctrl-if";
interface-type = "ace-lite";
reg = <0x1000 0x1000>;
};
cci_control1: slave-if@4000 {
compatible = "arm,cci-400-ctrl-if";
interface-type = "ace";
reg = <0x4000 0x1000>;
};
cci_control2: slave-if@5000 {
compatible = "arm,cci-400-ctrl-if";
interface-type = "ace";
reg = <0x5000 0x1000>;
};
};
This CCI node corresponds to a CCI component whose control registers sits
at address 0x000000002c090000.
CCI slave interface @0x000000002c091000 is connected to dma controller dma0.
CCI slave interface @0x000000002c094000 is connected to CPUs {CPU0, CPU1};
CCI slave interface @0x000000002c095000 is connected to CPUs {CPU2, CPU3};
@@ -16,6 +16,9 @@ Required properties:
"arm,arm1176-pmu"
"arm,arm1136-pmu"
- interrupts : 1 combined interrupt or 1 per core.
- cluster : a phandle to the cluster to which it belongs
If there are more than one cluster with same CPU type
then there should be separate PMU nodes per cluster.
Example:
View File
+6
View File
@@ -369,6 +369,8 @@ is not associated with a file:
[stack:1001] = the stack of the thread with tid 1001
[vdso] = the "virtual dynamic shared object",
the kernel system call handler
[anon:<name>] = an anonymous mapping that has been
named by userspace
or if empty, the mapping is anonymous.
@@ -419,6 +421,7 @@ KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 374 kB
VmFlags: rd ex mr mw me de
Name: name from userspace
the first of these lines shows the same information as is displayed for the
mapping in /proc/PID/maps. The remaining lines show the size of the mapping
@@ -469,6 +472,9 @@ Note that there is no guarantee that every flag and associated mnemonic will
be present in all further kernel releases. Things get changed, the flags may
be vanished or the reverse -- new added.
The "Name" field will only be present on a mapping that has been named by
userspace, and will show the name passed in by userspace.
This file is only present if the CONFIG_MMU kernel configuration option is
enabled.
Regular → Executable
View File
+3
View File
@@ -3217,6 +3217,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
HIGHMEM regardless of setting
of CONFIG_HIGHPTE.
uuid_debug= (Boolean) whether to enable debugging of TuxOnIce's
uuid support.
vdso= [X86,SH]
vdso=2: enable compat VDSO (default with COMPAT_VDSO)
vdso=1: enable VDSO (default)
+28
View File
@@ -22,6 +22,15 @@ ip_no_pmtu_disc - BOOLEAN
min_pmtu - INTEGER
default 552 - minimum discovered Path MTU
fwmark_reflect - BOOLEAN
Controls the fwmark of kernel-generated IPv4 reply packets that are not
associated with a socket for example, TCP RSTs or ICMP echo replies).
If unset, these packets have a fwmark of zero. If set, they have the
fwmark of the packet they are replying to. Similarly affects the fwmark
used by internal routing lookups triggered by incoming packets, such as
the ones used for Path MTU Discovery.
Default: 0
route/max_size - INTEGER
Maximum number of routes allowed in the kernel. Increase
this when using large numbers of interfaces and/or routes.
@@ -468,6 +477,16 @@ tcp_fastopen - INTEGER
See include/net/tcp.h and the code for more details.
tcp_fwmark_accept - BOOLEAN
If set, incoming connections to listening sockets that do not have a
socket mark will set the mark of the accepting socket to the fwmark of
the incoming SYN packet. This will cause all packets on that connection
(starting from the first SYNACK) to be sent with that fwmark. The
listening socket's mark is unchanged. Listening sockets that already
have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are
unaffected.
Default: 0
tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 255. Default value
@@ -1093,6 +1112,15 @@ conf/all/forwarding - BOOLEAN
proxy_ndp - BOOLEAN
Do proxy ndp.
fwmark_reflect - BOOLEAN
Controls the fwmark of kernel-generated IPv6 reply packets that are not
associated with a socket for example, TCP RSTs or ICMPv6 echo replies).
If unset, these packets have a fwmark of zero. If set, they have the
fwmark of the packet they are replying to. Similarly affects the fwmark
used by internal routing lookups triggered by incoming packets, such as
the ones used for Path MTU Discovery.
Default: 0
conf/interface/*:
Change special settings per interface.
+477
View File
@@ -0,0 +1,477 @@
TuxOnIce 3.0 Internal Documentation.
Updated to 26 March 2009
1. Introduction.
TuxOnIce 3.0 is an addition to the Linux Kernel, designed to
allow the user to quickly shutdown and quickly boot a computer, without
needing to close documents or programs. It is equivalent to the
hibernate facility in some laptops. This implementation, however,
requires no special BIOS or hardware support.
The code in these files is based upon the original implementation
prepared by Gabor Kuti and additional work by Pavel Machek and a
host of others. This code has been substantially reworked by Nigel
Cunningham, again with the help and testing of many others, not the
least of whom is Michael Frank. At its heart, however, the operation is
essentially the same as Gabor's version.
2. Overview of operation.
The basic sequence of operations is as follows:
a. Quiesce all other activity.
b. Ensure enough memory and storage space are available, and attempt
to free memory/storage if necessary.
c. Allocate the required memory and storage space.
d. Write the image.
e. Power down.
There are a number of complicating factors which mean that things are
not as simple as the above would imply, however...
o The activity of each process must be stopped at a point where it will
not be holding locks necessary for saving the image, or unexpectedly
restart operations due to something like a timeout and thereby make
our image inconsistent.
o It is desirous that we sync outstanding I/O to disk before calculating
image statistics. This reduces corruption if one should suspend but
then not resume, and also makes later parts of the operation safer (see
below).
o We need to get as close as we can to an atomic copy of the data.
Inconsistencies in the image will result in inconsistent memory contents at
resume time, and thus in instability of the system and/or file system
corruption. This would appear to imply a maximum image size of one half of
the amount of RAM, but we have a solution... (again, below).
o In 2.6, we choose to play nicely with the other suspend-to-disk
implementations.
3. Detailed description of internals.
a. Quiescing activity.
Safely quiescing the system is achieved using three separate but related
aspects.
First, we note that the vast majority of processes don't need to run during
suspend. They can be 'frozen'. We therefore implement a refrigerator
routine, which processes enter and in which they remain until the cycle is
complete. Processes enter the refrigerator via try_to_freeze() invocations
at appropriate places. A process cannot be frozen in any old place. It
must not be holding locks that will be needed for writing the image or
freezing other processes. For this reason, userspace processes generally
enter the refrigerator via the signal handling code, and kernel threads at
the place in their event loops where they drop locks and yield to other
processes or sleep.
The task of freezing processes is complicated by the fact that there can be
interdependencies between processes. Freezing process A before process B may
mean that process B cannot be frozen, because it stops at waiting for
process A rather than in the refrigerator. This issue is seen where
userspace waits on freezeable kernel threads or fuse filesystem threads. To
address this issue, we implement the following algorithm for quiescing
activity:
- Freeze filesystems (including fuse - userspace programs starting
new requests are immediately frozen; programs already running
requests complete their work before being frozen in the next
step)
- Freeze userspace
- Thaw filesystems (this is safe now that userspace is frozen and no
fuse requests are outstanding).
- Invoke sys_sync (noop on fuse).
- Freeze filesystems
- Freeze kernel threads
If we need to free memory, we thaw kernel threads and filesystems, but not
userspace. We can then free caches without worrying about deadlocks due to
swap files being on frozen filesystems or such like.
b. Ensure enough memory & storage are available.
We have a number of constraints to meet in order to be able to successfully
suspend and resume.
First, the image will be written in two parts, described below. One of these
parts needs to have an atomic copy made, which of course implies a maximum
size of one half of the amount of system memory. The other part ('pageset')
is not atomically copied, and can therefore be as large or small as desired.
Second, we have constraints on the amount of storage available. In these
calculations, we may also consider any compression that will be done. The
cryptoapi module allows the user to configure an expected compression ratio.
Third, the user can specify an arbitrary limit on the image size, in
megabytes. This limit is treated as a soft limit, so that we don't fail the
attempt to suspend if we cannot meet this constraint.
c. Allocate the required memory and storage space.
Having done the initial freeze, we determine whether the above constraints
are met, and seek to allocate the metadata for the image. If the constraints
are not met, or we fail to allocate the required space for the metadata, we
seek to free the amount of memory that we calculate is needed and try again.
We allow up to four iterations of this loop before aborting the cycle. If we
do fail, it should only be because of a bug in TuxOnIce's calculations.
These steps are merged together in the prepare_image function, found in
prepare_image.c. The functions are merged because of the cyclical nature
of the problem of calculating how much memory and storage is needed. Since
the data structures containing the information about the image must
themselves take memory and use storage, the amount of memory and storage
required changes as we prepare the image. Since the changes are not large,
only one or two iterations will be required to achieve a solution.
The recursive nature of the algorithm is miminised by keeping user space
frozen while preparing the image, and by the fact that our records of which
pages are to be saved and which pageset they are saved in use bitmaps (so
that changes in number or fragmentation of the pages to be saved don't
feedback via changes in the amount of memory needed for metadata). The
recursiveness is thus limited to any extra slab pages allocated to store the
extents that record storage used, and the effects of seeking to free memory.
d. Write the image.
We previously mentioned the need to create an atomic copy of the data, and
the half-of-memory limitation that is implied in this. This limitation is
circumvented by dividing the memory to be saved into two parts, called
pagesets.
Pageset2 contains most of the page cache - the pages on the active and
inactive LRU lists that aren't needed or modified while TuxOnIce is
running, so they can be safely written without an atomic copy. They are
therefore saved first and reloaded last. While saving these pages,
TuxOnIce carefully ensures that the work of writing the pages doesn't make
the image inconsistent. With the support for Kernel (Video) Mode Setting
going into the kernel at the time of writing, we need to check for pages
on the LRU that are used by KMS, and exclude them from pageset2. They are
atomically copied as part of pageset 1.
Once pageset2 has been saved, we prepare to do the atomic copy of remaining
memory. As part of the preparation, we power down drivers, thereby providing
them with the opportunity to have their state recorded in the image. The
amount of memory allocated by drivers for this is usually negligible, but if
DRI is in use, video drivers may require significants amounts. Ideally we
would be able to query drivers while preparing the image as to the amount of
memory they will need. Unfortunately no such mechanism exists at the time of
writing. For this reason, TuxOnIce allows the user to set an
'extra_pages_allowance', which is used to seek to ensure sufficient memory
is available for drivers at this point. TuxOnIce also lets the user set this
value to 0. In this case, a test driver suspend is done while preparing the
image, and the difference (plus a margin) used instead. TuxOnIce will also
automatically restart the hibernation process (twice at most) if it finds
that the extra pages allowance is not sufficient. It will then use what was
actually needed (plus a margin, again). Failure to hibernate should thus
be an extremely rare occurence.
Having suspended the drivers, we save the CPU context before making an
atomic copy of pageset1, resuming the drivers and saving the atomic copy.
After saving the two pagesets, we just need to save our metadata before
powering down.
As we mentioned earlier, the contents of pageset2 pages aren't needed once
they've been saved. We therefore use them as the destination of our atomic
copy. In the unlikely event that pageset1 is larger, extra pages are
allocated while the image is being prepared. This is normally only a real
possibility when the system has just been booted and the page cache is
small.
This is where we need to be careful about syncing, however. Pageset2 will
probably contain filesystem meta data. If this is overwritten with pageset1
and then a sync occurs, the filesystem will be corrupted - at least until
resume time and another sync of the restored data. Since there is a
possibility that the user might not resume or (may it never be!) that
TuxOnIce might oops, we do our utmost to avoid syncing filesystems after
copying pageset1.
e. Power down.
Powering down uses standard kernel routines. TuxOnIce supports powering down
using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off.
Supporting suspend to ram (S3) as a power off option might sound strange,
but it allows the user to quickly get their system up and running again if
the battery doesn't run out (we just need to re-read the overwritten pages)
and if the battery does run out (or the user removes power), they can still
resume.
4. Data Structures.
TuxOnIce uses three main structures to store its metadata and configuration
information:
a) Pageflags bitmaps.
TuxOnIce records which pages will be in pageset1, pageset2, the destination
of the atomic copy and the source of the atomically restored image using
bitmaps. The code used is that written for swsusp, with small improvements
to match TuxOnIce's requirements.
The pageset1 bitmap is thus easily stored in the image header for use at
resume time.
As mentioned above, using bitmaps also means that the amount of memory and
storage required for recording the above information is constant. This
greatly simplifies the work of preparing the image. In earlier versions of
TuxOnIce, extents were used to record which pages would be stored. In that
case, however, eating memory could result in greater fragmentation of the
lists of pages, which in turn required more memory to store the extents and
more storage in the image header. These could in turn require further
freeing of memory, and another iteration. All of this complexity is removed
by having bitmaps.
Bitmaps also make a lot of sense because TuxOnIce only ever iterates
through the lists. There is therefore no cost to not being able to find the
nth page in order 0 time. We only need to worry about the cost of finding
the n+1th page, given the location of the nth page. Bitwise optimisations
help here.
b) Extents for block data.
TuxOnIce supports writing the image to multiple block devices. In the case
of swap, multiple partitions and/or files may be in use, and we happily use
them all (with the exception of compcache pages, which we allocate but do
not use). This use of multiple block devices is accomplished as follows:
Whatever the actual source of the allocated storage, the destination of the
image can be viewed in terms of one or more block devices, and on each
device, a list of sectors. To simplify matters, we only use contiguous,
PAGE_SIZE aligned sectors, like the swap code does.
Since sector numbers on each bdev may well not start at 0, it makes much
more sense to use extents here. Contiguous ranges of pages can thus be
represented in the extents by contiguous values.
Variations in block size are taken account of in transforming this data
into the parameters for bio submission.
We can thus implement a layer of abstraction wherein the core of TuxOnIce
doesn't have to worry about which device we're currently writing to or
where in the device we are. It simply requests that the next page in the
pageset or header be written, leaving the details to this lower layer.
The lower layer remembers where in the sequence of devices and blocks each
pageset starts. The header always starts at the beginning of the allocated
storage.
So extents are:
struct extent {
unsigned long minimum, maximum;
struct extent *next;
}
These are combined into chains of extents for a device:
struct extent_chain {
int size; /* size of the extent ie sum (max-min+1) */
int allocs, frees;
char *name;
struct extent *first, *last_touched;
};
For each bdev, we need to store a little more info:
struct suspend_bdev_info {
struct block_device *bdev;
dev_t dev_t;
int bmap_shift;
int blocks_per_page;
};
The dev_t is used to identify the device in the stored image. As a result,
we expect devices at resume time to have the same major and minor numbers
as they had while suspending. This is primarily a concern where the user
utilises LVM for storage, as they will need to dmsetup their partitions in
such a way as to maintain this consistency at resume time.
bmap_shift and blocks_per_page apply the effects of variations in blocks
per page settings for the filesystem and underlying bdev. For most
filesystems, these are the same, but for xfs, they can have independant
values.
Combining these two structures together, we have everything we need to
record what devices and what blocks on each device are being used to
store the image, and to submit i/o using bio_submit.
The last elements in the picture are a means of recording how the storage
is being used.
We do this first and foremost by implementing a layer of abstraction on
top of the devices and extent chains which allows us to view however many
devices there might be as one long storage tape, with a single 'head' that
tracks a 'current position' on the tape:
struct extent_iterate_state {
struct extent_chain *chains;
int num_chains;
int current_chain;
struct extent *current_extent;
unsigned long current_offset;
};
That is, *chains points to an array of size num_chains of extent chains.
For the filewriter, this is always a single chain. For the swapwriter, the
array is of size MAX_SWAPFILES.
current_chain, current_extent and current_offset thus point to the current
index in the chains array (and into a matching array of struct
suspend_bdev_info), the current extent in that chain (to optimise access),
and the current value in the offset.
The image is divided into three parts:
- The header
- Pageset 1
- Pageset 2
The header always starts at the first device and first block. We know its
size before we begin to save the image because we carefully account for
everything that will be stored in it.
The second pageset (LRU) is stored first. It begins on the next page after
the end of the header.
The first pageset is stored second. It's start location is only known once
pageset2 has been saved, since pageset2 may be compressed as it is written.
This location is thus recorded at the end of saving pageset2. It is page
aligned also.
Since this information is needed at resume time, and the location of extents
in memory will differ at resume time, this needs to be stored in a portable
way:
struct extent_iterate_saved_state {
int chain_num;
int extent_num;
unsigned long offset;
};
We can thus implement a layer of abstraction wherein the core of TuxOnIce
doesn't have to worry about which device we're currently writing to or
where in the device we are. It simply requests that the next page in the
pageset or header be written, leaving the details to this layer, and
invokes the routines to remember and restore the position, without having
to worry about the details of how the data is arranged on disk or such like.
c) Modules
One aim in designing TuxOnIce was to make it flexible. We wanted to allow
for the implementation of different methods of transforming a page to be
written to disk and different methods of getting the pages stored.
In early versions (the betas and perhaps Suspend1), compression support was
inlined in the image writing code, and the data structures and code for
managing swap were intertwined with the rest of the code. A number of people
had expressed interest in implementing image encryption, and alternative
methods of storing the image.
In order to achieve this, TuxOnIce was given a modular design.
A module is a single file which encapsulates the functionality needed
to transform a pageset of data (encryption or compression, for example),
or to write the pageset to a device. The former type of module is called
a 'page-transformer', the later a 'writer'.
Modules are linked together in pipeline fashion. There may be zero or more
page transformers in a pipeline, and there is always exactly one writer.
The pipeline follows this pattern:
---------------------------------
| TuxOnIce Core |
---------------------------------
|
|
---------------------------------
| Page transformer 1 |
---------------------------------
|
|
---------------------------------
| Page transformer 2 |
---------------------------------
|
|
---------------------------------
| Writer |
---------------------------------
During the writing of an image, the core code feeds pages one at a time
to the first module. This module performs whatever transformations it
implements on the incoming data, completely consuming the incoming data and
feeding output in a similar manner to the next module.
All routines are SMP safe, and the final result of the transformations is
written with an index (provided by the core) and size of the output by the
writer. As a result, we can have multithreaded I/O without needing to
worry about the sequence in which pages are written (or read).
During reading, the pipeline works in the reverse direction. The core code
calls the first module with the address of a buffer which should be filled.
(Note that the buffer size is always PAGE_SIZE at this time). This module
will in turn request data from the next module and so on down until the
writer is made to read from the stored image.
Part of definition of the structure of a module thus looks like this:
int (*rw_init) (int rw, int stream_number);
int (*rw_cleanup) (int rw);
int (*write_chunk) (struct page *buffer_page);
int (*read_chunk) (struct page *buffer_page, int sync);
It should be noted that the _cleanup routine may be called before the
full stream of data has been read or written. While writing the image,
the user may (depending upon settings) choose to abort suspending, and
if we are in the midst of writing the last portion of the image, a portion
of the second pageset may be reread. This may also happen if an error
occurs and we seek to abort the process of writing the image.
The modular design is also useful in a number of other ways. It provides
a means where by we can add support for:
- providing overall initialisation and cleanup routines;
- serialising configuration information in the image header;
- providing debugging information to the user;
- determining memory and image storage requirements;
- dis/enabling components at run-time;
- configuring the module (see below);
...and routines for writers specific to their work:
- Parsing a resume= location;
- Determining whether an image exists;
- Marking a resume as having been attempted;
- Invalidating an image;
Since some parts of the core - the user interface and storage manager
support - have use for some of these functions, they are registered as
'miscellaneous' modules as well.
d) Sysfs data structures.
This brings us naturally to support for configuring TuxOnIce. We desired to
provide a way to make TuxOnIce as flexible and configurable as possible.
The user shouldn't have to reboot just because they want to now hibernate to
a file instead of a partition, for example.
To accomplish this, TuxOnIce implements a very generic means whereby the
core and modules can register new sysfs entries. All TuxOnIce entries use
a single _store and _show routine, both of which are found in
tuxonice_sysfs.c in the kernel/power directory. These routines handle the
most common operations - getting and setting the values of bits, integers,
longs, unsigned longs and strings in one place, and allow overrides for
customised get and set options as well as side-effect routines for all
reads and writes.
When combined with some simple macros, a new sysfs entry can then be defined
in just a couple of lines:
SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1,
2048, 0, NULL),
This defines a sysfs entry named "progress_granularity" which is rw and
allows the user to access an integer stored at &progress_granularity, giving
it a value between 1 and 2048 inclusive.
Sysfs entries are registered under /sys/power/tuxonice, and entries for
modules are located in a subdirectory named after the module.
+948
View File
@@ -0,0 +1,948 @@
--- TuxOnIce, version 3.0 ---
1. What is it?
2. Why would you want it?
3. What do you need to use it?
4. Why not just use the version already in the kernel?
5. How do you use it?
6. What do all those entries in /sys/power/tuxonice do?
7. How do you get support?
8. I think I've found a bug. What should I do?
9. When will XXX be supported?
10 How does it work?
11. Who wrote TuxOnIce?
1. What is it?
Imagine you're sitting at your computer, working away. For some reason, you
need to turn off your computer for a while - perhaps it's time to go home
for the day. When you come back to your computer next, you're going to want
to carry on where you left off. Now imagine that you could push a button and
have your computer store the contents of its memory to disk and power down.
Then, when you next start up your computer, it loads that image back into
memory and you can carry on from where you were, just as if you'd never
turned the computer off. You have far less time to start up, no reopening of
applications or finding what directory you put that file in yesterday.
That's what TuxOnIce does.
TuxOnIce has a long heritage. It began life as work by Gabor Kuti, who,
with some help from Pavel Machek, got an early version going in 1999. The
project was then taken over by Florent Chabaud while still in alpha version
numbers. Nigel Cunningham came on the scene when Florent was unable to
continue, moving the project into betas, then 1.0, 2.0 and so on up to
the present series. During the 2.0 series, the name was contracted to
Suspend2 and the website suspend2.net created. Beginning around July 2007,
a transition to calling the software TuxOnIce was made, to seek to help
make it clear that TuxOnIce is more concerned with hibernation than suspend
to ram.
Pavel Machek's swsusp code, which was merged around 2.5.17 retains the
original name, and was essentially a fork of the beta code until Rafael
Wysocki came on the scene in 2005 and began to improve it further.
2. Why would you want it?
Why wouldn't you want it?
Being able to save the state of your system and quickly restore it improves
your productivity - you get a useful system in far less time than through
the normal boot process. You also get to be completely 'green', using zero
power, or as close to that as possible (the computer may still provide
minimal power to some devices, so they can initiate a power on, but that
will be the same amount of power as would be used if you told the computer
to shutdown.
3. What do you need to use it?
a. Kernel Support.
i) The TuxOnIce patch.
TuxOnIce is part of the Linux Kernel. This version is not part of Linus's
2.6 tree at the moment, so you will need to download the kernel source and
apply the latest patch. Having done that, enable the appropriate options in
make [menu|x]config (under Power Management Options - look for "Enhanced
Hibernation"), compile and install your kernel. TuxOnIce works with SMP,
Highmem, preemption, fuse filesystems, x86-32, PPC and x86_64.
TuxOnIce patches are available from http://tuxonice.net.
ii) Compression support.
Compression support is implemented via the cryptoapi. You will therefore want
to select any Cryptoapi transforms that you want to use on your image from
the Cryptoapi menu while configuring your kernel. We recommend the use of the
LZO compression method - it is very fast and still achieves good compression.
You can also tell TuxOnIce to write its image to an encrypted and/or
compressed filesystem/swap partition. In that case, you don't need to do
anything special for TuxOnIce when it comes to kernel configuration.
iii) Configuring other options.
While you're configuring your kernel, try to configure as much as possible
to build as modules. We recommend this because there are a number of drivers
that are still in the process of implementing proper power management
support. In those cases, the best way to work around their current lack is
to build them as modules and remove the modules while hibernating. You might
also bug the driver authors to get their support up to speed, or even help!
b. Storage.
i) Swap.
TuxOnIce can store the hibernation image in your swap partition, a swap file or
a combination thereof. Whichever combination you choose, you will probably
want to create enough swap space to store the largest image you could have,
plus the space you'd normally use for swap. A good rule of thumb would be
to calculate the amount of swap you'd want without using TuxOnIce, and then
add the amount of memory you have. This swapspace can be arranged in any way
you'd like. It can be in one partition or file, or spread over a number. The
only requirement is that they be active when you start a hibernation cycle.
There is one exception to this requirement. TuxOnIce has the ability to turn
on one swap file or partition at the start of hibernating and turn it back off
at the end. If you want to ensure you have enough memory to store a image
when your memory is fully used, you might want to make one swap partition or
file for 'normal' use, and another for TuxOnIce to activate & deactivate
automatically. (Further details below).
ii) Normal files.
TuxOnIce includes a 'file allocator'. The file allocator can store your
image in a simple file. Since Linux has the concept of everything being a
file, this is more powerful than it initially sounds. If, for example, you
were to set up a network block device file, you could hibernate to a network
server. This has been tested and works to a point, but nbd itself isn't
stateless enough for our purposes.
Take extra care when setting up the file allocator. If you just type
commands without thinking and then try to hibernate, you could cause
irreversible corruption on your filesystems! Make sure you have backups.
Most people will only want to hibernate to a local file. To achieve that, do
something along the lines of:
echo "TuxOnIce" > /hibernation-file
dd if=/dev/zero bs=1M count=512 >> /hibernation-file
This will create a 512MB file called /hibernation-file. To get TuxOnIce to use
it:
echo /hibernation-file > /sys/power/tuxonice/file/target
Then
cat /sys/power/tuxonice/resume
Put the results of this into your bootloader's configuration (see also step
C, below):
---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE---
# cat /sys/power/tuxonice/resume
file:/dev/hda2:0x1e001
In this example, we would edit the append= line of our lilo.conf|menu.lst
so that it included:
resume=file:/dev/hda2:0x1e001
---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE---
For those who are thinking 'Could I make the file sparse?', the answer is
'No!'. At the moment, there is no way for TuxOnIce to fill in the holes in
a sparse file while hibernating. In the longer term (post merge!), I'd like
to change things so that the file could be dynamically resized and have
holes filled as needed. Right now, however, that's not possible and not a
priority.
c. Bootloader configuration.
Using TuxOnIce also requires that you add an extra parameter to
your lilo.conf or equivalent. Here's an example for a swap partition:
append="resume=swap:/dev/hda1"
This would tell TuxOnIce that /dev/hda1 is a swap partition you
have. TuxOnIce will use the swap signature of this partition as a
pointer to your data when you hibernate. This means that (in this example)
/dev/hda1 doesn't need to be _the_ swap partition where all of your data
is actually stored. It just needs to be a swap partition that has a
valid signature.
You don't need to have a swap partition for this purpose. TuxOnIce
can also use a swap file, but usage is a little more complex. Having made
your swap file, turn it on and do
cat /sys/power/tuxonice/swap/headerlocations
(this assumes you've already compiled your kernel with TuxOnIce
support and booted it). The results of the cat command will tell you
what you need to put in lilo.conf:
For swap partitions like /dev/hda1, simply use resume=/dev/hda1.
For swapfile `swapfile`, use resume=swap:/dev/hda2:0x242d.
If the swapfile changes for any reason (it is moved to a different
location, it is deleted and recreated, or the filesystem is
defragmented) then you will have to check
/sys/power/tuxonice/swap/headerlocations for a new resume_block value.
Once you've compiled and installed the kernel and adjusted your bootloader
configuration, you should only need to reboot for the most basic part
of TuxOnIce to be ready.
If you only compile in the swap allocator, or only compile in the file
allocator, you don't need to add the "swap:" part of the resume=
parameters above. resume=/dev/hda2:0x242d will work just as well. If you
have compiled both and your storage is on swap, you can also use this
format (the swap allocator is the default allocator).
When compiling your kernel, one of the options in the 'Power Management
Support' menu, just above the 'Enhanced Hibernation (TuxOnIce)' entry is
called 'Default resume partition'. This can be used to set a default value
for the resume= parameter.
d. The hibernate script.
Since the driver model in 2.6 kernels is still being developed, you may need
to do more than just configure TuxOnIce. Users of TuxOnIce usually start the
process via a script which prepares for the hibernation cycle, tells the
kernel to do its stuff and then restore things afterwards. This script might
involve:
- Switching to a text console and back if X doesn't like the video card
status on resume.
- Un/reloading drivers that don't play well with hibernation.
Note that you might not be able to unload some drivers if there are
processes using them. You might have to kill off processes that hold
devices open. Hint: if your X server accesses an USB mouse, doing a
'chvt' to a text console releases the device and you can unload the
module.
Check out the latest script (available on tuxonice.net).
e. The userspace user interface.
TuxOnIce has very limited support for displaying status if you only apply
the kernel patch - it can printk messages, but that is all. In addition,
some of the functions mentioned in this document (such as cancelling a cycle
or performing interactive debugging) are unavailable. To utilise these
functions, or simply get a nice display, you need the 'userui' component.
Userui comes in three flavours, usplash, fbsplash and text. Text should
work on any console. Usplash and fbsplash require the appropriate
(distro specific?) support.
To utilise a userui, TuxOnIce just needs to be told where to find the
userspace binary:
echo "/usr/local/sbin/tuxoniceui_fbsplash" > /sys/power/tuxonice/user_interface/program
The hibernate script can do this for you, and a default value for this
setting can be configured when compiling the kernel. This path is also
stored in the image header, so if you have an initrd or initramfs, you can
use the userui during the first part of resuming (prior to the atomic
restore) by putting the binary in the same path in your initrd/ramfs.
Alternatively, you can put it in a different location and do an echo
similar to the above prior to the echo > do_resume. The value saved in the
image header will then be ignored.
4. Why not just use the version already in the kernel?
The version in the vanilla kernel has a number of drawbacks. The most
serious of these are:
- it has a maximum image size of 1/2 total memory;
- it doesn't allocate storage until after it has snapshotted memory.
This means that you can't be sure hibernating will work until you
see it start to write the image;
- it does not allow you to press escape to cancel a cycle;
- it does not allow you to press escape to cancel resuming;
- it does not allow you to automatically swapon a file when
starting a cycle;
- it does not allow you to use multiple swap partitions or files;
- it does not allow you to use ordinary files;
- it just invalidates an image and continues to boot if you
accidentally boot the wrong kernel after hibernating;
- it doesn't support any sort of nice display while hibernating;
- it is moving toward requiring that you have an initrd/initramfs
to ever have a hope of resuming (uswsusp). While uswsusp will
address some of the concerns above, it won't address all of them,
and will be more complicated to get set up;
- it doesn't have support for suspend-to-both (write a hibernation
image, then suspend to ram; I think this is known as ReadySafe
under M$).
5. How do you use it?
A hibernation cycle can be started directly by doing:
echo > /sys/power/tuxonice/do_hibernate
In practice, though, you'll probably want to use the hibernate script
to unload modules, configure the kernel the way you like it and so on.
In that case, you'd do (as root):
hibernate
See the hibernate script's man page for more details on the options it
takes.
If you're using the text or splash user interface modules, one feature of
TuxOnIce that you might find useful is that you can press Escape at any time
during hibernating, and the process will be aborted.
Due to the way hibernation works, this means you'll have your system back and
perfectly usable almost instantly. The only exception is when it's at the
very end of writing the image. Then it will need to reload a small (usually
4-50MBs, depending upon the image characteristics) portion first.
Likewise, when resuming, you can press escape and resuming will be aborted.
The computer will then powerdown again according to settings at that time for
the powerdown method or rebooting.
You can change the settings for powering down while the image is being
written by pressing 'R' to toggle rebooting and 'O' to toggle between
suspending to ram and powering down completely).
If you run into problems with resuming, adding the "noresume" option to
the kernel command line will let you skip the resume step and recover your
system. This option shouldn't normally be needed, because TuxOnIce modifies
the image header prior to the atomic restore, and will thus prompt you
if it detects that you've tried to resume an image before (this flag is
removed if you press Escape to cancel a resume, so you won't be prompted
then).
Recent kernels (2.6.24 onwards) add support for resuming from a different
kernel to the one that was hibernated (thanks to Rafael for his work on
this - I've just embraced and enhanced the support for TuxOnIce). This
should further reduce the need for you to use the noresume option.
6. What do all those entries in /sys/power/tuxonice do?
/sys/power/tuxonice is the directory which contains files you can use to
tune and configure TuxOnIce to your liking. The exact contents of
the directory will depend upon the version of TuxOnIce you're
running and the options you selected at compile time. In the following
descriptions, names in brackets refer to compile time options.
(Note that they're all dependant upon you having selected CONFIG_TUXONICE
in the first place!).
Since the values of these settings can open potential security risks, the
writeable ones are accessible only to the root user. You may want to
configure sudo to allow you to invoke your hibernate script as an ordinary
user.
- alloc/failure_test
This debugging option provides a way of testing TuxOnIce's handling of
memory allocation failures. Each allocation type that TuxOnIce makes has
been given a unique number (see the source code). Echo the appropriate
number into this entry, and when TuxOnIce attempts to do that allocation,
it will pretend there was a failure and act accordingly.
- alloc/find_max_mem_allocated
This debugging option will cause TuxOnIce to find the maximum amount of
memory it used during a cycle, and report that information in debugging
information at the end of the cycle.
- alt_resume_param
Instead of powering down after writing a hibernation image, TuxOnIce
supports resuming from a different image. This entry lets you set the
location of the signature for that image (the resume= value you'd use
for it). Using an alternate image and keep_image mode, you can do things
like using an alternate image to power down an uninterruptible power
supply.
- block_io/target_outstanding_io
This value controls the amount of memory that the block I/O code says it
needs when the core code is calculating how much memory is needed for
hibernating and for resuming. It doesn't directly control the amount of
I/O that is submitted at any one time - that depends on the amount of
available memory (we may have more available than we asked for), the
throughput that is being achieved and the ability of the CPU to keep up
with disk throughput (particularly where we're compressing pages).
- checksum/enabled
Use cryptoapi hashing routines to verify that Pageset2 pages don't change
while we're saving the first part of the image, and to get any pages that
do change resaved in the atomic copy. This should normally not be needed,
but if you're seeing issues, please enable this. If your issues stop you
being able to resume, enable this option, hibernate and cancel the cycle
after the atomic copy is done. If the debugging info shows a non-zero
number of pages resaved, please report this to Nigel.
- compression/algorithm
Set the cryptoapi algorithm used for compressing the image.
- compression/expected_compression
These values allow you to set an expected compression ratio, which TuxOnice
will use in calculating whether it meets constraints on the image size. If
this expected compression ratio is not attained, the hibernation cycle will
abort, so it is wise to allow some spare. You can see what compression
ratio is achieved in the logs after hibernating.
- debug_info:
This file returns information about your configuration that may be helpful
in diagnosing problems with hibernating.
- did_suspend_to_both:
This file can be used when you hibernate with powerdown method 3 (ie suspend
to ram after writing the image). There can be two outcomes in this case. We
can resume from the suspend-to-ram before the battery runs out, or we can run
out of juice and and up resuming like normal. This entry lets you find out,
post resume, which way we went. If the value is 1, we resumed from suspend
to ram. This can be useful when actions need to be run post suspend-to-ram
that don't need to be run if we did the normal resume from power off.
- do_hibernate:
When anything is written to this file, the kernel side of TuxOnIce will
begin to attempt to write an image to disk and power down. You'll normally
want to run the hibernate script instead, to get modules unloaded first.
- do_resume:
When anything is written to this file TuxOnIce will attempt to read and
restore an image. If there is no image, it will return almost immediately.
If an image exists, the echo > will never return. Instead, the original
kernel context will be restored and the original echo > do_hibernate will
return.
- */enabled
These option can be used to temporarily disable various parts of TuxOnIce.
- extra_pages_allowance
When TuxOnIce does its atomic copy, it calls the driver model suspend
and resume methods. If you have DRI enabled with a driver such as fglrx,
this can result in the driver allocating a substantial amount of memory
for storing its state. Extra_pages_allowance tells TuxOnIce how much
extra memory it should ensure is available for those allocations. If
your attempts at hibernating end with a message in dmesg indicating that
insufficient extra pages were allowed, you need to increase this value.
- file/target:
Read this value to get the current setting. Write to it to point TuxOnice
at a new storage location for the file allocator. See section 3.b.ii above
for details of how to set up the file allocator.
- freezer_test
This entry can be used to get TuxOnIce to just test the freezer and prepare
an image without actually doing a hibernation cycle. It is useful for
diagnosing freezing and image preparation issues.
- full_pageset2
TuxOnIce divides the pages that are stored in an image into two sets. The
difference between the two sets is that pages in pageset 1 are atomically
copied, and pages in pageset 2 are written to disk without being copied
first. A page CAN be written to disk without being copied first if and only
if its contents will not be modified or used at any time after userspace
processes are frozen. A page MUST be in pageset 1 if its contents are
modified or used at any time after userspace processes have been frozen.
Normally (ie if this option is enabled), TuxOnIce will put all pages on the
per-zone LRUs in pageset2, then remove those pages used by any userspace
user interface helper and TuxOnIce storage manager that are running,
together with pages used by the GEM memory manager introduced around 2.6.28
kernels.
If this option is disabled, a much more conservative approach will be taken.
The only pages in pageset2 will be those belonging to userspace processes,
with the exclusion of those belonging to the TuxOnIce userspace helpers
mentioned above. This will result in a much smaller pageset2, and will
therefore result in smaller images than are possible with this option
enabled.
- ignore_rootfs
TuxOnIce records which device is mounted as the root filesystem when
writing the hibernation image. It will normally check at resume time that
this device isn't already mounted - that would be a cause of filesystem
corruption. In some particular cases (RAM based root filesystems), you
might want to disable this check. This option allows you to do that.
- image_exists:
Can be used in a script to determine whether a valid image exists at the
location currently pointed to by resume=. Returns up to three lines.
The first is whether an image exists (-1 for unsure, otherwise 0 or 1).
If an image eixsts, additional lines will return the machine and version.
Echoing anything to this entry removes any current image.
- image_size_limit:
The maximum size of hibernation image written to disk, measured in megabytes
(1024*1024).
- last_result:
The result of the last hibernation cycle, as defined in
include/linux/suspend-debug.h with the values SUSPEND_ABORTED to
SUSPEND_KEPT_IMAGE. This is a bitmask.
- late_cpu_hotplug:
This sysfs entry controls whether cpu hotplugging is done - as normal - just
before (unplug) and after (replug) the atomic copy/restore (so that all
CPUs/cores are available for multithreaded I/O). The alternative is to
unplug all secondary CPUs/cores at the start of hibernating/resuming, and
replug them at the end of resuming. No multithreaded I/O will be possible in
this configuration, but the odd machine has been reported to require it.
- lid_file:
This determines which ACPI button file we look in to determine whether the
lid is open or closed after resuming from suspend to disk or power off.
If the entry is set to "lid/LID", we'll open /proc/acpi/button/lid/LID/state
and check its contents at the appropriate moment. See post_wake_state below
for more details on how this entry is used.
- log_everything (CONFIG_PM_DEBUG):
Setting this option results in all messages printed being logged. Normally,
only a subset are logged, so as to not slow the process and not clutter the
logs. Useful for debugging. It can be toggled during a cycle by pressing
'L'.
- no_load_direct:
This is a debugging option. If, when loading the atomically copied pages of
an image, TuxOnIce finds that the destination address for a page is free,
it will normally allocate the image, load the data directly into that
address and skip it in the atomic restore. If this option is disabled, the
page will be loaded somewhere else and atomically restored like other pages.
- no_flusher_thread:
When doing multithreaded I/O (see below), the first online CPU can be used
to _just_ submit compressed pages when writing the image, rather than
compressing and submitting data. This option is normally disabled, but has
been included because Nigel would like to see whether it will be more useful
as the number of cores/cpus in computers increases.
- no_multithreaded_io:
TuxOnIce will normally create one thread per cpu/core on your computer,
each of which will then perform I/O. This will generally result in
throughput that's the maximum the storage medium can handle. There
shouldn't be any reason to disable multithreaded I/O now, but this option
has been retained for debugging purposes.
- no_pageset2
See the entry for full_pageset2 above for an explanation of pagesets.
Enabling this option causes TuxOnIce to do an atomic copy of all pages,
thereby limiting the maximum image size to 1/2 of memory, as swsusp does.
- no_pageset2_if_unneeded
See the entry for full_pageset2 above for an explanation of pagesets.
Enabling this option causes TuxOnIce to act like no_pageset2 was enabled
if and only it isn't needed anyway. This option may still make TuxOnIce
less reliable because pageset2 pages are normally used to store the
atomic copy - drivers that want to do allocations of larger amounts of
memory in one shot will be more likely to find that those amounts aren't
available if this option is enabled.
- pause_between_steps (CONFIG_PM_DEBUG):
This option is used during debugging, to make TuxOnIce pause between
each step of the process. It is ignored when the nice display is on.
- post_wake_state:
TuxOnIce provides support for automatically waking after a user-selected
delay, and using a different powerdown method if the lid is still closed.
(Yes, we're assuming a laptop). This entry lets you choose what state
should be entered next. The values are those described under
powerdown_method, below. It can be used to suspend to RAM after hibernating,
then powerdown properly (say) 20 minutes. It can also be used to power down
properly, then wake at (say) 6.30am and suspend to RAM until you're ready
to use the machine.
- powerdown_method:
Used to select a method by which TuxOnIce should powerdown after writing the
image. Currently:
0: Don't use ACPI to power off.
3: Attempt to enter Suspend-to-ram.
4: Attempt to enter ACPI S4 mode.
5: Attempt to power down via ACPI S5 mode.
Note that these options are highly dependant upon your hardware & software:
3: When succesful, your machine suspends to ram instead of powering off.
The advantage of using this mode is that it doesn't matter whether your
battery has enough charge to make it through to your next resume. If it
lasts, you will simply resume from suspend to ram (and the image on disk
will be discarded). If the battery runs out, you will resume from disk
instead. The disadvantage is that it takes longer than a normal
suspend-to-ram to enter the state, since the suspend-to-disk image needs
to be written first.
4/5: When successful, your machine will be off and comsume (almost) no power.
But it might still react to some external events like opening the lid or
trafic on a network or usb device. For the bios, resume is then the same
as warm boot, similar to a situation where you used the command `reboot'
to reboot your machine. If your machine has problems on warm boot or if
you want to protect your machine with the bios password, this is probably
not the right choice. Mode 4 may be necessary on some machines where ACPI
wake up methods need to be run to properly reinitialise hardware after a
hibernation cycle.
0: Switch the machine completely off. The only possible wakeup is the power
button. For the bios, resume is then the same as a cold boot, in
particular you would have to provide your bios boot password if your
machine uses that feature for booting.
- progressbar_granularity_limit:
This option can be used to limit the granularity of the progress bar
displayed with a bootsplash screen. The value is the maximum number of
steps. That is, 10 will make the progress bar jump in 10% increments.
- reboot:
This option causes TuxOnIce to reboot rather than powering down
at the end of saving an image. It can be toggled during a cycle by pressing
'R'.
- resume:
This sysfs entry can be used to read and set the location in which TuxOnIce
will look for the signature of an image - the value set using resume= at
boot time or CONFIG_PM_STD_PARTITION ("Default resume partition"). By
writing to this file as well as modifying your bootloader's configuration
file (eg menu.lst), you can set or reset the location of your image or the
method of storing the image without rebooting.
- replace_swsusp (CONFIG_TOI_REPLACE_SWSUSP):
This option makes
echo disk > /sys/power/state
activate TuxOnIce instead of swsusp. Regardless of whether this option is
enabled, any invocation of swsusp's resume time trigger will cause TuxOnIce
to check for an image too. This is due to the fact that at resume time, we
can't know whether this option was enabled until we see if an image is there
for us to resume from. (And when an image exists, we don't care whether we
did replace swsusp anyway - we just want to resume).
- resume_commandline:
This entry can be read after resuming to see the commandline that was used
when resuming began. You might use this to set up two bootloader entries
that are the same apart from the fact that one includes a extra append=
argument "at_work=1". You could then grep resume_commandline in your
post-resume scripts and configure networking (for example) differently
depending upon whether you're at home or work. resume_commandline can be
set to arbitrary text if you wish to remove sensitive contents.
- swap/swapfilename:
This entry is used to specify the swapfile or partition that
TuxOnIce will attempt to swapon/swapoff automatically. Thus, if
I normally use /dev/hda1 for swap, and want to use /dev/hda2 for specifically
for my hibernation image, I would
echo /dev/hda2 > /sys/power/tuxonice/swap/swapfile
/dev/hda2 would then be automatically swapon'd and swapoff'd. Note that the
swapon and swapoff occur while other processes are frozen (including kswapd)
so this swap file will not be used up when attempting to free memory. The
parition/file is also given the highest priority, so other swapfiles/partitions
will only be used to save the image when this one is filled.
The value of this file is used by headerlocations along with any currently
activated swapfiles/partitions.
- swap/headerlocations:
This option tells you the resume= options to use for swap devices you
currently have activated. It is particularly useful when you only want to
use a swap file to store your image. See above for further details.
- test_bio
This is a debugging option. When enabled, TuxOnIce will not hibernate.
Instead, when asked to write an image, it will skip the atomic copy,
just doing the writing of the image and then returning control to the
user at the point where it would have powered off. This is useful for
testing throughput in different configurations.
- test_filter_speed
This is a debugging option. When enabled, TuxOnIce will not hibernate.
Instead, when asked to write an image, it will not write anything or do
an atomic copy, but will only run any enabled compression algorithm on the
data that would have been written (the source pages of the atomic copy in
the case of pageset 1). This is useful for comparing the performance of
compression algorithms and for determining the extent to which an upgrade
to your storage method would improve hibernation speed.
- user_interface/debug_sections (CONFIG_PM_DEBUG):
This value, together with the console log level, controls what debugging
information is displayed. The console log level determines the level of
detail, and this value determines what detail is displayed. This value is
a bit vector, and the meaning of the bits can be found in the kernel tree
in include/linux/tuxonice.h. It can be overridden using the kernel's
command line option suspend_dbg.
- user_interface/default_console_level (CONFIG_PM_DEBUG):
This determines the value of the console log level at the start of a
hibernation cycle. If debugging is compiled in, the console log level can be
changed during a cycle by pressing the digit keys. Meanings are:
0: Nice display.
1: Nice display plus numerical progress.
2: Errors only.
3: Low level debugging info.
4: Medium level debugging info.
5: High level debugging info.
6: Verbose debugging info.
- user_interface/enable_escape:
Setting this to "1" will enable you abort a hibernation cycle or resuming by
pressing escape, "0" (default) disables this feature. Note that enabling
this option means that you cannot initiate a hibernation cycle and then walk
away from your computer, expecting it to be secure. With feature disabled,
you can validly have this expectation once TuxOnice begins to write the
image to disk. (Prior to this point, it is possible that TuxOnice might
about because of failure to freeze all processes or because constraints
on its ability to save the image are not met).
- user_interface/program
This entry is used to tell TuxOnice what userspace program to use for
providing a user interface while hibernating. The program uses a netlink
socket to pass messages back and forward to the kernel, allowing all of the
functions formerly implemented in the kernel user interface components.
- version:
The version of TuxOnIce you have compiled into the currently running kernel.
- wake_alarm_dir:
As mentioned above (post_wake_state), TuxOnIce supports automatically waking
after some delay. This entry allows you to select which wake alarm to use.
It should contain the value "rtc0" if you're wanting to use
/sys/class/rtc/rtc0.
- wake_delay:
This value determines the delay from the end of writing the image until the
wake alarm is triggered. You can set an absolute time by writing the desired
time into /sys/class/rtc/<wake_alarm_dir>/wakealarm and leaving these values
empty.
Note that for the wakeup to actually occur, you may need to modify entries
in /proc/acpi/wakeup. This is done by echoing the name of the button in the
first column (eg PBTN) into the file.
7. How do you get support?
Glad you asked. TuxOnIce is being actively maintained and supported
by Nigel (the guy doing most of the kernel coding at the moment), Bernard
(who maintains the hibernate script and userspace user interface components)
and its users.
Resources availble include HowTos, FAQs and a Wiki, all available via
tuxonice.net. You can find the mailing lists there.
8. I think I've found a bug. What should I do?
By far and a way, the most common problems people have with TuxOnIce
related to drivers not having adequate power management support. In this
case, it is not a bug with TuxOnIce, but we can still help you. As we
mentioned above, such issues can usually be worked around by building the
functionality as modules and unloading them while hibernating. Please visit
the Wiki for up-to-date lists of known issues and work arounds.
If this information doesn't help, try running:
hibernate --bug-report
..and sending the output to the users mailing list.
Good information on how to provide us with useful information from an
oops is found in the file REPORTING-BUGS, in the top level directory
of the kernel tree. If you get an oops, please especially note the
information about running what is printed on the screen through ksymoops.
The raw information is useless.
9. When will XXX be supported?
If there's a feature missing from TuxOnIce that you'd like, feel free to
ask. We try to be obliging, within reason.
Patches are welcome. Please send to the list.
10. How does it work?
TuxOnIce does its work in a number of steps.
a. Freezing system activity.
The first main stage in hibernating is to stop all other activity. This is
achieved in stages. Processes are considered in fours groups, which we will
describe in reverse order for clarity's sake: Threads with the PF_NOFREEZE
flag, kernel threads without this flag, userspace processes with the
PF_SYNCTHREAD flag and all other processes. The first set (PF_NOFREEZE) are
untouched by the refrigerator code. They are allowed to run during hibernating
and resuming, and are used to support user interaction, storage access or the
like. Other kernel threads (those unneeded while hibernating) are frozen last.
This leaves us with userspace processes that need to be frozen. When a
process enters one of the *_sync system calls, we set a PF_SYNCTHREAD flag on
that process for the duration of that call. Processes that have this flag are
frozen after processes without it, so that we can seek to ensure that dirty
data is synced to disk as quickly as possible in a situation where other
processes may be submitting writes at the same time. Freezing the processes
that are submitting data stops new I/O from being submitted. Syncthreads can
then cleanly finish their work. So the order is:
- Userspace processes without PF_SYNCTHREAD or PF_NOFREEZE;
- Userspace processes with PF_SYNCTHREAD (they won't have NOFREEZE);
- Kernel processes without PF_NOFREEZE.
b. Eating memory.
For a successful hibernation cycle, you need to have enough disk space to store the
image and enough memory for the various limitations of TuxOnIce's
algorithm. You can also specify a maximum image size. In order to attain
to those constraints, TuxOnIce may 'eat' memory. If, after freezing
processes, the constraints aren't met, TuxOnIce will thaw all the
other processes and begin to eat memory until its calculations indicate
the constraints are met. It will then freeze processes again and recheck
its calculations.
c. Allocation of storage.
Next, TuxOnIce allocates the storage that will be used to save
the image.
The core of TuxOnIce knows nothing about how or where pages are stored. We
therefore request the active allocator (remember you might have compiled in
more than one!) to allocate enough storage for our expect image size. If
this request cannot be fulfilled, we eat more memory and try again. If it
is fulfiled, we seek to allocate additional storage, just in case our
expected compression ratio (if any) isn't achieved. This time, however, we
just continue if we can't allocate enough storage.
If these calls to our allocator change the characteristics of the image
such that we haven't allocated enough memory, we also loop. (The allocator
may well need to allocate space for its storage information).
d. Write the first part of the image.
TuxOnIce stores the image in two sets of pages called 'pagesets'.
Pageset 2 contains pages on the active and inactive lists; essentially
the page cache. Pageset 1 contains all other pages, including the kernel.
We use two pagesets for one important reason: We need to make an atomic copy
of the kernel to ensure consistency of the image. Without a second pageset,
that would limit us to an image that was at most half the amount of memory
available. Using two pagesets allows us to store a full image. Since pageset
2 pages won't be needed in saving pageset 1, we first save pageset 2 pages.
We can then make our atomic copy of the remaining pages using both pageset 2
pages and any other pages that are free. While saving both pagesets, we are
careful not to corrupt the image. Among other things, we use lowlevel block
I/O routines that don't change the pagecache contents.
The next step, then, is writing pageset 2.
e. Suspending drivers and storing processor context.
Having written pageset2, TuxOnIce calls the power management functions to
notify drivers of the hibernation, and saves the processor state in preparation
for the atomic copy of memory we are about to make.
f. Atomic copy.
At this stage, everything else but the TuxOnIce code is halted. Processes
are frozen or idling, drivers are quiesced and have stored (ideally and where
necessary) their configuration in memory we are about to atomically copy.
In our lowlevel architecture specific code, we have saved the CPU state.
We can therefore now do our atomic copy before resuming drivers etc.
g. Save the atomic copy (pageset 1).
TuxOnice can then write the atomic copy of the remaining pages. Since we
have copied the pages into other locations, we can continue to use the
normal block I/O routines without fear of corruption our image.
f. Save the image header.
Nearly there! We save our settings and other parameters needed for
reloading pageset 1 in an 'image header'. We also tell our allocator to
serialise its data at this stage, so that it can reread the image at resume
time.
g. Set the image header.
Finally, we edit the header at our resume= location. The signature is
changed by the allocator to reflect the fact that an image exists, and to
point to the start of that data if necessary (swap allocator).
h. Power down.
Or reboot if we're debugging and the appropriate option is selected.
Whew!
Reloading the image.
--------------------
Reloading the image is essentially the reverse of all the above. We load
our copy of pageset 1, being careful to choose locations that aren't going
to be overwritten as we copy it back (We start very early in the boot
process, so there are no other processes to quiesce here). We then copy
pageset 1 back to its original location in memory and restore the process
context. We are now running with the original kernel. Next, we reload the
pageset 2 pages, free the memory and swap used by TuxOnIce, restore
the pageset header and restart processes. Sounds easy in comparison to
hibernating, doesn't it!
There is of course more to TuxOnIce than this, but this explanation
should be a good start. If there's interest, I'll write further
documentation on range pages and the low level I/O.
11. Who wrote TuxOnIce?
(Answer based on the writings of Florent Chabaud, credits in files and
Nigel's limited knowledge; apologies to anyone missed out!)
The main developers of TuxOnIce have been...
Gabor Kuti
Pavel Machek
Florent Chabaud
Bernard Blackham
Nigel Cunningham
Significant portions of swsusp, the code in the vanilla kernel which
TuxOnIce enhances, have been worked on by Rafael Wysocki. Thanks should
also be expressed to him.
The above mentioned developers have been aided in their efforts by a host
of hundreds, if not thousands of testers and people who have submitted bug
fixes & suggestions. Of special note are the efforts of Michael Frank, who
had his computers repetitively hibernate and resume for literally tens of
thousands of cycles and developed scripts to stress the system and test
TuxOnIce far beyond the point most of us (Nigel included!) would consider
testing. His efforts have contributed as much to TuxOnIce as any of the
names above.
+75
View File
@@ -0,0 +1,75 @@
Motivation:
In complicated DMA pipelines such as graphics (multimedia, camera, gpu, display)
a consumer of a buffer needs to know when the producer has finished producing
it. Likewise the producer needs to know when the consumer is finished with the
buffer so it can reuse it. A particular buffer may be consumed by multiple
consumers which will retain the buffer for different amounts of time. In
addition, a consumer may consume multiple buffers atomically.
The sync framework adds an API which allows synchronization between the
producers and consumers in a generic way while also allowing platforms which
have shared hardware synchronization primitives to exploit them.
Goals:
* provide a generic API for expressing synchronization dependencies
* allow drivers to exploit hardware synchronization between hardware
blocks
* provide a userspace API that allows a compositor to manage
dependencies.
* provide rich telemetry data to allow debugging slowdowns and stalls of
the graphics pipeline.
Objects:
* sync_timeline
* sync_pt
* sync_fence
sync_timeline:
A sync_timeline is an abstract monotonically increasing counter. In general,
each driver/hardware block context will have one of these. They can be backed
by the appropriate hardware or rely on the generic sw_sync implementation.
Timelines are only ever created through their specific implementations
(i.e. sw_sync.)
sync_pt:
A sync_pt is an abstract value which marks a point on a sync_timeline. Sync_pts
have a single timeline parent. They have 3 states: active, signaled, and error.
They start in active state and transition, once, to either signaled (when the
timeline counter advances beyond the sync_pts value) or error state.
sync_fence:
Sync_fences are the primary primitives used by drivers to coordinate
synchronization of their buffers. They are a collection of sync_pts which may
or may not have the same timeline parent. A sync_pt can only exist in one fence
and the fence's list of sync_pts is immutable once created. Fences can be
waited on synchronously or asynchronously. Two fences can also be merged to
create a third fence containing a copy of the two fences sync_pts. Fences are
backed by file descriptors to allow userspace to coordinate the display pipeline
dependencies.
Use:
A driver implementing sync support should have a work submission function which:
* takes a fence argument specifying when to begin work
* asynchronously queues that work to kick off when the fence is signaled
* returns a fence to indicate when its work will be done.
* signals the returned fence once the work is completed.
Consider an imaginary display driver that has the following API:
/*
* assumes buf is ready to be displayed.
* blocks until the buffer is on screen.
*/
void display_buffer(struct dma_buf *buf);
The new API will become:
/*
* will display buf when fence is signaled.
* returns immediately with a fence that will signal when buf
* is no longer displayed.
*/
struct sync_fence* display_buffer(struct dma_buf *buf,
struct sync_fence *fence);
+16
View File
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/vm:
- dirty_writeback_centisecs
- drop_caches
- extfrag_threshold
- extra_free_kbytes
- hugepages_treat_as_movable
- hugetlb_shm_group
- laptop_mode
@@ -198,6 +199,21 @@ fragmentation index is <= extfrag_threshold. The default value is 500.
==============================================================
extra_free_kbytes
This parameter tells the VM to keep extra free memory between the threshold
where background reclaim (kswapd) kicks in, and the threshold where direct
reclaim (by allocating processes) kicks in.
This is useful for workloads that require low latency memory allocations
and have a bounded burstiness in memory allocations, for example a
realtime application that receives and transmits network traffic
(causing in-kernel memory allocations) with a maximum total message burst
size of 200MB may need 200MB of extra free memory to avoid direct reclaim
related latencies.
==============================================================
hugepages_treat_as_movable
This parameter is only useful when kernelcore= is specified at boot time to
+1 -4
View File
@@ -358,11 +358,8 @@ Every arch has an init callback function. If you need to do something early on
to initialize some state, this is the time to do that. Otherwise, this simple
function below should be sufficient for most people:
int __init ftrace_dyn_arch_init(void *data)
int __init ftrace_dyn_arch_init(void)
{
/* return value is done indirectly via data */
*(unsigned long *)data = 0;
return 0;
}
+29
View File
@@ -2013,6 +2013,35 @@ will produce:
1) 1.449 us | }
You can disable the hierarchical function call formatting and instead print a
flat list of function entry and return events. This uses the format described
in the Output Formatting section and respects all the trace options that
control that formatting. Hierarchical formatting is the default.
hierachical: echo nofuncgraph-flat > trace_options
flat: echo funcgraph-flat > trace_options
ie:
# tracer: function_graph
#
# entries-in-buffer/entries-written: 68355/68355 #P:2
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
sh-1806 [001] d... 198.843443: graph_ent: func=_raw_spin_lock
sh-1806 [001] d... 198.843445: graph_ent: func=__raw_spin_lock
sh-1806 [001] d..1 198.843447: graph_ret: func=__raw_spin_lock
sh-1806 [001] d..1 198.843449: graph_ret: func=_raw_spin_lock
sh-1806 [001] d..1 198.843451: graph_ent: func=_raw_spin_unlock_irqrestore
sh-1806 [001] d... 198.843453: graph_ret: func=_raw_spin_unlock_irqrestore
You might find other useful features for this tracer in the
following "dynamic ftrace" section such as tracing only specific
functions or tasks.