import PULS_20160108
This commit is contained in:
@@ -1,6 +1,5 @@
|
||||
*.xml
|
||||
*.ps
|
||||
*.pdf
|
||||
*.html
|
||||
*.9.gz
|
||||
*.9
|
||||
|
||||
@@ -0,0 +1,121 @@
|
||||
=============
|
||||
A N D R O I D
|
||||
=============
|
||||
|
||||
Copyright (C) 2009 Google, Inc.
|
||||
Written by Mike Chan <mike@android.com>
|
||||
|
||||
CONTENTS:
|
||||
---------
|
||||
|
||||
1. Android
|
||||
1.1 Required enabled config options
|
||||
1.2 Required disabled config options
|
||||
1.3 Recommended enabled config options
|
||||
2. Contact
|
||||
|
||||
|
||||
1. Android
|
||||
==========
|
||||
|
||||
Android (www.android.com) is an open source operating system for mobile devices.
|
||||
This document describes configurations needed to run the Android framework on
|
||||
top of the Linux kernel.
|
||||
|
||||
To see a working defconfig look at msm_defconfig or goldfish_defconfig
|
||||
which can be found at http://android.git.kernel.org in kernel/common.git
|
||||
and kernel/msm.git
|
||||
|
||||
|
||||
1.1 Required enabled config options
|
||||
-----------------------------------
|
||||
After building a standard defconfig, ensure that these options are enabled in
|
||||
your .config or defconfig if they are not already. Based off the msm_defconfig.
|
||||
You should keep the rest of the default options enabled in the defconfig
|
||||
unless you know what you are doing.
|
||||
|
||||
ANDROID_PARANOID_NETWORK
|
||||
ASHMEM
|
||||
CONFIG_FB_MODE_HELPERS
|
||||
CONFIG_FONT_8x16
|
||||
CONFIG_FONT_8x8
|
||||
CONFIG_YAFFS_SHORT_NAMES_IN_RAM
|
||||
DAB
|
||||
EARLYSUSPEND
|
||||
FB
|
||||
FB_CFB_COPYAREA
|
||||
FB_CFB_FILLRECT
|
||||
FB_CFB_IMAGEBLIT
|
||||
FB_DEFERRED_IO
|
||||
FB_TILEBLITTING
|
||||
HIGH_RES_TIMERS
|
||||
INOTIFY
|
||||
INOTIFY_USER
|
||||
INPUT_EVDEV
|
||||
INPUT_GPIO
|
||||
INPUT_MISC
|
||||
LEDS_CLASS
|
||||
LEDS_GPIO
|
||||
LOCK_KERNEL
|
||||
LkOGGER
|
||||
LOW_MEMORY_KILLER
|
||||
MISC_DEVICES
|
||||
NEW_LEDS
|
||||
NO_HZ
|
||||
POWER_SUPPLY
|
||||
PREEMPT
|
||||
RAMFS
|
||||
RTC_CLASS
|
||||
RTC_LIB
|
||||
SWITCH
|
||||
SWITCH_GPIO
|
||||
TMPFS
|
||||
UID_STAT
|
||||
UID16
|
||||
USB_FUNCTION
|
||||
USB_FUNCTION_ADB
|
||||
USER_WAKELOCK
|
||||
VIDEO_OUTPUT_CONTROL
|
||||
WAKELOCK
|
||||
YAFFS_AUTO_YAFFS2
|
||||
YAFFS_FS
|
||||
YAFFS_YAFFS1
|
||||
YAFFS_YAFFS2
|
||||
|
||||
|
||||
1.2 Required disabled config options
|
||||
------------------------------------
|
||||
CONFIG_YAFFS_DISABLE_LAZY_LOAD
|
||||
DNOTIFY
|
||||
|
||||
|
||||
1.3 Recommended enabled config options
|
||||
------------------------------
|
||||
ANDROID_PMEM
|
||||
PSTORE_CONSOLE
|
||||
PSTORE_RAM
|
||||
SCHEDSTATS
|
||||
DEBUG_PREEMPT
|
||||
DEBUG_MUTEXES
|
||||
DEBUG_SPINLOCK_SLEEP
|
||||
DEBUG_INFO
|
||||
FRAME_POINTER
|
||||
CPU_FREQ
|
||||
CPU_FREQ_TABLE
|
||||
CPU_FREQ_DEFAULT_GOV_ONDEMAND
|
||||
CPU_FREQ_GOV_ONDEMAND
|
||||
CRC_CCITT
|
||||
EMBEDDED
|
||||
INPUT_TOUCHSCREEN
|
||||
I2C
|
||||
I2C_BOARDINFO
|
||||
LOG_BUF_SHIFT=17
|
||||
SERIAL_CORE
|
||||
SERIAL_CORE_CONSOLE
|
||||
|
||||
|
||||
2. Contact
|
||||
==========
|
||||
website: http://android.git.kernel.org
|
||||
|
||||
mailing-lists: android-kernel@googlegroups.com
|
||||
@@ -0,0 +1,34 @@
|
||||
Tagged virtual addresses in AArch64 Linux
|
||||
=========================================
|
||||
|
||||
Author: Will Deacon <will.deacon@arm.com>
|
||||
Date : 12 June 2013
|
||||
|
||||
This document briefly describes the provision of tagged virtual
|
||||
addresses in the AArch64 translation system and their potential uses
|
||||
in AArch64 Linux.
|
||||
|
||||
The kernel configures the translation tables so that translations made
|
||||
via TTBR0 (i.e. userspace mappings) have the top byte (bits 63:56) of
|
||||
the virtual address ignored by the translation hardware. This frees up
|
||||
this byte for application use, with the following caveats:
|
||||
|
||||
(1) The kernel requires that all user addresses passed to EL1
|
||||
are tagged with tag 0x00. This means that any syscall
|
||||
parameters containing user virtual addresses *must* have
|
||||
their top byte cleared before trapping to the kernel.
|
||||
|
||||
(2) Non-zero tags are not preserved when delivering signals.
|
||||
This means that signal handlers in applications making use
|
||||
of tags cannot rely on the tag information for user virtual
|
||||
addresses being maintained for fields inside siginfo_t.
|
||||
One exception to this rule is for signals raised in response
|
||||
to watchpoint debug exceptions, where the tag information
|
||||
will be preserved.
|
||||
|
||||
(3) Special care should be taken when using tagged pointers,
|
||||
since it is likely that C compilers will not hazard two
|
||||
virtual addresses differing only in the upper byte.
|
||||
|
||||
The architecture prevents the use of a tagged PC, so the upper byte will
|
||||
be set to a sign-extension of bit 55 on exception return.
|
||||
@@ -598,6 +598,15 @@ is completely unused; @cgrp->parent is still valid. (Note - can also
|
||||
be called for a newly-created cgroup if an error occurs after this
|
||||
subsystem's create() method has been called for the new cgroup).
|
||||
|
||||
int allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
|
||||
(cgroup_mutex held by caller)
|
||||
|
||||
Called prior to moving a task into a cgroup; if the subsystem
|
||||
returns an error, this will abort the attach operation. Used
|
||||
to extend the permission checks - if all subsystems in a cgroup
|
||||
return 0, the attach will be allowed to proceed, even if the
|
||||
default permission check (root or same user) fails.
|
||||
|
||||
int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
|
||||
(cgroup_mutex held by caller)
|
||||
|
||||
|
||||
@@ -28,6 +28,7 @@ Contents:
|
||||
2.3 Userspace
|
||||
2.4 Ondemand
|
||||
2.5 Conservative
|
||||
2.6 Interactive
|
||||
|
||||
3. The Governor Interface in the CPUfreq Core
|
||||
|
||||
@@ -218,6 +219,90 @@ a decision on when to decrease the frequency while running in any
|
||||
speed. Load for frequency increase is still evaluated every
|
||||
sampling rate.
|
||||
|
||||
2.6 Interactive
|
||||
---------------
|
||||
|
||||
The CPUfreq governor "interactive" is designed for latency-sensitive,
|
||||
interactive workloads. This governor sets the CPU speed depending on
|
||||
usage, similar to "ondemand" and "conservative" governors, but with a
|
||||
different set of configurable behaviors.
|
||||
|
||||
The tuneable values for this governor are:
|
||||
|
||||
target_loads: CPU load values used to adjust speed to influence the
|
||||
current CPU load toward that value. In general, the lower the target
|
||||
load, the more often the governor will raise CPU speeds to bring load
|
||||
below the target. The format is a single target load, optionally
|
||||
followed by pairs of CPU speeds and CPU loads to target at or above
|
||||
those speeds. Colons can be used between the speeds and associated
|
||||
target loads for readability. For example:
|
||||
|
||||
85 1000000:90 1700000:99
|
||||
|
||||
targets CPU load 85% below speed 1GHz, 90% at or above 1GHz, until
|
||||
1.7GHz and above, at which load 99% is targeted. If speeds are
|
||||
specified these must appear in ascending order. Higher target load
|
||||
values are typically specified for higher speeds, that is, target load
|
||||
values also usually appear in an ascending order. The default is
|
||||
target load 90% for all speeds.
|
||||
|
||||
min_sample_time: The minimum amount of time to spend at the current
|
||||
frequency before ramping down. Default is 80000 uS.
|
||||
|
||||
hispeed_freq: An intermediate "hi speed" at which to initially ramp
|
||||
when CPU load hits the value specified in go_hispeed_load. If load
|
||||
stays high for the amount of time specified in above_hispeed_delay,
|
||||
then speed may be bumped higher. Default is the maximum speed
|
||||
allowed by the policy at governor initialization time.
|
||||
|
||||
go_hispeed_load: The CPU load at which to ramp to hispeed_freq.
|
||||
Default is 99%.
|
||||
|
||||
above_hispeed_delay: When speed is at or above hispeed_freq, wait for
|
||||
this long before raising speed in response to continued high load.
|
||||
The format is a single delay value, optionally followed by pairs of
|
||||
CPU speeds and the delay to use at or above those speeds. Colons can
|
||||
be used between the speeds and associated delays for readability. For
|
||||
example:
|
||||
|
||||
80000 1300000:200000 1500000:40000
|
||||
|
||||
uses delay 80000 uS until CPU speed 1.3 GHz, at which speed delay
|
||||
200000 uS is used until speed 1.5 GHz, at which speed (and above)
|
||||
delay 40000 uS is used. If speeds are specified these must appear in
|
||||
ascending order. Default is 20000 uS.
|
||||
|
||||
timer_rate: Sample rate for reevaluating CPU load when the CPU is not
|
||||
idle. A deferrable timer is used, such that the CPU will not be woken
|
||||
from idle to service this timer until something else needs to run.
|
||||
(The maximum time to allow deferring this timer when not running at
|
||||
minimum speed is configurable via timer_slack.) Default is 20000 uS.
|
||||
|
||||
timer_slack: Maximum additional time to defer handling the governor
|
||||
sampling timer beyond timer_rate when running at speeds above the
|
||||
minimum. For platforms that consume additional power at idle when
|
||||
CPUs are running at speeds greater than minimum, this places an upper
|
||||
bound on how long the timer will be deferred prior to re-evaluating
|
||||
load and dropping speed. For example, if timer_rate is 20000uS and
|
||||
timer_slack is 10000uS then timers will be deferred for up to 30msec
|
||||
when not at lowest speed. A value of -1 means defer timers
|
||||
indefinitely at all speeds. Default is 80000 uS.
|
||||
|
||||
boost: If non-zero, immediately boost speed of all CPUs to at least
|
||||
hispeed_freq until zero is written to this attribute. If zero, allow
|
||||
CPU speeds to drop below hispeed_freq according to load as usual.
|
||||
Default is zero.
|
||||
|
||||
boostpulse: On each write, immediately boost speed of all CPUs to
|
||||
hispeed_freq for at least the period of time specified by
|
||||
boostpulse_duration, after which speeds are allowed to drop below
|
||||
hispeed_freq according to load as usual.
|
||||
|
||||
boostpulse_duration: Length of time to hold CPU speed at hispeed_freq
|
||||
on a write to boostpulse, before allowing speed to drop according to
|
||||
load as usual. Default is 80000 uS.
|
||||
|
||||
|
||||
3. The Governor Interface in the CPUfreq Core
|
||||
=============================================
|
||||
|
||||
|
||||
@@ -0,0 +1,172 @@
|
||||
=======================================================
|
||||
ARM CCI cache coherent interconnect binding description
|
||||
=======================================================
|
||||
|
||||
ARM multi-cluster systems maintain intra-cluster coherency through a
|
||||
cache coherent interconnect (CCI) that is capable of monitoring bus
|
||||
transactions and manage coherency, TLB invalidations and memory barriers.
|
||||
|
||||
It allows snooping and distributed virtual memory message broadcast across
|
||||
clusters, through memory mapped interface, with a global control register
|
||||
space and multiple sets of interface control registers, one per slave
|
||||
interface.
|
||||
|
||||
Bindings for the CCI node follow the ePAPR standard, available from:
|
||||
|
||||
www.power.org/documentation/epapr-version-1-1/
|
||||
|
||||
with the addition of the bindings described in this document which are
|
||||
specific to ARM.
|
||||
|
||||
* CCI interconnect node
|
||||
|
||||
Description: Describes a CCI cache coherent Interconnect component
|
||||
|
||||
Node name must be "cci".
|
||||
Node's parent must be the root node /, and the address space visible
|
||||
through the CCI interconnect is the same as the one seen from the
|
||||
root node (ie from CPUs perspective as per DT standard).
|
||||
Every CCI node has to define the following properties:
|
||||
|
||||
- compatible
|
||||
Usage: required
|
||||
Value type: <string>
|
||||
Definition: must be set to
|
||||
"arm,cci-400"
|
||||
|
||||
- reg
|
||||
Usage: required
|
||||
Value type: <prop-encoded-array>
|
||||
Definition: A standard property. Specifies base physical
|
||||
address of CCI control registers common to all
|
||||
interfaces.
|
||||
|
||||
- ranges:
|
||||
Usage: required
|
||||
Value type: <prop-encoded-array>
|
||||
Definition: A standard property. Follow rules in the ePAPR for
|
||||
hierarchical bus addressing. CCI interfaces
|
||||
addresses refer to the parent node addressing
|
||||
scheme to declare their register bases.
|
||||
|
||||
CCI interconnect node can define the following child nodes:
|
||||
|
||||
- CCI control interface nodes
|
||||
|
||||
Node name must be "slave-if".
|
||||
Parent node must be CCI interconnect node.
|
||||
|
||||
A CCI control interface node must contain the following
|
||||
properties:
|
||||
|
||||
- compatible
|
||||
Usage: required
|
||||
Value type: <string>
|
||||
Definition: must be set to
|
||||
"arm,cci-400-ctrl-if"
|
||||
|
||||
- interface-type:
|
||||
Usage: required
|
||||
Value type: <string>
|
||||
Definition: must be set to one of {"ace", "ace-lite"}
|
||||
depending on the interface type the node
|
||||
represents.
|
||||
|
||||
- reg:
|
||||
Usage: required
|
||||
Value type: <prop-encoded-array>
|
||||
Definition: the base address and size of the
|
||||
corresponding interface programming
|
||||
registers.
|
||||
|
||||
* CCI interconnect bus masters
|
||||
|
||||
Description: masters in the device tree connected to a CCI port
|
||||
(inclusive of CPUs and their cpu nodes).
|
||||
|
||||
A CCI interconnect bus master node must contain the following
|
||||
properties:
|
||||
|
||||
- cci-control-port:
|
||||
Usage: required
|
||||
Value type: <phandle>
|
||||
Definition: a phandle containing the CCI control interface node
|
||||
the master is connected to.
|
||||
|
||||
Example:
|
||||
|
||||
cpus {
|
||||
#size-cells = <0>;
|
||||
#address-cells = <1>;
|
||||
|
||||
CPU0: cpu@0 {
|
||||
device_type = "cpu";
|
||||
compatible = "arm,cortex-a15";
|
||||
cci-control-port = <&cci_control1>;
|
||||
reg = <0x0>;
|
||||
};
|
||||
|
||||
CPU1: cpu@1 {
|
||||
device_type = "cpu";
|
||||
compatible = "arm,cortex-a15";
|
||||
cci-control-port = <&cci_control1>;
|
||||
reg = <0x1>;
|
||||
};
|
||||
|
||||
CPU2: cpu@100 {
|
||||
device_type = "cpu";
|
||||
compatible = "arm,cortex-a7";
|
||||
cci-control-port = <&cci_control2>;
|
||||
reg = <0x100>;
|
||||
};
|
||||
|
||||
CPU3: cpu@101 {
|
||||
device_type = "cpu";
|
||||
compatible = "arm,cortex-a7";
|
||||
cci-control-port = <&cci_control2>;
|
||||
reg = <0x101>;
|
||||
};
|
||||
|
||||
};
|
||||
|
||||
dma0: dma@3000000 {
|
||||
compatible = "arm,pl330", "arm,primecell";
|
||||
cci-control-port = <&cci_control0>;
|
||||
reg = <0x0 0x3000000 0x0 0x1000>;
|
||||
interrupts = <10>;
|
||||
#dma-cells = <1>;
|
||||
#dma-channels = <8>;
|
||||
#dma-requests = <32>;
|
||||
};
|
||||
|
||||
cci@2c090000 {
|
||||
compatible = "arm,cci-400";
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
reg = <0x0 0x2c090000 0 0x1000>;
|
||||
ranges = <0x0 0x0 0x2c090000 0x6000>;
|
||||
|
||||
cci_control0: slave-if@1000 {
|
||||
compatible = "arm,cci-400-ctrl-if";
|
||||
interface-type = "ace-lite";
|
||||
reg = <0x1000 0x1000>;
|
||||
};
|
||||
|
||||
cci_control1: slave-if@4000 {
|
||||
compatible = "arm,cci-400-ctrl-if";
|
||||
interface-type = "ace";
|
||||
reg = <0x4000 0x1000>;
|
||||
};
|
||||
|
||||
cci_control2: slave-if@5000 {
|
||||
compatible = "arm,cci-400-ctrl-if";
|
||||
interface-type = "ace";
|
||||
reg = <0x5000 0x1000>;
|
||||
};
|
||||
};
|
||||
|
||||
This CCI node corresponds to a CCI component whose control registers sits
|
||||
at address 0x000000002c090000.
|
||||
CCI slave interface @0x000000002c091000 is connected to dma controller dma0.
|
||||
CCI slave interface @0x000000002c094000 is connected to CPUs {CPU0, CPU1};
|
||||
CCI slave interface @0x000000002c095000 is connected to CPUs {CPU2, CPU3};
|
||||
@@ -16,6 +16,9 @@ Required properties:
|
||||
"arm,arm1176-pmu"
|
||||
"arm,arm1136-pmu"
|
||||
- interrupts : 1 combined interrupt or 1 per core.
|
||||
- cluster : a phandle to the cluster to which it belongs
|
||||
If there are more than one cluster with same CPU type
|
||||
then there should be separate PMU nodes per cluster.
|
||||
|
||||
Example:
|
||||
|
||||
|
||||
Regular → Executable
@@ -369,6 +369,8 @@ is not associated with a file:
|
||||
[stack:1001] = the stack of the thread with tid 1001
|
||||
[vdso] = the "virtual dynamic shared object",
|
||||
the kernel system call handler
|
||||
[anon:<name>] = an anonymous mapping that has been
|
||||
named by userspace
|
||||
|
||||
or if empty, the mapping is anonymous.
|
||||
|
||||
@@ -419,6 +421,7 @@ KernelPageSize: 4 kB
|
||||
MMUPageSize: 4 kB
|
||||
Locked: 374 kB
|
||||
VmFlags: rd ex mr mw me de
|
||||
Name: name from userspace
|
||||
|
||||
the first of these lines shows the same information as is displayed for the
|
||||
mapping in /proc/PID/maps. The remaining lines show the size of the mapping
|
||||
@@ -469,6 +472,9 @@ Note that there is no guarantee that every flag and associated mnemonic will
|
||||
be present in all further kernel releases. Things get changed, the flags may
|
||||
be vanished or the reverse -- new added.
|
||||
|
||||
The "Name" field will only be present on a mapping that has been named by
|
||||
userspace, and will show the name passed in by userspace.
|
||||
|
||||
This file is only present if the CONFIG_MMU kernel configuration option is
|
||||
enabled.
|
||||
|
||||
|
||||
Regular → Executable
@@ -3217,6 +3217,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
|
||||
HIGHMEM regardless of setting
|
||||
of CONFIG_HIGHPTE.
|
||||
|
||||
uuid_debug= (Boolean) whether to enable debugging of TuxOnIce's
|
||||
uuid support.
|
||||
|
||||
vdso= [X86,SH]
|
||||
vdso=2: enable compat VDSO (default with COMPAT_VDSO)
|
||||
vdso=1: enable VDSO (default)
|
||||
|
||||
@@ -22,6 +22,15 @@ ip_no_pmtu_disc - BOOLEAN
|
||||
min_pmtu - INTEGER
|
||||
default 552 - minimum discovered Path MTU
|
||||
|
||||
fwmark_reflect - BOOLEAN
|
||||
Controls the fwmark of kernel-generated IPv4 reply packets that are not
|
||||
associated with a socket for example, TCP RSTs or ICMP echo replies).
|
||||
If unset, these packets have a fwmark of zero. If set, they have the
|
||||
fwmark of the packet they are replying to. Similarly affects the fwmark
|
||||
used by internal routing lookups triggered by incoming packets, such as
|
||||
the ones used for Path MTU Discovery.
|
||||
Default: 0
|
||||
|
||||
route/max_size - INTEGER
|
||||
Maximum number of routes allowed in the kernel. Increase
|
||||
this when using large numbers of interfaces and/or routes.
|
||||
@@ -468,6 +477,16 @@ tcp_fastopen - INTEGER
|
||||
|
||||
See include/net/tcp.h and the code for more details.
|
||||
|
||||
tcp_fwmark_accept - BOOLEAN
|
||||
If set, incoming connections to listening sockets that do not have a
|
||||
socket mark will set the mark of the accepting socket to the fwmark of
|
||||
the incoming SYN packet. This will cause all packets on that connection
|
||||
(starting from the first SYNACK) to be sent with that fwmark. The
|
||||
listening socket's mark is unchanged. Listening sockets that already
|
||||
have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are
|
||||
unaffected.
|
||||
Default: 0
|
||||
|
||||
tcp_syn_retries - INTEGER
|
||||
Number of times initial SYNs for an active TCP connection attempt
|
||||
will be retransmitted. Should not be higher than 255. Default value
|
||||
@@ -1093,6 +1112,15 @@ conf/all/forwarding - BOOLEAN
|
||||
proxy_ndp - BOOLEAN
|
||||
Do proxy ndp.
|
||||
|
||||
fwmark_reflect - BOOLEAN
|
||||
Controls the fwmark of kernel-generated IPv6 reply packets that are not
|
||||
associated with a socket for example, TCP RSTs or ICMPv6 echo replies).
|
||||
If unset, these packets have a fwmark of zero. If set, they have the
|
||||
fwmark of the packet they are replying to. Similarly affects the fwmark
|
||||
used by internal routing lookups triggered by incoming packets, such as
|
||||
the ones used for Path MTU Discovery.
|
||||
Default: 0
|
||||
|
||||
conf/interface/*:
|
||||
Change special settings per interface.
|
||||
|
||||
|
||||
@@ -0,0 +1,477 @@
|
||||
TuxOnIce 3.0 Internal Documentation.
|
||||
Updated to 26 March 2009
|
||||
|
||||
1. Introduction.
|
||||
|
||||
TuxOnIce 3.0 is an addition to the Linux Kernel, designed to
|
||||
allow the user to quickly shutdown and quickly boot a computer, without
|
||||
needing to close documents or programs. It is equivalent to the
|
||||
hibernate facility in some laptops. This implementation, however,
|
||||
requires no special BIOS or hardware support.
|
||||
|
||||
The code in these files is based upon the original implementation
|
||||
prepared by Gabor Kuti and additional work by Pavel Machek and a
|
||||
host of others. This code has been substantially reworked by Nigel
|
||||
Cunningham, again with the help and testing of many others, not the
|
||||
least of whom is Michael Frank. At its heart, however, the operation is
|
||||
essentially the same as Gabor's version.
|
||||
|
||||
2. Overview of operation.
|
||||
|
||||
The basic sequence of operations is as follows:
|
||||
|
||||
a. Quiesce all other activity.
|
||||
b. Ensure enough memory and storage space are available, and attempt
|
||||
to free memory/storage if necessary.
|
||||
c. Allocate the required memory and storage space.
|
||||
d. Write the image.
|
||||
e. Power down.
|
||||
|
||||
There are a number of complicating factors which mean that things are
|
||||
not as simple as the above would imply, however...
|
||||
|
||||
o The activity of each process must be stopped at a point where it will
|
||||
not be holding locks necessary for saving the image, or unexpectedly
|
||||
restart operations due to something like a timeout and thereby make
|
||||
our image inconsistent.
|
||||
|
||||
o It is desirous that we sync outstanding I/O to disk before calculating
|
||||
image statistics. This reduces corruption if one should suspend but
|
||||
then not resume, and also makes later parts of the operation safer (see
|
||||
below).
|
||||
|
||||
o We need to get as close as we can to an atomic copy of the data.
|
||||
Inconsistencies in the image will result in inconsistent memory contents at
|
||||
resume time, and thus in instability of the system and/or file system
|
||||
corruption. This would appear to imply a maximum image size of one half of
|
||||
the amount of RAM, but we have a solution... (again, below).
|
||||
|
||||
o In 2.6, we choose to play nicely with the other suspend-to-disk
|
||||
implementations.
|
||||
|
||||
3. Detailed description of internals.
|
||||
|
||||
a. Quiescing activity.
|
||||
|
||||
Safely quiescing the system is achieved using three separate but related
|
||||
aspects.
|
||||
|
||||
First, we note that the vast majority of processes don't need to run during
|
||||
suspend. They can be 'frozen'. We therefore implement a refrigerator
|
||||
routine, which processes enter and in which they remain until the cycle is
|
||||
complete. Processes enter the refrigerator via try_to_freeze() invocations
|
||||
at appropriate places. A process cannot be frozen in any old place. It
|
||||
must not be holding locks that will be needed for writing the image or
|
||||
freezing other processes. For this reason, userspace processes generally
|
||||
enter the refrigerator via the signal handling code, and kernel threads at
|
||||
the place in their event loops where they drop locks and yield to other
|
||||
processes or sleep.
|
||||
|
||||
The task of freezing processes is complicated by the fact that there can be
|
||||
interdependencies between processes. Freezing process A before process B may
|
||||
mean that process B cannot be frozen, because it stops at waiting for
|
||||
process A rather than in the refrigerator. This issue is seen where
|
||||
userspace waits on freezeable kernel threads or fuse filesystem threads. To
|
||||
address this issue, we implement the following algorithm for quiescing
|
||||
activity:
|
||||
|
||||
- Freeze filesystems (including fuse - userspace programs starting
|
||||
new requests are immediately frozen; programs already running
|
||||
requests complete their work before being frozen in the next
|
||||
step)
|
||||
- Freeze userspace
|
||||
- Thaw filesystems (this is safe now that userspace is frozen and no
|
||||
fuse requests are outstanding).
|
||||
- Invoke sys_sync (noop on fuse).
|
||||
- Freeze filesystems
|
||||
- Freeze kernel threads
|
||||
|
||||
If we need to free memory, we thaw kernel threads and filesystems, but not
|
||||
userspace. We can then free caches without worrying about deadlocks due to
|
||||
swap files being on frozen filesystems or such like.
|
||||
|
||||
b. Ensure enough memory & storage are available.
|
||||
|
||||
We have a number of constraints to meet in order to be able to successfully
|
||||
suspend and resume.
|
||||
|
||||
First, the image will be written in two parts, described below. One of these
|
||||
parts needs to have an atomic copy made, which of course implies a maximum
|
||||
size of one half of the amount of system memory. The other part ('pageset')
|
||||
is not atomically copied, and can therefore be as large or small as desired.
|
||||
|
||||
Second, we have constraints on the amount of storage available. In these
|
||||
calculations, we may also consider any compression that will be done. The
|
||||
cryptoapi module allows the user to configure an expected compression ratio.
|
||||
|
||||
Third, the user can specify an arbitrary limit on the image size, in
|
||||
megabytes. This limit is treated as a soft limit, so that we don't fail the
|
||||
attempt to suspend if we cannot meet this constraint.
|
||||
|
||||
c. Allocate the required memory and storage space.
|
||||
|
||||
Having done the initial freeze, we determine whether the above constraints
|
||||
are met, and seek to allocate the metadata for the image. If the constraints
|
||||
are not met, or we fail to allocate the required space for the metadata, we
|
||||
seek to free the amount of memory that we calculate is needed and try again.
|
||||
We allow up to four iterations of this loop before aborting the cycle. If we
|
||||
do fail, it should only be because of a bug in TuxOnIce's calculations.
|
||||
|
||||
These steps are merged together in the prepare_image function, found in
|
||||
prepare_image.c. The functions are merged because of the cyclical nature
|
||||
of the problem of calculating how much memory and storage is needed. Since
|
||||
the data structures containing the information about the image must
|
||||
themselves take memory and use storage, the amount of memory and storage
|
||||
required changes as we prepare the image. Since the changes are not large,
|
||||
only one or two iterations will be required to achieve a solution.
|
||||
|
||||
The recursive nature of the algorithm is miminised by keeping user space
|
||||
frozen while preparing the image, and by the fact that our records of which
|
||||
pages are to be saved and which pageset they are saved in use bitmaps (so
|
||||
that changes in number or fragmentation of the pages to be saved don't
|
||||
feedback via changes in the amount of memory needed for metadata). The
|
||||
recursiveness is thus limited to any extra slab pages allocated to store the
|
||||
extents that record storage used, and the effects of seeking to free memory.
|
||||
|
||||
d. Write the image.
|
||||
|
||||
We previously mentioned the need to create an atomic copy of the data, and
|
||||
the half-of-memory limitation that is implied in this. This limitation is
|
||||
circumvented by dividing the memory to be saved into two parts, called
|
||||
pagesets.
|
||||
|
||||
Pageset2 contains most of the page cache - the pages on the active and
|
||||
inactive LRU lists that aren't needed or modified while TuxOnIce is
|
||||
running, so they can be safely written without an atomic copy. They are
|
||||
therefore saved first and reloaded last. While saving these pages,
|
||||
TuxOnIce carefully ensures that the work of writing the pages doesn't make
|
||||
the image inconsistent. With the support for Kernel (Video) Mode Setting
|
||||
going into the kernel at the time of writing, we need to check for pages
|
||||
on the LRU that are used by KMS, and exclude them from pageset2. They are
|
||||
atomically copied as part of pageset 1.
|
||||
|
||||
Once pageset2 has been saved, we prepare to do the atomic copy of remaining
|
||||
memory. As part of the preparation, we power down drivers, thereby providing
|
||||
them with the opportunity to have their state recorded in the image. The
|
||||
amount of memory allocated by drivers for this is usually negligible, but if
|
||||
DRI is in use, video drivers may require significants amounts. Ideally we
|
||||
would be able to query drivers while preparing the image as to the amount of
|
||||
memory they will need. Unfortunately no such mechanism exists at the time of
|
||||
writing. For this reason, TuxOnIce allows the user to set an
|
||||
'extra_pages_allowance', which is used to seek to ensure sufficient memory
|
||||
is available for drivers at this point. TuxOnIce also lets the user set this
|
||||
value to 0. In this case, a test driver suspend is done while preparing the
|
||||
image, and the difference (plus a margin) used instead. TuxOnIce will also
|
||||
automatically restart the hibernation process (twice at most) if it finds
|
||||
that the extra pages allowance is not sufficient. It will then use what was
|
||||
actually needed (plus a margin, again). Failure to hibernate should thus
|
||||
be an extremely rare occurence.
|
||||
|
||||
Having suspended the drivers, we save the CPU context before making an
|
||||
atomic copy of pageset1, resuming the drivers and saving the atomic copy.
|
||||
After saving the two pagesets, we just need to save our metadata before
|
||||
powering down.
|
||||
|
||||
As we mentioned earlier, the contents of pageset2 pages aren't needed once
|
||||
they've been saved. We therefore use them as the destination of our atomic
|
||||
copy. In the unlikely event that pageset1 is larger, extra pages are
|
||||
allocated while the image is being prepared. This is normally only a real
|
||||
possibility when the system has just been booted and the page cache is
|
||||
small.
|
||||
|
||||
This is where we need to be careful about syncing, however. Pageset2 will
|
||||
probably contain filesystem meta data. If this is overwritten with pageset1
|
||||
and then a sync occurs, the filesystem will be corrupted - at least until
|
||||
resume time and another sync of the restored data. Since there is a
|
||||
possibility that the user might not resume or (may it never be!) that
|
||||
TuxOnIce might oops, we do our utmost to avoid syncing filesystems after
|
||||
copying pageset1.
|
||||
|
||||
e. Power down.
|
||||
|
||||
Powering down uses standard kernel routines. TuxOnIce supports powering down
|
||||
using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off.
|
||||
Supporting suspend to ram (S3) as a power off option might sound strange,
|
||||
but it allows the user to quickly get their system up and running again if
|
||||
the battery doesn't run out (we just need to re-read the overwritten pages)
|
||||
and if the battery does run out (or the user removes power), they can still
|
||||
resume.
|
||||
|
||||
4. Data Structures.
|
||||
|
||||
TuxOnIce uses three main structures to store its metadata and configuration
|
||||
information:
|
||||
|
||||
a) Pageflags bitmaps.
|
||||
|
||||
TuxOnIce records which pages will be in pageset1, pageset2, the destination
|
||||
of the atomic copy and the source of the atomically restored image using
|
||||
bitmaps. The code used is that written for swsusp, with small improvements
|
||||
to match TuxOnIce's requirements.
|
||||
|
||||
The pageset1 bitmap is thus easily stored in the image header for use at
|
||||
resume time.
|
||||
|
||||
As mentioned above, using bitmaps also means that the amount of memory and
|
||||
storage required for recording the above information is constant. This
|
||||
greatly simplifies the work of preparing the image. In earlier versions of
|
||||
TuxOnIce, extents were used to record which pages would be stored. In that
|
||||
case, however, eating memory could result in greater fragmentation of the
|
||||
lists of pages, which in turn required more memory to store the extents and
|
||||
more storage in the image header. These could in turn require further
|
||||
freeing of memory, and another iteration. All of this complexity is removed
|
||||
by having bitmaps.
|
||||
|
||||
Bitmaps also make a lot of sense because TuxOnIce only ever iterates
|
||||
through the lists. There is therefore no cost to not being able to find the
|
||||
nth page in order 0 time. We only need to worry about the cost of finding
|
||||
the n+1th page, given the location of the nth page. Bitwise optimisations
|
||||
help here.
|
||||
|
||||
b) Extents for block data.
|
||||
|
||||
TuxOnIce supports writing the image to multiple block devices. In the case
|
||||
of swap, multiple partitions and/or files may be in use, and we happily use
|
||||
them all (with the exception of compcache pages, which we allocate but do
|
||||
not use). This use of multiple block devices is accomplished as follows:
|
||||
|
||||
Whatever the actual source of the allocated storage, the destination of the
|
||||
image can be viewed in terms of one or more block devices, and on each
|
||||
device, a list of sectors. To simplify matters, we only use contiguous,
|
||||
PAGE_SIZE aligned sectors, like the swap code does.
|
||||
|
||||
Since sector numbers on each bdev may well not start at 0, it makes much
|
||||
more sense to use extents here. Contiguous ranges of pages can thus be
|
||||
represented in the extents by contiguous values.
|
||||
|
||||
Variations in block size are taken account of in transforming this data
|
||||
into the parameters for bio submission.
|
||||
|
||||
We can thus implement a layer of abstraction wherein the core of TuxOnIce
|
||||
doesn't have to worry about which device we're currently writing to or
|
||||
where in the device we are. It simply requests that the next page in the
|
||||
pageset or header be written, leaving the details to this lower layer.
|
||||
The lower layer remembers where in the sequence of devices and blocks each
|
||||
pageset starts. The header always starts at the beginning of the allocated
|
||||
storage.
|
||||
|
||||
So extents are:
|
||||
|
||||
struct extent {
|
||||
unsigned long minimum, maximum;
|
||||
struct extent *next;
|
||||
}
|
||||
|
||||
These are combined into chains of extents for a device:
|
||||
|
||||
struct extent_chain {
|
||||
int size; /* size of the extent ie sum (max-min+1) */
|
||||
int allocs, frees;
|
||||
char *name;
|
||||
struct extent *first, *last_touched;
|
||||
};
|
||||
|
||||
For each bdev, we need to store a little more info:
|
||||
|
||||
struct suspend_bdev_info {
|
||||
struct block_device *bdev;
|
||||
dev_t dev_t;
|
||||
int bmap_shift;
|
||||
int blocks_per_page;
|
||||
};
|
||||
|
||||
The dev_t is used to identify the device in the stored image. As a result,
|
||||
we expect devices at resume time to have the same major and minor numbers
|
||||
as they had while suspending. This is primarily a concern where the user
|
||||
utilises LVM for storage, as they will need to dmsetup their partitions in
|
||||
such a way as to maintain this consistency at resume time.
|
||||
|
||||
bmap_shift and blocks_per_page apply the effects of variations in blocks
|
||||
per page settings for the filesystem and underlying bdev. For most
|
||||
filesystems, these are the same, but for xfs, they can have independant
|
||||
values.
|
||||
|
||||
Combining these two structures together, we have everything we need to
|
||||
record what devices and what blocks on each device are being used to
|
||||
store the image, and to submit i/o using bio_submit.
|
||||
|
||||
The last elements in the picture are a means of recording how the storage
|
||||
is being used.
|
||||
|
||||
We do this first and foremost by implementing a layer of abstraction on
|
||||
top of the devices and extent chains which allows us to view however many
|
||||
devices there might be as one long storage tape, with a single 'head' that
|
||||
tracks a 'current position' on the tape:
|
||||
|
||||
struct extent_iterate_state {
|
||||
struct extent_chain *chains;
|
||||
int num_chains;
|
||||
int current_chain;
|
||||
struct extent *current_extent;
|
||||
unsigned long current_offset;
|
||||
};
|
||||
|
||||
That is, *chains points to an array of size num_chains of extent chains.
|
||||
For the filewriter, this is always a single chain. For the swapwriter, the
|
||||
array is of size MAX_SWAPFILES.
|
||||
|
||||
current_chain, current_extent and current_offset thus point to the current
|
||||
index in the chains array (and into a matching array of struct
|
||||
suspend_bdev_info), the current extent in that chain (to optimise access),
|
||||
and the current value in the offset.
|
||||
|
||||
The image is divided into three parts:
|
||||
- The header
|
||||
- Pageset 1
|
||||
- Pageset 2
|
||||
|
||||
The header always starts at the first device and first block. We know its
|
||||
size before we begin to save the image because we carefully account for
|
||||
everything that will be stored in it.
|
||||
|
||||
The second pageset (LRU) is stored first. It begins on the next page after
|
||||
the end of the header.
|
||||
|
||||
The first pageset is stored second. It's start location is only known once
|
||||
pageset2 has been saved, since pageset2 may be compressed as it is written.
|
||||
This location is thus recorded at the end of saving pageset2. It is page
|
||||
aligned also.
|
||||
|
||||
Since this information is needed at resume time, and the location of extents
|
||||
in memory will differ at resume time, this needs to be stored in a portable
|
||||
way:
|
||||
|
||||
struct extent_iterate_saved_state {
|
||||
int chain_num;
|
||||
int extent_num;
|
||||
unsigned long offset;
|
||||
};
|
||||
|
||||
We can thus implement a layer of abstraction wherein the core of TuxOnIce
|
||||
doesn't have to worry about which device we're currently writing to or
|
||||
where in the device we are. It simply requests that the next page in the
|
||||
pageset or header be written, leaving the details to this layer, and
|
||||
invokes the routines to remember and restore the position, without having
|
||||
to worry about the details of how the data is arranged on disk or such like.
|
||||
|
||||
c) Modules
|
||||
|
||||
One aim in designing TuxOnIce was to make it flexible. We wanted to allow
|
||||
for the implementation of different methods of transforming a page to be
|
||||
written to disk and different methods of getting the pages stored.
|
||||
|
||||
In early versions (the betas and perhaps Suspend1), compression support was
|
||||
inlined in the image writing code, and the data structures and code for
|
||||
managing swap were intertwined with the rest of the code. A number of people
|
||||
had expressed interest in implementing image encryption, and alternative
|
||||
methods of storing the image.
|
||||
|
||||
In order to achieve this, TuxOnIce was given a modular design.
|
||||
|
||||
A module is a single file which encapsulates the functionality needed
|
||||
to transform a pageset of data (encryption or compression, for example),
|
||||
or to write the pageset to a device. The former type of module is called
|
||||
a 'page-transformer', the later a 'writer'.
|
||||
|
||||
Modules are linked together in pipeline fashion. There may be zero or more
|
||||
page transformers in a pipeline, and there is always exactly one writer.
|
||||
The pipeline follows this pattern:
|
||||
|
||||
---------------------------------
|
||||
| TuxOnIce Core |
|
||||
---------------------------------
|
||||
|
|
||||
|
|
||||
---------------------------------
|
||||
| Page transformer 1 |
|
||||
---------------------------------
|
||||
|
|
||||
|
|
||||
---------------------------------
|
||||
| Page transformer 2 |
|
||||
---------------------------------
|
||||
|
|
||||
|
|
||||
---------------------------------
|
||||
| Writer |
|
||||
---------------------------------
|
||||
|
||||
During the writing of an image, the core code feeds pages one at a time
|
||||
to the first module. This module performs whatever transformations it
|
||||
implements on the incoming data, completely consuming the incoming data and
|
||||
feeding output in a similar manner to the next module.
|
||||
|
||||
All routines are SMP safe, and the final result of the transformations is
|
||||
written with an index (provided by the core) and size of the output by the
|
||||
writer. As a result, we can have multithreaded I/O without needing to
|
||||
worry about the sequence in which pages are written (or read).
|
||||
|
||||
During reading, the pipeline works in the reverse direction. The core code
|
||||
calls the first module with the address of a buffer which should be filled.
|
||||
(Note that the buffer size is always PAGE_SIZE at this time). This module
|
||||
will in turn request data from the next module and so on down until the
|
||||
writer is made to read from the stored image.
|
||||
|
||||
Part of definition of the structure of a module thus looks like this:
|
||||
|
||||
int (*rw_init) (int rw, int stream_number);
|
||||
int (*rw_cleanup) (int rw);
|
||||
int (*write_chunk) (struct page *buffer_page);
|
||||
int (*read_chunk) (struct page *buffer_page, int sync);
|
||||
|
||||
It should be noted that the _cleanup routine may be called before the
|
||||
full stream of data has been read or written. While writing the image,
|
||||
the user may (depending upon settings) choose to abort suspending, and
|
||||
if we are in the midst of writing the last portion of the image, a portion
|
||||
of the second pageset may be reread. This may also happen if an error
|
||||
occurs and we seek to abort the process of writing the image.
|
||||
|
||||
The modular design is also useful in a number of other ways. It provides
|
||||
a means where by we can add support for:
|
||||
|
||||
- providing overall initialisation and cleanup routines;
|
||||
- serialising configuration information in the image header;
|
||||
- providing debugging information to the user;
|
||||
- determining memory and image storage requirements;
|
||||
- dis/enabling components at run-time;
|
||||
- configuring the module (see below);
|
||||
|
||||
...and routines for writers specific to their work:
|
||||
- Parsing a resume= location;
|
||||
- Determining whether an image exists;
|
||||
- Marking a resume as having been attempted;
|
||||
- Invalidating an image;
|
||||
|
||||
Since some parts of the core - the user interface and storage manager
|
||||
support - have use for some of these functions, they are registered as
|
||||
'miscellaneous' modules as well.
|
||||
|
||||
d) Sysfs data structures.
|
||||
|
||||
This brings us naturally to support for configuring TuxOnIce. We desired to
|
||||
provide a way to make TuxOnIce as flexible and configurable as possible.
|
||||
The user shouldn't have to reboot just because they want to now hibernate to
|
||||
a file instead of a partition, for example.
|
||||
|
||||
To accomplish this, TuxOnIce implements a very generic means whereby the
|
||||
core and modules can register new sysfs entries. All TuxOnIce entries use
|
||||
a single _store and _show routine, both of which are found in
|
||||
tuxonice_sysfs.c in the kernel/power directory. These routines handle the
|
||||
most common operations - getting and setting the values of bits, integers,
|
||||
longs, unsigned longs and strings in one place, and allow overrides for
|
||||
customised get and set options as well as side-effect routines for all
|
||||
reads and writes.
|
||||
|
||||
When combined with some simple macros, a new sysfs entry can then be defined
|
||||
in just a couple of lines:
|
||||
|
||||
SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1,
|
||||
2048, 0, NULL),
|
||||
|
||||
This defines a sysfs entry named "progress_granularity" which is rw and
|
||||
allows the user to access an integer stored at &progress_granularity, giving
|
||||
it a value between 1 and 2048 inclusive.
|
||||
|
||||
Sysfs entries are registered under /sys/power/tuxonice, and entries for
|
||||
modules are located in a subdirectory named after the module.
|
||||
|
||||
@@ -0,0 +1,948 @@
|
||||
--- TuxOnIce, version 3.0 ---
|
||||
|
||||
1. What is it?
|
||||
2. Why would you want it?
|
||||
3. What do you need to use it?
|
||||
4. Why not just use the version already in the kernel?
|
||||
5. How do you use it?
|
||||
6. What do all those entries in /sys/power/tuxonice do?
|
||||
7. How do you get support?
|
||||
8. I think I've found a bug. What should I do?
|
||||
9. When will XXX be supported?
|
||||
10 How does it work?
|
||||
11. Who wrote TuxOnIce?
|
||||
|
||||
1. What is it?
|
||||
|
||||
Imagine you're sitting at your computer, working away. For some reason, you
|
||||
need to turn off your computer for a while - perhaps it's time to go home
|
||||
for the day. When you come back to your computer next, you're going to want
|
||||
to carry on where you left off. Now imagine that you could push a button and
|
||||
have your computer store the contents of its memory to disk and power down.
|
||||
Then, when you next start up your computer, it loads that image back into
|
||||
memory and you can carry on from where you were, just as if you'd never
|
||||
turned the computer off. You have far less time to start up, no reopening of
|
||||
applications or finding what directory you put that file in yesterday.
|
||||
That's what TuxOnIce does.
|
||||
|
||||
TuxOnIce has a long heritage. It began life as work by Gabor Kuti, who,
|
||||
with some help from Pavel Machek, got an early version going in 1999. The
|
||||
project was then taken over by Florent Chabaud while still in alpha version
|
||||
numbers. Nigel Cunningham came on the scene when Florent was unable to
|
||||
continue, moving the project into betas, then 1.0, 2.0 and so on up to
|
||||
the present series. During the 2.0 series, the name was contracted to
|
||||
Suspend2 and the website suspend2.net created. Beginning around July 2007,
|
||||
a transition to calling the software TuxOnIce was made, to seek to help
|
||||
make it clear that TuxOnIce is more concerned with hibernation than suspend
|
||||
to ram.
|
||||
|
||||
Pavel Machek's swsusp code, which was merged around 2.5.17 retains the
|
||||
original name, and was essentially a fork of the beta code until Rafael
|
||||
Wysocki came on the scene in 2005 and began to improve it further.
|
||||
|
||||
2. Why would you want it?
|
||||
|
||||
Why wouldn't you want it?
|
||||
|
||||
Being able to save the state of your system and quickly restore it improves
|
||||
your productivity - you get a useful system in far less time than through
|
||||
the normal boot process. You also get to be completely 'green', using zero
|
||||
power, or as close to that as possible (the computer may still provide
|
||||
minimal power to some devices, so they can initiate a power on, but that
|
||||
will be the same amount of power as would be used if you told the computer
|
||||
to shutdown.
|
||||
|
||||
3. What do you need to use it?
|
||||
|
||||
a. Kernel Support.
|
||||
|
||||
i) The TuxOnIce patch.
|
||||
|
||||
TuxOnIce is part of the Linux Kernel. This version is not part of Linus's
|
||||
2.6 tree at the moment, so you will need to download the kernel source and
|
||||
apply the latest patch. Having done that, enable the appropriate options in
|
||||
make [menu|x]config (under Power Management Options - look for "Enhanced
|
||||
Hibernation"), compile and install your kernel. TuxOnIce works with SMP,
|
||||
Highmem, preemption, fuse filesystems, x86-32, PPC and x86_64.
|
||||
|
||||
TuxOnIce patches are available from http://tuxonice.net.
|
||||
|
||||
ii) Compression support.
|
||||
|
||||
Compression support is implemented via the cryptoapi. You will therefore want
|
||||
to select any Cryptoapi transforms that you want to use on your image from
|
||||
the Cryptoapi menu while configuring your kernel. We recommend the use of the
|
||||
LZO compression method - it is very fast and still achieves good compression.
|
||||
|
||||
You can also tell TuxOnIce to write its image to an encrypted and/or
|
||||
compressed filesystem/swap partition. In that case, you don't need to do
|
||||
anything special for TuxOnIce when it comes to kernel configuration.
|
||||
|
||||
iii) Configuring other options.
|
||||
|
||||
While you're configuring your kernel, try to configure as much as possible
|
||||
to build as modules. We recommend this because there are a number of drivers
|
||||
that are still in the process of implementing proper power management
|
||||
support. In those cases, the best way to work around their current lack is
|
||||
to build them as modules and remove the modules while hibernating. You might
|
||||
also bug the driver authors to get their support up to speed, or even help!
|
||||
|
||||
b. Storage.
|
||||
|
||||
i) Swap.
|
||||
|
||||
TuxOnIce can store the hibernation image in your swap partition, a swap file or
|
||||
a combination thereof. Whichever combination you choose, you will probably
|
||||
want to create enough swap space to store the largest image you could have,
|
||||
plus the space you'd normally use for swap. A good rule of thumb would be
|
||||
to calculate the amount of swap you'd want without using TuxOnIce, and then
|
||||
add the amount of memory you have. This swapspace can be arranged in any way
|
||||
you'd like. It can be in one partition or file, or spread over a number. The
|
||||
only requirement is that they be active when you start a hibernation cycle.
|
||||
|
||||
There is one exception to this requirement. TuxOnIce has the ability to turn
|
||||
on one swap file or partition at the start of hibernating and turn it back off
|
||||
at the end. If you want to ensure you have enough memory to store a image
|
||||
when your memory is fully used, you might want to make one swap partition or
|
||||
file for 'normal' use, and another for TuxOnIce to activate & deactivate
|
||||
automatically. (Further details below).
|
||||
|
||||
ii) Normal files.
|
||||
|
||||
TuxOnIce includes a 'file allocator'. The file allocator can store your
|
||||
image in a simple file. Since Linux has the concept of everything being a
|
||||
file, this is more powerful than it initially sounds. If, for example, you
|
||||
were to set up a network block device file, you could hibernate to a network
|
||||
server. This has been tested and works to a point, but nbd itself isn't
|
||||
stateless enough for our purposes.
|
||||
|
||||
Take extra care when setting up the file allocator. If you just type
|
||||
commands without thinking and then try to hibernate, you could cause
|
||||
irreversible corruption on your filesystems! Make sure you have backups.
|
||||
|
||||
Most people will only want to hibernate to a local file. To achieve that, do
|
||||
something along the lines of:
|
||||
|
||||
echo "TuxOnIce" > /hibernation-file
|
||||
dd if=/dev/zero bs=1M count=512 >> /hibernation-file
|
||||
|
||||
This will create a 512MB file called /hibernation-file. To get TuxOnIce to use
|
||||
it:
|
||||
|
||||
echo /hibernation-file > /sys/power/tuxonice/file/target
|
||||
|
||||
Then
|
||||
|
||||
cat /sys/power/tuxonice/resume
|
||||
|
||||
Put the results of this into your bootloader's configuration (see also step
|
||||
C, below):
|
||||
|
||||
---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE---
|
||||
# cat /sys/power/tuxonice/resume
|
||||
file:/dev/hda2:0x1e001
|
||||
|
||||
In this example, we would edit the append= line of our lilo.conf|menu.lst
|
||||
so that it included:
|
||||
|
||||
resume=file:/dev/hda2:0x1e001
|
||||
---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE---
|
||||
|
||||
For those who are thinking 'Could I make the file sparse?', the answer is
|
||||
'No!'. At the moment, there is no way for TuxOnIce to fill in the holes in
|
||||
a sparse file while hibernating. In the longer term (post merge!), I'd like
|
||||
to change things so that the file could be dynamically resized and have
|
||||
holes filled as needed. Right now, however, that's not possible and not a
|
||||
priority.
|
||||
|
||||
c. Bootloader configuration.
|
||||
|
||||
Using TuxOnIce also requires that you add an extra parameter to
|
||||
your lilo.conf or equivalent. Here's an example for a swap partition:
|
||||
|
||||
append="resume=swap:/dev/hda1"
|
||||
|
||||
This would tell TuxOnIce that /dev/hda1 is a swap partition you
|
||||
have. TuxOnIce will use the swap signature of this partition as a
|
||||
pointer to your data when you hibernate. This means that (in this example)
|
||||
/dev/hda1 doesn't need to be _the_ swap partition where all of your data
|
||||
is actually stored. It just needs to be a swap partition that has a
|
||||
valid signature.
|
||||
|
||||
You don't need to have a swap partition for this purpose. TuxOnIce
|
||||
can also use a swap file, but usage is a little more complex. Having made
|
||||
your swap file, turn it on and do
|
||||
|
||||
cat /sys/power/tuxonice/swap/headerlocations
|
||||
|
||||
(this assumes you've already compiled your kernel with TuxOnIce
|
||||
support and booted it). The results of the cat command will tell you
|
||||
what you need to put in lilo.conf:
|
||||
|
||||
For swap partitions like /dev/hda1, simply use resume=/dev/hda1.
|
||||
For swapfile `swapfile`, use resume=swap:/dev/hda2:0x242d.
|
||||
|
||||
If the swapfile changes for any reason (it is moved to a different
|
||||
location, it is deleted and recreated, or the filesystem is
|
||||
defragmented) then you will have to check
|
||||
/sys/power/tuxonice/swap/headerlocations for a new resume_block value.
|
||||
|
||||
Once you've compiled and installed the kernel and adjusted your bootloader
|
||||
configuration, you should only need to reboot for the most basic part
|
||||
of TuxOnIce to be ready.
|
||||
|
||||
If you only compile in the swap allocator, or only compile in the file
|
||||
allocator, you don't need to add the "swap:" part of the resume=
|
||||
parameters above. resume=/dev/hda2:0x242d will work just as well. If you
|
||||
have compiled both and your storage is on swap, you can also use this
|
||||
format (the swap allocator is the default allocator).
|
||||
|
||||
When compiling your kernel, one of the options in the 'Power Management
|
||||
Support' menu, just above the 'Enhanced Hibernation (TuxOnIce)' entry is
|
||||
called 'Default resume partition'. This can be used to set a default value
|
||||
for the resume= parameter.
|
||||
|
||||
d. The hibernate script.
|
||||
|
||||
Since the driver model in 2.6 kernels is still being developed, you may need
|
||||
to do more than just configure TuxOnIce. Users of TuxOnIce usually start the
|
||||
process via a script which prepares for the hibernation cycle, tells the
|
||||
kernel to do its stuff and then restore things afterwards. This script might
|
||||
involve:
|
||||
|
||||
- Switching to a text console and back if X doesn't like the video card
|
||||
status on resume.
|
||||
- Un/reloading drivers that don't play well with hibernation.
|
||||
|
||||
Note that you might not be able to unload some drivers if there are
|
||||
processes using them. You might have to kill off processes that hold
|
||||
devices open. Hint: if your X server accesses an USB mouse, doing a
|
||||
'chvt' to a text console releases the device and you can unload the
|
||||
module.
|
||||
|
||||
Check out the latest script (available on tuxonice.net).
|
||||
|
||||
e. The userspace user interface.
|
||||
|
||||
TuxOnIce has very limited support for displaying status if you only apply
|
||||
the kernel patch - it can printk messages, but that is all. In addition,
|
||||
some of the functions mentioned in this document (such as cancelling a cycle
|
||||
or performing interactive debugging) are unavailable. To utilise these
|
||||
functions, or simply get a nice display, you need the 'userui' component.
|
||||
Userui comes in three flavours, usplash, fbsplash and text. Text should
|
||||
work on any console. Usplash and fbsplash require the appropriate
|
||||
(distro specific?) support.
|
||||
|
||||
To utilise a userui, TuxOnIce just needs to be told where to find the
|
||||
userspace binary:
|
||||
|
||||
echo "/usr/local/sbin/tuxoniceui_fbsplash" > /sys/power/tuxonice/user_interface/program
|
||||
|
||||
The hibernate script can do this for you, and a default value for this
|
||||
setting can be configured when compiling the kernel. This path is also
|
||||
stored in the image header, so if you have an initrd or initramfs, you can
|
||||
use the userui during the first part of resuming (prior to the atomic
|
||||
restore) by putting the binary in the same path in your initrd/ramfs.
|
||||
Alternatively, you can put it in a different location and do an echo
|
||||
similar to the above prior to the echo > do_resume. The value saved in the
|
||||
image header will then be ignored.
|
||||
|
||||
4. Why not just use the version already in the kernel?
|
||||
|
||||
The version in the vanilla kernel has a number of drawbacks. The most
|
||||
serious of these are:
|
||||
- it has a maximum image size of 1/2 total memory;
|
||||
- it doesn't allocate storage until after it has snapshotted memory.
|
||||
This means that you can't be sure hibernating will work until you
|
||||
see it start to write the image;
|
||||
- it does not allow you to press escape to cancel a cycle;
|
||||
- it does not allow you to press escape to cancel resuming;
|
||||
- it does not allow you to automatically swapon a file when
|
||||
starting a cycle;
|
||||
- it does not allow you to use multiple swap partitions or files;
|
||||
- it does not allow you to use ordinary files;
|
||||
- it just invalidates an image and continues to boot if you
|
||||
accidentally boot the wrong kernel after hibernating;
|
||||
- it doesn't support any sort of nice display while hibernating;
|
||||
- it is moving toward requiring that you have an initrd/initramfs
|
||||
to ever have a hope of resuming (uswsusp). While uswsusp will
|
||||
address some of the concerns above, it won't address all of them,
|
||||
and will be more complicated to get set up;
|
||||
- it doesn't have support for suspend-to-both (write a hibernation
|
||||
image, then suspend to ram; I think this is known as ReadySafe
|
||||
under M$).
|
||||
|
||||
5. How do you use it?
|
||||
|
||||
A hibernation cycle can be started directly by doing:
|
||||
|
||||
echo > /sys/power/tuxonice/do_hibernate
|
||||
|
||||
In practice, though, you'll probably want to use the hibernate script
|
||||
to unload modules, configure the kernel the way you like it and so on.
|
||||
In that case, you'd do (as root):
|
||||
|
||||
hibernate
|
||||
|
||||
See the hibernate script's man page for more details on the options it
|
||||
takes.
|
||||
|
||||
If you're using the text or splash user interface modules, one feature of
|
||||
TuxOnIce that you might find useful is that you can press Escape at any time
|
||||
during hibernating, and the process will be aborted.
|
||||
|
||||
Due to the way hibernation works, this means you'll have your system back and
|
||||
perfectly usable almost instantly. The only exception is when it's at the
|
||||
very end of writing the image. Then it will need to reload a small (usually
|
||||
4-50MBs, depending upon the image characteristics) portion first.
|
||||
|
||||
Likewise, when resuming, you can press escape and resuming will be aborted.
|
||||
The computer will then powerdown again according to settings at that time for
|
||||
the powerdown method or rebooting.
|
||||
|
||||
You can change the settings for powering down while the image is being
|
||||
written by pressing 'R' to toggle rebooting and 'O' to toggle between
|
||||
suspending to ram and powering down completely).
|
||||
|
||||
If you run into problems with resuming, adding the "noresume" option to
|
||||
the kernel command line will let you skip the resume step and recover your
|
||||
system. This option shouldn't normally be needed, because TuxOnIce modifies
|
||||
the image header prior to the atomic restore, and will thus prompt you
|
||||
if it detects that you've tried to resume an image before (this flag is
|
||||
removed if you press Escape to cancel a resume, so you won't be prompted
|
||||
then).
|
||||
|
||||
Recent kernels (2.6.24 onwards) add support for resuming from a different
|
||||
kernel to the one that was hibernated (thanks to Rafael for his work on
|
||||
this - I've just embraced and enhanced the support for TuxOnIce). This
|
||||
should further reduce the need for you to use the noresume option.
|
||||
|
||||
6. What do all those entries in /sys/power/tuxonice do?
|
||||
|
||||
/sys/power/tuxonice is the directory which contains files you can use to
|
||||
tune and configure TuxOnIce to your liking. The exact contents of
|
||||
the directory will depend upon the version of TuxOnIce you're
|
||||
running and the options you selected at compile time. In the following
|
||||
descriptions, names in brackets refer to compile time options.
|
||||
(Note that they're all dependant upon you having selected CONFIG_TUXONICE
|
||||
in the first place!).
|
||||
|
||||
Since the values of these settings can open potential security risks, the
|
||||
writeable ones are accessible only to the root user. You may want to
|
||||
configure sudo to allow you to invoke your hibernate script as an ordinary
|
||||
user.
|
||||
|
||||
- alloc/failure_test
|
||||
|
||||
This debugging option provides a way of testing TuxOnIce's handling of
|
||||
memory allocation failures. Each allocation type that TuxOnIce makes has
|
||||
been given a unique number (see the source code). Echo the appropriate
|
||||
number into this entry, and when TuxOnIce attempts to do that allocation,
|
||||
it will pretend there was a failure and act accordingly.
|
||||
|
||||
- alloc/find_max_mem_allocated
|
||||
|
||||
This debugging option will cause TuxOnIce to find the maximum amount of
|
||||
memory it used during a cycle, and report that information in debugging
|
||||
information at the end of the cycle.
|
||||
|
||||
- alt_resume_param
|
||||
|
||||
Instead of powering down after writing a hibernation image, TuxOnIce
|
||||
supports resuming from a different image. This entry lets you set the
|
||||
location of the signature for that image (the resume= value you'd use
|
||||
for it). Using an alternate image and keep_image mode, you can do things
|
||||
like using an alternate image to power down an uninterruptible power
|
||||
supply.
|
||||
|
||||
- block_io/target_outstanding_io
|
||||
|
||||
This value controls the amount of memory that the block I/O code says it
|
||||
needs when the core code is calculating how much memory is needed for
|
||||
hibernating and for resuming. It doesn't directly control the amount of
|
||||
I/O that is submitted at any one time - that depends on the amount of
|
||||
available memory (we may have more available than we asked for), the
|
||||
throughput that is being achieved and the ability of the CPU to keep up
|
||||
with disk throughput (particularly where we're compressing pages).
|
||||
|
||||
- checksum/enabled
|
||||
|
||||
Use cryptoapi hashing routines to verify that Pageset2 pages don't change
|
||||
while we're saving the first part of the image, and to get any pages that
|
||||
do change resaved in the atomic copy. This should normally not be needed,
|
||||
but if you're seeing issues, please enable this. If your issues stop you
|
||||
being able to resume, enable this option, hibernate and cancel the cycle
|
||||
after the atomic copy is done. If the debugging info shows a non-zero
|
||||
number of pages resaved, please report this to Nigel.
|
||||
|
||||
- compression/algorithm
|
||||
|
||||
Set the cryptoapi algorithm used for compressing the image.
|
||||
|
||||
- compression/expected_compression
|
||||
|
||||
These values allow you to set an expected compression ratio, which TuxOnice
|
||||
will use in calculating whether it meets constraints on the image size. If
|
||||
this expected compression ratio is not attained, the hibernation cycle will
|
||||
abort, so it is wise to allow some spare. You can see what compression
|
||||
ratio is achieved in the logs after hibernating.
|
||||
|
||||
- debug_info:
|
||||
|
||||
This file returns information about your configuration that may be helpful
|
||||
in diagnosing problems with hibernating.
|
||||
|
||||
- did_suspend_to_both:
|
||||
|
||||
This file can be used when you hibernate with powerdown method 3 (ie suspend
|
||||
to ram after writing the image). There can be two outcomes in this case. We
|
||||
can resume from the suspend-to-ram before the battery runs out, or we can run
|
||||
out of juice and and up resuming like normal. This entry lets you find out,
|
||||
post resume, which way we went. If the value is 1, we resumed from suspend
|
||||
to ram. This can be useful when actions need to be run post suspend-to-ram
|
||||
that don't need to be run if we did the normal resume from power off.
|
||||
|
||||
- do_hibernate:
|
||||
|
||||
When anything is written to this file, the kernel side of TuxOnIce will
|
||||
begin to attempt to write an image to disk and power down. You'll normally
|
||||
want to run the hibernate script instead, to get modules unloaded first.
|
||||
|
||||
- do_resume:
|
||||
|
||||
When anything is written to this file TuxOnIce will attempt to read and
|
||||
restore an image. If there is no image, it will return almost immediately.
|
||||
If an image exists, the echo > will never return. Instead, the original
|
||||
kernel context will be restored and the original echo > do_hibernate will
|
||||
return.
|
||||
|
||||
- */enabled
|
||||
|
||||
These option can be used to temporarily disable various parts of TuxOnIce.
|
||||
|
||||
- extra_pages_allowance
|
||||
|
||||
When TuxOnIce does its atomic copy, it calls the driver model suspend
|
||||
and resume methods. If you have DRI enabled with a driver such as fglrx,
|
||||
this can result in the driver allocating a substantial amount of memory
|
||||
for storing its state. Extra_pages_allowance tells TuxOnIce how much
|
||||
extra memory it should ensure is available for those allocations. If
|
||||
your attempts at hibernating end with a message in dmesg indicating that
|
||||
insufficient extra pages were allowed, you need to increase this value.
|
||||
|
||||
- file/target:
|
||||
|
||||
Read this value to get the current setting. Write to it to point TuxOnice
|
||||
at a new storage location for the file allocator. See section 3.b.ii above
|
||||
for details of how to set up the file allocator.
|
||||
|
||||
- freezer_test
|
||||
|
||||
This entry can be used to get TuxOnIce to just test the freezer and prepare
|
||||
an image without actually doing a hibernation cycle. It is useful for
|
||||
diagnosing freezing and image preparation issues.
|
||||
|
||||
- full_pageset2
|
||||
|
||||
TuxOnIce divides the pages that are stored in an image into two sets. The
|
||||
difference between the two sets is that pages in pageset 1 are atomically
|
||||
copied, and pages in pageset 2 are written to disk without being copied
|
||||
first. A page CAN be written to disk without being copied first if and only
|
||||
if its contents will not be modified or used at any time after userspace
|
||||
processes are frozen. A page MUST be in pageset 1 if its contents are
|
||||
modified or used at any time after userspace processes have been frozen.
|
||||
|
||||
Normally (ie if this option is enabled), TuxOnIce will put all pages on the
|
||||
per-zone LRUs in pageset2, then remove those pages used by any userspace
|
||||
user interface helper and TuxOnIce storage manager that are running,
|
||||
together with pages used by the GEM memory manager introduced around 2.6.28
|
||||
kernels.
|
||||
|
||||
If this option is disabled, a much more conservative approach will be taken.
|
||||
The only pages in pageset2 will be those belonging to userspace processes,
|
||||
with the exclusion of those belonging to the TuxOnIce userspace helpers
|
||||
mentioned above. This will result in a much smaller pageset2, and will
|
||||
therefore result in smaller images than are possible with this option
|
||||
enabled.
|
||||
|
||||
- ignore_rootfs
|
||||
|
||||
TuxOnIce records which device is mounted as the root filesystem when
|
||||
writing the hibernation image. It will normally check at resume time that
|
||||
this device isn't already mounted - that would be a cause of filesystem
|
||||
corruption. In some particular cases (RAM based root filesystems), you
|
||||
might want to disable this check. This option allows you to do that.
|
||||
|
||||
- image_exists:
|
||||
|
||||
Can be used in a script to determine whether a valid image exists at the
|
||||
location currently pointed to by resume=. Returns up to three lines.
|
||||
The first is whether an image exists (-1 for unsure, otherwise 0 or 1).
|
||||
If an image eixsts, additional lines will return the machine and version.
|
||||
Echoing anything to this entry removes any current image.
|
||||
|
||||
- image_size_limit:
|
||||
|
||||
The maximum size of hibernation image written to disk, measured in megabytes
|
||||
(1024*1024).
|
||||
|
||||
- last_result:
|
||||
|
||||
The result of the last hibernation cycle, as defined in
|
||||
include/linux/suspend-debug.h with the values SUSPEND_ABORTED to
|
||||
SUSPEND_KEPT_IMAGE. This is a bitmask.
|
||||
|
||||
- late_cpu_hotplug:
|
||||
|
||||
This sysfs entry controls whether cpu hotplugging is done - as normal - just
|
||||
before (unplug) and after (replug) the atomic copy/restore (so that all
|
||||
CPUs/cores are available for multithreaded I/O). The alternative is to
|
||||
unplug all secondary CPUs/cores at the start of hibernating/resuming, and
|
||||
replug them at the end of resuming. No multithreaded I/O will be possible in
|
||||
this configuration, but the odd machine has been reported to require it.
|
||||
|
||||
- lid_file:
|
||||
|
||||
This determines which ACPI button file we look in to determine whether the
|
||||
lid is open or closed after resuming from suspend to disk or power off.
|
||||
If the entry is set to "lid/LID", we'll open /proc/acpi/button/lid/LID/state
|
||||
and check its contents at the appropriate moment. See post_wake_state below
|
||||
for more details on how this entry is used.
|
||||
|
||||
- log_everything (CONFIG_PM_DEBUG):
|
||||
|
||||
Setting this option results in all messages printed being logged. Normally,
|
||||
only a subset are logged, so as to not slow the process and not clutter the
|
||||
logs. Useful for debugging. It can be toggled during a cycle by pressing
|
||||
'L'.
|
||||
|
||||
- no_load_direct:
|
||||
|
||||
This is a debugging option. If, when loading the atomically copied pages of
|
||||
an image, TuxOnIce finds that the destination address for a page is free,
|
||||
it will normally allocate the image, load the data directly into that
|
||||
address and skip it in the atomic restore. If this option is disabled, the
|
||||
page will be loaded somewhere else and atomically restored like other pages.
|
||||
|
||||
- no_flusher_thread:
|
||||
|
||||
When doing multithreaded I/O (see below), the first online CPU can be used
|
||||
to _just_ submit compressed pages when writing the image, rather than
|
||||
compressing and submitting data. This option is normally disabled, but has
|
||||
been included because Nigel would like to see whether it will be more useful
|
||||
as the number of cores/cpus in computers increases.
|
||||
|
||||
- no_multithreaded_io:
|
||||
|
||||
TuxOnIce will normally create one thread per cpu/core on your computer,
|
||||
each of which will then perform I/O. This will generally result in
|
||||
throughput that's the maximum the storage medium can handle. There
|
||||
shouldn't be any reason to disable multithreaded I/O now, but this option
|
||||
has been retained for debugging purposes.
|
||||
|
||||
- no_pageset2
|
||||
|
||||
See the entry for full_pageset2 above for an explanation of pagesets.
|
||||
Enabling this option causes TuxOnIce to do an atomic copy of all pages,
|
||||
thereby limiting the maximum image size to 1/2 of memory, as swsusp does.
|
||||
|
||||
- no_pageset2_if_unneeded
|
||||
|
||||
See the entry for full_pageset2 above for an explanation of pagesets.
|
||||
Enabling this option causes TuxOnIce to act like no_pageset2 was enabled
|
||||
if and only it isn't needed anyway. This option may still make TuxOnIce
|
||||
less reliable because pageset2 pages are normally used to store the
|
||||
atomic copy - drivers that want to do allocations of larger amounts of
|
||||
memory in one shot will be more likely to find that those amounts aren't
|
||||
available if this option is enabled.
|
||||
|
||||
- pause_between_steps (CONFIG_PM_DEBUG):
|
||||
|
||||
This option is used during debugging, to make TuxOnIce pause between
|
||||
each step of the process. It is ignored when the nice display is on.
|
||||
|
||||
- post_wake_state:
|
||||
|
||||
TuxOnIce provides support for automatically waking after a user-selected
|
||||
delay, and using a different powerdown method if the lid is still closed.
|
||||
(Yes, we're assuming a laptop). This entry lets you choose what state
|
||||
should be entered next. The values are those described under
|
||||
powerdown_method, below. It can be used to suspend to RAM after hibernating,
|
||||
then powerdown properly (say) 20 minutes. It can also be used to power down
|
||||
properly, then wake at (say) 6.30am and suspend to RAM until you're ready
|
||||
to use the machine.
|
||||
|
||||
- powerdown_method:
|
||||
|
||||
Used to select a method by which TuxOnIce should powerdown after writing the
|
||||
image. Currently:
|
||||
|
||||
0: Don't use ACPI to power off.
|
||||
3: Attempt to enter Suspend-to-ram.
|
||||
4: Attempt to enter ACPI S4 mode.
|
||||
5: Attempt to power down via ACPI S5 mode.
|
||||
|
||||
Note that these options are highly dependant upon your hardware & software:
|
||||
|
||||
3: When succesful, your machine suspends to ram instead of powering off.
|
||||
The advantage of using this mode is that it doesn't matter whether your
|
||||
battery has enough charge to make it through to your next resume. If it
|
||||
lasts, you will simply resume from suspend to ram (and the image on disk
|
||||
will be discarded). If the battery runs out, you will resume from disk
|
||||
instead. The disadvantage is that it takes longer than a normal
|
||||
suspend-to-ram to enter the state, since the suspend-to-disk image needs
|
||||
to be written first.
|
||||
4/5: When successful, your machine will be off and comsume (almost) no power.
|
||||
But it might still react to some external events like opening the lid or
|
||||
trafic on a network or usb device. For the bios, resume is then the same
|
||||
as warm boot, similar to a situation where you used the command `reboot'
|
||||
to reboot your machine. If your machine has problems on warm boot or if
|
||||
you want to protect your machine with the bios password, this is probably
|
||||
not the right choice. Mode 4 may be necessary on some machines where ACPI
|
||||
wake up methods need to be run to properly reinitialise hardware after a
|
||||
hibernation cycle.
|
||||
0: Switch the machine completely off. The only possible wakeup is the power
|
||||
button. For the bios, resume is then the same as a cold boot, in
|
||||
particular you would have to provide your bios boot password if your
|
||||
machine uses that feature for booting.
|
||||
|
||||
- progressbar_granularity_limit:
|
||||
|
||||
This option can be used to limit the granularity of the progress bar
|
||||
displayed with a bootsplash screen. The value is the maximum number of
|
||||
steps. That is, 10 will make the progress bar jump in 10% increments.
|
||||
|
||||
- reboot:
|
||||
|
||||
This option causes TuxOnIce to reboot rather than powering down
|
||||
at the end of saving an image. It can be toggled during a cycle by pressing
|
||||
'R'.
|
||||
|
||||
- resume:
|
||||
|
||||
This sysfs entry can be used to read and set the location in which TuxOnIce
|
||||
will look for the signature of an image - the value set using resume= at
|
||||
boot time or CONFIG_PM_STD_PARTITION ("Default resume partition"). By
|
||||
writing to this file as well as modifying your bootloader's configuration
|
||||
file (eg menu.lst), you can set or reset the location of your image or the
|
||||
method of storing the image without rebooting.
|
||||
|
||||
- replace_swsusp (CONFIG_TOI_REPLACE_SWSUSP):
|
||||
|
||||
This option makes
|
||||
|
||||
echo disk > /sys/power/state
|
||||
|
||||
activate TuxOnIce instead of swsusp. Regardless of whether this option is
|
||||
enabled, any invocation of swsusp's resume time trigger will cause TuxOnIce
|
||||
to check for an image too. This is due to the fact that at resume time, we
|
||||
can't know whether this option was enabled until we see if an image is there
|
||||
for us to resume from. (And when an image exists, we don't care whether we
|
||||
did replace swsusp anyway - we just want to resume).
|
||||
|
||||
- resume_commandline:
|
||||
|
||||
This entry can be read after resuming to see the commandline that was used
|
||||
when resuming began. You might use this to set up two bootloader entries
|
||||
that are the same apart from the fact that one includes a extra append=
|
||||
argument "at_work=1". You could then grep resume_commandline in your
|
||||
post-resume scripts and configure networking (for example) differently
|
||||
depending upon whether you're at home or work. resume_commandline can be
|
||||
set to arbitrary text if you wish to remove sensitive contents.
|
||||
|
||||
- swap/swapfilename:
|
||||
|
||||
This entry is used to specify the swapfile or partition that
|
||||
TuxOnIce will attempt to swapon/swapoff automatically. Thus, if
|
||||
I normally use /dev/hda1 for swap, and want to use /dev/hda2 for specifically
|
||||
for my hibernation image, I would
|
||||
|
||||
echo /dev/hda2 > /sys/power/tuxonice/swap/swapfile
|
||||
|
||||
/dev/hda2 would then be automatically swapon'd and swapoff'd. Note that the
|
||||
swapon and swapoff occur while other processes are frozen (including kswapd)
|
||||
so this swap file will not be used up when attempting to free memory. The
|
||||
parition/file is also given the highest priority, so other swapfiles/partitions
|
||||
will only be used to save the image when this one is filled.
|
||||
|
||||
The value of this file is used by headerlocations along with any currently
|
||||
activated swapfiles/partitions.
|
||||
|
||||
- swap/headerlocations:
|
||||
|
||||
This option tells you the resume= options to use for swap devices you
|
||||
currently have activated. It is particularly useful when you only want to
|
||||
use a swap file to store your image. See above for further details.
|
||||
|
||||
- test_bio
|
||||
|
||||
This is a debugging option. When enabled, TuxOnIce will not hibernate.
|
||||
Instead, when asked to write an image, it will skip the atomic copy,
|
||||
just doing the writing of the image and then returning control to the
|
||||
user at the point where it would have powered off. This is useful for
|
||||
testing throughput in different configurations.
|
||||
|
||||
- test_filter_speed
|
||||
|
||||
This is a debugging option. When enabled, TuxOnIce will not hibernate.
|
||||
Instead, when asked to write an image, it will not write anything or do
|
||||
an atomic copy, but will only run any enabled compression algorithm on the
|
||||
data that would have been written (the source pages of the atomic copy in
|
||||
the case of pageset 1). This is useful for comparing the performance of
|
||||
compression algorithms and for determining the extent to which an upgrade
|
||||
to your storage method would improve hibernation speed.
|
||||
|
||||
- user_interface/debug_sections (CONFIG_PM_DEBUG):
|
||||
|
||||
This value, together with the console log level, controls what debugging
|
||||
information is displayed. The console log level determines the level of
|
||||
detail, and this value determines what detail is displayed. This value is
|
||||
a bit vector, and the meaning of the bits can be found in the kernel tree
|
||||
in include/linux/tuxonice.h. It can be overridden using the kernel's
|
||||
command line option suspend_dbg.
|
||||
|
||||
- user_interface/default_console_level (CONFIG_PM_DEBUG):
|
||||
|
||||
This determines the value of the console log level at the start of a
|
||||
hibernation cycle. If debugging is compiled in, the console log level can be
|
||||
changed during a cycle by pressing the digit keys. Meanings are:
|
||||
|
||||
0: Nice display.
|
||||
1: Nice display plus numerical progress.
|
||||
2: Errors only.
|
||||
3: Low level debugging info.
|
||||
4: Medium level debugging info.
|
||||
5: High level debugging info.
|
||||
6: Verbose debugging info.
|
||||
|
||||
- user_interface/enable_escape:
|
||||
|
||||
Setting this to "1" will enable you abort a hibernation cycle or resuming by
|
||||
pressing escape, "0" (default) disables this feature. Note that enabling
|
||||
this option means that you cannot initiate a hibernation cycle and then walk
|
||||
away from your computer, expecting it to be secure. With feature disabled,
|
||||
you can validly have this expectation once TuxOnice begins to write the
|
||||
image to disk. (Prior to this point, it is possible that TuxOnice might
|
||||
about because of failure to freeze all processes or because constraints
|
||||
on its ability to save the image are not met).
|
||||
|
||||
- user_interface/program
|
||||
|
||||
This entry is used to tell TuxOnice what userspace program to use for
|
||||
providing a user interface while hibernating. The program uses a netlink
|
||||
socket to pass messages back and forward to the kernel, allowing all of the
|
||||
functions formerly implemented in the kernel user interface components.
|
||||
|
||||
- version:
|
||||
|
||||
The version of TuxOnIce you have compiled into the currently running kernel.
|
||||
|
||||
- wake_alarm_dir:
|
||||
|
||||
As mentioned above (post_wake_state), TuxOnIce supports automatically waking
|
||||
after some delay. This entry allows you to select which wake alarm to use.
|
||||
It should contain the value "rtc0" if you're wanting to use
|
||||
/sys/class/rtc/rtc0.
|
||||
|
||||
- wake_delay:
|
||||
|
||||
This value determines the delay from the end of writing the image until the
|
||||
wake alarm is triggered. You can set an absolute time by writing the desired
|
||||
time into /sys/class/rtc/<wake_alarm_dir>/wakealarm and leaving these values
|
||||
empty.
|
||||
|
||||
Note that for the wakeup to actually occur, you may need to modify entries
|
||||
in /proc/acpi/wakeup. This is done by echoing the name of the button in the
|
||||
first column (eg PBTN) into the file.
|
||||
|
||||
7. How do you get support?
|
||||
|
||||
Glad you asked. TuxOnIce is being actively maintained and supported
|
||||
by Nigel (the guy doing most of the kernel coding at the moment), Bernard
|
||||
(who maintains the hibernate script and userspace user interface components)
|
||||
and its users.
|
||||
|
||||
Resources availble include HowTos, FAQs and a Wiki, all available via
|
||||
tuxonice.net. You can find the mailing lists there.
|
||||
|
||||
8. I think I've found a bug. What should I do?
|
||||
|
||||
By far and a way, the most common problems people have with TuxOnIce
|
||||
related to drivers not having adequate power management support. In this
|
||||
case, it is not a bug with TuxOnIce, but we can still help you. As we
|
||||
mentioned above, such issues can usually be worked around by building the
|
||||
functionality as modules and unloading them while hibernating. Please visit
|
||||
the Wiki for up-to-date lists of known issues and work arounds.
|
||||
|
||||
If this information doesn't help, try running:
|
||||
|
||||
hibernate --bug-report
|
||||
|
||||
..and sending the output to the users mailing list.
|
||||
|
||||
Good information on how to provide us with useful information from an
|
||||
oops is found in the file REPORTING-BUGS, in the top level directory
|
||||
of the kernel tree. If you get an oops, please especially note the
|
||||
information about running what is printed on the screen through ksymoops.
|
||||
The raw information is useless.
|
||||
|
||||
9. When will XXX be supported?
|
||||
|
||||
If there's a feature missing from TuxOnIce that you'd like, feel free to
|
||||
ask. We try to be obliging, within reason.
|
||||
|
||||
Patches are welcome. Please send to the list.
|
||||
|
||||
10. How does it work?
|
||||
|
||||
TuxOnIce does its work in a number of steps.
|
||||
|
||||
a. Freezing system activity.
|
||||
|
||||
The first main stage in hibernating is to stop all other activity. This is
|
||||
achieved in stages. Processes are considered in fours groups, which we will
|
||||
describe in reverse order for clarity's sake: Threads with the PF_NOFREEZE
|
||||
flag, kernel threads without this flag, userspace processes with the
|
||||
PF_SYNCTHREAD flag and all other processes. The first set (PF_NOFREEZE) are
|
||||
untouched by the refrigerator code. They are allowed to run during hibernating
|
||||
and resuming, and are used to support user interaction, storage access or the
|
||||
like. Other kernel threads (those unneeded while hibernating) are frozen last.
|
||||
This leaves us with userspace processes that need to be frozen. When a
|
||||
process enters one of the *_sync system calls, we set a PF_SYNCTHREAD flag on
|
||||
that process for the duration of that call. Processes that have this flag are
|
||||
frozen after processes without it, so that we can seek to ensure that dirty
|
||||
data is synced to disk as quickly as possible in a situation where other
|
||||
processes may be submitting writes at the same time. Freezing the processes
|
||||
that are submitting data stops new I/O from being submitted. Syncthreads can
|
||||
then cleanly finish their work. So the order is:
|
||||
|
||||
- Userspace processes without PF_SYNCTHREAD or PF_NOFREEZE;
|
||||
- Userspace processes with PF_SYNCTHREAD (they won't have NOFREEZE);
|
||||
- Kernel processes without PF_NOFREEZE.
|
||||
|
||||
b. Eating memory.
|
||||
|
||||
For a successful hibernation cycle, you need to have enough disk space to store the
|
||||
image and enough memory for the various limitations of TuxOnIce's
|
||||
algorithm. You can also specify a maximum image size. In order to attain
|
||||
to those constraints, TuxOnIce may 'eat' memory. If, after freezing
|
||||
processes, the constraints aren't met, TuxOnIce will thaw all the
|
||||
other processes and begin to eat memory until its calculations indicate
|
||||
the constraints are met. It will then freeze processes again and recheck
|
||||
its calculations.
|
||||
|
||||
c. Allocation of storage.
|
||||
|
||||
Next, TuxOnIce allocates the storage that will be used to save
|
||||
the image.
|
||||
|
||||
The core of TuxOnIce knows nothing about how or where pages are stored. We
|
||||
therefore request the active allocator (remember you might have compiled in
|
||||
more than one!) to allocate enough storage for our expect image size. If
|
||||
this request cannot be fulfilled, we eat more memory and try again. If it
|
||||
is fulfiled, we seek to allocate additional storage, just in case our
|
||||
expected compression ratio (if any) isn't achieved. This time, however, we
|
||||
just continue if we can't allocate enough storage.
|
||||
|
||||
If these calls to our allocator change the characteristics of the image
|
||||
such that we haven't allocated enough memory, we also loop. (The allocator
|
||||
may well need to allocate space for its storage information).
|
||||
|
||||
d. Write the first part of the image.
|
||||
|
||||
TuxOnIce stores the image in two sets of pages called 'pagesets'.
|
||||
Pageset 2 contains pages on the active and inactive lists; essentially
|
||||
the page cache. Pageset 1 contains all other pages, including the kernel.
|
||||
We use two pagesets for one important reason: We need to make an atomic copy
|
||||
of the kernel to ensure consistency of the image. Without a second pageset,
|
||||
that would limit us to an image that was at most half the amount of memory
|
||||
available. Using two pagesets allows us to store a full image. Since pageset
|
||||
2 pages won't be needed in saving pageset 1, we first save pageset 2 pages.
|
||||
We can then make our atomic copy of the remaining pages using both pageset 2
|
||||
pages and any other pages that are free. While saving both pagesets, we are
|
||||
careful not to corrupt the image. Among other things, we use lowlevel block
|
||||
I/O routines that don't change the pagecache contents.
|
||||
|
||||
The next step, then, is writing pageset 2.
|
||||
|
||||
e. Suspending drivers and storing processor context.
|
||||
|
||||
Having written pageset2, TuxOnIce calls the power management functions to
|
||||
notify drivers of the hibernation, and saves the processor state in preparation
|
||||
for the atomic copy of memory we are about to make.
|
||||
|
||||
f. Atomic copy.
|
||||
|
||||
At this stage, everything else but the TuxOnIce code is halted. Processes
|
||||
are frozen or idling, drivers are quiesced and have stored (ideally and where
|
||||
necessary) their configuration in memory we are about to atomically copy.
|
||||
In our lowlevel architecture specific code, we have saved the CPU state.
|
||||
We can therefore now do our atomic copy before resuming drivers etc.
|
||||
|
||||
g. Save the atomic copy (pageset 1).
|
||||
|
||||
TuxOnice can then write the atomic copy of the remaining pages. Since we
|
||||
have copied the pages into other locations, we can continue to use the
|
||||
normal block I/O routines without fear of corruption our image.
|
||||
|
||||
f. Save the image header.
|
||||
|
||||
Nearly there! We save our settings and other parameters needed for
|
||||
reloading pageset 1 in an 'image header'. We also tell our allocator to
|
||||
serialise its data at this stage, so that it can reread the image at resume
|
||||
time.
|
||||
|
||||
g. Set the image header.
|
||||
|
||||
Finally, we edit the header at our resume= location. The signature is
|
||||
changed by the allocator to reflect the fact that an image exists, and to
|
||||
point to the start of that data if necessary (swap allocator).
|
||||
|
||||
h. Power down.
|
||||
|
||||
Or reboot if we're debugging and the appropriate option is selected.
|
||||
|
||||
Whew!
|
||||
|
||||
Reloading the image.
|
||||
--------------------
|
||||
|
||||
Reloading the image is essentially the reverse of all the above. We load
|
||||
our copy of pageset 1, being careful to choose locations that aren't going
|
||||
to be overwritten as we copy it back (We start very early in the boot
|
||||
process, so there are no other processes to quiesce here). We then copy
|
||||
pageset 1 back to its original location in memory and restore the process
|
||||
context. We are now running with the original kernel. Next, we reload the
|
||||
pageset 2 pages, free the memory and swap used by TuxOnIce, restore
|
||||
the pageset header and restart processes. Sounds easy in comparison to
|
||||
hibernating, doesn't it!
|
||||
|
||||
There is of course more to TuxOnIce than this, but this explanation
|
||||
should be a good start. If there's interest, I'll write further
|
||||
documentation on range pages and the low level I/O.
|
||||
|
||||
11. Who wrote TuxOnIce?
|
||||
|
||||
(Answer based on the writings of Florent Chabaud, credits in files and
|
||||
Nigel's limited knowledge; apologies to anyone missed out!)
|
||||
|
||||
The main developers of TuxOnIce have been...
|
||||
|
||||
Gabor Kuti
|
||||
Pavel Machek
|
||||
Florent Chabaud
|
||||
Bernard Blackham
|
||||
Nigel Cunningham
|
||||
|
||||
Significant portions of swsusp, the code in the vanilla kernel which
|
||||
TuxOnIce enhances, have been worked on by Rafael Wysocki. Thanks should
|
||||
also be expressed to him.
|
||||
|
||||
The above mentioned developers have been aided in their efforts by a host
|
||||
of hundreds, if not thousands of testers and people who have submitted bug
|
||||
fixes & suggestions. Of special note are the efforts of Michael Frank, who
|
||||
had his computers repetitively hibernate and resume for literally tens of
|
||||
thousands of cycles and developed scripts to stress the system and test
|
||||
TuxOnIce far beyond the point most of us (Nigel included!) would consider
|
||||
testing. His efforts have contributed as much to TuxOnIce as any of the
|
||||
names above.
|
||||
@@ -0,0 +1,75 @@
|
||||
Motivation:
|
||||
|
||||
In complicated DMA pipelines such as graphics (multimedia, camera, gpu, display)
|
||||
a consumer of a buffer needs to know when the producer has finished producing
|
||||
it. Likewise the producer needs to know when the consumer is finished with the
|
||||
buffer so it can reuse it. A particular buffer may be consumed by multiple
|
||||
consumers which will retain the buffer for different amounts of time. In
|
||||
addition, a consumer may consume multiple buffers atomically.
|
||||
The sync framework adds an API which allows synchronization between the
|
||||
producers and consumers in a generic way while also allowing platforms which
|
||||
have shared hardware synchronization primitives to exploit them.
|
||||
|
||||
Goals:
|
||||
* provide a generic API for expressing synchronization dependencies
|
||||
* allow drivers to exploit hardware synchronization between hardware
|
||||
blocks
|
||||
* provide a userspace API that allows a compositor to manage
|
||||
dependencies.
|
||||
* provide rich telemetry data to allow debugging slowdowns and stalls of
|
||||
the graphics pipeline.
|
||||
|
||||
Objects:
|
||||
* sync_timeline
|
||||
* sync_pt
|
||||
* sync_fence
|
||||
|
||||
sync_timeline:
|
||||
|
||||
A sync_timeline is an abstract monotonically increasing counter. In general,
|
||||
each driver/hardware block context will have one of these. They can be backed
|
||||
by the appropriate hardware or rely on the generic sw_sync implementation.
|
||||
Timelines are only ever created through their specific implementations
|
||||
(i.e. sw_sync.)
|
||||
|
||||
sync_pt:
|
||||
|
||||
A sync_pt is an abstract value which marks a point on a sync_timeline. Sync_pts
|
||||
have a single timeline parent. They have 3 states: active, signaled, and error.
|
||||
They start in active state and transition, once, to either signaled (when the
|
||||
timeline counter advances beyond the sync_pt’s value) or error state.
|
||||
|
||||
sync_fence:
|
||||
|
||||
Sync_fences are the primary primitives used by drivers to coordinate
|
||||
synchronization of their buffers. They are a collection of sync_pts which may
|
||||
or may not have the same timeline parent. A sync_pt can only exist in one fence
|
||||
and the fence's list of sync_pts is immutable once created. Fences can be
|
||||
waited on synchronously or asynchronously. Two fences can also be merged to
|
||||
create a third fence containing a copy of the two fences’ sync_pts. Fences are
|
||||
backed by file descriptors to allow userspace to coordinate the display pipeline
|
||||
dependencies.
|
||||
|
||||
Use:
|
||||
|
||||
A driver implementing sync support should have a work submission function which:
|
||||
* takes a fence argument specifying when to begin work
|
||||
* asynchronously queues that work to kick off when the fence is signaled
|
||||
* returns a fence to indicate when its work will be done.
|
||||
* signals the returned fence once the work is completed.
|
||||
|
||||
Consider an imaginary display driver that has the following API:
|
||||
/*
|
||||
* assumes buf is ready to be displayed.
|
||||
* blocks until the buffer is on screen.
|
||||
*/
|
||||
void display_buffer(struct dma_buf *buf);
|
||||
|
||||
The new API will become:
|
||||
/*
|
||||
* will display buf when fence is signaled.
|
||||
* returns immediately with a fence that will signal when buf
|
||||
* is no longer displayed.
|
||||
*/
|
||||
struct sync_fence* display_buffer(struct dma_buf *buf,
|
||||
struct sync_fence *fence);
|
||||
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/vm:
|
||||
- dirty_writeback_centisecs
|
||||
- drop_caches
|
||||
- extfrag_threshold
|
||||
- extra_free_kbytes
|
||||
- hugepages_treat_as_movable
|
||||
- hugetlb_shm_group
|
||||
- laptop_mode
|
||||
@@ -198,6 +199,21 @@ fragmentation index is <= extfrag_threshold. The default value is 500.
|
||||
|
||||
==============================================================
|
||||
|
||||
extra_free_kbytes
|
||||
|
||||
This parameter tells the VM to keep extra free memory between the threshold
|
||||
where background reclaim (kswapd) kicks in, and the threshold where direct
|
||||
reclaim (by allocating processes) kicks in.
|
||||
|
||||
This is useful for workloads that require low latency memory allocations
|
||||
and have a bounded burstiness in memory allocations, for example a
|
||||
realtime application that receives and transmits network traffic
|
||||
(causing in-kernel memory allocations) with a maximum total message burst
|
||||
size of 200MB may need 200MB of extra free memory to avoid direct reclaim
|
||||
related latencies.
|
||||
|
||||
==============================================================
|
||||
|
||||
hugepages_treat_as_movable
|
||||
|
||||
This parameter is only useful when kernelcore= is specified at boot time to
|
||||
|
||||
@@ -358,11 +358,8 @@ Every arch has an init callback function. If you need to do something early on
|
||||
to initialize some state, this is the time to do that. Otherwise, this simple
|
||||
function below should be sufficient for most people:
|
||||
|
||||
int __init ftrace_dyn_arch_init(void *data)
|
||||
int __init ftrace_dyn_arch_init(void)
|
||||
{
|
||||
/* return value is done indirectly via data */
|
||||
*(unsigned long *)data = 0;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
@@ -2013,6 +2013,35 @@ will produce:
|
||||
1) 1.449 us | }
|
||||
|
||||
|
||||
You can disable the hierarchical function call formatting and instead print a
|
||||
flat list of function entry and return events. This uses the format described
|
||||
in the Output Formatting section and respects all the trace options that
|
||||
control that formatting. Hierarchical formatting is the default.
|
||||
|
||||
hierachical: echo nofuncgraph-flat > trace_options
|
||||
flat: echo funcgraph-flat > trace_options
|
||||
|
||||
ie:
|
||||
|
||||
# tracer: function_graph
|
||||
#
|
||||
# entries-in-buffer/entries-written: 68355/68355 #P:2
|
||||
#
|
||||
# _-----=> irqs-off
|
||||
# / _----=> need-resched
|
||||
# | / _---=> hardirq/softirq
|
||||
# || / _--=> preempt-depth
|
||||
# ||| / delay
|
||||
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
|
||||
# | | | |||| | |
|
||||
sh-1806 [001] d... 198.843443: graph_ent: func=_raw_spin_lock
|
||||
sh-1806 [001] d... 198.843445: graph_ent: func=__raw_spin_lock
|
||||
sh-1806 [001] d..1 198.843447: graph_ret: func=__raw_spin_lock
|
||||
sh-1806 [001] d..1 198.843449: graph_ret: func=_raw_spin_lock
|
||||
sh-1806 [001] d..1 198.843451: graph_ent: func=_raw_spin_unlock_irqrestore
|
||||
sh-1806 [001] d... 198.843453: graph_ret: func=_raw_spin_unlock_irqrestore
|
||||
|
||||
|
||||
You might find other useful features for this tracer in the
|
||||
following "dynamic ftrace" section such as tracing only specific
|
||||
functions or tasks.
|
||||
|
||||
Reference in New Issue
Block a user