[mirror_ubuntu-jammy-kernel.git] / Documentation / cgroups / freezer-subsystem.txt

	The cgroup freezer is useful to batch job management system which start
and stop sets of tasks in order to schedule the resources of a machine
according to the desires of a system administrator. This sort of program
is often used on HPC clusters to schedule access to the cluster as a
whole. The cgroup freezer uses cgroups to describe the set of tasks to
be started/stopped by the batch job management system. It also provides
a means to start and stop the tasks composing the job.

	The cgroup freezer will also be useful for checkpointing running groups
of tasks. The freezer allows the checkpoint code to obtain a consistent
image of the tasks by attempting to force the tasks in a cgroup into a
quiescent state. Once the tasks are quiescent another task can
walk /proc or invoke a kernel interface to gather information about the
quiesced tasks. Checkpointed tasks can be restarted later should a
recoverable error occur. This also allows the checkpointed tasks to be
migrated between nodes in a cluster by copying the gathered information
to another node and restarting the tasks there.

	Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
and resuming tasks in userspace. Both of these signals are observable
from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
blocked, or ignored it can be seen by waiting or ptracing parent tasks.
SIGCONT is especially unsuitable since it can be caught by the task. Any
programs designed to watch for SIGSTOP and SIGCONT could be broken by
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
demonstrate this problem using nested bash shells:

	$ echo $$
	16644
	$ bash
	$ echo $$
	16690

	From a second, unrelated bash shell:
	$ kill -SIGSTOP 16690
	$ kill -SIGCONT 16990

	<at this point 16990 exits and causes 16644 to exit too>

	This happens because bash can observe both signals and choose how it
responds to them.

	Another example of a program which catches and responds to these
signals is gdb. In fact any program designed to use ptrace is likely to
have a problem with this method of stopping and resuming tasks.

	 In contrast, the cgroup freezer uses the kernel freezer code to
prevent the freeze/unfreeze cycle from becoming visible to the tasks
being frozen. This allows the bash example above and gdb to run as
expected.

	The freezer subsystem in the container filesystem defines a file named
freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the
cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup.
Reading will return the current state.

* Examples of usage :

   # mkdir /containers/freezer
   # mount -t cgroup -ofreezer freezer  /containers
   # mkdir /containers/0
   # echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

   # cat /containers/0/freezer.state
   THAWED

to freeze all tasks in the container :

   # echo FROZEN > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   FREEZING
   # cat /containers/0/freezer.state
   FROZEN

to unfreeze all tasks in the container :

   # echo THAWED > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   THAWED

This is the basic mechanism which should do the right thing for user space task
in a simple scenario.

It's important to note that freezing can be incomplete. In that case we return
EBUSY. This means that some tasks in the cgroup are busy doing something that
prevents us from completely freezing the cgroup at this time. After EBUSY,
the cgroup will remain partially frozen -- reflected by freezer.state reporting
"FREEZING" when read. The state will remain "FREEZING" until one of these
things happens:

	1) Userspace cancels the freezing operation by writing "THAWED" to
		the freezer.state file
	2) Userspace retries the freezing operation by writing "FROZEN" to
		the freezer.state file (writing "FREEZING" is not legal
		and returns EIO)
	3) The tasks that blocked the cgroup from entering the "FROZEN"
		state disappear from the cgroup's set of tasks.
Commit	Line	Data
bde5ab65 MH	1	The cgroup freezer is useful to batch job management system which start
	2	and stop sets of tasks in order to schedule the resources of a machine
	3	according to the desires of a system administrator. This sort of program
	4	is often used on HPC clusters to schedule access to the cluster as a
	5	whole. The cgroup freezer uses cgroups to describe the set of tasks to
	6	be started/stopped by the batch job management system. It also provides
	7	a means to start and stop the tasks composing the job.
	8
	9	The cgroup freezer will also be useful for checkpointing running groups
	10	of tasks. The freezer allows the checkpoint code to obtain a consistent
	11	image of the tasks by attempting to force the tasks in a cgroup into a
	12	quiescent state. Once the tasks are quiescent another task can
	13	walk /proc or invoke a kernel interface to gather information about the
	14	quiesced tasks. Checkpointed tasks can be restarted later should a
	15	recoverable error occur. This also allows the checkpointed tasks to be
	16	migrated between nodes in a cluster by copying the gathered information
	17	to another node and restarting the tasks there.
	18
	19	Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
	20	and resuming tasks in userspace. Both of these signals are observable
	21	from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
	22	blocked, or ignored it can be seen by waiting or ptracing parent tasks.
	23	SIGCONT is especially unsuitable since it can be caught by the task. Any
	24	programs designed to watch for SIGSTOP and SIGCONT could be broken by
	25	attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
	26	demonstrate this problem using nested bash shells:
	27
	28	$ echo $$
	29	16644
	30	$ bash
	31	$ echo $$
	32	16690
	33
	34	From a second, unrelated bash shell:
	35	$ kill -SIGSTOP 16690
	36	$ kill -SIGCONT 16990
	37
	38	<at this point 16990 exits and causes 16644 to exit too>
	39
	40	This happens because bash can observe both signals and choose how it
	41	responds to them.
	42
	43	Another example of a program which catches and responds to these
	44	signals is gdb. In fact any program designed to use ptrace is likely to
	45	have a problem with this method of stopping and resuming tasks.
	46
	47	In contrast, the cgroup freezer uses the kernel freezer code to
	48	prevent the freeze/unfreeze cycle from becoming visible to the tasks
	49	being frozen. This allows the bash example above and gdb to run as
	50	expected.
	51
	52	The freezer subsystem in the container filesystem defines a file named
	53	freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the
	54	cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup.
	55	Reading will return the current state.
	56
	57	* Examples of usage :
	58
	59	# mkdir /containers/freezer
	60	# mount -t cgroup -ofreezer freezer /containers
	61	# mkdir /containers/0
	62	# echo $some_pid > /containers/0/tasks
	63
	64	to get status of the freezer subsystem :
65
66	# cat /containers/0/freezer.state
67	THAWED
68
69	to freeze all tasks in the container :
70
71	# echo FROZEN > /containers/0/freezer.state
72	# cat /containers/0/freezer.state
73	FREEZING
74	# cat /containers/0/freezer.state
75	FROZEN
76
77	to unfreeze all tasks in the container :
78
79	# echo THAWED > /containers/0/freezer.state
80	# cat /containers/0/freezer.state
81	THAWED
82
83	This is the basic mechanism which should do the right thing for user space task
84	in a simple scenario.
85
86	It's important to note that freezing can be incomplete. In that case we return
87	EBUSY. This means that some tasks in the cgroup are busy doing something that
88	prevents us from completely freezing the cgroup at this time. After EBUSY,
89	the cgroup will remain partially frozen -- reflected by freezer.state reporting
90	"FREEZING" when read. The state will remain "FREEZING" until one of these
91	things happens:
92
93	1) Userspace cancels the freezing operation by writing "THAWED" to
94	the freezer.state file
95	2) Userspace retries the freezing operation by writing "FROZEN" to
96	the freezer.state file (writing "FREEZING" is not legal
97	and returns EIO)
98	3) The tasks that blocked the cgroup from entering the "FROZEN"
99	state disappear from the cgroup's set of tasks.