]> git.proxmox.com Git - mirror_ubuntu-disco-kernel.git/blame - Documentation/cgroup-v2.txt
cgroup: implement CSS_TASK_ITER_THREADED
[mirror_ubuntu-disco-kernel.git] / Documentation / cgroup-v2.txt
CommitLineData
633b11be 1================
6c292092 2Control Group v2
633b11be 3================
6c292092 4
633b11be
MCC
5:Date: October, 2015
6:Author: Tejun Heo <tj@kernel.org>
6c292092
TH
7
8This is the authoritative documentation on the design, interface and
9conventions of cgroup v2. It describes all userland-visible aspects
10of cgroup including core and specific controller behaviors. All
11future changes must be reflected in this document. Documentation for
9a2ddda5 12v1 is available under Documentation/cgroup-v1/.
6c292092 13
633b11be
MCC
14.. CONTENTS
15
16 1. Introduction
17 1-1. Terminology
18 1-2. What is cgroup?
19 2. Basic Operations
20 2-1. Mounting
21 2-2. Organizing Processes
22 2-3. [Un]populated Notification
23 2-4. Controlling Controllers
24 2-4-1. Enabling and Disabling
25 2-4-2. Top-down Constraint
26 2-4-3. No Internal Process Constraint
27 2-5. Delegation
28 2-5-1. Model of Delegation
29 2-5-2. Delegation Containment
30 2-6. Guidelines
31 2-6-1. Organize Once and Control
32 2-6-2. Avoid Name Collisions
33 3. Resource Distribution Models
34 3-1. Weights
35 3-2. Limits
36 3-3. Protections
37 3-4. Allocations
38 4. Interface Files
39 4-1. Format
40 4-2. Conventions
41 4-3. Core Interface Files
42 5. Controllers
43 5-1. CPU
44 5-1-1. CPU Interface Files
45 5-2. Memory
46 5-2-1. Memory Interface Files
47 5-2-2. Usage Guidelines
48 5-2-3. Memory Ownership
49 5-3. IO
50 5-3-1. IO Interface Files
51 5-3-2. Writeback
52 5-4. PID
53 5-4-1. PID Interface Files
54 5-5. RDMA
55 5-5-1. RDMA Interface Files
56 5-6. Misc
57 5-6-1. perf_event
58 6. Namespace
59 6-1. Basics
60 6-2. The Root and Views
61 6-3. Migration and setns(2)
62 6-4. Interaction with Other Namespaces
63 P. Information on Kernel Programming
64 P-1. Filesystem Support for Writeback
65 D. Deprecated v1 Core Features
66 R. Issues with v1 and Rationales for v2
67 R-1. Multiple Hierarchies
68 R-2. Thread Granularity
69 R-3. Competition Between Inner Nodes and Threads
70 R-4. Other Interface Issues
71 R-5. Controller Issues and Remedies
72 R-5-1. Memory
73
74
75Introduction
76============
77
78Terminology
79-----------
6c292092
TH
80
81"cgroup" stands for "control group" and is never capitalized. The
82singular form is used to designate the whole feature and also as a
83qualifier as in "cgroup controllers". When explicitly referring to
84multiple individual control groups, the plural form "cgroups" is used.
85
86
633b11be
MCC
87What is cgroup?
88---------------
6c292092
TH
89
90cgroup is a mechanism to organize processes hierarchically and
91distribute system resources along the hierarchy in a controlled and
92configurable manner.
93
94cgroup is largely composed of two parts - the core and controllers.
95cgroup core is primarily responsible for hierarchically organizing
96processes. A cgroup controller is usually responsible for
97distributing a specific type of system resource along the hierarchy
98although there are utility controllers which serve purposes other than
99resource distribution.
100
101cgroups form a tree structure and every process in the system belongs
102to one and only one cgroup. All threads of a process belong to the
103same cgroup. On creation, all processes are put in the cgroup that
104the parent process belongs to at the time. A process can be migrated
105to another cgroup. Migration of a process doesn't affect already
106existing descendant processes.
107
108Following certain structural constraints, controllers may be enabled or
109disabled selectively on a cgroup. All controller behaviors are
110hierarchical - if a controller is enabled on a cgroup, it affects all
111processes which belong to the cgroups consisting the inclusive
112sub-hierarchy of the cgroup. When a controller is enabled on a nested
113cgroup, it always restricts the resource distribution further. The
114restrictions set closer to the root in the hierarchy can not be
115overridden from further away.
116
117
633b11be
MCC
118Basic Operations
119================
6c292092 120
633b11be
MCC
121Mounting
122--------
6c292092
TH
123
124Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2
633b11be 125hierarchy can be mounted with the following mount command::
6c292092
TH
126
127 # mount -t cgroup2 none $MOUNT_POINT
128
129cgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All
130controllers which support v2 and are not bound to a v1 hierarchy are
131automatically bound to the v2 hierarchy and show up at the root.
132Controllers which are not in active use in the v2 hierarchy can be
133bound to other hierarchies. This allows mixing v2 hierarchy with the
134legacy v1 multiple hierarchies in a fully backward compatible way.
135
136A controller can be moved across hierarchies only after the controller
137is no longer referenced in its current hierarchy. Because per-cgroup
138controller states are destroyed asynchronously and controllers may
139have lingering references, a controller may not show up immediately on
140the v2 hierarchy after the final umount of the previous hierarchy.
141Similarly, a controller should be fully disabled to be moved out of
142the unified hierarchy and it may take some time for the disabled
143controller to become available for other hierarchies; furthermore, due
144to inter-controller dependencies, other controllers may need to be
145disabled too.
146
147While useful for development and manual configurations, moving
148controllers dynamically between the v2 and other hierarchies is
149strongly discouraged for production use. It is recommended to decide
150the hierarchies and controller associations before starting using the
151controllers after system boot.
152
1619b6d4
JW
153During transition to v2, system management software might still
154automount the v1 cgroup filesystem and so hijack all controllers
155during boot, before manual intervention is possible. To make testing
156and experimenting easier, the kernel parameter cgroup_no_v1= allows
157disabling controllers in v1 and make them always available in v2.
158
5136f636
TH
159cgroup v2 currently supports the following mount options.
160
161 nsdelegate
162
163 Consider cgroup namespaces as delegation boundaries. This
164 option is system wide and can only be set on mount or modified
165 through remount from the init namespace. The mount option is
166 ignored on non-init namespace mounts. Please refer to the
167 Delegation section for details.
168
6c292092 169
633b11be
MCC
170Organizing Processes
171--------------------
6c292092
TH
172
173Initially, only the root cgroup exists to which all processes belong.
633b11be 174A child cgroup can be created by creating a sub-directory::
6c292092
TH
175
176 # mkdir $CGROUP_NAME
177
178A given cgroup may have multiple child cgroups forming a tree
179structure. Each cgroup has a read-writable interface file
180"cgroup.procs". When read, it lists the PIDs of all processes which
181belong to the cgroup one-per-line. The PIDs are not ordered and the
182same PID may show up more than once if the process got moved to
183another cgroup and then back or the PID got recycled while reading.
184
185A process can be migrated into a cgroup by writing its PID to the
186target cgroup's "cgroup.procs" file. Only one process can be migrated
187on a single write(2) call. If a process is composed of multiple
188threads, writing the PID of any thread migrates all threads of the
189process.
190
191When a process forks a child process, the new process is born into the
192cgroup that the forking process belongs to at the time of the
193operation. After exit, a process stays associated with the cgroup
194that it belonged to at the time of exit until it's reaped; however, a
195zombie process does not appear in "cgroup.procs" and thus can't be
196moved to another cgroup.
197
198A cgroup which doesn't have any children or live processes can be
199destroyed by removing the directory. Note that a cgroup which doesn't
200have any children and is associated only with zombie processes is
633b11be 201considered empty and can be removed::
6c292092
TH
202
203 # rmdir $CGROUP_NAME
204
205"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy
206cgroup is in use in the system, this file may contain multiple lines,
207one for each hierarchy. The entry for cgroup v2 is always in the
633b11be 208format "0::$PATH"::
6c292092
TH
209
210 # cat /proc/842/cgroup
211 ...
212 0::/test-cgroup/test-cgroup-nested
213
214If the process becomes a zombie and the cgroup it was associated with
633b11be 215is removed subsequently, " (deleted)" is appended to the path::
6c292092
TH
216
217 # cat /proc/842/cgroup
218 ...
219 0::/test-cgroup/test-cgroup-nested (deleted)
220
221
633b11be
MCC
222[Un]populated Notification
223--------------------------
6c292092
TH
224
225Each non-root cgroup has a "cgroup.events" file which contains
226"populated" field indicating whether the cgroup's sub-hierarchy has
227live processes in it. Its value is 0 if there is no live process in
228the cgroup and its descendants; otherwise, 1. poll and [id]notify
229events are triggered when the value changes. This can be used, for
230example, to start a clean-up operation after all processes of a given
231sub-hierarchy have exited. The populated state updates and
232notifications are recursive. Consider the following sub-hierarchy
233where the numbers in the parentheses represent the numbers of processes
633b11be 234in each cgroup::
6c292092
TH
235
236 A(4) - B(0) - C(1)
237 \ D(0)
238
239A, B and C's "populated" fields would be 1 while D's 0. After the one
240process in C exits, B and C's "populated" fields would flip to "0" and
241file modified events will be generated on the "cgroup.events" files of
242both cgroups.
243
244
633b11be
MCC
245Controlling Controllers
246-----------------------
6c292092 247
633b11be
MCC
248Enabling and Disabling
249~~~~~~~~~~~~~~~~~~~~~~
6c292092
TH
250
251Each cgroup has a "cgroup.controllers" file which lists all
633b11be 252controllers available for the cgroup to enable::
6c292092
TH
253
254 # cat cgroup.controllers
255 cpu io memory
256
257No controller is enabled by default. Controllers can be enabled and
633b11be 258disabled by writing to the "cgroup.subtree_control" file::
6c292092
TH
259
260 # echo "+cpu +memory -io" > cgroup.subtree_control
261
262Only controllers which are listed in "cgroup.controllers" can be
263enabled. When multiple operations are specified as above, either they
264all succeed or fail. If multiple operations on the same controller
265are specified, the last one is effective.
266
267Enabling a controller in a cgroup indicates that the distribution of
268the target resource across its immediate children will be controlled.
269Consider the following sub-hierarchy. The enabled controllers are
633b11be 270listed in parentheses::
6c292092
TH
271
272 A(cpu,memory) - B(memory) - C()
273 \ D()
274
275As A has "cpu" and "memory" enabled, A will control the distribution
276of CPU cycles and memory to its children, in this case, B. As B has
277"memory" enabled but not "CPU", C and D will compete freely on CPU
278cycles but their division of memory available to B will be controlled.
279
280As a controller regulates the distribution of the target resource to
281the cgroup's children, enabling it creates the controller's interface
282files in the child cgroups. In the above example, enabling "cpu" on B
283would create the "cpu." prefixed controller interface files in C and
284D. Likewise, disabling "memory" from B would remove the "memory."
285prefixed controller interface files from C and D. This means that the
286controller interface files - anything which doesn't start with
287"cgroup." are owned by the parent rather than the cgroup itself.
288
289
633b11be
MCC
290Top-down Constraint
291~~~~~~~~~~~~~~~~~~~
6c292092
TH
292
293Resources are distributed top-down and a cgroup can further distribute
294a resource only if the resource has been distributed to it from the
295parent. This means that all non-root "cgroup.subtree_control" files
296can only contain controllers which are enabled in the parent's
297"cgroup.subtree_control" file. A controller can be enabled only if
298the parent has the controller enabled and a controller can't be
299disabled if one or more children have it enabled.
300
301
633b11be
MCC
302No Internal Process Constraint
303~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6c292092
TH
304
305Non-root cgroups can only distribute resources to their children when
306they don't have any processes of their own. In other words, only
307cgroups which don't contain any processes can have controllers enabled
308in their "cgroup.subtree_control" files.
309
310This guarantees that, when a controller is looking at the part of the
311hierarchy which has it enabled, processes are always only on the
312leaves. This rules out situations where child cgroups compete against
313internal processes of the parent.
314
315The root cgroup is exempt from this restriction. Root contains
316processes and anonymous resource consumption which can't be associated
317with any other cgroups and requires special treatment from most
318controllers. How resource consumption in the root cgroup is governed
319is up to each controller.
320
321Note that the restriction doesn't get in the way if there is no
322enabled controller in the cgroup's "cgroup.subtree_control". This is
323important as otherwise it wouldn't be possible to create children of a
324populated cgroup. To control resource distribution of a cgroup, the
325cgroup must create children and transfer all its processes to the
326children before enabling controllers in its "cgroup.subtree_control"
327file.
328
329
633b11be
MCC
330Delegation
331----------
6c292092 332
633b11be
MCC
333Model of Delegation
334~~~~~~~~~~~~~~~~~~~
6c292092 335
5136f636
TH
336A cgroup can be delegated in two ways. First, to a less privileged
337user by granting write access of the directory and its "cgroup.procs"
338and "cgroup.subtree_control" files to the user. Second, if the
339"nsdelegate" mount option is set, automatically to a cgroup namespace
340on namespace creation.
341
342Because the resource control interface files in a given directory
343control the distribution of the parent's resources, the delegatee
344shouldn't be allowed to write to them. For the first method, this is
345achieved by not granting access to these files. For the second, the
346kernel rejects writes to all files other than "cgroup.procs" and
347"cgroup.subtree_control" on a namespace root from inside the
348namespace.
349
350The end results are equivalent for both delegation types. Once
351delegated, the user can build sub-hierarchy under the directory,
352organize processes inside it as it sees fit and further distribute the
353resources it received from the parent. The limits and other settings
354of all resource controllers are hierarchical and regardless of what
355happens in the delegated sub-hierarchy, nothing can escape the
356resource restrictions imposed by the parent.
6c292092
TH
357
358Currently, cgroup doesn't impose any restrictions on the number of
359cgroups in or nesting depth of a delegated sub-hierarchy; however,
360this may be limited explicitly in the future.
361
362
633b11be
MCC
363Delegation Containment
364~~~~~~~~~~~~~~~~~~~~~~
6c292092
TH
365
366A delegated sub-hierarchy is contained in the sense that processes
5136f636
TH
367can't be moved into or out of the sub-hierarchy by the delegatee.
368
369For delegations to a less privileged user, this is achieved by
370requiring the following conditions for a process with a non-root euid
371to migrate a target process into a cgroup by writing its PID to the
372"cgroup.procs" file.
6c292092 373
6c292092
TH
374- The writer must have write access to the "cgroup.procs" file.
375
376- The writer must have write access to the "cgroup.procs" file of the
377 common ancestor of the source and destination cgroups.
378
576dd464 379The above two constraints ensure that while a delegatee may migrate
6c292092
TH
380processes around freely in the delegated sub-hierarchy it can't pull
381in from or push out to outside the sub-hierarchy.
382
383For an example, let's assume cgroups C0 and C1 have been delegated to
384user U0 who created C00, C01 under C0 and C10 under C1 as follows and
633b11be 385all processes under C0 and C1 belong to U0::
6c292092
TH
386
387 ~~~~~~~~~~~~~ - C0 - C00
388 ~ cgroup ~ \ C01
389 ~ hierarchy ~
390 ~~~~~~~~~~~~~ - C1 - C10
391
392Let's also say U0 wants to write the PID of a process which is
393currently in C10 into "C00/cgroup.procs". U0 has write access to the
576dd464
TH
394file; however, the common ancestor of the source cgroup C10 and the
395destination cgroup C00 is above the points of delegation and U0 would
396not have write access to its "cgroup.procs" files and thus the write
397will be denied with -EACCES.
6c292092 398
5136f636
TH
399For delegations to namespaces, containment is achieved by requiring
400that both the source and destination cgroups are reachable from the
401namespace of the process which is attempting the migration. If either
402is not reachable, the migration is rejected with -ENOENT.
403
6c292092 404
633b11be
MCC
405Guidelines
406----------
6c292092 407
633b11be
MCC
408Organize Once and Control
409~~~~~~~~~~~~~~~~~~~~~~~~~
6c292092
TH
410
411Migrating a process across cgroups is a relatively expensive operation
412and stateful resources such as memory are not moved together with the
413process. This is an explicit design decision as there often exist
414inherent trade-offs between migration and various hot paths in terms
415of synchronization cost.
416
417As such, migrating processes across cgroups frequently as a means to
418apply different resource restrictions is discouraged. A workload
419should be assigned to a cgroup according to the system's logical and
420resource structure once on start-up. Dynamic adjustments to resource
421distribution can be made by changing controller configuration through
422the interface files.
423
424
633b11be
MCC
425Avoid Name Collisions
426~~~~~~~~~~~~~~~~~~~~~
6c292092
TH
427
428Interface files for a cgroup and its children cgroups occupy the same
429directory and it is possible to create children cgroups which collide
430with interface files.
431
432All cgroup core interface files are prefixed with "cgroup." and each
433controller's interface files are prefixed with the controller name and
434a dot. A controller's name is composed of lower case alphabets and
435'_'s but never begins with an '_' so it can be used as the prefix
436character for collision avoidance. Also, interface file names won't
437start or end with terms which are often used in categorizing workloads
438such as job, service, slice, unit or workload.
439
440cgroup doesn't do anything to prevent name collisions and it's the
441user's responsibility to avoid them.
442
443
633b11be
MCC
444Resource Distribution Models
445============================
6c292092
TH
446
447cgroup controllers implement several resource distribution schemes
448depending on the resource type and expected use cases. This section
449describes major schemes in use along with their expected behaviors.
450
451
633b11be
MCC
452Weights
453-------
6c292092
TH
454
455A parent's resource is distributed by adding up the weights of all
456active children and giving each the fraction matching the ratio of its
457weight against the sum. As only children which can make use of the
458resource at the moment participate in the distribution, this is
459work-conserving. Due to the dynamic nature, this model is usually
460used for stateless resources.
461
462All weights are in the range [1, 10000] with the default at 100. This
463allows symmetric multiplicative biases in both directions at fine
464enough granularity while staying in the intuitive range.
465
466As long as the weight is in range, all configuration combinations are
467valid and there is no reason to reject configuration changes or
468process migrations.
469
470"cpu.weight" proportionally distributes CPU cycles to active children
471and is an example of this type.
472
473
633b11be
MCC
474Limits
475------
6c292092
TH
476
477A child can only consume upto the configured amount of the resource.
478Limits can be over-committed - the sum of the limits of children can
479exceed the amount of resource available to the parent.
480
481Limits are in the range [0, max] and defaults to "max", which is noop.
482
483As limits can be over-committed, all configuration combinations are
484valid and there is no reason to reject configuration changes or
485process migrations.
486
487"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
488on an IO device and is an example of this type.
489
490
633b11be
MCC
491Protections
492-----------
6c292092
TH
493
494A cgroup is protected to be allocated upto the configured amount of
495the resource if the usages of all its ancestors are under their
496protected levels. Protections can be hard guarantees or best effort
497soft boundaries. Protections can also be over-committed in which case
498only upto the amount available to the parent is protected among
499children.
500
501Protections are in the range [0, max] and defaults to 0, which is
502noop.
503
504As protections can be over-committed, all configuration combinations
505are valid and there is no reason to reject configuration changes or
506process migrations.
507
508"memory.low" implements best-effort memory protection and is an
509example of this type.
510
511
633b11be
MCC
512Allocations
513-----------
6c292092
TH
514
515A cgroup is exclusively allocated a certain amount of a finite
516resource. Allocations can't be over-committed - the sum of the
517allocations of children can not exceed the amount of resource
518available to the parent.
519
520Allocations are in the range [0, max] and defaults to 0, which is no
521resource.
522
523As allocations can't be over-committed, some configuration
524combinations are invalid and should be rejected. Also, if the
525resource is mandatory for execution of processes, process migrations
526may be rejected.
527
528"cpu.rt.max" hard-allocates realtime slices and is an example of this
529type.
530
531
633b11be
MCC
532Interface Files
533===============
6c292092 534
633b11be
MCC
535Format
536------
6c292092
TH
537
538All interface files should be in one of the following formats whenever
633b11be 539possible::
6c292092
TH
540
541 New-line separated values
542 (when only one value can be written at once)
543
544 VAL0\n
545 VAL1\n
546 ...
547
548 Space separated values
549 (when read-only or multiple values can be written at once)
550
551 VAL0 VAL1 ...\n
552
553 Flat keyed
554
555 KEY0 VAL0\n
556 KEY1 VAL1\n
557 ...
558
559 Nested keyed
560
561 KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
562 KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
563 ...
564
565For a writable file, the format for writing should generally match
566reading; however, controllers may allow omitting later fields or
567implement restricted shortcuts for most common use cases.
568
569For both flat and nested keyed files, only the values for a single key
570can be written at a time. For nested keyed files, the sub key pairs
571may be specified in any order and not all pairs have to be specified.
572
573
633b11be
MCC
574Conventions
575-----------
6c292092
TH
576
577- Settings for a single feature should be contained in a single file.
578
579- The root cgroup should be exempt from resource control and thus
580 shouldn't have resource control interface files. Also,
581 informational files on the root cgroup which end up showing global
582 information available elsewhere shouldn't exist.
583
584- If a controller implements weight based resource distribution, its
585 interface file should be named "weight" and have the range [1,
586 10000] with 100 as the default. The values are chosen to allow
587 enough and symmetric bias in both directions while keeping it
588 intuitive (the default is 100%).
589
590- If a controller implements an absolute resource guarantee and/or
591 limit, the interface files should be named "min" and "max"
592 respectively. If a controller implements best effort resource
593 guarantee and/or limit, the interface files should be named "low"
594 and "high" respectively.
595
596 In the above four control files, the special token "max" should be
597 used to represent upward infinity for both reading and writing.
598
599- If a setting has a configurable default value and keyed specific
600 overrides, the default entry should be keyed with "default" and
601 appear as the first entry in the file.
602
603 The default value can be updated by writing either "default $VAL" or
604 "$VAL".
605
606 When writing to update a specific override, "default" can be used as
607 the value to indicate removal of the override. Override entries
608 with "default" as the value must not appear when read.
609
610 For example, a setting which is keyed by major:minor device numbers
633b11be 611 with integer values may look like the following::
6c292092
TH
612
613 # cat cgroup-example-interface-file
614 default 150
615 8:0 300
616
633b11be 617 The default value can be updated by::
6c292092
TH
618
619 # echo 125 > cgroup-example-interface-file
620
633b11be 621 or::
6c292092
TH
622
623 # echo "default 125" > cgroup-example-interface-file
624
633b11be 625 An override can be set by::
6c292092
TH
626
627 # echo "8:16 170" > cgroup-example-interface-file
628
633b11be 629 and cleared by::
6c292092
TH
630
631 # echo "8:0 default" > cgroup-example-interface-file
632 # cat cgroup-example-interface-file
633 default 125
634 8:16 170
635
636- For events which are not very high frequency, an interface file
637 "events" should be created which lists event key value pairs.
638 Whenever a notifiable event happens, file modified event should be
639 generated on the file.
640
641
633b11be
MCC
642Core Interface Files
643--------------------
6c292092
TH
644
645All cgroup core files are prefixed with "cgroup."
646
647 cgroup.procs
6c292092
TH
648 A read-write new-line separated values file which exists on
649 all cgroups.
650
651 When read, it lists the PIDs of all processes which belong to
652 the cgroup one-per-line. The PIDs are not ordered and the
653 same PID may show up more than once if the process got moved
654 to another cgroup and then back or the PID got recycled while
655 reading.
656
657 A PID can be written to migrate the process associated with
658 the PID to the cgroup. The writer should match all of the
659 following conditions.
660
6c292092
TH
661 - It must have write access to the "cgroup.procs" file.
662
663 - It must have write access to the "cgroup.procs" file of the
664 common ancestor of the source and destination cgroups.
665
666 When delegating a sub-hierarchy, write access to this file
667 should be granted along with the containing directory.
668
669 cgroup.controllers
6c292092
TH
670 A read-only space separated values file which exists on all
671 cgroups.
672
673 It shows space separated list of all controllers available to
674 the cgroup. The controllers are not ordered.
675
676 cgroup.subtree_control
6c292092
TH
677 A read-write space separated values file which exists on all
678 cgroups. Starts out empty.
679
680 When read, it shows space separated list of the controllers
681 which are enabled to control resource distribution from the
682 cgroup to its children.
683
684 Space separated list of controllers prefixed with '+' or '-'
685 can be written to enable or disable controllers. A controller
686 name prefixed with '+' enables the controller and '-'
687 disables. If a controller appears more than once on the list,
688 the last one is effective. When multiple enable and disable
689 operations are specified, either all succeed or all fail.
690
691 cgroup.events
6c292092
TH
692 A read-only flat-keyed file which exists on non-root cgroups.
693 The following entries are defined. Unless specified
694 otherwise, a value change in this file generates a file
695 modified event.
696
697 populated
6c292092
TH
698 1 if the cgroup or its descendants contains any live
699 processes; otherwise, 0.
700
701
633b11be
MCC
702Controllers
703===========
6c292092 704
633b11be
MCC
705CPU
706---
6c292092 707
633b11be
MCC
708.. note::
709
710 The interface for the cpu controller hasn't been merged yet
6c292092
TH
711
712The "cpu" controllers regulates distribution of CPU cycles. This
713controller implements weight and absolute bandwidth limit models for
714normal scheduling policy and absolute bandwidth allocation model for
715realtime scheduling policy.
716
717
633b11be
MCC
718CPU Interface Files
719~~~~~~~~~~~~~~~~~~~
6c292092
TH
720
721All time durations are in microseconds.
722
723 cpu.stat
6c292092
TH
724 A read-only flat-keyed file which exists on non-root cgroups.
725
633b11be 726 It reports the following six stats:
6c292092 727
633b11be
MCC
728 - usage_usec
729 - user_usec
730 - system_usec
731 - nr_periods
732 - nr_throttled
733 - throttled_usec
6c292092
TH
734
735 cpu.weight
6c292092
TH
736 A read-write single value file which exists on non-root
737 cgroups. The default is "100".
738
739 The weight in the range [1, 10000].
740
741 cpu.max
6c292092
TH
742 A read-write two value file which exists on non-root cgroups.
743 The default is "max 100000".
744
633b11be 745 The maximum bandwidth limit. It's in the following format::
6c292092
TH
746
747 $MAX $PERIOD
748
749 which indicates that the group may consume upto $MAX in each
750 $PERIOD duration. "max" for $MAX indicates no limit. If only
751 one number is written, $MAX is updated.
752
753 cpu.rt.max
633b11be 754 .. note::
6c292092 755
633b11be
MCC
756 The semantics of this file is still under discussion and the
757 interface hasn't been merged yet
6c292092
TH
758
759 A read-write two value file which exists on all cgroups.
760 The default is "0 100000".
761
762 The maximum realtime runtime allocation. Over-committing
763 configurations are disallowed and process migrations are
764 rejected if not enough bandwidth is available. It's in the
633b11be 765 following format::
6c292092
TH
766
767 $MAX $PERIOD
768
769 which indicates that the group may consume upto $MAX in each
770 $PERIOD duration. If only one number is written, $MAX is
771 updated.
772
773
633b11be
MCC
774Memory
775------
6c292092
TH
776
777The "memory" controller regulates distribution of memory. Memory is
778stateful and implements both limit and protection models. Due to the
779intertwining between memory usage and reclaim pressure and the
780stateful nature of memory, the distribution model is relatively
781complex.
782
783While not completely water-tight, all major memory usages by a given
784cgroup are tracked so that the total memory consumption can be
785accounted and controlled to a reasonable extent. Currently, the
786following types of memory usages are tracked.
787
788- Userland memory - page cache and anonymous memory.
789
790- Kernel data structures such as dentries and inodes.
791
792- TCP socket buffers.
793
794The above list may expand in the future for better coverage.
795
796
633b11be
MCC
797Memory Interface Files
798~~~~~~~~~~~~~~~~~~~~~~
6c292092
TH
799
800All memory amounts are in bytes. If a value which is not aligned to
801PAGE_SIZE is written, the value may be rounded up to the closest
802PAGE_SIZE multiple when read back.
803
804 memory.current
6c292092
TH
805 A read-only single value file which exists on non-root
806 cgroups.
807
808 The total amount of memory currently being used by the cgroup
809 and its descendants.
810
811 memory.low
6c292092
TH
812 A read-write single value file which exists on non-root
813 cgroups. The default is "0".
814
815 Best-effort memory protection. If the memory usages of a
816 cgroup and all its ancestors are below their low boundaries,
817 the cgroup's memory won't be reclaimed unless memory can be
818 reclaimed from unprotected cgroups.
819
820 Putting more memory than generally available under this
821 protection is discouraged.
822
823 memory.high
6c292092
TH
824 A read-write single value file which exists on non-root
825 cgroups. The default is "max".
826
827 Memory usage throttle limit. This is the main mechanism to
828 control memory usage of a cgroup. If a cgroup's usage goes
829 over the high boundary, the processes of the cgroup are
830 throttled and put under heavy reclaim pressure.
831
832 Going over the high limit never invokes the OOM killer and
833 under extreme conditions the limit may be breached.
834
835 memory.max
6c292092
TH
836 A read-write single value file which exists on non-root
837 cgroups. The default is "max".
838
839 Memory usage hard limit. This is the final protection
840 mechanism. If a cgroup's memory usage reaches this limit and
841 can't be reduced, the OOM killer is invoked in the cgroup.
842 Under certain circumstances, the usage may go over the limit
843 temporarily.
844
845 This is the ultimate protection mechanism. As long as the
846 high limit is used and monitored properly, this limit's
847 utility is limited to providing the final safety net.
848
849 memory.events
6c292092
TH
850 A read-only flat-keyed file which exists on non-root cgroups.
851 The following entries are defined. Unless specified
852 otherwise, a value change in this file generates a file
853 modified event.
854
855 low
6c292092
TH
856 The number of times the cgroup is reclaimed due to
857 high memory pressure even though its usage is under
858 the low boundary. This usually indicates that the low
859 boundary is over-committed.
860
861 high
6c292092
TH
862 The number of times processes of the cgroup are
863 throttled and routed to perform direct memory reclaim
864 because the high memory boundary was exceeded. For a
865 cgroup whose memory usage is capped by the high limit
866 rather than global memory pressure, this event's
867 occurrences are expected.
868
869 max
6c292092
TH
870 The number of times the cgroup's memory usage was
871 about to go over the max boundary. If direct reclaim
8e675f7a 872 fails to bring it down, the cgroup goes to OOM state.
6c292092
TH
873
874 oom
8e675f7a
KK
875 The number of time the cgroup's memory usage was
876 reached the limit and allocation was about to fail.
877
878 Depending on context result could be invocation of OOM
879 killer and retrying allocation or failing alloction.
880
881 Failed allocation in its turn could be returned into
882 userspace as -ENOMEM or siletly ignored in cases like
633b11be 883 disk readahead. For now OOM in memory cgroup kills
8e675f7a
KK
884 tasks iff shortage has happened inside page fault.
885
886 oom_kill
8e675f7a
KK
887 The number of processes belonging to this cgroup
888 killed by any kind of OOM killer.
6c292092 889
587d9f72 890 memory.stat
587d9f72
JW
891 A read-only flat-keyed file which exists on non-root cgroups.
892
893 This breaks down the cgroup's memory footprint into different
894 types of memory, type-specific details, and other information
895 on the state and past events of the memory management system.
896
897 All memory amounts are in bytes.
898
899 The entries are ordered to be human readable, and new entries
900 can show up in the middle. Don't rely on items remaining in a
901 fixed position; use the keys to look up specific values!
902
903 anon
587d9f72
JW
904 Amount of memory used in anonymous mappings such as
905 brk(), sbrk(), and mmap(MAP_ANONYMOUS)
906
907 file
587d9f72
JW
908 Amount of memory used to cache filesystem data,
909 including tmpfs and shared memory.
910
12580e4b 911 kernel_stack
12580e4b
VD
912 Amount of memory allocated to kernel stacks.
913
27ee57c9 914 slab
27ee57c9
VD
915 Amount of memory used for storing in-kernel data
916 structures.
917
4758e198 918 sock
4758e198
JW
919 Amount of memory used in network transmission buffers
920
9a4caf1e 921 shmem
9a4caf1e
JW
922 Amount of cached filesystem data that is swap-backed,
923 such as tmpfs, shm segments, shared anonymous mmap()s
924
587d9f72 925 file_mapped
587d9f72
JW
926 Amount of cached filesystem data mapped with mmap()
927
928 file_dirty
587d9f72
JW
929 Amount of cached filesystem data that was modified but
930 not yet written back to disk
931
932 file_writeback
587d9f72
JW
933 Amount of cached filesystem data that was modified and
934 is currently being written back to disk
935
633b11be 936 inactive_anon, active_anon, inactive_file, active_file, unevictable
587d9f72
JW
937 Amount of memory, swap-backed and filesystem-backed,
938 on the internal memory management lists used by the
939 page reclaim algorithm
940
27ee57c9 941 slab_reclaimable
27ee57c9
VD
942 Part of "slab" that might be reclaimed, such as
943 dentries and inodes.
944
945 slab_unreclaimable
27ee57c9
VD
946 Part of "slab" that cannot be reclaimed on memory
947 pressure.
948
587d9f72 949 pgfault
587d9f72
JW
950 Total number of page faults incurred
951
952 pgmajfault
587d9f72
JW
953 Number of major page faults incurred
954
b340959e
RG
955 workingset_refault
956
957 Number of refaults of previously evicted pages
958
959 workingset_activate
960
961 Number of refaulted pages that were immediately activated
962
963 workingset_nodereclaim
964
965 Number of times a shadow node has been reclaimed
966
2262185c
RG
967 pgrefill
968
969 Amount of scanned pages (in an active LRU list)
970
971 pgscan
972
973 Amount of scanned pages (in an inactive LRU list)
974
975 pgsteal
976
977 Amount of reclaimed pages
978
979 pgactivate
980
981 Amount of pages moved to the active LRU list
982
983 pgdeactivate
984
985 Amount of pages moved to the inactive LRU lis
986
987 pglazyfree
988
989 Amount of pages postponed to be freed under memory pressure
990
991 pglazyfreed
992
993 Amount of reclaimed lazyfree pages
994
3e24b19d 995 memory.swap.current
3e24b19d
VD
996 A read-only single value file which exists on non-root
997 cgroups.
998
999 The total amount of swap currently being used by the cgroup
1000 and its descendants.
1001
1002 memory.swap.max
3e24b19d
VD
1003 A read-write single value file which exists on non-root
1004 cgroups. The default is "max".
1005
1006 Swap usage hard limit. If a cgroup's swap usage reaches this
1007 limit, anonymous meomry of the cgroup will not be swapped out.
1008
6c292092 1009
633b11be
MCC
1010Usage Guidelines
1011~~~~~~~~~~~~~~~~
6c292092
TH
1012
1013"memory.high" is the main mechanism to control memory usage.
1014Over-committing on high limit (sum of high limits > available memory)
1015and letting global memory pressure to distribute memory according to
1016usage is a viable strategy.
1017
1018Because breach of the high limit doesn't trigger the OOM killer but
1019throttles the offending cgroup, a management agent has ample
1020opportunities to monitor and take appropriate actions such as granting
1021more memory or terminating the workload.
1022
1023Determining whether a cgroup has enough memory is not trivial as
1024memory usage doesn't indicate whether the workload can benefit from
1025more memory. For example, a workload which writes data received from
1026network to a file can use all available memory but can also operate as
1027performant with a small amount of memory. A measure of memory
1028pressure - how much the workload is being impacted due to lack of
1029memory - is necessary to determine whether a workload needs more
1030memory; unfortunately, memory pressure monitoring mechanism isn't
1031implemented yet.
1032
1033
633b11be
MCC
1034Memory Ownership
1035~~~~~~~~~~~~~~~~
6c292092
TH
1036
1037A memory area is charged to the cgroup which instantiated it and stays
1038charged to the cgroup until the area is released. Migrating a process
1039to a different cgroup doesn't move the memory usages that it
1040instantiated while in the previous cgroup to the new cgroup.
1041
1042A memory area may be used by processes belonging to different cgroups.
1043To which cgroup the area will be charged is in-deterministic; however,
1044over time, the memory area is likely to end up in a cgroup which has
1045enough memory allowance to avoid high reclaim pressure.
1046
1047If a cgroup sweeps a considerable amount of memory which is expected
1048to be accessed repeatedly by other cgroups, it may make sense to use
1049POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
1050belonging to the affected files to ensure correct memory ownership.
1051
1052
633b11be
MCC
1053IO
1054--
6c292092
TH
1055
1056The "io" controller regulates the distribution of IO resources. This
1057controller implements both weight based and absolute bandwidth or IOPS
1058limit distribution; however, weight based distribution is available
1059only if cfq-iosched is in use and neither scheme is available for
1060blk-mq devices.
1061
1062
633b11be
MCC
1063IO Interface Files
1064~~~~~~~~~~~~~~~~~~
6c292092
TH
1065
1066 io.stat
6c292092
TH
1067 A read-only nested-keyed file which exists on non-root
1068 cgroups.
1069
1070 Lines are keyed by $MAJ:$MIN device numbers and not ordered.
1071 The following nested keys are defined.
1072
633b11be 1073 ====== ===================
6c292092
TH
1074 rbytes Bytes read
1075 wbytes Bytes written
1076 rios Number of read IOs
1077 wios Number of write IOs
633b11be 1078 ====== ===================
6c292092 1079
633b11be 1080 An example read output follows:
6c292092
TH
1081
1082 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
1083 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
1084
1085 io.weight
6c292092
TH
1086 A read-write flat-keyed file which exists on non-root cgroups.
1087 The default is "default 100".
1088
1089 The first line is the default weight applied to devices
1090 without specific override. The rest are overrides keyed by
1091 $MAJ:$MIN device numbers and not ordered. The weights are in
1092 the range [1, 10000] and specifies the relative amount IO time
1093 the cgroup can use in relation to its siblings.
1094
1095 The default weight can be updated by writing either "default
1096 $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing
1097 "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
1098
633b11be 1099 An example read output follows::
6c292092
TH
1100
1101 default 100
1102 8:16 200
1103 8:0 50
1104
1105 io.max
6c292092
TH
1106 A read-write nested-keyed file which exists on non-root
1107 cgroups.
1108
1109 BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN
1110 device numbers and not ordered. The following nested keys are
1111 defined.
1112
633b11be 1113 ===== ==================================
6c292092
TH
1114 rbps Max read bytes per second
1115 wbps Max write bytes per second
1116 riops Max read IO operations per second
1117 wiops Max write IO operations per second
633b11be 1118 ===== ==================================
6c292092
TH
1119
1120 When writing, any number of nested key-value pairs can be
1121 specified in any order. "max" can be specified as the value
1122 to remove a specific limit. If the same key is specified
1123 multiple times, the outcome is undefined.
1124
1125 BPS and IOPS are measured in each IO direction and IOs are
1126 delayed if limit is reached. Temporary bursts are allowed.
1127
633b11be 1128 Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
6c292092
TH
1129
1130 echo "8:16 rbps=2097152 wiops=120" > io.max
1131
633b11be 1132 Reading returns the following::
6c292092
TH
1133
1134 8:16 rbps=2097152 wbps=max riops=max wiops=120
1135
633b11be 1136 Write IOPS limit can be removed by writing the following::
6c292092
TH
1137
1138 echo "8:16 wiops=max" > io.max
1139
633b11be 1140 Reading now returns the following::
6c292092
TH
1141
1142 8:16 rbps=2097152 wbps=max riops=max wiops=max
1143
1144
633b11be
MCC
1145Writeback
1146~~~~~~~~~
6c292092
TH
1147
1148Page cache is dirtied through buffered writes and shared mmaps and
1149written asynchronously to the backing filesystem by the writeback
1150mechanism. Writeback sits between the memory and IO domains and
1151regulates the proportion of dirty memory by balancing dirtying and
1152write IOs.
1153
1154The io controller, in conjunction with the memory controller,
1155implements control of page cache writeback IOs. The memory controller
1156defines the memory domain that dirty memory ratio is calculated and
1157maintained for and the io controller defines the io domain which
1158writes out dirty pages for the memory domain. Both system-wide and
1159per-cgroup dirty memory states are examined and the more restrictive
1160of the two is enforced.
1161
1162cgroup writeback requires explicit support from the underlying
1163filesystem. Currently, cgroup writeback is implemented on ext2, ext4
1164and btrfs. On other filesystems, all writeback IOs are attributed to
1165the root cgroup.
1166
1167There are inherent differences in memory and writeback management
1168which affects how cgroup ownership is tracked. Memory is tracked per
1169page while writeback per inode. For the purpose of writeback, an
1170inode is assigned to a cgroup and all IO requests to write dirty pages
1171from the inode are attributed to that cgroup.
1172
1173As cgroup ownership for memory is tracked per page, there can be pages
1174which are associated with different cgroups than the one the inode is
1175associated with. These are called foreign pages. The writeback
1176constantly keeps track of foreign pages and, if a particular foreign
1177cgroup becomes the majority over a certain period of time, switches
1178the ownership of the inode to that cgroup.
1179
1180While this model is enough for most use cases where a given inode is
1181mostly dirtied by a single cgroup even when the main writing cgroup
1182changes over time, use cases where multiple cgroups write to a single
1183inode simultaneously are not supported well. In such circumstances, a
1184significant portion of IOs are likely to be attributed incorrectly.
1185As memory controller assigns page ownership on the first use and
1186doesn't update it until the page is released, even if writeback
1187strictly follows page ownership, multiple cgroups dirtying overlapping
1188areas wouldn't work as expected. It's recommended to avoid such usage
1189patterns.
1190
1191The sysctl knobs which affect writeback behavior are applied to cgroup
1192writeback as follows.
1193
633b11be 1194 vm.dirty_background_ratio, vm.dirty_ratio
6c292092
TH
1195 These ratios apply the same to cgroup writeback with the
1196 amount of available memory capped by limits imposed by the
1197 memory controller and system-wide clean memory.
1198
633b11be 1199 vm.dirty_background_bytes, vm.dirty_bytes
6c292092
TH
1200 For cgroup writeback, this is calculated into ratio against
1201 total available memory and applied the same way as
1202 vm.dirty[_background]_ratio.
1203
1204
633b11be
MCC
1205PID
1206---
20c56e59
HR
1207
1208The process number controller is used to allow a cgroup to stop any
1209new tasks from being fork()'d or clone()'d after a specified limit is
1210reached.
1211
1212The number of tasks in a cgroup can be exhausted in ways which other
1213controllers cannot prevent, thus warranting its own controller. For
1214example, a fork bomb is likely to exhaust the number of tasks before
1215hitting memory restrictions.
1216
1217Note that PIDs used in this controller refer to TIDs, process IDs as
1218used by the kernel.
1219
1220
633b11be
MCC
1221PID Interface Files
1222~~~~~~~~~~~~~~~~~~~
20c56e59
HR
1223
1224 pids.max
312eb712
TK
1225 A read-write single value file which exists on non-root
1226 cgroups. The default is "max".
20c56e59 1227
312eb712 1228 Hard limit of number of processes.
20c56e59
HR
1229
1230 pids.current
312eb712 1231 A read-only single value file which exists on all cgroups.
20c56e59 1232
312eb712
TK
1233 The number of processes currently in the cgroup and its
1234 descendants.
20c56e59
HR
1235
1236Organisational operations are not blocked by cgroup policies, so it is
1237possible to have pids.current > pids.max. This can be done by either
1238setting the limit to be smaller than pids.current, or attaching enough
1239processes to the cgroup such that pids.current is larger than
1240pids.max. However, it is not possible to violate a cgroup PID policy
1241through fork() or clone(). These will return -EAGAIN if the creation
1242of a new process would cause a cgroup policy to be violated.
1243
1244
633b11be
MCC
1245RDMA
1246----
968ebff1 1247
9c1e67f9
PP
1248The "rdma" controller regulates the distribution and accounting of
1249of RDMA resources.
1250
633b11be
MCC
1251RDMA Interface Files
1252~~~~~~~~~~~~~~~~~~~~
9c1e67f9
PP
1253
1254 rdma.max
1255 A readwrite nested-keyed file that exists for all the cgroups
1256 except root that describes current configured resource limit
1257 for a RDMA/IB device.
1258
1259 Lines are keyed by device name and are not ordered.
1260 Each line contains space separated resource name and its configured
1261 limit that can be distributed.
1262
1263 The following nested keys are defined.
1264
633b11be 1265 ========== =============================
9c1e67f9
PP
1266 hca_handle Maximum number of HCA Handles
1267 hca_object Maximum number of HCA Objects
633b11be 1268 ========== =============================
9c1e67f9 1269
633b11be 1270 An example for mlx4 and ocrdma device follows::
9c1e67f9
PP
1271
1272 mlx4_0 hca_handle=2 hca_object=2000
1273 ocrdma1 hca_handle=3 hca_object=max
1274
1275 rdma.current
1276 A read-only file that describes current resource usage.
1277 It exists for all the cgroup except root.
1278
633b11be 1279 An example for mlx4 and ocrdma device follows::
9c1e67f9
PP
1280
1281 mlx4_0 hca_handle=1 hca_object=20
1282 ocrdma1 hca_handle=1 hca_object=23
1283
1284
633b11be
MCC
1285Misc
1286----
63f1ca59 1287
633b11be
MCC
1288perf_event
1289~~~~~~~~~~
968ebff1
TH
1290
1291perf_event controller, if not mounted on a legacy hierarchy, is
1292automatically enabled on the v2 hierarchy so that perf events can
1293always be filtered by cgroup v2 path. The controller can still be
1294moved to a legacy hierarchy after v2 hierarchy is populated.
1295
1296
633b11be
MCC
1297Namespace
1298=========
d4021f6c 1299
633b11be
MCC
1300Basics
1301------
d4021f6c
SH
1302
1303cgroup namespace provides a mechanism to virtualize the view of the
1304"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone
1305flag can be used with clone(2) and unshare(2) to create a new cgroup
1306namespace. The process running inside the cgroup namespace will have
1307its "/proc/$PID/cgroup" output restricted to cgroupns root. The
1308cgroupns root is the cgroup of the process at the time of creation of
1309the cgroup namespace.
1310
1311Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
1312complete path of the cgroup of a process. In a container setup where
1313a set of cgroups and namespaces are intended to isolate processes the
1314"/proc/$PID/cgroup" file may leak potential system level information
633b11be 1315to the isolated processes. For Example::
d4021f6c
SH
1316
1317 # cat /proc/self/cgroup
1318 0::/batchjobs/container_id1
1319
1320The path '/batchjobs/container_id1' can be considered as system-data
1321and undesirable to expose to the isolated processes. cgroup namespace
1322can be used to restrict visibility of this path. For example, before
633b11be 1323creating a cgroup namespace, one would see::
d4021f6c
SH
1324
1325 # ls -l /proc/self/ns/cgroup
1326 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
1327 # cat /proc/self/cgroup
1328 0::/batchjobs/container_id1
1329
633b11be 1330After unsharing a new namespace, the view changes::
d4021f6c
SH
1331
1332 # ls -l /proc/self/ns/cgroup
1333 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
1334 # cat /proc/self/cgroup
1335 0::/
1336
1337When some thread from a multi-threaded process unshares its cgroup
1338namespace, the new cgroupns gets applied to the entire process (all
1339the threads). This is natural for the v2 hierarchy; however, for the
1340legacy hierarchies, this may be unexpected.
1341
1342A cgroup namespace is alive as long as there are processes inside or
1343mounts pinning it. When the last usage goes away, the cgroup
1344namespace is destroyed. The cgroupns root and the actual cgroups
1345remain.
1346
1347
633b11be
MCC
1348The Root and Views
1349------------------
d4021f6c
SH
1350
1351The 'cgroupns root' for a cgroup namespace is the cgroup in which the
1352process calling unshare(2) is running. For example, if a process in
1353/batchjobs/container_id1 cgroup calls unshare, cgroup
1354/batchjobs/container_id1 becomes the cgroupns root. For the
1355init_cgroup_ns, this is the real root ('/') cgroup.
1356
1357The cgroupns root cgroup does not change even if the namespace creator
633b11be 1358process later moves to a different cgroup::
d4021f6c
SH
1359
1360 # ~/unshare -c # unshare cgroupns in some cgroup
1361 # cat /proc/self/cgroup
1362 0::/
1363 # mkdir sub_cgrp_1
1364 # echo 0 > sub_cgrp_1/cgroup.procs
1365 # cat /proc/self/cgroup
1366 0::/sub_cgrp_1
1367
1368Each process gets its namespace-specific view of "/proc/$PID/cgroup"
1369
1370Processes running inside the cgroup namespace will be able to see
1371cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
633b11be 1372From within an unshared cgroupns::
d4021f6c
SH
1373
1374 # sleep 100000 &
1375 [1] 7353
1376 # echo 7353 > sub_cgrp_1/cgroup.procs
1377 # cat /proc/7353/cgroup
1378 0::/sub_cgrp_1
1379
1380From the initial cgroup namespace, the real cgroup path will be
633b11be 1381visible::
d4021f6c
SH
1382
1383 $ cat /proc/7353/cgroup
1384 0::/batchjobs/container_id1/sub_cgrp_1
1385
1386From a sibling cgroup namespace (that is, a namespace rooted at a
1387different cgroup), the cgroup path relative to its own cgroup
1388namespace root will be shown. For instance, if PID 7353's cgroup
633b11be 1389namespace root is at '/batchjobs/container_id2', then it will see::
d4021f6c
SH
1390
1391 # cat /proc/7353/cgroup
1392 0::/../container_id2/sub_cgrp_1
1393
1394Note that the relative path always starts with '/' to indicate that
1395its relative to the cgroup namespace root of the caller.
1396
1397
633b11be
MCC
1398Migration and setns(2)
1399----------------------
d4021f6c
SH
1400
1401Processes inside a cgroup namespace can move into and out of the
1402namespace root if they have proper access to external cgroups. For
1403example, from inside a namespace with cgroupns root at
1404/batchjobs/container_id1, and assuming that the global hierarchy is
633b11be 1405still accessible inside cgroupns::
d4021f6c
SH
1406
1407 # cat /proc/7353/cgroup
1408 0::/sub_cgrp_1
1409 # echo 7353 > batchjobs/container_id2/cgroup.procs
1410 # cat /proc/7353/cgroup
1411 0::/../container_id2
1412
1413Note that this kind of setup is not encouraged. A task inside cgroup
1414namespace should only be exposed to its own cgroupns hierarchy.
1415
1416setns(2) to another cgroup namespace is allowed when:
1417
1418(a) the process has CAP_SYS_ADMIN against its current user namespace
1419(b) the process has CAP_SYS_ADMIN against the target cgroup
1420 namespace's userns
1421
1422No implicit cgroup changes happen with attaching to another cgroup
1423namespace. It is expected that the someone moves the attaching
1424process under the target cgroup namespace root.
1425
1426
633b11be
MCC
1427Interaction with Other Namespaces
1428---------------------------------
d4021f6c
SH
1429
1430Namespace specific cgroup hierarchy can be mounted by a process
633b11be 1431running inside a non-init cgroup namespace::
d4021f6c
SH
1432
1433 # mount -t cgroup2 none $MOUNT_POINT
1434
1435This will mount the unified cgroup hierarchy with cgroupns root as the
1436filesystem root. The process needs CAP_SYS_ADMIN against its user and
1437mount namespaces.
1438
1439The virtualization of /proc/self/cgroup file combined with restricting
1440the view of cgroup hierarchy by namespace-private cgroupfs mount
1441provides a properly isolated cgroup view inside the container.
1442
1443
633b11be
MCC
1444Information on Kernel Programming
1445=================================
6c292092
TH
1446
1447This section contains kernel programming information in the areas
1448where interacting with cgroup is necessary. cgroup core and
1449controllers are not covered.
1450
1451
633b11be
MCC
1452Filesystem Support for Writeback
1453--------------------------------
6c292092
TH
1454
1455A filesystem can support cgroup writeback by updating
1456address_space_operations->writepage[s]() to annotate bio's using the
1457following two functions.
1458
1459 wbc_init_bio(@wbc, @bio)
6c292092
TH
1460 Should be called for each bio carrying writeback data and
1461 associates the bio with the inode's owner cgroup. Can be
1462 called anytime between bio allocation and submission.
1463
1464 wbc_account_io(@wbc, @page, @bytes)
6c292092
TH
1465 Should be called for each data segment being written out.
1466 While this function doesn't care exactly when it's called
1467 during the writeback session, it's the easiest and most
1468 natural to call it as data segments are added to a bio.
1469
1470With writeback bio's annotated, cgroup support can be enabled per
1471super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
1472selective disabling of cgroup writeback support which is helpful when
1473certain filesystem features, e.g. journaled data mode, are
1474incompatible.
1475
1476wbc_init_bio() binds the specified bio to its cgroup. Depending on
1477the configuration, the bio may be executed at a lower priority and if
1478the writeback session is holding shared resources, e.g. a journal
1479entry, may lead to priority inversion. There is no one easy solution
1480for the problem. Filesystems can try to work around specific problem
1481cases by skipping wbc_init_bio() or using bio_associate_blkcg()
1482directly.
1483
1484
633b11be
MCC
1485Deprecated v1 Core Features
1486===========================
6c292092
TH
1487
1488- Multiple hierarchies including named ones are not supported.
1489
5136f636 1490- All v1 mount options are not supported.
6c292092
TH
1491
1492- The "tasks" file is removed and "cgroup.procs" is not sorted.
1493
1494- "cgroup.clone_children" is removed.
1495
1496- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
1497 at the root instead.
1498
1499
633b11be
MCC
1500Issues with v1 and Rationales for v2
1501====================================
6c292092 1502
633b11be
MCC
1503Multiple Hierarchies
1504--------------------
6c292092
TH
1505
1506cgroup v1 allowed an arbitrary number of hierarchies and each
1507hierarchy could host any number of controllers. While this seemed to
1508provide a high level of flexibility, it wasn't useful in practice.
1509
1510For example, as there is only one instance of each controller, utility
1511type controllers such as freezer which can be useful in all
1512hierarchies could only be used in one. The issue is exacerbated by
1513the fact that controllers couldn't be moved to another hierarchy once
1514hierarchies were populated. Another issue was that all controllers
1515bound to a hierarchy were forced to have exactly the same view of the
1516hierarchy. It wasn't possible to vary the granularity depending on
1517the specific controller.
1518
1519In practice, these issues heavily limited which controllers could be
1520put on the same hierarchy and most configurations resorted to putting
1521each controller on its own hierarchy. Only closely related ones, such
1522as the cpu and cpuacct controllers, made sense to be put on the same
1523hierarchy. This often meant that userland ended up managing multiple
1524similar hierarchies repeating the same steps on each hierarchy
1525whenever a hierarchy management operation was necessary.
1526
1527Furthermore, support for multiple hierarchies came at a steep cost.
1528It greatly complicated cgroup core implementation but more importantly
1529the support for multiple hierarchies restricted how cgroup could be
1530used in general and what controllers was able to do.
1531
1532There was no limit on how many hierarchies there might be, which meant
1533that a thread's cgroup membership couldn't be described in finite
1534length. The key might contain any number of entries and was unlimited
1535in length, which made it highly awkward to manipulate and led to
1536addition of controllers which existed only to identify membership,
1537which in turn exacerbated the original problem of proliferating number
1538of hierarchies.
1539
1540Also, as a controller couldn't have any expectation regarding the
1541topologies of hierarchies other controllers might be on, each
1542controller had to assume that all other controllers were attached to
1543completely orthogonal hierarchies. This made it impossible, or at
1544least very cumbersome, for controllers to cooperate with each other.
1545
1546In most use cases, putting controllers on hierarchies which are
1547completely orthogonal to each other isn't necessary. What usually is
1548called for is the ability to have differing levels of granularity
1549depending on the specific controller. In other words, hierarchy may
1550be collapsed from leaf towards root when viewed from specific
1551controllers. For example, a given configuration might not care about
1552how memory is distributed beyond a certain level while still wanting
1553to control how CPU cycles are distributed.
1554
1555
633b11be
MCC
1556Thread Granularity
1557------------------
6c292092
TH
1558
1559cgroup v1 allowed threads of a process to belong to different cgroups.
1560This didn't make sense for some controllers and those controllers
1561ended up implementing different ways to ignore such situations but
1562much more importantly it blurred the line between API exposed to
1563individual applications and system management interface.
1564
1565Generally, in-process knowledge is available only to the process
1566itself; thus, unlike service-level organization of processes,
1567categorizing threads of a process requires active participation from
1568the application which owns the target process.
1569
1570cgroup v1 had an ambiguously defined delegation model which got abused
1571in combination with thread granularity. cgroups were delegated to
1572individual applications so that they can create and manage their own
1573sub-hierarchies and control resource distributions along them. This
1574effectively raised cgroup to the status of a syscall-like API exposed
1575to lay programs.
1576
1577First of all, cgroup has a fundamentally inadequate interface to be
1578exposed this way. For a process to access its own knobs, it has to
1579extract the path on the target hierarchy from /proc/self/cgroup,
1580construct the path by appending the name of the knob to the path, open
1581and then read and/or write to it. This is not only extremely clunky
1582and unusual but also inherently racy. There is no conventional way to
1583define transaction across the required steps and nothing can guarantee
1584that the process would actually be operating on its own sub-hierarchy.
1585
1586cgroup controllers implemented a number of knobs which would never be
1587accepted as public APIs because they were just adding control knobs to
1588system-management pseudo filesystem. cgroup ended up with interface
1589knobs which were not properly abstracted or refined and directly
1590revealed kernel internal details. These knobs got exposed to
1591individual applications through the ill-defined delegation mechanism
1592effectively abusing cgroup as a shortcut to implementing public APIs
1593without going through the required scrutiny.
1594
1595This was painful for both userland and kernel. Userland ended up with
1596misbehaving and poorly abstracted interfaces and kernel exposing and
1597locked into constructs inadvertently.
1598
1599
633b11be
MCC
1600Competition Between Inner Nodes and Threads
1601-------------------------------------------
6c292092
TH
1602
1603cgroup v1 allowed threads to be in any cgroups which created an
1604interesting problem where threads belonging to a parent cgroup and its
1605children cgroups competed for resources. This was nasty as two
1606different types of entities competed and there was no obvious way to
1607settle it. Different controllers did different things.
1608
1609The cpu controller considered threads and cgroups as equivalents and
1610mapped nice levels to cgroup weights. This worked for some cases but
1611fell flat when children wanted to be allocated specific ratios of CPU
1612cycles and the number of internal threads fluctuated - the ratios
1613constantly changed as the number of competing entities fluctuated.
1614There also were other issues. The mapping from nice level to weight
1615wasn't obvious or universal, and there were various other knobs which
1616simply weren't available for threads.
1617
1618The io controller implicitly created a hidden leaf node for each
1619cgroup to host the threads. The hidden leaf had its own copies of all
633b11be 1620the knobs with ``leaf_`` prefixed. While this allowed equivalent
6c292092
TH
1621control over internal threads, it was with serious drawbacks. It
1622always added an extra layer of nesting which wouldn't be necessary
1623otherwise, made the interface messy and significantly complicated the
1624implementation.
1625
1626The memory controller didn't have a way to control what happened
1627between internal tasks and child cgroups and the behavior was not
1628clearly defined. There were attempts to add ad-hoc behaviors and
1629knobs to tailor the behavior to specific workloads which would have
1630led to problems extremely difficult to resolve in the long term.
1631
1632Multiple controllers struggled with internal tasks and came up with
1633different ways to deal with it; unfortunately, all the approaches were
1634severely flawed and, furthermore, the widely different behaviors
1635made cgroup as a whole highly inconsistent.
1636
1637This clearly is a problem which needs to be addressed from cgroup core
1638in a uniform way.
1639
1640
633b11be
MCC
1641Other Interface Issues
1642----------------------
6c292092
TH
1643
1644cgroup v1 grew without oversight and developed a large number of
1645idiosyncrasies and inconsistencies. One issue on the cgroup core side
1646was how an empty cgroup was notified - a userland helper binary was
1647forked and executed for each event. The event delivery wasn't
1648recursive or delegatable. The limitations of the mechanism also led
1649to in-kernel event delivery filtering mechanism further complicating
1650the interface.
1651
1652Controller interfaces were problematic too. An extreme example is
1653controllers completely ignoring hierarchical organization and treating
1654all cgroups as if they were all located directly under the root
1655cgroup. Some controllers exposed a large amount of inconsistent
1656implementation details to userland.
1657
1658There also was no consistency across controllers. When a new cgroup
1659was created, some controllers defaulted to not imposing extra
1660restrictions while others disallowed any resource usage until
1661explicitly configured. Configuration knobs for the same type of
1662control used widely differing naming schemes and formats. Statistics
1663and information knobs were named arbitrarily and used different
1664formats and units even in the same controller.
1665
1666cgroup v2 establishes common conventions where appropriate and updates
1667controllers so that they expose minimal and consistent interfaces.
1668
1669
633b11be
MCC
1670Controller Issues and Remedies
1671------------------------------
6c292092 1672
633b11be
MCC
1673Memory
1674~~~~~~
6c292092
TH
1675
1676The original lower boundary, the soft limit, is defined as a limit
1677that is per default unset. As a result, the set of cgroups that
1678global reclaim prefers is opt-in, rather than opt-out. The costs for
1679optimizing these mostly negative lookups are so high that the
1680implementation, despite its enormous size, does not even provide the
1681basic desirable behavior. First off, the soft limit has no
1682hierarchical meaning. All configured groups are organized in a global
1683rbtree and treated like equal peers, regardless where they are located
1684in the hierarchy. This makes subtree delegation impossible. Second,
1685the soft limit reclaim pass is so aggressive that it not just
1686introduces high allocation latencies into the system, but also impacts
1687system performance due to overreclaim, to the point where the feature
1688becomes self-defeating.
1689
1690The memory.low boundary on the other hand is a top-down allocated
1691reserve. A cgroup enjoys reclaim protection when it and all its
1692ancestors are below their low boundaries, which makes delegation of
1693subtrees possible. Secondly, new cgroups have no reserve per default
1694and in the common case most cgroups are eligible for the preferred
1695reclaim pass. This allows the new low boundary to be efficiently
1696implemented with just a minor addition to the generic reclaim code,
1697without the need for out-of-band data structures and reclaim passes.
1698Because the generic reclaim code considers all cgroups except for the
1699ones running low in the preferred first reclaim pass, overreclaim of
1700individual groups is eliminated as well, resulting in much better
1701overall workload performance.
1702
1703The original high boundary, the hard limit, is defined as a strict
1704limit that can not budge, even if the OOM killer has to be called.
1705But this generally goes against the goal of making the most out of the
1706available memory. The memory consumption of workloads varies during
1707runtime, and that requires users to overcommit. But doing that with a
1708strict upper limit requires either a fairly accurate prediction of the
1709working set size or adding slack to the limit. Since working set size
1710estimation is hard and error prone, and getting it wrong results in
1711OOM kills, most users tend to err on the side of a looser limit and
1712end up wasting precious resources.
1713
1714The memory.high boundary on the other hand can be set much more
1715conservatively. When hit, it throttles allocations by forcing them
1716into direct reclaim to work off the excess, but it never invokes the
1717OOM killer. As a result, a high boundary that is chosen too
1718aggressively will not terminate the processes, but instead it will
1719lead to gradual performance degradation. The user can monitor this
1720and make corrections until the minimal memory footprint that still
1721gives acceptable performance is found.
1722
1723In extreme cases, with many concurrent allocations and a complete
1724breakdown of reclaim progress within the group, the high boundary can
1725be exceeded. But even then it's mostly better to satisfy the
1726allocation from the slack available in other groups or the rest of the
1727system than killing the group. Otherwise, memory.max is there to
1728limit this type of spillover and ultimately contain buggy or even
1729malicious applications.
3e24b19d 1730
b6e6edcf
JW
1731Setting the original memory.limit_in_bytes below the current usage was
1732subject to a race condition, where concurrent charges could cause the
1733limit setting to fail. memory.max on the other hand will first set the
1734limit to prevent new charges, and then reclaim and OOM kill until the
1735new limit is met - or the task writing to memory.max is killed.
1736
3e24b19d
VD
1737The combined memory+swap accounting and limiting is replaced by real
1738control over swap space.
1739
1740The main argument for a combined memory+swap facility in the original
1741cgroup design was that global or parental pressure would always be
1742able to swap all anonymous memory of a child group, regardless of the
1743child's own (possibly untrusted) configuration. However, untrusted
1744groups can sabotage swapping by other means - such as referencing its
1745anonymous memory in a tight loop - and an admin can not assume full
1746swappability when overcommitting untrusted jobs.
1747
1748For trusted jobs, on the other hand, a combined counter is not an
1749intuitive userspace interface, and it flies in the face of the idea
1750that cgroup controllers should account and limit specific physical
1751resources. Swap space is a resource like all others in the system,
1752and that's why unified hierarchy allows distributing it separately.