]> git.proxmox.com Git - systemd.git/blame - doc/CGROUP_DELEGATION.md
New upstream version 239
[systemd.git] / doc / CGROUP_DELEGATION.md
CommitLineData
b012e921
MB
1# Control Group APIs and Delegation
2
3*Intended audience: hackers working on userspace subsystems that require direct
4cgroup access, such as container managers and similar.*
5
6So you are wondering about resource management with systemd, you know Linux
7control groups (cgroups) a bit and are trying to integrate your software with
8what systemd has to offer there. Here's a bit of documentation about the
9concepts and interfaces involved with this.
10
11What's described here has been part of systemd and documented since v205
12times. However, it has been updated and improved substantially, even
13though the concepts stayed mostly the same. This is an attempt to provide more
14comprehensive up-to-date information about all this, particular in light of the
15poor implementations of the components interfacing with systemd of current
16container managers.
17
18Before you read on, please make sure you read the low-level [kernel
19documentation about
20cgroupsv2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt). This
21documentation then adds in the higher-level view from systemd.
22
23This document augments the existing documentation we already have:
24
25* [The New Control Group Interfaces](https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/)
26* [Writing VM and Container Managers](https://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers/)
27
28These wiki documents are not as up to date as they should be, currently, but
29the basic concepts still fully apply. You should read them too, if you do something
30with cgroups and systemd, in particular as they shine more light on the various
31D-Bus APIs provided. (That said, sooner or later we should probably fold that
32wiki documentation into this very document, too.)
33
34## Two Key Design Rules
35
36Much of the philosophy behind these concepts is based on a couple of basic
37design ideas of cgroupsv2 (which we however try to adapt as far as we can to
38cgroupsv1 too). Specifically two cgroupsv2 rules are the most relevant:
39
401. The **no-processes-in-inner-nodes** rule: this means that it's not permitted
41to have processes directly attached to a cgroup that also has child cgroups and
42vice versa. A cgroup is either an inner node or a leaf node of the tree, and if
43it's an inner node it may not contain processes directly, and if it's a leaf
44node then it may not have child cgroups. (Note that there are some minor
45exceptions to this rule, though. E.g. the root cgroup is special and allows
46both processes and children — which is used in particular to maintain kernel
47threads.)
48
492. The **single-writer** rule: this means that each cgroup only has a single
50writer, i.e. a single process managing it. It's OK if different cgroups have
51different processes managing them. However, only a single process should own a
52specific cgroup, and when it does that ownership is exclusive, and nothing else
53should manipulate it at the same time. This rule ensures that various pieces of
54software don't step on each other's toes constantly.
55
56These two rules have various effects. For example, one corollary of this is: if
57your container manager creates and manages cgroups in the system's root cgroup
58you violate rule #2, as the root cgroup is managed by systemd and hence off
59limits to everybody else.
60
61Note that rule #1 is generally enforced by the kernel if cgroupsv2 is used: as
62soon as you add a process to a cgroup it is ensured the rule is not
63violated. On cgroupsv1 this rule didn't exist, and hence isn't enforced, even
64though it's a good thing to follow it then too. Rule #2 is not enforced on
65either cgroupsv1 nor cgroupsv2 (this is UNIX after all, in the general case
66root can do anything, modulo SELinux and friends), but if you ignore it you'll
67be in constant pain as various pieces of software will fight over cgroup
68ownership.
69
70Note that cgroupsv1 is currently the most deployed implementation, even though
71it's semantically broken in many ways, and in many cases doesn't actually do
72what people think it does. cgroupsv2 is where things are going, and most new
73kernel features in this area are only added to cgroupsv2, and not cgroupsv1
74anymore. For example cgroupsv2 provides proper cgroup-empty notifications, has
75support for all kinds of per-cgroup BPF magic, supports secure delegation of
76cgroup trees to less privileged processes and so on, which all are not
77available on cgroupsv1.
78
79## Three Different Tree Setups 🌳
80
81systemd supports three different modes how cgroups are set up. Specifically:
82
831. **Unified** — this is the simplest mode, and exposes a pure cgroupsv2
84logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system
85and all available controllers are exclusively exposed through it.
86
872. **Legacy** — this is the traditional cgroupsv1 mode. In this mode the
88various controllers each get their own cgroup file system mounted to
89`/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup
90hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`.
91
923. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set
93up mostly like legacy, except that there's also an additional hierarchy
94`/sys/fs/cgroup/unified/` that contains the cgroupsv2 hierarchy. In this mode
95compatibility with cgroupsv1 is retained while some cgroupsv2 features are
96available too. This mode is a stopgap. Don't bother with this too much unless
97you have too much free time.
98
99To say this clearly, legacy and hybrid modes have no future. If you develop
100software today and don't focus on the unified mode, then you are writing
101software for yesterday, not tomorrow. They are primarily supported for
102compatibility reasons and will not receive new features. Sorry.
103
104Superficially, in legacy and hybrid modes it might appear that the parallel
105cgroup hierarchies for each controller are orthogonal from each other. In
106systemd they are not: the hierarchies of all controllers are always kept in
107sync (at least mostly: sub-trees might be suppressed in certain hierarchies if
108no controller usage is required for them). The fact that systemd keeps these
109hierarchies in sync means that the legacy and hybrid hierarchies are
110conceptually very close to the unified hierarchy. In particular this allows us
111to talk of one specific cgroup and actually mean the same cgroup in all
112available controller hierarchies. E.g. if we talk about the cgroup `/foo/bar/`
113then we actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as
114`/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on.
115Note that in cgroupsv2 the controller hierarchies aren't orthogonal, hence
116thinking about them as orthogonal won't help you in the long run anyway.
117
118If you wonder how to detect which of these three modes is currently used, use
119`statfs()` on `/sys/fs/cgroup/`. If it reports `CGROUP2_SUPER_MAGIC` in its
120`.f_type` field, then you are in unified mode. If it reports `TMPFS_MAGIC` then
121you are either in legacy or hybrid mode. To distuingish these two cases, run
122`statfs()` again on `/sys/fs/cgroup/unified/`. If that succeeds and reports
123`CGROUP2_SUPER_MAGIC` you are in hybrid mode, otherwise not.
124
125## systemd's Unit Types
126
127The low-level kernel cgroups feature is exposed in systemd in three different
128"unit" types. Specifically:
129
1301. 💼 The `.service` unit type. This unit type is for units encapsulating
131 processes systemd itself starts. Units of these types have cgroups that are
132 the leaves of the cgroup tree the systemd instance manages (though possibly
133 they might contain a sub-tree of their own managed by something else, made
134 possible by the concept of delegation, see below). Service units are usually
135 instantiated based on a unit file on disk that describes the command line to
136 invoke and other properties of the service. However, service units may also
137 be declared and started programmatically at runtime through a D-Bus API
138 (which is called *transient* services).
139
1402. 👓 The `.scope` unit type. This is very similar to `.service`. The main
141 difference: the processes the units of this type encapsulate are forked off
142 by some unrelated manager process, and that manager asked systemd to expose
143 them as a unit. Unlike services, scopes can only be declared and started
144 programmatically, i.e. are always transient. That's because they encapsulate
145 processes forked off by something else, i.e. existing runtime objects, and
146 hence cannot really be defined fully in 'offline' concepts such as unit
147 files.
148
1493. 🔪 The `.slice` unit type. Units of this type do not directly contain any
150 processes. Units of this type are the inner nodes of part of the cgroup tree
151 the systemd instance manages. Much like services, slices can be defined
152 either on disk with unit files or programmatically as transient units.
153
154Slices expose the trunk and branches of a tree, and scopes and services are
155attached to those branches as leaves. The idea is that scopes and services can
156be moved around though, i.e. assigned to a different slice if needed.
157
158The naming of slice units directly maps to the cgroup tree path. This is not
159the case for service and scope units however. A slice named `foo-bar-baz.slice`
160maps to a cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/`. A service
161`quux.service` which is attached to the slice `foo-bar-baz.slice` maps to the
162cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/quux.service/`.
163
164By default systemd sets up four slice units:
165
1661. `-.slice` is the root slice. i.e. the parent of everything else. On the host
167 system it maps directly to the top-level directory of cgroupsv2.
168
1692. `system.slice` is where system services are by default placed, unless
170 configured otherwise.
171
1723. `user.slice` is where user sessions are placed. Each user gets a slice of
173 its own below that.
174
1754. `machines.slice` is where VMs and containers are supposed to be
176 placed. `systemd-nspawn` makes use of this by default, and you're very welcome
177 to place your containers and VMs there too if you hack on managers for those.
178
179Users may define any amount of additional slices they like though, the four
180above are just the defaults.
181
182## Delegation
183
184Container managers and suchlike often want to control cgroups directly using
185the raw kernel APIs. That's entirely fine and supported, as long as proper
186*delegation* is followed. Delegation is a concept we inherited from cgroupsv2,
187but we expose it on cgroupsv1 too. Delegation means that some parts of the
188cgroup tree may be managed by different managers than others. As long as it is
189clear which manager manages which part of the tree each one can do within its
190sub-graph of the tree whatever it wants.
191
192Only sub-trees can be delegated (though whoever decides to request a sub-tree
193can delegate sub-sub-trees further to somebody else if they like). Delegation
194takes place at a specific cgroup: in systemd there's a `Delegate=` property you
195can set for a service or scope unit. If you do, it's the cut-off point for
196systemd's cgroup management: the unit itself is managed by systemd, i.e. all
197its attributes are managed exclusively by systemd, however your program may
198create/remove sub-cgroups inside it freely, and those then become exclusive
199property of your program, systemd won't touch them — all attributes of *those*
200sub-cgroups can be manipulated freely and exclusively by your program.
201
202By turning on the `Delegate=` property for a scope or service you get a few
203guarantees:
204
2051. systemd won't fiddle with your sub-tree of the cgroup tree anymore. It won't
206 change attributes of any cgroups below it, nor will it create or remove any
207 cgroups thereunder, nor migrate processes across the boundaries of that
208 sub-tree as it deems useful anymore.
209
2102. If your service makes use of the `User=` functionality, then the sub-tree
211 will be `chown()`ed to the indicated user so that it can correctly create
212 cgroups below it. Note however that systemd will do that only in the unified
213 hierarchy (in unified and hybrid mode) as well as on systemd's own private
214 hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy
215 controller hierarchies. Delegation to less privileges processes is not safe
216 in cgroupsv1 (as a limitation of the kernel), hence systemd won't facilitate
217 access to it.
218
2193. Any BPF IP filter programs systemd installs will be installed with
220 `BPF_F_ALLOW_MULTI` so that your program can install additional ones.
221
222In unit files the `Delegate=` property is superficially exposed as
223boolean. However, since v236 it optionally takes a list of controller names
224instead. If so, delegation is requested for listed controllers
225specifically. Note hat this only encodes a request. Depending on various
226parameters it might happen that your service actually will get fewer
227controllers delegated (for example, because the controller is not available on
228the current kernel or was turned off) or more. If no list is specified
229(i.e. the property simply set to `yes`) then all available controllers are
230delegated.
231
232Let's stress one thing: delegation is available on scope and service units
233only. It's expressly not available on slice units. Why? Because slice units are
234our *inner* nodes of the cgroup trees and we freely attach service and scopes
235to them. If we'd allow delegation on slice units then this would mean that
236both systemd and your own manager would create/delete cgroups below the slice
237unit and that conflicts with the single-writer rule.
238
239So, if you want to do your own raw cgroups kernel level access, then allocate a
240scope unit, or a service unit (or just use the service unit you already have
241for your service code), and turn on delegation for it.
242
243## Three Scenarios
244
245Let's say you write a container manager, and you wonder what to do regarding
246cgroups for it, as you want your manager to be able to run on systemd systems.
247
248You basically have three options:
249
2501. 😊 The *integration-is-good* option. For this, you register each container
251 you have either as a systemd service (i.e. let systemd invoke the executor
252 binary for you) or a systemd scope (i.e. your manager executes the binary
253 directly, but then tells systemd about it. In this mode the administrator
254 can use the usual systemd resource management and reporting commands
255 individually on those containers. By turning on `Delegate=` for these scopes
256 or services you make it possible to run cgroup-enabled programs in your
257 containers, for example a nested systemd instance. This option has two
258 sub-options:
259
260 a. You transiently register the service or scope by directly contacting
261 systemd via D-Bus. In this case systemd will just manage the unit for you
262 and nothing else.
263
264 b. Instead you register the service or scope through `systemd-machined`
265 (also via D-Bus). This mini-daemon is basically just a proxy for the same
266 operations as in a. The main benefit of this: this way you let the system
267 know that what you are registering is a container, and this opens up
268 certain additional integration points. For example, `journalctl -M` can
269 then be used to directly look into any container's journal logs (should
270 the container run systemd inside), or `systemctl -M` can be used to
271 directly invoke systemd operations inside the containers. Moreover tools
272 like "ps" can then show you to which container a process belongs (`ps -eo
273 pid,comm,machine`), and even gnome-system-monitor supports it.
274
2752. 🙁 The *i-like-islands* option. If all you care about is your own cgroup tree,
276 and you want to have to do as little as possible with systemd and no
277 interest in integration with the rest of the system, then this is a valid
278 option. For this all you have to do is turn on `Delegate=` for your main
279 manager daemon. Then figure out the cgroup systemd placed your daemon in:
280 you can now freely create sub-cgroups beneath it. Don't forget the
281 *no-processes-in-inner-nodes* rule however: you have to move your main
282 daemon process out of that cgroup (and into a sub-cgroup) before you can
283 start further processes in any of your sub-cgroups.
284
2853. 🙁 The *i-like-continents* option. In this option you'd leave your manager
286 daemon where it is, and would not turn on delegation on its unit. However,
287 as first thing you register a new scope unit with systemd, and that scope
288 unit would have `Delegate=` turned on, and then you place all your
289 containers underneath it. From systemd's PoV there'd be two units: your
290 manager service and the big scope that contains all your containers in one.
291
292BTW: if for whatever reason you say "I hate D-Bus, I'll never call any D-Bus
293API, kthxbye", then options #1 and #3 are not available, as they generally
294involve talking to systemd from your program code, via D-Bus. You still have
295option #2 in that case however, as you can simply set `Delegate=` in your
296service's unit file and you are done and have your own sub-tree. In fact, #2 is
297the one option that allows you to completely ignore systemd's existence: you
298can entirely generically follow the single rule that you just use the cgroup
299you are started in, and everything below it, whatever that might be. That said,
300maybe if you dislike D-Bus and systemd that much, the better approach might be
301to work on that, and widen your horizon a bit. You are welcome.
302
303## Controller Support
304
305systemd supports a number of controllers (but not all). Specifically, supported
306are:
307
308* on cgroupsv1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids`
309* on cgroupsv2: `cpu`, `io`, `memory`, `pids`
310
311It is our intention to natively support all cgroupsv2 controllers as they are
312added to the kernel. However, regarding cgroupsv1: at this point we will not
313add support for any other controllers anymore. This means systemd currently
314does not and will never manage the following controllers on cgroupsv1:
315`freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not?
316Depending on the case, either their API semantics or implementations aren't
317really usable, or it's very clear they have no future on cgroupsv2, and we
318won't add new code for stuff that clearly has no future.
319
320Effectively this means that all those mentioned cgroupsv1 controllers are up
321for grabs: systemd won't manage them, and hence won't delegate them to your
322code (however, systemd will still mount their hierarchies, simply because it
323mounts all controller hierarchies it finds available in the kernel). If you
324decide to use them, then that's fine, but systemd won't help you with it (but
325also not interfere with it). To be nice to other tenants it might be wise to
326replicate the cgroup hierarchies of the other controllers in them too however,
327but of course that's between you and those other tenants, and systemd won't
328care. Replicating the cgroup hierarchies in those unsupported controllers would
329mean replicating the full cgroup paths in them, and hence the prefixing
330`.slice` components too, otherwise the hierarchies will start being orthogonal
331after all, and that's not really desirable. On more thing: systemd will clean
332up after you in the hierarchies it manages: if your daemon goes down, its
333cgroups will be removed too. You basically get the guarantee that you start
334with a pristine cgroup sub-tree for your service or scope whenever it is
335started. This is not the case however in the hierarchies systemd doesn't
336manage. This means that your programs should be ready to deal with left-over
337cgroups in them — from previous runs, and be extra careful with them as they
338might still carry settings that might not be valid anymore.
339
340Note a particular asymmetry here: if your systemd version doesn't support a
341specific controller on cgroupsv1 you can still make use of it for delegation,
342by directly fiddling with its hierarchy and replicating the cgroup tree there
343as necessary (as suggested above). However, on cgroupsv2 this is different:
344separately mounted hierarchies are not available, and delegation has always to
345happen through systemd itself. This means: when you update your kernel and it
346adds a new, so far unseen controller, and you want to use it for delegation,
347then you also need to update systemd to a version that groks it.
348
349## systemd as Container Payload
350
351systemd can happily run as a container payload's PID 1. Note that systemd
352unconditionally needs write access to the cgroup tree however, hence you need
353to delegate a sub-tree to it. Note that there's nothing too special you have to
354do beyond that: just invoke systemd as PID 1 inside the root of the delegated
355cgroup sub-tree, and it will figure out the rest: it will determine the cgroup
356it is running in and take possession of it. It won't interfere with any cgroup
357outside of the sub-tree it was invoked in. Use of `CLONE_NEWCGROUP` is hence
358optional (but of course wise).
359
360Note one particular asymmetry here though: systemd will try to take possession
361of the root cgroup you pass to it *in* *full*, i.e. it will not only
362create/remove child cgroups below it, it will also attempt to manage the
363attributes of it. OTOH as mentioned above, when delegating a cgroup tree to
364somebody else it only passes the rights to create/remove sub-cgroups, but will
365insist on managing the delegated cgroup tree's top-level attributes. Or in
366other words: systemd is *greedy* when accepting delegated cgroup trees and also
367*greedy* when delegating them to others: it insists on managing attributes on
368the specific cgroup in both cases. A container manager that is itself a payload
369of a host systemd which wants to run a systemd as its own container payload
370instead hence needs to insert an extra level in the hierarchy in between, so
371that the systemd on the host and the one in the container won't fight for the
372attributes. That said, you likely should do that anyway, due to the
373no-processes-in-inner-cgroups rule, see below.
374
375When systemd runs as container payload it will make use of all hierarchies it
376has write access to. For legacy mode you need to make at least
377`/sys/fs/cgroup/systemd/` available, all other hierarchies are optional. For
378hybrid mode you need to add `/sys/fs/cgroup/unified/`. Finally, for fully
379unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself.
380
381## Some Dos
382
3831. ⚡ If you go for implementation option 1a or 1b (as in the list above), then
384 each of your containers will have its own systemd-managed unit and hence
385 cgroup with possibly further sub-cgroups below. Typically the first process
386 running in that unit will be some kind of executor program, which will in
387 turn fork off the payload processes of the container. In this case don't
388 forget that there are two levels of delegation involved: first, systemd
389 delegates a group sub-tree to your executor. And then your executor should
390 delegate a sub-tree further down to the container payload. Oh, and because
391 of the no-process-in-inner-nodes rule, your executor needs to migrate itself
392 to a sub-cgroup of the cgroup it got delegated, too. Most likely you hence
393 want a two-pronged approach: below the cgroup you got started in, you want
394 one cgroup maybe called `supervisor/` where your manager runs in and then
395 for each container a sibling cgroup of that maybe called `payload-xyz/`.
396
3972. ⚡ Don't forget that the cgroups you create have to have names that are
398 suitable as UNIX file names, and that they live in the same namespace as the
399 various kernel attribute files. Hence, when you want to allow the user
400 arbitrary naming, you might need to escape some of the names (for example,
401 you really don't want to create a cgroup named `tasks`, just because the
402 user created a container by that name, because `tasks` after all is a magic
403 attribute in cgroupsv1, and your `mkdir()` will hence fail with `EEXIST`. In
404 systemd we do escaping by prefixing names that might collide with a kernel
405 attribute name with an underscore. You might want to do the same, but this
406 is really up to you how you do it. Just do it, and be careful.
407
408## Some Don'ts
409
4101. 🚫 Never create your own cgroups below arbitrary cgroups systemd manages, i.e
411 cgroups you haven't set `Delegate=` in. Specifically: 🔥 don't create your
412 own cgroups below the root cgroup 🔥. That's owned by systemd, and you will
413 step on systemd's toes if you ignore that, and systemd will step on
414 yours. Get your own delegated sub-tree, you may create as many cgroups there
415 as you like. Seriously, if you create cgroups directly in the cgroup root,
416 then all you do is ask for trouble.
417
4182. 🚫 Don't attempt to set `Delegate=` in slice units, and in particular not in
419 `-.slice`. It's not supported, and will generate an error.
420
4213. 🚫 Never *write* to any of the attributes of a cgroup systemd created for
422 you. It's systemd's private property. You are welcome to manipulate the
423 attributes of cgroups you created in your own delegated sub-tree, but the
424 cgroup tree of systemd itself is out of limits for you. It's fine to *read*
425 from any attribute you like however. That's totally OK and welcome.
426
4274. 🚫 When not using `CLONE_NEWCGROUP` when delegating a sub-tree to a
428 container payload running systemd, then don't get the idea that you can bind
429 mount only a sub-tree of the host's cgroup tree into the container. Part of
430 the cgroup API is that `/proc/$PID/cgroup` reports the cgroup path of every
431 process, and hence any path below `/sys/fs/cgroup/` needs to match what
432 `/proc/$PID/cgroup` of the payload processes reports. What you can do safely
433 however, is mount the upper parts of the cgroup tree read-only (or even
434 replace the middle bits with an intermediary `tmpfs` — but be careful not to
435 break the `statfs()` detection logic discussed above), as long as the path
436 to the delegated sub-tree remains accessible as-is.
437
4385. ⚡ Currently, the algorithm for mapping between slice/scope/service unit
439 naming and their cgroup paths is not considered public API of systemd, and
440 may change in future versions. This means: it's best to avoid implementing a
441 local logic of translating cgroup paths to slice/scope/service names in your
442 program, or vice versa — it's likely going to break sooner or later. Use the
443 appropriate D-Bus API calls for that instead, so that systemd translates
444 this for you. (Specifically: each Unit object has a `ControlGroup` property
445 to get the cgroup for a unit. The method `GetUnitByControlGroup()` may be
446 used to get the unit for a cgroup.)
447
4486. ⚡ Think twice before delegating cgroupsv1 controllers to less privileged
449 containers. It's not safe, you basically allow your containers to freeze the
450 system with that and worse. Delegation is a strongpoint of cgroupsv2 though,
451 and there it's safe to treat delegation boundaries as privilege boundaries.
452
453And that's it for now. If you have further questions, refer to the systemd
454mailing list.
455
456— Berlin, 2018-04-20