pct: clarify needed systemd-versions for cgroupv2 support

[pve-docs.git] / pct.adoc
diff --git a/pct.adoc b/pct.adoc

index c52f45e7876209f916936b14ccd302ffbc094d92..42826bdd4329822a1f4d91d138f85c34a396c872 100644 (file)
--- a/pct.adoc
+++ b/pct.adoc
@@ -36,28 +36,33 @@ the host system directly.
  The runtime costs for containers is low, usually negligible. However, there are
  some drawbacks that need be considered:
  
-* Only Linux distributions can be run in containers.It is not possible to run
-  other Operating Systems like, for example, FreeBSD or Microsoft Windows
+* Only Linux distributions can be run in Proxmox Containers. It is not possible to run
+  other operating systems like, for example, FreeBSD or Microsoft Windows
    inside a container.
  
  * For security reasons, access to host resources needs to be restricted.
-  Containers run in their own separate namespaces. Additionally some syscalls
-  are not allowed within containers.
+  Therefore, containers run in their own separate namespaces. Additionally some
+  syscalls (user space requests to the Linux kernel) are not allowed within containers.
  
-{pve} uses https://linuxcontainers.org/[Linux Containers (LXC)] as underlying
+{pve} uses https://linuxcontainers.org/lxc/introduction/[Linux Containers (LXC)] as its underlying
  container technology. The ``Proxmox Container Toolkit'' (`pct`) simplifies the
-usage and management of LXC containers.
+usage and management of LXC, by providing an interface that abstracts
+complex tasks.
  
  Containers are tightly integrated with {pve}. This means that they are aware of
  the cluster setup, and they can use the same network and storage resources as
  virtual machines. You can also use the {pve} firewall, or manage containers
  using the HA framework.
  
-Our primary goal is to offer an environment as one would get from a VM, but
-without the additional overhead. We call this ``System Containers''.
+Our primary goal is to offer an environment that provides the benefits of using a
+VM, but without the additional overhead. This means that Proxmox Containers can
+be categorized as ``System Containers'', rather than ``Application Containers''.
  
-NOTE: If you want to run micro-containers, for example, 'Docker' or 'rkt', it
-is best to run them inside a VM.
+NOTE: If you want to run application containers, for example, 'Docker' images, it
+is recommended that you run them inside a Proxmox Qemu VM. This will give you
+all the advantages of application containerization, while also providing the
+benefits that VMs offer, such as strong isolation from the host and the ability
+to live-migrate, which otherwise isn't possible with containers. 
  
  
  Technology Overview
@@ -446,7 +451,7 @@ Security Considerations
  
  Containers use the kernel of the host system. This exposes an attack surface
  for malicious users. In general, full virtual machines provide better
-isolation. This should be considered if containers are provided to unkown or
+isolation. This should be considered if containers are provided to unknown or
  untrusted people.
  
  To reduce the attack surface, LXC uses many security features like AppArmor,
@@ -473,13 +478,69 @@ To disable AppArmor for a container, add the following line to the container
  configuration file located at `/etc/pve/lxc/CTID.conf`:
  
  ----
-lxc.apparmor_profile = unconfined
+lxc.apparmor.profile = unconfined
  ----
  
  WARNING: Please note that this is not recommended for production use.
  
  
-// TODO: describe cgroups + seccomp a bit more.
+[[pct_cgroup]]
+Control Groups ('cgroup')
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'cgroup' is a kernel
+mechanism used to hierarchically organize processes and distribute system
+resources.
+
+The main resources controlled via 'cgroups' are CPU time, memory and swap
+limits, and access to device nodes. 'cgroups' are also used to "freeze" a
+container before taking snapshots.
+
+There are 2 versions of 'cgroups' currently available,
+https://www.kernel.org/doc/html/v5.11/admin-guide/cgroup-v1/index.html[legacy]
+and
+https://www.kernel.org/doc/html/v5.11/admin-guide/cgroup-v2.html['cgroupv2'].
+
+Since {pve} 7.0, the default is a pure 'cgroupv2' environment. Previously a
+"hybrid" setup was used, where resource control was mainly done in 'cgroupv1'
+with an additional 'cgroupv2' controller which could take over some subsystems
+via the 'cgroup_no_v1' kernel command line parameter. (See the
+https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html[kernel
+parameter documentation] for details.)
+
+The main difference between pure 'cgroupv2' and the old hybrid environments
+regarding {pve} is that with 'cgroupv2' memory and swap are now controlled
+independently. The memory and swap settings for containers can map directly to
+these values, whereas previously only the memory limit and the limit of the
+*sum* of memory and swap could be limited.
+
+Another important difference is that the 'devices' controller is configured in a
+completely different way. Because of this, file system quotas are currently not
+supported in a pure 'cgroupv2' environment.
+
+'cgroupv2' support by the container's OS is needed to run in a pure 'cgroupv2'
+environment. Containers running 'systemd' version 231 or newer support
+'cgroupv2' footnote:[this includes all newest major versions of container
+templates shipped by {pve}], as do containers not using 'systemd' as init
+system footnote:[for example Alpine Linux].
+
+NOTE: CentOS 7 and Ubuntu 16.10 are two prominent Linux distributions, which
+have a 'systemd' version that is too old to run in a 'cgroupv2' environment.
+
+If file system quotas are not required and the containers support 'cgroupv2',
+it is recommended to stick to the new default.
+
+To switch back to the previous version the following kernel command line
+parameter can be used:
+
+----
+systemd.unified_cgroup_hierarchy=0
+----
+
+See xref:sysboot_edit_kernel_cmdline[this section] on editing the kernel boot
+command line on where to add the parameter.
+
+// TODO: seccomp a bit more.
  // TODO: pve-lxc-syscalld
  
  
@@ -598,6 +659,8 @@ Using Quotas Inside Containers
  Quotas allow to set limits inside a container for the amount of disk space that
  each user can use.
  
+NOTE: This currently requires the use of legacy 'cgroups'.
+
  NOTE: This only works on ext4 image based storage types and currently only
  works with privileged containers.
  
@@ -786,22 +849,39 @@ Reduce the memory of the container to 512MB
  # pct set 100 -memory 512
  ----
  
+Destroying a container always removes it from Access Control Lists and it always
+removes the firewall configuration of the container. You have to activate
+'--purge', if you want to additionally remove the container from replication jobs,
+backup jobs and HA resource configurations.
+
+----
+# pct destroy 100 --purge
+----
+
+
  
  Obtaining Debugging Logs
  ~~~~~~~~~~~~~~~~~~~~~~~~
  
  In case `pct start` is unable to start a specific container, it might be
-helpful to collect debugging output by running `lxc-start` (replace `ID` with
-the container's ID):
+helpful to collect debugging output by passing the `--debug` flag (replace `CTID` with
+the container's CTID):
+
+----
+# pct start CTID --debug
+----
+
+Alternatively, you can use the following `lxc-start` command, which will save
+the debug log to the file specified by the `-o` output option:
  
  ----
-# lxc-start -n ID -F -l DEBUG -o /tmp/lxc-ID.log
+# lxc-start -n CTID -F -l DEBUG -o /tmp/lxc-CTID.log
  ----
  
  This command will attempt to start the container in foreground mode, to stop
-the container run `pct shutdown ID` or `pct stop ID` in a second terminal.
+the container run `pct shutdown CTID` or `pct stop CTID` in a second terminal.
  
-The collected debug log is written to `/tmp/lxc-ID.log`.
+The collected debug log is written to `/tmp/lxc-CTID.log`.
  
  NOTE: If you have changed the container's configuration since the last start
  attempt with `pct start`, you need to run `pct start` at least once to also
@@ -821,7 +901,7 @@ This works as long as your Container is offline. If it has local volumes or
  mount points defined, the migration will copy the content over the network to
  the target host if the same storage is defined there.
  
-Running containers cannot live-migrated due to techincal limitations. You can
+Running containers cannot live-migrated due to technical limitations. You can
  do a restart migration, which shuts down, moves and then starts a container
  again on the target node. As containers are very lightweight, this results
  normally only in a downtime of some hundreds of milliseconds.