btrfs: document df weirdness and how to better get usage

[pve-docs.git] / pct.adoc
diff --git a/pct.adoc b/pct.adoc

index 9bedab7754edeae4484483745c784d256fc7705f..b88569f42f1e9ab75d3ae2b79989f5839d174880 100644 (file)
--- a/pct.adoc
+++ b/pct.adoc
@@ -36,28 +36,33 @@ the host system directly.
  The runtime costs for containers is low, usually negligible. However, there are
  some drawbacks that need be considered:
  
-* Only Linux distributions can be run in containers.It is not possible to run
-  other Operating Systems like, for example, FreeBSD or Microsoft Windows
+* Only Linux distributions can be run in Proxmox Containers. It is not possible to run
+  other operating systems like, for example, FreeBSD or Microsoft Windows
    inside a container.
  
  * For security reasons, access to host resources needs to be restricted.
-  Containers run in their own separate namespaces. Additionally some syscalls
-  are not allowed within containers.
+  Therefore, containers run in their own separate namespaces. Additionally some
+  syscalls (user space requests to the Linux kernel) are not allowed within containers.
  
-{pve} uses https://linuxcontainers.org/[Linux Containers (LXC)] as underlying
+{pve} uses https://linuxcontainers.org/lxc/introduction/[Linux Containers (LXC)] as its underlying
  container technology. The ``Proxmox Container Toolkit'' (`pct`) simplifies the
-usage and management of LXC containers.
+usage and management of LXC, by providing an interface that abstracts
+complex tasks.
  
  Containers are tightly integrated with {pve}. This means that they are aware of
  the cluster setup, and they can use the same network and storage resources as
  virtual machines. You can also use the {pve} firewall, or manage containers
  using the HA framework.
  
-Our primary goal is to offer an environment as one would get from a VM, but
-without the additional overhead. We call this ``System Containers''.
+Our primary goal is to offer an environment that provides the benefits of using a
+VM, but without the additional overhead. This means that Proxmox Containers can
+be categorized as ``System Containers'', rather than ``Application Containers''.
  
-NOTE: If you want to run micro-containers, for example, 'Docker' or 'rkt', it
-is best to run them inside a VM.
+NOTE: If you want to run application containers, for example, 'Docker' images, it
+is recommended that you run them inside a Proxmox Qemu VM. This will give you
+all the advantages of application containerization, while also providing the
+benefits that VMs offer, such as strong isolation from the host and the ability
+to live-migrate, which otherwise isn't possible with containers. 
  
  
  Technology Overview
@@ -86,37 +91,12 @@ Technology Overview
  * Container setup from host (network, DNS, storage, etc.)
  
  
-Security Considerations
------------------------
-
-Containers use the kernel of the host system. This creates a big attack surface
-for malicious users. This should be considered if containers are provided to
-untrustworthy people. In general, full virtual machines provide better
-isolation.
-
-However, LXC uses many security features like AppArmor, CGroups and kernel
-namespaces to reduce the attack surface.
-
-AppArmor profiles are used to restrict access to possibly dangerous actions.
-Some system calls, i.e. `mount`, are prohibited from execution.
-
-To trace AppArmor activity, use:
-
-----
-# dmesg | grep apparmor
-----
-
  [[pct_container_images]]
  Container Images
  ----------------
  
  Container images, sometimes also referred to as ``templates'' or
  ``appliances'', are `tar` archives which contain everything to run a container.
-`pct` uses them to create a new container, for example:
-
-----
-# pct create 999 local:vztmpl/debian-10.0-standard_10.0-1_amd64.tar.gz
-----
  
  {pve} itself provides a variety of basic templates for the most common Linux
  distributions. They can be downloaded using the GUI or the `pveam` (short for
@@ -124,8 +104,8 @@ distributions. They can be downloaded using the GUI or the `pveam` (short for
  Additionally, https://www.turnkeylinux.org/[TurnKey Linux] container templates
  are also available to download.
  
-The list of available templates is updated daily via cron. To trigger it
-manually:
+The list of available templates is updated daily through the 'pve-daily-update'
+timer. You can also trigger an update manually by executing:
  
  ----
  # pveam update
@@ -143,30 +123,30 @@ interested in, for example basic `system` images:
  .List available system images
  ----
  # pveam available --section system
-system          alpine-3.10-default_20190626_amd64.tar.xz
-system          alpine-3.9-default_20190224_amd64.tar.xz
-system          archlinux-base_20190924-1_amd64.tar.gz
-system          centos-6-default_20191016_amd64.tar.xz
+system          alpine-3.12-default_20200823_amd64.tar.xz
+system          alpine-3.13-default_20210419_amd64.tar.xz
+system          alpine-3.14-default_20210623_amd64.tar.xz
+system          archlinux-base_20210420-1_amd64.tar.gz
  system          centos-7-default_20190926_amd64.tar.xz
-system          centos-8-default_20191016_amd64.tar.xz
-system          debian-10.0-standard_10.0-1_amd64.tar.gz
-system          debian-8.0-standard_8.11-1_amd64.tar.gz
+system          centos-8-default_20201210_amd64.tar.xz
  system          debian-9.0-standard_9.7-1_amd64.tar.gz
-system          fedora-30-default_20190718_amd64.tar.xz
-system          fedora-31-default_20191029_amd64.tar.xz
-system          gentoo-current-default_20190718_amd64.tar.xz
-system          opensuse-15.0-default_20180907_amd64.tar.xz
-system          opensuse-15.1-default_20190719_amd64.tar.xz
+system          debian-10-standard_10.7-1_amd64.tar.gz
+system          devuan-3.0-standard_3.0_amd64.tar.gz
+system          fedora-33-default_20201115_amd64.tar.xz
+system          fedora-34-default_20210427_amd64.tar.xz
+system          gentoo-current-default_20200310_amd64.tar.xz
+system          opensuse-15.2-default_20200824_amd64.tar.xz
  system          ubuntu-16.04-standard_16.04.5-1_amd64.tar.gz
  system          ubuntu-18.04-standard_18.04.1-1_amd64.tar.gz
-system          ubuntu-19.04-standard_19.04-1_amd64.tar.gz
-system          ubuntu-19.10-standard_19.10-1_amd64.tar.gz
+system          ubuntu-20.04-standard_20.04-1_amd64.tar.gz
+system          ubuntu-20.10-standard_20.10-1_amd64.tar.gz
+system          ubuntu-21.04-standard_21.04-1_amd64.tar.gz
  ----
  
  Before you can use such a template, you need to download them into one of your
-storages. You can simply use storage `local` for that purpose. For clustered
-installations, it is preferred to use a shared storage so that all nodes can
-access those images.
+storages. If you're unsure to which one, you can simply use the `local` named
+storage for that purpose. For clustered installations, it is preferred to use a
+shared storage so that all nodes can access those images.
  
  ----
  # pveam download local debian-10.0-standard_10.0-1_amd64.tar.gz
@@ -180,119 +160,23 @@ downloaded images on storage `local` with:
  local:vztmpl/debian-10.0-standard_10.0-1_amd64.tar.gz  219.95MB
  ----
  
-The above command shows you the full {pve} volume identifiers. They include the
-storage name, and most other {pve} commands can use them. For example you can
-delete that image later with:
-
-----
-# pveam remove local:vztmpl/debian-10.0-standard_10.0-1_amd64.tar.gz
-----
+TIP: You can also use the {pve} web interface GUI to download, list and delete
+container templates.
  
-[[pct_container_storage]]
-Container Storage
------------------
-
-The {pve} LXC container storage model is more flexible than traditional
-container storage models. A container can have multiple mount points. This
-makes it possible to use the best suited storage for each application.
-
-For example the root file system of the container can be on slow and cheap
-storage while the database can be on fast and distributed storage via a second
-mount point. See section <<pct_mount_points, Mount Points>> for further
-details.
-
-Any storage type supported by the {pve} storage library can be used. This means
-that containers can be stored on local (for example `lvm`, `zfs` or directory),
-shared external (like `iSCSI`, `NFS`) or even distributed storage systems like
-Ceph. Advanced storage features like snapshots or clones can be used if the
-underlying storage supports them. The `vzdump` backup tool can use snapshots to
-provide consistent container backups.
-
-Furthermore, local devices or local directories can be mounted directly using
-'bind mounts'. This gives access to local resources inside a container with
-practically zero overhead. Bind mounts can be used as an easy way to share data
-between containers.
-
-
-FUSE Mounts
-~~~~~~~~~~~
-
-WARNING: Because of existing issues in the Linux kernel's freezer subsystem the
-usage of FUSE mounts inside a container is strongly advised against, as
-containers need to be frozen for suspend or snapshot mode backups.
-
-If FUSE mounts cannot be replaced by other mounting mechanisms or storage
-technologies, it is possible to establish the FUSE mount on the Proxmox host
-and use a bind mount point to make it accessible inside the container.
-
-
-Using Quotas Inside Containers
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Quotas allow to set limits inside a container for the amount of disk space that
-each user can use.
-
-NOTE: This only works on ext4 image based storage types and currently only
-works with privileged containers.
-
-Activating the `quota` option causes the following mount options to be used for
-a mount point:
-`usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0`
-
-This allows quotas to be used like on any other system. You can initialize the
-`/aquota.user` and `/aquota.group` files by running:
-
-----
-# quotacheck -cmug /
-# quotaon /
-----
-
-Then edit the quotas using the `edquota` command. Refer to the documentation of
-the distribution running inside the container for details.
-
-NOTE: You need to run the above commands for every mount point by passing the
-mount point's path instead of just `/`.
-
-
-Using ACLs Inside Containers
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The standard Posix **A**ccess **C**ontrol **L**ists are also available inside
-containers. ACLs allow you to set more detailed file ownership than the
-traditional user/group/others model.
-
-
-Backup of Container mount points
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To include a mount point in backups, enable the `backup` option for it in the
-container configuration. For an existing mount point `mp0`
+`pct` uses them to create a new container, for example:
  
  ----
-mp0: guests:subvol-100-disk-1,mp=/root/files,size=8G
+# pct create 999 local:vztmpl/debian-10.0-standard_10.0-1_amd64.tar.gz
  ----
  
-add `backup=1` to enable it.
+The above command shows you the full {pve} volume identifiers. They include the
+storage name, and most other {pve} commands can use them. For example you can
+delete that image later with:
  
  ----
-mp0: guests:subvol-100-disk-1,mp=/root/files,size=8G,backup=1
+# pveam remove local:vztmpl/debian-10.0-standard_10.0-1_amd64.tar.gz
  ----
  
-NOTE: When creating a new mount point in the GUI, this option is enabled by
-default.
-
-To disable backups for a mount point, add `backup=0` in the way described
-above, or uncheck the *Backup* checkbox on the GUI.
-
-Replication of Containers mount points
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-By default, additional mount points are replicated when the Root Disk is
-replicated. If you want the {pve} storage replication mechanism to skip a mount
-point, you can set the *Skip replication* option for that mount point.
-As of {pve} 5.0, replication requires a storage of type `zfspool`. Adding a
-mount point to a different type of storage when the container has replication
-configured requires to have *Skip replication* enabled for that mount point.
  
  [[pct_settings]]
  Container Settings
@@ -336,25 +220,11 @@ systemd version running inside the container should be equal to or greater than
  Privileged Containers
  ^^^^^^^^^^^^^^^^^^^^^
  
-Security in containers is achieved by using mandatory access control
-('AppArmor'), 'seccomp' filters and namespaces. The LXC team considers this
-kind of container as unsafe, and they will not consider new container escape
-exploits to be security issues worthy of a CVE and quick fix.  That's why
-privileged containers should only be used in trusted environments.
-
-Although it is not recommended, AppArmor can be disabled for a container. This
-brings security risks with it. Some syscalls can lead to privilege escalation
-when executed within a container if the system is misconfigured or if a LXC or
-Linux Kernel vulnerability exists.
-
-To disable AppArmor for a container, add the following line to the container
-configuration file located at `/etc/pve/lxc/CTID.conf`:
-
-----
-lxc.apparmor_profile = unconfined
-----
-
-WARNING: Please note that this is not recommended for production use.
+Security in containers is achieved by using mandatory access control 'AppArmor'
+restrictions, 'seccomp' filters and Linux kernel namespaces. The LXC team
+considers this kind of container as unsafe, and they will not consider new
+container escape exploits to be security issues worthy of a CVE and quick fix.
+That's why privileged containers should only be used in trusted environments.
  
  
  [[pct_cpu]]
@@ -576,6 +446,132 @@ It will be called during various phases of the guests lifetime.  For an example
  and documentation see the example script under
  `/usr/share/pve-docs/examples/guest-example-hookscript.pl`.
  
+Security Considerations
+-----------------------
+
+Containers use the kernel of the host system. This exposes an attack surface
+for malicious users. In general, full virtual machines provide better
+isolation. This should be considered if containers are provided to unknown or
+untrusted people.
+
+To reduce the attack surface, LXC uses many security features like AppArmor,
+CGroups and kernel namespaces.
+
+AppArmor
+~~~~~~~~
+
+AppArmor profiles are used to restrict access to possibly dangerous actions.
+Some system calls, i.e. `mount`, are prohibited from execution.
+
+To trace AppArmor activity, use:
+
+----
+# dmesg | grep apparmor
+----
+
+Although it is not recommended, AppArmor can be disabled for a container. This
+brings security risks with it. Some syscalls can lead to privilege escalation
+when executed within a container if the system is misconfigured or if a LXC or
+Linux Kernel vulnerability exists.
+
+To disable AppArmor for a container, add the following line to the container
+configuration file located at `/etc/pve/lxc/CTID.conf`:
+
+----
+lxc.apparmor.profile = unconfined
+----
+
+WARNING: Please note that this is not recommended for production use.
+
+
+[[pct_cgroup]]
+Control Groups ('cgroup')
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'cgroup' is a kernel
+mechanism used to hierarchically organize processes and distribute system
+resources.
+
+The main resources controlled via 'cgroups' are CPU time, memory and swap
+limits, and access to device nodes. 'cgroups' are also used to "freeze" a
+container before taking snapshots.
+
+There are 2 versions of 'cgroups' currently available,
+https://www.kernel.org/doc/html/v5.11/admin-guide/cgroup-v1/index.html[legacy]
+and
+https://www.kernel.org/doc/html/v5.11/admin-guide/cgroup-v2.html['cgroupv2'].
+
+Since {pve} 7.0, the default is a pure 'cgroupv2' environment. Previously a
+"hybrid" setup was used, where resource control was mainly done in 'cgroupv1'
+with an additional 'cgroupv2' controller which could take over some subsystems
+via the 'cgroup_no_v1' kernel command line parameter. (See the
+https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html[kernel
+parameter documentation] for details.)
+
+[[pct_cgroup_compat]]
+CGroup Version Compatibility
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The main difference between pure 'cgroupv2' and the old hybrid environments
+regarding {pve} is that with 'cgroupv2' memory and swap are now controlled
+independently. The memory and swap settings for containers can map directly to
+these values, whereas previously only the memory limit and the limit of the
+*sum* of memory and swap could be limited.
+
+Another important difference is that the 'devices' controller is configured in a
+completely different way. Because of this, file system quotas are currently not
+supported in a pure 'cgroupv2' environment.
+
+'cgroupv2' support by the container's OS is needed to run in a pure 'cgroupv2'
+environment. Containers running 'systemd' version 231 or newer support
+'cgroupv2' footnote:[this includes all newest major versions of container
+templates shipped by {pve}], as do containers not using 'systemd' as init
+system footnote:[for example Alpine Linux].
+
+[NOTE]
+====
+CentOS 7 and Ubuntu 16.10 are two prominent Linux distributions releases,
+which have a 'systemd' version that is too old to run in a 'cgroupv2'
+environment, you can either
+
+* Upgrade the whole distribution to a newer release. For the examples above, that
+  could be Ubuntu 18.04 or 20.04, and CentOS 8 (or RHEL/CentOS derivatives like
+  AlmaLinux or Rocky Linux). This has the benefit to get the newest bug and
+  security fixes, often also new features, and moving the EOL date in the future.
+
+* Upgrade the Containers systemd version. If the distribution provides a
+  backports repository this can be an easy and quick stop-gap measurement.
+
+* Move the container, or its services, to a Virtual Machine. Virtual Machines
+  have a much less interaction with the host, that's why one can install
+  decades old OS versions just fine there.
+
+* Switch back to the legacy 'cgroup' controller. Note that while it can be a
+  valid solution, it's not a permanent one. There's a high likelihood that a
+  future {pve} major release, for example 8.0, cannot support the legacy
+  controller anymore.
+====
+
+[[pct_cgroup_change_version]]
+Changing CGroup Version
+^^^^^^^^^^^^^^^^^^^^^^^
+
+TIP: If file system quotas are not required and all containers support 'cgroupv2',
+it is recommended to stick to the new default.
+
+To switch back to the previous version the following kernel command line
+parameter can be used:
+
+----
+systemd.unified_cgroup_hierarchy=0
+----
+
+See xref:sysboot_edit_kernel_cmdline[this section] on editing the kernel boot
+command line on where to add the parameter.
+
+// TODO: seccomp a bit more.
+// TODO: pve-lxc-syscalld
+
+
  Guest Operating System Configuration
  ------------------------------------
  
@@ -647,6 +643,115 @@ NOTE: Container start fails if the configured `ostype` differs from the auto
  detected type.
  
  
+[[pct_container_storage]]
+Container Storage
+-----------------
+
+The {pve} LXC container storage model is more flexible than traditional
+container storage models. A container can have multiple mount points. This
+makes it possible to use the best suited storage for each application.
+
+For example the root file system of the container can be on slow and cheap
+storage while the database can be on fast and distributed storage via a second
+mount point. See section <<pct_mount_points, Mount Points>> for further
+details.
+
+Any storage type supported by the {pve} storage library can be used. This means
+that containers can be stored on local (for example `lvm`, `zfs` or directory),
+shared external (like `iSCSI`, `NFS`) or even distributed storage systems like
+Ceph. Advanced storage features like snapshots or clones can be used if the
+underlying storage supports them. The `vzdump` backup tool can use snapshots to
+provide consistent container backups.
+
+Furthermore, local devices or local directories can be mounted directly using
+'bind mounts'. This gives access to local resources inside a container with
+practically zero overhead. Bind mounts can be used as an easy way to share data
+between containers.
+
+
+FUSE Mounts
+~~~~~~~~~~~
+
+WARNING: Because of existing issues in the Linux kernel's freezer subsystem the
+usage of FUSE mounts inside a container is strongly advised against, as
+containers need to be frozen for suspend or snapshot mode backups.
+
+If FUSE mounts cannot be replaced by other mounting mechanisms or storage
+technologies, it is possible to establish the FUSE mount on the Proxmox host
+and use a bind mount point to make it accessible inside the container.
+
+
+Using Quotas Inside Containers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Quotas allow to set limits inside a container for the amount of disk space that
+each user can use.
+
+NOTE: This currently requires the use of legacy 'cgroups'.
+
+NOTE: This only works on ext4 image based storage types and currently only
+works with privileged containers.
+
+Activating the `quota` option causes the following mount options to be used for
+a mount point:
+`usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0`
+
+This allows quotas to be used like on any other system. You can initialize the
+`/aquota.user` and `/aquota.group` files by running:
+
+----
+# quotacheck -cmug /
+# quotaon /
+----
+
+Then edit the quotas using the `edquota` command. Refer to the documentation of
+the distribution running inside the container for details.
+
+NOTE: You need to run the above commands for every mount point by passing the
+mount point's path instead of just `/`.
+
+
+Using ACLs Inside Containers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The standard Posix **A**ccess **C**ontrol **L**ists are also available inside
+containers. ACLs allow you to set more detailed file ownership than the
+traditional user/group/others model.
+
+
+Backup of Container mount points
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To include a mount point in backups, enable the `backup` option for it in the
+container configuration. For an existing mount point `mp0`
+
+----
+mp0: guests:subvol-100-disk-1,mp=/root/files,size=8G
+----
+
+add `backup=1` to enable it.
+
+----
+mp0: guests:subvol-100-disk-1,mp=/root/files,size=8G,backup=1
+----
+
+NOTE: When creating a new mount point in the GUI, this option is enabled by
+default.
+
+To disable backups for a mount point, add `backup=0` in the way described
+above, or uncheck the *Backup* checkbox on the GUI.
+
+Replication of Containers mount points
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, additional mount points are replicated when the Root Disk is
+replicated. If you want the {pve} storage replication mechanism to skip a mount
+point, you can set the *Skip replication* option for that mount point.
+As of {pve} 5.0, replication requires a storage of type `zfspool`. Adding a
+mount point to a different type of storage when the container has replication
+configured requires to have *Skip replication* enabled for that mount point.
+
+
  Backup and Restore
  ------------------
  
@@ -772,22 +877,39 @@ Reduce the memory of the container to 512MB
  # pct set 100 -memory 512
  ----
  
+Destroying a container always removes it from Access Control Lists and it always
+removes the firewall configuration of the container. You have to activate
+'--purge', if you want to additionally remove the container from replication jobs,
+backup jobs and HA resource configurations.
+
+----
+# pct destroy 100 --purge
+----
+
+
  
  Obtaining Debugging Logs
  ~~~~~~~~~~~~~~~~~~~~~~~~
  
  In case `pct start` is unable to start a specific container, it might be
-helpful to collect debugging output by running `lxc-start` (replace `ID` with
-the container's ID):
+helpful to collect debugging output by passing the `--debug` flag (replace `CTID` with
+the container's CTID):
+
+----
+# pct start CTID --debug
+----
+
+Alternatively, you can use the following `lxc-start` command, which will save
+the debug log to the file specified by the `-o` output option:
  
  ----
-# lxc-start -n ID -F -l DEBUG -o /tmp/lxc-ID.log
+# lxc-start -n CTID -F -l DEBUG -o /tmp/lxc-CTID.log
  ----
  
  This command will attempt to start the container in foreground mode, to stop
-the container run `pct shutdown ID` or `pct stop ID` in a second terminal.
+the container run `pct shutdown CTID` or `pct stop CTID` in a second terminal.
  
-The collected debug log is written to `/tmp/lxc-ID.log`.
+The collected debug log is written to `/tmp/lxc-CTID.log`.
  
  NOTE: If you have changed the container's configuration since the last start
  attempt with `pct start`, you need to run `pct start` at least once to also
@@ -807,7 +929,7 @@ This works as long as your Container is offline. If it has local volumes or
  mount points defined, the migration will copy the content over the network to
  the target host if the same storage is defined there.
  
-Running containers cannot live-migrated due to techincal limitations. You can
+Running containers cannot live-migrated due to technical limitations. You can
  do a restart migration, which shuts down, moves and then starts a container
  again on the target node. As containers are very lightweight, this results
  normally only in a downtime of some hundreds of milliseconds.