X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pct.adoc;h=2d52ad2b5fa6d92385c4e4ad712a8f1b603eeb47;hp=611ff484b9c7bf0272e2dcdc97373cbf788340ed;hb=04c569f66d62b615840b27f791481b8bc3e5696e;hpb=38fd0958719a329859b3d0d719c37d5df15a2d8d

diff --git a/pct.adoc b/pct.adoc
index 611ff48..2d52ad2 100644
--- a/pct.adoc
+++ b/pct.adoc
@@ -24,17 +24,389 @@ Proxmox Container Toolkit
 include::attributes.txt[]
 endif::manvolnum[]
 
-'pct' is a tool to manages Linux Containers (LXC). You can create and
-destroy containers, and control execution
-(start/stop/suspend/resume). Besides that, you can use pct to set
-parameters in the associated config file, like network configuration
-or memory.
+
+Containers are a lightweight alternative to fully virtualized
+VMs. Instead of emulating a complete Operating System (OS), containers
+simply use the OS of the host they run on. This implies that all
+containers use the same kernel, and that they can access resources
+from the host directly.
+
+This is great because containers do not waste CPU power nor memory due
+to kernel emulation. Container run-time costs are close to zero and
+usually negligible. But there are also some drawbacks you need to
+consider:
+
+* You can only run Linux based OS inside containers, i.e. it is not
+  possible to run FreeBSD or MS Windows inside.
+
+* For security reasons, access to host resources needs to be
+  restricted. This is done with AppArmor, SecComp filters and other
+  kernel features. Be prepared that some syscalls are not allowed
+  inside containers.
+
+{pve} uses https://linuxcontainers.org/[LXC] as underlying container
+technology. We consider LXC as low-level library, which provides
+countless options. It would be too difficult to use those tools
+directly. Instead, we provide a small wrapper called `pct`, the
+"Proxmox Container Toolkit".
+
+The toolkit is tightly coupled with {pve}. That means that it is aware
+of the cluster setup, and it can use the same network and storage
+resources as fully virtualized VMs. You can even use the {pve}
+firewall, or manage containers using the HA framework.
+
+Our primary goal is to offer an environment as one would get from a
+VM, but without the additional overhead. We call this "System
+Containers".
+
+NOTE: If you want to run micro-containers (with docker, rct, ...), it
+is best to run them inside a VM.
+
+
+Security Considerations
+-----------------------
+
+Containers use the same kernel as the host, so there is a big attack
+surface for malicious users. You should consider this fact if you
+provide containers to totally untrusted people. In general, fully
+virtualized VMs provide better isolation.
+
+The good news is that LXC uses many kernel security features like
+AppArmor, CGroups and PID and user namespaces, which makes containers
+usage quite secure. We distinguish two types of containers:
+
+Privileged containers
+~~~~~~~~~~~~~~~~~~~~~
+
+Security is done by dropping capabilities, using mandatory access
+control (AppArmor), SecComp filters and namespaces. The LXC team
+considers this kind of container as unsafe, and they will not consider
+new container escape exploits to be security issues worthy of a CVE
+and quick fix. So you should use this kind of containers only inside a
+trusted environment, or when no untrusted task is running as root in
+the container.
+
+Unprivileged containers
+~~~~~~~~~~~~~~~~~~~~~~~
+
+This kind of containers use a new kernel feature called user
+namespaces. The root uid 0 inside the container is mapped to an
+unprivileged user outside the container. This means that most security
+issues (container escape, resource abuse, ...) in those containers
+will affect a random unprivileged user, and so would be a generic
+kernel security bug rather than an LXC issue. The LXC team thinks
+unprivileged containers are safe by design.
+
+
+Configuration
+-------------
+
+The '/etc/pve/lxc/<CTID>.conf' file stores container configuration,
+where '<CTID>' is the numeric ID of the given container. Like all
+other files stored inside '/etc/pve/', they get automatically
+replicated to all other cluster nodes.
+
+NOTE: CTIDs < 100 are reserved for internal purposes, and CTIDs need to be
+unique cluster wide.
+
+.Example Container Configuration
+----
+ostype: debian
+arch: amd64
+hostname: www
+memory: 512
+swap: 512
+net0: bridge=vmbr0,hwaddr=66:64:66:64:64:36,ip=dhcp,name=eth0,type=veth
+rootfs: local:107/vm-107-disk-1.raw,size=7G
+----
+
+Those configuration files are simple text files, and you can edit them
+using a normal text editor ('vi', 'nano', ...). This is sometimes
+useful to do small corrections, but keep in mind that you need to
+restart the container to apply such changes.
+
+For that reason, it is usually better to use the 'pct' command to
+generate and modify those files, or do the whole thing using the GUI.
+Our toolkit is smart enough to instantaneously apply most changes to
+running containers. This feature is called "hot plug", and there is no
+need to restart the container in that case.
+
+File Format
+~~~~~~~~~~~
+
+Container configuration files use a simple colon separated key/value
+format. Each line has the following format:
+
+ # this is a comment
+ OPTION: value
+
+Blank lines in those files are ignored, and lines starting with a '#'
+character are treated as comments and are also ignored.
+
+It is possible to add low-level, LXC style configuration directly, for
+example:
+
+ lxc.init_cmd: /sbin/my_own_init
+
+or
+
+ lxc.init_cmd = /sbin/my_own_init
+
+Those settings are directly passed to the LXC low-level tools.
+
+Snapshots
+~~~~~~~~~
+
+When you create a snapshot, 'pct' stores the configuration at snapshot
+time into a separate snapshot section within the same configuration
+file. For example, after creating a snapshot called 'testsnapshot',
+your configuration file will look like this:
+
+.Container Configuration with Snapshot
+----
+memory: 512
+swap: 512
+parent: testsnaphot
+...
+
+[testsnaphot]
+memory: 512
+swap: 512
+snaptime: 1457170803
+...
+----
+
+There are a few snapshot related properties like 'parent' and
+'snaptime'. The 'parent' property is used to store the parent/child
+relationship between snapshots. 'snaptime' is the snapshot creation
+time stamp (unix epoch).
+
+Guest Operating System Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We normally try to detect the operating system type inside the
+container, and then modify some files inside the container to make
+them work as expected. Here is a short list of things we do at
+container startup:
+
+set /etc/hostname:: to set the container name
+
+modify /etc/hosts:: to allow lookup of the local hostname
+
+network setup:: pass the complete network setup to the container
+
+configure DNS:: pass information about DNS servers
+
+adapt the init system:: for example, fix the number of spawned getty processes
+
+set the root password:: when creating a new container
+
+rewrite ssh_host_keys:: so that each container has unique keys
+
+randomize crontab:: so that cron does not start at the same time on all containers
+
+The above task depends on the OS type, so the implementation is different
+for each OS type. You can also disable any modifications by manually
+setting the 'ostype' to 'unmanaged'.
+
+OS type detection is done by testing for certain files inside the
+container:
+
+Ubuntu:: inspect /etc/lsb-release ('DISTRIB_ID=Ubuntu')
+
+Debian:: test /etc/debian_version
+
+Fedora:: test /etc/fedora-release
+
+RedHat or CentOS:: test /etc/redhat-release
+
+ArchLinux:: test /etc/arch-release
+
+Alpine:: test /etc/alpine-release
+
+NOTE: Container start fails if the configured 'ostype' differs from the auto
+detected type.
+
+
+Container Images
+----------------
+
+Container Images, sometimes also referred to as "templates" or
+"appliances", are 'tar' archives which contain everything to run a
+container. You can think of it as a tidy container backup. Like most
+modern container toolkits, 'pct' uses those images when you create a
+new container, for example:
+
+ pct create 999 local:vztmpl/debian-8.0-standard_8.0-1_amd64.tar.gz
+
+Proxmox itself ships a set of basic templates for most common
+operating systems, and you can download them using the 'pveam' (short
+for {pve} Appliance Manager) command line utility. You can also
+download https://www.turnkeylinux.org/[TurnKey Linux] containers using
+that tool (or the graphical user interface).
+
+Our image repositories contain a list of available images, and there
+is a cron job run each day to download that list. You can trigger that
+update manually with:
+
+ pveam update
+
+After that you can view the list of available images using:
+
+ pveam available
+
+You can restrict this large list by specifying the 'section' you are
+interested in, for example basic 'system' images:
+
+.List available system images
+----
+# pveam available --section system
+system          archlinux-base_2015-24-29-1_x86_64.tar.gz
+system          centos-7-default_20160205_amd64.tar.xz
+system          debian-6.0-standard_6.0-7_amd64.tar.gz
+system          debian-7.0-standard_7.0-3_amd64.tar.gz
+system          debian-8.0-standard_8.0-1_amd64.tar.gz
+system          ubuntu-12.04-standard_12.04-1_amd64.tar.gz
+system          ubuntu-14.04-standard_14.04-1_amd64.tar.gz
+system          ubuntu-15.04-standard_15.04-1_amd64.tar.gz
+system          ubuntu-15.10-standard_15.10-1_amd64.tar.gz
+----
+
+Before you can use such a template, you need to download them into one
+of your storages. You can simply use storage 'local' for that
+purpose. For clustered installations, it is preferred to use a shared
+storage so that all nodes can access those images.
+
+ pveam download local debian-8.0-standard_8.0-1_amd64.tar.gz
+
+You are now ready to create containers using that image, and you can
+list all downloaded images on storage 'local' with:
+
+----
+# pveam list local
+local:vztmpl/debian-8.0-standard_8.0-1_amd64.tar.gz  190.20MB
+----
+
+The above command shows you the full {pve} volume identifiers. They include
+the storage name, and most other {pve} commands can use them. For
+examply you can delete that image later with:
+
+ pveam remove local:vztmpl/debian-8.0-standard_8.0-1_amd64.tar.gz
+
+
+Container Storage
+-----------------
+
+Traditional containers use a very simple storage model, only allowing
+a single mount point, the root file system. This was further
+restricted to specific file system types like 'ext4' and 'nfs'.
+Additional mounts are often done by user provided scripts. This turend
+out to be complex and error prone, so we try to avoid that now.
+
+Our new LXC based container model is more flexible regarding
+storage. First, you can have more than a single mount point. This
+allows you to choose a suitable storage for each application. For
+example, you can use a relatively slow (and thus cheap) storage for
+the container root file system. Then you can use a second mount point
+to mount a very fast, distributed storage for your database
+application.
+
+The second big improvement is that you can use any storage type
+supported by the {pve} storage library. That means that you can store
+your containers on local 'lvmthin' or 'zfs', shared 'iSCSI' storage,
+or even on distributed storage systems like 'ceph'. It also enables us
+to use advanced storage features like snapshots and clones. 'vzdump'
+can also use the snapshot feature to provide consistent container
+backups.
+
+Last but not least, you can also mount local devices directly, or
+mount local directories using bind mounts. That way you can access
+local storage inside containers with zero overhead. Such bind mounts
+also provide an easy way to share data between different containers.
+
+Container Mountpoints
+---------------------
+
+Beside the root directory the container can also have additional mountpoints.
+Currently there are basically three types of mountpoints: storage backed
+mountpoints, bind mounts and device mounts.
+
+Storage backed mountpoints are managed by the {pve} storage subsystem and come
+in three different flavors:
+
+- Image based: These are raw images containing a single ext4 formatted file
+  system.
+- ZFS Subvolumes: These are technically bind mounts, but with managed storage,
+  and thus allow resizing and snapshotting.
+- Directories: passing `size=0` triggers a special case where instead of a raw
+  image a directory is created.
+
+Bind mounts are considered to not be managed by the storage subsystem, so you
+cannot make snapshots or deal with quotas from inside the container, and with
+unprivileged containers you might run into permission problems caused by the
+user mapping, and cannot use ACLs from inside an unprivileged container.
+
+Similarly device mounts are not managed by the storage, but for these the
+`quota` and `acl` options will be honored.
+
+WARNING: Because of existing issues in the Linux kernel's freezer
+subsystem the usage of FUSE mounts inside a container is strongly
+advised against, as containers need to be frozen for suspend or
+snapshot mode backups. If FUSE mounts cannot be replaced by other
+mounting mechanisms or storage technologies, it is possible to
+establish the FUSE mount on the Proxmox host and use a bind
+mountpoint to make it accessible inside the container.
+
+Using quotas inside containers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Quotas allow to set limits inside a container for the amount of disk space
+that each user can use.
+This only works on ext4 image based storage types and currently does not work
+with unprivileged containers.
+
+Activating the `quota` option causes the following mount options to be used for
+a mountpoint: `usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0`
+
+This allows quotas to be used like you would on any other system. You can
+initialize the `/aquota.user` and `/aquota.group` files by running
+
+ quotacheck -cmug /
+ quotaon /
+
+and edit the quotas via the `edquota` command. Refer to the documentation
+of the distribution running inside the container for details.
+
+NOTE: You need to run the above commands for every mountpoint by passing
+the mountpoint's path instead of just `/`.
+
+Using ACLs inside containers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The standard Posix Access Control Lists are also available inside containers. 
+ACLs allow you to set more detailed file ownership than the traditional user/
+group/others model.
+
+
+Container Network
+-----------------
+
+TODO
+
+
+Managing Containers with 'pct'
+------------------------------
+
+'pct' is the tool to manage Linux Containers on {pve}. You can create
+and destroy containers, and control execution (start, stop, migrate,
+...). You can use pct to set parameters in the associated config file,
+like network configuration or memory limits.
 
 CLI Usage Examples
-------------------
+~~~~~~~~~~~~~~~~~~
 
-Create a container based on a Debian template (provided you downloaded
-the template via the webgui before)
+Create a container based on a Debian template (provided you have
+already downloaded the template via the webgui)
 
  pct create 100 /var/lib/vz/template/cache/debian-8.0-standard_8.0-1_amd64.tar.gz
 
@@ -66,9 +438,9 @@ Reduce the memory of the container to 512MB
 Files
 ------
 
-'/etc/pve/lxc/<vmid>.conf'::
+'/etc/pve/lxc/<CTID>.conf'::
 
-Configuration file for the container <vmid>
+Configuration file for the container '<CTID>'.
 
 
 Container Advantages
@@ -109,9 +481,9 @@ Technology Overview
 
 - CRIU: for live migration (planned)
 
-- We use latest available kernels (4.2.X)
+- We use latest available kernels (4.4.X)
 
-- image based deployment (templates)
+- Image based deployment (templates)
 
 - Container setup from host (Network, DNS, Storage, ...)