pcie-passthrough: add short note about iommu passthrough mode

[pve-docs.git] / qm.adoc
diff --git a/qm.adoc b/qm.adoc

index c291cb0c2774906e5d20a02cbee77754274cf79d..e7d0c07d9605172b1f98c3f42649db59398c41c2 100644 (file)
--- a/qm.adoc
+++ b/qm.adoc
@@ -279,12 +279,13 @@ execution on the host system. If you're not sure about the workload of your VM,
  it is usually a safe bet to set the number of *Total cores* to 2.
  
  NOTE: It is perfectly safe if the _overall_ number of cores of all your VMs
-is greater than the number of cores on the server (e.g., 4 VMs with each 4
-cores on a machine with only 8 cores). In that case the host system will
-balance the Qemu execution threads between your server cores, just like if you
-were running a standard multi-threaded application. However, {pve} will prevent
-you from starting VMs with more virtual CPU cores than physically available, as
-this will only bring the performance down due to the cost of context switches.
+is greater than the number of cores on the server (for example, 4 VMs each with
+4 cores (= total 16) on a machine with only 8 cores). In that case the host
+system will balance the QEMU execution threads between your server cores, just
+like if you were running a standard multi-threaded application. However, {pve}
+will prevent you from starting VMs with more virtual CPU cores than physically
+available, as this will only bring the performance down due to the cost of
+context switches.
  
  [[qm_cpu_resource_limits]]
  Resource Limits
@@ -310,24 +311,39 @@ other VMs and CTs would get to less CPU. So, we set the *cpulimit* limit to
  real host cores CPU time. But, if only 4 would do work they could still get
  almost 100% of a real core each.
  
-NOTE: VMs can, depending on their configuration, use additional threads e.g.,
-for networking or IO operations but also live migration. Thus a VM can show up
-to use more CPU time than just its virtual CPUs could use. To ensure that a VM
-never uses more CPU time than virtual CPUs assigned set the *cpulimit* setting
-to the same value as the total core count.
+NOTE: VMs can, depending on their configuration, use additional threads, such
+as for networking or IO operations but also live migration. Thus a VM can show
+up to use more CPU time than just its virtual CPUs could use. To ensure that a
+VM never uses more CPU time than virtual CPUs assigned set the *cpulimit*
+setting to the same value as the total core count.
  
  The second CPU resource limiting setting, *cpuunits* (nowadays often called CPU
-shares or CPU weight), controls how much CPU time a VM gets in regards to other
-VMs running.  It is a relative weight which defaults to `1024`, if you increase
-this for a VM it will be prioritized by the scheduler in comparison to other
-VMs with lower weight. E.g., if VM 100 has set the default 1024 and VM 200 was
-changed to `2048`, the latter VM 200 would receive twice the CPU bandwidth than
-the first VM 100.
+shares or CPU weight), controls how much CPU time a VM gets compared to other
+running VMs. It is a relative weight which defaults to `100` (or `1024` if the
+host uses legacy cgroup v1). If you increase this for a VM it will be
+prioritized by the scheduler in comparison to other VMs with lower weight. For
+example, if VM 100 has set the default `100` and VM 200 was changed to `200`,
+the latter VM 200 would receive twice the CPU bandwidth than the first VM 100.
  
  For more information see `man systemd.resource-control`, here `CPUQuota`
-corresponds to `cpulimit` and `CPUShares` corresponds to our `cpuunits`
+corresponds to `cpulimit` and `CPUWeight` corresponds to our `cpuunits`
  setting, visit its Notes section for references and implementation details.
  
+The third CPU resource limiting setting, *affinity*, controls what host cores
+the virtual machine will be permitted to execute on. E.g., if an affinity value
+of `0-3,8-11` is provided, the virtual machine will be restricted to using the
+host cores `0,1,2,3,8,9,10,` and `11`. Valid *affinity* values are written in
+cpuset `List Format`. List Format is a comma-separated list of CPU numbers and
+ranges of numbers, in ASCII decimal.
+
+NOTE: CPU *affinity* uses the `taskset` command to restrict virtual machines to
+a given set of cores. This restriction will not take effect for some types of
+processes that may be created for IO. *CPU affinity is not a security feature.*
+
+For more information regarding *affinity* see `man cpuset`. Here the
+`List Format` corresponds to valid *affinity* values. Visit its `Formats`
+section for more examples.
+
  CPU Type
  ^^^^^^^^
  
@@ -516,10 +532,10 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
  
  Save this under /etc/udev/rules.d/ as a file ending in `.rules`.
  
-Note: CPU hot-remove is machine dependent and requires guest cooperation.
-The deletion command does not guarantee CPU removal to actually happen,
-typically it's a request forwarded to guest using target dependent mechanism,
-e.g., ACPI on x86/amd64.
+Note: CPU hot-remove is machine dependent and requires guest cooperation.  The
+deletion command does not guarantee CPU removal to actually happen, typically
+it's a request forwarded to guest OS using target dependent mechanism, such as
+ACPI on x86/amd64.
  
  
  [[qm_memory]]
@@ -540,8 +556,7 @@ Even when using a fixed memory size, the ballooning device gets added to the
  VM, because it delivers useful information such as how much memory the guest
  really uses.
  In general, you should leave *ballooning* enabled, but if you want to disable
-it (e.g. for debugging purposes), simply uncheck
-*Ballooning Device* or set
+it (like for debugging purposes), simply uncheck *Ballooning Device* or set
  
   balloon: 0
  
@@ -659,18 +674,28 @@ QEMU can virtualize a few types of VGA hardware. Some examples are:
  * *cirrus*, this was once the default, it emulates a very old hardware module
  with all its problems. This display type should only be used if really
  necessary footnote:[https://www.kraxel.org/blog/2014/10/qemu-using-cirrus-considered-harmful/
-qemu: using cirrus considered harmful], e.g., if using Windows XP or earlier
+qemu: using cirrus considered harmful], for example, if using Windows XP or
+earlier
  * *vmware*, is a VMWare SVGA-II compatible adapter.
  * *qxl*, is the QXL paravirtualized graphics card. Selecting this also
  enables https://www.spice-space.org/[SPICE] (a remote viewer protocol) for the
  VM.
+* *virtio-gl*, often named VirGL is a virtual 3D GPU for use inside VMs that
+  can offload workloads to the host GPU without requiring special (expensive)
+  models and drivers and neither binding the host GPU completely, allowing
+  reuse between multiple guests and or the host.
++
+NOTE: VirGL support needs some extra libraries that aren't installed by
+default due to being relatively big and also not available as open source for
+all GPU models/vendors. For most setups you'll just need to do:
+`apt install libgl1 libegl1`
  
  You can edit the amount of memory given to the virtual GPU, by setting
  the 'memory' option. This can enable higher resolutions inside the VM,
  especially with SPICE/QXL.
  
  As the memory is reserved by display device, selecting Multi-Monitor mode
-for SPICE (e.g., `qxl2` for dual monitors) has some implications:
+for SPICE (such as `qxl2` for dual monitors) has some implications:
  
  * Windows needs a device for each monitor, so if your 'ostype' is some
  version of Windows, {pve} gives the VM an extra device per monitor.
@@ -733,10 +758,14 @@ the operating system. By default QEMU uses *SeaBIOS* for this, which is an
  open-source, x86 BIOS implementation. SeaBIOS is a good choice for most
  standard setups.
  
-There are, however, some scenarios in which a BIOS is not a good firmware
-to boot from, e.g. if you want to do VGA passthrough. footnote:[Alex Williamson has a very good blog entry about this.
+Some operating systems (such as Windows 11) may require use of an UEFI
+compatible implementation instead. In such cases, you must rather use *OVMF*,
+which is an open-source UEFI implementation. footnote:[See the OVMF Project https://github.com/tianocore/tianocore.github.io/wiki/OVMF]
+
+There are other scenarios in which the SeaBIOS may not be the ideal firmware to
+boot from, for example if you want to do VGA passthrough. footnote:[Alex
+Williamson has a good blog entry about this
  https://vfio.blogspot.co.at/2014/08/primary-graphics-assignment-without-vga.html]
-In such cases, you should rather use *OVMF*, which is an open-source UEFI implementation. footnote:[See the OVMF Project https://github.com/tianocore/tianocore.github.io/wiki/OVMF]
  
  If you want to use OVMF, there are several things to consider:
  
@@ -745,18 +774,67 @@ This disk will be included in backups and snapshots, and there can only be one.
  
  You can create such a disk with the following command:
  
- qm set <vmid> -efidisk0 <storage>:1,format=<format>
+----
+# qm set <vmid> -efidisk0 <storage>:1,format=<format>,efitype=4m,pre-enrolled-keys=1
+----
  
  Where *<storage>* is the storage where you want to have the disk, and
  *<format>* is a format which the storage supports. Alternatively, you can
  create such a disk through the web interface with 'Add' -> 'EFI Disk' in the
  hardware section of a VM.
  
+The *efitype* option specifies which version of the OVMF firmware should be
+used. For new VMs, this should always be '4m', as it supports Secure Boot and
+has more space allocated to support future development (this is the default in
+the GUI).
+
+*pre-enroll-keys* specifies if the efidisk should come pre-loaded with
+distribution-specific and Microsoft Standard Secure Boot keys. It also enables
+Secure Boot by default (though it can still be disabled in the OVMF menu within
+the VM).
+
+NOTE: If you want to start using Secure Boot in an existing VM (that still uses
+a '2m' efidisk), you need to recreate the efidisk. To do so, delete the old one
+(`qm set <vmid> -delete efidisk0`) and add a new one as described above. This
+will reset any custom configurations you have made in the OVMF menu!
+
  When using OVMF with a virtual display (without VGA passthrough),
-you need to set the client resolution in the OVMF menu(which you can reach
+you need to set the client resolution in the OVMF menu (which you can reach
  with a press of the ESC button during boot), or you have to choose
  SPICE as the display type.
  
+[[qm_tpm]]
+Trusted Platform Module (TPM)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A *Trusted Platform Module* is a device which stores secret data - such as
+encryption keys - securely and provides tamper-resistance functions for
+validating system boot.
+
+Certain operating systems (such as Windows 11) require such a device to be
+attached to a machine (be it physical or virtual).
+
+A TPM is added by specifying a *tpmstate* volume. This works similar to an
+efidisk, in that it cannot be changed (only removed) once created. You can add
+one via the following command:
+
+----
+# qm set <vmid> -tpmstate0 <storage>:1,version=<version>
+----
+
+Where *<storage>* is the storage you want to put the state on, and *<version>*
+is either 'v1.2' or 'v2.0'. You can also add one via the web interface, by
+choosing 'Add' -> 'TPM State' in the hardware section of a VM.
+
+The 'v2.0' TPM spec is newer and better supported, so unless you have a specific
+implementation that requires a 'v1.2' TPM, it should be preferred.
+
+NOTE: Compared to a physical TPM, an emulated one does *not* provide any real
+security benefits. The point of a TPM is that the data on it cannot be modified
+easily, except via commands specified as part of the TPM spec. Since with an
+emulated device the data storage happens on a regular volume, it can potentially
+be edited by anyone with access to it.
+
  [[qm_ivshmem]]
  Inter-VM shared memory
  ~~~~~~~~~~~~~~~~~~~~~~
@@ -766,7 +844,9 @@ share memory between the host and a guest, or also between multiple guests.
  
  To add such a device, you can use `qm`:
  
-  qm set <vmid> -ivshmem size=32,name=foo
+----
+# qm set <vmid> -ivshmem size=32,name=foo
+----
  
  Where the size is in MiB. The file will be located under
  `/dev/shm/pve-shm-$name` (the default name is the vmid).
@@ -851,7 +931,7 @@ Device Boot Order
  ~~~~~~~~~~~~~~~~~
  
  QEMU can tell the guest which devices it should boot from, and in which order.
-This can be specified in the config via the `boot` property, e.g.:
+This can be specified in the config via the `boot` property, for example:
  
  ----
  boot: order=scsi0;net0;hostpci0
@@ -888,7 +968,9 @@ when the host system boots. For this you need to select the option 'Start at
  boot' from the 'Options' Tab of your VM in the web interface, or set it with
  the following command:
  
- qm set <vmid> -onboot 1
+----
+# qm set <vmid> -onboot 1
+----
  
  .Start and Shutdown Order
  
@@ -899,19 +981,20 @@ VMs, for instance if one of your VM is providing firewalling or DHCP
  to other guest systems.  For this you can use the following
  parameters:
  
-* *Start/Shutdown order*: Defines the start order priority. E.g. set it to 1 if
+* *Start/Shutdown order*: Defines the start order priority. For example, set it
+* to 1 if
  you want the VM to be the first to be started. (We use the reverse startup
  order for shutdown, so a machine with a start order of 1 would be the last to
  be shut down). If multiple VMs have the same order defined on a host, they will
  additionally be ordered by 'VMID' in ascending order.
  * *Startup delay*: Defines the interval between this VM start and subsequent
-VMs starts . E.g. set it to 240 if you want to wait 240 seconds before starting
-other VMs.
+VMs starts. For example, set it to 240 if you want to wait 240 seconds before
+starting other VMs.
  * *Shutdown timeout*: Defines the duration in seconds {pve} should wait
-for the VM to be offline after issuing a shutdown command.
-By default this value is set to 180, which means that {pve} will issue a
-shutdown request and wait 180 seconds for the machine to be offline. If
-the machine is still online after the timeout it will be stopped forcefully.
+for the VM to be offline after issuing a shutdown command. By default this
+value is set to 180, which means that {pve} will issue a shutdown request and
+wait 180 seconds for the machine to be offline. If the machine is still online
+after the timeout it will be stopped forcefully.
  
  NOTE: VMs managed by the HA stack do not follow the 'start on boot' and
  'boot order' options currently. Those VMs will be skipped by the startup and
@@ -923,6 +1006,9 @@ start after those where the parameter is set. Further, this parameter can only
  be enforced between virtual machines running on the same host, not
  cluster-wide.
  
+If you require a delay between the host boot and the booting of the first VM,
+see the section on xref:first_guest_boot_delay[Proxmox VE Node Management].
+
  
  [[qm_qemu_agent]]
  Qemu Guest Agent
@@ -1053,7 +1139,9 @@ Migration
  
  If you have a cluster, you can migrate your VM to another host with
  
- qm migrate <vmid> <target>
+----
+# qm migrate <vmid> <target>
+----
  
  There are generally two mechanisms for this
  
@@ -1063,43 +1151,62 @@ There are generally two mechanisms for this
  Online Migration
  ~~~~~~~~~~~~~~~~
  
-When your VM is running and it has no local resources defined (such as disks
-on local storage, passed through devices, etc.) you can initiate a live
-migration with the -online flag.
+If your VM is running and no locally bound resources are configured (such as
+passed-through devices), you can initiate a live migration with the `--online`
+flag in the `qm migration` command evocation. The web-interface defaults to
+live migration when the VM is running.
  
  How it works
  ^^^^^^^^^^^^
  
-This starts a Qemu Process on the target host with the 'incoming' flag, which
-means that the process starts and waits for the memory data and device states
-from the source Virtual Machine (since all other resources, e.g. disks,
-are shared, the memory content and device state are the only things left
-to transmit).
-
-Once this connection is established, the source begins to send the memory
-content asynchronously to the target. If the memory on the source changes,
-those sections are marked dirty and there will be another pass of sending data.
-This happens until the amount of data to send is so small that it can
-pause the VM on the source, send the remaining data to the target and start
-the VM on the target in under a second.
+Online migration first starts a new QEMU process on the target host with the
+'incoming' flag, which performs only basic initialization with the guest vCPUs
+still paused and then waits for the guest memory and device state data streams
+of the source Virtual Machine.
+All other resources, such as disks, are either shared or got already sent
+before runtime state migration of the VMs begins; so only the memory content
+and device state remain to be transferred.
+
+Once this connection is established, the source begins asynchronously sending
+the memory content to the target. If the guest memory on the source changes,
+those sections are marked dirty and another pass is made to send the guest
+memory data.
+This loop is repeated until the data difference between running source VM
+and incoming target VM is small enough to be sent in a few milliseconds,
+because then the source VM can be paused completely, without a user or program
+noticing the pause, so that the remaining data can be sent to the target, and
+then unpause the targets VM's CPU to make it the new running VM in well under a
+second.
  
  Requirements
  ^^^^^^^^^^^^
  
  For Live Migration to work, there are some things required:
  
-* The VM has no local resources (e.g. passed through devices, local disks, etc.)
-* The hosts are in the same {pve} cluster.
-* The hosts have a working (and reliable) network connection.
-* The target host must have the same or higher versions of the
-  {pve} packages. (It *might* work the other way, but this is never guaranteed)
+* The VM has no local resources that cannot be migrated. For example,
+  PCI or USB devices that are passed through currently block live-migration.
+  Local Disks, on the other hand, can be migrated by sending them to the target
+  just fine.
+* The hosts are located in the same {pve} cluster.
+* The hosts have a working (and reliable) network connection between them.
+* The target host must have the same, or higher versions of the
+  {pve} packages. Although it can sometimes work the other way around, this
+  cannot be guaranteed.
+* The hosts have CPUs from the same vendor with similar capabilities. Different
+  vendor  *might* work depending on the actual models and VMs CPU type
+  configured, but it cannot be guaranteed - so please test before deploying
+  such a setup in production.
  
  Offline Migration
  ~~~~~~~~~~~~~~~~~
  
-If you have local resources, you can still offline migrate your VMs,
-as long as all disk are on storages, which are defined on both hosts.
-Then the migration will copy the disk over the network to the target host.
+If you have local resources, you can still migrate your VMs offline as long as
+all disk are on storage defined on both hosts.
+Migration then copies the disks to the target host over the network, as with
+online migration. Note that any hardware pass-through configuration may need to
+be adapted to the device location on the target host.
+
+// TODO: mention hardware map IDs as better way to solve that, once available
  
  [[qm_copy_and_clone]]
  Copies and Clones
@@ -1197,12 +1304,12 @@ in its configuration file.
  
  To create and add a 'vmgenid' to an already existing VM one can pass the
  special value `1' to let {pve} autogenerate one or manually set the 'UUID'
-footnote:[Online GUID generator http://guid.one/] by using it as value,
-e.g.:
+footnote:[Online GUID generator http://guid.one/] by using it as value, for
+example:
  
  ----
- qm set VMID -vmgenid 1
- qm set VMID -vmgenid 00000000-0000-0000-0000-000000000000
+# qm set VMID -vmgenid 1
+# qm set VMID -vmgenid 00000000-0000-0000-0000-000000000000
  ----
  
  NOTE: The initial addition of a 'vmgenid' device to an existing VM, may result
@@ -1214,12 +1321,12 @@ its value on VM creation, or retroactively delete the property in the
  configuration with:
  
  ----
- qm set VMID -delete vmgenid
+# qm set VMID -delete vmgenid
  ----
  
  The most prominent use case for 'vmgenid' are newer Microsoft Windows
  operating systems, which use it to avoid problems in time sensitive or
-replicate services (e.g., databases, domain controller
+replicate services (such as databases or domain controller
  footnote:[https://docs.microsoft.com/en-us/windows-server/identity/ad-ds/get-started/virtual-dc/virtualized-domain-controller-architecture])
  on snapshot rollback, backup restore or a whole VM clone operation.
  
@@ -1281,7 +1388,9 @@ This will create a new virtual machine, using cores, memory and
  VM name as read from the OVF manifest, and import the disks to the +local-lvm+
   storage. You have to configure the network manually.
  
- qm importovf 999 WinDev1709Eval.ovf local-lvm
+----
+# qm importovf 999 WinDev1709Eval.ovf local-lvm
+----
  
  The VM is ready to be started.
  
@@ -1303,18 +1412,14 @@ Suppose you created a Debian/Ubuntu disk image with the 'vmdebootstrap' tool:
    --customize=./copy_pub_ssh.sh \
    --sparse --image vm600.raw
  
-You can now create a new target VM for this image.
-
- qm create 600 --net0 virtio,bridge=vmbr0 --name vm600 --serial0 socket \
-   --bootdisk scsi0 --scsihw virtio-scsi-pci --ostype l26
-
-Add the disk image as +unused0+ to the VM, using the storage +pvedir+:
-
- qm importdisk 600 vm600.raw pvedir
+You can now create a new target VM, importing the image to the storage `pvedir`
+and attaching it to the VM's SCSI controller:
  
-Finally attach the unused disk to the SCSI controller of the VM:
-
- qm set 600 --scsi0 pvedir:600/vm-600-disk-1.raw
+----
+# qm create 600 --net0 virtio,bridge=vmbr0 --name vm600 --serial0 socket \
+   --boot order=scsi0 --scsihw virtio-scsi-pci --ostype l26 \
+   --scsi0 pvedir:0,import-from=/path/to/dir/vm600.raw
+----
  
  The VM is ready to be started.
  
@@ -1332,7 +1437,9 @@ Hookscripts
  
  You can add a hook script to VMs with the config property `hookscript`.
  
- qm set 100 --hookscript local:snippets/hookscript.pl
+----
+# qm set 100 --hookscript local:snippets/hookscript.pl
+----
  
  It will be called during various phases of the guests lifetime.
  For an example and documentation see the example script under
@@ -1344,7 +1451,9 @@ Hibernation
  
  You can suspend a VM to disk with the GUI option `Hibernate` or with
  
- qm suspend ID --todisk
+----
+# qm suspend ID --todisk
+----
  
  That means that the current content of the memory will be saved onto disk
  and the VM gets stopped. On the next start, the memory content will be
@@ -1375,27 +1484,50 @@ CLI Usage Examples
  Using an iso file uploaded on the 'local' storage, create a VM
  with a 4 GB IDE disk on the 'local-lvm' storage
  
- qm create 300 -ide0 local-lvm:4 -net0 e1000 -cdrom local:iso/proxmox-mailgateway_2.1.iso
+----
+# qm create 300 -ide0 local-lvm:4 -net0 e1000 -cdrom local:iso/proxmox-mailgateway_2.1.iso
+----
  
  Start the new VM
  
- qm start 300
+----
+# qm start 300
+----
  
  Send a shutdown request, then wait until the VM is stopped.
  
- qm shutdown 300 && qm wait 300
+----
+# qm shutdown 300 && qm wait 300
+----
  
  Same as above, but only wait for 40 seconds.
  
- qm shutdown 300 && qm wait 300 -timeout 40
+----
+# qm shutdown 300 && qm wait 300 -timeout 40
+----
  
  Destroying a VM always removes it from Access Control Lists and it always
  removes the firewall configuration of the VM. You have to activate
  '--purge', if you want to additionally remove the VM from replication jobs,
  backup jobs and HA resource configurations.
  
- qm destroy 300 --purge
+----
+# qm destroy 300 --purge
+----
  
+Move a disk image to a different storage.
+
+----
+# qm move-disk 300 scsi0 other-storage
+----
+
+Reassign a disk image to a different VM. This will remove the disk `scsi1` from
+the source VM and attaches it as `scsi3` to the target VM. In the background
+the disk image is being renamed so that the name matches the new owner.
+
+----
+# qm move-disk 300 scsi1 --target-vmid 400 --target-disk scsi3
+----
  
  
  [[qm_configuration]]
@@ -1492,11 +1624,13 @@ include::qm.conf.5-opts.adoc[]
  Locks
  -----
  
-Online migrations, snapshots and backups (`vzdump`) set a lock to
-prevent incompatible concurrent actions on the affected VMs. Sometimes
-you need to remove such a lock manually (e.g., after a power failure).
+Online migrations, snapshots and backups (`vzdump`) set a lock to prevent
+incompatible concurrent actions on the affected VMs. Sometimes you need to
+remove such a lock manually (for example after a power failure).
  
- qm unlock <vmid>
+----
+# qm unlock <vmid>
+----
  
  CAUTION: Only do that if you are sure the action which set the lock is
  no longer running.