X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pvecm.adoc;h=4b4747faedd6530cccef51073bfedf99c14951cd;hp=c3acc840d7dde567a5e397c822267de75bb885b8;hb=HEAD;hpb=082ea7d907cabf9b05f1439814cbedf97e5e8413 diff --git a/pvecm.adoc b/pvecm.adoc index c3acc84..5117eaa 100644 --- a/pvecm.adoc +++ b/pvecm.adoc @@ -1,3 +1,4 @@ +[[chapter_pvecm]] ifdef::manvolnum[] pvecm(1) ======== @@ -23,26 +24,28 @@ Cluster Manager :pve-toplevel: endif::manvolnum[] -The {PVE} cluster manager `pvecm` is a tool to create a group of +The {pve} cluster manager `pvecm` is a tool to create a group of physical servers. Such a group is called a *cluster*. We use the http://www.corosync.org[Corosync Cluster Engine] for reliable group -communication, and such clusters can consist of up to 32 physical nodes -(probably more, dependent on network latency). +communication. There's no explicit limit for the number of nodes in a cluster. +In practice, the actual possible node count may be limited by the host and +network performance. Currently (2021), there are reports of clusters (using +high-end enterprise hardware) with over 50 nodes in production. `pvecm` can be used to create a new cluster, join nodes to a cluster, -leave the cluster, get status information and do various other cluster -related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'') +leave the cluster, get status information, and do various other cluster-related +tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'') is used to transparently distribute the cluster configuration to all cluster nodes. Grouping nodes into a cluster has the following advantages: -* Centralized, web based management +* Centralized, web-based management -* Multi-master clusters: each node can do all management task +* Multi-master clusters: each node can do all management tasks -* `pmxcfs`: database-driven file system for storing configuration files, - replicated in real-time on all nodes using `corosync`. +* Use of `pmxcfs`, a database-driven file system, for storing configuration + files, replicated in real-time on all nodes using `corosync` * Easy migration of virtual machines and containers between physical hosts @@ -55,17 +58,12 @@ Grouping nodes into a cluster has the following advantages: Requirements ------------ -* All nodes must be in the same network as `corosync` uses IP Multicast - to communicate between nodes (also see - http://www.corosync.org[Corosync Cluster Engine]). Corosync uses UDP - ports 5404 and 5405 for cluster communication. -+ -NOTE: Some switches do not support IP multicast by default and must be -manually enabled first. +* All nodes must be able to connect to each other via UDP ports 5405-5412 + for corosync to work. -* Date and time have to be synchronized. +* Date and time must be synchronized. -* SSH tunnel on TCP port 22 between nodes is used. +* An SSH tunnel on TCP port 22 between nodes is required. * If you are interested in High Availability, you need to have at least three nodes for reliable quorum. All nodes should have the @@ -74,66 +72,182 @@ manually enabled first. * We recommend a dedicated NIC for the cluster traffic, especially if you use shared storage. -NOTE: It is not possible to mix Proxmox VE 3.x and earlier with -Proxmox VE 4.0 cluster nodes. +* The root password of a cluster node is required for adding nodes. + +* Online migration of virtual machines is only supported when nodes have CPUs + from the same vendor. It might work otherwise, but this is never guaranteed. + +NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster +nodes. + +NOTE: While it's possible to mix {pve} 4.4 and {pve} 5.0 nodes, doing so is +not supported as a production configuration and should only be done temporarily, +during an upgrade of the whole cluster from one major version to another. + +NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The +cluster protocol (corosync) between {pve} 6.x and earlier versions changed +fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the +upgrade procedure to {pve} 6.0. Preparing Nodes --------------- -First, install {PVE} on all nodes. Make sure that each node is +First, install {pve} on all nodes. Make sure that each node is installed with the final hostname and IP configuration. Changing the hostname and IP is not possible after cluster creation. -Currently the cluster creation has to be done on the console, so you -need to login via `ssh`. +While it's common to reference all node names and their IPs in `/etc/hosts` (or +make their names resolvable through other means), this is not necessary for a +cluster to work. It may be useful however, as you can then connect from one node +to another via SSH, using the easier to remember node name (see also +xref:pvecm_corosync_addresses[Link Address Types]). Note that we always +recommend referencing nodes by their IP addresses in the cluster configuration. -Create the Cluster ------------------- -Login via `ssh` to the first {pve} node. Use a unique name for your cluster. -This name cannot be changed later. +[[pvecm_create_cluster]] +Create a Cluster +---------------- - hp1# pvecm create YOUR-CLUSTER-NAME +You can either create a cluster on the console (login via `ssh`), or through +the API using the {pve} web interface (__Datacenter -> Cluster__). -CAUTION: The cluster name is used to compute the default multicast -address. Please use unique cluster names if you run more than one -cluster inside your network. +NOTE: Use a unique name for your cluster. This name cannot be changed later. +The cluster name follows the same rules as node names. + +[[pvecm_cluster_create_via_gui]] +Create via Web GUI +~~~~~~~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-cluster-create.png"] + +Under __Datacenter -> Cluster__, click on *Create Cluster*. Enter the cluster +name and select a network connection from the drop-down list to serve as the +main cluster network (Link 0). It defaults to the IP resolved via the node's +hostname. + +As of {pve} 6.2, up to 8 fallback links can be added to a cluster. To add a +redundant link, click the 'Add' button and select a link number and IP address +from the respective fields. Prior to {pve} 6.2, to add a second link as +fallback, you can select the 'Advanced' checkbox and choose an additional +network interface (Link 1, see also xref:pvecm_redundancy[Corosync Redundancy]). + +NOTE: Ensure that the network selected for cluster communication is not used for +any high traffic purposes, like network storage or live-migration. +While the cluster network itself produces small amounts of data, it is very +sensitive to latency. Check out full +xref:pvecm_cluster_network_requirements[cluster network requirements]. + +[[pvecm_cluster_create_via_cli]] +Create via the Command Line +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Login via `ssh` to the first {pve} node and run the following command: + +---- + hp1# pvecm create CLUSTERNAME +---- -To check the state of your cluster use: +To check the state of the new cluster use: +---- hp1# pvecm status +---- + +Multiple Clusters in the Same Network +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to create multiple clusters in the same physical or logical +network. In this case, each cluster must have a unique name to avoid possible +clashes in the cluster communication stack. Furthermore, this helps avoid human +confusion by making clusters clearly distinguishable. +While the bandwidth requirement of a corosync cluster is relatively low, the +latency of packages and the package per second (PPS) rate is the limiting +factor. Different clusters in the same network can compete with each other for +these resources, so it may still make sense to use separate physical network +infrastructure for bigger clusters. +[[pvecm_join_node_to_cluster]] Adding Nodes to the Cluster --------------------------- -Login via `ssh` to the node you want to add. +CAUTION: All existing configuration in `/etc/pve` is overwritten when joining a +cluster. In particular, a joining node cannot hold any guests, since guest IDs +could otherwise conflict, and the node will inherit the cluster's storage +configuration. To join a node with existing guest, as a workaround, you can +create a backup of each guest (using `vzdump`) and restore it under a different +ID after joining. If the node's storage layout differs, you will need to re-add +the node's storages, and adapt each storage's node restriction to reflect on +which nodes the storage is actually available. + +Join Node to Cluster via GUI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-cluster-join-information.png"] + +Log in to the web interface on an existing cluster node. Under __Datacenter -> +Cluster__, click the *Join Information* button at the top. Then, click on the +button *Copy Information*. Alternatively, copy the string from the 'Information' +field manually. + +[thumbnail="screenshot/gui-cluster-join.png"] - hp2# pvecm add IP-ADDRESS-CLUSTER +Next, log in to the web interface on the node you want to add. +Under __Datacenter -> Cluster__, click on *Join Cluster*. Fill in the +'Information' field with the 'Join Information' text you copied earlier. +Most settings required for joining the cluster will be filled out +automatically. For security reasons, the cluster password has to be entered +manually. + +NOTE: To enter all required data manually, you can disable the 'Assisted Join' +checkbox. + +After clicking the *Join* button, the cluster join process will start +immediately. After the node has joined the cluster, its current node certificate +will be replaced by one signed from the cluster certificate authority (CA). +This means that the current session will stop working after a few seconds. You +then might need to force-reload the web interface and log in again with the +cluster credentials. + +Now your node should be visible under __Datacenter -> Cluster__. + +Join Node to Cluster via Command Line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Log in to the node you want to join into an existing cluster via `ssh`. + +---- + # pvecm add IP-ADDRESS-CLUSTER +---- -For `IP-ADDRESS-CLUSTER` use the IP from an existing cluster node. +For `IP-ADDRESS-CLUSTER`, use the IP or hostname of an existing cluster node. +An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]). -CAUTION: A new node cannot hold any VMs, because you would get -conflicts about identical VM IDs. Also, all existing configuration in -`/etc/pve` is overwritten when you join a new node to the cluster. To -workaround, use `vzdump` to backup and restore to a different VMID after -adding the node to the cluster. -To check the state of cluster: +To check the state of the cluster use: +---- # pvecm status +---- .Cluster status after adding 4 nodes ---- -hp2# pvecm status + # pvecm status +Cluster information +~~~~~~~~~~~~~~~~~~~ +Name: prod-central +Config Version: 3 +Transport: knet +Secure auth: on + Quorum information ~~~~~~~~~~~~~~~~~~ -Date: Mon Apr 20 12:30:13 2015 +Date: Tue Sep 14 11:06:47 2021 Quorum provider: corosync_votequorum Nodes: 4 Node ID: 0x00000001 -Ring ID: 1928 +Ring ID: 1.1a8 Quorate: Yes Votequorum information @@ -141,7 +255,7 @@ Votequorum information Expected votes: 4 Highest expected: 4 Total votes: 4 -Quorum: 2 +Quorum: 3 Flags: Quorate Membership information @@ -153,13 +267,15 @@ Membership information 0x00000004 1 192.168.15.94 ---- -If you only want the list of all nodes use: +If you only want a list of all nodes, use: +---- # pvecm nodes +---- .List nodes in a cluster ---- -hp2# pvecm nodes + # pvecm nodes Membership information ~~~~~~~~~~~~~~~~~~~~~~ @@ -170,68 +286,47 @@ Membership information 4 1 hp4 ---- -Adding Nodes With Separated Cluster Network +[[pvecm_adding_nodes_with_separated_cluster_network]] +Adding Nodes with Separated Cluster Network ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -When adding a node to a cluster with a separated cluster network you need to -use the 'ringX_addr' parameters to set the nodes address on those networks: +When adding a node to a cluster with a separated cluster network, you need to +use the 'link0' parameter to set the nodes address on that network: [source,bash] ---- -pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0 +# pvecm add IP-ADDRESS-CLUSTER --link0 LOCAL-IP-ADDRESS-LINK0 ---- -If you want to use the Redundant Ring Protocol you will also want to pass the -'ring1_addr' parameter. +If you want to use the built-in xref:pvecm_redundancy[redundancy] of the +Kronosnet transport layer, also use the 'link1' parameter. +Using the GUI, you can select the correct interface from the corresponding +'Link X' fields in the *Cluster Join* dialog. Remove a Cluster Node --------------------- -CAUTION: Read carefully the procedure before proceeding, as it could +CAUTION: Read the procedure carefully before proceeding, as it may not be what you want or need. -Move all virtual machines from the node. Make sure you have no local -data or backups you want to keep, or save them accordingly. - -Log in to one remaining node via ssh. Issue a `pvecm nodes` command to -identify the node ID: - ----- -hp1# pvecm status - -Quorum information -~~~~~~~~~~~~~~~~~~ -Date: Mon Apr 20 12:30:13 2015 -Quorum provider: corosync_votequorum -Nodes: 4 -Node ID: 0x00000001 -Ring ID: 1928 -Quorate: Yes +Move all virtual machines from the node. Ensure that you have made copies of any +local data or backups that you want to keep. In addition, make sure to remove +any scheduled replication jobs to the node to be removed. -Votequorum information -~~~~~~~~~~~~~~~~~~~~~~ -Expected votes: 4 -Highest expected: 4 -Total votes: 4 -Quorum: 2 -Flags: Quorate +CAUTION: Failure to remove replication jobs to a node before removing said node +will result in the replication job becoming irremovable. Especially note that +replication automatically switches direction if a replicated VM is migrated, so +by migrating a replicated VM from a node to be deleted, replication jobs will be +set up to that node automatically. -Membership information -~~~~~~~~~~~~~~~~~~~~~~ - Nodeid Votes Name -0x00000001 1 192.168.15.91 (local) -0x00000002 1 192.168.15.92 -0x00000003 1 192.168.15.93 -0x00000004 1 192.168.15.94 ----- +In the following example, we will remove the node hp4 from the cluster. -IMPORTANT: at this point you must power off the node to be removed and -make sure that it will not power on again (in the network) as it -is. +Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes` +command to identify the node ID to remove: ---- -hp1# pvecm nodes + hp1# pvecm nodes Membership information ~~~~~~~~~~~~~~~~~~~~~~ @@ -242,33 +337,42 @@ Membership information 4 1 hp4 ---- -Log in to one remaining node via ssh. Issue the delete command (here -deleting node `hp4`): +At this point, you must power off hp4 and ensure that it will not power on +again (in the network) with its current configuration. + +IMPORTANT: As mentioned above, it is critical to power off the node +*before* removal, and make sure that it will *not* power on again +(in the existing cluster network) with its current configuration. +If you power on the node as it is, the cluster could end up broken, +and it could be difficult to restore it to a functioning state. + +After powering off the node hp4, we can safely remove it from the cluster. + +---- hp1# pvecm delnode hp4 + Killing node 4 +---- + +NOTE: At this point, it is possible that you will receive an error message +stating `Could not kill node (error = CS_ERR_NOT_EXIST)`. This does not +signify an actual failure in the deletion of the node, but rather a failure in +corosync trying to kill an offline node. Thus, it can be safely ignored. -If the operation succeeds no output is returned, just check the node -list again with `pvecm nodes` or `pvecm status`. You should see -something like: +Use `pvecm nodes` or `pvecm status` to check the node list again. It should +look something like: ---- hp1# pvecm status -Quorum information -~~~~~~~~~~~~~~~~~~ -Date: Mon Apr 20 12:44:28 2015 -Quorum provider: corosync_votequorum -Nodes: 3 -Node ID: 0x00000001 -Ring ID: 1992 -Quorate: Yes +... Votequorum information ~~~~~~~~~~~~~~~~~~~~~~ Expected votes: 3 Highest expected: 3 Total votes: 3 -Quorum: 3 +Quorum: 2 Flags: Quorate Membership information @@ -279,51 +383,54 @@ Membership information 0x00000003 1 192.168.15.92 ---- -IMPORTANT: as said above, it is very important to power off the node -*before* removal, and make sure that it will *never* power on again -(in the existing cluster network) as it is. +If, for whatever reason, you want this server to join the same cluster again, +you have to: -If you power on the node as it is, your cluster will be screwed up and -it could be difficult to restore a clean cluster state. +* do a fresh install of {pve} on it, -If, for whatever reason, you want that this server joins the same -cluster again, you have to +* then join it, as explained in the previous section. -* reinstall {pve} on it from scratch +The configuration files for the removed node will still reside in +'/etc/pve/nodes/hp4'. Recover any configuration you still need and remove the +directory afterwards. -* then join it, as explained in the previous section. +NOTE: After removal of the node, its SSH fingerprint will still reside in the +'known_hosts' of the other nodes. If you receive an SSH error after rejoining +a node with the same IP or hostname, run `pvecm updatecerts` once on the +re-added node to update its fingerprint cluster wide. [[pvecm_separate_node_without_reinstall]] -Separate A Node Without Reinstalling +Separate a Node Without Reinstalling ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAUTION: This is *not* the recommended method, proceed with caution. Use the -above mentioned method if you're unsure. +previous method if you're unsure. You can also separate a node from a cluster without reinstalling it from -scratch. But after removing the node from the cluster it will still have -access to the shared storages! This must be resolved before you start removing +scratch. But after removing the node from the cluster, it will still have +access to any shared storage. This must be resolved before you start removing the node from the cluster. A {pve} cluster cannot share the exact same -storage with another cluster, as it leads to VMID conflicts. - -Its suggested that you create a new storage where only the node which you want -to separate has access. This can be an new export on your NFS or a new Ceph -pool, to name a few examples. Its just important that the exact same storage -does not gets accessed by multiple clusters. After setting this storage up move -all data from the node and its VMs to it. Then you are ready to separate the +storage with another cluster, as storage locking doesn't work over the cluster +boundary. Furthermore, it may also lead to VMID conflicts. + +It's suggested that you create a new storage, where only the node which you want +to separate has access. This can be a new export on your NFS or a new Ceph +pool, to name a few examples. It's just important that the exact same storage +does not get accessed by multiple clusters. After setting up this storage, move +all data and VMs from the node to it. Then you are ready to separate the node from the cluster. -WARNING: Ensure all shared resources are cleanly separated! You will run into -conflicts and problems else. +WARNING: Ensure that all shared resources are cleanly separated! Otherwise you +will run into conflicts and problems. -First stop the corosync and the pve-cluster services on the node: +First, stop the corosync and pve-cluster services on the node: [source,bash] ---- systemctl stop pve-cluster systemctl stop corosync ---- -Start the cluster filesystem again in local mode: +Start the cluster file system again in local mode: [source,bash] ---- pmxcfs -l @@ -333,35 +440,35 @@ Delete the corosync configuration files: [source,bash] ---- rm /etc/pve/corosync.conf -rm /etc/corosync/* +rm -r /etc/corosync/* ---- -You can now start the filesystem again as normal service: +You can now start the file system again as a normal service: [source,bash] ---- killall pmxcfs systemctl start pve-cluster ---- -The node is now separated from the cluster. You can deleted it from a remaining -node of the cluster with: +The node is now separated from the cluster. You can deleted it from any +remaining node of the cluster with: [source,bash] ---- pvecm delnode oldnode ---- -If the command failed, because the remaining node in the cluster lost quorum -when the now separate node exited, you may set the expected votes to 1 as a workaround: +If the command fails due to a loss of quorum in the remaining node, you can set +the expected votes to 1 as a workaround: [source,bash] ---- pvecm expected 1 ---- -And the repeat the 'pvecm delnode' command. +And then repeat the 'pvecm delnode' command. -Now switch back to the separated node, here delete all remaining files left -from the old cluster. This ensures that the node can be added to another -cluster again without problems. +Now switch back to the separated node and delete all the remaining cluster +files on it. This ensures that the node can be added to another cluster again +without problems. [source,bash] ---- @@ -369,15 +476,16 @@ rm /var/lib/corosync/* ---- As the configuration files from the other nodes are still in the cluster -filesystem you may want to clean those up too. Remove simply the whole -directory recursive from '/etc/pve/nodes/NODENAME', but check three times that -you used the correct one before deleting it. +file system, you may want to clean those up too. After making absolutely sure +that you have the correct node name, you can simply remove the entire +directory recursively from '/etc/pve/nodes/NODENAME'. -CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means -the nodes can still connect to each other with public key authentication. This -should be fixed by removing the respective keys from the +CAUTION: The node's SSH keys will remain in the 'authorized_key' file. This +means that the nodes can still connect to each other with public key +authentication. You should fix this by removing the respective keys from the '/etc/pve/priv/authorized_keys' file. + Quorum ------ @@ -397,105 +505,100 @@ if it loses quorum. NOTE: {pve} assigns a single vote to each node by default. + Cluster Network --------------- The cluster network is the core of a cluster. All messages sent over it have to -be delivered reliable to all nodes in their respective order. In {pve} this -part is done by corosync, an implementation of a high performance low overhead -high availability development toolkit. It serves our decentralized -configuration file system (`pmxcfs`). +be delivered reliably to all nodes in their respective order. In {pve} this +part is done by corosync, an implementation of a high performance, low overhead, +high availability development toolkit. It serves our decentralized configuration +file system (`pmxcfs`). -[[cluster-network-requirements]] +[[pvecm_cluster_network_requirements]] Network Requirements ~~~~~~~~~~~~~~~~~~~~ -This needs a reliable network with latencies under 2 milliseconds (LAN -performance) to work properly. While corosync can also use unicast for -communication between nodes its **highly recommended** to have a multicast -capable network. The network should not be used heavily by other members, -ideally corosync runs on its own network. -*never* share it with network where storage communicates too. -Before setting up a cluster it is good practice to check if the network is fit -for that purpose. +The {pve} cluster stack requires a reliable network with latencies under 5 +milliseconds (LAN performance) between all nodes to operate stably. While on +setups with a small node count a network with higher latencies _may_ work, this +is not guaranteed and gets rather unlikely with more than three nodes and +latencies above around 10 ms. -* Ensure that all nodes are in the same subnet. This must only be true for the - network interfaces used for cluster communication (corosync). +The network should not be used heavily by other members, as while corosync does +not uses much bandwidth it is sensitive to latency jitters; ideally corosync +runs on its own physically separated network. Especially do not use a shared +network for corosync and storage (except as a potential low-priority fallback +in a xref:pvecm_redundancy[redundant] configuration). -* Ensure all nodes can reach each other over those interfaces, using `ping` is - enough for a basic test. +Before setting up a cluster, it is good practice to check if the network is fit +for that purpose. To ensure that the nodes can connect to each other on the +cluster network, you can test the connectivity between them with the `ping` +tool. -* Ensure that multicast works in general and a high package rates. This can be - done with the `omping` tool. The final "%loss" number should be < 1%. -[source,bash] ----- -omping -c 10000 -i 0.001 -F -q NODE1-IP NODE2-IP ... ----- +If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically +be generated - no manual action is required. -* Ensure that multicast communication works over an extended period of time. - This covers up problems where IGMP snooping is activated on the network but - no multicast querier is active. This test has a duration of around 10 - minutes. -[source,bash] ----- -omping -c 600 -i 1 -q NODE1-IP NODE2-IP ... ----- - -Your network is not ready for clustering if any of these test fails. Recheck -your network configuration. Especially switches are notorious for having -multicast disabled by default or IGMP snooping enabled with no IGMP querier -active. +NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0). +Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster +communication, which, for now, only supports regular UDP unicast. -In smaller cluster its also an option to use unicast if you really cannot get -multicast to work. +CAUTION: You can still enable Multicast or legacy unicast by setting your +transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf], +but keep in mind that this will disable all cryptography and redundancy support. +This is therefore not recommended. Separate Cluster Network ~~~~~~~~~~~~~~~~~~~~~~~~ -When creating a cluster without any parameters the cluster network is generally -shared with the Web UI and the VMs and its traffic. Depending on your setup -even storage traffic may get sent over the same network. Its recommended to -change that, as corosync is a time critical real time application. +When creating a cluster without any parameters, the corosync cluster network is +generally shared with the web interface and the VMs' network. Depending on +your setup, even storage traffic may get sent over the same network. It's +recommended to change that, as corosync is a time-critical, real-time +application. -Setting Up A New Network +Setting Up a New Network ^^^^^^^^^^^^^^^^^^^^^^^^ -First you have to setup a new network interface. It should be on a physical +First, you have to set up a new network interface. It should be on a physically separate network. Ensure that your network fulfills the -<>. +xref:pvecm_cluster_network_requirements[cluster network requirements]. Separate On Cluster Creation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -This is possible through the 'ring0_addr' and 'bindnet0_addr' parameter of -the 'pvecm create' command used for creating a new cluster. +This is possible via the 'linkX' parameters of the 'pvecm create' +command, used for creating a new cluster. -If you have setup a additional NIC with a static address on 10.10.10.1/25 -and want to send and receive all cluster communication over this interface +If you have set up an additional NIC with a static address on 10.10.10.1/25, +and want to send and receive all cluster communication over this interface, you would execute: [source,bash] ---- -pvecm create test --ring0_addr 10.10.10.1 --bindnet0_addr 10.10.10.0 +pvecm create test --link0 10.10.10.1 ---- -To check if everything is working properly execute: +To check if everything is working properly, execute: [source,bash] ---- systemctl status corosync ---- -[[separate-cluster-net-after-creation]] +Afterwards, proceed as described above to +xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network]. + +[[pvecm_separate_cluster_net_after_creation]] Separate After Cluster Creation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -You can do this also if you have already created a cluster and want to switch +You can do this if you have already created a cluster and want to switch its communication to another network, without rebuilding the whole cluster. -This change may lead to short durations of quorum loss in the cluster, as nodes +This change may lead to short periods of quorum loss in the cluster, as nodes have to restart corosync and come up one after the other on the new network. -Check how to <> first. -The open it and you should see a file similar to: +Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first. +Then, open it and you should see a file similar to: ---- logging { @@ -533,36 +636,41 @@ quorum { } totem { - cluster_name: thomas-testcluster + cluster_name: testcluster config_version: 3 - ip_version: ipv4 + ip_version: ipv4-6 secauth: on version: 2 interface { - bindnetaddr: 192.168.30.50 - ringnumber: 0 + linknumber: 0 } } ---- -The first you want to do is add the 'name' properties in the node entries if -you do not see them already. Those *must* match the node name. +NOTE: `ringX_addr` actually specifies a corosync *link address*. The name "ring" +is a remnant of older corosync versions that is kept for backwards +compatibility. + +The first thing you want to do is add the 'name' properties in the node entries, +if you do not see them already. Those *must* match the node name. -Then replace the address from the 'ring0_addr' properties with the new -addresses. You may use plain IP addresses or also hostnames here. If you use -hostnames ensure that they are resolvable from all nodes. +Then replace all addresses from the 'ring0_addr' properties of all nodes with +the new addresses. You may use plain IP addresses or hostnames here. If you use +hostnames, ensure that they are resolvable from all nodes (see also +xref:pvecm_corosync_addresses[Link Address Types]). -In my example I want to switch my cluster communication to the 10.10.10.1/25 -network. So I replace all 'ring0_addr' respectively. I also set the bindetaddr -in the totem section of the config to an address of the new network. It can be -any address from the subnet configured on the new network interface. +In this example, we want to switch cluster communication to the +10.10.10.0/25 network, so we change the 'ring0_addr' of each node respectively. -After you increased the 'config_version' property the new configuration file +NOTE: The exact same procedure can be used to change other 'ringX_addr' values +as well. However, we recommend only changing one link address at a time, so +that it's easier to recover if something goes wrong. + +After we increase the 'config_version' property, the new configuration file should look like: ---- - logging { debug: off to_syslog: yes @@ -598,209 +706,536 @@ quorum { } totem { - cluster_name: thomas-testcluster + cluster_name: testcluster config_version: 4 - ip_version: ipv4 + ip_version: ipv4-6 secauth: on version: 2 interface { - bindnetaddr: 10.10.10.1 - ringnumber: 0 + linknumber: 0 } } ---- -Now after a final check whether all changed information is correct we save it -and see again the <> section to -learn how to bring it in effect. +Then, after a final check to see that all changed information is correct, we +save it and once again follow the +xref:pvecm_edit_corosync_conf[edit corosync.conf file] section to bring it into +effect. -As our change cannot be enforced live from corosync we have to do an restart. +The changes will be applied live, so restarting corosync is not strictly +necessary. If you changed other settings as well, or notice corosync +complaining, you can optionally trigger a restart. On a single node execute: + [source,bash] ---- systemctl restart corosync ---- -Now check if everything is fine: +Now check if everything is okay: [source,bash] ---- systemctl status corosync ---- -If corosync runs again correct restart corosync also on all other nodes. +If corosync begins to work again, restart it on all other nodes too. They will then join the cluster membership one by one on the new network. -Redundant Ring Protocol -~~~~~~~~~~~~~~~~~~~~~~~ -To avoid a single point of failure you should implement counter measurements. -This can be on the hardware and operating system level through network bonding. +[[pvecm_corosync_addresses]] +Corosync Addresses +~~~~~~~~~~~~~~~~~~ + +A corosync link address (for backwards compatibility denoted by 'ringX_addr' in +`corosync.conf`) can be specified in two ways: -Corosync itself offers also a possibility to add redundancy through the so -called 'Redundant Ring Protocol'. This protocol allows running a second totem -ring on another network, this network should be physically separated from the -other rings network to actually increase availability. +* **IPv4/v6 addresses** can be used directly. They are recommended, since they +are static and usually not changed carelessly. -RRP On Cluster Creation -~~~~~~~~~~~~~~~~~~~~~~~ +* **Hostnames** will be resolved using `getaddrinfo`, which means that by +default, IPv6 addresses will be used first, if available (see also +`man gai.conf`). Keep this in mind, especially when upgrading an existing +cluster to IPv6. -The 'pvecm create' command provides the additional parameters 'bindnetX_addr', -'ringX_addr' and 'rrp_mode', can be used for RRP configuration. +CAUTION: Hostnames should be used with care, since the addresses they +resolve to can be changed without touching corosync or the node it runs on - +which may lead to a situation where an address is changed without thinking +about implications for corosync. -NOTE: See the <> if you do not know what each parameter means. +A separate, static hostname specifically for corosync is recommended, if +hostnames are preferred. Also, make sure that every node in the cluster can +resolve all hostnames correctly. -So if you have two networks, one on the 10.10.10.1/24 and the other on the -10.10.20.1/24 subnet you would execute: +Since {pve} 5.1, while supported, hostnames will be resolved at the time of +entry. Only the resolved IP is saved to the configuration. + +Nodes that joined the cluster on earlier versions likely still use their +unresolved hostname in `corosync.conf`. It might be a good idea to replace +them with IPs or a separate hostname, as mentioned above. + + +[[pvecm_redundancy]] +Corosync Redundancy +------------------- + +Corosync supports redundant networking via its integrated Kronosnet layer by +default (it is not supported on the legacy udp/udpu transports). It can be +enabled by specifying more than one link address, either via the '--linkX' +parameters of `pvecm`, in the GUI as **Link 1** (while creating a cluster or +adding a new node) or by specifying more than one 'ringX_addr' in +`corosync.conf`. + +NOTE: To provide useful failover, every link should be on its own +physical network connection. + +Links are used according to a priority setting. You can configure this priority +by setting 'knet_link_priority' in the corresponding interface section in +`corosync.conf`, or, preferably, using the 'priority' parameter when creating +your cluster with `pvecm`: -[source,bash] ---- -pvecm create CLUSTERNAME -bindnet0_addr 10.10.10.1 -ring0_addr 10.10.10.1 \ --bindnet1_addr 10.10.20.1 -ring1_addr 10.10.20.1 + # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=15 --link1 10.20.20.1,priority=20 ---- -RRP On A Created Cluster -~~~~~~~~~~~~~~~~~~~~~~~~ +This would cause 'link1' to be used first, since it has the higher priority. + +If no priorities are configured manually (or two links have the same priority), +links will be used in order of their number, with the lower number having higher +priority. + +Even if all links are working, only the one with the highest priority will see +corosync traffic. Link priorities cannot be mixed, meaning that links with +different priorities will not be able to communicate with each other. + +Since lower priority links will not see traffic unless all higher priorities +have failed, it becomes a useful strategy to specify networks used for +other tasks (VMs, storage, etc.) as low-priority links. If worst comes to +worst, a higher latency or more congested connection might be better than no +connection at all. + +Adding Redundant Links To An Existing Cluster +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -When enabling an already running cluster to use RRP you will take similar steps -as describe in -<>. You -just do it on another ring. +To add a new link to a running configuration, first check how to +xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. -First add a new `interface` subsection in the `totem` section, set its -`ringnumber` property to `1`. Set the interfaces `bindnetaddr` property to an -address of the subnet you have configured for your new ring. -Further set the `rrp_mode` to `passive`, this is the only stable mode. +Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make +sure that your 'X' is the same for every node you add it to, and that it is +unique for each node. -Then add to each node entry in the `nodelist` section its new `ring1_addr` -property with the nodes additional ring address. +Lastly, add a new 'interface', as shown below, to your `totem` +section, replacing 'X' with the link number chosen above. -So if you have two networks, one on the 10.10.10.1/24 and the other on the -10.10.20.1/24 subnet, the final configuration file should look like: +Assuming you added a link with number 1, the new configuration file could look +like this: ---- -totem { - cluster_name: tweak - config_version: 9 - ip_version: ipv4 - rrp_mode: passive - secauth: on - version: 2 - interface { - bindnetaddr: 10.10.10.1 - ringnumber: 0 - } - interface { - bindnetaddr: 10.10.20.1 - ringnumber: 1 - } +logging { + debug: off + to_syslog: yes } nodelist { + node { - name: pvecm1 - nodeid: 1 + name: due + nodeid: 2 quorum_votes: 1 - ring0_addr: 10.10.10.1 - ring1_addr: 10.10.20.1 + ring0_addr: 10.10.10.2 + ring1_addr: 10.20.20.2 } - node { - name: pvecm2 - nodeid: 2 + node { + name: tre + nodeid: 3 quorum_votes: 1 - ring0_addr: 10.10.10.2 - ring1_addr: 10.10.20.2 + ring0_addr: 10.10.10.3 + ring1_addr: 10.20.20.3 } - [...] # other cluster nodes here + node { + name: uno + nodeid: 1 + quorum_votes: 1 + ring0_addr: 10.10.10.1 + ring1_addr: 10.20.20.1 + } + +} + +quorum { + provider: corosync_votequorum +} + +totem { + cluster_name: testcluster + config_version: 4 + ip_version: ipv4-6 + secauth: on + version: 2 + interface { + linknumber: 0 + } + interface { + linknumber: 1 + } } +---- + +The new link will be enabled as soon as you follow the last steps to +xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not +be necessary. You can check that corosync loaded the new link using: + +---- +journalctl -b -u corosync +---- + +It might be a good idea to test the new link by temporarily disconnecting the +old link on one node and making sure that its status remains online while +disconnected: + +---- +pvecm status +---- + +If you see a healthy cluster state, it means that your new link is being used. + + +Role of SSH in {pve} Clusters +----------------------------- + +{pve} utilizes SSH tunnels for various features. + +* Proxying console/shell sessions (node and guests) ++ +When using the shell for node B while being connected to node A, connects to a +terminal proxy on node A, which is in turn connected to the login shell on node +B via a non-interactive SSH tunnel. + +* VM and CT memory and local-storage migration in 'secure' mode. ++ +During the migration, one or more SSH tunnel(s) are established between the +source and target nodes, in order to exchange migration information and +transfer memory and disk contents. + +* Storage replication + +SSH setup +~~~~~~~~~ + +On {pve} systems, the following changes are made to the SSH configuration/setup: + +* the `root` user's SSH client config gets setup to prefer `AES` over `ChaCha20` + +* the `root` user's `authorized_keys` file gets linked to + `/etc/pve/priv/authorized_keys`, merging all authorized keys within a cluster + +* `sshd` is configured to allow logging in as root with a password + +NOTE: Older systems might also have `/etc/ssh/ssh_known_hosts` set up as symlink +pointing to `/etc/pve/priv/known_hosts`, containing a merged version of all +node host keys. This system was replaced with explicit host key pinning in +`pve-cluster <>`, the symlink can be deconfigured if still in +place by running `pvecm updatecerts --unmerge-known-hosts`. + +Pitfalls due to automatic execution of `.bashrc` and siblings +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In case you have a custom `.bashrc`, or similar files that get executed on +login by the configured shell, `ssh` will automatically run it once the session +is established successfully. This can cause some unexpected behavior, as those +commands may be executed with root permissions on any of the operations +described above. This can cause possible problematic side-effects! + +In order to avoid such complications, it's recommended to add a check in +`/root/.bashrc` to make sure the session is interactive, and only then run +`.bashrc` commands. + +You can add this snippet at the beginning of your `.bashrc` file: + +---- +# Early exit if not running interactively to avoid side-effects! +case $- in + *i*) ;; + *) return;; +esac +---- + +Corosync External Vote Support +------------------------------ + +This section describes a way to deploy an external voter in a {pve} cluster. +When configured, the cluster can sustain more node failures without +violating safety properties of the cluster communication. + +For this to work, there are two services involved: + +* A QDevice daemon which runs on each {pve} node + +* An external vote daemon which runs on an independent server + +As a result, you can achieve higher availability, even in smaller setups (for +example 2+1 nodes). + +QDevice Technical Overview +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Corosync Quorum Device (QDevice) is a daemon which runs on each cluster +node. It provides a configured number of votes to the cluster's quorum +subsystem, based on an externally running third-party arbitrator's decision. +Its primary use is to allow a cluster to sustain more node failures than +standard quorum rules allow. This can be done safely as the external device +can see all nodes and thus choose only one set of nodes to give its vote. +This will only be done if said set of nodes can have quorum (again) after +receiving the third-party vote. + +Currently, only 'QDevice Net' is supported as a third-party arbitrator. This is +a daemon which provides a vote to a cluster partition, if it can reach the +partition members over the network. It will only give votes to one partition +of a cluster at any time. +It's designed to support multiple clusters and is almost configuration and +state free. New clusters are handled dynamically and no configuration file +is needed on the host running a QDevice. + +The only requirements for the external host are that it needs network access to +the cluster and to have a corosync-qnetd package available. We provide a package +for Debian based hosts, and other Linux distributions should also have a package +available through their respective package manager. + +NOTE: Unlike corosync itself, a QDevice connects to the cluster over TCP/IP. +The daemon can also run outside the LAN of the cluster and isn't limited to the +low latencies requirements of corosync. + +Supported Setups +~~~~~~~~~~~~~~~~ + +We support QDevices for clusters with an even number of nodes and recommend +it for 2 node clusters, if they should provide higher availability. +For clusters with an odd node count, we currently discourage the use of +QDevices. The reason for this is the difference in the votes which the QDevice +provides for each cluster type. Even numbered clusters get a single additional +vote, which only increases availability, because if the QDevice +itself fails, you are in the same position as with no QDevice at all. + +On the other hand, with an odd numbered cluster size, the QDevice provides +'(N-1)' votes -- where 'N' corresponds to the cluster node count. This +alternative behavior makes sense; if it had only one additional vote, the +cluster could get into a split-brain situation. This algorithm allows for all +nodes but one (and naturally the QDevice itself) to fail. However, there are two +drawbacks to this: + +* If the QNet daemon itself fails, no other node may fail or the cluster + immediately loses quorum. For example, in a cluster with 15 nodes, 7 + could fail before the cluster becomes inquorate. But, if a QDevice is + configured here and it itself fails, **no single node** of the 15 may fail. + The QDevice acts almost as a single point of failure in this case. + +* The fact that all but one node plus QDevice may fail sounds promising at + first, but this may result in a mass recovery of HA services, which could + overload the single remaining node. Furthermore, a Ceph server will stop + providing services if only '((N-1)/2)' nodes or less remain online. + +If you understand the drawbacks and implications, you can decide yourself if +you want to use this technology in an odd numbered cluster setup. -[...] # other remaining config sections here +QDevice-Net Setup +~~~~~~~~~~~~~~~~~ + +We recommend running any daemon which provides votes to corosync-qdevice as an +unprivileged user. {pve} and Debian provide a package which is already +configured to do so. +The traffic between the daemon and the cluster must be encrypted to ensure a +safe and secure integration of the QDevice in {pve}. + +First, install the 'corosync-qnetd' package on your external server + +---- +external# apt install corosync-qnetd +---- + +and the 'corosync-qdevice' package on all cluster nodes + +---- +pve# apt install corosync-qdevice +---- + +After doing this, ensure that all the nodes in the cluster are online. + +You can now set up your QDevice by running the following command on one +of the {pve} nodes: + +---- +pve# pvecm qdevice setup +---- + +The SSH key from the cluster will be automatically copied to the QDevice. + +NOTE: Make sure to setup key-based access for the root user on your external +server, or temporarily allow root login with password during the setup phase. +If you receive an error such as 'Host key verification failed.' at this +stage, running `pvecm updatecerts` could fix the issue. + +After all the steps have successfully completed, you will see "Done". You can +verify that the QDevice has been set up with: ---- +pve# pvecm status + +... + +Votequorum information +~~~~~~~~~~~~~~~~~~~~~ +Expected votes: 3 +Highest expected: 3 +Total votes: 3 +Quorum: 2 +Flags: Quorate Qdevice + +Membership information +~~~~~~~~~~~~~~~~~~~~~~ + Nodeid Votes Qdevice Name + 0x00000001 1 A,V,NMW 192.168.22.180 (local) + 0x00000002 1 A,V,NMW 192.168.22.181 + 0x00000000 1 Qdevice + +---- + +[[pvecm_qdevice_status_flags]] +QDevice Status Flags +^^^^^^^^^^^^^^^^^^^^ + +The status output of the QDevice, as seen above, will usually contain three +columns: + +* `A` / `NA`: Alive or Not Alive. Indicates if the communication to the external + `corosync-qnetd` daemon works. +* `V` / `NV`: If the QDevice will cast a vote for the node. In a split-brain + situation, where the corosync connection between the nodes is down, but they + both can still communicate with the external `corosync-qnetd` daemon, + only one node will get the vote. +* `MW` / `NMW`: Master wins (`MV`) or not (`NMW`). Default is `NMW`, see + footnote:[`votequorum_qdevice_master_wins` manual page + https://manpages.debian.org/bookworm/libvotequorum-dev/votequorum_qdevice_master_wins.3.en.html]. +* `NR`: QDevice is not registered. + +NOTE: If your QDevice is listed as `Not Alive` (`NA` in the output above), +ensure that port `5403` (the default port of the qnetd server) of your external +server is reachable via TCP/IP! -Bring it in effect like described in the -<> section. -This is a change which cannot take live in effect and needs at least a restart -of corosync. Recommended is a restart of the whole cluster. +Frequently Asked Questions +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Tie Breaking +^^^^^^^^^^^^ + +In case of a tie, where two same-sized cluster partitions cannot see each other +but can see the QDevice, the QDevice chooses one of those partitions randomly +and provides a vote to it. + +Possible Negative Implications +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For clusters with an even node count, there are no negative implications when +using a QDevice. If it fails to work, it is the same as not having a QDevice +at all. + +Adding/Deleting Nodes After QDevice Setup +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If you want to add a new node or remove an existing one from a cluster with a +QDevice setup, you need to remove the QDevice first. After that, you can add or +remove nodes normally. Once you have a cluster with an even node count again, +you can set up the QDevice again as described previously. + +Removing the QDevice +^^^^^^^^^^^^^^^^^^^^ + +If you used the official `pvecm` tool to add the QDevice, you can remove it +by running: + +---- +pve# pvecm qdevice remove +---- + +//Still TODO +//^^^^^^^^^^ +//There is still stuff to add here -If you cannot reboot the whole cluster ensure no High Availability services are -configured and the stop the corosync service on all nodes. After corosync is -stopped on all nodes start it one after the other again. Corosync Configuration ---------------------- -The `/ect/pve/corosync.conf` file plays a central role in {pve} cluster. It -controls the cluster member ship and its network. -For reading more about it check the corosync.conf man page: +The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It +controls the cluster membership and its network. +For further information about it, check the corosync.conf man page: [source,bash] ---- man corosync.conf ---- -For node membership you should always use the `pvecm` tool provided by {pve}. +For node membership, you should always use the `pvecm` tool provided by {pve}. You may have to edit the configuration file manually for other changes. Here are a few best practice tips for doing this. -[[edit-corosync-conf]] +[[pvecm_edit_corosync_conf]] Edit corosync.conf ~~~~~~~~~~~~~~~~~~ -Editing the corosync.conf file can be not always straight forward. There are -two on each cluster, one in `/etc/pve/corosync.conf` and the other in +Editing the corosync.conf file is not always very straightforward. There are +two on each cluster node, one in `/etc/pve/corosync.conf` and the other in `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will propagate the changes to the local one, but not vice versa. -The configuration will get updated automatically as soon as the file changes. -This means changes which can be integrated in a running corosync will take -instantly effect. So you should always make a copy and edit that instead, to -avoid triggering some unwanted changes by an in between safe. +The configuration will get updated automatically, as soon as the file changes. +This means that changes which can be integrated in a running corosync will take +effect immediately. Thus, you should always make a copy and edit that instead, +to avoid triggering unintended changes when saving the file while editing. [source,bash] ---- cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new ---- -Then open the Config file with your favorite editor, `nano` and `vim.tiny` are -preinstalled on {pve} for example. +Then, open the config file with your favorite editor, such as `nano` or +`vim.tiny`, which come pre-installed on every {pve} node. -NOTE: Always increment the 'config_version' number on configuration changes, +NOTE: Always increment the 'config_version' number after configuration changes; omitting this can lead to problems. -After making the necessary changes create another copy of the current working +After making the necessary changes, create another copy of the current working configuration file. This serves as a backup if the new configuration fails to -apply or makes problems in other ways. +apply or causes other issues. [source,bash] ---- cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak ---- -Then move the new configuration file over the old one: +Then replace the old configuration file with the new one: [source,bash] ---- mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf ---- -You may check with the commands +You can check if the changes could be applied automatically, using the following +commands: [source,bash] ---- systemctl status corosync journalctl -b -u corosync ---- -If the change could applied automatically. If not you may have to restart the +If the changes could not be applied automatically, you may have to restart the corosync service via: [source,bash] ---- systemctl restart corosync ---- -On errors check the troubleshooting section below. +On errors, check the troubleshooting section below. Troubleshooting ~~~~~~~~~~~~~~~ @@ -818,48 +1253,36 @@ corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for re [...] ---- -It means that the hostname you set for corosync 'ringX_addr' in the +It means that the hostname you set for a corosync 'ringX_addr' in the configuration could not be resolved. - Write Configuration When Not Quorate ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you -know what you do, use: +If you need to change '/etc/pve/corosync.conf' on a node with no quorum, and you +understand what you are doing, use: [source,bash] ---- pvecm expected 1 ---- This sets the expected vote count to 1 and makes the cluster quorate. You can -now fix your configuration, or revert it back to the last working backup. +then fix your configuration, or revert it back to the last working backup. -This is not enough if corosync cannot start anymore. Here its best to edit the -local copy of the corosync configuration in '/etc/corosync/corosync.conf' so -that corosync can start again. Ensure that on all nodes this configuration has -the same content to avoid split brains. If you are not sure what went wrong -it's best to ask the Proxmox Community to help you. +This is not enough if corosync cannot start anymore. In that case, it is best to +edit the local copy of the corosync configuration in +'/etc/corosync/corosync.conf', so that corosync can start again. Ensure that on +all nodes, this configuration has the same content to avoid split-brain +situations. -[[corosync-conf-glossary]] +[[pvecm_corosync_conf_glossary]] Corosync Configuration Glossary ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ringX_addr:: -This names the different ring addresses for the corosync totem rings used for -the cluster communication. - -bindnetaddr:: -Defines to which interface the ring should bind to. It may be any address of -the subnet configured on the interface we want to use. In general its the -recommended to just use an address a node uses on this interface. - -rrp_mode:: -Specifies the mode of the redundant ring protocol and may be passive, active or -none. Note that use of active is highly experimental and not official -supported. Passive is the preferred mode, it may double the cluster -communication throughput and increases availability. +This names the different link addresses for the Kronosnet connections between +nodes. Cluster Cold Start @@ -872,112 +1295,148 @@ NOTE: It is always a good idea to use an uninterruptible power supply (``UPS'', also called ``battery backup'') to avoid this state, especially if you want HA. -On node startup, service `pve-manager` is started and waits for +On node startup, the `pve-guests` service is started and waits for quorum. Once quorate, it starts all guests which have the `onboot` flag set. When you turn on nodes, or when power comes back after power failure, -it is likely that some nodes boots faster than others. Please keep in +it is likely that some nodes will boot faster than others. Please keep in mind that guest startup is delayed until you reach quorum. + +[[pvecm_next_id_range]] +Guest VMID Auto-Selection +------------------------ + +When creating new guests the web interface will ask the backend for a free VMID +automatically. The default range for searching is `100` to `1000000` (lower +than the maximal allowed VMID enforced by the schema). + +Sometimes admins either want to allocate new VMIDs in a separate range, for +example to easily separate temporary VMs with ones that choose a VMID manually. +Other times its just desired to provided a stable length VMID, for which +setting the lower boundary to, for example, `100000` gives much more room for. + +To accommodate this use case one can set either lower, upper or both boundaries +via the `datacenter.cfg` configuration file, which can be edited in the web +interface under 'Datacenter' -> 'Options'. + +NOTE: The range is only used for the next-id API call, so it isn't a hard +limit. + Guest Migration --------------- -Migrating Virtual Guests (live) to other nodes is a useful feature in a -cluster. There exist settings to control the behavior of such migrations. -This can be done cluster wide via the 'datacenter.cfg' configuration file or -also for a single migration through API or command line tool parameters. +Migrating virtual guests to other nodes is a useful feature in a +cluster. There are settings to control the behavior of such +migrations. This can be done via the configuration file +`datacenter.cfg` or for a specific migration via API or command-line +parameters. + +It makes a difference if a guest is online or offline, or if it has +local resources (like a local disk). + +For details about virtual machine migration, see the +xref:qm_migration[QEMU/KVM Migration Chapter]. + +For details about container migration, see the +xref:pct_migration[Container Migration Chapter]. Migration Type ~~~~~~~~~~~~~~ -The migration type defines if the migration data should be sent over a -encrypted ('secure') channel or an unencrypted ('insecure') one. -Setting the migration type to insecure means that the RAM content of a -Virtual Guest gets also transfered unencrypted, which can lead to -information disclosure of critical data from inside the guest for example -passwords or encryption keys. -Thus we strongly recommend to use the secure channel if you have not full -control over the network and cannot guarantee that no one is eavesdropping -on it. +The migration type defines if the migration data should be sent over an +encrypted (`secure`) channel or an unencrypted (`insecure`) one. +Setting the migration type to `insecure` means that the RAM content of a +virtual guest is also transferred unencrypted, which can lead to +information disclosure of critical data from inside the guest (for +example, passwords or encryption keys). -Note that storage migration do not obey this setting, they will always send -the content over an secure channel currently. +Therefore, we strongly recommend using the secure channel if you do +not have full control over the network and can not guarantee that no +one is eavesdropping on it. -While this setting is often changed to 'insecure' in favor of gaining better -performance on migrations it may actually have an small impact on systems -with AES encryption hardware support in the CPU. This impact can get bigger -if the network link can transmit 10Gbps or more. +NOTE: Storage migration does not follow this setting. Currently, it +always sends the storage content over a secure channel. + +Encryption requires a lot of computing power, so this setting is often +changed to `insecure` to achieve better performance. The impact on +modern systems is lower because they implement AES encryption in +hardware. The performance impact is particularly evident in fast +networks, where you can transfer 10 Gbps or more. Migration Network ~~~~~~~~~~~~~~~~~ -By default {pve} uses the network where the cluster communication happens -for sending the migration traffic. This is may be suboptimal, for one the -sensible cluster traffic can be disturbed and on the other hand it may not -have the best bandwidth available from all network interfaces on the node. -Setting the migration network parameter allows using a dedicated network for -sending all the migration traffic when migrating a guest system. This -includes the traffic for offline storage migrations. - -The migration network is represented as a network in 'CIDR' notation. This -has the advantage that you do not need to set a IP for each node, {pve} is -able to figure out the real address from the given CIDR denoted network and -the networks configured on the target node. -To let this work the network must be specific enough, i.e. each node must -have one and only one IP configured in the given network. +By default, {pve} uses the network in which cluster communication +takes place to send the migration traffic. This is not optimal both because +sensitive cluster traffic can be disrupted and this network may not +have the best bandwidth available on the node. + +Setting the migration network parameter allows the use of a dedicated +network for all migration traffic. In addition to the memory, +this also affects the storage traffic for offline migrations. + +The migration network is set as a network using CIDR notation. This +has the advantage that you don't have to set individual IP addresses +for each node. {pve} can determine the real address on the +destination node from the network specified in the CIDR form. To +enable this, the network must be specified so that each node has exactly one +IP in the respective network. Example ^^^^^^^ -Lets assume that we have a three node setup with three networks, one for the -public communication with the Internet, one for the cluster communication -and one very fast one, which we want to use as an dedicated migration -network. A network configuration for such a setup could look like: +We assume that we have a three-node setup, with three separate +networks. One for public communication with the Internet, one for +cluster communication, and a very fast one, which we want to use as a +dedicated network for migration. + +A network configuration for such a setup might look as follows: ---- -iface eth0 inet manual +iface eno1 inet manual # public network auto vmbr0 iface vmbr0 inet static - address 192.X.Y.57 - netmask 255.255.250.0 + address 192.X.Y.57/24 gateway 192.X.Y.1 - bridge_ports eth0 - bridge_stp off - bridge_fd 0 + bridge-ports eno1 + bridge-stp off + bridge-fd 0 # cluster network -auto eth1 -iface eth1 inet static - address 10.1.1.1 - netmask 255.255.255.0 +auto eno2 +iface eno2 inet static + address 10.1.1.1/24 # fast network -auto eth2 -iface eth2 inet static - address 10.1.2.1 - netmask 255.255.255.0 - -# [...] +auto eno3 +iface eno3 inet static + address 10.1.2.1/24 ---- -Here we want to use the 10.1.2.1/24 network as migration network. -For a single migration you can achieve this by using the 'migration_network' -parameter: +Here, we will use the network 10.1.2.0/24 as a migration network. For +a single migration, you can do this using the `migration_network` +parameter of the command-line tool: + ---- -# qm migrate 106 tre --online --migration_network 10.1.2.1/24 +# qm migrate 106 tre --online --migration_network 10.1.2.0/24 ---- -To set this up as default network for all migrations cluster wide you can use -the migration property in '/etc/pve/datacenter.cfg': +To configure this as the default network for all migrations in the +cluster, set the `migration` property of the `/etc/pve/datacenter.cfg` +file: + ---- -# [...] -migration: secure,network=10.1.2.1/24 +# use dedicated migration network +migration: secure,network=10.1.2.0/24 ---- -Note that the migration type must be always set if the network gets set. +NOTE: The migration type must always be set when the migration network +is set in `/etc/pve/datacenter.cfg`. + ifdef::manvolnum[] include::pve-copyright.adoc[]