pvecm.adoc

   1 [[chapter_pvecm]]
   2 ifdef::manvolnum[]
   3 pvecm(1)
   4 ========
   5 :pve-toplevel:
   6
   7 NAME
   8 ----
   9
  10 pvecm - Proxmox VE Cluster Manager
  11
  12 SYNOPSIS
  13 --------
  14
  15 include::pvecm.1-synopsis.adoc[]
  16
  17 DESCRIPTION
  18 -----------
  19 endif::manvolnum[]
  20
  21 ifndef::manvolnum[]
  22 Cluster Manager
  23 ===============
  24 :pve-toplevel:
  25 endif::manvolnum[]
  26
  27 The {PVE} cluster manager `pvecm` is a tool to create a group of
  28 physical servers. Such a group is called a *cluster*. We use the
  29 http://www.corosync.org[Corosync Cluster Engine] for reliable group
  30 communication, and such clusters can consist of up to 32 physical nodes
  31 (probably more, dependent on network latency).
  32
  33 `pvecm` can be used to create a new cluster, join nodes to a cluster,
  34 leave the cluster, get status information and do various other cluster
  35 related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'')
  36 is used to transparently distribute the cluster configuration to all cluster
  37 nodes.
  38
  39 Grouping nodes into a cluster has the following advantages:
  40
  41 * Centralized, web based management
  42
  43 * Multi-master clusters: each node can do all management tasks
  44
  45 * `pmxcfs`: database-driven file system for storing configuration files,
  46  replicated in real-time on all nodes using `corosync`.
  47
  48 * Easy migration of virtual machines and containers between physical
  49   hosts
  50
  51 * Fast deployment
  52
  53 * Cluster-wide services like firewall and HA
  54
  55
  56 Requirements
  57 ------------
  58
  59 * All nodes must be able to connect to each other via UDP ports 5404 and 5405
  60  for corosync to work.
  61
  62 * Date and time have to be synchronized.
  63
  64 * SSH tunnel on TCP port 22 between nodes is used.
  65
  66 * If you are interested in High Availability, you need to have at
  67   least three nodes for reliable quorum. All nodes should have the
  68   same version.
  69
  70 * We recommend a dedicated NIC for the cluster traffic, especially if
  71   you use shared storage.
  72
  73 * Root password of a cluster node is required for adding nodes.
  74
  75 NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster
  76 nodes.
  77
  78 NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as
  79 production configuration and should only used temporarily during upgrading the
  80 whole cluster from one to another major version.
  81
  82 NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The
  83 cluster protocol (corosync) between {pve} 6.x and earlier versions changed
  84 fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the
  85 upgrade procedure to {pve} 6.0.
  86
  87
  88 Preparing Nodes
  89 ---------------
  90
  91 First, install {PVE} on all nodes. Make sure that each node is
  92 installed with the final hostname and IP configuration. Changing the
  93 hostname and IP is not possible after cluster creation.
  94
  95 Currently the cluster creation can either be done on the console (login via
  96 `ssh`) or the API, which we have a GUI implementation for (__Datacenter ->
  97 Cluster__).
  98
  99 While it's common to reference all nodenames and their IPs in `/etc/hosts` (or
 100 make their names resolvable through other means), this is not necessary for a
 101 cluster to work. It may be useful however, as you can then connect from one node
 102 to the other with SSH via the easier to remember node name (see also
 103 xref:pvecm_corosync_addresses[Link Address Types]). Note that we always
 104 recommend to reference nodes by their IP addresses in the cluster configuration.
 105
 106
 107 [[pvecm_create_cluster]]
 108 Create the Cluster
 109 ------------------
 110
 111 Use a unique name for your cluster. This name cannot be changed later. The
 112 cluster name follows the same rules as node names.
 113
 114 Create via Web GUI
 115 ~~~~~~~~~~~~~~~~~~
 116
 117 [thumbnail="screenshot/gui-cluster-create.png"]
 118
 119 Under __Datacenter -> Cluster__, click on *Create Cluster*. Enter the cluster
 120 name and select a network connection from the dropdown to serve as the main
 121 cluster network (Link 0). It defaults to the IP resolved via the node's
 122 hostname.
 123
 124 To add a second link as fallback, you can select the 'Advanced' checkbox and
 125 choose an additional network interface (Link 1, see also
 126 xref:pvecm_redundancy[Corosync Redundancy]).
 127
 128 Create via Command Line
 129 ~~~~~~~~~~~~~~~~~~~~~~~
 130
 131 Login via `ssh` to the first {pve} node and run the following command:
 132
 133 ----
 134  hp1# pvecm create CLUSTERNAME
 135 ----
 136
 137 To check the state of the new cluster use:
 138
 139 ----
 140  hp1# pvecm status
 141 ----
 142
 143 Multiple Clusters In Same Network
 144 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 145
 146 It is possible to create multiple clusters in the same physical or logical
 147 network. Each such cluster must have a unique name to avoid possible clashes in
 148 the cluster communication stack. This also helps avoid human confusion by making
 149 clusters clearly distinguishable.
 150
 151 While the bandwidth requirement of a corosync cluster is relatively low, the
 152 latency of packages and the package per second (PPS) rate is the limiting
 153 factor. Different clusters in the same network can compete with each other for
 154 these resources, so it may still make sense to use separate physical network
 155 infrastructure for bigger clusters.
 156
 157 [[pvecm_join_node_to_cluster]]
 158 Adding Nodes to the Cluster
 159 ---------------------------
 160
 161 CAUTION: A node that is about to be added to the cluster cannot hold any guests.
 162 All existing configuration in `/etc/pve` is overwritten when joining a cluster,
 163 since guest IDs could be conflicting. As a workaround create a backup of the
 164 guest (`vzdump`) and restore it as a different ID after the node has been added
 165 to the cluster.
 166
 167 Add Node via GUI
 168 ~~~~~~~~~~~~~~~~
 169
 170 [thumbnail="screenshot/gui-cluster-join-information.png"]
 171
 172 Login to the web interface on an existing cluster node. Under __Datacenter ->
 173 Cluster__, click the button *Join Information* at the top. Then, click on the
 174 button *Copy Information*. Alternatively, copy the string from the 'Information'
 175 field manually.
 176
 177 [thumbnail="screenshot/gui-cluster-join.png"]
 178
 179 Next, login to the web interface on the node you want to add.
 180 Under __Datacenter -> Cluster__, click on *Join Cluster*. Fill in the
 181 'Information' field with the text you copied earlier.
 182
 183 For security reasons, the cluster password has to be entered manually.
 184
 185 NOTE: To enter all required data manually, you can disable the 'Assisted Join'
 186 checkbox.
 187
 188 After clicking on *Join* the node will immediately be added to the cluster. You
 189 might need to reload the web page and re-login with the cluster credentials.
 190
 191 Confirm that your node is visible under __Datacenter -> Cluster__.
 192
 193 Add Node via Command Line
 194 ~~~~~~~~~~~~~~~~~~~~~~~~~
 195
 196 Login via `ssh` to the node you want to add.
 197
 198 ----
 199  hp2# pvecm add IP-ADDRESS-CLUSTER
 200 ----
 201
 202 For `IP-ADDRESS-CLUSTER` use the IP or hostname of an existing cluster node.
 203 An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]).
 204
 205
 206 To check the state of the cluster use:
 207
 208 ----
 209  # pvecm status
 210 ----
 211
 212 .Cluster status after adding 4 nodes
 213 ----
 214 hp2# pvecm status
 215 Quorum information
 216 ~~~~~~~~~~~~~~~~~~
 217 Date:             Mon Apr 20 12:30:13 2015
 218 Quorum provider:  corosync_votequorum
 219 Nodes:            4
 220 Node ID:          0x00000001
 221 Ring ID:          1/8
 222 Quorate:          Yes
 223
 224 Votequorum information
 225 ~~~~~~~~~~~~~~~~~~~~~~
 226 Expected votes:   4
 227 Highest expected: 4
 228 Total votes:      4
 229 Quorum:           3
 230 Flags:            Quorate
 231
 232 Membership information
 233 ~~~~~~~~~~~~~~~~~~~~~~
 234     Nodeid      Votes Name
 235 0x00000001          1 192.168.15.91
 236 0x00000002          1 192.168.15.92 (local)
 237 0x00000003          1 192.168.15.93
 238 0x00000004          1 192.168.15.94
 239 ----
 240
 241 If you only want the list of all nodes use:
 242
 243 ----
 244  # pvecm nodes
 245 ----
 246
 247 .List nodes in a cluster
 248 ----
 249 hp2# pvecm nodes
 250
 251 Membership information
 252 ~~~~~~~~~~~~~~~~~~~~~~
 253     Nodeid      Votes Name
 254          1          1 hp1
 255          2          1 hp2 (local)
 256          3          1 hp3
 257          4          1 hp4
 258 ----
 259
 260 [[pvecm_adding_nodes_with_separated_cluster_network]]
 261 Adding Nodes With Separated Cluster Network
 262 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 263
 264 When adding a node to a cluster with a separated cluster network you need to
 265 use the 'link0' parameter to set the nodes address on that network:
 266
 267 [source,bash]
 268 ----
 269 pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0
 270 ----
 271
 272 If you want to use the built-in xref:pvecm_redundancy[redundancy] of the
 273 kronosnet transport layer, also use the 'link1' parameter.
 274
 275 Using the GUI, you can select the correct interface from the corresponding 'Link 0'
 276 and 'Link 1' fields in the *Cluster Join* dialog.
 277
 278 Remove a Cluster Node
 279 ---------------------
 280
 281 CAUTION: Read carefully the procedure before proceeding, as it could
 282 not be what you want or need.
 283
 284 Move all virtual machines from the node. Make sure you have no local
 285 data or backups you want to keep, or save them accordingly.
 286 In the following example we will remove the node hp4 from the cluster.
 287
 288 Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes`
 289 command to identify the node ID to remove:
 290
 291 ----
 292 hp1# pvecm nodes
 293
 294 Membership information
 295 ~~~~~~~~~~~~~~~~~~~~~~
 296     Nodeid      Votes Name
 297          1          1 hp1 (local)
 298          2          1 hp2
 299          3          1 hp3
 300          4          1 hp4
 301 ----
 302
 303
 304 At this point you must power off hp4 and
 305 make sure that it will not power on again (in the network) as it
 306 is.
 307
 308 IMPORTANT: As said above, it is critical to power off the node
 309 *before* removal, and make sure that it will *never* power on again
 310 (in the existing cluster network) as it is.
 311 If you power on the node as it is, your cluster will be screwed up and
 312 it could be difficult to restore a clean cluster state.
 313
 314 After powering off the node hp4, we can safely remove it from the cluster.
 315
 316 ----
 317  hp1# pvecm delnode hp4
 318 ----
 319
 320 If the operation succeeds no output is returned, just check the node
 321 list again with `pvecm nodes` or `pvecm status`. You should see
 322 something like:
 323
 324 ----
 325 hp1# pvecm status
 326
 327 Quorum information
 328 ~~~~~~~~~~~~~~~~~~
 329 Date:             Mon Apr 20 12:44:28 2015
 330 Quorum provider:  corosync_votequorum
 331 Nodes:            3
 332 Node ID:          0x00000001
 333 Ring ID:          1/8
 334 Quorate:          Yes
 335
 336 Votequorum information
 337 ~~~~~~~~~~~~~~~~~~~~~~
 338 Expected votes:   3
 339 Highest expected: 3
 340 Total votes:      3
 341 Quorum:           2
 342 Flags:            Quorate
 343
 344 Membership information
 345 ~~~~~~~~~~~~~~~~~~~~~~
 346     Nodeid      Votes Name
 347 0x00000001          1 192.168.15.90 (local)
 348 0x00000002          1 192.168.15.91
 349 0x00000003          1 192.168.15.92
 350 ----
 351
 352 If, for whatever reason, you want this server to join the same cluster again,
 353 you have to
 354
 355 * reinstall {pve} on it from scratch
 356
 357 * then join it, as explained in the previous section.
 358
 359 NOTE: After removal of the node, its SSH fingerprint will still reside in the
 360 'known_hosts' of the other nodes. If you receive an SSH error after rejoining
 361 a node with the same IP or hostname, run `pvecm updatecerts` once on the
 362 re-added node to update its fingerprint cluster wide.
 363
 364 [[pvecm_separate_node_without_reinstall]]
 365 Separate A Node Without Reinstalling
 366 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 367
 368 CAUTION: This is *not* the recommended method, proceed with caution. Use the
 369 above mentioned method if you're unsure.
 370
 371 You can also separate a node from a cluster without reinstalling it from
 372 scratch.  But after removing the node from the cluster it will still have
 373 access to the shared storages! This must be resolved before you start removing
 374 the node from the cluster. A {pve} cluster cannot share the exact same
 375 storage with another cluster, as storage locking doesn't work over cluster
 376 boundary. Further, it may also lead to VMID conflicts.
 377
 378 Its suggested that you create a new storage where only the node which you want
 379 to separate has access. This can be a new export on your NFS or a new Ceph
 380 pool, to name a few examples. Its just important that the exact same storage
 381 does not gets accessed by multiple clusters. After setting this storage up move
 382 all data from the node and its VMs to it. Then you are ready to separate the
 383 node from the cluster.
 384
 385 WARNING: Ensure all shared resources are cleanly separated! Otherwise you will
 386 run into conflicts and problems.
 387
 388 First stop the corosync and the pve-cluster services on the node:
 389 [source,bash]
 390 ----
 391 systemctl stop pve-cluster
 392 systemctl stop corosync
 393 ----
 394
 395 Start the cluster filesystem again in local mode:
 396 [source,bash]
 397 ----
 398 pmxcfs -l
 399 ----
 400
 401 Delete the corosync configuration files:
 402 [source,bash]
 403 ----
 404 rm /etc/pve/corosync.conf
 405 rm /etc/corosync/*
 406 ----
 407
 408 You can now start the filesystem again as normal service:
 409 [source,bash]
 410 ----
 411 killall pmxcfs
 412 systemctl start pve-cluster
 413 ----
 414
 415 The node is now separated from the cluster. You can deleted it from a remaining
 416 node of the cluster with:
 417 [source,bash]
 418 ----
 419 pvecm delnode oldnode
 420 ----
 421
 422 If the command failed, because the remaining node in the cluster lost quorum
 423 when the now separate node exited, you may set the expected votes to 1 as a workaround:
 424 [source,bash]
 425 ----
 426 pvecm expected 1
 427 ----
 428
 429 And then repeat the 'pvecm delnode' command.
 430
 431 Now switch back to the separated node, here delete all remaining files left
 432 from the old cluster. This ensures that the node can be added to another
 433 cluster again without problems.
 434
 435 [source,bash]
 436 ----
 437 rm /var/lib/corosync/*
 438 ----
 439
 440 As the configuration files from the other nodes are still in the cluster
 441 filesystem you may want to clean those up too.  Remove simply the whole
 442 directory recursive from '/etc/pve/nodes/NODENAME', but check three times that
 443 you used the correct one before deleting it.
 444
 445 CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means
 446 the nodes can still connect to each other with public key authentication. This
 447 should be fixed by removing the respective keys from the
 448 '/etc/pve/priv/authorized_keys' file.
 449
 450
 451 Quorum
 452 ------
 453
 454 {pve} use a quorum-based technique to provide a consistent state among
 455 all cluster nodes.
 456
 457 [quote, from Wikipedia, Quorum (distributed computing)]
 458 ____
 459 A quorum is the minimum number of votes that a distributed transaction
 460 has to obtain in order to be allowed to perform an operation in a
 461 distributed system.
 462 ____
 463
 464 In case of network partitioning, state changes requires that a
 465 majority of nodes are online. The cluster switches to read-only mode
 466 if it loses quorum.
 467
 468 NOTE: {pve} assigns a single vote to each node by default.
 469
 470
 471 Cluster Network
 472 ---------------
 473
 474 The cluster network is the core of a cluster. All messages sent over it have to
 475 be delivered reliably to all nodes in their respective order. In {pve} this
 476 part is done by corosync, an implementation of a high performance, low overhead
 477 high availability development toolkit. It serves our decentralized
 478 configuration file system (`pmxcfs`).
 479
 480 [[pvecm_cluster_network_requirements]]
 481 Network Requirements
 482 ~~~~~~~~~~~~~~~~~~~~
 483 This needs a reliable network with latencies under 2 milliseconds (LAN
 484 performance) to work properly. The network should not be used heavily by other
 485 members, ideally corosync runs on its own network. Do not use a shared network
 486 for corosync and storage (except as a potential low-priority fallback in a
 487 xref:pvecm_redundancy[redundant] configuration).
 488
 489 Before setting up a cluster, it is good practice to check if the network is fit
 490 for that purpose. To make sure the nodes can connect to each other on the
 491 cluster network, you can test the connectivity between them with the `ping`
 492 tool.
 493
 494 If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically
 495 be generated - no manual action is required.
 496
 497 NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0).
 498 Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster
 499 communication, which, for now, only supports regular UDP unicast.
 500
 501 CAUTION: You can still enable Multicast or legacy unicast by setting your
 502 transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf],
 503 but keep in mind that this will disable all cryptography and redundancy support.
 504 This is therefore not recommended.
 505
 506 Separate Cluster Network
 507 ~~~~~~~~~~~~~~~~~~~~~~~~
 508
 509 When creating a cluster without any parameters the corosync cluster network is
 510 generally shared with the Web UI and the VMs and their traffic. Depending on
 511 your setup, even storage traffic may get sent over the same network. Its
 512 recommended to change that, as corosync is a time critical real time
 513 application.
 514
 515 Setting Up A New Network
 516 ^^^^^^^^^^^^^^^^^^^^^^^^
 517
 518 First you have to set up a new network interface. It should be on a physically
 519 separate network. Ensure that your network fulfills the
 520 xref:pvecm_cluster_network_requirements[cluster network requirements].
 521
 522 Separate On Cluster Creation
 523 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 524
 525 This is possible via the 'linkX' parameters of the 'pvecm create'
 526 command used for creating a new cluster.
 527
 528 If you have set up an additional NIC with a static address on 10.10.10.1/25,
 529 and want to send and receive all cluster communication over this interface,
 530 you would execute:
 531
 532 [source,bash]
 533 ----
 534 pvecm create test --link0 10.10.10.1
 535 ----
 536
 537 To check if everything is working properly execute:
 538 [source,bash]
 539 ----
 540 systemctl status corosync
 541 ----
 542
 543 Afterwards, proceed as described above to
 544 xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network].
 545
 546 [[pvecm_separate_cluster_net_after_creation]]
 547 Separate After Cluster Creation
 548 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 549
 550 You can do this if you have already created a cluster and want to switch
 551 its communication to another network, without rebuilding the whole cluster.
 552 This change may lead to short durations of quorum loss in the cluster, as nodes
 553 have to restart corosync and come up one after the other on the new network.
 554
 555 Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first.
 556 Then, open it and you should see a file similar to:
 557
 558 ----
 559 logging {
 560   debug: off
 561   to_syslog: yes
 562 }
 563
 564 nodelist {
 565
 566   node {
 567     name: due
 568     nodeid: 2
 569     quorum_votes: 1
 570     ring0_addr: due
 571   }
 572
 573   node {
 574     name: tre
 575     nodeid: 3
 576     quorum_votes: 1
 577     ring0_addr: tre
 578   }
 579
 580   node {
 581     name: uno
 582     nodeid: 1
 583     quorum_votes: 1
 584     ring0_addr: uno
 585   }
 586
 587 }
 588
 589 quorum {
 590   provider: corosync_votequorum
 591 }
 592
 593 totem {
 594   cluster_name: testcluster
 595   config_version: 3
 596   ip_version: ipv4-6
 597   secauth: on
 598   version: 2
 599   interface {
 600     linknumber: 0
 601   }
 602
 603 }
 604 ----
 605
 606 NOTE: `ringX_addr` actually specifies a corosync *link address*, the name "ring"
 607 is a remnant of older corosync versions that is kept for backwards
 608 compatibility.
 609
 610 The first thing you want to do is add the 'name' properties in the node entries
 611 if you do not see them already. Those *must* match the node name.
 612
 613 Then replace all addresses from the 'ring0_addr' properties of all nodes with
 614 the new addresses. You may use plain IP addresses or hostnames here. If you use
 615 hostnames ensure that they are resolvable from all nodes. (see also
 616 xref:pvecm_corosync_addresses[Link Address Types])
 617
 618 In this example, we want to switch the cluster communication to the
 619 10.10.10.1/25 network. So we replace all 'ring0_addr' respectively.
 620
 621 NOTE: The exact same procedure can be used to change other 'ringX_addr' values
 622 as well, although we recommend to not change multiple addresses at once, to make
 623 it easier to recover if something goes wrong.
 624
 625 After we increase the 'config_version' property, the new configuration file
 626 should look like:
 627
 628 ----
 629 logging {
 630   debug: off
 631   to_syslog: yes
 632 }
 633
 634 nodelist {
 635
 636   node {
 637     name: due
 638     nodeid: 2
 639     quorum_votes: 1
 640     ring0_addr: 10.10.10.2
 641   }
 642
 643   node {
 644     name: tre
 645     nodeid: 3
 646     quorum_votes: 1
 647     ring0_addr: 10.10.10.3
 648   }
 649
 650   node {
 651     name: uno
 652     nodeid: 1
 653     quorum_votes: 1
 654     ring0_addr: 10.10.10.1
 655   }
 656
 657 }
 658
 659 quorum {
 660   provider: corosync_votequorum
 661 }
 662
 663 totem {
 664   cluster_name: testcluster
 665   config_version: 4
 666   ip_version: ipv4-6
 667   secauth: on
 668   version: 2
 669   interface {
 670     linknumber: 0
 671   }
 672
 673 }
 674 ----
 675
 676 Then, after a final check if all changed information is correct, we save it and
 677 once again follow the xref:pvecm_edit_corosync_conf[edit corosync.conf file]
 678 section to bring it into effect.
 679
 680 The changes will be applied live, so restarting corosync is not strictly
 681 necessary. If you changed other settings as well, or notice corosync
 682 complaining, you can optionally trigger a restart.
 683
 684 On a single node execute:
 685
 686 [source,bash]
 687 ----
 688 systemctl restart corosync
 689 ----
 690
 691 Now check if everything is fine:
 692
 693 [source,bash]
 694 ----
 695 systemctl status corosync
 696 ----
 697
 698 If corosync runs again correct restart corosync also on all other nodes.
 699 They will then join the cluster membership one by one on the new network.
 700
 701 [[pvecm_corosync_addresses]]
 702 Corosync addresses
 703 ~~~~~~~~~~~~~~~~~~
 704
 705 A corosync link address (for backwards compatibility denoted by 'ringX_addr' in
 706 `corosync.conf`) can be specified in two ways:
 707
 708 * **IPv4/v6 addresses** will be used directly. They are recommended, since they
 709 are static and usually not changed carelessly.
 710
 711 * **Hostnames** will be resolved using `getaddrinfo`, which means that per
 712 default, IPv6 addresses will be used first, if available (see also
 713 `man gai.conf`). Keep this in mind, especially when upgrading an existing
 714 cluster to IPv6.
 715
 716 CAUTION: Hostnames should be used with care, since the address they
 717 resolve to can be changed without touching corosync or the node it runs on -
 718 which may lead to a situation where an address is changed without thinking
 719 about implications for corosync.
 720
 721 A seperate, static hostname specifically for corosync is recommended, if
 722 hostnames are preferred. Also, make sure that every node in the cluster can
 723 resolve all hostnames correctly.
 724
 725 Since {pve} 5.1, while supported, hostnames will be resolved at the time of
 726 entry. Only the resolved IP is then saved to the configuration.
 727
 728 Nodes that joined the cluster on earlier versions likely still use their
 729 unresolved hostname in `corosync.conf`. It might be a good idea to replace
 730 them with IPs or a seperate hostname, as mentioned above.
 731
 732
 733 [[pvecm_redundancy]]
 734 Corosync Redundancy
 735 -------------------
 736
 737 Corosync supports redundant networking via its integrated kronosnet layer by
 738 default (it is not supported on the legacy udp/udpu transports). It can be
 739 enabled by specifying more than one link address, either via the '--linkX'
 740 parameters of `pvecm`, in the GUI as **Link 1** (while creating a cluster or
 741 adding a new node) or by specifying more than one 'ringX_addr' in
 742 `corosync.conf`.
 743
 744 NOTE: To provide useful failover, every link should be on its own
 745 physical network connection.
 746
 747 Links are used according to a priority setting. You can configure this priority
 748 by setting 'knet_link_priority' in the corresponding interface section in
 749 `corosync.conf`, or, preferrably, using the 'priority' parameter when creating
 750 your cluster with `pvecm`:
 751
 752 ----
 753  # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=20 --link1 10.20.20.1,priority=15
 754 ----
 755
 756 This would cause 'link1' to be used first, since it has the lower priority.
 757
 758 If no priorities are configured manually (or two links have the same priority),
 759 links will be used in order of their number, with the lower number having higher
 760 priority.
 761
 762 Even if all links are working, only the one with the highest priority will see
 763 corosync traffic. Link priorities cannot be mixed, i.e. links with different
 764 priorities will not be able to communicate with each other.
 765
 766 Since lower priority links will not see traffic unless all higher priorities
 767 have failed, it becomes a useful strategy to specify even networks used for
 768 other tasks (VMs, storage, etc...) as low-priority links. If worst comes to
 769 worst, a higher-latency or more congested connection might be better than no
 770 connection at all.
 771
 772 Adding Redundant Links To An Existing Cluster
 773 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 774
 775 To add a new link to a running configuration, first check how to
 776 xref:pvecm_edit_corosync_conf[edit the corosync.conf file].
 777
 778 Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make
 779 sure that your 'X' is the same for every node you add it to, and that it is
 780 unique for each node.
 781
 782 Lastly, add a new 'interface', as shown below, to your `totem`
 783 section, replacing 'X' with your link number chosen above.
 784
 785 Assuming you added a link with number 1, the new configuration file could look
 786 like this:
 787
 788 ----
 789 logging {
 790   debug: off
 791   to_syslog: yes
 792 }
 793
 794 nodelist {
 795
 796   node {
 797     name: due
 798     nodeid: 2
 799     quorum_votes: 1
 800     ring0_addr: 10.10.10.2
 801     ring1_addr: 10.20.20.2
 802   }
 803
 804   node {
 805     name: tre
 806     nodeid: 3
 807     quorum_votes: 1
 808     ring0_addr: 10.10.10.3
 809     ring1_addr: 10.20.20.3
 810   }
 811
 812   node {
 813     name: uno
 814     nodeid: 1
 815     quorum_votes: 1
 816     ring0_addr: 10.10.10.1
 817     ring1_addr: 10.20.20.1
 818   }
 819
 820 }
 821
 822 quorum {
 823   provider: corosync_votequorum
 824 }
 825
 826 totem {
 827   cluster_name: testcluster
 828   config_version: 4
 829   ip_version: ipv4-6
 830   secauth: on
 831   version: 2
 832   interface {
 833     linknumber: 0
 834   }
 835   interface {
 836     linknumber: 1
 837   }
 838 }
 839 ----
 840
 841 The new link will be enabled as soon as you follow the last steps to
 842 xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not
 843 be necessary. You can check that corosync loaded the new link using:
 844
 845 ----
 846 journalctl -b -u corosync
 847 ----
 848
 849 It might be a good idea to test the new link by temporarily disconnecting the
 850 old link on one node and making sure that its status remains online while
 851 disconnected:
 852
 853 ----
 854 pvecm status
 855 ----
 856
 857 If you see a healthy cluster state, it means that your new link is being used.
 858
 859
 860 Corosync External Vote Support
 861 ------------------------------
 862
 863 This section describes a way to deploy an external voter in a {pve} cluster.
 864 When configured, the cluster can sustain more node failures without
 865 violating safety properties of the cluster communication.
 866
 867 For this to work there are two services involved:
 868
 869 * a so called qdevice daemon which runs on each {pve} node
 870
 871 * an external vote daemon which runs on an independent server.
 872
 873 As a result you can achieve higher availability even in smaller setups (for
 874 example 2+1 nodes).
 875
 876 QDevice Technical Overview
 877 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 878
 879 The Corosync Quroum Device (QDevice) is a daemon which runs on each cluster
 880 node. It provides a configured number of votes to the clusters quorum
 881 subsystem based on an external running third-party arbitrator's decision.
 882 Its primary use is to allow a cluster to sustain more node failures than
 883 standard quorum rules allow. This can be done safely as the external device
 884 can see all nodes and thus choose only one set of nodes to give its vote.
 885 This will only be done if said set of nodes can have quorum (again) when
 886 receiving the third-party vote.
 887
 888 Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
 889 a daemon which provides a vote to a cluster partition if it can reach the
 890 partition members over the network. It will give only votes to one partition
 891 of a cluster at any time.
 892 It's designed to support multiple clusters and is almost configuration and
 893 state free. New clusters are handled dynamically and no configuration file
 894 is needed on the host running a QDevice.
 895
 896 The external host has the only requirement that it needs network access to the
 897 cluster and a corosync-qnetd package available. We provide such a package
 898 for Debian based hosts, other Linux distributions should also have a package
 899 available through their respective package manager.
 900
 901 NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
 902 TCP/IP. The daemon may even run outside of the clusters LAN and can have longer
 903 latencies than 2 ms.
 904
 905 Supported Setups
 906 ~~~~~~~~~~~~~~~~
 907
 908 We support QDevices for clusters with an even number of nodes and recommend
 909 it for 2 node clusters, if they should provide higher availability.
 910 For clusters with an odd node count we discourage the use of QDevices
 911 currently. The reason for this, is the difference of the votes the QDevice
 912 provides for each cluster type. Even numbered clusters get single additional
 913 vote, with this we can only increase availability, i.e. if the QDevice
 914 itself fails we are in the same situation as with no QDevice at all.
 915
 916 Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
 917 where 'N' corresponds to the cluster node count. This difference makes
 918 sense, if we had only one additional vote the cluster can get into a split
 919 brain situation.
 920 This algorithm would allow that all nodes but one (and naturally the
 921 QDevice itself) could fail.
 922 There are two drawbacks with this:
 923
 924 * If the QNet daemon itself fails, no other node may fail or the cluster
 925   immediately loses quorum.  For example, in a cluster with 15 nodes 7
 926   could fail before the cluster becomes inquorate. But, if a QDevice is
 927   configured here and said QDevice fails itself **no single node** of
 928   the 15 may fail. The QDevice acts almost as a single point of failure in
 929   this case.
 930
 931 * The fact that all but one node plus QDevice may fail sound promising at
 932   first, but this may result in a mass recovery of HA services that would
 933   overload the single node left. Also ceph server will stop to provide
 934   services after only '((N-1)/2)' nodes are online.
 935
 936 If you understand the drawbacks and implications you can decide yourself if
 937 you should use this technology in an odd numbered cluster setup.
 938
 939 QDevice-Net Setup
 940 ~~~~~~~~~~~~~~~~~
 941
 942 We recommend to run any daemon which provides votes to corosync-qdevice as an
 943 unprivileged user. {pve} and Debian provides a package which is already
 944 configured to do so.
 945 The traffic between the daemon and the cluster must be encrypted to ensure a
 946 safe and secure QDevice integration in {pve}.
 947
 948 First install the 'corosync-qnetd' package on your external server and
 949 the 'corosync-qdevice' package on all cluster nodes.
 950
 951 After that, ensure that all your nodes on the cluster are online.
 952
 953 You can now easily set up your QDevice by running the following command on one
 954 of the {pve} nodes:
 955
 956 ----
 957 pve# pvecm qdevice setup <QDEVICE-IP>
 958 ----
 959
 960 The SSH key from the cluster will be automatically copied to the QDevice. You
 961 might need to enter an SSH password during this step.
 962
 963 After you enter the password and all the steps are successfully completed, you
 964 will see "Done". You can check the status now:
 965
 966 ----
 967 pve# pvecm status
 968
 969 ...
 970
 971 Votequorum information
 972 ~~~~~~~~~~~~~~~~~~~~~
 973 Expected votes:   3
 974 Highest expected: 3
 975 Total votes:      3
 976 Quorum:           2
 977 Flags:            Quorate Qdevice
 978
 979 Membership information
 980 ~~~~~~~~~~~~~~~~~~~~~~
 981     Nodeid      Votes    Qdevice Name
 982     0x00000001          1    A,V,NMW 192.168.22.180 (local)
 983     0x00000002          1    A,V,NMW 192.168.22.181
 984     0x00000000          1            Qdevice
 985
 986 ----
 987
 988 which means the QDevice is set up.
 989
 990 Frequently Asked Questions
 991 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 992
 993 Tie Breaking
 994 ^^^^^^^^^^^^
 995
 996 In case of a tie, where two same-sized cluster partitions cannot see each other
 997 but the QDevice, the QDevice chooses randomly one of those partitions and
 998 provides a vote to it.
 999
1000 Possible Negative Implications
1001 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1002
1003 For clusters with an even node count there are no negative implications when
1004 setting up a QDevice. If it fails to work, you are as good as without QDevice at
1005 all.
1006
1007 Adding/Deleting Nodes After QDevice Setup
1008 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1009
1010 If you want to add a new node or remove an existing one from a cluster with a
1011 QDevice setup, you need to remove the QDevice first. After that, you can add or
1012 remove nodes normally. Once you have a cluster with an even node count again,
1013 you can set up the QDevice again as described above.
1014
1015 Removing the QDevice
1016 ^^^^^^^^^^^^^^^^^^^^
1017
1018 If you used the official `pvecm` tool to add the QDevice, you can remove it
1019 trivially by running:
1020
1021 ----
1022 pve# pvecm qdevice remove
1023 ----
1024
1025 //Still TODO
1026 //^^^^^^^^^^
1027 //There is still stuff to add here
1028
1029
1030 Corosync Configuration
1031 ----------------------
1032
1033 The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It
1034 controls the cluster membership and its network.
1035 For further information about it, check the corosync.conf man page:
1036 [source,bash]
1037 ----
1038 man corosync.conf
1039 ----
1040
1041 For node membership you should always use the `pvecm` tool provided by {pve}.
1042 You may have to edit the configuration file manually for other changes.
1043 Here are a few best practice tips for doing this.
1044
1045 [[pvecm_edit_corosync_conf]]
1046 Edit corosync.conf
1047 ~~~~~~~~~~~~~~~~~~
1048
1049 Editing the corosync.conf file is not always very straightforward. There are
1050 two on each cluster node, one in `/etc/pve/corosync.conf` and the other in
1051 `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will
1052 propagate the changes to the local one, but not vice versa.
1053
1054 The configuration will get updated automatically as soon as the file changes.
1055 This means changes which can be integrated in a running corosync will take
1056 effect immediately. So you should always make a copy and edit that instead, to
1057 avoid triggering some unwanted changes by an in-between safe.
1058
1059 [source,bash]
1060 ----
1061 cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
1062 ----
1063
1064 Then open the config file with your favorite editor, `nano` and `vim.tiny` are
1065 preinstalled on any {pve} node for example.
1066
1067 NOTE: Always increment the 'config_version' number on configuration changes,
1068 omitting this can lead to problems.
1069
1070 After making the necessary changes create another copy of the current working
1071 configuration file. This serves as a backup if the new configuration fails to
1072 apply or makes problems in other ways.
1073
1074 [source,bash]
1075 ----
1076 cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
1077 ----
1078
1079 Then move the new configuration file over the old one:
1080 [source,bash]
1081 ----
1082 mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
1083 ----
1084
1085 You may check with the commands
1086 [source,bash]
1087 ----
1088 systemctl status corosync
1089 journalctl -b -u corosync
1090 ----
1091
1092 If the change could be applied automatically. If not you may have to restart the
1093 corosync service via:
1094 [source,bash]
1095 ----
1096 systemctl restart corosync
1097 ----
1098
1099 On errors check the troubleshooting section below.
1100
1101 Troubleshooting
1102 ~~~~~~~~~~~~~~~
1103
1104 Issue: 'quorum.expected_votes must be configured'
1105 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1106
1107 When corosync starts to fail and you get the following message in the system log:
1108
1109 ----
1110 [...]
1111 corosync[1647]:  [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
1112 corosync[1647]:  [SERV  ] Service engine 'corosync_quorum' failed to load for reason
1113     'configuration error: nodelist or quorum.expected_votes must be configured!'
1114 [...]
1115 ----
1116
1117 It means that the hostname you set for corosync 'ringX_addr' in the
1118 configuration could not be resolved.
1119
1120 Write Configuration When Not Quorate
1121 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1122
1123 If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you
1124 know what you do, use:
1125 [source,bash]
1126 ----
1127 pvecm expected 1
1128 ----
1129
1130 This sets the expected vote count to 1 and makes the cluster quorate. You can
1131 now fix your configuration, or revert it back to the last working backup.
1132
1133 This is not enough if corosync cannot start anymore. Here it is best to edit the
1134 local copy of the corosync configuration in '/etc/corosync/corosync.conf' so
1135 that corosync can start again. Ensure that on all nodes this configuration has
1136 the same content to avoid split brains. If you are not sure what went wrong
1137 it's best to ask the Proxmox Community to help you.
1138
1139
1140 [[pvecm_corosync_conf_glossary]]
1141 Corosync Configuration Glossary
1142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1143
1144 ringX_addr::
1145 This names the different link addresses for the kronosnet connections between
1146 nodes.
1147
1148
1149 Cluster Cold Start
1150 ------------------
1151
1152 It is obvious that a cluster is not quorate when all nodes are
1153 offline. This is a common case after a power failure.
1154
1155 NOTE: It is always a good idea to use an uninterruptible power supply
1156 (``UPS'', also called ``battery backup'') to avoid this state, especially if
1157 you want HA.
1158
1159 On node startup, the `pve-guests` service is started and waits for
1160 quorum. Once quorate, it starts all guests which have the `onboot`
1161 flag set.
1162
1163 When you turn on nodes, or when power comes back after power failure,
1164 it is likely that some nodes boots faster than others. Please keep in
1165 mind that guest startup is delayed until you reach quorum.
1166
1167
1168 Guest Migration
1169 ---------------
1170
1171 Migrating virtual guests to other nodes is a useful feature in a
1172 cluster. There are settings to control the behavior of such
1173 migrations. This can be done via the configuration file
1174 `datacenter.cfg` or for a specific migration via API or command line
1175 parameters.
1176
1177 It makes a difference if a Guest is online or offline, or if it has
1178 local resources (like a local disk).
1179
1180 For Details about Virtual Machine Migration see the
1181 xref:qm_migration[QEMU/KVM Migration Chapter].
1182
1183 For Details about Container Migration see the
1184 xref:pct_migration[Container Migration Chapter].
1185
1186 Migration Type
1187 ~~~~~~~~~~~~~~
1188
1189 The migration type defines if the migration data should be sent over an
1190 encrypted (`secure`) channel or an unencrypted (`insecure`) one.
1191 Setting the migration type to insecure means that the RAM content of a
1192 virtual guest gets also transferred unencrypted, which can lead to
1193 information disclosure of critical data from inside the guest (for
1194 example passwords or encryption keys).
1195
1196 Therefore, we strongly recommend using the secure channel if you do
1197 not have full control over the network and can not guarantee that no
1198 one is eavesdropping on it.
1199
1200 NOTE: Storage migration does not follow this setting. Currently, it
1201 always sends the storage content over a secure channel.
1202
1203 Encryption requires a lot of computing power, so this setting is often
1204 changed to "unsafe" to achieve better performance. The impact on
1205 modern systems is lower because they implement AES encryption in
1206 hardware. The performance impact is particularly evident in fast
1207 networks where you can transfer 10 Gbps or more.
1208
1209 Migration Network
1210 ~~~~~~~~~~~~~~~~~
1211
1212 By default, {pve} uses the network in which cluster communication
1213 takes place to send the migration traffic. This is not optimal because
1214 sensitive cluster traffic can be disrupted and this network may not
1215 have the best bandwidth available on the node.
1216
1217 Setting the migration network parameter allows the use of a dedicated
1218 network for the entire migration traffic. In addition to the memory,
1219 this also affects the storage traffic for offline migrations.
1220
1221 The migration network is set as a network in the CIDR notation. This
1222 has the advantage that you do not have to set individual IP addresses
1223 for each node.  {pve} can determine the real address on the
1224 destination node from the network specified in the CIDR form.  To
1225 enable this, the network must be specified so that each node has one,
1226 but only one IP in the respective network.
1227
1228 Example
1229 ^^^^^^^
1230
1231 We assume that we have a three-node setup with three separate
1232 networks. One for public communication with the Internet, one for
1233 cluster communication and a very fast one, which we want to use as a
1234 dedicated network for migration.
1235
1236 A network configuration for such a setup might look as follows:
1237
1238 ----
1239 iface eno1 inet manual
1240
1241 # public network
1242 auto vmbr0
1243 iface vmbr0 inet static
1244     address 192.X.Y.57
1245     netmask 255.255.250.0
1246     gateway 192.X.Y.1
1247     bridge_ports eno1
1248     bridge_stp off
1249     bridge_fd 0
1250
1251 # cluster network
1252 auto eno2
1253 iface eno2 inet static
1254     address  10.1.1.1
1255     netmask  255.255.255.0
1256
1257 # fast network
1258 auto eno3
1259 iface eno3 inet static
1260     address  10.1.2.1
1261     netmask  255.255.255.0
1262 ----
1263
1264 Here, we will use the network 10.1.2.0/24 as a migration network. For
1265 a single migration, you can do this using the `migration_network`
1266 parameter of the command line tool:
1267
1268 ----
1269 # qm migrate 106 tre --online --migration_network 10.1.2.0/24
1270 ----
1271
1272 To configure this as the default network for all migrations in the
1273 cluster, set the `migration` property of the `/etc/pve/datacenter.cfg`
1274 file:
1275
1276 ----
1277 # use dedicated migration network
1278 migration: secure,network=10.1.2.0/24
1279 ----
1280
1281 NOTE: The migration type must always be set when the migration network
1282 gets set in `/etc/pve/datacenter.cfg`.
1283
1284
1285 ifdef::manvolnum[]
1286 include::pve-copyright.adoc[]
1287 endif::manvolnum[]