pvecm.adoc

   1 [[chapter_pvecm]]
   2 ifdef::manvolnum[]
   3 pvecm(1)
   4 ========
   5 :pve-toplevel:
   6
   7 NAME
   8 ----
   9
  10 pvecm - Proxmox VE Cluster Manager
  11
  12 SYNOPSIS
  13 --------
  14
  15 include::pvecm.1-synopsis.adoc[]
  16
  17 DESCRIPTION
  18 -----------
  19 endif::manvolnum[]
  20
  21 ifndef::manvolnum[]
  22 Cluster Manager
  23 ===============
  24 :pve-toplevel:
  25 endif::manvolnum[]
  26
  27 The {PVE} cluster manager `pvecm` is a tool to create a group of
  28 physical servers. Such a group is called a *cluster*. We use the
  29 http://www.corosync.org[Corosync Cluster Engine] for reliable group
  30 communication, and such clusters can consist of up to 32 physical nodes
  31 (probably more, dependent on network latency).
  32
  33 `pvecm` can be used to create a new cluster, join nodes to a cluster,
  34 leave the cluster, get status information and do various other cluster
  35 related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'')
  36 is used to transparently distribute the cluster configuration to all cluster
  37 nodes.
  38
  39 Grouping nodes into a cluster has the following advantages:
  40
  41 * Centralized, web based management
  42
  43 * Multi-master clusters: each node can do all management task
  44
  45 * `pmxcfs`: database-driven file system for storing configuration files,
  46  replicated in real-time on all nodes using `corosync`.
  47
  48 * Easy migration of virtual machines and containers between physical
  49   hosts
  50
  51 * Fast deployment
  52
  53 * Cluster-wide services like firewall and HA
  54
  55
  56 Requirements
  57 ------------
  58
  59 * All nodes must be in the same network as `corosync` uses IP Multicast
  60  to communicate between nodes (also see
  61  http://www.corosync.org[Corosync Cluster Engine]). Corosync uses UDP
  62  ports 5404 and 5405 for cluster communication.
  63 +
  64 NOTE: Some switches do not support IP multicast by default and must be
  65 manually enabled first.
  66
  67 * Date and time have to be synchronized.
  68
  69 * SSH tunnel on TCP port 22 between nodes is used.
  70
  71 * If you are interested in High Availability, you need to have at
  72   least three nodes for reliable quorum. All nodes should have the
  73   same version.
  74
  75 * We recommend a dedicated NIC for the cluster traffic, especially if
  76   you use shared storage.
  77
  78 * Root password of a cluster node is required for adding nodes.
  79
  80 NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster
  81 nodes.
  82
  83 NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as
  84 production configuration and should only used temporarily during upgrading the
  85 whole cluster from one to another major version.
  86
  87
  88 Preparing Nodes
  89 ---------------
  90
  91 First, install {PVE} on all nodes. Make sure that each node is
  92 installed with the final hostname and IP configuration. Changing the
  93 hostname and IP is not possible after cluster creation.
  94
  95 Currently the cluster creation can either be done on the console (login via
  96 `ssh`) or the API, which we have a GUI implementation for (__Datacenter ->
  97 Cluster__).
  98
  99 While it's often common use to reference all other nodenames in `/etc/hosts`
 100 with their IP this is not strictly necessary for a cluster, which normally uses
 101 multicast, to work. It maybe useful as you then can connect from one node to
 102 the other with SSH through the easier to remember node name.
 103
 104 [[pvecm_create_cluster]]
 105 Create the Cluster
 106 ------------------
 107
 108 Login via `ssh` to the first {pve} node. Use a unique name for your cluster.
 109 This name cannot be changed later. The cluster name follows the same rules as
 110 node names.
 111
 112 ----
 113  hp1# pvecm create CLUSTERNAME
 114 ----
 115
 116 CAUTION: The cluster name is used to compute the default multicast address.
 117 Please use unique cluster names if you run more than one cluster inside your
 118 network. To avoid human confusion, it is also recommended to choose different
 119 names even if clusters do not share the cluster network.
 120
 121 To check the state of your cluster use:
 122
 123 ----
 124  hp1# pvecm status
 125 ----
 126
 127 Multiple Clusters In Same Network
 128 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 129
 130 It is possible to create multiple clusters in the same physical or logical
 131 network. Each cluster must have a unique name, which is used to generate the
 132 cluster's multicast group address. As long as no duplicate cluster names are
 133 configured in one network segment, the different clusters won't interfere with
 134 each other.
 135
 136 If multiple clusters operate in a single network it may be beneficial to setup
 137 an IGMP querier and enable IGMP Snooping in said network. This may reduce the
 138 load of the network significantly because multicast packets are only delivered
 139 to endpoints of the respective member nodes.
 140
 141
 142 [[pvecm_join_node_to_cluster]]
 143 Adding Nodes to the Cluster
 144 ---------------------------
 145
 146 Login via `ssh` to the node you want to add.
 147
 148 ----
 149  hp2# pvecm add IP-ADDRESS-CLUSTER
 150 ----
 151
 152 For `IP-ADDRESS-CLUSTER` use the IP or hostname of an existing cluster node.
 153 An IP address is recommended (see <<corosync-addresses,Ring Address Types>>).
 154
 155 CAUTION: A new node cannot hold any VMs, because you would get
 156 conflicts about identical VM IDs. Also, all existing configuration in
 157 `/etc/pve` is overwritten when you join a new node to the cluster. To
 158 workaround, use `vzdump` to backup and restore to a different VMID after
 159 adding the node to the cluster.
 160
 161 To check the state of cluster:
 162
 163 ----
 164  # pvecm status
 165 ----
 166
 167 .Cluster status after adding 4 nodes
 168 ----
 169 hp2# pvecm status
 170 Quorum information
 171 ~~~~~~~~~~~~~~~~~~
 172 Date:             Mon Apr 20 12:30:13 2015
 173 Quorum provider:  corosync_votequorum
 174 Nodes:            4
 175 Node ID:          0x00000001
 176 Ring ID:          1928
 177 Quorate:          Yes
 178
 179 Votequorum information
 180 ~~~~~~~~~~~~~~~~~~~~~~
 181 Expected votes:   4
 182 Highest expected: 4
 183 Total votes:      4
 184 Quorum:           3
 185 Flags:            Quorate
 186
 187 Membership information
 188 ~~~~~~~~~~~~~~~~~~~~~~
 189     Nodeid      Votes Name
 190 0x00000001          1 192.168.15.91
 191 0x00000002          1 192.168.15.92 (local)
 192 0x00000003          1 192.168.15.93
 193 0x00000004          1 192.168.15.94
 194 ----
 195
 196 If you only want the list of all nodes use:
 197
 198 ----
 199  # pvecm nodes
 200 ----
 201
 202 .List nodes in a cluster
 203 ----
 204 hp2# pvecm nodes
 205
 206 Membership information
 207 ~~~~~~~~~~~~~~~~~~~~~~
 208     Nodeid      Votes Name
 209          1          1 hp1
 210          2          1 hp2 (local)
 211          3          1 hp3
 212          4          1 hp4
 213 ----
 214
 215 [[adding-nodes-with-separated-cluster-network]]
 216 Adding Nodes With Separated Cluster Network
 217 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 218
 219 When adding a node to a cluster with a separated cluster network you need to
 220 use the 'ringX_addr' parameters to set the nodes address on those networks:
 221
 222 [source,bash]
 223 ----
 224 pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0
 225 ----
 226
 227 If you want to use the Redundant Ring Protocol you will also want to pass the
 228 'ring1_addr' parameter.
 229
 230
 231 Remove a Cluster Node
 232 ---------------------
 233
 234 CAUTION: Read carefully the procedure before proceeding, as it could
 235 not be what you want or need.
 236
 237 Move all virtual machines from the node. Make sure you have no local
 238 data or backups you want to keep, or save them accordingly.
 239 In the following example we will remove the node hp4 from the cluster.
 240
 241 Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes`
 242 command to identify the node ID to remove:
 243
 244 ----
 245 hp1# pvecm nodes
 246
 247 Membership information
 248 ~~~~~~~~~~~~~~~~~~~~~~
 249     Nodeid      Votes Name
 250          1          1 hp1 (local)
 251          2          1 hp2
 252          3          1 hp3
 253          4          1 hp4
 254 ----
 255
 256
 257 At this point you must power off hp4 and
 258 make sure that it will not power on again (in the network) as it
 259 is.
 260
 261 IMPORTANT: As said above, it is critical to power off the node
 262 *before* removal, and make sure that it will *never* power on again
 263 (in the existing cluster network) as it is.
 264 If you power on the node as it is, your cluster will be screwed up and
 265 it could be difficult to restore a clean cluster state.
 266
 267 After powering off the node hp4, we can safely remove it from the cluster.
 268
 269 ----
 270  hp1# pvecm delnode hp4
 271 ----
 272
 273 If the operation succeeds no output is returned, just check the node
 274 list again with `pvecm nodes` or `pvecm status`. You should see
 275 something like:
 276
 277 ----
 278 hp1# pvecm status
 279
 280 Quorum information
 281 ~~~~~~~~~~~~~~~~~~
 282 Date:             Mon Apr 20 12:44:28 2015
 283 Quorum provider:  corosync_votequorum
 284 Nodes:            3
 285 Node ID:          0x00000001
 286 Ring ID:          1992
 287 Quorate:          Yes
 288
 289 Votequorum information
 290 ~~~~~~~~~~~~~~~~~~~~~~
 291 Expected votes:   3
 292 Highest expected: 3
 293 Total votes:      3
 294 Quorum:           2
 295 Flags:            Quorate
 296
 297 Membership information
 298 ~~~~~~~~~~~~~~~~~~~~~~
 299     Nodeid      Votes Name
 300 0x00000001          1 192.168.15.90 (local)
 301 0x00000002          1 192.168.15.91
 302 0x00000003          1 192.168.15.92
 303 ----
 304
 305 If, for whatever reason, you want that this server joins the same
 306 cluster again, you have to
 307
 308 * reinstall {pve} on it from scratch
 309
 310 * then join it, as explained in the previous section.
 311
 312 [[pvecm_separate_node_without_reinstall]]
 313 Separate A Node Without Reinstalling
 314 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 315
 316 CAUTION: This is *not* the recommended method, proceed with caution. Use the
 317 above mentioned method if you're unsure.
 318
 319 You can also separate a node from a cluster without reinstalling it from
 320 scratch.  But after removing the node from the cluster it will still have
 321 access to the shared storages! This must be resolved before you start removing
 322 the node from the cluster. A {pve} cluster cannot share the exact same
 323 storage with another cluster, as storage locking doesn't work over cluster
 324 boundary. Further, it may also lead to VMID conflicts.
 325
 326 Its suggested that you create a new storage where only the node which you want
 327 to separate has access. This can be an new export on your NFS or a new Ceph
 328 pool, to name a few examples. Its just important that the exact same storage
 329 does not gets accessed by multiple clusters. After setting this storage up move
 330 all data from the node and its VMs to it. Then you are ready to separate the
 331 node from the cluster.
 332
 333 WARNING: Ensure all shared resources are cleanly separated! You will run into
 334 conflicts and problems else.
 335
 336 First stop the corosync and the pve-cluster services on the node:
 337 [source,bash]
 338 ----
 339 systemctl stop pve-cluster
 340 systemctl stop corosync
 341 ----
 342
 343 Start the cluster filesystem again in local mode:
 344 [source,bash]
 345 ----
 346 pmxcfs -l
 347 ----
 348
 349 Delete the corosync configuration files:
 350 [source,bash]
 351 ----
 352 rm /etc/pve/corosync.conf
 353 rm /etc/corosync/*
 354 ----
 355
 356 You can now start the filesystem again as normal service:
 357 [source,bash]
 358 ----
 359 killall pmxcfs
 360 systemctl start pve-cluster
 361 ----
 362
 363 The node is now separated from the cluster. You can deleted it from a remaining
 364 node of the cluster with:
 365 [source,bash]
 366 ----
 367 pvecm delnode oldnode
 368 ----
 369
 370 If the command failed, because the remaining node in the cluster lost quorum
 371 when the now separate node exited, you may set the expected votes to 1 as a workaround:
 372 [source,bash]
 373 ----
 374 pvecm expected 1
 375 ----
 376
 377 And then repeat the 'pvecm delnode' command.
 378
 379 Now switch back to the separated node, here delete all remaining files left
 380 from the old cluster. This ensures that the node can be added to another
 381 cluster again without problems.
 382
 383 [source,bash]
 384 ----
 385 rm /var/lib/corosync/*
 386 ----
 387
 388 As the configuration files from the other nodes are still in the cluster
 389 filesystem you may want to clean those up too.  Remove simply the whole
 390 directory recursive from '/etc/pve/nodes/NODENAME', but check three times that
 391 you used the correct one before deleting it.
 392
 393 CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means
 394 the nodes can still connect to each other with public key authentication. This
 395 should be fixed by removing the respective keys from the
 396 '/etc/pve/priv/authorized_keys' file.
 397
 398 Quorum
 399 ------
 400
 401 {pve} use a quorum-based technique to provide a consistent state among
 402 all cluster nodes.
 403
 404 [quote, from Wikipedia, Quorum (distributed computing)]
 405 ____
 406 A quorum is the minimum number of votes that a distributed transaction
 407 has to obtain in order to be allowed to perform an operation in a
 408 distributed system.
 409 ____
 410
 411 In case of network partitioning, state changes requires that a
 412 majority of nodes are online. The cluster switches to read-only mode
 413 if it loses quorum.
 414
 415 NOTE: {pve} assigns a single vote to each node by default.
 416
 417 Cluster Network
 418 ---------------
 419
 420 The cluster network is the core of a cluster. All messages sent over it have to
 421 be delivered reliable to all nodes in their respective order. In {pve} this
 422 part is done by corosync, an implementation of a high performance low overhead
 423 high availability development toolkit. It serves our decentralized
 424 configuration file system (`pmxcfs`).
 425
 426 [[cluster-network-requirements]]
 427 Network Requirements
 428 ~~~~~~~~~~~~~~~~~~~~
 429 This needs a reliable network with latencies under 2 milliseconds (LAN
 430 performance) to work properly. While corosync can also use unicast for
 431 communication between nodes its **highly recommended** to have a multicast
 432 capable network. The network should not be used heavily by other members,
 433 ideally corosync runs on its own network.
 434 *never* share it with network where storage communicates too.
 435
 436 Before setting up a cluster it is good practice to check if the network is fit
 437 for that purpose.
 438
 439 * Ensure that all nodes are in the same subnet. This must only be true for the
 440   network interfaces used for cluster communication (corosync).
 441
 442 * Ensure all nodes can reach each other over those interfaces, using `ping` is
 443   enough for a basic test.
 444
 445 * Ensure that multicast works in general and a high package rates. This can be
 446   done with the `omping` tool. The final "%loss" number should be < 1%.
 447 +
 448 [source,bash]
 449 ----
 450 omping -c 10000 -i 0.001 -F -q NODE1-IP NODE2-IP ...
 451 ----
 452
 453 * Ensure that multicast communication works over an extended period of time.
 454   This uncovers problems where IGMP snooping is activated on the network but
 455   no multicast querier is active. This test has a duration of around 10
 456   minutes.
 457 +
 458 [source,bash]
 459 ----
 460 omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...
 461 ----
 462
 463 Your network is not ready for clustering if any of these test fails. Recheck
 464 your network configuration. Especially switches are notorious for having
 465 multicast disabled by default or IGMP snooping enabled with no IGMP querier
 466 active.
 467
 468 In smaller cluster its also an option to use unicast if you really cannot get
 469 multicast to work.
 470
 471 Separate Cluster Network
 472 ~~~~~~~~~~~~~~~~~~~~~~~~
 473
 474 When creating a cluster without any parameters the cluster network is generally
 475 shared with the Web UI and the VMs and its traffic. Depending on your setup
 476 even storage traffic may get sent over the same network. Its recommended to
 477 change that, as corosync is a time critical real time application.
 478
 479 Setting Up A New Network
 480 ^^^^^^^^^^^^^^^^^^^^^^^^
 481
 482 First you have to setup a new network interface. It should be on a physical
 483 separate network. Ensure that your network fulfills the
 484 <<cluster-network-requirements,cluster network requirements>>.
 485
 486 Separate On Cluster Creation
 487 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 488
 489 This is possible through the 'ring0_addr' and 'bindnet0_addr' parameter of
 490 the 'pvecm create' command used for creating a new cluster.
 491
 492 If you have setup an additional NIC with a static address on 10.10.10.1/25
 493 and want to send and receive all cluster communication over this interface
 494 you would execute:
 495
 496 [source,bash]
 497 ----
 498 pvecm create test --ring0_addr 10.10.10.1 --bindnet0_addr 10.10.10.0
 499 ----
 500
 501 To check if everything is working properly execute:
 502 [source,bash]
 503 ----
 504 systemctl status corosync
 505 ----
 506
 507 Afterwards, proceed as descripted in the section to
 508 <<adding-nodes-with-separated-cluster-network,add nodes with a separated cluster network>>.
 509
 510 [[separate-cluster-net-after-creation]]
 511 Separate After Cluster Creation
 512 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 513
 514 You can do this also if you have already created a cluster and want to switch
 515 its communication to another network, without rebuilding the whole cluster.
 516 This change may lead to short durations of quorum loss in the cluster, as nodes
 517 have to restart corosync and come up one after the other on the new network.
 518
 519 Check how to <<edit-corosync-conf,edit the corosync.conf file>> first.
 520 The open it and you should see a file similar to:
 521
 522 ----
 523 logging {
 524   debug: off
 525   to_syslog: yes
 526 }
 527
 528 nodelist {
 529
 530   node {
 531     name: due
 532     nodeid: 2
 533     quorum_votes: 1
 534     ring0_addr: due
 535   }
 536
 537   node {
 538     name: tre
 539     nodeid: 3
 540     quorum_votes: 1
 541     ring0_addr: tre
 542   }
 543
 544   node {
 545     name: uno
 546     nodeid: 1
 547     quorum_votes: 1
 548     ring0_addr: uno
 549   }
 550
 551 }
 552
 553 quorum {
 554   provider: corosync_votequorum
 555 }
 556
 557 totem {
 558   cluster_name: thomas-testcluster
 559   config_version: 3
 560   ip_version: ipv4
 561   secauth: on
 562   version: 2
 563   interface {
 564     bindnetaddr: 192.168.30.50
 565     ringnumber: 0
 566   }
 567
 568 }
 569 ----
 570
 571 The first you want to do is add the 'name' properties in the node entries if
 572 you do not see them already. Those *must* match the node name.
 573
 574 Then replace the address from the 'ring0_addr' properties with the new
 575 addresses.  You may use plain IP addresses or also hostnames here. If you use
 576 hostnames ensure that they are resolvable from all nodes. (see also
 577 <<corosync-addresses,Ring Address Types>>)
 578
 579 In my example I want to switch my cluster communication to the 10.10.10.1/25
 580 network. So I replace all 'ring0_addr' respectively. I also set the bindnetaddr
 581 in the totem section of the config to an address of the new network. It can be
 582 any address from the subnet configured on the new network interface.
 583
 584 After you increased the 'config_version' property the new configuration file
 585 should look like:
 586
 587 ----
 588
 589 logging {
 590   debug: off
 591   to_syslog: yes
 592 }
 593
 594 nodelist {
 595
 596   node {
 597     name: due
 598     nodeid: 2
 599     quorum_votes: 1
 600     ring0_addr: 10.10.10.2
 601   }
 602
 603   node {
 604     name: tre
 605     nodeid: 3
 606     quorum_votes: 1
 607     ring0_addr: 10.10.10.3
 608   }
 609
 610   node {
 611     name: uno
 612     nodeid: 1
 613     quorum_votes: 1
 614     ring0_addr: 10.10.10.1
 615   }
 616
 617 }
 618
 619 quorum {
 620   provider: corosync_votequorum
 621 }
 622
 623 totem {
 624   cluster_name: thomas-testcluster
 625   config_version: 4
 626   ip_version: ipv4
 627   secauth: on
 628   version: 2
 629   interface {
 630     bindnetaddr: 10.10.10.1
 631     ringnumber: 0
 632   }
 633
 634 }
 635 ----
 636
 637 Now after a final check whether all changed information is correct we save it
 638 and see again the <<edit-corosync-conf,edit corosync.conf file>> section to
 639 learn how to bring it in effect.
 640
 641 As our change cannot be enforced live from corosync we have to do an restart.
 642
 643 On a single node execute:
 644 [source,bash]
 645 ----
 646 systemctl restart corosync
 647 ----
 648
 649 Now check if everything is fine:
 650
 651 [source,bash]
 652 ----
 653 systemctl status corosync
 654 ----
 655
 656 If corosync runs again correct restart corosync also on all other nodes.
 657 They will then join the cluster membership one by one on the new network.
 658
 659 [[corosync-addresses]]
 660 Corosync addresses
 661 ~~~~~~~~~~~~~~~~~~
 662
 663 A corosync link or ring address can be specified in two ways:
 664
 665 * **IPv4/v6 addresses** will be used directly. They are recommended, since they
 666 are static and usually not changed carelessly.
 667
 668 * **Hostnames** will be resolved using `getaddrinfo`, which means that per
 669 default, IPv6 addresses will be used first, if available (see also
 670 `man gai.conf`). Keep this in mind, especially when upgrading an existing
 671 cluster to IPv6.
 672
 673 CAUTION: Hostnames should be used with care, since the address they
 674 resolve to can be changed without touching corosync or the node it runs on -
 675 which may lead to a situation where an address is changed without thinking
 676 about implications for corosync.
 677
 678 A seperate, static hostname specifically for corosync is recommended, if
 679 hostnames are preferred. Also, make sure that every node in the cluster can
 680 resolve all hostnames correctly.
 681
 682 Since {pve} 5.1, while supported, hostnames will be resolved at the time of
 683 entry. Only the resolved IP is then saved to the configuration.
 684
 685 Nodes that joined the cluster on earlier versions likely still use their
 686 unresolved hostname in `corosync.conf`. It might be a good idea to replace
 687 them with IPs or a seperate hostname, as mentioned above.
 688
 689 [[pvecm_rrp]]
 690 Redundant Ring Protocol
 691 ~~~~~~~~~~~~~~~~~~~~~~~
 692 To avoid a single point of failure you should implement counter measurements.
 693 This can be on the hardware and operating system level through network bonding.
 694
 695 Corosync itself offers also a possibility to add redundancy through the so
 696 called 'Redundant Ring Protocol'. This protocol allows running a second totem
 697 ring on another network, this network should be physically separated from the
 698 other rings network to actually increase availability.
 699
 700 RRP On Cluster Creation
 701 ~~~~~~~~~~~~~~~~~~~~~~~
 702
 703 The 'pvecm create' command provides the additional parameters 'bindnetX_addr',
 704 'ringX_addr' and 'rrp_mode', can be used for RRP configuration.
 705
 706 NOTE: See the <<corosync-conf-glossary,glossary>> if you do not know what each parameter means.
 707
 708 So if you have two networks, one on the 10.10.10.1/24 and the other on the
 709 10.10.20.1/24 subnet you would execute:
 710
 711 [source,bash]
 712 ----
 713 pvecm create CLUSTERNAME -bindnet0_addr 10.10.10.1 -ring0_addr 10.10.10.1 \
 714 -bindnet1_addr 10.10.20.1 -ring1_addr 10.10.20.1
 715 ----
 716
 717 RRP On Existing Clusters
 718 ~~~~~~~~~~~~~~~~~~~~~~~~
 719
 720 You will take similar steps as described in
 721 <<separate-cluster-net-after-creation,separating the cluster network>> to
 722 enable RRP on an already running cluster. The single difference is, that you
 723 will add `ring1` and use it instead of `ring0`.
 724
 725 First add a new `interface` subsection in the `totem` section, set its
 726 `ringnumber` property to `1`. Set the interfaces `bindnetaddr` property to an
 727 address of the subnet you have configured for your new ring.
 728 Further set the `rrp_mode` to `passive`, this is the only stable mode.
 729
 730 Then add to each node entry in the `nodelist` section its new `ring1_addr`
 731 property with the nodes additional ring address.
 732
 733 So if you have two networks, one on the 10.10.10.1/24 and the other on the
 734 10.10.20.1/24 subnet, the final configuration file should look like:
 735
 736 ----
 737 totem {
 738   cluster_name: tweak
 739   config_version: 9
 740   ip_version: ipv4
 741   rrp_mode: passive
 742   secauth: on
 743   version: 2
 744   interface {
 745     bindnetaddr: 10.10.10.1
 746     ringnumber: 0
 747   }
 748   interface {
 749     bindnetaddr: 10.10.20.1
 750     ringnumber: 1
 751   }
 752 }
 753
 754 nodelist {
 755   node {
 756     name: pvecm1
 757     nodeid: 1
 758     quorum_votes: 1
 759     ring0_addr: 10.10.10.1
 760     ring1_addr: 10.10.20.1
 761   }
 762
 763  node {
 764     name: pvecm2
 765     nodeid: 2
 766     quorum_votes: 1
 767     ring0_addr: 10.10.10.2
 768     ring1_addr: 10.10.20.2
 769   }
 770
 771   [...] # other cluster nodes here
 772 }
 773
 774 [...] # other remaining config sections here
 775
 776 ----
 777
 778 Bring it in effect like described in the
 779 <<edit-corosync-conf,edit the corosync.conf file>> section.
 780
 781 This is a change which cannot take live in effect and needs at least a restart
 782 of corosync. Recommended is a restart of the whole cluster.
 783
 784 If you cannot reboot the whole cluster ensure no High Availability services are
 785 configured and the stop the corosync service on all nodes. After corosync is
 786 stopped on all nodes start it one after the other again.
 787
 788 Corosync External Vote Support
 789 ------------------------------
 790
 791 This section describes a way to deploy an external voter in a {pve} cluster.
 792 When configured, the cluster can sustain more node failures without
 793 violating safety properties of the cluster communication.
 794
 795 For this to work there are two services involved:
 796
 797 * a so called qdevice daemon which runs on each {pve} node
 798
 799 * an external vote daemon which runs on an independent server.
 800
 801 As a result you can achieve higher availability even in smaller setups (for
 802 example 2+1 nodes).
 803
 804 QDevice Technical Overview
 805 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 806
 807 The Corosync Quroum Device (QDevice) is a daemon which runs on each cluster
 808 node. It provides a configured number of votes to the clusters quorum
 809 subsystem based on an external running third-party arbitrator's decision.
 810 Its primary use is to allow a cluster to sustain more node failures than
 811 standard quorum rules allow. This can be done safely as the external device
 812 can see all nodes and thus choose only one set of nodes to give its vote.
 813 This will only be done if said set of nodes can have quorum (again) when
 814 receiving the third-party vote.
 815
 816 Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
 817 a daemon which provides a vote to a cluster partition if it can reach the
 818 partition members over the network. It will give only votes to one partition
 819 of a cluster at any time.
 820 It's designed to support multiple clusters and is almost configuration and
 821 state free. New clusters are handled dynamically and no configuration file
 822 is needed on the host running a QDevice.
 823
 824 The external host has the only requirement that it needs network access to the
 825 cluster and a corosync-qnetd package available. We provide such a package
 826 for Debian based hosts, other Linux distributions should also have a package
 827 available through their respective package manager.
 828
 829 NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
 830 TCP/IP and thus does not need a multicast capable network between itself and
 831 the cluster. In fact the daemon may run outside of the LAN and can have
 832 longer latencies than 2 ms.
 833
 834
 835 Supported Setups
 836 ~~~~~~~~~~~~~~~~
 837
 838 We support QDevices for clusters with an even number of nodes and recommend
 839 it for 2 node clusters, if they should provide higher availability.
 840 For clusters with an odd node count we discourage the use of QDevices
 841 currently. The reason for this, is the difference of the votes the QDevice
 842 provides for each cluster type. Even numbered clusters get single additional
 843 vote, with this we can only increase availability, i.e. if the QDevice
 844 itself fails we are in the same situation as with no QDevice at all.
 845
 846 Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
 847 where 'N' corresponds to the cluster node count. This difference makes
 848 sense, if we had only one additional vote the cluster can get into a split
 849 brain situation.
 850 This algorithm would allow that all nodes but one (and naturally the
 851 QDevice itself) could fail.
 852 There are two drawbacks with this:
 853
 854 * If the QNet daemon itself fails, no other node may fail or the cluster
 855   immediately loses quorum.  For example, in a cluster with 15 nodes 7
 856   could fail before the cluster becomes inquorate. But, if a QDevice is
 857   configured here and said QDevice fails itself **no single node** of
 858   the 15 may fail. The QDevice acts almost as a single point of failure in
 859   this case.
 860
 861 * The fact that all but one node plus QDevice may fail sound promising at
 862   first, but this may result in a mass recovery of HA services that would
 863   overload the single node left. Also ceph server will stop to provide
 864   services after only '((N-1)/2)' nodes are online.
 865
 866 If you understand the drawbacks and implications you can decide yourself if
 867 you should use this technology in an odd numbered cluster setup.
 868
 869
 870 QDevice-Net Setup
 871 ~~~~~~~~~~~~~~~~~
 872
 873 We recommend to run any daemon which provides votes to corosync-qdevice as an
 874 unprivileged user. {pve} and Debian provides a package which is already
 875 configured to do so.
 876 The traffic between the daemon and the cluster must be encrypted to ensure a
 877 safe and secure QDevice integration in {pve}.
 878
 879 First install the 'corosync-qnetd' package on your external server and
 880 the 'corosync-qdevice' package on all cluster nodes.
 881
 882 After that, ensure that all your nodes on the cluster are online.
 883
 884 You can now easily set up your QDevice by running the following command on one
 885 of the {pve} nodes:
 886
 887 ----
 888 pve# pvecm qdevice setup <QDEVICE-IP>
 889 ----
 890
 891 The SSH key from the cluster will be automatically copied to the QDevice. You
 892 might need to enter an SSH password during this step.
 893
 894 After you enter the password and all the steps are successfully completed, you
 895 will see "Done". You can check the status now:
 896
 897 ----
 898 pve# pvecm status
 899
 900 ...
 901
 902 Votequorum information
 903 ~~~~~~~~~~~~~~~~~~~~~
 904 Expected votes:   3
 905 Highest expected: 3
 906 Total votes:      3
 907 Quorum:           2
 908 Flags:            Quorate Qdevice
 909
 910 Membership information
 911 ~~~~~~~~~~~~~~~~~~~~~~
 912     Nodeid      Votes    Qdevice Name
 913     0x00000001          1    A,V,NMW 192.168.22.180 (local)
 914     0x00000002          1    A,V,NMW 192.168.22.181
 915     0x00000000          1            Qdevice
 916
 917 ----
 918
 919 which means the QDevice is set up.
 920
 921
 922 Frequently Asked Questions
 923 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 924
 925 Tie Breaking
 926 ^^^^^^^^^^^^
 927
 928 In case of a tie, where two same-sized cluster partitions cannot see each other
 929 but the QDevice, the QDevice chooses randomly one of those partitions and
 930 provides a vote to it.
 931
 932 Possible Negative Implications
 933 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 934
 935 For clusters with an even node count there are no negative implications when
 936 setting up a QDevice. If it fails to work, you are as good as without QDevice at
 937 all.
 938
 939 Adding/Deleting Nodes After QDevice Setup
 940 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 941
 942 If you want to add a new node or remove an existing one from a cluster with a
 943 QDevice setup, you need to remove the QDevice first. After that, you can add or
 944 remove nodes normally. Once you have a cluster with an even node count again,
 945 you can set up the QDevice again as described above.
 946
 947 Removing the QDevice
 948 ^^^^^^^^^^^^^^^^^^^^
 949
 950 If you used the official `pvecm` tool to add the QDevice, you can remove it
 951 trivially by running:
 952
 953 ----
 954 pve# pvecm qdevice remove
 955 ----
 956
 957 //Still TODO
 958 //^^^^^^^^^^
 959 //There ist still stuff to add here
 960
 961
 962 Corosync Configuration
 963 ----------------------
 964
 965 The `/etc/pve/corosync.conf` file plays a central role in {pve} cluster. It
 966 controls the cluster member ship and its network.
 967 For reading more about it check the corosync.conf man page:
 968 [source,bash]
 969 ----
 970 man corosync.conf
 971 ----
 972
 973 For node membership you should always use the `pvecm` tool provided by {pve}.
 974 You may have to edit the configuration file manually for other changes.
 975 Here are a few best practice tips for doing this.
 976
 977 [[edit-corosync-conf]]
 978 Edit corosync.conf
 979 ~~~~~~~~~~~~~~~~~~
 980
 981 Editing the corosync.conf file can be not always straight forward. There are
 982 two on each cluster, one in `/etc/pve/corosync.conf` and the other in
 983 `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will
 984 propagate the changes to the local one, but not vice versa.
 985
 986 The configuration will get updated automatically as soon as the file changes.
 987 This means changes which can be integrated in a running corosync will take
 988 instantly effect. So you should always make a copy and edit that instead, to
 989 avoid triggering some unwanted changes by an in between safe.
 990
 991 [source,bash]
 992 ----
 993 cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
 994 ----
 995
 996 Then open the Config file with your favorite editor, `nano` and `vim.tiny` are
 997 preinstalled on {pve} for example.
 998
 999 NOTE: Always increment the 'config_version' number on configuration changes,
1000 omitting this can lead to problems.
1001
1002 After making the necessary changes create another copy of the current working
1003 configuration file. This serves as a backup if the new configuration fails to
1004 apply or makes problems in other ways.
1005
1006 [source,bash]
1007 ----
1008 cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
1009 ----
1010
1011 Then move the new configuration file over the old one:
1012 [source,bash]
1013 ----
1014 mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
1015 ----
1016
1017 You may check with the commands
1018 [source,bash]
1019 ----
1020 systemctl status corosync
1021 journalctl -b -u corosync
1022 ----
1023
1024 If the change could applied automatically. If not you may have to restart the
1025 corosync service via:
1026 [source,bash]
1027 ----
1028 systemctl restart corosync
1029 ----
1030
1031 On errors check the troubleshooting section below.
1032
1033 Troubleshooting
1034 ~~~~~~~~~~~~~~~
1035
1036 Issue: 'quorum.expected_votes must be configured'
1037 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1038
1039 When corosync starts to fail and you get the following message in the system log:
1040
1041 ----
1042 [...]
1043 corosync[1647]:  [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
1044 corosync[1647]:  [SERV  ] Service engine 'corosync_quorum' failed to load for reason
1045     'configuration error: nodelist or quorum.expected_votes must be configured!'
1046 [...]
1047 ----
1048
1049 It means that the hostname you set for corosync 'ringX_addr' in the
1050 configuration could not be resolved.
1051
1052
1053 Write Configuration When Not Quorate
1054 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1055
1056 If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you
1057 know what you do, use:
1058 [source,bash]
1059 ----
1060 pvecm expected 1
1061 ----
1062
1063 This sets the expected vote count to 1 and makes the cluster quorate. You can
1064 now fix your configuration, or revert it back to the last working backup.
1065
1066 This is not enough if corosync cannot start anymore. Here its best to edit the
1067 local copy of the corosync configuration in '/etc/corosync/corosync.conf' so
1068 that corosync can start again. Ensure that on all nodes this configuration has
1069 the same content to avoid split brains. If you are not sure what went wrong
1070 it's best to ask the Proxmox Community to help you.
1071
1072
1073 [[corosync-conf-glossary]]
1074 Corosync Configuration Glossary
1075 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1076
1077 ringX_addr::
1078 This names the different ring addresses for the corosync totem rings used for
1079 the cluster communication.
1080
1081 bindnetaddr::
1082 Defines to which interface the ring should bind to. It may be any address of
1083 the subnet configured on the interface we want to use. In general its the
1084 recommended to just use an address a node uses on this interface.
1085
1086 rrp_mode::
1087 Specifies the mode of the redundant ring protocol and may be passive, active or
1088 none. Note that use of active is highly experimental and not official
1089 supported. Passive is the preferred mode, it may double the cluster
1090 communication throughput and increases availability.
1091
1092
1093 Cluster Cold Start
1094 ------------------
1095
1096 It is obvious that a cluster is not quorate when all nodes are
1097 offline. This is a common case after a power failure.
1098
1099 NOTE: It is always a good idea to use an uninterruptible power supply
1100 (``UPS'', also called ``battery backup'') to avoid this state, especially if
1101 you want HA.
1102
1103 On node startup, the `pve-guests` service is started and waits for
1104 quorum. Once quorate, it starts all guests which have the `onboot`
1105 flag set.
1106
1107 When you turn on nodes, or when power comes back after power failure,
1108 it is likely that some nodes boots faster than others. Please keep in
1109 mind that guest startup is delayed until you reach quorum.
1110
1111
1112 Guest Migration
1113 ---------------
1114
1115 Migrating virtual guests to other nodes is a useful feature in a
1116 cluster. There are settings to control the behavior of such
1117 migrations. This can be done via the configuration file
1118 `datacenter.cfg` or for a specific migration via API or command line
1119 parameters.
1120
1121 It makes a difference if a Guest is online or offline, or if it has
1122 local resources (like a local disk).
1123
1124 For Details about Virtual Machine Migration see the
1125 xref:qm_migration[QEMU/KVM Migration Chapter]
1126
1127 For Details about Container Migration see the
1128 xref:pct_migration[Container Migration Chapter]
1129
1130 Migration Type
1131 ~~~~~~~~~~~~~~
1132
1133 The migration type defines if the migration data should be sent over an
1134 encrypted (`secure`) channel or an unencrypted (`insecure`) one.
1135 Setting the migration type to insecure means that the RAM content of a
1136 virtual guest gets also transferred unencrypted, which can lead to
1137 information disclosure of critical data from inside the guest (for
1138 example passwords or encryption keys).
1139
1140 Therefore, we strongly recommend using the secure channel if you do
1141 not have full control over the network and can not guarantee that no
1142 one is eavesdropping to it.
1143
1144 NOTE: Storage migration does not follow this setting. Currently, it
1145 always sends the storage content over a secure channel.
1146
1147 Encryption requires a lot of computing power, so this setting is often
1148 changed to "unsafe" to achieve better performance. The impact on
1149 modern systems is lower because they implement AES encryption in
1150 hardware. The performance impact is particularly evident in fast
1151 networks where you can transfer 10 Gbps or more.
1152
1153
1154 Migration Network
1155 ~~~~~~~~~~~~~~~~~
1156
1157 By default, {pve} uses the network in which cluster communication
1158 takes place to send the migration traffic. This is not optimal because
1159 sensitive cluster traffic can be disrupted and this network may not
1160 have the best bandwidth available on the node.
1161
1162 Setting the migration network parameter allows the use of a dedicated
1163 network for the entire migration traffic. In addition to the memory,
1164 this also affects the storage traffic for offline migrations.
1165
1166 The migration network is set as a network in the CIDR notation. This
1167 has the advantage that you do not have to set individual IP addresses
1168 for each node.  {pve} can determine the real address on the
1169 destination node from the network specified in the CIDR form.  To
1170 enable this, the network must be specified so that each node has one,
1171 but only one IP in the respective network.
1172
1173
1174 Example
1175 ^^^^^^^
1176
1177 We assume that we have a three-node setup with three separate
1178 networks. One for public communication with the Internet, one for
1179 cluster communication and a very fast one, which we want to use as a
1180 dedicated network for migration.
1181
1182 A network configuration for such a setup might look as follows:
1183
1184 ----
1185 iface eno1 inet manual
1186
1187 # public network
1188 auto vmbr0
1189 iface vmbr0 inet static
1190     address 192.X.Y.57
1191     netmask 255.255.250.0
1192     gateway 192.X.Y.1
1193     bridge_ports eno1
1194     bridge_stp off
1195     bridge_fd 0
1196
1197 # cluster network
1198 auto eno2
1199 iface eno2 inet static
1200     address  10.1.1.1
1201     netmask  255.255.255.0
1202
1203 # fast network
1204 auto eno3
1205 iface eno3 inet static
1206     address  10.1.2.1
1207     netmask  255.255.255.0
1208 ----
1209
1210 Here, we will use the network 10.1.2.0/24 as a migration network. For
1211 a single migration, you can do this using the `migration_network`
1212 parameter of the command line tool:
1213
1214 ----
1215 # qm migrate 106 tre --online --migration_network 10.1.2.0/24
1216 ----
1217
1218 To configure this as the default network for all migrations in the
1219 cluster, set the `migration` property of the `/etc/pve/datacenter.cfg`
1220 file:
1221
1222 ----
1223 # use dedicated migration network
1224 migration: secure,network=10.1.2.0/24
1225 ----
1226
1227 NOTE: The migration type must always be set when the migration network
1228 gets set in `/etc/pve/datacenter.cfg`.
1229
1230
1231 ifdef::manvolnum[]
1232 include::pve-copyright.adoc[]
1233 endif::manvolnum[]