]> git.proxmox.com Git - pve-docs.git/blob - pvecm.adoc
add cloudinit dump and snippets documentation
[pve-docs.git] / pvecm.adoc
1 [[chapter_pvecm]]
2 ifdef::manvolnum[]
3 pvecm(1)
4 ========
5 :pve-toplevel:
6
7 NAME
8 ----
9
10 pvecm - Proxmox VE Cluster Manager
11
12 SYNOPSIS
13 --------
14
15 include::pvecm.1-synopsis.adoc[]
16
17 DESCRIPTION
18 -----------
19 endif::manvolnum[]
20
21 ifndef::manvolnum[]
22 Cluster Manager
23 ===============
24 :pve-toplevel:
25 endif::manvolnum[]
26
27 The {PVE} cluster manager `pvecm` is a tool to create a group of
28 physical servers. Such a group is called a *cluster*. We use the
29 http://www.corosync.org[Corosync Cluster Engine] for reliable group
30 communication, and such clusters can consist of up to 32 physical nodes
31 (probably more, dependent on network latency).
32
33 `pvecm` can be used to create a new cluster, join nodes to a cluster,
34 leave the cluster, get status information and do various other cluster
35 related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'')
36 is used to transparently distribute the cluster configuration to all cluster
37 nodes.
38
39 Grouping nodes into a cluster has the following advantages:
40
41 * Centralized, web based management
42
43 * Multi-master clusters: each node can do all management task
44
45 * `pmxcfs`: database-driven file system for storing configuration files,
46 replicated in real-time on all nodes using `corosync`.
47
48 * Easy migration of virtual machines and containers between physical
49 hosts
50
51 * Fast deployment
52
53 * Cluster-wide services like firewall and HA
54
55
56 Requirements
57 ------------
58
59 * All nodes must be in the same network as `corosync` uses IP Multicast
60 to communicate between nodes (also see
61 http://www.corosync.org[Corosync Cluster Engine]). Corosync uses UDP
62 ports 5404 and 5405 for cluster communication.
63 +
64 NOTE: Some switches do not support IP multicast by default and must be
65 manually enabled first.
66
67 * Date and time have to be synchronized.
68
69 * SSH tunnel on TCP port 22 between nodes is used.
70
71 * If you are interested in High Availability, you need to have at
72 least three nodes for reliable quorum. All nodes should have the
73 same version.
74
75 * We recommend a dedicated NIC for the cluster traffic, especially if
76 you use shared storage.
77
78 * Root password of a cluster node is required for adding nodes.
79
80 NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster
81 nodes.
82
83 NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as
84 production configuration and should only used temporarily during upgrading the
85 whole cluster from one to another major version.
86
87
88 Preparing Nodes
89 ---------------
90
91 First, install {PVE} on all nodes. Make sure that each node is
92 installed with the final hostname and IP configuration. Changing the
93 hostname and IP is not possible after cluster creation.
94
95 Currently the cluster creation can either be done on the console (login via
96 `ssh`) or the API, which we have a GUI implementation for (__Datacenter ->
97 Cluster__).
98
99 While it's often common use to reference all other nodenames in `/etc/hosts`
100 with their IP this is not strictly necessary for a cluster, which normally uses
101 multicast, to work. It maybe useful as you then can connect from one node to
102 the other with SSH through the easier to remember node name.
103
104 [[pvecm_create_cluster]]
105 Create the Cluster
106 ------------------
107
108 Login via `ssh` to the first {pve} node. Use a unique name for your cluster.
109 This name cannot be changed later. The cluster name follows the same rules as
110 node names.
111
112 ----
113 hp1# pvecm create CLUSTERNAME
114 ----
115
116 CAUTION: The cluster name is used to compute the default multicast address.
117 Please use unique cluster names if you run more than one cluster inside your
118 network. To avoid human confusion, it is also recommended to choose different
119 names even if clusters do not share the cluster network.
120
121 To check the state of your cluster use:
122
123 ----
124 hp1# pvecm status
125 ----
126
127 Multiple Clusters In Same Network
128 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
129
130 It is possible to create multiple clusters in the same physical or logical
131 network. Each cluster must have a unique name, which is used to generate the
132 cluster's multicast group address. As long as no duplicate cluster names are
133 configured in one network segment, the different clusters won't interfere with
134 each other.
135
136 If multiple clusters operate in a single network it may be beneficial to setup
137 an IGMP querier and enable IGMP Snooping in said network. This may reduce the
138 load of the network significantly because multicast packets are only delivered
139 to endpoints of the respective member nodes.
140
141
142 [[pvecm_join_node_to_cluster]]
143 Adding Nodes to the Cluster
144 ---------------------------
145
146 Login via `ssh` to the node you want to add.
147
148 ----
149 hp2# pvecm add IP-ADDRESS-CLUSTER
150 ----
151
152 For `IP-ADDRESS-CLUSTER` use the IP or hostname of an existing cluster node.
153 An IP address is recommended (see <<corosync-addresses,Ring Address Types>>).
154
155 CAUTION: A new node cannot hold any VMs, because you would get
156 conflicts about identical VM IDs. Also, all existing configuration in
157 `/etc/pve` is overwritten when you join a new node to the cluster. To
158 workaround, use `vzdump` to backup and restore to a different VMID after
159 adding the node to the cluster.
160
161 To check the state of cluster:
162
163 ----
164 # pvecm status
165 ----
166
167 .Cluster status after adding 4 nodes
168 ----
169 hp2# pvecm status
170 Quorum information
171 ~~~~~~~~~~~~~~~~~~
172 Date: Mon Apr 20 12:30:13 2015
173 Quorum provider: corosync_votequorum
174 Nodes: 4
175 Node ID: 0x00000001
176 Ring ID: 1928
177 Quorate: Yes
178
179 Votequorum information
180 ~~~~~~~~~~~~~~~~~~~~~~
181 Expected votes: 4
182 Highest expected: 4
183 Total votes: 4
184 Quorum: 3
185 Flags: Quorate
186
187 Membership information
188 ~~~~~~~~~~~~~~~~~~~~~~
189 Nodeid Votes Name
190 0x00000001 1 192.168.15.91
191 0x00000002 1 192.168.15.92 (local)
192 0x00000003 1 192.168.15.93
193 0x00000004 1 192.168.15.94
194 ----
195
196 If you only want the list of all nodes use:
197
198 ----
199 # pvecm nodes
200 ----
201
202 .List nodes in a cluster
203 ----
204 hp2# pvecm nodes
205
206 Membership information
207 ~~~~~~~~~~~~~~~~~~~~~~
208 Nodeid Votes Name
209 1 1 hp1
210 2 1 hp2 (local)
211 3 1 hp3
212 4 1 hp4
213 ----
214
215 [[adding-nodes-with-separated-cluster-network]]
216 Adding Nodes With Separated Cluster Network
217 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
218
219 When adding a node to a cluster with a separated cluster network you need to
220 use the 'ringX_addr' parameters to set the nodes address on those networks:
221
222 [source,bash]
223 ----
224 pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0
225 ----
226
227 If you want to use the Redundant Ring Protocol you will also want to pass the
228 'ring1_addr' parameter.
229
230
231 Remove a Cluster Node
232 ---------------------
233
234 CAUTION: Read carefully the procedure before proceeding, as it could
235 not be what you want or need.
236
237 Move all virtual machines from the node. Make sure you have no local
238 data or backups you want to keep, or save them accordingly.
239 In the following example we will remove the node hp4 from the cluster.
240
241 Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes`
242 command to identify the node ID to remove:
243
244 ----
245 hp1# pvecm nodes
246
247 Membership information
248 ~~~~~~~~~~~~~~~~~~~~~~
249 Nodeid Votes Name
250 1 1 hp1 (local)
251 2 1 hp2
252 3 1 hp3
253 4 1 hp4
254 ----
255
256
257 At this point you must power off hp4 and
258 make sure that it will not power on again (in the network) as it
259 is.
260
261 IMPORTANT: As said above, it is critical to power off the node
262 *before* removal, and make sure that it will *never* power on again
263 (in the existing cluster network) as it is.
264 If you power on the node as it is, your cluster will be screwed up and
265 it could be difficult to restore a clean cluster state.
266
267 After powering off the node hp4, we can safely remove it from the cluster.
268
269 ----
270 hp1# pvecm delnode hp4
271 ----
272
273 If the operation succeeds no output is returned, just check the node
274 list again with `pvecm nodes` or `pvecm status`. You should see
275 something like:
276
277 ----
278 hp1# pvecm status
279
280 Quorum information
281 ~~~~~~~~~~~~~~~~~~
282 Date: Mon Apr 20 12:44:28 2015
283 Quorum provider: corosync_votequorum
284 Nodes: 3
285 Node ID: 0x00000001
286 Ring ID: 1992
287 Quorate: Yes
288
289 Votequorum information
290 ~~~~~~~~~~~~~~~~~~~~~~
291 Expected votes: 3
292 Highest expected: 3
293 Total votes: 3
294 Quorum: 2
295 Flags: Quorate
296
297 Membership information
298 ~~~~~~~~~~~~~~~~~~~~~~
299 Nodeid Votes Name
300 0x00000001 1 192.168.15.90 (local)
301 0x00000002 1 192.168.15.91
302 0x00000003 1 192.168.15.92
303 ----
304
305 If, for whatever reason, you want that this server joins the same
306 cluster again, you have to
307
308 * reinstall {pve} on it from scratch
309
310 * then join it, as explained in the previous section.
311
312 [[pvecm_separate_node_without_reinstall]]
313 Separate A Node Without Reinstalling
314 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
315
316 CAUTION: This is *not* the recommended method, proceed with caution. Use the
317 above mentioned method if you're unsure.
318
319 You can also separate a node from a cluster without reinstalling it from
320 scratch. But after removing the node from the cluster it will still have
321 access to the shared storages! This must be resolved before you start removing
322 the node from the cluster. A {pve} cluster cannot share the exact same
323 storage with another cluster, as storage locking doesn't work over cluster
324 boundary. Further, it may also lead to VMID conflicts.
325
326 Its suggested that you create a new storage where only the node which you want
327 to separate has access. This can be an new export on your NFS or a new Ceph
328 pool, to name a few examples. Its just important that the exact same storage
329 does not gets accessed by multiple clusters. After setting this storage up move
330 all data from the node and its VMs to it. Then you are ready to separate the
331 node from the cluster.
332
333 WARNING: Ensure all shared resources are cleanly separated! You will run into
334 conflicts and problems else.
335
336 First stop the corosync and the pve-cluster services on the node:
337 [source,bash]
338 ----
339 systemctl stop pve-cluster
340 systemctl stop corosync
341 ----
342
343 Start the cluster filesystem again in local mode:
344 [source,bash]
345 ----
346 pmxcfs -l
347 ----
348
349 Delete the corosync configuration files:
350 [source,bash]
351 ----
352 rm /etc/pve/corosync.conf
353 rm /etc/corosync/*
354 ----
355
356 You can now start the filesystem again as normal service:
357 [source,bash]
358 ----
359 killall pmxcfs
360 systemctl start pve-cluster
361 ----
362
363 The node is now separated from the cluster. You can deleted it from a remaining
364 node of the cluster with:
365 [source,bash]
366 ----
367 pvecm delnode oldnode
368 ----
369
370 If the command failed, because the remaining node in the cluster lost quorum
371 when the now separate node exited, you may set the expected votes to 1 as a workaround:
372 [source,bash]
373 ----
374 pvecm expected 1
375 ----
376
377 And then repeat the 'pvecm delnode' command.
378
379 Now switch back to the separated node, here delete all remaining files left
380 from the old cluster. This ensures that the node can be added to another
381 cluster again without problems.
382
383 [source,bash]
384 ----
385 rm /var/lib/corosync/*
386 ----
387
388 As the configuration files from the other nodes are still in the cluster
389 filesystem you may want to clean those up too. Remove simply the whole
390 directory recursive from '/etc/pve/nodes/NODENAME', but check three times that
391 you used the correct one before deleting it.
392
393 CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means
394 the nodes can still connect to each other with public key authentication. This
395 should be fixed by removing the respective keys from the
396 '/etc/pve/priv/authorized_keys' file.
397
398 Quorum
399 ------
400
401 {pve} use a quorum-based technique to provide a consistent state among
402 all cluster nodes.
403
404 [quote, from Wikipedia, Quorum (distributed computing)]
405 ____
406 A quorum is the minimum number of votes that a distributed transaction
407 has to obtain in order to be allowed to perform an operation in a
408 distributed system.
409 ____
410
411 In case of network partitioning, state changes requires that a
412 majority of nodes are online. The cluster switches to read-only mode
413 if it loses quorum.
414
415 NOTE: {pve} assigns a single vote to each node by default.
416
417 Cluster Network
418 ---------------
419
420 The cluster network is the core of a cluster. All messages sent over it have to
421 be delivered reliable to all nodes in their respective order. In {pve} this
422 part is done by corosync, an implementation of a high performance low overhead
423 high availability development toolkit. It serves our decentralized
424 configuration file system (`pmxcfs`).
425
426 [[cluster-network-requirements]]
427 Network Requirements
428 ~~~~~~~~~~~~~~~~~~~~
429 This needs a reliable network with latencies under 2 milliseconds (LAN
430 performance) to work properly. While corosync can also use unicast for
431 communication between nodes its **highly recommended** to have a multicast
432 capable network. The network should not be used heavily by other members,
433 ideally corosync runs on its own network.
434 *never* share it with network where storage communicates too.
435
436 Before setting up a cluster it is good practice to check if the network is fit
437 for that purpose.
438
439 * Ensure that all nodes are in the same subnet. This must only be true for the
440 network interfaces used for cluster communication (corosync).
441
442 * Ensure all nodes can reach each other over those interfaces, using `ping` is
443 enough for a basic test.
444
445 * Ensure that multicast works in general and a high package rates. This can be
446 done with the `omping` tool. The final "%loss" number should be < 1%.
447 +
448 [source,bash]
449 ----
450 omping -c 10000 -i 0.001 -F -q NODE1-IP NODE2-IP ...
451 ----
452
453 * Ensure that multicast communication works over an extended period of time.
454 This uncovers problems where IGMP snooping is activated on the network but
455 no multicast querier is active. This test has a duration of around 10
456 minutes.
457 +
458 [source,bash]
459 ----
460 omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...
461 ----
462
463 Your network is not ready for clustering if any of these test fails. Recheck
464 your network configuration. Especially switches are notorious for having
465 multicast disabled by default or IGMP snooping enabled with no IGMP querier
466 active.
467
468 In smaller cluster its also an option to use unicast if you really cannot get
469 multicast to work.
470
471 Separate Cluster Network
472 ~~~~~~~~~~~~~~~~~~~~~~~~
473
474 When creating a cluster without any parameters the cluster network is generally
475 shared with the Web UI and the VMs and its traffic. Depending on your setup
476 even storage traffic may get sent over the same network. Its recommended to
477 change that, as corosync is a time critical real time application.
478
479 Setting Up A New Network
480 ^^^^^^^^^^^^^^^^^^^^^^^^
481
482 First you have to setup a new network interface. It should be on a physical
483 separate network. Ensure that your network fulfills the
484 <<cluster-network-requirements,cluster network requirements>>.
485
486 Separate On Cluster Creation
487 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
488
489 This is possible through the 'ring0_addr' and 'bindnet0_addr' parameter of
490 the 'pvecm create' command used for creating a new cluster.
491
492 If you have setup an additional NIC with a static address on 10.10.10.1/25
493 and want to send and receive all cluster communication over this interface
494 you would execute:
495
496 [source,bash]
497 ----
498 pvecm create test --ring0_addr 10.10.10.1 --bindnet0_addr 10.10.10.0
499 ----
500
501 To check if everything is working properly execute:
502 [source,bash]
503 ----
504 systemctl status corosync
505 ----
506
507 Afterwards, proceed as descripted in the section to
508 <<adding-nodes-with-separated-cluster-network,add nodes with a separated cluster network>>.
509
510 [[separate-cluster-net-after-creation]]
511 Separate After Cluster Creation
512 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
513
514 You can do this also if you have already created a cluster and want to switch
515 its communication to another network, without rebuilding the whole cluster.
516 This change may lead to short durations of quorum loss in the cluster, as nodes
517 have to restart corosync and come up one after the other on the new network.
518
519 Check how to <<edit-corosync-conf,edit the corosync.conf file>> first.
520 The open it and you should see a file similar to:
521
522 ----
523 logging {
524 debug: off
525 to_syslog: yes
526 }
527
528 nodelist {
529
530 node {
531 name: due
532 nodeid: 2
533 quorum_votes: 1
534 ring0_addr: due
535 }
536
537 node {
538 name: tre
539 nodeid: 3
540 quorum_votes: 1
541 ring0_addr: tre
542 }
543
544 node {
545 name: uno
546 nodeid: 1
547 quorum_votes: 1
548 ring0_addr: uno
549 }
550
551 }
552
553 quorum {
554 provider: corosync_votequorum
555 }
556
557 totem {
558 cluster_name: thomas-testcluster
559 config_version: 3
560 ip_version: ipv4
561 secauth: on
562 version: 2
563 interface {
564 bindnetaddr: 192.168.30.50
565 ringnumber: 0
566 }
567
568 }
569 ----
570
571 The first you want to do is add the 'name' properties in the node entries if
572 you do not see them already. Those *must* match the node name.
573
574 Then replace the address from the 'ring0_addr' properties with the new
575 addresses. You may use plain IP addresses or also hostnames here. If you use
576 hostnames ensure that they are resolvable from all nodes. (see also
577 <<corosync-addresses,Ring Address Types>>)
578
579 In my example I want to switch my cluster communication to the 10.10.10.1/25
580 network. So I replace all 'ring0_addr' respectively. I also set the bindnetaddr
581 in the totem section of the config to an address of the new network. It can be
582 any address from the subnet configured on the new network interface.
583
584 After you increased the 'config_version' property the new configuration file
585 should look like:
586
587 ----
588
589 logging {
590 debug: off
591 to_syslog: yes
592 }
593
594 nodelist {
595
596 node {
597 name: due
598 nodeid: 2
599 quorum_votes: 1
600 ring0_addr: 10.10.10.2
601 }
602
603 node {
604 name: tre
605 nodeid: 3
606 quorum_votes: 1
607 ring0_addr: 10.10.10.3
608 }
609
610 node {
611 name: uno
612 nodeid: 1
613 quorum_votes: 1
614 ring0_addr: 10.10.10.1
615 }
616
617 }
618
619 quorum {
620 provider: corosync_votequorum
621 }
622
623 totem {
624 cluster_name: thomas-testcluster
625 config_version: 4
626 ip_version: ipv4
627 secauth: on
628 version: 2
629 interface {
630 bindnetaddr: 10.10.10.1
631 ringnumber: 0
632 }
633
634 }
635 ----
636
637 Now after a final check whether all changed information is correct we save it
638 and see again the <<edit-corosync-conf,edit corosync.conf file>> section to
639 learn how to bring it in effect.
640
641 As our change cannot be enforced live from corosync we have to do an restart.
642
643 On a single node execute:
644 [source,bash]
645 ----
646 systemctl restart corosync
647 ----
648
649 Now check if everything is fine:
650
651 [source,bash]
652 ----
653 systemctl status corosync
654 ----
655
656 If corosync runs again correct restart corosync also on all other nodes.
657 They will then join the cluster membership one by one on the new network.
658
659 [[corosync-addresses]]
660 Corosync addresses
661 ~~~~~~~~~~~~~~~~~~
662
663 A corosync link or ring address can be specified in two ways:
664
665 * **IPv4/v6 addresses** will be used directly. They are recommended, since they
666 are static and usually not changed carelessly.
667
668 * **Hostnames** will be resolved using `getaddrinfo`, which means that per
669 default, IPv6 addresses will be used first, if available (see also
670 `man gai.conf`). Keep this in mind, especially when upgrading an existing
671 cluster to IPv6.
672
673 CAUTION: Hostnames should be used with care, since the address they
674 resolve to can be changed without touching corosync or the node it runs on -
675 which may lead to a situation where an address is changed without thinking
676 about implications for corosync.
677
678 A seperate, static hostname specifically for corosync is recommended, if
679 hostnames are preferred. Also, make sure that every node in the cluster can
680 resolve all hostnames correctly.
681
682 Since {pve} 5.1, while supported, hostnames will be resolved at the time of
683 entry. Only the resolved IP is then saved to the configuration.
684
685 Nodes that joined the cluster on earlier versions likely still use their
686 unresolved hostname in `corosync.conf`. It might be a good idea to replace
687 them with IPs or a seperate hostname, as mentioned above.
688
689 [[pvecm_rrp]]
690 Redundant Ring Protocol
691 ~~~~~~~~~~~~~~~~~~~~~~~
692 To avoid a single point of failure you should implement counter measurements.
693 This can be on the hardware and operating system level through network bonding.
694
695 Corosync itself offers also a possibility to add redundancy through the so
696 called 'Redundant Ring Protocol'. This protocol allows running a second totem
697 ring on another network, this network should be physically separated from the
698 other rings network to actually increase availability.
699
700 RRP On Cluster Creation
701 ~~~~~~~~~~~~~~~~~~~~~~~
702
703 The 'pvecm create' command provides the additional parameters 'bindnetX_addr',
704 'ringX_addr' and 'rrp_mode', can be used for RRP configuration.
705
706 NOTE: See the <<corosync-conf-glossary,glossary>> if you do not know what each parameter means.
707
708 So if you have two networks, one on the 10.10.10.1/24 and the other on the
709 10.10.20.1/24 subnet you would execute:
710
711 [source,bash]
712 ----
713 pvecm create CLUSTERNAME -bindnet0_addr 10.10.10.1 -ring0_addr 10.10.10.1 \
714 -bindnet1_addr 10.10.20.1 -ring1_addr 10.10.20.1
715 ----
716
717 RRP On Existing Clusters
718 ~~~~~~~~~~~~~~~~~~~~~~~~
719
720 You will take similar steps as described in
721 <<separate-cluster-net-after-creation,separating the cluster network>> to
722 enable RRP on an already running cluster. The single difference is, that you
723 will add `ring1` and use it instead of `ring0`.
724
725 First add a new `interface` subsection in the `totem` section, set its
726 `ringnumber` property to `1`. Set the interfaces `bindnetaddr` property to an
727 address of the subnet you have configured for your new ring.
728 Further set the `rrp_mode` to `passive`, this is the only stable mode.
729
730 Then add to each node entry in the `nodelist` section its new `ring1_addr`
731 property with the nodes additional ring address.
732
733 So if you have two networks, one on the 10.10.10.1/24 and the other on the
734 10.10.20.1/24 subnet, the final configuration file should look like:
735
736 ----
737 totem {
738 cluster_name: tweak
739 config_version: 9
740 ip_version: ipv4
741 rrp_mode: passive
742 secauth: on
743 version: 2
744 interface {
745 bindnetaddr: 10.10.10.1
746 ringnumber: 0
747 }
748 interface {
749 bindnetaddr: 10.10.20.1
750 ringnumber: 1
751 }
752 }
753
754 nodelist {
755 node {
756 name: pvecm1
757 nodeid: 1
758 quorum_votes: 1
759 ring0_addr: 10.10.10.1
760 ring1_addr: 10.10.20.1
761 }
762
763 node {
764 name: pvecm2
765 nodeid: 2
766 quorum_votes: 1
767 ring0_addr: 10.10.10.2
768 ring1_addr: 10.10.20.2
769 }
770
771 [...] # other cluster nodes here
772 }
773
774 [...] # other remaining config sections here
775
776 ----
777
778 Bring it in effect like described in the
779 <<edit-corosync-conf,edit the corosync.conf file>> section.
780
781 This is a change which cannot take live in effect and needs at least a restart
782 of corosync. Recommended is a restart of the whole cluster.
783
784 If you cannot reboot the whole cluster ensure no High Availability services are
785 configured and the stop the corosync service on all nodes. After corosync is
786 stopped on all nodes start it one after the other again.
787
788 Corosync External Vote Support
789 ------------------------------
790
791 This section describes a way to deploy an external voter in a {pve} cluster.
792 When configured, the cluster can sustain more node failures without
793 violating safety properties of the cluster communication.
794
795 For this to work there are two services involved:
796
797 * a so called qdevice daemon which runs on each {pve} node
798
799 * an external vote daemon which runs on an independent server.
800
801 As a result you can achieve higher availability even in smaller setups (for
802 example 2+1 nodes).
803
804 QDevice Technical Overview
805 ~~~~~~~~~~~~~~~~~~~~~~~~~~
806
807 The Corosync Quroum Device (QDevice) is a daemon which runs on each cluster
808 node. It provides a configured number of votes to the clusters quorum
809 subsystem based on an external running third-party arbitrator's decision.
810 Its primary use is to allow a cluster to sustain more node failures than
811 standard quorum rules allow. This can be done safely as the external device
812 can see all nodes and thus choose only one set of nodes to give its vote.
813 This will only be done if said set of nodes can have quorum (again) when
814 receiving the third-party vote.
815
816 Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
817 a daemon which provides a vote to a cluster partition if it can reach the
818 partition members over the network. It will give only votes to one partition
819 of a cluster at any time.
820 It's designed to support multiple clusters and is almost configuration and
821 state free. New clusters are handled dynamically and no configuration file
822 is needed on the host running a QDevice.
823
824 The external host has the only requirement that it needs network access to the
825 cluster and a corosync-qnetd package available. We provide such a package
826 for Debian based hosts, other Linux distributions should also have a package
827 available through their respective package manager.
828
829 NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
830 TCP/IP and thus does not need a multicast capable network between itself and
831 the cluster. In fact the daemon may run outside of the LAN and can have
832 longer latencies than 2 ms.
833
834
835 Supported Setups
836 ~~~~~~~~~~~~~~~~
837
838 We support QDevices for clusters with an even number of nodes and recommend
839 it for 2 node clusters, if they should provide higher availability.
840 For clusters with an odd node count we discourage the use of QDevices
841 currently. The reason for this, is the difference of the votes the QDevice
842 provides for each cluster type. Even numbered clusters get single additional
843 vote, with this we can only increase availability, i.e. if the QDevice
844 itself fails we are in the same situation as with no QDevice at all.
845
846 Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
847 where 'N' corresponds to the cluster node count. This difference makes
848 sense, if we had only one additional vote the cluster can get into a split
849 brain situation.
850 This algorithm would allow that all nodes but one (and naturally the
851 QDevice itself) could fail.
852 There are two drawbacks with this:
853
854 * If the QNet daemon itself fails, no other node may fail or the cluster
855 immediately loses quorum. For example, in a cluster with 15 nodes 7
856 could fail before the cluster becomes inquorate. But, if a QDevice is
857 configured here and said QDevice fails itself **no single node** of
858 the 15 may fail. The QDevice acts almost as a single point of failure in
859 this case.
860
861 * The fact that all but one node plus QDevice may fail sound promising at
862 first, but this may result in a mass recovery of HA services that would
863 overload the single node left. Also ceph server will stop to provide
864 services after only '((N-1)/2)' nodes are online.
865
866 If you understand the drawbacks and implications you can decide yourself if
867 you should use this technology in an odd numbered cluster setup.
868
869
870 QDevice-Net Setup
871 ~~~~~~~~~~~~~~~~~
872
873 We recommend to run any daemon which provides votes to corosync-qdevice as an
874 unprivileged user. {pve} and Debian provides a package which is already
875 configured to do so.
876 The traffic between the daemon and the cluster must be encrypted to ensure a
877 safe and secure QDevice integration in {pve}.
878
879 First install the 'corosync-qnetd' package on your external server and
880 the 'corosync-qdevice' package on all cluster nodes.
881
882 After that, ensure that all your nodes on the cluster are online.
883
884 You can now easily set up your QDevice by running the following command on one
885 of the {pve} nodes:
886
887 ----
888 pve# pvecm qdevice setup <QDEVICE-IP>
889 ----
890
891 The SSH key from the cluster will be automatically copied to the QDevice. You
892 might need to enter an SSH password during this step.
893
894 After you enter the password and all the steps are successfully completed, you
895 will see "Done". You can check the status now:
896
897 ----
898 pve# pvecm status
899
900 ...
901
902 Votequorum information
903 ~~~~~~~~~~~~~~~~~~~~~
904 Expected votes: 3
905 Highest expected: 3
906 Total votes: 3
907 Quorum: 2
908 Flags: Quorate Qdevice
909
910 Membership information
911 ~~~~~~~~~~~~~~~~~~~~~~
912 Nodeid Votes Qdevice Name
913 0x00000001 1 A,V,NMW 192.168.22.180 (local)
914 0x00000002 1 A,V,NMW 192.168.22.181
915 0x00000000 1 Qdevice
916
917 ----
918
919 which means the QDevice is set up.
920
921
922 Frequently Asked Questions
923 ~~~~~~~~~~~~~~~~~~~~~~~~~~
924
925 Tie Breaking
926 ^^^^^^^^^^^^
927
928 In case of a tie, where two same-sized cluster partitions cannot see each other
929 but the QDevice, the QDevice chooses randomly one of those partitions and
930 provides a vote to it.
931
932 Possible Negative Implications
933 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
934
935 For clusters with an even node count there are no negative implications when
936 setting up a QDevice. If it fails to work, you are as good as without QDevice at
937 all.
938
939 Adding/Deleting Nodes After QDevice Setup
940 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
941
942 If you want to add a new node or remove an existing one from a cluster with a
943 QDevice setup, you need to remove the QDevice first. After that, you can add or
944 remove nodes normally. Once you have a cluster with an even node count again,
945 you can set up the QDevice again as described above.
946
947 Removing the QDevice
948 ^^^^^^^^^^^^^^^^^^^^
949
950 If you used the official `pvecm` tool to add the QDevice, you can remove it
951 trivially by running:
952
953 ----
954 pve# pvecm qdevice remove
955 ----
956
957 //Still TODO
958 //^^^^^^^^^^
959 //There ist still stuff to add here
960
961
962 Corosync Configuration
963 ----------------------
964
965 The `/etc/pve/corosync.conf` file plays a central role in {pve} cluster. It
966 controls the cluster member ship and its network.
967 For reading more about it check the corosync.conf man page:
968 [source,bash]
969 ----
970 man corosync.conf
971 ----
972
973 For node membership you should always use the `pvecm` tool provided by {pve}.
974 You may have to edit the configuration file manually for other changes.
975 Here are a few best practice tips for doing this.
976
977 [[edit-corosync-conf]]
978 Edit corosync.conf
979 ~~~~~~~~~~~~~~~~~~
980
981 Editing the corosync.conf file can be not always straight forward. There are
982 two on each cluster, one in `/etc/pve/corosync.conf` and the other in
983 `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will
984 propagate the changes to the local one, but not vice versa.
985
986 The configuration will get updated automatically as soon as the file changes.
987 This means changes which can be integrated in a running corosync will take
988 instantly effect. So you should always make a copy and edit that instead, to
989 avoid triggering some unwanted changes by an in between safe.
990
991 [source,bash]
992 ----
993 cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
994 ----
995
996 Then open the Config file with your favorite editor, `nano` and `vim.tiny` are
997 preinstalled on {pve} for example.
998
999 NOTE: Always increment the 'config_version' number on configuration changes,
1000 omitting this can lead to problems.
1001
1002 After making the necessary changes create another copy of the current working
1003 configuration file. This serves as a backup if the new configuration fails to
1004 apply or makes problems in other ways.
1005
1006 [source,bash]
1007 ----
1008 cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
1009 ----
1010
1011 Then move the new configuration file over the old one:
1012 [source,bash]
1013 ----
1014 mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
1015 ----
1016
1017 You may check with the commands
1018 [source,bash]
1019 ----
1020 systemctl status corosync
1021 journalctl -b -u corosync
1022 ----
1023
1024 If the change could applied automatically. If not you may have to restart the
1025 corosync service via:
1026 [source,bash]
1027 ----
1028 systemctl restart corosync
1029 ----
1030
1031 On errors check the troubleshooting section below.
1032
1033 Troubleshooting
1034 ~~~~~~~~~~~~~~~
1035
1036 Issue: 'quorum.expected_votes must be configured'
1037 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1038
1039 When corosync starts to fail and you get the following message in the system log:
1040
1041 ----
1042 [...]
1043 corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
1044 corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason
1045 'configuration error: nodelist or quorum.expected_votes must be configured!'
1046 [...]
1047 ----
1048
1049 It means that the hostname you set for corosync 'ringX_addr' in the
1050 configuration could not be resolved.
1051
1052
1053 Write Configuration When Not Quorate
1054 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1055
1056 If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you
1057 know what you do, use:
1058 [source,bash]
1059 ----
1060 pvecm expected 1
1061 ----
1062
1063 This sets the expected vote count to 1 and makes the cluster quorate. You can
1064 now fix your configuration, or revert it back to the last working backup.
1065
1066 This is not enough if corosync cannot start anymore. Here its best to edit the
1067 local copy of the corosync configuration in '/etc/corosync/corosync.conf' so
1068 that corosync can start again. Ensure that on all nodes this configuration has
1069 the same content to avoid split brains. If you are not sure what went wrong
1070 it's best to ask the Proxmox Community to help you.
1071
1072
1073 [[corosync-conf-glossary]]
1074 Corosync Configuration Glossary
1075 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1076
1077 ringX_addr::
1078 This names the different ring addresses for the corosync totem rings used for
1079 the cluster communication.
1080
1081 bindnetaddr::
1082 Defines to which interface the ring should bind to. It may be any address of
1083 the subnet configured on the interface we want to use. In general its the
1084 recommended to just use an address a node uses on this interface.
1085
1086 rrp_mode::
1087 Specifies the mode of the redundant ring protocol and may be passive, active or
1088 none. Note that use of active is highly experimental and not official
1089 supported. Passive is the preferred mode, it may double the cluster
1090 communication throughput and increases availability.
1091
1092
1093 Cluster Cold Start
1094 ------------------
1095
1096 It is obvious that a cluster is not quorate when all nodes are
1097 offline. This is a common case after a power failure.
1098
1099 NOTE: It is always a good idea to use an uninterruptible power supply
1100 (``UPS'', also called ``battery backup'') to avoid this state, especially if
1101 you want HA.
1102
1103 On node startup, the `pve-guests` service is started and waits for
1104 quorum. Once quorate, it starts all guests which have the `onboot`
1105 flag set.
1106
1107 When you turn on nodes, or when power comes back after power failure,
1108 it is likely that some nodes boots faster than others. Please keep in
1109 mind that guest startup is delayed until you reach quorum.
1110
1111
1112 Guest Migration
1113 ---------------
1114
1115 Migrating virtual guests to other nodes is a useful feature in a
1116 cluster. There are settings to control the behavior of such
1117 migrations. This can be done via the configuration file
1118 `datacenter.cfg` or for a specific migration via API or command line
1119 parameters.
1120
1121 It makes a difference if a Guest is online or offline, or if it has
1122 local resources (like a local disk).
1123
1124 For Details about Virtual Machine Migration see the
1125 xref:qm_migration[QEMU/KVM Migration Chapter]
1126
1127 For Details about Container Migration see the
1128 xref:pct_migration[Container Migration Chapter]
1129
1130 Migration Type
1131 ~~~~~~~~~~~~~~
1132
1133 The migration type defines if the migration data should be sent over an
1134 encrypted (`secure`) channel or an unencrypted (`insecure`) one.
1135 Setting the migration type to insecure means that the RAM content of a
1136 virtual guest gets also transferred unencrypted, which can lead to
1137 information disclosure of critical data from inside the guest (for
1138 example passwords or encryption keys).
1139
1140 Therefore, we strongly recommend using the secure channel if you do
1141 not have full control over the network and can not guarantee that no
1142 one is eavesdropping to it.
1143
1144 NOTE: Storage migration does not follow this setting. Currently, it
1145 always sends the storage content over a secure channel.
1146
1147 Encryption requires a lot of computing power, so this setting is often
1148 changed to "unsafe" to achieve better performance. The impact on
1149 modern systems is lower because they implement AES encryption in
1150 hardware. The performance impact is particularly evident in fast
1151 networks where you can transfer 10 Gbps or more.
1152
1153
1154 Migration Network
1155 ~~~~~~~~~~~~~~~~~
1156
1157 By default, {pve} uses the network in which cluster communication
1158 takes place to send the migration traffic. This is not optimal because
1159 sensitive cluster traffic can be disrupted and this network may not
1160 have the best bandwidth available on the node.
1161
1162 Setting the migration network parameter allows the use of a dedicated
1163 network for the entire migration traffic. In addition to the memory,
1164 this also affects the storage traffic for offline migrations.
1165
1166 The migration network is set as a network in the CIDR notation. This
1167 has the advantage that you do not have to set individual IP addresses
1168 for each node. {pve} can determine the real address on the
1169 destination node from the network specified in the CIDR form. To
1170 enable this, the network must be specified so that each node has one,
1171 but only one IP in the respective network.
1172
1173
1174 Example
1175 ^^^^^^^
1176
1177 We assume that we have a three-node setup with three separate
1178 networks. One for public communication with the Internet, one for
1179 cluster communication and a very fast one, which we want to use as a
1180 dedicated network for migration.
1181
1182 A network configuration for such a setup might look as follows:
1183
1184 ----
1185 iface eno1 inet manual
1186
1187 # public network
1188 auto vmbr0
1189 iface vmbr0 inet static
1190 address 192.X.Y.57
1191 netmask 255.255.250.0
1192 gateway 192.X.Y.1
1193 bridge_ports eno1
1194 bridge_stp off
1195 bridge_fd 0
1196
1197 # cluster network
1198 auto eno2
1199 iface eno2 inet static
1200 address 10.1.1.1
1201 netmask 255.255.255.0
1202
1203 # fast network
1204 auto eno3
1205 iface eno3 inet static
1206 address 10.1.2.1
1207 netmask 255.255.255.0
1208 ----
1209
1210 Here, we will use the network 10.1.2.0/24 as a migration network. For
1211 a single migration, you can do this using the `migration_network`
1212 parameter of the command line tool:
1213
1214 ----
1215 # qm migrate 106 tre --online --migration_network 10.1.2.0/24
1216 ----
1217
1218 To configure this as the default network for all migrations in the
1219 cluster, set the `migration` property of the `/etc/pve/datacenter.cfg`
1220 file:
1221
1222 ----
1223 # use dedicated migration network
1224 migration: secure,network=10.1.2.0/24
1225 ----
1226
1227 NOTE: The migration type must always be set when the migration network
1228 gets set in `/etc/pve/datacenter.cfg`.
1229
1230
1231 ifdef::manvolnum[]
1232 include::pve-copyright.adoc[]
1233 endif::manvolnum[]