]> git.proxmox.com Git - pve-docs.git/blame - pvecm.adoc
Update pvecm documentation for corosync 3
[pve-docs.git] / pvecm.adoc
CommitLineData
bde0e57d 1[[chapter_pvecm]]
d8742b0c 2ifdef::manvolnum[]
b2f242ab
DM
3pvecm(1)
4========
5f09af76
DM
5:pve-toplevel:
6
d8742b0c
DM
7NAME
8----
9
74026b8f 10pvecm - Proxmox VE Cluster Manager
d8742b0c 11
49a5e11c 12SYNOPSIS
d8742b0c
DM
13--------
14
15include::pvecm.1-synopsis.adoc[]
16
17DESCRIPTION
18-----------
19endif::manvolnum[]
20
21ifndef::manvolnum[]
22Cluster Manager
23===============
5f09af76 24:pve-toplevel:
194d2f29 25endif::manvolnum[]
5f09af76 26
8c1189b6
FG
27The {PVE} cluster manager `pvecm` is a tool to create a group of
28physical servers. Such a group is called a *cluster*. We use the
8a865621 29http://www.corosync.org[Corosync Cluster Engine] for reliable group
5eba0743 30communication, and such clusters can consist of up to 32 physical nodes
8a865621
DM
31(probably more, dependent on network latency).
32
8c1189b6 33`pvecm` can be used to create a new cluster, join nodes to a cluster,
8a865621 34leave the cluster, get status information and do various other cluster
e300cf7d
FG
35related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'')
36is used to transparently distribute the cluster configuration to all cluster
8a865621
DM
37nodes.
38
39Grouping nodes into a cluster has the following advantages:
40
41* Centralized, web based management
42
5eba0743 43* Multi-master clusters: each node can do all management task
8a865621 44
8c1189b6
FG
45* `pmxcfs`: database-driven file system for storing configuration files,
46 replicated in real-time on all nodes using `corosync`.
8a865621 47
5eba0743 48* Easy migration of virtual machines and containers between physical
8a865621
DM
49 hosts
50
51* Fast deployment
52
53* Cluster-wide services like firewall and HA
54
55
56Requirements
57------------
58
a9e7c3aa
SR
59* All nodes must be able to connect to each other via UDP ports 5404 and 5405
60 for corosync to work.
8a865621
DM
61
62* Date and time have to be synchronized.
63
ceabe189 64* SSH tunnel on TCP port 22 between nodes is used.
8a865621 65
ceabe189
DM
66* If you are interested in High Availability, you need to have at
67 least three nodes for reliable quorum. All nodes should have the
68 same version.
8a865621
DM
69
70* We recommend a dedicated NIC for the cluster traffic, especially if
71 you use shared storage.
72
d4a9910f
DL
73* Root password of a cluster node is required for adding nodes.
74
e4b62d04
TL
75NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster
76nodes.
77
78NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as
79production configuration and should only used temporarily during upgrading the
80whole cluster from one to another major version.
8a865621 81
a9e7c3aa
SR
82NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The
83cluster protocol (corosync) between {pve} 6.x and earlier versions changed
84fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the
85upgrade procedure to {pve} 6.0.
86
8a865621 87
ceabe189
DM
88Preparing Nodes
89---------------
8a865621
DM
90
91First, install {PVE} on all nodes. Make sure that each node is
92installed with the final hostname and IP configuration. Changing the
93hostname and IP is not possible after cluster creation.
94
30101530
TL
95Currently the cluster creation can either be done on the console (login via
96`ssh`) or the API, which we have a GUI implementation for (__Datacenter ->
97Cluster__).
8a865621 98
a9e7c3aa
SR
99While it's common to reference all nodenames and their IPs in `/etc/hosts` (or
100make their names resolvable through other means), this is not necessary for a
101cluster to work. It may be useful however, as you can then connect from one node
102to the other with SSH via the easier to remember node name (see also
103xref:pvecm_corosync_addresses[Link Address Types]). Note that we always
104recommend to reference nodes by their IP addresses in the cluster configuration.
105
9a7396aa 106
11202f1d 107[[pvecm_create_cluster]]
8a865621 108Create the Cluster
ceabe189 109------------------
8a865621 110
8c1189b6 111Login via `ssh` to the first {pve} node. Use a unique name for your cluster.
9a7396aa
TL
112This name cannot be changed later. The cluster name follows the same rules as
113node names.
8a865621 114
c15cdfba
TL
115----
116 hp1# pvecm create CLUSTERNAME
117----
8a865621 118
a9e7c3aa
SR
119NOTE: It is possible to create multiple clusters in the same physical or logical
120network. Use unique cluster names if you do so. To avoid human confusion, it is
121also recommended to choose different names even if clusters do not share the
122cluster network.
63f956c8 123
8a865621
DM
124To check the state of your cluster use:
125
c15cdfba 126----
8a865621 127 hp1# pvecm status
c15cdfba 128----
8a865621
DM
129
130
11202f1d 131[[pvecm_join_node_to_cluster]]
8a865621 132Adding Nodes to the Cluster
ceabe189 133---------------------------
8a865621 134
8c1189b6 135Login via `ssh` to the node you want to add.
8a865621 136
c15cdfba 137----
8a865621 138 hp2# pvecm add IP-ADDRESS-CLUSTER
c15cdfba 139----
8a865621 140
270757a1 141For `IP-ADDRESS-CLUSTER` use the IP or hostname of an existing cluster node.
a9e7c3aa 142An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]).
8a865621 143
5eba0743 144CAUTION: A new node cannot hold any VMs, because you would get
7980581f 145conflicts about identical VM IDs. Also, all existing configuration in
8c1189b6
FG
146`/etc/pve` is overwritten when you join a new node to the cluster. To
147workaround, use `vzdump` to backup and restore to a different VMID after
7980581f 148adding the node to the cluster.
8a865621 149
a9e7c3aa 150To check the state of the cluster use:
8a865621 151
c15cdfba 152----
8a865621 153 # pvecm status
c15cdfba 154----
8a865621 155
ceabe189 156.Cluster status after adding 4 nodes
8a865621
DM
157----
158hp2# pvecm status
159Quorum information
160~~~~~~~~~~~~~~~~~~
161Date: Mon Apr 20 12:30:13 2015
162Quorum provider: corosync_votequorum
163Nodes: 4
164Node ID: 0x00000001
a9e7c3aa 165Ring ID: 1/8
8a865621
DM
166Quorate: Yes
167
168Votequorum information
169~~~~~~~~~~~~~~~~~~~~~~
170Expected votes: 4
171Highest expected: 4
172Total votes: 4
91f3edd0 173Quorum: 3
8a865621
DM
174Flags: Quorate
175
176Membership information
177~~~~~~~~~~~~~~~~~~~~~~
178 Nodeid Votes Name
1790x00000001 1 192.168.15.91
1800x00000002 1 192.168.15.92 (local)
1810x00000003 1 192.168.15.93
1820x00000004 1 192.168.15.94
183----
184
185If you only want the list of all nodes use:
186
c15cdfba 187----
8a865621 188 # pvecm nodes
c15cdfba 189----
8a865621 190
5eba0743 191.List nodes in a cluster
8a865621
DM
192----
193hp2# pvecm nodes
194
195Membership information
196~~~~~~~~~~~~~~~~~~~~~~
197 Nodeid Votes Name
198 1 1 hp1
199 2 1 hp2 (local)
200 3 1 hp3
201 4 1 hp4
202----
203
3254bfdd 204[[pvecm_adding_nodes_with_separated_cluster_network]]
e4ec4154
TL
205Adding Nodes With Separated Cluster Network
206~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
207
208When adding a node to a cluster with a separated cluster network you need to
a9e7c3aa 209use the 'link0' parameter to set the nodes address on that network:
e4ec4154
TL
210
211[source,bash]
4d19cb00 212----
a9e7c3aa 213pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0
4d19cb00 214----
e4ec4154 215
a9e7c3aa
SR
216If you want to use the built-in xref:pvecm_redundancy[redundancy] of the
217kronosnet transport layer, also use the 'link1' parameter.
e4ec4154 218
8a865621
DM
219
220Remove a Cluster Node
ceabe189 221---------------------
8a865621
DM
222
223CAUTION: Read carefully the procedure before proceeding, as it could
224not be what you want or need.
225
226Move all virtual machines from the node. Make sure you have no local
227data or backups you want to keep, or save them accordingly.
e8503c6c 228In the following example we will remove the node hp4 from the cluster.
8a865621 229
e8503c6c
EK
230Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes`
231command to identify the node ID to remove:
8a865621
DM
232
233----
234hp1# pvecm nodes
235
236Membership information
237~~~~~~~~~~~~~~~~~~~~~~
238 Nodeid Votes Name
239 1 1 hp1 (local)
240 2 1 hp2
241 3 1 hp3
242 4 1 hp4
243----
244
e8503c6c
EK
245
246At this point you must power off hp4 and
247make sure that it will not power on again (in the network) as it
248is.
249
250IMPORTANT: As said above, it is critical to power off the node
251*before* removal, and make sure that it will *never* power on again
252(in the existing cluster network) as it is.
253If you power on the node as it is, your cluster will be screwed up and
254it could be difficult to restore a clean cluster state.
255
256After powering off the node hp4, we can safely remove it from the cluster.
8a865621 257
c15cdfba 258----
8a865621 259 hp1# pvecm delnode hp4
c15cdfba 260----
8a865621
DM
261
262If the operation succeeds no output is returned, just check the node
8c1189b6 263list again with `pvecm nodes` or `pvecm status`. You should see
8a865621
DM
264something like:
265
266----
267hp1# pvecm status
268
269Quorum information
270~~~~~~~~~~~~~~~~~~
271Date: Mon Apr 20 12:44:28 2015
272Quorum provider: corosync_votequorum
273Nodes: 3
274Node ID: 0x00000001
a9e7c3aa 275Ring ID: 1/8
8a865621
DM
276Quorate: Yes
277
278Votequorum information
279~~~~~~~~~~~~~~~~~~~~~~
280Expected votes: 3
281Highest expected: 3
282Total votes: 3
91f3edd0 283Quorum: 2
8a865621
DM
284Flags: Quorate
285
286Membership information
287~~~~~~~~~~~~~~~~~~~~~~
288 Nodeid Votes Name
2890x00000001 1 192.168.15.90 (local)
2900x00000002 1 192.168.15.91
2910x00000003 1 192.168.15.92
292----
293
a9e7c3aa
SR
294If, for whatever reason, you want this server to join the same cluster again,
295you have to
8a865621 296
26ca7ff5 297* reinstall {pve} on it from scratch
8a865621
DM
298
299* then join it, as explained in the previous section.
d8742b0c 300
41925ede
SR
301NOTE: After removal of the node, its SSH fingerprint will still reside in the
302'known_hosts' of the other nodes. If you receive an SSH error after rejoining
9121b45b
TL
303a node with the same IP or hostname, run `pvecm updatecerts` once on the
304re-added node to update its fingerprint cluster wide.
41925ede 305
38ae8db3 306[[pvecm_separate_node_without_reinstall]]
555e966b
TL
307Separate A Node Without Reinstalling
308~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
309
310CAUTION: This is *not* the recommended method, proceed with caution. Use the
311above mentioned method if you're unsure.
312
313You can also separate a node from a cluster without reinstalling it from
314scratch. But after removing the node from the cluster it will still have
315access to the shared storages! This must be resolved before you start removing
316the node from the cluster. A {pve} cluster cannot share the exact same
2ea5c4a5
TL
317storage with another cluster, as storage locking doesn't work over cluster
318boundary. Further, it may also lead to VMID conflicts.
555e966b 319
3be22308 320Its suggested that you create a new storage where only the node which you want
a9e7c3aa 321to separate has access. This can be a new export on your NFS or a new Ceph
3be22308
TL
322pool, to name a few examples. Its just important that the exact same storage
323does not gets accessed by multiple clusters. After setting this storage up move
324all data from the node and its VMs to it. Then you are ready to separate the
325node from the cluster.
555e966b 326
a9e7c3aa
SR
327WARNING: Ensure all shared resources are cleanly separated! Otherwise you will
328run into conflicts and problems.
555e966b
TL
329
330First stop the corosync and the pve-cluster services on the node:
331[source,bash]
4d19cb00 332----
555e966b
TL
333systemctl stop pve-cluster
334systemctl stop corosync
4d19cb00 335----
555e966b
TL
336
337Start the cluster filesystem again in local mode:
338[source,bash]
4d19cb00 339----
555e966b 340pmxcfs -l
4d19cb00 341----
555e966b
TL
342
343Delete the corosync configuration files:
344[source,bash]
4d19cb00 345----
555e966b
TL
346rm /etc/pve/corosync.conf
347rm /etc/corosync/*
4d19cb00 348----
555e966b
TL
349
350You can now start the filesystem again as normal service:
351[source,bash]
4d19cb00 352----
555e966b
TL
353killall pmxcfs
354systemctl start pve-cluster
4d19cb00 355----
555e966b
TL
356
357The node is now separated from the cluster. You can deleted it from a remaining
358node of the cluster with:
359[source,bash]
4d19cb00 360----
555e966b 361pvecm delnode oldnode
4d19cb00 362----
555e966b
TL
363
364If the command failed, because the remaining node in the cluster lost quorum
365when the now separate node exited, you may set the expected votes to 1 as a workaround:
366[source,bash]
4d19cb00 367----
555e966b 368pvecm expected 1
4d19cb00 369----
555e966b 370
96d698db 371And then repeat the 'pvecm delnode' command.
555e966b
TL
372
373Now switch back to the separated node, here delete all remaining files left
374from the old cluster. This ensures that the node can be added to another
375cluster again without problems.
376
377[source,bash]
4d19cb00 378----
555e966b 379rm /var/lib/corosync/*
4d19cb00 380----
555e966b
TL
381
382As the configuration files from the other nodes are still in the cluster
383filesystem you may want to clean those up too. Remove simply the whole
384directory recursive from '/etc/pve/nodes/NODENAME', but check three times that
385you used the correct one before deleting it.
386
387CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means
388the nodes can still connect to each other with public key authentication. This
389should be fixed by removing the respective keys from the
390'/etc/pve/priv/authorized_keys' file.
d8742b0c 391
a9e7c3aa 392
806ef12d
DM
393Quorum
394------
395
396{pve} use a quorum-based technique to provide a consistent state among
397all cluster nodes.
398
399[quote, from Wikipedia, Quorum (distributed computing)]
400____
401A quorum is the minimum number of votes that a distributed transaction
402has to obtain in order to be allowed to perform an operation in a
403distributed system.
404____
405
406In case of network partitioning, state changes requires that a
407majority of nodes are online. The cluster switches to read-only mode
5eba0743 408if it loses quorum.
806ef12d
DM
409
410NOTE: {pve} assigns a single vote to each node by default.
411
a9e7c3aa 412
e4ec4154
TL
413Cluster Network
414---------------
415
416The cluster network is the core of a cluster. All messages sent over it have to
a9e7c3aa
SR
417be delivered reliably to all nodes in their respective order. In {pve} this
418part is done by corosync, an implementation of a high performance, low overhead
e4ec4154
TL
419high availability development toolkit. It serves our decentralized
420configuration file system (`pmxcfs`).
421
3254bfdd 422[[pvecm_cluster_network_requirements]]
e4ec4154
TL
423Network Requirements
424~~~~~~~~~~~~~~~~~~~~
425This needs a reliable network with latencies under 2 milliseconds (LAN
a9e7c3aa
SR
426performance) to work properly. The network should not be used heavily by other
427members, ideally corosync runs on its own network. Do not use a shared network
428for corosync and storage (except as a potential low-priority fallback in a
429xref:pvecm_redundancy[redundant] configuration).
e4ec4154 430
a9e7c3aa
SR
431Before setting up a cluster, it is good practice to check if the network is fit
432for that purpose. To make sure the nodes can connect to each other on the
433cluster network, you can test the connectivity between them with the `ping`
434tool.
e4ec4154 435
a9e7c3aa
SR
436If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically
437be generated - no manual action is required.
e4ec4154 438
a9e7c3aa
SR
439NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0).
440Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster
441communication, which, for now, only supports regular UDP unicast.
e4ec4154 442
a9e7c3aa
SR
443CAUTION: You can still enable Multicast or legacy unicast by setting your
444transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf],
445but keep in mind that this will disable all cryptography and redundancy support.
446This is therefore not recommended.
e4ec4154
TL
447
448Separate Cluster Network
449~~~~~~~~~~~~~~~~~~~~~~~~
450
a9e7c3aa
SR
451When creating a cluster without any parameters the corosync cluster network is
452generally shared with the Web UI and the VMs and their traffic. Depending on
453your setup, even storage traffic may get sent over the same network. Its
454recommended to change that, as corosync is a time critical real time
455application.
e4ec4154
TL
456
457Setting Up A New Network
458^^^^^^^^^^^^^^^^^^^^^^^^
459
a9e7c3aa 460First you have to set up a new network interface. It should be on a physically
e4ec4154 461separate network. Ensure that your network fulfills the
3254bfdd 462xref:pvecm_cluster_network_requirements[cluster network requirements].
e4ec4154
TL
463
464Separate On Cluster Creation
465^^^^^^^^^^^^^^^^^^^^^^^^^^^^
466
a9e7c3aa
SR
467This is possible via the 'linkX' parameters of the 'pvecm create'
468command used for creating a new cluster.
e4ec4154 469
a9e7c3aa
SR
470If you have set up an additional NIC with a static address on 10.10.10.1/25,
471and want to send and receive all cluster communication over this interface,
e4ec4154
TL
472you would execute:
473
474[source,bash]
4d19cb00 475----
a9e7c3aa 476pvecm create test --link0 10.10.10.1
4d19cb00 477----
e4ec4154
TL
478
479To check if everything is working properly execute:
480[source,bash]
4d19cb00 481----
e4ec4154 482systemctl status corosync
4d19cb00 483----
e4ec4154 484
a9e7c3aa 485Afterwards, proceed as described above to
3254bfdd 486xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network].
82d52451 487
3254bfdd 488[[pvecm_separate_cluster_net_after_creation]]
e4ec4154
TL
489Separate After Cluster Creation
490^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
491
a9e7c3aa 492You can do this if you have already created a cluster and want to switch
e4ec4154
TL
493its communication to another network, without rebuilding the whole cluster.
494This change may lead to short durations of quorum loss in the cluster, as nodes
495have to restart corosync and come up one after the other on the new network.
496
3254bfdd 497Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first.
a9e7c3aa 498Then, open it and you should see a file similar to:
e4ec4154
TL
499
500----
501logging {
502 debug: off
503 to_syslog: yes
504}
505
506nodelist {
507
508 node {
509 name: due
510 nodeid: 2
511 quorum_votes: 1
512 ring0_addr: due
513 }
514
515 node {
516 name: tre
517 nodeid: 3
518 quorum_votes: 1
519 ring0_addr: tre
520 }
521
522 node {
523 name: uno
524 nodeid: 1
525 quorum_votes: 1
526 ring0_addr: uno
527 }
528
529}
530
531quorum {
532 provider: corosync_votequorum
533}
534
535totem {
a9e7c3aa 536 cluster_name: testcluster
e4ec4154 537 config_version: 3
a9e7c3aa 538 ip_version: ipv4-6
e4ec4154
TL
539 secauth: on
540 version: 2
541 interface {
a9e7c3aa 542 linknumber: 0
e4ec4154
TL
543 }
544
545}
546----
547
a9e7c3aa
SR
548NOTE: `ringX_addr` actually specifies a corosync *link address*, the name "ring"
549is a remnant of older corosync versions that is kept for backwards
550compatibility.
551
552The first thing you want to do is add the 'name' properties in the node entries
553if you do not see them already. Those *must* match the node name.
e4ec4154 554
a9e7c3aa
SR
555Then replace all addresses from the 'ring0_addr' properties of all nodes with
556the new addresses. You may use plain IP addresses or hostnames here. If you use
270757a1 557hostnames ensure that they are resolvable from all nodes. (see also
a9e7c3aa 558xref:pvecm_corosync_addresses[Link Address Types])
e4ec4154 559
a9e7c3aa
SR
560In this example, we want to switch the cluster communication to the
56110.10.10.1/25 network. So we replace all 'ring0_addr' respectively.
e4ec4154 562
a9e7c3aa
SR
563NOTE: The exact same procedure can be used to change other 'ringX_addr' values
564as well, although we recommend to not change multiple addresses at once, to make
565it easier to recover if something goes wrong.
566
567After we increase the 'config_version' property, the new configuration file
e4ec4154
TL
568should look like:
569
570----
e4ec4154
TL
571logging {
572 debug: off
573 to_syslog: yes
574}
575
576nodelist {
577
578 node {
579 name: due
580 nodeid: 2
581 quorum_votes: 1
582 ring0_addr: 10.10.10.2
583 }
584
585 node {
586 name: tre
587 nodeid: 3
588 quorum_votes: 1
589 ring0_addr: 10.10.10.3
590 }
591
592 node {
593 name: uno
594 nodeid: 1
595 quorum_votes: 1
596 ring0_addr: 10.10.10.1
597 }
598
599}
600
601quorum {
602 provider: corosync_votequorum
603}
604
605totem {
a9e7c3aa 606 cluster_name: testcluster
e4ec4154 607 config_version: 4
a9e7c3aa 608 ip_version: ipv4-6
e4ec4154
TL
609 secauth: on
610 version: 2
611 interface {
a9e7c3aa 612 linknumber: 0
e4ec4154
TL
613 }
614
615}
616----
617
a9e7c3aa
SR
618Then, after a final check if all changed information is correct, we save it and
619once again follow the xref:pvecm_edit_corosync_conf[edit corosync.conf file]
620section to bring it into effect.
e4ec4154 621
a9e7c3aa
SR
622The changes will be applied live, so restarting corosync is not strictly
623necessary. If you changed other settings as well, or notice corosync
624complaining, you can optionally trigger a restart.
e4ec4154
TL
625
626On a single node execute:
a9e7c3aa 627
e4ec4154 628[source,bash]
4d19cb00 629----
e4ec4154 630systemctl restart corosync
4d19cb00 631----
e4ec4154
TL
632
633Now check if everything is fine:
634
635[source,bash]
4d19cb00 636----
e4ec4154 637systemctl status corosync
4d19cb00 638----
e4ec4154
TL
639
640If corosync runs again correct restart corosync also on all other nodes.
641They will then join the cluster membership one by one on the new network.
642
3254bfdd 643[[pvecm_corosync_addresses]]
270757a1
SR
644Corosync addresses
645~~~~~~~~~~~~~~~~~~
646
a9e7c3aa
SR
647A corosync link address (for backwards compatibility denoted by 'ringX_addr' in
648`corosync.conf`) can be specified in two ways:
270757a1
SR
649
650* **IPv4/v6 addresses** will be used directly. They are recommended, since they
651are static and usually not changed carelessly.
652
653* **Hostnames** will be resolved using `getaddrinfo`, which means that per
654default, IPv6 addresses will be used first, if available (see also
655`man gai.conf`). Keep this in mind, especially when upgrading an existing
656cluster to IPv6.
657
658CAUTION: Hostnames should be used with care, since the address they
659resolve to can be changed without touching corosync or the node it runs on -
660which may lead to a situation where an address is changed without thinking
661about implications for corosync.
662
663A seperate, static hostname specifically for corosync is recommended, if
664hostnames are preferred. Also, make sure that every node in the cluster can
665resolve all hostnames correctly.
666
667Since {pve} 5.1, while supported, hostnames will be resolved at the time of
668entry. Only the resolved IP is then saved to the configuration.
669
670Nodes that joined the cluster on earlier versions likely still use their
671unresolved hostname in `corosync.conf`. It might be a good idea to replace
672them with IPs or a seperate hostname, as mentioned above.
673
e4ec4154 674
a9e7c3aa
SR
675[[pvecm_redundancy]]
676Corosync Redundancy
677-------------------
e4ec4154 678
a9e7c3aa
SR
679Corosync supports redundant networking via its integrated kronosnet layer by
680default (it is not supported on the legacy udp/udpu transports). It can be
681enabled by specifying more than one link address, either via the '--linkX'
682parameters of `pvecm` (while creating a cluster or adding a new node) or by
683specifying more than one 'ringX_addr' in `corosync.conf`.
e4ec4154 684
a9e7c3aa
SR
685NOTE: To provide useful failover, every link should be on its own
686physical network connection.
e4ec4154 687
a9e7c3aa
SR
688Links are used according to a priority setting. You can configure this priority
689by setting 'knet_link_priority' in the corresponding interface section in
690`corosync.conf`, or, preferrably, using the 'priority' parameter when creating
691your cluster with `pvecm`:
e4ec4154 692
4d19cb00 693----
a9e7c3aa 694 # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=20 --link1 10.20.20.1,priority=15
4d19cb00 695----
e4ec4154 696
a9e7c3aa
SR
697This would cause 'link1' to be used first, since it has the lower priority.
698
699If no priorities are configured manually (or two links have the same priority),
700links will be used in order of their number, with the lower number having higher
701priority.
702
703Even if all links are working, only the one with the highest priority will see
704corosync traffic. Link priorities cannot be mixed, i.e. links with different
705priorities will not be able to communicate with each other.
e4ec4154 706
a9e7c3aa
SR
707Since lower priority links will not see traffic unless all higher priorities
708have failed, it becomes a useful strategy to specify even networks used for
709other tasks (VMs, storage, etc...) as low-priority links. If worst comes to
710worst, a higher-latency or more congested connection might be better than no
711connection at all.
e4ec4154 712
a9e7c3aa
SR
713Adding Redundant Links To An Existing Cluster
714~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
e4ec4154 715
a9e7c3aa
SR
716To add a new link to a running configuration, first check how to
717xref:pvecm_edit_corosync_conf[edit the corosync.conf file].
e4ec4154 718
a9e7c3aa
SR
719Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make
720sure that your 'X' is the same for every node you add it to, and that it is
721unique for each node.
722
723Lastly, add a new 'interface', as shown below, to your `totem`
724section, replacing 'X' with your link number chosen above.
725
726Assuming you added a link with number 1, the new configuration file could look
727like this:
e4ec4154
TL
728
729----
a9e7c3aa
SR
730logging {
731 debug: off
732 to_syslog: yes
e4ec4154
TL
733}
734
735nodelist {
a9e7c3aa 736
e4ec4154 737 node {
a9e7c3aa
SR
738 name: due
739 nodeid: 2
e4ec4154 740 quorum_votes: 1
a9e7c3aa
SR
741 ring0_addr: 10.10.10.2
742 ring1_addr: 10.20.20.2
e4ec4154
TL
743 }
744
a9e7c3aa
SR
745 node {
746 name: tre
747 nodeid: 3
e4ec4154 748 quorum_votes: 1
a9e7c3aa
SR
749 ring0_addr: 10.10.10.3
750 ring1_addr: 10.20.20.3
e4ec4154
TL
751 }
752
a9e7c3aa
SR
753 node {
754 name: uno
755 nodeid: 1
756 quorum_votes: 1
757 ring0_addr: 10.10.10.1
758 ring1_addr: 10.20.20.1
759 }
760
761}
762
763quorum {
764 provider: corosync_votequorum
765}
766
767totem {
768 cluster_name: testcluster
769 config_version: 4
770 ip_version: ipv4-6
771 secauth: on
772 version: 2
773 interface {
774 linknumber: 0
775 }
776 interface {
777 linknumber: 1
778 }
e4ec4154 779}
a9e7c3aa 780----
e4ec4154 781
a9e7c3aa
SR
782The new link will be enabled as soon as you follow the last steps to
783xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not
784be necessary. You can check that corosync loaded the new link using:
e4ec4154 785
a9e7c3aa
SR
786----
787journalctl -b -u corosync
e4ec4154
TL
788----
789
a9e7c3aa
SR
790It might be a good idea to test the new link by temporarily disconnecting the
791old link on one node and making sure that its status remains online while
792disconnected:
e4ec4154 793
a9e7c3aa
SR
794----
795pvecm status
796----
797
798If you see a healthy cluster state, it means that your new link is being used.
e4ec4154 799
e4ec4154 800
c21d2cbe
OB
801Corosync External Vote Support
802------------------------------
803
804This section describes a way to deploy an external voter in a {pve} cluster.
805When configured, the cluster can sustain more node failures without
806violating safety properties of the cluster communication.
807
808For this to work there are two services involved:
809
810* a so called qdevice daemon which runs on each {pve} node
811
812* an external vote daemon which runs on an independent server.
813
814As a result you can achieve higher availability even in smaller setups (for
815example 2+1 nodes).
816
817QDevice Technical Overview
818~~~~~~~~~~~~~~~~~~~~~~~~~~
819
820The Corosync Quroum Device (QDevice) is a daemon which runs on each cluster
821node. It provides a configured number of votes to the clusters quorum
822subsystem based on an external running third-party arbitrator's decision.
823Its primary use is to allow a cluster to sustain more node failures than
824standard quorum rules allow. This can be done safely as the external device
825can see all nodes and thus choose only one set of nodes to give its vote.
51730d56 826This will only be done if said set of nodes can have quorum (again) when
c21d2cbe
OB
827receiving the third-party vote.
828
829Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
830a daemon which provides a vote to a cluster partition if it can reach the
831partition members over the network. It will give only votes to one partition
832of a cluster at any time.
833It's designed to support multiple clusters and is almost configuration and
834state free. New clusters are handled dynamically and no configuration file
835is needed on the host running a QDevice.
836
837The external host has the only requirement that it needs network access to the
838cluster and a corosync-qnetd package available. We provide such a package
839for Debian based hosts, other Linux distributions should also have a package
840available through their respective package manager.
841
842NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
a9e7c3aa
SR
843TCP/IP. The daemon may even run outside of the clusters LAN and can have longer
844latencies than 2 ms.
c21d2cbe
OB
845
846Supported Setups
847~~~~~~~~~~~~~~~~
848
849We support QDevices for clusters with an even number of nodes and recommend
850it for 2 node clusters, if they should provide higher availability.
851For clusters with an odd node count we discourage the use of QDevices
852currently. The reason for this, is the difference of the votes the QDevice
853provides for each cluster type. Even numbered clusters get single additional
854vote, with this we can only increase availability, i.e. if the QDevice
855itself fails we are in the same situation as with no QDevice at all.
856
857Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
858where 'N' corresponds to the cluster node count. This difference makes
859sense, if we had only one additional vote the cluster can get into a split
860brain situation.
861This algorithm would allow that all nodes but one (and naturally the
862QDevice itself) could fail.
863There are two drawbacks with this:
864
865* If the QNet daemon itself fails, no other node may fail or the cluster
866 immediately loses quorum. For example, in a cluster with 15 nodes 7
867 could fail before the cluster becomes inquorate. But, if a QDevice is
868 configured here and said QDevice fails itself **no single node** of
869 the 15 may fail. The QDevice acts almost as a single point of failure in
870 this case.
871
872* The fact that all but one node plus QDevice may fail sound promising at
873 first, but this may result in a mass recovery of HA services that would
874 overload the single node left. Also ceph server will stop to provide
875 services after only '((N-1)/2)' nodes are online.
876
877If you understand the drawbacks and implications you can decide yourself if
878you should use this technology in an odd numbered cluster setup.
879
c21d2cbe
OB
880QDevice-Net Setup
881~~~~~~~~~~~~~~~~~
882
883We recommend to run any daemon which provides votes to corosync-qdevice as an
e34c3e91
TL
884unprivileged user. {pve} and Debian provides a package which is already
885configured to do so.
c21d2cbe
OB
886The traffic between the daemon and the cluster must be encrypted to ensure a
887safe and secure QDevice integration in {pve}.
888
889First install the 'corosync-qnetd' package on your external server and
890the 'corosync-qdevice' package on all cluster nodes.
891
892After that, ensure that all your nodes on the cluster are online.
893
894You can now easily set up your QDevice by running the following command on one
895of the {pve} nodes:
896
897----
898pve# pvecm qdevice setup <QDEVICE-IP>
899----
900
901The SSH key from the cluster will be automatically copied to the QDevice. You
902might need to enter an SSH password during this step.
903
904After you enter the password and all the steps are successfully completed, you
905will see "Done". You can check the status now:
906
907----
908pve# pvecm status
909
910...
911
912Votequorum information
913~~~~~~~~~~~~~~~~~~~~~
914Expected votes: 3
915Highest expected: 3
916Total votes: 3
917Quorum: 2
918Flags: Quorate Qdevice
919
920Membership information
921~~~~~~~~~~~~~~~~~~~~~~
922 Nodeid Votes Qdevice Name
923 0x00000001 1 A,V,NMW 192.168.22.180 (local)
924 0x00000002 1 A,V,NMW 192.168.22.181
925 0x00000000 1 Qdevice
926
927----
928
929which means the QDevice is set up.
930
c21d2cbe
OB
931Frequently Asked Questions
932~~~~~~~~~~~~~~~~~~~~~~~~~~
933
934Tie Breaking
935^^^^^^^^^^^^
936
00821894
TL
937In case of a tie, where two same-sized cluster partitions cannot see each other
938but the QDevice, the QDevice chooses randomly one of those partitions and
c21d2cbe
OB
939provides a vote to it.
940
d31de328
TL
941Possible Negative Implications
942^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
943
00821894
TL
944For clusters with an even node count there are no negative implications when
945setting up a QDevice. If it fails to work, you are as good as without QDevice at
946all.
d31de328 947
870c2817
OB
948Adding/Deleting Nodes After QDevice Setup
949^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
d31de328
TL
950
951If you want to add a new node or remove an existing one from a cluster with a
00821894
TL
952QDevice setup, you need to remove the QDevice first. After that, you can add or
953remove nodes normally. Once you have a cluster with an even node count again,
954you can set up the QDevice again as described above.
870c2817
OB
955
956Removing the QDevice
957^^^^^^^^^^^^^^^^^^^^
958
00821894
TL
959If you used the official `pvecm` tool to add the QDevice, you can remove it
960trivially by running:
870c2817
OB
961
962----
963pve# pvecm qdevice remove
964----
d31de328 965
51730d56
TL
966//Still TODO
967//^^^^^^^^^^
a9e7c3aa 968//There is still stuff to add here
c21d2cbe
OB
969
970
e4ec4154
TL
971Corosync Configuration
972----------------------
973
a9e7c3aa
SR
974The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It
975controls the cluster membership and its network.
976For further information about it, check the corosync.conf man page:
e4ec4154 977[source,bash]
4d19cb00 978----
e4ec4154 979man corosync.conf
4d19cb00 980----
e4ec4154
TL
981
982For node membership you should always use the `pvecm` tool provided by {pve}.
983You may have to edit the configuration file manually for other changes.
984Here are a few best practice tips for doing this.
985
3254bfdd 986[[pvecm_edit_corosync_conf]]
e4ec4154
TL
987Edit corosync.conf
988~~~~~~~~~~~~~~~~~~
989
a9e7c3aa
SR
990Editing the corosync.conf file is not always very straightforward. There are
991two on each cluster node, one in `/etc/pve/corosync.conf` and the other in
e4ec4154
TL
992`/etc/corosync/corosync.conf`. Editing the one in our cluster file system will
993propagate the changes to the local one, but not vice versa.
994
995The configuration will get updated automatically as soon as the file changes.
996This means changes which can be integrated in a running corosync will take
a9e7c3aa
SR
997effect immediately. So you should always make a copy and edit that instead, to
998avoid triggering some unwanted changes by an in-between safe.
e4ec4154
TL
999
1000[source,bash]
4d19cb00 1001----
e4ec4154 1002cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
4d19cb00 1003----
e4ec4154 1004
a9e7c3aa
SR
1005Then open the config file with your favorite editor, `nano` and `vim.tiny` are
1006preinstalled on any {pve} node for example.
e4ec4154
TL
1007
1008NOTE: Always increment the 'config_version' number on configuration changes,
1009omitting this can lead to problems.
1010
1011After making the necessary changes create another copy of the current working
1012configuration file. This serves as a backup if the new configuration fails to
1013apply or makes problems in other ways.
1014
1015[source,bash]
4d19cb00 1016----
e4ec4154 1017cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
4d19cb00 1018----
e4ec4154
TL
1019
1020Then move the new configuration file over the old one:
1021[source,bash]
4d19cb00 1022----
e4ec4154 1023mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
4d19cb00 1024----
e4ec4154
TL
1025
1026You may check with the commands
1027[source,bash]
4d19cb00 1028----
e4ec4154
TL
1029systemctl status corosync
1030journalctl -b -u corosync
4d19cb00 1031----
e4ec4154 1032
a9e7c3aa 1033If the change could be applied automatically. If not you may have to restart the
e4ec4154
TL
1034corosync service via:
1035[source,bash]
4d19cb00 1036----
e4ec4154 1037systemctl restart corosync
4d19cb00 1038----
e4ec4154
TL
1039
1040On errors check the troubleshooting section below.
1041
1042Troubleshooting
1043~~~~~~~~~~~~~~~
1044
1045Issue: 'quorum.expected_votes must be configured'
1046^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1047
1048When corosync starts to fail and you get the following message in the system log:
1049
1050----
1051[...]
1052corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
1053corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason
1054 'configuration error: nodelist or quorum.expected_votes must be configured!'
1055[...]
1056----
1057
1058It means that the hostname you set for corosync 'ringX_addr' in the
1059configuration could not be resolved.
1060
e4ec4154
TL
1061Write Configuration When Not Quorate
1062^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1063
1064If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you
1065know what you do, use:
1066[source,bash]
4d19cb00 1067----
e4ec4154 1068pvecm expected 1
4d19cb00 1069----
e4ec4154
TL
1070
1071This sets the expected vote count to 1 and makes the cluster quorate. You can
1072now fix your configuration, or revert it back to the last working backup.
1073
1074This is not enough if corosync cannot start anymore. Here its best to edit the
1075local copy of the corosync configuration in '/etc/corosync/corosync.conf' so
1076that corosync can start again. Ensure that on all nodes this configuration has
1077the same content to avoid split brains. If you are not sure what went wrong
1078it's best to ask the Proxmox Community to help you.
1079
1080
3254bfdd 1081[[pvecm_corosync_conf_glossary]]
e4ec4154
TL
1082Corosync Configuration Glossary
1083~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1084
1085ringX_addr::
a9e7c3aa
SR
1086This names the different link addresses for the kronosnet connections between
1087nodes.
e4ec4154 1088
806ef12d
DM
1089
1090Cluster Cold Start
1091------------------
1092
1093It is obvious that a cluster is not quorate when all nodes are
1094offline. This is a common case after a power failure.
1095
1096NOTE: It is always a good idea to use an uninterruptible power supply
8c1189b6 1097(``UPS'', also called ``battery backup'') to avoid this state, especially if
806ef12d
DM
1098you want HA.
1099
204231df 1100On node startup, the `pve-guests` service is started and waits for
8c1189b6 1101quorum. Once quorate, it starts all guests which have the `onboot`
612417fd
DM
1102flag set.
1103
1104When you turn on nodes, or when power comes back after power failure,
1105it is likely that some nodes boots faster than others. Please keep in
1106mind that guest startup is delayed until you reach quorum.
806ef12d 1107
054a7e7d 1108
082ea7d9
TL
1109Guest Migration
1110---------------
1111
054a7e7d
DM
1112Migrating virtual guests to other nodes is a useful feature in a
1113cluster. There are settings to control the behavior of such
1114migrations. This can be done via the configuration file
1115`datacenter.cfg` or for a specific migration via API or command line
1116parameters.
1117
da6c7dee
DC
1118It makes a difference if a Guest is online or offline, or if it has
1119local resources (like a local disk).
1120
1121For Details about Virtual Machine Migration see the
a9e7c3aa 1122xref:qm_migration[QEMU/KVM Migration Chapter].
da6c7dee
DC
1123
1124For Details about Container Migration see the
a9e7c3aa 1125xref:pct_migration[Container Migration Chapter].
082ea7d9
TL
1126
1127Migration Type
1128~~~~~~~~~~~~~~
1129
44f38275 1130The migration type defines if the migration data should be sent over an
d63be10b 1131encrypted (`secure`) channel or an unencrypted (`insecure`) one.
082ea7d9 1132Setting the migration type to insecure means that the RAM content of a
470d4313 1133virtual guest gets also transferred unencrypted, which can lead to
b1743473
DM
1134information disclosure of critical data from inside the guest (for
1135example passwords or encryption keys).
054a7e7d
DM
1136
1137Therefore, we strongly recommend using the secure channel if you do
1138not have full control over the network and can not guarantee that no
1139one is eavesdropping to it.
082ea7d9 1140
054a7e7d
DM
1141NOTE: Storage migration does not follow this setting. Currently, it
1142always sends the storage content over a secure channel.
1143
1144Encryption requires a lot of computing power, so this setting is often
1145changed to "unsafe" to achieve better performance. The impact on
1146modern systems is lower because they implement AES encryption in
b1743473
DM
1147hardware. The performance impact is particularly evident in fast
1148networks where you can transfer 10 Gbps or more.
082ea7d9 1149
082ea7d9
TL
1150Migration Network
1151~~~~~~~~~~~~~~~~~
1152
a9baa444
TL
1153By default, {pve} uses the network in which cluster communication
1154takes place to send the migration traffic. This is not optimal because
1155sensitive cluster traffic can be disrupted and this network may not
1156have the best bandwidth available on the node.
1157
1158Setting the migration network parameter allows the use of a dedicated
1159network for the entire migration traffic. In addition to the memory,
1160this also affects the storage traffic for offline migrations.
1161
1162The migration network is set as a network in the CIDR notation. This
1163has the advantage that you do not have to set individual IP addresses
1164for each node. {pve} can determine the real address on the
1165destination node from the network specified in the CIDR form. To
1166enable this, the network must be specified so that each node has one,
1167but only one IP in the respective network.
1168
082ea7d9
TL
1169Example
1170^^^^^^^
1171
a9baa444
TL
1172We assume that we have a three-node setup with three separate
1173networks. One for public communication with the Internet, one for
1174cluster communication and a very fast one, which we want to use as a
1175dedicated network for migration.
1176
1177A network configuration for such a setup might look as follows:
082ea7d9
TL
1178
1179----
7a0d4784 1180iface eno1 inet manual
082ea7d9
TL
1181
1182# public network
1183auto vmbr0
1184iface vmbr0 inet static
1185 address 192.X.Y.57
1186 netmask 255.255.250.0
1187 gateway 192.X.Y.1
7a0d4784 1188 bridge_ports eno1
082ea7d9
TL
1189 bridge_stp off
1190 bridge_fd 0
1191
1192# cluster network
7a0d4784
WL
1193auto eno2
1194iface eno2 inet static
082ea7d9
TL
1195 address 10.1.1.1
1196 netmask 255.255.255.0
1197
1198# fast network
7a0d4784
WL
1199auto eno3
1200iface eno3 inet static
082ea7d9
TL
1201 address 10.1.2.1
1202 netmask 255.255.255.0
082ea7d9
TL
1203----
1204
a9baa444
TL
1205Here, we will use the network 10.1.2.0/24 as a migration network. For
1206a single migration, you can do this using the `migration_network`
1207parameter of the command line tool:
1208
082ea7d9 1209----
b1743473 1210# qm migrate 106 tre --online --migration_network 10.1.2.0/24
082ea7d9
TL
1211----
1212
a9baa444
TL
1213To configure this as the default network for all migrations in the
1214cluster, set the `migration` property of the `/etc/pve/datacenter.cfg`
1215file:
1216
082ea7d9 1217----
a9baa444 1218# use dedicated migration network
b1743473 1219migration: secure,network=10.1.2.0/24
082ea7d9
TL
1220----
1221
a9baa444
TL
1222NOTE: The migration type must always be set when the migration network
1223gets set in `/etc/pve/datacenter.cfg`.
1224
806ef12d 1225
d8742b0c
DM
1226ifdef::manvolnum[]
1227include::pve-copyright.adoc[]
1228endif::manvolnum[]