]> git.proxmox.com Git - pve-docs.git/blame - pvecm.adoc
Add cluster join information screenshot
[pve-docs.git] / pvecm.adoc
CommitLineData
bde0e57d 1[[chapter_pvecm]]
d8742b0c 2ifdef::manvolnum[]
b2f242ab
DM
3pvecm(1)
4========
5f09af76
DM
5:pve-toplevel:
6
d8742b0c
DM
7NAME
8----
9
74026b8f 10pvecm - Proxmox VE Cluster Manager
d8742b0c 11
49a5e11c 12SYNOPSIS
d8742b0c
DM
13--------
14
15include::pvecm.1-synopsis.adoc[]
16
17DESCRIPTION
18-----------
19endif::manvolnum[]
20
21ifndef::manvolnum[]
22Cluster Manager
23===============
5f09af76 24:pve-toplevel:
194d2f29 25endif::manvolnum[]
5f09af76 26
8c1189b6
FG
27The {PVE} cluster manager `pvecm` is a tool to create a group of
28physical servers. Such a group is called a *cluster*. We use the
8a865621 29http://www.corosync.org[Corosync Cluster Engine] for reliable group
5eba0743 30communication, and such clusters can consist of up to 32 physical nodes
8a865621
DM
31(probably more, dependent on network latency).
32
8c1189b6 33`pvecm` can be used to create a new cluster, join nodes to a cluster,
8a865621 34leave the cluster, get status information and do various other cluster
e300cf7d
FG
35related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'')
36is used to transparently distribute the cluster configuration to all cluster
8a865621
DM
37nodes.
38
39Grouping nodes into a cluster has the following advantages:
40
41* Centralized, web based management
42
6d3c0b34 43* Multi-master clusters: each node can do all management tasks
8a865621 44
8c1189b6
FG
45* `pmxcfs`: database-driven file system for storing configuration files,
46 replicated in real-time on all nodes using `corosync`.
8a865621 47
5eba0743 48* Easy migration of virtual machines and containers between physical
8a865621
DM
49 hosts
50
51* Fast deployment
52
53* Cluster-wide services like firewall and HA
54
55
56Requirements
57------------
58
a9e7c3aa
SR
59* All nodes must be able to connect to each other via UDP ports 5404 and 5405
60 for corosync to work.
8a865621
DM
61
62* Date and time have to be synchronized.
63
ceabe189 64* SSH tunnel on TCP port 22 between nodes is used.
8a865621 65
ceabe189
DM
66* If you are interested in High Availability, you need to have at
67 least three nodes for reliable quorum. All nodes should have the
68 same version.
8a865621
DM
69
70* We recommend a dedicated NIC for the cluster traffic, especially if
71 you use shared storage.
72
d4a9910f
DL
73* Root password of a cluster node is required for adding nodes.
74
e4b62d04
TL
75NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster
76nodes.
77
78NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as
79production configuration and should only used temporarily during upgrading the
80whole cluster from one to another major version.
8a865621 81
a9e7c3aa
SR
82NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The
83cluster protocol (corosync) between {pve} 6.x and earlier versions changed
84fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the
85upgrade procedure to {pve} 6.0.
86
8a865621 87
ceabe189
DM
88Preparing Nodes
89---------------
8a865621
DM
90
91First, install {PVE} on all nodes. Make sure that each node is
92installed with the final hostname and IP configuration. Changing the
93hostname and IP is not possible after cluster creation.
94
30101530
TL
95Currently the cluster creation can either be done on the console (login via
96`ssh`) or the API, which we have a GUI implementation for (__Datacenter ->
97Cluster__).
8a865621 98
a9e7c3aa
SR
99While it's common to reference all nodenames and their IPs in `/etc/hosts` (or
100make their names resolvable through other means), this is not necessary for a
101cluster to work. It may be useful however, as you can then connect from one node
102to the other with SSH via the easier to remember node name (see also
103xref:pvecm_corosync_addresses[Link Address Types]). Note that we always
104recommend to reference nodes by their IP addresses in the cluster configuration.
105
9a7396aa 106
11202f1d 107[[pvecm_create_cluster]]
8a865621 108Create the Cluster
ceabe189 109------------------
8a865621 110
3e380ce0
SR
111Use a unique name for your cluster. This name cannot be changed later. The
112cluster name follows the same rules as node names.
113
114Create via Web GUI
115~~~~~~~~~~~~~~~~~~
116
117Under __Datacenter -> Cluster__, click on *Create Cluster*. Enter the cluster
118name and select a network connection from the dropdown to serve as the main
119cluster network (Link 0). It defaults to the IP resolved via the node's
120hostname.
121
122To add a second link as fallback, you can select the 'Advanced' checkbox and
123choose an additional network interface (Link 1, see also
124xref:pvecm_redundancy[Corosync Redundancy]).
125
126Create via Command Line
127~~~~~~~~~~~~~~~~~~~~~~~
128
129Login via `ssh` to the first {pve} node and run the following command:
8a865621 130
c15cdfba
TL
131----
132 hp1# pvecm create CLUSTERNAME
133----
8a865621 134
3e380ce0 135To check the state of the new cluster use:
8a865621 136
c15cdfba 137----
8a865621 138 hp1# pvecm status
c15cdfba 139----
8a865621 140
dd1aa0e0
TL
141Multiple Clusters In Same Network
142~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143
144It is possible to create multiple clusters in the same physical or logical
3e380ce0
SR
145network. Each such cluster must have a unique name to avoid possible clashes in
146the cluster communication stack. This also helps avoid human confusion by making
147clusters clearly distinguishable.
dd1aa0e0
TL
148
149While the bandwidth requirement of a corosync cluster is relatively low, the
150latency of packages and the package per second (PPS) rate is the limiting
151factor. Different clusters in the same network can compete with each other for
152these resources, so it may still make sense to use separate physical network
153infrastructure for bigger clusters.
8a865621 154
11202f1d 155[[pvecm_join_node_to_cluster]]
8a865621 156Adding Nodes to the Cluster
ceabe189 157---------------------------
8a865621 158
3e380ce0
SR
159CAUTION: A node that is about to be added to the cluster cannot hold any guests.
160All existing configuration in `/etc/pve` is overwritten when joining a cluster,
161since guest IDs could be conflicting. As a workaround create a backup of the
162guest (`vzdump`) and restore it as a different ID after the node has been added
163to the cluster.
164
165Add Node via GUI
166~~~~~~~~~~~~~~~~
167
168Login to the web interface on an existing cluster node. Under __Datacenter ->
169Cluster__, click the button *Join Information* at the top. Then, click on the
170button *Copy Information*. Alternatively, copy the string from the 'Information'
171field manually.
172
173Next, login to the web interface on the node you want to add.
174Under __Datacenter -> Cluster__, click on *Join Cluster*. Fill in the
175'Information' field with the text you copied earlier.
176
177For security reasons, the cluster password has to be entered manually.
178
179NOTE: To enter all required data manually, you can disable the 'Assisted Join'
180checkbox.
181
182After clicking on *Join* the node will immediately be added to the cluster. You
183might need to reload the web page and re-login with the cluster credentials.
184
185Confirm that your node is visible under __Datacenter -> Cluster__.
186
187Add Node via Command Line
188~~~~~~~~~~~~~~~~~~~~~~~~~
189
8c1189b6 190Login via `ssh` to the node you want to add.
8a865621 191
c15cdfba 192----
8a865621 193 hp2# pvecm add IP-ADDRESS-CLUSTER
c15cdfba 194----
8a865621 195
270757a1 196For `IP-ADDRESS-CLUSTER` use the IP or hostname of an existing cluster node.
a9e7c3aa 197An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]).
8a865621 198
8a865621 199
a9e7c3aa 200To check the state of the cluster use:
8a865621 201
c15cdfba 202----
8a865621 203 # pvecm status
c15cdfba 204----
8a865621 205
ceabe189 206.Cluster status after adding 4 nodes
8a865621
DM
207----
208hp2# pvecm status
209Quorum information
210~~~~~~~~~~~~~~~~~~
211Date: Mon Apr 20 12:30:13 2015
212Quorum provider: corosync_votequorum
213Nodes: 4
214Node ID: 0x00000001
a9e7c3aa 215Ring ID: 1/8
8a865621
DM
216Quorate: Yes
217
218Votequorum information
219~~~~~~~~~~~~~~~~~~~~~~
220Expected votes: 4
221Highest expected: 4
222Total votes: 4
91f3edd0 223Quorum: 3
8a865621
DM
224Flags: Quorate
225
226Membership information
227~~~~~~~~~~~~~~~~~~~~~~
228 Nodeid Votes Name
2290x00000001 1 192.168.15.91
2300x00000002 1 192.168.15.92 (local)
2310x00000003 1 192.168.15.93
2320x00000004 1 192.168.15.94
233----
234
235If you only want the list of all nodes use:
236
c15cdfba 237----
8a865621 238 # pvecm nodes
c15cdfba 239----
8a865621 240
5eba0743 241.List nodes in a cluster
8a865621
DM
242----
243hp2# pvecm nodes
244
245Membership information
246~~~~~~~~~~~~~~~~~~~~~~
247 Nodeid Votes Name
248 1 1 hp1
249 2 1 hp2 (local)
250 3 1 hp3
251 4 1 hp4
252----
253
3254bfdd 254[[pvecm_adding_nodes_with_separated_cluster_network]]
e4ec4154
TL
255Adding Nodes With Separated Cluster Network
256~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
257
258When adding a node to a cluster with a separated cluster network you need to
a9e7c3aa 259use the 'link0' parameter to set the nodes address on that network:
e4ec4154
TL
260
261[source,bash]
4d19cb00 262----
a9e7c3aa 263pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0
4d19cb00 264----
e4ec4154 265
a9e7c3aa
SR
266If you want to use the built-in xref:pvecm_redundancy[redundancy] of the
267kronosnet transport layer, also use the 'link1' parameter.
e4ec4154 268
3e380ce0
SR
269Using the GUI, you can select the correct interface from the corresponding 'Link 0'
270and 'Link 1' fields in the *Cluster Join* dialog.
8a865621
DM
271
272Remove a Cluster Node
ceabe189 273---------------------
8a865621
DM
274
275CAUTION: Read carefully the procedure before proceeding, as it could
276not be what you want or need.
277
278Move all virtual machines from the node. Make sure you have no local
279data or backups you want to keep, or save them accordingly.
e8503c6c 280In the following example we will remove the node hp4 from the cluster.
8a865621 281
e8503c6c
EK
282Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes`
283command to identify the node ID to remove:
8a865621
DM
284
285----
286hp1# pvecm nodes
287
288Membership information
289~~~~~~~~~~~~~~~~~~~~~~
290 Nodeid Votes Name
291 1 1 hp1 (local)
292 2 1 hp2
293 3 1 hp3
294 4 1 hp4
295----
296
e8503c6c
EK
297
298At this point you must power off hp4 and
299make sure that it will not power on again (in the network) as it
300is.
301
302IMPORTANT: As said above, it is critical to power off the node
303*before* removal, and make sure that it will *never* power on again
304(in the existing cluster network) as it is.
305If you power on the node as it is, your cluster will be screwed up and
306it could be difficult to restore a clean cluster state.
307
308After powering off the node hp4, we can safely remove it from the cluster.
8a865621 309
c15cdfba 310----
8a865621 311 hp1# pvecm delnode hp4
c15cdfba 312----
8a865621
DM
313
314If the operation succeeds no output is returned, just check the node
8c1189b6 315list again with `pvecm nodes` or `pvecm status`. You should see
8a865621
DM
316something like:
317
318----
319hp1# pvecm status
320
321Quorum information
322~~~~~~~~~~~~~~~~~~
323Date: Mon Apr 20 12:44:28 2015
324Quorum provider: corosync_votequorum
325Nodes: 3
326Node ID: 0x00000001
a9e7c3aa 327Ring ID: 1/8
8a865621
DM
328Quorate: Yes
329
330Votequorum information
331~~~~~~~~~~~~~~~~~~~~~~
332Expected votes: 3
333Highest expected: 3
334Total votes: 3
91f3edd0 335Quorum: 2
8a865621
DM
336Flags: Quorate
337
338Membership information
339~~~~~~~~~~~~~~~~~~~~~~
340 Nodeid Votes Name
3410x00000001 1 192.168.15.90 (local)
3420x00000002 1 192.168.15.91
3430x00000003 1 192.168.15.92
344----
345
a9e7c3aa
SR
346If, for whatever reason, you want this server to join the same cluster again,
347you have to
8a865621 348
26ca7ff5 349* reinstall {pve} on it from scratch
8a865621
DM
350
351* then join it, as explained in the previous section.
d8742b0c 352
41925ede
SR
353NOTE: After removal of the node, its SSH fingerprint will still reside in the
354'known_hosts' of the other nodes. If you receive an SSH error after rejoining
9121b45b
TL
355a node with the same IP or hostname, run `pvecm updatecerts` once on the
356re-added node to update its fingerprint cluster wide.
41925ede 357
38ae8db3 358[[pvecm_separate_node_without_reinstall]]
555e966b
TL
359Separate A Node Without Reinstalling
360~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
361
362CAUTION: This is *not* the recommended method, proceed with caution. Use the
363above mentioned method if you're unsure.
364
365You can also separate a node from a cluster without reinstalling it from
366scratch. But after removing the node from the cluster it will still have
367access to the shared storages! This must be resolved before you start removing
368the node from the cluster. A {pve} cluster cannot share the exact same
2ea5c4a5
TL
369storage with another cluster, as storage locking doesn't work over cluster
370boundary. Further, it may also lead to VMID conflicts.
555e966b 371
3be22308 372Its suggested that you create a new storage where only the node which you want
a9e7c3aa 373to separate has access. This can be a new export on your NFS or a new Ceph
3be22308
TL
374pool, to name a few examples. Its just important that the exact same storage
375does not gets accessed by multiple clusters. After setting this storage up move
376all data from the node and its VMs to it. Then you are ready to separate the
377node from the cluster.
555e966b 378
a9e7c3aa
SR
379WARNING: Ensure all shared resources are cleanly separated! Otherwise you will
380run into conflicts and problems.
555e966b
TL
381
382First stop the corosync and the pve-cluster services on the node:
383[source,bash]
4d19cb00 384----
555e966b
TL
385systemctl stop pve-cluster
386systemctl stop corosync
4d19cb00 387----
555e966b
TL
388
389Start the cluster filesystem again in local mode:
390[source,bash]
4d19cb00 391----
555e966b 392pmxcfs -l
4d19cb00 393----
555e966b
TL
394
395Delete the corosync configuration files:
396[source,bash]
4d19cb00 397----
555e966b
TL
398rm /etc/pve/corosync.conf
399rm /etc/corosync/*
4d19cb00 400----
555e966b
TL
401
402You can now start the filesystem again as normal service:
403[source,bash]
4d19cb00 404----
555e966b
TL
405killall pmxcfs
406systemctl start pve-cluster
4d19cb00 407----
555e966b
TL
408
409The node is now separated from the cluster. You can deleted it from a remaining
410node of the cluster with:
411[source,bash]
4d19cb00 412----
555e966b 413pvecm delnode oldnode
4d19cb00 414----
555e966b
TL
415
416If the command failed, because the remaining node in the cluster lost quorum
417when the now separate node exited, you may set the expected votes to 1 as a workaround:
418[source,bash]
4d19cb00 419----
555e966b 420pvecm expected 1
4d19cb00 421----
555e966b 422
96d698db 423And then repeat the 'pvecm delnode' command.
555e966b
TL
424
425Now switch back to the separated node, here delete all remaining files left
426from the old cluster. This ensures that the node can be added to another
427cluster again without problems.
428
429[source,bash]
4d19cb00 430----
555e966b 431rm /var/lib/corosync/*
4d19cb00 432----
555e966b
TL
433
434As the configuration files from the other nodes are still in the cluster
435filesystem you may want to clean those up too. Remove simply the whole
436directory recursive from '/etc/pve/nodes/NODENAME', but check three times that
437you used the correct one before deleting it.
438
439CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means
440the nodes can still connect to each other with public key authentication. This
441should be fixed by removing the respective keys from the
442'/etc/pve/priv/authorized_keys' file.
d8742b0c 443
a9e7c3aa 444
806ef12d
DM
445Quorum
446------
447
448{pve} use a quorum-based technique to provide a consistent state among
449all cluster nodes.
450
451[quote, from Wikipedia, Quorum (distributed computing)]
452____
453A quorum is the minimum number of votes that a distributed transaction
454has to obtain in order to be allowed to perform an operation in a
455distributed system.
456____
457
458In case of network partitioning, state changes requires that a
459majority of nodes are online. The cluster switches to read-only mode
5eba0743 460if it loses quorum.
806ef12d
DM
461
462NOTE: {pve} assigns a single vote to each node by default.
463
a9e7c3aa 464
e4ec4154
TL
465Cluster Network
466---------------
467
468The cluster network is the core of a cluster. All messages sent over it have to
a9e7c3aa
SR
469be delivered reliably to all nodes in their respective order. In {pve} this
470part is done by corosync, an implementation of a high performance, low overhead
e4ec4154
TL
471high availability development toolkit. It serves our decentralized
472configuration file system (`pmxcfs`).
473
3254bfdd 474[[pvecm_cluster_network_requirements]]
e4ec4154
TL
475Network Requirements
476~~~~~~~~~~~~~~~~~~~~
477This needs a reliable network with latencies under 2 milliseconds (LAN
a9e7c3aa
SR
478performance) to work properly. The network should not be used heavily by other
479members, ideally corosync runs on its own network. Do not use a shared network
480for corosync and storage (except as a potential low-priority fallback in a
481xref:pvecm_redundancy[redundant] configuration).
e4ec4154 482
a9e7c3aa
SR
483Before setting up a cluster, it is good practice to check if the network is fit
484for that purpose. To make sure the nodes can connect to each other on the
485cluster network, you can test the connectivity between them with the `ping`
486tool.
e4ec4154 487
a9e7c3aa
SR
488If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically
489be generated - no manual action is required.
e4ec4154 490
a9e7c3aa
SR
491NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0).
492Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster
493communication, which, for now, only supports regular UDP unicast.
e4ec4154 494
a9e7c3aa
SR
495CAUTION: You can still enable Multicast or legacy unicast by setting your
496transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf],
497but keep in mind that this will disable all cryptography and redundancy support.
498This is therefore not recommended.
e4ec4154
TL
499
500Separate Cluster Network
501~~~~~~~~~~~~~~~~~~~~~~~~
502
a9e7c3aa
SR
503When creating a cluster without any parameters the corosync cluster network is
504generally shared with the Web UI and the VMs and their traffic. Depending on
505your setup, even storage traffic may get sent over the same network. Its
506recommended to change that, as corosync is a time critical real time
507application.
e4ec4154
TL
508
509Setting Up A New Network
510^^^^^^^^^^^^^^^^^^^^^^^^
511
a9e7c3aa 512First you have to set up a new network interface. It should be on a physically
e4ec4154 513separate network. Ensure that your network fulfills the
3254bfdd 514xref:pvecm_cluster_network_requirements[cluster network requirements].
e4ec4154
TL
515
516Separate On Cluster Creation
517^^^^^^^^^^^^^^^^^^^^^^^^^^^^
518
a9e7c3aa
SR
519This is possible via the 'linkX' parameters of the 'pvecm create'
520command used for creating a new cluster.
e4ec4154 521
a9e7c3aa
SR
522If you have set up an additional NIC with a static address on 10.10.10.1/25,
523and want to send and receive all cluster communication over this interface,
e4ec4154
TL
524you would execute:
525
526[source,bash]
4d19cb00 527----
a9e7c3aa 528pvecm create test --link0 10.10.10.1
4d19cb00 529----
e4ec4154
TL
530
531To check if everything is working properly execute:
532[source,bash]
4d19cb00 533----
e4ec4154 534systemctl status corosync
4d19cb00 535----
e4ec4154 536
a9e7c3aa 537Afterwards, proceed as described above to
3254bfdd 538xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network].
82d52451 539
3254bfdd 540[[pvecm_separate_cluster_net_after_creation]]
e4ec4154
TL
541Separate After Cluster Creation
542^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
543
a9e7c3aa 544You can do this if you have already created a cluster and want to switch
e4ec4154
TL
545its communication to another network, without rebuilding the whole cluster.
546This change may lead to short durations of quorum loss in the cluster, as nodes
547have to restart corosync and come up one after the other on the new network.
548
3254bfdd 549Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first.
a9e7c3aa 550Then, open it and you should see a file similar to:
e4ec4154
TL
551
552----
553logging {
554 debug: off
555 to_syslog: yes
556}
557
558nodelist {
559
560 node {
561 name: due
562 nodeid: 2
563 quorum_votes: 1
564 ring0_addr: due
565 }
566
567 node {
568 name: tre
569 nodeid: 3
570 quorum_votes: 1
571 ring0_addr: tre
572 }
573
574 node {
575 name: uno
576 nodeid: 1
577 quorum_votes: 1
578 ring0_addr: uno
579 }
580
581}
582
583quorum {
584 provider: corosync_votequorum
585}
586
587totem {
a9e7c3aa 588 cluster_name: testcluster
e4ec4154 589 config_version: 3
a9e7c3aa 590 ip_version: ipv4-6
e4ec4154
TL
591 secauth: on
592 version: 2
593 interface {
a9e7c3aa 594 linknumber: 0
e4ec4154
TL
595 }
596
597}
598----
599
a9e7c3aa
SR
600NOTE: `ringX_addr` actually specifies a corosync *link address*, the name "ring"
601is a remnant of older corosync versions that is kept for backwards
602compatibility.
603
604The first thing you want to do is add the 'name' properties in the node entries
605if you do not see them already. Those *must* match the node name.
e4ec4154 606
a9e7c3aa
SR
607Then replace all addresses from the 'ring0_addr' properties of all nodes with
608the new addresses. You may use plain IP addresses or hostnames here. If you use
270757a1 609hostnames ensure that they are resolvable from all nodes. (see also
a9e7c3aa 610xref:pvecm_corosync_addresses[Link Address Types])
e4ec4154 611
a9e7c3aa
SR
612In this example, we want to switch the cluster communication to the
61310.10.10.1/25 network. So we replace all 'ring0_addr' respectively.
e4ec4154 614
a9e7c3aa
SR
615NOTE: The exact same procedure can be used to change other 'ringX_addr' values
616as well, although we recommend to not change multiple addresses at once, to make
617it easier to recover if something goes wrong.
618
619After we increase the 'config_version' property, the new configuration file
e4ec4154
TL
620should look like:
621
622----
e4ec4154
TL
623logging {
624 debug: off
625 to_syslog: yes
626}
627
628nodelist {
629
630 node {
631 name: due
632 nodeid: 2
633 quorum_votes: 1
634 ring0_addr: 10.10.10.2
635 }
636
637 node {
638 name: tre
639 nodeid: 3
640 quorum_votes: 1
641 ring0_addr: 10.10.10.3
642 }
643
644 node {
645 name: uno
646 nodeid: 1
647 quorum_votes: 1
648 ring0_addr: 10.10.10.1
649 }
650
651}
652
653quorum {
654 provider: corosync_votequorum
655}
656
657totem {
a9e7c3aa 658 cluster_name: testcluster
e4ec4154 659 config_version: 4
a9e7c3aa 660 ip_version: ipv4-6
e4ec4154
TL
661 secauth: on
662 version: 2
663 interface {
a9e7c3aa 664 linknumber: 0
e4ec4154
TL
665 }
666
667}
668----
669
a9e7c3aa
SR
670Then, after a final check if all changed information is correct, we save it and
671once again follow the xref:pvecm_edit_corosync_conf[edit corosync.conf file]
672section to bring it into effect.
e4ec4154 673
a9e7c3aa
SR
674The changes will be applied live, so restarting corosync is not strictly
675necessary. If you changed other settings as well, or notice corosync
676complaining, you can optionally trigger a restart.
e4ec4154
TL
677
678On a single node execute:
a9e7c3aa 679
e4ec4154 680[source,bash]
4d19cb00 681----
e4ec4154 682systemctl restart corosync
4d19cb00 683----
e4ec4154
TL
684
685Now check if everything is fine:
686
687[source,bash]
4d19cb00 688----
e4ec4154 689systemctl status corosync
4d19cb00 690----
e4ec4154
TL
691
692If corosync runs again correct restart corosync also on all other nodes.
693They will then join the cluster membership one by one on the new network.
694
3254bfdd 695[[pvecm_corosync_addresses]]
270757a1
SR
696Corosync addresses
697~~~~~~~~~~~~~~~~~~
698
a9e7c3aa
SR
699A corosync link address (for backwards compatibility denoted by 'ringX_addr' in
700`corosync.conf`) can be specified in two ways:
270757a1
SR
701
702* **IPv4/v6 addresses** will be used directly. They are recommended, since they
703are static and usually not changed carelessly.
704
705* **Hostnames** will be resolved using `getaddrinfo`, which means that per
706default, IPv6 addresses will be used first, if available (see also
707`man gai.conf`). Keep this in mind, especially when upgrading an existing
708cluster to IPv6.
709
710CAUTION: Hostnames should be used with care, since the address they
711resolve to can be changed without touching corosync or the node it runs on -
712which may lead to a situation where an address is changed without thinking
713about implications for corosync.
714
715A seperate, static hostname specifically for corosync is recommended, if
716hostnames are preferred. Also, make sure that every node in the cluster can
717resolve all hostnames correctly.
718
719Since {pve} 5.1, while supported, hostnames will be resolved at the time of
720entry. Only the resolved IP is then saved to the configuration.
721
722Nodes that joined the cluster on earlier versions likely still use their
723unresolved hostname in `corosync.conf`. It might be a good idea to replace
724them with IPs or a seperate hostname, as mentioned above.
725
e4ec4154 726
a9e7c3aa
SR
727[[pvecm_redundancy]]
728Corosync Redundancy
729-------------------
e4ec4154 730
a9e7c3aa
SR
731Corosync supports redundant networking via its integrated kronosnet layer by
732default (it is not supported on the legacy udp/udpu transports). It can be
733enabled by specifying more than one link address, either via the '--linkX'
3e380ce0
SR
734parameters of `pvecm`, in the GUI as **Link 1** (while creating a cluster or
735adding a new node) or by specifying more than one 'ringX_addr' in
736`corosync.conf`.
e4ec4154 737
a9e7c3aa
SR
738NOTE: To provide useful failover, every link should be on its own
739physical network connection.
e4ec4154 740
a9e7c3aa
SR
741Links are used according to a priority setting. You can configure this priority
742by setting 'knet_link_priority' in the corresponding interface section in
743`corosync.conf`, or, preferrably, using the 'priority' parameter when creating
744your cluster with `pvecm`:
e4ec4154 745
4d19cb00 746----
a9e7c3aa 747 # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=20 --link1 10.20.20.1,priority=15
4d19cb00 748----
e4ec4154 749
a9e7c3aa
SR
750This would cause 'link1' to be used first, since it has the lower priority.
751
752If no priorities are configured manually (or two links have the same priority),
753links will be used in order of their number, with the lower number having higher
754priority.
755
756Even if all links are working, only the one with the highest priority will see
757corosync traffic. Link priorities cannot be mixed, i.e. links with different
758priorities will not be able to communicate with each other.
e4ec4154 759
a9e7c3aa
SR
760Since lower priority links will not see traffic unless all higher priorities
761have failed, it becomes a useful strategy to specify even networks used for
762other tasks (VMs, storage, etc...) as low-priority links. If worst comes to
763worst, a higher-latency or more congested connection might be better than no
764connection at all.
e4ec4154 765
a9e7c3aa
SR
766Adding Redundant Links To An Existing Cluster
767~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
e4ec4154 768
a9e7c3aa
SR
769To add a new link to a running configuration, first check how to
770xref:pvecm_edit_corosync_conf[edit the corosync.conf file].
e4ec4154 771
a9e7c3aa
SR
772Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make
773sure that your 'X' is the same for every node you add it to, and that it is
774unique for each node.
775
776Lastly, add a new 'interface', as shown below, to your `totem`
777section, replacing 'X' with your link number chosen above.
778
779Assuming you added a link with number 1, the new configuration file could look
780like this:
e4ec4154
TL
781
782----
a9e7c3aa
SR
783logging {
784 debug: off
785 to_syslog: yes
e4ec4154
TL
786}
787
788nodelist {
a9e7c3aa 789
e4ec4154 790 node {
a9e7c3aa
SR
791 name: due
792 nodeid: 2
e4ec4154 793 quorum_votes: 1
a9e7c3aa
SR
794 ring0_addr: 10.10.10.2
795 ring1_addr: 10.20.20.2
e4ec4154
TL
796 }
797
a9e7c3aa
SR
798 node {
799 name: tre
800 nodeid: 3
e4ec4154 801 quorum_votes: 1
a9e7c3aa
SR
802 ring0_addr: 10.10.10.3
803 ring1_addr: 10.20.20.3
e4ec4154
TL
804 }
805
a9e7c3aa
SR
806 node {
807 name: uno
808 nodeid: 1
809 quorum_votes: 1
810 ring0_addr: 10.10.10.1
811 ring1_addr: 10.20.20.1
812 }
813
814}
815
816quorum {
817 provider: corosync_votequorum
818}
819
820totem {
821 cluster_name: testcluster
822 config_version: 4
823 ip_version: ipv4-6
824 secauth: on
825 version: 2
826 interface {
827 linknumber: 0
828 }
829 interface {
830 linknumber: 1
831 }
e4ec4154 832}
a9e7c3aa 833----
e4ec4154 834
a9e7c3aa
SR
835The new link will be enabled as soon as you follow the last steps to
836xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not
837be necessary. You can check that corosync loaded the new link using:
e4ec4154 838
a9e7c3aa
SR
839----
840journalctl -b -u corosync
e4ec4154
TL
841----
842
a9e7c3aa
SR
843It might be a good idea to test the new link by temporarily disconnecting the
844old link on one node and making sure that its status remains online while
845disconnected:
e4ec4154 846
a9e7c3aa
SR
847----
848pvecm status
849----
850
851If you see a healthy cluster state, it means that your new link is being used.
e4ec4154 852
e4ec4154 853
c21d2cbe
OB
854Corosync External Vote Support
855------------------------------
856
857This section describes a way to deploy an external voter in a {pve} cluster.
858When configured, the cluster can sustain more node failures without
859violating safety properties of the cluster communication.
860
861For this to work there are two services involved:
862
863* a so called qdevice daemon which runs on each {pve} node
864
865* an external vote daemon which runs on an independent server.
866
867As a result you can achieve higher availability even in smaller setups (for
868example 2+1 nodes).
869
870QDevice Technical Overview
871~~~~~~~~~~~~~~~~~~~~~~~~~~
872
873The Corosync Quroum Device (QDevice) is a daemon which runs on each cluster
874node. It provides a configured number of votes to the clusters quorum
875subsystem based on an external running third-party arbitrator's decision.
876Its primary use is to allow a cluster to sustain more node failures than
877standard quorum rules allow. This can be done safely as the external device
878can see all nodes and thus choose only one set of nodes to give its vote.
51730d56 879This will only be done if said set of nodes can have quorum (again) when
c21d2cbe
OB
880receiving the third-party vote.
881
882Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
883a daemon which provides a vote to a cluster partition if it can reach the
884partition members over the network. It will give only votes to one partition
885of a cluster at any time.
886It's designed to support multiple clusters and is almost configuration and
887state free. New clusters are handled dynamically and no configuration file
888is needed on the host running a QDevice.
889
890The external host has the only requirement that it needs network access to the
891cluster and a corosync-qnetd package available. We provide such a package
892for Debian based hosts, other Linux distributions should also have a package
893available through their respective package manager.
894
895NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
a9e7c3aa
SR
896TCP/IP. The daemon may even run outside of the clusters LAN and can have longer
897latencies than 2 ms.
c21d2cbe
OB
898
899Supported Setups
900~~~~~~~~~~~~~~~~
901
902We support QDevices for clusters with an even number of nodes and recommend
903it for 2 node clusters, if they should provide higher availability.
904For clusters with an odd node count we discourage the use of QDevices
905currently. The reason for this, is the difference of the votes the QDevice
906provides for each cluster type. Even numbered clusters get single additional
907vote, with this we can only increase availability, i.e. if the QDevice
908itself fails we are in the same situation as with no QDevice at all.
909
910Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
911where 'N' corresponds to the cluster node count. This difference makes
912sense, if we had only one additional vote the cluster can get into a split
913brain situation.
914This algorithm would allow that all nodes but one (and naturally the
915QDevice itself) could fail.
916There are two drawbacks with this:
917
918* If the QNet daemon itself fails, no other node may fail or the cluster
919 immediately loses quorum. For example, in a cluster with 15 nodes 7
920 could fail before the cluster becomes inquorate. But, if a QDevice is
921 configured here and said QDevice fails itself **no single node** of
922 the 15 may fail. The QDevice acts almost as a single point of failure in
923 this case.
924
925* The fact that all but one node plus QDevice may fail sound promising at
926 first, but this may result in a mass recovery of HA services that would
927 overload the single node left. Also ceph server will stop to provide
928 services after only '((N-1)/2)' nodes are online.
929
930If you understand the drawbacks and implications you can decide yourself if
931you should use this technology in an odd numbered cluster setup.
932
c21d2cbe
OB
933QDevice-Net Setup
934~~~~~~~~~~~~~~~~~
935
936We recommend to run any daemon which provides votes to corosync-qdevice as an
e34c3e91
TL
937unprivileged user. {pve} and Debian provides a package which is already
938configured to do so.
c21d2cbe
OB
939The traffic between the daemon and the cluster must be encrypted to ensure a
940safe and secure QDevice integration in {pve}.
941
942First install the 'corosync-qnetd' package on your external server and
943the 'corosync-qdevice' package on all cluster nodes.
944
945After that, ensure that all your nodes on the cluster are online.
946
947You can now easily set up your QDevice by running the following command on one
948of the {pve} nodes:
949
950----
951pve# pvecm qdevice setup <QDEVICE-IP>
952----
953
954The SSH key from the cluster will be automatically copied to the QDevice. You
955might need to enter an SSH password during this step.
956
957After you enter the password and all the steps are successfully completed, you
958will see "Done". You can check the status now:
959
960----
961pve# pvecm status
962
963...
964
965Votequorum information
966~~~~~~~~~~~~~~~~~~~~~
967Expected votes: 3
968Highest expected: 3
969Total votes: 3
970Quorum: 2
971Flags: Quorate Qdevice
972
973Membership information
974~~~~~~~~~~~~~~~~~~~~~~
975 Nodeid Votes Qdevice Name
976 0x00000001 1 A,V,NMW 192.168.22.180 (local)
977 0x00000002 1 A,V,NMW 192.168.22.181
978 0x00000000 1 Qdevice
979
980----
981
982which means the QDevice is set up.
983
c21d2cbe
OB
984Frequently Asked Questions
985~~~~~~~~~~~~~~~~~~~~~~~~~~
986
987Tie Breaking
988^^^^^^^^^^^^
989
00821894
TL
990In case of a tie, where two same-sized cluster partitions cannot see each other
991but the QDevice, the QDevice chooses randomly one of those partitions and
c21d2cbe
OB
992provides a vote to it.
993
d31de328
TL
994Possible Negative Implications
995^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
996
00821894
TL
997For clusters with an even node count there are no negative implications when
998setting up a QDevice. If it fails to work, you are as good as without QDevice at
999all.
d31de328 1000
870c2817
OB
1001Adding/Deleting Nodes After QDevice Setup
1002^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
d31de328
TL
1003
1004If you want to add a new node or remove an existing one from a cluster with a
00821894
TL
1005QDevice setup, you need to remove the QDevice first. After that, you can add or
1006remove nodes normally. Once you have a cluster with an even node count again,
1007you can set up the QDevice again as described above.
870c2817
OB
1008
1009Removing the QDevice
1010^^^^^^^^^^^^^^^^^^^^
1011
00821894
TL
1012If you used the official `pvecm` tool to add the QDevice, you can remove it
1013trivially by running:
870c2817
OB
1014
1015----
1016pve# pvecm qdevice remove
1017----
d31de328 1018
51730d56
TL
1019//Still TODO
1020//^^^^^^^^^^
a9e7c3aa 1021//There is still stuff to add here
c21d2cbe
OB
1022
1023
e4ec4154
TL
1024Corosync Configuration
1025----------------------
1026
a9e7c3aa
SR
1027The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It
1028controls the cluster membership and its network.
1029For further information about it, check the corosync.conf man page:
e4ec4154 1030[source,bash]
4d19cb00 1031----
e4ec4154 1032man corosync.conf
4d19cb00 1033----
e4ec4154
TL
1034
1035For node membership you should always use the `pvecm` tool provided by {pve}.
1036You may have to edit the configuration file manually for other changes.
1037Here are a few best practice tips for doing this.
1038
3254bfdd 1039[[pvecm_edit_corosync_conf]]
e4ec4154
TL
1040Edit corosync.conf
1041~~~~~~~~~~~~~~~~~~
1042
a9e7c3aa
SR
1043Editing the corosync.conf file is not always very straightforward. There are
1044two on each cluster node, one in `/etc/pve/corosync.conf` and the other in
e4ec4154
TL
1045`/etc/corosync/corosync.conf`. Editing the one in our cluster file system will
1046propagate the changes to the local one, but not vice versa.
1047
1048The configuration will get updated automatically as soon as the file changes.
1049This means changes which can be integrated in a running corosync will take
a9e7c3aa
SR
1050effect immediately. So you should always make a copy and edit that instead, to
1051avoid triggering some unwanted changes by an in-between safe.
e4ec4154
TL
1052
1053[source,bash]
4d19cb00 1054----
e4ec4154 1055cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
4d19cb00 1056----
e4ec4154 1057
a9e7c3aa
SR
1058Then open the config file with your favorite editor, `nano` and `vim.tiny` are
1059preinstalled on any {pve} node for example.
e4ec4154
TL
1060
1061NOTE: Always increment the 'config_version' number on configuration changes,
1062omitting this can lead to problems.
1063
1064After making the necessary changes create another copy of the current working
1065configuration file. This serves as a backup if the new configuration fails to
1066apply or makes problems in other ways.
1067
1068[source,bash]
4d19cb00 1069----
e4ec4154 1070cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
4d19cb00 1071----
e4ec4154
TL
1072
1073Then move the new configuration file over the old one:
1074[source,bash]
4d19cb00 1075----
e4ec4154 1076mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
4d19cb00 1077----
e4ec4154
TL
1078
1079You may check with the commands
1080[source,bash]
4d19cb00 1081----
e4ec4154
TL
1082systemctl status corosync
1083journalctl -b -u corosync
4d19cb00 1084----
e4ec4154 1085
a9e7c3aa 1086If the change could be applied automatically. If not you may have to restart the
e4ec4154
TL
1087corosync service via:
1088[source,bash]
4d19cb00 1089----
e4ec4154 1090systemctl restart corosync
4d19cb00 1091----
e4ec4154
TL
1092
1093On errors check the troubleshooting section below.
1094
1095Troubleshooting
1096~~~~~~~~~~~~~~~
1097
1098Issue: 'quorum.expected_votes must be configured'
1099^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1100
1101When corosync starts to fail and you get the following message in the system log:
1102
1103----
1104[...]
1105corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
1106corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason
1107 'configuration error: nodelist or quorum.expected_votes must be configured!'
1108[...]
1109----
1110
1111It means that the hostname you set for corosync 'ringX_addr' in the
1112configuration could not be resolved.
1113
e4ec4154
TL
1114Write Configuration When Not Quorate
1115^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1116
1117If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you
1118know what you do, use:
1119[source,bash]
4d19cb00 1120----
e4ec4154 1121pvecm expected 1
4d19cb00 1122----
e4ec4154
TL
1123
1124This sets the expected vote count to 1 and makes the cluster quorate. You can
1125now fix your configuration, or revert it back to the last working backup.
1126
6d3c0b34 1127This is not enough if corosync cannot start anymore. Here it is best to edit the
e4ec4154
TL
1128local copy of the corosync configuration in '/etc/corosync/corosync.conf' so
1129that corosync can start again. Ensure that on all nodes this configuration has
1130the same content to avoid split brains. If you are not sure what went wrong
1131it's best to ask the Proxmox Community to help you.
1132
1133
3254bfdd 1134[[pvecm_corosync_conf_glossary]]
e4ec4154
TL
1135Corosync Configuration Glossary
1136~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1137
1138ringX_addr::
a9e7c3aa
SR
1139This names the different link addresses for the kronosnet connections between
1140nodes.
e4ec4154 1141
806ef12d
DM
1142
1143Cluster Cold Start
1144------------------
1145
1146It is obvious that a cluster is not quorate when all nodes are
1147offline. This is a common case after a power failure.
1148
1149NOTE: It is always a good idea to use an uninterruptible power supply
8c1189b6 1150(``UPS'', also called ``battery backup'') to avoid this state, especially if
806ef12d
DM
1151you want HA.
1152
204231df 1153On node startup, the `pve-guests` service is started and waits for
8c1189b6 1154quorum. Once quorate, it starts all guests which have the `onboot`
612417fd
DM
1155flag set.
1156
1157When you turn on nodes, or when power comes back after power failure,
1158it is likely that some nodes boots faster than others. Please keep in
1159mind that guest startup is delayed until you reach quorum.
806ef12d 1160
054a7e7d 1161
082ea7d9
TL
1162Guest Migration
1163---------------
1164
054a7e7d
DM
1165Migrating virtual guests to other nodes is a useful feature in a
1166cluster. There are settings to control the behavior of such
1167migrations. This can be done via the configuration file
1168`datacenter.cfg` or for a specific migration via API or command line
1169parameters.
1170
da6c7dee
DC
1171It makes a difference if a Guest is online or offline, or if it has
1172local resources (like a local disk).
1173
1174For Details about Virtual Machine Migration see the
a9e7c3aa 1175xref:qm_migration[QEMU/KVM Migration Chapter].
da6c7dee
DC
1176
1177For Details about Container Migration see the
a9e7c3aa 1178xref:pct_migration[Container Migration Chapter].
082ea7d9
TL
1179
1180Migration Type
1181~~~~~~~~~~~~~~
1182
44f38275 1183The migration type defines if the migration data should be sent over an
d63be10b 1184encrypted (`secure`) channel or an unencrypted (`insecure`) one.
082ea7d9 1185Setting the migration type to insecure means that the RAM content of a
470d4313 1186virtual guest gets also transferred unencrypted, which can lead to
b1743473
DM
1187information disclosure of critical data from inside the guest (for
1188example passwords or encryption keys).
054a7e7d
DM
1189
1190Therefore, we strongly recommend using the secure channel if you do
1191not have full control over the network and can not guarantee that no
6d3c0b34 1192one is eavesdropping on it.
082ea7d9 1193
054a7e7d
DM
1194NOTE: Storage migration does not follow this setting. Currently, it
1195always sends the storage content over a secure channel.
1196
1197Encryption requires a lot of computing power, so this setting is often
1198changed to "unsafe" to achieve better performance. The impact on
1199modern systems is lower because they implement AES encryption in
b1743473
DM
1200hardware. The performance impact is particularly evident in fast
1201networks where you can transfer 10 Gbps or more.
082ea7d9 1202
082ea7d9
TL
1203Migration Network
1204~~~~~~~~~~~~~~~~~
1205
a9baa444
TL
1206By default, {pve} uses the network in which cluster communication
1207takes place to send the migration traffic. This is not optimal because
1208sensitive cluster traffic can be disrupted and this network may not
1209have the best bandwidth available on the node.
1210
1211Setting the migration network parameter allows the use of a dedicated
1212network for the entire migration traffic. In addition to the memory,
1213this also affects the storage traffic for offline migrations.
1214
1215The migration network is set as a network in the CIDR notation. This
1216has the advantage that you do not have to set individual IP addresses
1217for each node. {pve} can determine the real address on the
1218destination node from the network specified in the CIDR form. To
1219enable this, the network must be specified so that each node has one,
1220but only one IP in the respective network.
1221
082ea7d9
TL
1222Example
1223^^^^^^^
1224
a9baa444
TL
1225We assume that we have a three-node setup with three separate
1226networks. One for public communication with the Internet, one for
1227cluster communication and a very fast one, which we want to use as a
1228dedicated network for migration.
1229
1230A network configuration for such a setup might look as follows:
082ea7d9
TL
1231
1232----
7a0d4784 1233iface eno1 inet manual
082ea7d9
TL
1234
1235# public network
1236auto vmbr0
1237iface vmbr0 inet static
1238 address 192.X.Y.57
1239 netmask 255.255.250.0
1240 gateway 192.X.Y.1
7a0d4784 1241 bridge_ports eno1
082ea7d9
TL
1242 bridge_stp off
1243 bridge_fd 0
1244
1245# cluster network
7a0d4784
WL
1246auto eno2
1247iface eno2 inet static
082ea7d9
TL
1248 address 10.1.1.1
1249 netmask 255.255.255.0
1250
1251# fast network
7a0d4784
WL
1252auto eno3
1253iface eno3 inet static
082ea7d9
TL
1254 address 10.1.2.1
1255 netmask 255.255.255.0
082ea7d9
TL
1256----
1257
a9baa444
TL
1258Here, we will use the network 10.1.2.0/24 as a migration network. For
1259a single migration, you can do this using the `migration_network`
1260parameter of the command line tool:
1261
082ea7d9 1262----
b1743473 1263# qm migrate 106 tre --online --migration_network 10.1.2.0/24
082ea7d9
TL
1264----
1265
a9baa444
TL
1266To configure this as the default network for all migrations in the
1267cluster, set the `migration` property of the `/etc/pve/datacenter.cfg`
1268file:
1269
082ea7d9 1270----
a9baa444 1271# use dedicated migration network
b1743473 1272migration: secure,network=10.1.2.0/24
082ea7d9
TL
1273----
1274
a9baa444
TL
1275NOTE: The migration type must always be set when the migration network
1276gets set in `/etc/pve/datacenter.cfg`.
1277
806ef12d 1278
d8742b0c
DM
1279ifdef::manvolnum[]
1280include::pve-copyright.adoc[]
1281endif::manvolnum[]