]> git.proxmox.com Git - pve-docs.git/blob - pvecm.adoc
Add screenshot references to cluster GUI section
[pve-docs.git] / pvecm.adoc
1 [[chapter_pvecm]]
2 ifdef::manvolnum[]
3 pvecm(1)
4 ========
5 :pve-toplevel:
6
7 NAME
8 ----
9
10 pvecm - Proxmox VE Cluster Manager
11
12 SYNOPSIS
13 --------
14
15 include::pvecm.1-synopsis.adoc[]
16
17 DESCRIPTION
18 -----------
19 endif::manvolnum[]
20
21 ifndef::manvolnum[]
22 Cluster Manager
23 ===============
24 :pve-toplevel:
25 endif::manvolnum[]
26
27 The {PVE} cluster manager `pvecm` is a tool to create a group of
28 physical servers. Such a group is called a *cluster*. We use the
29 http://www.corosync.org[Corosync Cluster Engine] for reliable group
30 communication, and such clusters can consist of up to 32 physical nodes
31 (probably more, dependent on network latency).
32
33 `pvecm` can be used to create a new cluster, join nodes to a cluster,
34 leave the cluster, get status information and do various other cluster
35 related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'')
36 is used to transparently distribute the cluster configuration to all cluster
37 nodes.
38
39 Grouping nodes into a cluster has the following advantages:
40
41 * Centralized, web based management
42
43 * Multi-master clusters: each node can do all management tasks
44
45 * `pmxcfs`: database-driven file system for storing configuration files,
46 replicated in real-time on all nodes using `corosync`.
47
48 * Easy migration of virtual machines and containers between physical
49 hosts
50
51 * Fast deployment
52
53 * Cluster-wide services like firewall and HA
54
55
56 Requirements
57 ------------
58
59 * All nodes must be able to connect to each other via UDP ports 5404 and 5405
60 for corosync to work.
61
62 * Date and time have to be synchronized.
63
64 * SSH tunnel on TCP port 22 between nodes is used.
65
66 * If you are interested in High Availability, you need to have at
67 least three nodes for reliable quorum. All nodes should have the
68 same version.
69
70 * We recommend a dedicated NIC for the cluster traffic, especially if
71 you use shared storage.
72
73 * Root password of a cluster node is required for adding nodes.
74
75 NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster
76 nodes.
77
78 NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as
79 production configuration and should only used temporarily during upgrading the
80 whole cluster from one to another major version.
81
82 NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The
83 cluster protocol (corosync) between {pve} 6.x and earlier versions changed
84 fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the
85 upgrade procedure to {pve} 6.0.
86
87
88 Preparing Nodes
89 ---------------
90
91 First, install {PVE} on all nodes. Make sure that each node is
92 installed with the final hostname and IP configuration. Changing the
93 hostname and IP is not possible after cluster creation.
94
95 Currently the cluster creation can either be done on the console (login via
96 `ssh`) or the API, which we have a GUI implementation for (__Datacenter ->
97 Cluster__).
98
99 While it's common to reference all nodenames and their IPs in `/etc/hosts` (or
100 make their names resolvable through other means), this is not necessary for a
101 cluster to work. It may be useful however, as you can then connect from one node
102 to the other with SSH via the easier to remember node name (see also
103 xref:pvecm_corosync_addresses[Link Address Types]). Note that we always
104 recommend to reference nodes by their IP addresses in the cluster configuration.
105
106
107 [[pvecm_create_cluster]]
108 Create the Cluster
109 ------------------
110
111 Use a unique name for your cluster. This name cannot be changed later. The
112 cluster name follows the same rules as node names.
113
114 Create via Web GUI
115 ~~~~~~~~~~~~~~~~~~
116
117 [thumbnail="screenshot/gui-cluster-create.png"]
118
119 Under __Datacenter -> Cluster__, click on *Create Cluster*. Enter the cluster
120 name and select a network connection from the dropdown to serve as the main
121 cluster network (Link 0). It defaults to the IP resolved via the node's
122 hostname.
123
124 To add a second link as fallback, you can select the 'Advanced' checkbox and
125 choose an additional network interface (Link 1, see also
126 xref:pvecm_redundancy[Corosync Redundancy]).
127
128 Create via Command Line
129 ~~~~~~~~~~~~~~~~~~~~~~~
130
131 Login via `ssh` to the first {pve} node and run the following command:
132
133 ----
134 hp1# pvecm create CLUSTERNAME
135 ----
136
137 To check the state of the new cluster use:
138
139 ----
140 hp1# pvecm status
141 ----
142
143 Multiple Clusters In Same Network
144 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
145
146 It is possible to create multiple clusters in the same physical or logical
147 network. Each such cluster must have a unique name to avoid possible clashes in
148 the cluster communication stack. This also helps avoid human confusion by making
149 clusters clearly distinguishable.
150
151 While the bandwidth requirement of a corosync cluster is relatively low, the
152 latency of packages and the package per second (PPS) rate is the limiting
153 factor. Different clusters in the same network can compete with each other for
154 these resources, so it may still make sense to use separate physical network
155 infrastructure for bigger clusters.
156
157 [[pvecm_join_node_to_cluster]]
158 Adding Nodes to the Cluster
159 ---------------------------
160
161 CAUTION: A node that is about to be added to the cluster cannot hold any guests.
162 All existing configuration in `/etc/pve` is overwritten when joining a cluster,
163 since guest IDs could be conflicting. As a workaround create a backup of the
164 guest (`vzdump`) and restore it as a different ID after the node has been added
165 to the cluster.
166
167 Add Node via GUI
168 ~~~~~~~~~~~~~~~~
169
170 [thumbnail="screenshot/gui-cluster-join-information.png"]
171
172 Login to the web interface on an existing cluster node. Under __Datacenter ->
173 Cluster__, click the button *Join Information* at the top. Then, click on the
174 button *Copy Information*. Alternatively, copy the string from the 'Information'
175 field manually.
176
177 [thumbnail="screenshot/gui-cluster-join.png"]
178
179 Next, login to the web interface on the node you want to add.
180 Under __Datacenter -> Cluster__, click on *Join Cluster*. Fill in the
181 'Information' field with the text you copied earlier.
182
183 For security reasons, the cluster password has to be entered manually.
184
185 NOTE: To enter all required data manually, you can disable the 'Assisted Join'
186 checkbox.
187
188 After clicking on *Join* the node will immediately be added to the cluster. You
189 might need to reload the web page and re-login with the cluster credentials.
190
191 Confirm that your node is visible under __Datacenter -> Cluster__.
192
193 Add Node via Command Line
194 ~~~~~~~~~~~~~~~~~~~~~~~~~
195
196 Login via `ssh` to the node you want to add.
197
198 ----
199 hp2# pvecm add IP-ADDRESS-CLUSTER
200 ----
201
202 For `IP-ADDRESS-CLUSTER` use the IP or hostname of an existing cluster node.
203 An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]).
204
205
206 To check the state of the cluster use:
207
208 ----
209 # pvecm status
210 ----
211
212 .Cluster status after adding 4 nodes
213 ----
214 hp2# pvecm status
215 Quorum information
216 ~~~~~~~~~~~~~~~~~~
217 Date: Mon Apr 20 12:30:13 2015
218 Quorum provider: corosync_votequorum
219 Nodes: 4
220 Node ID: 0x00000001
221 Ring ID: 1/8
222 Quorate: Yes
223
224 Votequorum information
225 ~~~~~~~~~~~~~~~~~~~~~~
226 Expected votes: 4
227 Highest expected: 4
228 Total votes: 4
229 Quorum: 3
230 Flags: Quorate
231
232 Membership information
233 ~~~~~~~~~~~~~~~~~~~~~~
234 Nodeid Votes Name
235 0x00000001 1 192.168.15.91
236 0x00000002 1 192.168.15.92 (local)
237 0x00000003 1 192.168.15.93
238 0x00000004 1 192.168.15.94
239 ----
240
241 If you only want the list of all nodes use:
242
243 ----
244 # pvecm nodes
245 ----
246
247 .List nodes in a cluster
248 ----
249 hp2# pvecm nodes
250
251 Membership information
252 ~~~~~~~~~~~~~~~~~~~~~~
253 Nodeid Votes Name
254 1 1 hp1
255 2 1 hp2 (local)
256 3 1 hp3
257 4 1 hp4
258 ----
259
260 [[pvecm_adding_nodes_with_separated_cluster_network]]
261 Adding Nodes With Separated Cluster Network
262 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
263
264 When adding a node to a cluster with a separated cluster network you need to
265 use the 'link0' parameter to set the nodes address on that network:
266
267 [source,bash]
268 ----
269 pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0
270 ----
271
272 If you want to use the built-in xref:pvecm_redundancy[redundancy] of the
273 kronosnet transport layer, also use the 'link1' parameter.
274
275 Using the GUI, you can select the correct interface from the corresponding 'Link 0'
276 and 'Link 1' fields in the *Cluster Join* dialog.
277
278 Remove a Cluster Node
279 ---------------------
280
281 CAUTION: Read carefully the procedure before proceeding, as it could
282 not be what you want or need.
283
284 Move all virtual machines from the node. Make sure you have no local
285 data or backups you want to keep, or save them accordingly.
286 In the following example we will remove the node hp4 from the cluster.
287
288 Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes`
289 command to identify the node ID to remove:
290
291 ----
292 hp1# pvecm nodes
293
294 Membership information
295 ~~~~~~~~~~~~~~~~~~~~~~
296 Nodeid Votes Name
297 1 1 hp1 (local)
298 2 1 hp2
299 3 1 hp3
300 4 1 hp4
301 ----
302
303
304 At this point you must power off hp4 and
305 make sure that it will not power on again (in the network) as it
306 is.
307
308 IMPORTANT: As said above, it is critical to power off the node
309 *before* removal, and make sure that it will *never* power on again
310 (in the existing cluster network) as it is.
311 If you power on the node as it is, your cluster will be screwed up and
312 it could be difficult to restore a clean cluster state.
313
314 After powering off the node hp4, we can safely remove it from the cluster.
315
316 ----
317 hp1# pvecm delnode hp4
318 ----
319
320 If the operation succeeds no output is returned, just check the node
321 list again with `pvecm nodes` or `pvecm status`. You should see
322 something like:
323
324 ----
325 hp1# pvecm status
326
327 Quorum information
328 ~~~~~~~~~~~~~~~~~~
329 Date: Mon Apr 20 12:44:28 2015
330 Quorum provider: corosync_votequorum
331 Nodes: 3
332 Node ID: 0x00000001
333 Ring ID: 1/8
334 Quorate: Yes
335
336 Votequorum information
337 ~~~~~~~~~~~~~~~~~~~~~~
338 Expected votes: 3
339 Highest expected: 3
340 Total votes: 3
341 Quorum: 2
342 Flags: Quorate
343
344 Membership information
345 ~~~~~~~~~~~~~~~~~~~~~~
346 Nodeid Votes Name
347 0x00000001 1 192.168.15.90 (local)
348 0x00000002 1 192.168.15.91
349 0x00000003 1 192.168.15.92
350 ----
351
352 If, for whatever reason, you want this server to join the same cluster again,
353 you have to
354
355 * reinstall {pve} on it from scratch
356
357 * then join it, as explained in the previous section.
358
359 NOTE: After removal of the node, its SSH fingerprint will still reside in the
360 'known_hosts' of the other nodes. If you receive an SSH error after rejoining
361 a node with the same IP or hostname, run `pvecm updatecerts` once on the
362 re-added node to update its fingerprint cluster wide.
363
364 [[pvecm_separate_node_without_reinstall]]
365 Separate A Node Without Reinstalling
366 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
367
368 CAUTION: This is *not* the recommended method, proceed with caution. Use the
369 above mentioned method if you're unsure.
370
371 You can also separate a node from a cluster without reinstalling it from
372 scratch. But after removing the node from the cluster it will still have
373 access to the shared storages! This must be resolved before you start removing
374 the node from the cluster. A {pve} cluster cannot share the exact same
375 storage with another cluster, as storage locking doesn't work over cluster
376 boundary. Further, it may also lead to VMID conflicts.
377
378 Its suggested that you create a new storage where only the node which you want
379 to separate has access. This can be a new export on your NFS or a new Ceph
380 pool, to name a few examples. Its just important that the exact same storage
381 does not gets accessed by multiple clusters. After setting this storage up move
382 all data from the node and its VMs to it. Then you are ready to separate the
383 node from the cluster.
384
385 WARNING: Ensure all shared resources are cleanly separated! Otherwise you will
386 run into conflicts and problems.
387
388 First stop the corosync and the pve-cluster services on the node:
389 [source,bash]
390 ----
391 systemctl stop pve-cluster
392 systemctl stop corosync
393 ----
394
395 Start the cluster filesystem again in local mode:
396 [source,bash]
397 ----
398 pmxcfs -l
399 ----
400
401 Delete the corosync configuration files:
402 [source,bash]
403 ----
404 rm /etc/pve/corosync.conf
405 rm /etc/corosync/*
406 ----
407
408 You can now start the filesystem again as normal service:
409 [source,bash]
410 ----
411 killall pmxcfs
412 systemctl start pve-cluster
413 ----
414
415 The node is now separated from the cluster. You can deleted it from a remaining
416 node of the cluster with:
417 [source,bash]
418 ----
419 pvecm delnode oldnode
420 ----
421
422 If the command failed, because the remaining node in the cluster lost quorum
423 when the now separate node exited, you may set the expected votes to 1 as a workaround:
424 [source,bash]
425 ----
426 pvecm expected 1
427 ----
428
429 And then repeat the 'pvecm delnode' command.
430
431 Now switch back to the separated node, here delete all remaining files left
432 from the old cluster. This ensures that the node can be added to another
433 cluster again without problems.
434
435 [source,bash]
436 ----
437 rm /var/lib/corosync/*
438 ----
439
440 As the configuration files from the other nodes are still in the cluster
441 filesystem you may want to clean those up too. Remove simply the whole
442 directory recursive from '/etc/pve/nodes/NODENAME', but check three times that
443 you used the correct one before deleting it.
444
445 CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means
446 the nodes can still connect to each other with public key authentication. This
447 should be fixed by removing the respective keys from the
448 '/etc/pve/priv/authorized_keys' file.
449
450
451 Quorum
452 ------
453
454 {pve} use a quorum-based technique to provide a consistent state among
455 all cluster nodes.
456
457 [quote, from Wikipedia, Quorum (distributed computing)]
458 ____
459 A quorum is the minimum number of votes that a distributed transaction
460 has to obtain in order to be allowed to perform an operation in a
461 distributed system.
462 ____
463
464 In case of network partitioning, state changes requires that a
465 majority of nodes are online. The cluster switches to read-only mode
466 if it loses quorum.
467
468 NOTE: {pve} assigns a single vote to each node by default.
469
470
471 Cluster Network
472 ---------------
473
474 The cluster network is the core of a cluster. All messages sent over it have to
475 be delivered reliably to all nodes in their respective order. In {pve} this
476 part is done by corosync, an implementation of a high performance, low overhead
477 high availability development toolkit. It serves our decentralized
478 configuration file system (`pmxcfs`).
479
480 [[pvecm_cluster_network_requirements]]
481 Network Requirements
482 ~~~~~~~~~~~~~~~~~~~~
483 This needs a reliable network with latencies under 2 milliseconds (LAN
484 performance) to work properly. The network should not be used heavily by other
485 members, ideally corosync runs on its own network. Do not use a shared network
486 for corosync and storage (except as a potential low-priority fallback in a
487 xref:pvecm_redundancy[redundant] configuration).
488
489 Before setting up a cluster, it is good practice to check if the network is fit
490 for that purpose. To make sure the nodes can connect to each other on the
491 cluster network, you can test the connectivity between them with the `ping`
492 tool.
493
494 If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically
495 be generated - no manual action is required.
496
497 NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0).
498 Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster
499 communication, which, for now, only supports regular UDP unicast.
500
501 CAUTION: You can still enable Multicast or legacy unicast by setting your
502 transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf],
503 but keep in mind that this will disable all cryptography and redundancy support.
504 This is therefore not recommended.
505
506 Separate Cluster Network
507 ~~~~~~~~~~~~~~~~~~~~~~~~
508
509 When creating a cluster without any parameters the corosync cluster network is
510 generally shared with the Web UI and the VMs and their traffic. Depending on
511 your setup, even storage traffic may get sent over the same network. Its
512 recommended to change that, as corosync is a time critical real time
513 application.
514
515 Setting Up A New Network
516 ^^^^^^^^^^^^^^^^^^^^^^^^
517
518 First you have to set up a new network interface. It should be on a physically
519 separate network. Ensure that your network fulfills the
520 xref:pvecm_cluster_network_requirements[cluster network requirements].
521
522 Separate On Cluster Creation
523 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
524
525 This is possible via the 'linkX' parameters of the 'pvecm create'
526 command used for creating a new cluster.
527
528 If you have set up an additional NIC with a static address on 10.10.10.1/25,
529 and want to send and receive all cluster communication over this interface,
530 you would execute:
531
532 [source,bash]
533 ----
534 pvecm create test --link0 10.10.10.1
535 ----
536
537 To check if everything is working properly execute:
538 [source,bash]
539 ----
540 systemctl status corosync
541 ----
542
543 Afterwards, proceed as described above to
544 xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network].
545
546 [[pvecm_separate_cluster_net_after_creation]]
547 Separate After Cluster Creation
548 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
549
550 You can do this if you have already created a cluster and want to switch
551 its communication to another network, without rebuilding the whole cluster.
552 This change may lead to short durations of quorum loss in the cluster, as nodes
553 have to restart corosync and come up one after the other on the new network.
554
555 Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first.
556 Then, open it and you should see a file similar to:
557
558 ----
559 logging {
560 debug: off
561 to_syslog: yes
562 }
563
564 nodelist {
565
566 node {
567 name: due
568 nodeid: 2
569 quorum_votes: 1
570 ring0_addr: due
571 }
572
573 node {
574 name: tre
575 nodeid: 3
576 quorum_votes: 1
577 ring0_addr: tre
578 }
579
580 node {
581 name: uno
582 nodeid: 1
583 quorum_votes: 1
584 ring0_addr: uno
585 }
586
587 }
588
589 quorum {
590 provider: corosync_votequorum
591 }
592
593 totem {
594 cluster_name: testcluster
595 config_version: 3
596 ip_version: ipv4-6
597 secauth: on
598 version: 2
599 interface {
600 linknumber: 0
601 }
602
603 }
604 ----
605
606 NOTE: `ringX_addr` actually specifies a corosync *link address*, the name "ring"
607 is a remnant of older corosync versions that is kept for backwards
608 compatibility.
609
610 The first thing you want to do is add the 'name' properties in the node entries
611 if you do not see them already. Those *must* match the node name.
612
613 Then replace all addresses from the 'ring0_addr' properties of all nodes with
614 the new addresses. You may use plain IP addresses or hostnames here. If you use
615 hostnames ensure that they are resolvable from all nodes. (see also
616 xref:pvecm_corosync_addresses[Link Address Types])
617
618 In this example, we want to switch the cluster communication to the
619 10.10.10.1/25 network. So we replace all 'ring0_addr' respectively.
620
621 NOTE: The exact same procedure can be used to change other 'ringX_addr' values
622 as well, although we recommend to not change multiple addresses at once, to make
623 it easier to recover if something goes wrong.
624
625 After we increase the 'config_version' property, the new configuration file
626 should look like:
627
628 ----
629 logging {
630 debug: off
631 to_syslog: yes
632 }
633
634 nodelist {
635
636 node {
637 name: due
638 nodeid: 2
639 quorum_votes: 1
640 ring0_addr: 10.10.10.2
641 }
642
643 node {
644 name: tre
645 nodeid: 3
646 quorum_votes: 1
647 ring0_addr: 10.10.10.3
648 }
649
650 node {
651 name: uno
652 nodeid: 1
653 quorum_votes: 1
654 ring0_addr: 10.10.10.1
655 }
656
657 }
658
659 quorum {
660 provider: corosync_votequorum
661 }
662
663 totem {
664 cluster_name: testcluster
665 config_version: 4
666 ip_version: ipv4-6
667 secauth: on
668 version: 2
669 interface {
670 linknumber: 0
671 }
672
673 }
674 ----
675
676 Then, after a final check if all changed information is correct, we save it and
677 once again follow the xref:pvecm_edit_corosync_conf[edit corosync.conf file]
678 section to bring it into effect.
679
680 The changes will be applied live, so restarting corosync is not strictly
681 necessary. If you changed other settings as well, or notice corosync
682 complaining, you can optionally trigger a restart.
683
684 On a single node execute:
685
686 [source,bash]
687 ----
688 systemctl restart corosync
689 ----
690
691 Now check if everything is fine:
692
693 [source,bash]
694 ----
695 systemctl status corosync
696 ----
697
698 If corosync runs again correct restart corosync also on all other nodes.
699 They will then join the cluster membership one by one on the new network.
700
701 [[pvecm_corosync_addresses]]
702 Corosync addresses
703 ~~~~~~~~~~~~~~~~~~
704
705 A corosync link address (for backwards compatibility denoted by 'ringX_addr' in
706 `corosync.conf`) can be specified in two ways:
707
708 * **IPv4/v6 addresses** will be used directly. They are recommended, since they
709 are static and usually not changed carelessly.
710
711 * **Hostnames** will be resolved using `getaddrinfo`, which means that per
712 default, IPv6 addresses will be used first, if available (see also
713 `man gai.conf`). Keep this in mind, especially when upgrading an existing
714 cluster to IPv6.
715
716 CAUTION: Hostnames should be used with care, since the address they
717 resolve to can be changed without touching corosync or the node it runs on -
718 which may lead to a situation where an address is changed without thinking
719 about implications for corosync.
720
721 A seperate, static hostname specifically for corosync is recommended, if
722 hostnames are preferred. Also, make sure that every node in the cluster can
723 resolve all hostnames correctly.
724
725 Since {pve} 5.1, while supported, hostnames will be resolved at the time of
726 entry. Only the resolved IP is then saved to the configuration.
727
728 Nodes that joined the cluster on earlier versions likely still use their
729 unresolved hostname in `corosync.conf`. It might be a good idea to replace
730 them with IPs or a seperate hostname, as mentioned above.
731
732
733 [[pvecm_redundancy]]
734 Corosync Redundancy
735 -------------------
736
737 Corosync supports redundant networking via its integrated kronosnet layer by
738 default (it is not supported on the legacy udp/udpu transports). It can be
739 enabled by specifying more than one link address, either via the '--linkX'
740 parameters of `pvecm`, in the GUI as **Link 1** (while creating a cluster or
741 adding a new node) or by specifying more than one 'ringX_addr' in
742 `corosync.conf`.
743
744 NOTE: To provide useful failover, every link should be on its own
745 physical network connection.
746
747 Links are used according to a priority setting. You can configure this priority
748 by setting 'knet_link_priority' in the corresponding interface section in
749 `corosync.conf`, or, preferrably, using the 'priority' parameter when creating
750 your cluster with `pvecm`:
751
752 ----
753 # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=20 --link1 10.20.20.1,priority=15
754 ----
755
756 This would cause 'link1' to be used first, since it has the lower priority.
757
758 If no priorities are configured manually (or two links have the same priority),
759 links will be used in order of their number, with the lower number having higher
760 priority.
761
762 Even if all links are working, only the one with the highest priority will see
763 corosync traffic. Link priorities cannot be mixed, i.e. links with different
764 priorities will not be able to communicate with each other.
765
766 Since lower priority links will not see traffic unless all higher priorities
767 have failed, it becomes a useful strategy to specify even networks used for
768 other tasks (VMs, storage, etc...) as low-priority links. If worst comes to
769 worst, a higher-latency or more congested connection might be better than no
770 connection at all.
771
772 Adding Redundant Links To An Existing Cluster
773 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
774
775 To add a new link to a running configuration, first check how to
776 xref:pvecm_edit_corosync_conf[edit the corosync.conf file].
777
778 Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make
779 sure that your 'X' is the same for every node you add it to, and that it is
780 unique for each node.
781
782 Lastly, add a new 'interface', as shown below, to your `totem`
783 section, replacing 'X' with your link number chosen above.
784
785 Assuming you added a link with number 1, the new configuration file could look
786 like this:
787
788 ----
789 logging {
790 debug: off
791 to_syslog: yes
792 }
793
794 nodelist {
795
796 node {
797 name: due
798 nodeid: 2
799 quorum_votes: 1
800 ring0_addr: 10.10.10.2
801 ring1_addr: 10.20.20.2
802 }
803
804 node {
805 name: tre
806 nodeid: 3
807 quorum_votes: 1
808 ring0_addr: 10.10.10.3
809 ring1_addr: 10.20.20.3
810 }
811
812 node {
813 name: uno
814 nodeid: 1
815 quorum_votes: 1
816 ring0_addr: 10.10.10.1
817 ring1_addr: 10.20.20.1
818 }
819
820 }
821
822 quorum {
823 provider: corosync_votequorum
824 }
825
826 totem {
827 cluster_name: testcluster
828 config_version: 4
829 ip_version: ipv4-6
830 secauth: on
831 version: 2
832 interface {
833 linknumber: 0
834 }
835 interface {
836 linknumber: 1
837 }
838 }
839 ----
840
841 The new link will be enabled as soon as you follow the last steps to
842 xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not
843 be necessary. You can check that corosync loaded the new link using:
844
845 ----
846 journalctl -b -u corosync
847 ----
848
849 It might be a good idea to test the new link by temporarily disconnecting the
850 old link on one node and making sure that its status remains online while
851 disconnected:
852
853 ----
854 pvecm status
855 ----
856
857 If you see a healthy cluster state, it means that your new link is being used.
858
859
860 Corosync External Vote Support
861 ------------------------------
862
863 This section describes a way to deploy an external voter in a {pve} cluster.
864 When configured, the cluster can sustain more node failures without
865 violating safety properties of the cluster communication.
866
867 For this to work there are two services involved:
868
869 * a so called qdevice daemon which runs on each {pve} node
870
871 * an external vote daemon which runs on an independent server.
872
873 As a result you can achieve higher availability even in smaller setups (for
874 example 2+1 nodes).
875
876 QDevice Technical Overview
877 ~~~~~~~~~~~~~~~~~~~~~~~~~~
878
879 The Corosync Quroum Device (QDevice) is a daemon which runs on each cluster
880 node. It provides a configured number of votes to the clusters quorum
881 subsystem based on an external running third-party arbitrator's decision.
882 Its primary use is to allow a cluster to sustain more node failures than
883 standard quorum rules allow. This can be done safely as the external device
884 can see all nodes and thus choose only one set of nodes to give its vote.
885 This will only be done if said set of nodes can have quorum (again) when
886 receiving the third-party vote.
887
888 Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
889 a daemon which provides a vote to a cluster partition if it can reach the
890 partition members over the network. It will give only votes to one partition
891 of a cluster at any time.
892 It's designed to support multiple clusters and is almost configuration and
893 state free. New clusters are handled dynamically and no configuration file
894 is needed on the host running a QDevice.
895
896 The external host has the only requirement that it needs network access to the
897 cluster and a corosync-qnetd package available. We provide such a package
898 for Debian based hosts, other Linux distributions should also have a package
899 available through their respective package manager.
900
901 NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
902 TCP/IP. The daemon may even run outside of the clusters LAN and can have longer
903 latencies than 2 ms.
904
905 Supported Setups
906 ~~~~~~~~~~~~~~~~
907
908 We support QDevices for clusters with an even number of nodes and recommend
909 it for 2 node clusters, if they should provide higher availability.
910 For clusters with an odd node count we discourage the use of QDevices
911 currently. The reason for this, is the difference of the votes the QDevice
912 provides for each cluster type. Even numbered clusters get single additional
913 vote, with this we can only increase availability, i.e. if the QDevice
914 itself fails we are in the same situation as with no QDevice at all.
915
916 Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
917 where 'N' corresponds to the cluster node count. This difference makes
918 sense, if we had only one additional vote the cluster can get into a split
919 brain situation.
920 This algorithm would allow that all nodes but one (and naturally the
921 QDevice itself) could fail.
922 There are two drawbacks with this:
923
924 * If the QNet daemon itself fails, no other node may fail or the cluster
925 immediately loses quorum. For example, in a cluster with 15 nodes 7
926 could fail before the cluster becomes inquorate. But, if a QDevice is
927 configured here and said QDevice fails itself **no single node** of
928 the 15 may fail. The QDevice acts almost as a single point of failure in
929 this case.
930
931 * The fact that all but one node plus QDevice may fail sound promising at
932 first, but this may result in a mass recovery of HA services that would
933 overload the single node left. Also ceph server will stop to provide
934 services after only '((N-1)/2)' nodes are online.
935
936 If you understand the drawbacks and implications you can decide yourself if
937 you should use this technology in an odd numbered cluster setup.
938
939 QDevice-Net Setup
940 ~~~~~~~~~~~~~~~~~
941
942 We recommend to run any daemon which provides votes to corosync-qdevice as an
943 unprivileged user. {pve} and Debian provides a package which is already
944 configured to do so.
945 The traffic between the daemon and the cluster must be encrypted to ensure a
946 safe and secure QDevice integration in {pve}.
947
948 First install the 'corosync-qnetd' package on your external server and
949 the 'corosync-qdevice' package on all cluster nodes.
950
951 After that, ensure that all your nodes on the cluster are online.
952
953 You can now easily set up your QDevice by running the following command on one
954 of the {pve} nodes:
955
956 ----
957 pve# pvecm qdevice setup <QDEVICE-IP>
958 ----
959
960 The SSH key from the cluster will be automatically copied to the QDevice. You
961 might need to enter an SSH password during this step.
962
963 After you enter the password and all the steps are successfully completed, you
964 will see "Done". You can check the status now:
965
966 ----
967 pve# pvecm status
968
969 ...
970
971 Votequorum information
972 ~~~~~~~~~~~~~~~~~~~~~
973 Expected votes: 3
974 Highest expected: 3
975 Total votes: 3
976 Quorum: 2
977 Flags: Quorate Qdevice
978
979 Membership information
980 ~~~~~~~~~~~~~~~~~~~~~~
981 Nodeid Votes Qdevice Name
982 0x00000001 1 A,V,NMW 192.168.22.180 (local)
983 0x00000002 1 A,V,NMW 192.168.22.181
984 0x00000000 1 Qdevice
985
986 ----
987
988 which means the QDevice is set up.
989
990 Frequently Asked Questions
991 ~~~~~~~~~~~~~~~~~~~~~~~~~~
992
993 Tie Breaking
994 ^^^^^^^^^^^^
995
996 In case of a tie, where two same-sized cluster partitions cannot see each other
997 but the QDevice, the QDevice chooses randomly one of those partitions and
998 provides a vote to it.
999
1000 Possible Negative Implications
1001 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1002
1003 For clusters with an even node count there are no negative implications when
1004 setting up a QDevice. If it fails to work, you are as good as without QDevice at
1005 all.
1006
1007 Adding/Deleting Nodes After QDevice Setup
1008 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1009
1010 If you want to add a new node or remove an existing one from a cluster with a
1011 QDevice setup, you need to remove the QDevice first. After that, you can add or
1012 remove nodes normally. Once you have a cluster with an even node count again,
1013 you can set up the QDevice again as described above.
1014
1015 Removing the QDevice
1016 ^^^^^^^^^^^^^^^^^^^^
1017
1018 If you used the official `pvecm` tool to add the QDevice, you can remove it
1019 trivially by running:
1020
1021 ----
1022 pve# pvecm qdevice remove
1023 ----
1024
1025 //Still TODO
1026 //^^^^^^^^^^
1027 //There is still stuff to add here
1028
1029
1030 Corosync Configuration
1031 ----------------------
1032
1033 The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It
1034 controls the cluster membership and its network.
1035 For further information about it, check the corosync.conf man page:
1036 [source,bash]
1037 ----
1038 man corosync.conf
1039 ----
1040
1041 For node membership you should always use the `pvecm` tool provided by {pve}.
1042 You may have to edit the configuration file manually for other changes.
1043 Here are a few best practice tips for doing this.
1044
1045 [[pvecm_edit_corosync_conf]]
1046 Edit corosync.conf
1047 ~~~~~~~~~~~~~~~~~~
1048
1049 Editing the corosync.conf file is not always very straightforward. There are
1050 two on each cluster node, one in `/etc/pve/corosync.conf` and the other in
1051 `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will
1052 propagate the changes to the local one, but not vice versa.
1053
1054 The configuration will get updated automatically as soon as the file changes.
1055 This means changes which can be integrated in a running corosync will take
1056 effect immediately. So you should always make a copy and edit that instead, to
1057 avoid triggering some unwanted changes by an in-between safe.
1058
1059 [source,bash]
1060 ----
1061 cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
1062 ----
1063
1064 Then open the config file with your favorite editor, `nano` and `vim.tiny` are
1065 preinstalled on any {pve} node for example.
1066
1067 NOTE: Always increment the 'config_version' number on configuration changes,
1068 omitting this can lead to problems.
1069
1070 After making the necessary changes create another copy of the current working
1071 configuration file. This serves as a backup if the new configuration fails to
1072 apply or makes problems in other ways.
1073
1074 [source,bash]
1075 ----
1076 cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
1077 ----
1078
1079 Then move the new configuration file over the old one:
1080 [source,bash]
1081 ----
1082 mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
1083 ----
1084
1085 You may check with the commands
1086 [source,bash]
1087 ----
1088 systemctl status corosync
1089 journalctl -b -u corosync
1090 ----
1091
1092 If the change could be applied automatically. If not you may have to restart the
1093 corosync service via:
1094 [source,bash]
1095 ----
1096 systemctl restart corosync
1097 ----
1098
1099 On errors check the troubleshooting section below.
1100
1101 Troubleshooting
1102 ~~~~~~~~~~~~~~~
1103
1104 Issue: 'quorum.expected_votes must be configured'
1105 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1106
1107 When corosync starts to fail and you get the following message in the system log:
1108
1109 ----
1110 [...]
1111 corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
1112 corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason
1113 'configuration error: nodelist or quorum.expected_votes must be configured!'
1114 [...]
1115 ----
1116
1117 It means that the hostname you set for corosync 'ringX_addr' in the
1118 configuration could not be resolved.
1119
1120 Write Configuration When Not Quorate
1121 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1122
1123 If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you
1124 know what you do, use:
1125 [source,bash]
1126 ----
1127 pvecm expected 1
1128 ----
1129
1130 This sets the expected vote count to 1 and makes the cluster quorate. You can
1131 now fix your configuration, or revert it back to the last working backup.
1132
1133 This is not enough if corosync cannot start anymore. Here it is best to edit the
1134 local copy of the corosync configuration in '/etc/corosync/corosync.conf' so
1135 that corosync can start again. Ensure that on all nodes this configuration has
1136 the same content to avoid split brains. If you are not sure what went wrong
1137 it's best to ask the Proxmox Community to help you.
1138
1139
1140 [[pvecm_corosync_conf_glossary]]
1141 Corosync Configuration Glossary
1142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1143
1144 ringX_addr::
1145 This names the different link addresses for the kronosnet connections between
1146 nodes.
1147
1148
1149 Cluster Cold Start
1150 ------------------
1151
1152 It is obvious that a cluster is not quorate when all nodes are
1153 offline. This is a common case after a power failure.
1154
1155 NOTE: It is always a good idea to use an uninterruptible power supply
1156 (``UPS'', also called ``battery backup'') to avoid this state, especially if
1157 you want HA.
1158
1159 On node startup, the `pve-guests` service is started and waits for
1160 quorum. Once quorate, it starts all guests which have the `onboot`
1161 flag set.
1162
1163 When you turn on nodes, or when power comes back after power failure,
1164 it is likely that some nodes boots faster than others. Please keep in
1165 mind that guest startup is delayed until you reach quorum.
1166
1167
1168 Guest Migration
1169 ---------------
1170
1171 Migrating virtual guests to other nodes is a useful feature in a
1172 cluster. There are settings to control the behavior of such
1173 migrations. This can be done via the configuration file
1174 `datacenter.cfg` or for a specific migration via API or command line
1175 parameters.
1176
1177 It makes a difference if a Guest is online or offline, or if it has
1178 local resources (like a local disk).
1179
1180 For Details about Virtual Machine Migration see the
1181 xref:qm_migration[QEMU/KVM Migration Chapter].
1182
1183 For Details about Container Migration see the
1184 xref:pct_migration[Container Migration Chapter].
1185
1186 Migration Type
1187 ~~~~~~~~~~~~~~
1188
1189 The migration type defines if the migration data should be sent over an
1190 encrypted (`secure`) channel or an unencrypted (`insecure`) one.
1191 Setting the migration type to insecure means that the RAM content of a
1192 virtual guest gets also transferred unencrypted, which can lead to
1193 information disclosure of critical data from inside the guest (for
1194 example passwords or encryption keys).
1195
1196 Therefore, we strongly recommend using the secure channel if you do
1197 not have full control over the network and can not guarantee that no
1198 one is eavesdropping on it.
1199
1200 NOTE: Storage migration does not follow this setting. Currently, it
1201 always sends the storage content over a secure channel.
1202
1203 Encryption requires a lot of computing power, so this setting is often
1204 changed to "unsafe" to achieve better performance. The impact on
1205 modern systems is lower because they implement AES encryption in
1206 hardware. The performance impact is particularly evident in fast
1207 networks where you can transfer 10 Gbps or more.
1208
1209 Migration Network
1210 ~~~~~~~~~~~~~~~~~
1211
1212 By default, {pve} uses the network in which cluster communication
1213 takes place to send the migration traffic. This is not optimal because
1214 sensitive cluster traffic can be disrupted and this network may not
1215 have the best bandwidth available on the node.
1216
1217 Setting the migration network parameter allows the use of a dedicated
1218 network for the entire migration traffic. In addition to the memory,
1219 this also affects the storage traffic for offline migrations.
1220
1221 The migration network is set as a network in the CIDR notation. This
1222 has the advantage that you do not have to set individual IP addresses
1223 for each node. {pve} can determine the real address on the
1224 destination node from the network specified in the CIDR form. To
1225 enable this, the network must be specified so that each node has one,
1226 but only one IP in the respective network.
1227
1228 Example
1229 ^^^^^^^
1230
1231 We assume that we have a three-node setup with three separate
1232 networks. One for public communication with the Internet, one for
1233 cluster communication and a very fast one, which we want to use as a
1234 dedicated network for migration.
1235
1236 A network configuration for such a setup might look as follows:
1237
1238 ----
1239 iface eno1 inet manual
1240
1241 # public network
1242 auto vmbr0
1243 iface vmbr0 inet static
1244 address 192.X.Y.57
1245 netmask 255.255.250.0
1246 gateway 192.X.Y.1
1247 bridge_ports eno1
1248 bridge_stp off
1249 bridge_fd 0
1250
1251 # cluster network
1252 auto eno2
1253 iface eno2 inet static
1254 address 10.1.1.1
1255 netmask 255.255.255.0
1256
1257 # fast network
1258 auto eno3
1259 iface eno3 inet static
1260 address 10.1.2.1
1261 netmask 255.255.255.0
1262 ----
1263
1264 Here, we will use the network 10.1.2.0/24 as a migration network. For
1265 a single migration, you can do this using the `migration_network`
1266 parameter of the command line tool:
1267
1268 ----
1269 # qm migrate 106 tre --online --migration_network 10.1.2.0/24
1270 ----
1271
1272 To configure this as the default network for all migrations in the
1273 cluster, set the `migration` property of the `/etc/pve/datacenter.cfg`
1274 file:
1275
1276 ----
1277 # use dedicated migration network
1278 migration: secure,network=10.1.2.0/24
1279 ----
1280
1281 NOTE: The migration type must always be set when the migration network
1282 gets set in `/etc/pve/datacenter.cfg`.
1283
1284
1285 ifdef::manvolnum[]
1286 include::pve-copyright.adoc[]
1287 endif::manvolnum[]