]>
Commit | Line | Data |
---|---|---|
1 | [[chapter_pvecm]] | |
2 | ifdef::manvolnum[] | |
3 | pvecm(1) | |
4 | ======== | |
5 | :pve-toplevel: | |
6 | ||
7 | NAME | |
8 | ---- | |
9 | ||
10 | pvecm - Proxmox VE Cluster Manager | |
11 | ||
12 | SYNOPSIS | |
13 | -------- | |
14 | ||
15 | include::pvecm.1-synopsis.adoc[] | |
16 | ||
17 | DESCRIPTION | |
18 | ----------- | |
19 | endif::manvolnum[] | |
20 | ||
21 | ifndef::manvolnum[] | |
22 | Cluster Manager | |
23 | =============== | |
24 | :pve-toplevel: | |
25 | endif::manvolnum[] | |
26 | ||
27 | The {pve} cluster manager `pvecm` is a tool to create a group of | |
28 | physical servers. Such a group is called a *cluster*. We use the | |
29 | http://www.corosync.org[Corosync Cluster Engine] for reliable group | |
30 | communication. There's no explicit limit for the number of nodes in a cluster. | |
31 | In practice, the actual possible node count may be limited by the host and | |
32 | network performance. Currently (2021), there are reports of clusters (using | |
33 | high-end enterprise hardware) with over 50 nodes in production. | |
34 | ||
35 | `pvecm` can be used to create a new cluster, join nodes to a cluster, | |
36 | leave the cluster, get status information, and do various other cluster-related | |
37 | tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'') | |
38 | is used to transparently distribute the cluster configuration to all cluster | |
39 | nodes. | |
40 | ||
41 | Grouping nodes into a cluster has the following advantages: | |
42 | ||
43 | * Centralized, web-based management | |
44 | ||
45 | * Multi-master clusters: each node can do all management tasks | |
46 | ||
47 | * Use of `pmxcfs`, a database-driven file system, for storing configuration | |
48 | files, replicated in real-time on all nodes using `corosync` | |
49 | ||
50 | * Easy migration of virtual machines and containers between physical | |
51 | hosts | |
52 | ||
53 | * Fast deployment | |
54 | ||
55 | * Cluster-wide services like firewall and HA | |
56 | ||
57 | ||
58 | Requirements | |
59 | ------------ | |
60 | ||
61 | * All nodes must be able to connect to each other via UDP ports 5405-5412 | |
62 | for corosync to work. | |
63 | ||
64 | * Date and time must be synchronized. | |
65 | ||
66 | * An SSH tunnel on TCP port 22 between nodes is required. | |
67 | ||
68 | * If you are interested in High Availability, you need to have at | |
69 | least three nodes for reliable quorum. All nodes should have the | |
70 | same version. | |
71 | ||
72 | * We recommend a dedicated NIC for the cluster traffic, especially if | |
73 | you use shared storage. | |
74 | ||
75 | * The root password of a cluster node is required for adding nodes. | |
76 | ||
77 | * Online migration of virtual machines is only supported when nodes have CPUs | |
78 | from the same vendor. It might work otherwise, but this is never guaranteed. | |
79 | ||
80 | NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster | |
81 | nodes. | |
82 | ||
83 | NOTE: While it's possible to mix {pve} 4.4 and {pve} 5.0 nodes, doing so is | |
84 | not supported as a production configuration and should only be done temporarily, | |
85 | during an upgrade of the whole cluster from one major version to another. | |
86 | ||
87 | NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The | |
88 | cluster protocol (corosync) between {pve} 6.x and earlier versions changed | |
89 | fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the | |
90 | upgrade procedure to {pve} 6.0. | |
91 | ||
92 | ||
93 | Preparing Nodes | |
94 | --------------- | |
95 | ||
96 | First, install {pve} on all nodes. Make sure that each node is | |
97 | installed with the final hostname and IP configuration. Changing the | |
98 | hostname and IP is not possible after cluster creation. | |
99 | ||
100 | While it's common to reference all node names and their IPs in `/etc/hosts` (or | |
101 | make their names resolvable through other means), this is not necessary for a | |
102 | cluster to work. It may be useful however, as you can then connect from one node | |
103 | to another via SSH, using the easier to remember node name (see also | |
104 | xref:pvecm_corosync_addresses[Link Address Types]). Note that we always | |
105 | recommend referencing nodes by their IP addresses in the cluster configuration. | |
106 | ||
107 | ||
108 | [[pvecm_create_cluster]] | |
109 | Create a Cluster | |
110 | ---------------- | |
111 | ||
112 | You can either create a cluster on the console (login via `ssh`), or through | |
113 | the API using the {pve} web interface (__Datacenter -> Cluster__). | |
114 | ||
115 | NOTE: Use a unique name for your cluster. This name cannot be changed later. | |
116 | The cluster name follows the same rules as node names. | |
117 | ||
118 | [[pvecm_cluster_create_via_gui]] | |
119 | Create via Web GUI | |
120 | ~~~~~~~~~~~~~~~~~~ | |
121 | ||
122 | [thumbnail="screenshot/gui-cluster-create.png"] | |
123 | ||
124 | Under __Datacenter -> Cluster__, click on *Create Cluster*. Enter the cluster | |
125 | name and select a network connection from the drop-down list to serve as the | |
126 | main cluster network (Link 0). It defaults to the IP resolved via the node's | |
127 | hostname. | |
128 | ||
129 | As of {pve} 6.2, up to 8 fallback links can be added to a cluster. To add a | |
130 | redundant link, click the 'Add' button and select a link number and IP address | |
131 | from the respective fields. Prior to {pve} 6.2, to add a second link as | |
132 | fallback, you can select the 'Advanced' checkbox and choose an additional | |
133 | network interface (Link 1, see also xref:pvecm_redundancy[Corosync Redundancy]). | |
134 | ||
135 | NOTE: Ensure that the network selected for cluster communication is not used for | |
136 | any high traffic purposes, like network storage or live-migration. | |
137 | While the cluster network itself produces small amounts of data, it is very | |
138 | sensitive to latency. Check out full | |
139 | xref:pvecm_cluster_network_requirements[cluster network requirements]. | |
140 | ||
141 | [[pvecm_cluster_create_via_cli]] | |
142 | Create via the Command Line | |
143 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
144 | ||
145 | Login via `ssh` to the first {pve} node and run the following command: | |
146 | ||
147 | ---- | |
148 | hp1# pvecm create CLUSTERNAME | |
149 | ---- | |
150 | ||
151 | To check the state of the new cluster use: | |
152 | ||
153 | ---- | |
154 | hp1# pvecm status | |
155 | ---- | |
156 | ||
157 | Multiple Clusters in the Same Network | |
158 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
159 | ||
160 | It is possible to create multiple clusters in the same physical or logical | |
161 | network. In this case, each cluster must have a unique name to avoid possible | |
162 | clashes in the cluster communication stack. Furthermore, this helps avoid human | |
163 | confusion by making clusters clearly distinguishable. | |
164 | ||
165 | While the bandwidth requirement of a corosync cluster is relatively low, the | |
166 | latency of packages and the package per second (PPS) rate is the limiting | |
167 | factor. Different clusters in the same network can compete with each other for | |
168 | these resources, so it may still make sense to use separate physical network | |
169 | infrastructure for bigger clusters. | |
170 | ||
171 | [[pvecm_join_node_to_cluster]] | |
172 | Adding Nodes to the Cluster | |
173 | --------------------------- | |
174 | ||
175 | CAUTION: All existing configuration in `/etc/pve` is overwritten when joining a | |
176 | cluster. In particular, a joining node cannot hold any guests, since guest IDs | |
177 | could otherwise conflict, and the node will inherit the cluster's storage | |
178 | configuration. To join a node with existing guest, as a workaround, you can | |
179 | create a backup of each guest (using `vzdump`) and restore it under a different | |
180 | ID after joining. If the node's storage layout differs, you will need to re-add | |
181 | the node's storages, and adapt each storage's node restriction to reflect on | |
182 | which nodes the storage is actually available. | |
183 | ||
184 | Join Node to Cluster via GUI | |
185 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
186 | ||
187 | [thumbnail="screenshot/gui-cluster-join-information.png"] | |
188 | ||
189 | Log in to the web interface on an existing cluster node. Under __Datacenter -> | |
190 | Cluster__, click the *Join Information* button at the top. Then, click on the | |
191 | button *Copy Information*. Alternatively, copy the string from the 'Information' | |
192 | field manually. | |
193 | ||
194 | [thumbnail="screenshot/gui-cluster-join.png"] | |
195 | ||
196 | Next, log in to the web interface on the node you want to add. | |
197 | Under __Datacenter -> Cluster__, click on *Join Cluster*. Fill in the | |
198 | 'Information' field with the 'Join Information' text you copied earlier. | |
199 | Most settings required for joining the cluster will be filled out | |
200 | automatically. For security reasons, the cluster password has to be entered | |
201 | manually. | |
202 | ||
203 | NOTE: To enter all required data manually, you can disable the 'Assisted Join' | |
204 | checkbox. | |
205 | ||
206 | After clicking the *Join* button, the cluster join process will start | |
207 | immediately. After the node has joined the cluster, its current node certificate | |
208 | will be replaced by one signed from the cluster certificate authority (CA). | |
209 | This means that the current session will stop working after a few seconds. You | |
210 | then might need to force-reload the web interface and log in again with the | |
211 | cluster credentials. | |
212 | ||
213 | Now your node should be visible under __Datacenter -> Cluster__. | |
214 | ||
215 | Join Node to Cluster via Command Line | |
216 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
217 | ||
218 | Log in to the node you want to join into an existing cluster via `ssh`. | |
219 | ||
220 | ---- | |
221 | # pvecm add IP-ADDRESS-CLUSTER | |
222 | ---- | |
223 | ||
224 | For `IP-ADDRESS-CLUSTER`, use the IP or hostname of an existing cluster node. | |
225 | An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]). | |
226 | ||
227 | ||
228 | To check the state of the cluster use: | |
229 | ||
230 | ---- | |
231 | # pvecm status | |
232 | ---- | |
233 | ||
234 | .Cluster status after adding 4 nodes | |
235 | ---- | |
236 | # pvecm status | |
237 | Cluster information | |
238 | ~~~~~~~~~~~~~~~~~~~ | |
239 | Name: prod-central | |
240 | Config Version: 3 | |
241 | Transport: knet | |
242 | Secure auth: on | |
243 | ||
244 | Quorum information | |
245 | ~~~~~~~~~~~~~~~~~~ | |
246 | Date: Tue Sep 14 11:06:47 2021 | |
247 | Quorum provider: corosync_votequorum | |
248 | Nodes: 4 | |
249 | Node ID: 0x00000001 | |
250 | Ring ID: 1.1a8 | |
251 | Quorate: Yes | |
252 | ||
253 | Votequorum information | |
254 | ~~~~~~~~~~~~~~~~~~~~~~ | |
255 | Expected votes: 4 | |
256 | Highest expected: 4 | |
257 | Total votes: 4 | |
258 | Quorum: 3 | |
259 | Flags: Quorate | |
260 | ||
261 | Membership information | |
262 | ~~~~~~~~~~~~~~~~~~~~~~ | |
263 | Nodeid Votes Name | |
264 | 0x00000001 1 192.168.15.91 | |
265 | 0x00000002 1 192.168.15.92 (local) | |
266 | 0x00000003 1 192.168.15.93 | |
267 | 0x00000004 1 192.168.15.94 | |
268 | ---- | |
269 | ||
270 | If you only want a list of all nodes, use: | |
271 | ||
272 | ---- | |
273 | # pvecm nodes | |
274 | ---- | |
275 | ||
276 | .List nodes in a cluster | |
277 | ---- | |
278 | # pvecm nodes | |
279 | ||
280 | Membership information | |
281 | ~~~~~~~~~~~~~~~~~~~~~~ | |
282 | Nodeid Votes Name | |
283 | 1 1 hp1 | |
284 | 2 1 hp2 (local) | |
285 | 3 1 hp3 | |
286 | 4 1 hp4 | |
287 | ---- | |
288 | ||
289 | [[pvecm_adding_nodes_with_separated_cluster_network]] | |
290 | Adding Nodes with Separated Cluster Network | |
291 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
292 | ||
293 | When adding a node to a cluster with a separated cluster network, you need to | |
294 | use the 'link0' parameter to set the nodes address on that network: | |
295 | ||
296 | [source,bash] | |
297 | ---- | |
298 | # pvecm add IP-ADDRESS-CLUSTER --link0 LOCAL-IP-ADDRESS-LINK0 | |
299 | ---- | |
300 | ||
301 | If you want to use the built-in xref:pvecm_redundancy[redundancy] of the | |
302 | Kronosnet transport layer, also use the 'link1' parameter. | |
303 | ||
304 | Using the GUI, you can select the correct interface from the corresponding | |
305 | 'Link X' fields in the *Cluster Join* dialog. | |
306 | ||
307 | Remove a Cluster Node | |
308 | --------------------- | |
309 | ||
310 | CAUTION: Read the procedure carefully before proceeding, as it may | |
311 | not be what you want or need. | |
312 | ||
313 | Move all virtual machines from the node. Ensure that you have made copies of any | |
314 | local data or backups that you want to keep. In addition, make sure to remove | |
315 | any scheduled replication jobs to the node to be removed. | |
316 | ||
317 | CAUTION: Failure to remove replication jobs to a node before removing said node | |
318 | will result in the replication job becoming irremovable. Especially note that | |
319 | replication automatically switches direction if a replicated VM is migrated, so | |
320 | by migrating a replicated VM from a node to be deleted, replication jobs will be | |
321 | set up to that node automatically. | |
322 | ||
323 | In the following example, we will remove the node hp4 from the cluster. | |
324 | ||
325 | Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes` | |
326 | command to identify the node ID to remove: | |
327 | ||
328 | ---- | |
329 | hp1# pvecm nodes | |
330 | ||
331 | Membership information | |
332 | ~~~~~~~~~~~~~~~~~~~~~~ | |
333 | Nodeid Votes Name | |
334 | 1 1 hp1 (local) | |
335 | 2 1 hp2 | |
336 | 3 1 hp3 | |
337 | 4 1 hp4 | |
338 | ---- | |
339 | ||
340 | ||
341 | At this point, you must power off hp4 and ensure that it will not power on | |
342 | again (in the network) with its current configuration. | |
343 | ||
344 | IMPORTANT: As mentioned above, it is critical to power off the node | |
345 | *before* removal, and make sure that it will *not* power on again | |
346 | (in the existing cluster network) with its current configuration. | |
347 | If you power on the node as it is, the cluster could end up broken, | |
348 | and it could be difficult to restore it to a functioning state. | |
349 | ||
350 | After powering off the node hp4, we can safely remove it from the cluster. | |
351 | ||
352 | ---- | |
353 | hp1# pvecm delnode hp4 | |
354 | Killing node 4 | |
355 | ---- | |
356 | ||
357 | NOTE: At this point, it is possible that you will receive an error message | |
358 | stating `Could not kill node (error = CS_ERR_NOT_EXIST)`. This does not | |
359 | signify an actual failure in the deletion of the node, but rather a failure in | |
360 | corosync trying to kill an offline node. Thus, it can be safely ignored. | |
361 | ||
362 | Use `pvecm nodes` or `pvecm status` to check the node list again. It should | |
363 | look something like: | |
364 | ||
365 | ---- | |
366 | hp1# pvecm status | |
367 | ||
368 | ... | |
369 | ||
370 | Votequorum information | |
371 | ~~~~~~~~~~~~~~~~~~~~~~ | |
372 | Expected votes: 3 | |
373 | Highest expected: 3 | |
374 | Total votes: 3 | |
375 | Quorum: 2 | |
376 | Flags: Quorate | |
377 | ||
378 | Membership information | |
379 | ~~~~~~~~~~~~~~~~~~~~~~ | |
380 | Nodeid Votes Name | |
381 | 0x00000001 1 192.168.15.90 (local) | |
382 | 0x00000002 1 192.168.15.91 | |
383 | 0x00000003 1 192.168.15.92 | |
384 | ---- | |
385 | ||
386 | If, for whatever reason, you want this server to join the same cluster again, | |
387 | you have to: | |
388 | ||
389 | * do a fresh install of {pve} on it, | |
390 | ||
391 | * then join it, as explained in the previous section. | |
392 | ||
393 | The configuration files for the removed node will still reside in | |
394 | '/etc/pve/nodes/hp4'. Recover any configuration you still need and remove the | |
395 | directory afterwards. | |
396 | ||
397 | NOTE: After removal of the node, its SSH fingerprint will still reside in the | |
398 | 'known_hosts' of the other nodes. If you receive an SSH error after rejoining | |
399 | a node with the same IP or hostname, run `pvecm updatecerts` once on the | |
400 | re-added node to update its fingerprint cluster wide. | |
401 | ||
402 | [[pvecm_separate_node_without_reinstall]] | |
403 | Separate a Node Without Reinstalling | |
404 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
405 | ||
406 | CAUTION: This is *not* the recommended method, proceed with caution. Use the | |
407 | previous method if you're unsure. | |
408 | ||
409 | You can also separate a node from a cluster without reinstalling it from | |
410 | scratch. But after removing the node from the cluster, it will still have | |
411 | access to any shared storage. This must be resolved before you start removing | |
412 | the node from the cluster. A {pve} cluster cannot share the exact same | |
413 | storage with another cluster, as storage locking doesn't work over the cluster | |
414 | boundary. Furthermore, it may also lead to VMID conflicts. | |
415 | ||
416 | It's suggested that you create a new storage, where only the node which you want | |
417 | to separate has access. This can be a new export on your NFS or a new Ceph | |
418 | pool, to name a few examples. It's just important that the exact same storage | |
419 | does not get accessed by multiple clusters. After setting up this storage, move | |
420 | all data and VMs from the node to it. Then you are ready to separate the | |
421 | node from the cluster. | |
422 | ||
423 | WARNING: Ensure that all shared resources are cleanly separated! Otherwise you | |
424 | will run into conflicts and problems. | |
425 | ||
426 | First, stop the corosync and pve-cluster services on the node: | |
427 | [source,bash] | |
428 | ---- | |
429 | systemctl stop pve-cluster | |
430 | systemctl stop corosync | |
431 | ---- | |
432 | ||
433 | Start the cluster file system again in local mode: | |
434 | [source,bash] | |
435 | ---- | |
436 | pmxcfs -l | |
437 | ---- | |
438 | ||
439 | Delete the corosync configuration files: | |
440 | [source,bash] | |
441 | ---- | |
442 | rm /etc/pve/corosync.conf | |
443 | rm -r /etc/corosync/* | |
444 | ---- | |
445 | ||
446 | You can now start the file system again as a normal service: | |
447 | [source,bash] | |
448 | ---- | |
449 | killall pmxcfs | |
450 | systemctl start pve-cluster | |
451 | ---- | |
452 | ||
453 | The node is now separated from the cluster. You can deleted it from any | |
454 | remaining node of the cluster with: | |
455 | [source,bash] | |
456 | ---- | |
457 | pvecm delnode oldnode | |
458 | ---- | |
459 | ||
460 | If the command fails due to a loss of quorum in the remaining node, you can set | |
461 | the expected votes to 1 as a workaround: | |
462 | [source,bash] | |
463 | ---- | |
464 | pvecm expected 1 | |
465 | ---- | |
466 | ||
467 | And then repeat the 'pvecm delnode' command. | |
468 | ||
469 | Now switch back to the separated node and delete all the remaining cluster | |
470 | files on it. This ensures that the node can be added to another cluster again | |
471 | without problems. | |
472 | ||
473 | [source,bash] | |
474 | ---- | |
475 | rm /var/lib/corosync/* | |
476 | ---- | |
477 | ||
478 | As the configuration files from the other nodes are still in the cluster | |
479 | file system, you may want to clean those up too. After making absolutely sure | |
480 | that you have the correct node name, you can simply remove the entire | |
481 | directory recursively from '/etc/pve/nodes/NODENAME'. | |
482 | ||
483 | CAUTION: The node's SSH keys will remain in the 'authorized_key' file. This | |
484 | means that the nodes can still connect to each other with public key | |
485 | authentication. You should fix this by removing the respective keys from the | |
486 | '/etc/pve/priv/authorized_keys' file. | |
487 | ||
488 | ||
489 | Quorum | |
490 | ------ | |
491 | ||
492 | {pve} use a quorum-based technique to provide a consistent state among | |
493 | all cluster nodes. | |
494 | ||
495 | [quote, from Wikipedia, Quorum (distributed computing)] | |
496 | ____ | |
497 | A quorum is the minimum number of votes that a distributed transaction | |
498 | has to obtain in order to be allowed to perform an operation in a | |
499 | distributed system. | |
500 | ____ | |
501 | ||
502 | In case of network partitioning, state changes requires that a | |
503 | majority of nodes are online. The cluster switches to read-only mode | |
504 | if it loses quorum. | |
505 | ||
506 | NOTE: {pve} assigns a single vote to each node by default. | |
507 | ||
508 | ||
509 | Cluster Network | |
510 | --------------- | |
511 | ||
512 | The cluster network is the core of a cluster. All messages sent over it have to | |
513 | be delivered reliably to all nodes in their respective order. In {pve} this | |
514 | part is done by corosync, an implementation of a high performance, low overhead, | |
515 | high availability development toolkit. It serves our decentralized configuration | |
516 | file system (`pmxcfs`). | |
517 | ||
518 | [[pvecm_cluster_network_requirements]] | |
519 | Network Requirements | |
520 | ~~~~~~~~~~~~~~~~~~~~ | |
521 | ||
522 | The {pve} cluster stack requires a reliable network with latencies under 5 | |
523 | milliseconds (LAN performance) between all nodes to operate stably. While on | |
524 | setups with a small node count a network with higher latencies _may_ work, this | |
525 | is not guaranteed and gets rather unlikely with more than three nodes and | |
526 | latencies above around 10 ms. | |
527 | ||
528 | The network should not be used heavily by other members, as while corosync does | |
529 | not uses much bandwidth it is sensitive to latency jitters; ideally corosync | |
530 | runs on its own physically separated network. Especially do not use a shared | |
531 | network for corosync and storage (except as a potential low-priority fallback | |
532 | in a xref:pvecm_redundancy[redundant] configuration). | |
533 | ||
534 | Before setting up a cluster, it is good practice to check if the network is fit | |
535 | for that purpose. To ensure that the nodes can connect to each other on the | |
536 | cluster network, you can test the connectivity between them with the `ping` | |
537 | tool. | |
538 | ||
539 | If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically | |
540 | be generated - no manual action is required. | |
541 | ||
542 | NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0). | |
543 | Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster | |
544 | communication, which, for now, only supports regular UDP unicast. | |
545 | ||
546 | CAUTION: You can still enable Multicast or legacy unicast by setting your | |
547 | transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf], | |
548 | but keep in mind that this will disable all cryptography and redundancy support. | |
549 | This is therefore not recommended. | |
550 | ||
551 | Separate Cluster Network | |
552 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
553 | ||
554 | When creating a cluster without any parameters, the corosync cluster network is | |
555 | generally shared with the web interface and the VMs' network. Depending on | |
556 | your setup, even storage traffic may get sent over the same network. It's | |
557 | recommended to change that, as corosync is a time-critical, real-time | |
558 | application. | |
559 | ||
560 | Setting Up a New Network | |
561 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
562 | ||
563 | First, you have to set up a new network interface. It should be on a physically | |
564 | separate network. Ensure that your network fulfills the | |
565 | xref:pvecm_cluster_network_requirements[cluster network requirements]. | |
566 | ||
567 | Separate On Cluster Creation | |
568 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
569 | ||
570 | This is possible via the 'linkX' parameters of the 'pvecm create' | |
571 | command, used for creating a new cluster. | |
572 | ||
573 | If you have set up an additional NIC with a static address on 10.10.10.1/25, | |
574 | and want to send and receive all cluster communication over this interface, | |
575 | you would execute: | |
576 | ||
577 | [source,bash] | |
578 | ---- | |
579 | pvecm create test --link0 10.10.10.1 | |
580 | ---- | |
581 | ||
582 | To check if everything is working properly, execute: | |
583 | [source,bash] | |
584 | ---- | |
585 | systemctl status corosync | |
586 | ---- | |
587 | ||
588 | Afterwards, proceed as described above to | |
589 | xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network]. | |
590 | ||
591 | [[pvecm_separate_cluster_net_after_creation]] | |
592 | Separate After Cluster Creation | |
593 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
594 | ||
595 | You can do this if you have already created a cluster and want to switch | |
596 | its communication to another network, without rebuilding the whole cluster. | |
597 | This change may lead to short periods of quorum loss in the cluster, as nodes | |
598 | have to restart corosync and come up one after the other on the new network. | |
599 | ||
600 | Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first. | |
601 | Then, open it and you should see a file similar to: | |
602 | ||
603 | ---- | |
604 | logging { | |
605 | debug: off | |
606 | to_syslog: yes | |
607 | } | |
608 | ||
609 | nodelist { | |
610 | ||
611 | node { | |
612 | name: due | |
613 | nodeid: 2 | |
614 | quorum_votes: 1 | |
615 | ring0_addr: due | |
616 | } | |
617 | ||
618 | node { | |
619 | name: tre | |
620 | nodeid: 3 | |
621 | quorum_votes: 1 | |
622 | ring0_addr: tre | |
623 | } | |
624 | ||
625 | node { | |
626 | name: uno | |
627 | nodeid: 1 | |
628 | quorum_votes: 1 | |
629 | ring0_addr: uno | |
630 | } | |
631 | ||
632 | } | |
633 | ||
634 | quorum { | |
635 | provider: corosync_votequorum | |
636 | } | |
637 | ||
638 | totem { | |
639 | cluster_name: testcluster | |
640 | config_version: 3 | |
641 | ip_version: ipv4-6 | |
642 | secauth: on | |
643 | version: 2 | |
644 | interface { | |
645 | linknumber: 0 | |
646 | } | |
647 | ||
648 | } | |
649 | ---- | |
650 | ||
651 | NOTE: `ringX_addr` actually specifies a corosync *link address*. The name "ring" | |
652 | is a remnant of older corosync versions that is kept for backwards | |
653 | compatibility. | |
654 | ||
655 | The first thing you want to do is add the 'name' properties in the node entries, | |
656 | if you do not see them already. Those *must* match the node name. | |
657 | ||
658 | Then replace all addresses from the 'ring0_addr' properties of all nodes with | |
659 | the new addresses. You may use plain IP addresses or hostnames here. If you use | |
660 | hostnames, ensure that they are resolvable from all nodes (see also | |
661 | xref:pvecm_corosync_addresses[Link Address Types]). | |
662 | ||
663 | In this example, we want to switch cluster communication to the | |
664 | 10.10.10.0/25 network, so we change the 'ring0_addr' of each node respectively. | |
665 | ||
666 | NOTE: The exact same procedure can be used to change other 'ringX_addr' values | |
667 | as well. However, we recommend only changing one link address at a time, so | |
668 | that it's easier to recover if something goes wrong. | |
669 | ||
670 | After we increase the 'config_version' property, the new configuration file | |
671 | should look like: | |
672 | ||
673 | ---- | |
674 | logging { | |
675 | debug: off | |
676 | to_syslog: yes | |
677 | } | |
678 | ||
679 | nodelist { | |
680 | ||
681 | node { | |
682 | name: due | |
683 | nodeid: 2 | |
684 | quorum_votes: 1 | |
685 | ring0_addr: 10.10.10.2 | |
686 | } | |
687 | ||
688 | node { | |
689 | name: tre | |
690 | nodeid: 3 | |
691 | quorum_votes: 1 | |
692 | ring0_addr: 10.10.10.3 | |
693 | } | |
694 | ||
695 | node { | |
696 | name: uno | |
697 | nodeid: 1 | |
698 | quorum_votes: 1 | |
699 | ring0_addr: 10.10.10.1 | |
700 | } | |
701 | ||
702 | } | |
703 | ||
704 | quorum { | |
705 | provider: corosync_votequorum | |
706 | } | |
707 | ||
708 | totem { | |
709 | cluster_name: testcluster | |
710 | config_version: 4 | |
711 | ip_version: ipv4-6 | |
712 | secauth: on | |
713 | version: 2 | |
714 | interface { | |
715 | linknumber: 0 | |
716 | } | |
717 | ||
718 | } | |
719 | ---- | |
720 | ||
721 | Then, after a final check to see that all changed information is correct, we | |
722 | save it and once again follow the | |
723 | xref:pvecm_edit_corosync_conf[edit corosync.conf file] section to bring it into | |
724 | effect. | |
725 | ||
726 | The changes will be applied live, so restarting corosync is not strictly | |
727 | necessary. If you changed other settings as well, or notice corosync | |
728 | complaining, you can optionally trigger a restart. | |
729 | ||
730 | On a single node execute: | |
731 | ||
732 | [source,bash] | |
733 | ---- | |
734 | systemctl restart corosync | |
735 | ---- | |
736 | ||
737 | Now check if everything is okay: | |
738 | ||
739 | [source,bash] | |
740 | ---- | |
741 | systemctl status corosync | |
742 | ---- | |
743 | ||
744 | If corosync begins to work again, restart it on all other nodes too. | |
745 | They will then join the cluster membership one by one on the new network. | |
746 | ||
747 | [[pvecm_corosync_addresses]] | |
748 | Corosync Addresses | |
749 | ~~~~~~~~~~~~~~~~~~ | |
750 | ||
751 | A corosync link address (for backwards compatibility denoted by 'ringX_addr' in | |
752 | `corosync.conf`) can be specified in two ways: | |
753 | ||
754 | * **IPv4/v6 addresses** can be used directly. They are recommended, since they | |
755 | are static and usually not changed carelessly. | |
756 | ||
757 | * **Hostnames** will be resolved using `getaddrinfo`, which means that by | |
758 | default, IPv6 addresses will be used first, if available (see also | |
759 | `man gai.conf`). Keep this in mind, especially when upgrading an existing | |
760 | cluster to IPv6. | |
761 | ||
762 | CAUTION: Hostnames should be used with care, since the addresses they | |
763 | resolve to can be changed without touching corosync or the node it runs on - | |
764 | which may lead to a situation where an address is changed without thinking | |
765 | about implications for corosync. | |
766 | ||
767 | A separate, static hostname specifically for corosync is recommended, if | |
768 | hostnames are preferred. Also, make sure that every node in the cluster can | |
769 | resolve all hostnames correctly. | |
770 | ||
771 | Since {pve} 5.1, while supported, hostnames will be resolved at the time of | |
772 | entry. Only the resolved IP is saved to the configuration. | |
773 | ||
774 | Nodes that joined the cluster on earlier versions likely still use their | |
775 | unresolved hostname in `corosync.conf`. It might be a good idea to replace | |
776 | them with IPs or a separate hostname, as mentioned above. | |
777 | ||
778 | ||
779 | [[pvecm_redundancy]] | |
780 | Corosync Redundancy | |
781 | ------------------- | |
782 | ||
783 | Corosync supports redundant networking via its integrated Kronosnet layer by | |
784 | default (it is not supported on the legacy udp/udpu transports). It can be | |
785 | enabled by specifying more than one link address, either via the '--linkX' | |
786 | parameters of `pvecm`, in the GUI as **Link 1** (while creating a cluster or | |
787 | adding a new node) or by specifying more than one 'ringX_addr' in | |
788 | `corosync.conf`. | |
789 | ||
790 | NOTE: To provide useful failover, every link should be on its own | |
791 | physical network connection. | |
792 | ||
793 | Links are used according to a priority setting. You can configure this priority | |
794 | by setting 'knet_link_priority' in the corresponding interface section in | |
795 | `corosync.conf`, or, preferably, using the 'priority' parameter when creating | |
796 | your cluster with `pvecm`: | |
797 | ||
798 | ---- | |
799 | # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=15 --link1 10.20.20.1,priority=20 | |
800 | ---- | |
801 | ||
802 | This would cause 'link1' to be used first, since it has the higher priority. | |
803 | ||
804 | If no priorities are configured manually (or two links have the same priority), | |
805 | links will be used in order of their number, with the lower number having higher | |
806 | priority. | |
807 | ||
808 | Even if all links are working, only the one with the highest priority will see | |
809 | corosync traffic. Link priorities cannot be mixed, meaning that links with | |
810 | different priorities will not be able to communicate with each other. | |
811 | ||
812 | Since lower priority links will not see traffic unless all higher priorities | |
813 | have failed, it becomes a useful strategy to specify networks used for | |
814 | other tasks (VMs, storage, etc.) as low-priority links. If worst comes to | |
815 | worst, a higher latency or more congested connection might be better than no | |
816 | connection at all. | |
817 | ||
818 | Adding Redundant Links To An Existing Cluster | |
819 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
820 | ||
821 | To add a new link to a running configuration, first check how to | |
822 | xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. | |
823 | ||
824 | Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make | |
825 | sure that your 'X' is the same for every node you add it to, and that it is | |
826 | unique for each node. | |
827 | ||
828 | Lastly, add a new 'interface', as shown below, to your `totem` | |
829 | section, replacing 'X' with the link number chosen above. | |
830 | ||
831 | Assuming you added a link with number 1, the new configuration file could look | |
832 | like this: | |
833 | ||
834 | ---- | |
835 | logging { | |
836 | debug: off | |
837 | to_syslog: yes | |
838 | } | |
839 | ||
840 | nodelist { | |
841 | ||
842 | node { | |
843 | name: due | |
844 | nodeid: 2 | |
845 | quorum_votes: 1 | |
846 | ring0_addr: 10.10.10.2 | |
847 | ring1_addr: 10.20.20.2 | |
848 | } | |
849 | ||
850 | node { | |
851 | name: tre | |
852 | nodeid: 3 | |
853 | quorum_votes: 1 | |
854 | ring0_addr: 10.10.10.3 | |
855 | ring1_addr: 10.20.20.3 | |
856 | } | |
857 | ||
858 | node { | |
859 | name: uno | |
860 | nodeid: 1 | |
861 | quorum_votes: 1 | |
862 | ring0_addr: 10.10.10.1 | |
863 | ring1_addr: 10.20.20.1 | |
864 | } | |
865 | ||
866 | } | |
867 | ||
868 | quorum { | |
869 | provider: corosync_votequorum | |
870 | } | |
871 | ||
872 | totem { | |
873 | cluster_name: testcluster | |
874 | config_version: 4 | |
875 | ip_version: ipv4-6 | |
876 | secauth: on | |
877 | version: 2 | |
878 | interface { | |
879 | linknumber: 0 | |
880 | } | |
881 | interface { | |
882 | linknumber: 1 | |
883 | } | |
884 | } | |
885 | ---- | |
886 | ||
887 | The new link will be enabled as soon as you follow the last steps to | |
888 | xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not | |
889 | be necessary. You can check that corosync loaded the new link using: | |
890 | ||
891 | ---- | |
892 | journalctl -b -u corosync | |
893 | ---- | |
894 | ||
895 | It might be a good idea to test the new link by temporarily disconnecting the | |
896 | old link on one node and making sure that its status remains online while | |
897 | disconnected: | |
898 | ||
899 | ---- | |
900 | pvecm status | |
901 | ---- | |
902 | ||
903 | If you see a healthy cluster state, it means that your new link is being used. | |
904 | ||
905 | ||
906 | Role of SSH in {pve} Clusters | |
907 | ----------------------------- | |
908 | ||
909 | {pve} utilizes SSH tunnels for various features. | |
910 | ||
911 | * Proxying console/shell sessions (node and guests) | |
912 | + | |
913 | When using the shell for node B while being connected to node A, connects to a | |
914 | terminal proxy on node A, which is in turn connected to the login shell on node | |
915 | B via a non-interactive SSH tunnel. | |
916 | ||
917 | * VM and CT memory and local-storage migration in 'secure' mode. | |
918 | + | |
919 | During the migration, one or more SSH tunnel(s) are established between the | |
920 | source and target nodes, in order to exchange migration information and | |
921 | transfer memory and disk contents. | |
922 | ||
923 | * Storage replication | |
924 | ||
925 | SSH setup | |
926 | ~~~~~~~~~ | |
927 | ||
928 | On {pve} systems, the following changes are made to the SSH configuration/setup: | |
929 | ||
930 | * the `root` user's SSH client config gets setup to prefer `AES` over `ChaCha20` | |
931 | ||
932 | * the `root` user's `authorized_keys` file gets linked to | |
933 | `/etc/pve/priv/authorized_keys`, merging all authorized keys within a cluster | |
934 | ||
935 | * `sshd` is configured to allow logging in as root with a password | |
936 | ||
937 | NOTE: Older systems might also have `/etc/ssh/ssh_known_hosts` set up as symlink | |
938 | pointing to `/etc/pve/priv/known_hosts`, containing a merged version of all | |
939 | node host keys. This system was replaced with explicit host key pinning in | |
940 | `pve-cluster <<INSERT VERSION>>`, the symlink can be deconfigured if still in | |
941 | place by running `pvecm updatecerts --unmerge-known-hosts`. | |
942 | ||
943 | Pitfalls due to automatic execution of `.bashrc` and siblings | |
944 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
945 | ||
946 | In case you have a custom `.bashrc`, or similar files that get executed on | |
947 | login by the configured shell, `ssh` will automatically run it once the session | |
948 | is established successfully. This can cause some unexpected behavior, as those | |
949 | commands may be executed with root permissions on any of the operations | |
950 | described above. This can cause possible problematic side-effects! | |
951 | ||
952 | In order to avoid such complications, it's recommended to add a check in | |
953 | `/root/.bashrc` to make sure the session is interactive, and only then run | |
954 | `.bashrc` commands. | |
955 | ||
956 | You can add this snippet at the beginning of your `.bashrc` file: | |
957 | ||
958 | ---- | |
959 | # Early exit if not running interactively to avoid side-effects! | |
960 | case $- in | |
961 | *i*) ;; | |
962 | *) return;; | |
963 | esac | |
964 | ---- | |
965 | ||
966 | Corosync External Vote Support | |
967 | ------------------------------ | |
968 | ||
969 | This section describes a way to deploy an external voter in a {pve} cluster. | |
970 | When configured, the cluster can sustain more node failures without | |
971 | violating safety properties of the cluster communication. | |
972 | ||
973 | For this to work, there are two services involved: | |
974 | ||
975 | * A QDevice daemon which runs on each {pve} node | |
976 | ||
977 | * An external vote daemon which runs on an independent server | |
978 | ||
979 | As a result, you can achieve higher availability, even in smaller setups (for | |
980 | example 2+1 nodes). | |
981 | ||
982 | QDevice Technical Overview | |
983 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
984 | ||
985 | The Corosync Quorum Device (QDevice) is a daemon which runs on each cluster | |
986 | node. It provides a configured number of votes to the cluster's quorum | |
987 | subsystem, based on an externally running third-party arbitrator's decision. | |
988 | Its primary use is to allow a cluster to sustain more node failures than | |
989 | standard quorum rules allow. This can be done safely as the external device | |
990 | can see all nodes and thus choose only one set of nodes to give its vote. | |
991 | This will only be done if said set of nodes can have quorum (again) after | |
992 | receiving the third-party vote. | |
993 | ||
994 | Currently, only 'QDevice Net' is supported as a third-party arbitrator. This is | |
995 | a daemon which provides a vote to a cluster partition, if it can reach the | |
996 | partition members over the network. It will only give votes to one partition | |
997 | of a cluster at any time. | |
998 | It's designed to support multiple clusters and is almost configuration and | |
999 | state free. New clusters are handled dynamically and no configuration file | |
1000 | is needed on the host running a QDevice. | |
1001 | ||
1002 | The only requirements for the external host are that it needs network access to | |
1003 | the cluster and to have a corosync-qnetd package available. We provide a package | |
1004 | for Debian based hosts, and other Linux distributions should also have a package | |
1005 | available through their respective package manager. | |
1006 | ||
1007 | NOTE: Unlike corosync itself, a QDevice connects to the cluster over TCP/IP. | |
1008 | The daemon can also run outside the LAN of the cluster and isn't limited to the | |
1009 | low latencies requirements of corosync. | |
1010 | ||
1011 | Supported Setups | |
1012 | ~~~~~~~~~~~~~~~~ | |
1013 | ||
1014 | We support QDevices for clusters with an even number of nodes and recommend | |
1015 | it for 2 node clusters, if they should provide higher availability. | |
1016 | For clusters with an odd node count, we currently discourage the use of | |
1017 | QDevices. The reason for this is the difference in the votes which the QDevice | |
1018 | provides for each cluster type. Even numbered clusters get a single additional | |
1019 | vote, which only increases availability, because if the QDevice | |
1020 | itself fails, you are in the same position as with no QDevice at all. | |
1021 | ||
1022 | On the other hand, with an odd numbered cluster size, the QDevice provides | |
1023 | '(N-1)' votes -- where 'N' corresponds to the cluster node count. This | |
1024 | alternative behavior makes sense; if it had only one additional vote, the | |
1025 | cluster could get into a split-brain situation. This algorithm allows for all | |
1026 | nodes but one (and naturally the QDevice itself) to fail. However, there are two | |
1027 | drawbacks to this: | |
1028 | ||
1029 | * If the QNet daemon itself fails, no other node may fail or the cluster | |
1030 | immediately loses quorum. For example, in a cluster with 15 nodes, 7 | |
1031 | could fail before the cluster becomes inquorate. But, if a QDevice is | |
1032 | configured here and it itself fails, **no single node** of the 15 may fail. | |
1033 | The QDevice acts almost as a single point of failure in this case. | |
1034 | ||
1035 | * The fact that all but one node plus QDevice may fail sounds promising at | |
1036 | first, but this may result in a mass recovery of HA services, which could | |
1037 | overload the single remaining node. Furthermore, a Ceph server will stop | |
1038 | providing services if only '((N-1)/2)' nodes or less remain online. | |
1039 | ||
1040 | If you understand the drawbacks and implications, you can decide yourself if | |
1041 | you want to use this technology in an odd numbered cluster setup. | |
1042 | ||
1043 | QDevice-Net Setup | |
1044 | ~~~~~~~~~~~~~~~~~ | |
1045 | ||
1046 | We recommend running any daemon which provides votes to corosync-qdevice as an | |
1047 | unprivileged user. {pve} and Debian provide a package which is already | |
1048 | configured to do so. | |
1049 | The traffic between the daemon and the cluster must be encrypted to ensure a | |
1050 | safe and secure integration of the QDevice in {pve}. | |
1051 | ||
1052 | First, install the 'corosync-qnetd' package on your external server | |
1053 | ||
1054 | ---- | |
1055 | external# apt install corosync-qnetd | |
1056 | ---- | |
1057 | ||
1058 | and the 'corosync-qdevice' package on all cluster nodes | |
1059 | ||
1060 | ---- | |
1061 | pve# apt install corosync-qdevice | |
1062 | ---- | |
1063 | ||
1064 | After doing this, ensure that all the nodes in the cluster are online. | |
1065 | ||
1066 | You can now set up your QDevice by running the following command on one | |
1067 | of the {pve} nodes: | |
1068 | ||
1069 | ---- | |
1070 | pve# pvecm qdevice setup <QDEVICE-IP> | |
1071 | ---- | |
1072 | ||
1073 | The SSH key from the cluster will be automatically copied to the QDevice. | |
1074 | ||
1075 | NOTE: Make sure to setup key-based access for the root user on your external | |
1076 | server, or temporarily allow root login with password during the setup phase. | |
1077 | If you receive an error such as 'Host key verification failed.' at this | |
1078 | stage, running `pvecm updatecerts` could fix the issue. | |
1079 | ||
1080 | After all the steps have successfully completed, you will see "Done". You can | |
1081 | verify that the QDevice has been set up with: | |
1082 | ||
1083 | ---- | |
1084 | pve# pvecm status | |
1085 | ||
1086 | ... | |
1087 | ||
1088 | Votequorum information | |
1089 | ~~~~~~~~~~~~~~~~~~~~~ | |
1090 | Expected votes: 3 | |
1091 | Highest expected: 3 | |
1092 | Total votes: 3 | |
1093 | Quorum: 2 | |
1094 | Flags: Quorate Qdevice | |
1095 | ||
1096 | Membership information | |
1097 | ~~~~~~~~~~~~~~~~~~~~~~ | |
1098 | Nodeid Votes Qdevice Name | |
1099 | 0x00000001 1 A,V,NMW 192.168.22.180 (local) | |
1100 | 0x00000002 1 A,V,NMW 192.168.22.181 | |
1101 | 0x00000000 1 Qdevice | |
1102 | ||
1103 | ---- | |
1104 | ||
1105 | [[pvecm_qdevice_status_flags]] | |
1106 | QDevice Status Flags | |
1107 | ^^^^^^^^^^^^^^^^^^^^ | |
1108 | ||
1109 | The status output of the QDevice, as seen above, will usually contain three | |
1110 | columns: | |
1111 | ||
1112 | * `A` / `NA`: Alive or Not Alive. Indicates if the communication to the external | |
1113 | `corosync-qnetd` daemon works. | |
1114 | * `V` / `NV`: If the QDevice will cast a vote for the node. In a split-brain | |
1115 | situation, where the corosync connection between the nodes is down, but they | |
1116 | both can still communicate with the external `corosync-qnetd` daemon, | |
1117 | only one node will get the vote. | |
1118 | * `MW` / `NMW`: Master wins (`MV`) or not (`NMW`). Default is `NMW`, see | |
1119 | footnote:[`votequorum_qdevice_master_wins` manual page | |
1120 | https://manpages.debian.org/bookworm/libvotequorum-dev/votequorum_qdevice_master_wins.3.en.html]. | |
1121 | * `NR`: QDevice is not registered. | |
1122 | ||
1123 | NOTE: If your QDevice is listed as `Not Alive` (`NA` in the output above), | |
1124 | ensure that port `5403` (the default port of the qnetd server) of your external | |
1125 | server is reachable via TCP/IP! | |
1126 | ||
1127 | ||
1128 | Frequently Asked Questions | |
1129 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
1130 | ||
1131 | Tie Breaking | |
1132 | ^^^^^^^^^^^^ | |
1133 | ||
1134 | In case of a tie, where two same-sized cluster partitions cannot see each other | |
1135 | but can see the QDevice, the QDevice chooses one of those partitions randomly | |
1136 | and provides a vote to it. | |
1137 | ||
1138 | Possible Negative Implications | |
1139 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1140 | ||
1141 | For clusters with an even node count, there are no negative implications when | |
1142 | using a QDevice. If it fails to work, it is the same as not having a QDevice | |
1143 | at all. | |
1144 | ||
1145 | Adding/Deleting Nodes After QDevice Setup | |
1146 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1147 | ||
1148 | If you want to add a new node or remove an existing one from a cluster with a | |
1149 | QDevice setup, you need to remove the QDevice first. After that, you can add or | |
1150 | remove nodes normally. Once you have a cluster with an even node count again, | |
1151 | you can set up the QDevice again as described previously. | |
1152 | ||
1153 | Removing the QDevice | |
1154 | ^^^^^^^^^^^^^^^^^^^^ | |
1155 | ||
1156 | If you used the official `pvecm` tool to add the QDevice, you can remove it | |
1157 | by running: | |
1158 | ||
1159 | ---- | |
1160 | pve# pvecm qdevice remove | |
1161 | ---- | |
1162 | ||
1163 | //Still TODO | |
1164 | //^^^^^^^^^^ | |
1165 | //There is still stuff to add here | |
1166 | ||
1167 | ||
1168 | Corosync Configuration | |
1169 | ---------------------- | |
1170 | ||
1171 | The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It | |
1172 | controls the cluster membership and its network. | |
1173 | For further information about it, check the corosync.conf man page: | |
1174 | [source,bash] | |
1175 | ---- | |
1176 | man corosync.conf | |
1177 | ---- | |
1178 | ||
1179 | For node membership, you should always use the `pvecm` tool provided by {pve}. | |
1180 | You may have to edit the configuration file manually for other changes. | |
1181 | Here are a few best practice tips for doing this. | |
1182 | ||
1183 | [[pvecm_edit_corosync_conf]] | |
1184 | Edit corosync.conf | |
1185 | ~~~~~~~~~~~~~~~~~~ | |
1186 | ||
1187 | Editing the corosync.conf file is not always very straightforward. There are | |
1188 | two on each cluster node, one in `/etc/pve/corosync.conf` and the other in | |
1189 | `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will | |
1190 | propagate the changes to the local one, but not vice versa. | |
1191 | ||
1192 | The configuration will get updated automatically, as soon as the file changes. | |
1193 | This means that changes which can be integrated in a running corosync will take | |
1194 | effect immediately. Thus, you should always make a copy and edit that instead, | |
1195 | to avoid triggering unintended changes when saving the file while editing. | |
1196 | ||
1197 | [source,bash] | |
1198 | ---- | |
1199 | cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new | |
1200 | ---- | |
1201 | ||
1202 | Then, open the config file with your favorite editor, such as `nano` or | |
1203 | `vim.tiny`, which come pre-installed on every {pve} node. | |
1204 | ||
1205 | NOTE: Always increment the 'config_version' number after configuration changes; | |
1206 | omitting this can lead to problems. | |
1207 | ||
1208 | After making the necessary changes, create another copy of the current working | |
1209 | configuration file. This serves as a backup if the new configuration fails to | |
1210 | apply or causes other issues. | |
1211 | ||
1212 | [source,bash] | |
1213 | ---- | |
1214 | cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak | |
1215 | ---- | |
1216 | ||
1217 | Then replace the old configuration file with the new one: | |
1218 | [source,bash] | |
1219 | ---- | |
1220 | mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf | |
1221 | ---- | |
1222 | ||
1223 | You can check if the changes could be applied automatically, using the following | |
1224 | commands: | |
1225 | [source,bash] | |
1226 | ---- | |
1227 | systemctl status corosync | |
1228 | journalctl -b -u corosync | |
1229 | ---- | |
1230 | ||
1231 | If the changes could not be applied automatically, you may have to restart the | |
1232 | corosync service via: | |
1233 | [source,bash] | |
1234 | ---- | |
1235 | systemctl restart corosync | |
1236 | ---- | |
1237 | ||
1238 | On errors, check the troubleshooting section below. | |
1239 | ||
1240 | Troubleshooting | |
1241 | ~~~~~~~~~~~~~~~ | |
1242 | ||
1243 | Issue: 'quorum.expected_votes must be configured' | |
1244 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1245 | ||
1246 | When corosync starts to fail and you get the following message in the system log: | |
1247 | ||
1248 | ---- | |
1249 | [...] | |
1250 | corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize. | |
1251 | corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason | |
1252 | 'configuration error: nodelist or quorum.expected_votes must be configured!' | |
1253 | [...] | |
1254 | ---- | |
1255 | ||
1256 | It means that the hostname you set for a corosync 'ringX_addr' in the | |
1257 | configuration could not be resolved. | |
1258 | ||
1259 | Write Configuration When Not Quorate | |
1260 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1261 | ||
1262 | If you need to change '/etc/pve/corosync.conf' on a node with no quorum, and you | |
1263 | understand what you are doing, use: | |
1264 | [source,bash] | |
1265 | ---- | |
1266 | pvecm expected 1 | |
1267 | ---- | |
1268 | ||
1269 | This sets the expected vote count to 1 and makes the cluster quorate. You can | |
1270 | then fix your configuration, or revert it back to the last working backup. | |
1271 | ||
1272 | This is not enough if corosync cannot start anymore. In that case, it is best to | |
1273 | edit the local copy of the corosync configuration in | |
1274 | '/etc/corosync/corosync.conf', so that corosync can start again. Ensure that on | |
1275 | all nodes, this configuration has the same content to avoid split-brain | |
1276 | situations. | |
1277 | ||
1278 | ||
1279 | [[pvecm_corosync_conf_glossary]] | |
1280 | Corosync Configuration Glossary | |
1281 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
1282 | ||
1283 | ringX_addr:: | |
1284 | This names the different link addresses for the Kronosnet connections between | |
1285 | nodes. | |
1286 | ||
1287 | ||
1288 | Cluster Cold Start | |
1289 | ------------------ | |
1290 | ||
1291 | It is obvious that a cluster is not quorate when all nodes are | |
1292 | offline. This is a common case after a power failure. | |
1293 | ||
1294 | NOTE: It is always a good idea to use an uninterruptible power supply | |
1295 | (``UPS'', also called ``battery backup'') to avoid this state, especially if | |
1296 | you want HA. | |
1297 | ||
1298 | On node startup, the `pve-guests` service is started and waits for | |
1299 | quorum. Once quorate, it starts all guests which have the `onboot` | |
1300 | flag set. | |
1301 | ||
1302 | When you turn on nodes, or when power comes back after power failure, | |
1303 | it is likely that some nodes will boot faster than others. Please keep in | |
1304 | mind that guest startup is delayed until you reach quorum. | |
1305 | ||
1306 | ||
1307 | [[pvecm_next_id_range]] | |
1308 | Guest VMID Auto-Selection | |
1309 | ------------------------ | |
1310 | ||
1311 | When creating new guests the web interface will ask the backend for a free VMID | |
1312 | automatically. The default range for searching is `100` to `1000000` (lower | |
1313 | than the maximal allowed VMID enforced by the schema). | |
1314 | ||
1315 | Sometimes admins either want to allocate new VMIDs in a separate range, for | |
1316 | example to easily separate temporary VMs with ones that choose a VMID manually. | |
1317 | Other times its just desired to provided a stable length VMID, for which | |
1318 | setting the lower boundary to, for example, `100000` gives much more room for. | |
1319 | ||
1320 | To accommodate this use case one can set either lower, upper or both boundaries | |
1321 | via the `datacenter.cfg` configuration file, which can be edited in the web | |
1322 | interface under 'Datacenter' -> 'Options'. | |
1323 | ||
1324 | NOTE: The range is only used for the next-id API call, so it isn't a hard | |
1325 | limit. | |
1326 | ||
1327 | Guest Migration | |
1328 | --------------- | |
1329 | ||
1330 | Migrating virtual guests to other nodes is a useful feature in a | |
1331 | cluster. There are settings to control the behavior of such | |
1332 | migrations. This can be done via the configuration file | |
1333 | `datacenter.cfg` or for a specific migration via API or command-line | |
1334 | parameters. | |
1335 | ||
1336 | It makes a difference if a guest is online or offline, or if it has | |
1337 | local resources (like a local disk). | |
1338 | ||
1339 | For details about virtual machine migration, see the | |
1340 | xref:qm_migration[QEMU/KVM Migration Chapter]. | |
1341 | ||
1342 | For details about container migration, see the | |
1343 | xref:pct_migration[Container Migration Chapter]. | |
1344 | ||
1345 | Migration Type | |
1346 | ~~~~~~~~~~~~~~ | |
1347 | ||
1348 | The migration type defines if the migration data should be sent over an | |
1349 | encrypted (`secure`) channel or an unencrypted (`insecure`) one. | |
1350 | Setting the migration type to `insecure` means that the RAM content of a | |
1351 | virtual guest is also transferred unencrypted, which can lead to | |
1352 | information disclosure of critical data from inside the guest (for | |
1353 | example, passwords or encryption keys). | |
1354 | ||
1355 | Therefore, we strongly recommend using the secure channel if you do | |
1356 | not have full control over the network and can not guarantee that no | |
1357 | one is eavesdropping on it. | |
1358 | ||
1359 | NOTE: Storage migration does not follow this setting. Currently, it | |
1360 | always sends the storage content over a secure channel. | |
1361 | ||
1362 | Encryption requires a lot of computing power, so this setting is often | |
1363 | changed to `insecure` to achieve better performance. The impact on | |
1364 | modern systems is lower because they implement AES encryption in | |
1365 | hardware. The performance impact is particularly evident in fast | |
1366 | networks, where you can transfer 10 Gbps or more. | |
1367 | ||
1368 | Migration Network | |
1369 | ~~~~~~~~~~~~~~~~~ | |
1370 | ||
1371 | By default, {pve} uses the network in which cluster communication | |
1372 | takes place to send the migration traffic. This is not optimal both because | |
1373 | sensitive cluster traffic can be disrupted and this network may not | |
1374 | have the best bandwidth available on the node. | |
1375 | ||
1376 | Setting the migration network parameter allows the use of a dedicated | |
1377 | network for all migration traffic. In addition to the memory, | |
1378 | this also affects the storage traffic for offline migrations. | |
1379 | ||
1380 | The migration network is set as a network using CIDR notation. This | |
1381 | has the advantage that you don't have to set individual IP addresses | |
1382 | for each node. {pve} can determine the real address on the | |
1383 | destination node from the network specified in the CIDR form. To | |
1384 | enable this, the network must be specified so that each node has exactly one | |
1385 | IP in the respective network. | |
1386 | ||
1387 | Example | |
1388 | ^^^^^^^ | |
1389 | ||
1390 | We assume that we have a three-node setup, with three separate | |
1391 | networks. One for public communication with the Internet, one for | |
1392 | cluster communication, and a very fast one, which we want to use as a | |
1393 | dedicated network for migration. | |
1394 | ||
1395 | A network configuration for such a setup might look as follows: | |
1396 | ||
1397 | ---- | |
1398 | iface eno1 inet manual | |
1399 | ||
1400 | # public network | |
1401 | auto vmbr0 | |
1402 | iface vmbr0 inet static | |
1403 | address 192.X.Y.57/24 | |
1404 | gateway 192.X.Y.1 | |
1405 | bridge-ports eno1 | |
1406 | bridge-stp off | |
1407 | bridge-fd 0 | |
1408 | ||
1409 | # cluster network | |
1410 | auto eno2 | |
1411 | iface eno2 inet static | |
1412 | address 10.1.1.1/24 | |
1413 | ||
1414 | # fast network | |
1415 | auto eno3 | |
1416 | iface eno3 inet static | |
1417 | address 10.1.2.1/24 | |
1418 | ---- | |
1419 | ||
1420 | Here, we will use the network 10.1.2.0/24 as a migration network. For | |
1421 | a single migration, you can do this using the `migration_network` | |
1422 | parameter of the command-line tool: | |
1423 | ||
1424 | ---- | |
1425 | # qm migrate 106 tre --online --migration_network 10.1.2.0/24 | |
1426 | ---- | |
1427 | ||
1428 | To configure this as the default network for all migrations in the | |
1429 | cluster, set the `migration` property of the `/etc/pve/datacenter.cfg` | |
1430 | file: | |
1431 | ||
1432 | ---- | |
1433 | # use dedicated migration network | |
1434 | migration: secure,network=10.1.2.0/24 | |
1435 | ---- | |
1436 | ||
1437 | NOTE: The migration type must always be set when the migration network | |
1438 | is set in `/etc/pve/datacenter.cfg`. | |
1439 | ||
1440 | ||
1441 | ifdef::manvolnum[] | |
1442 | include::pve-copyright.adoc[] | |
1443 | endif::manvolnum[] |