]>
Commit | Line | Data |
---|---|---|
bde0e57d | 1 | [[chapter_pvecm]] |
d8742b0c | 2 | ifdef::manvolnum[] |
b2f242ab DM |
3 | pvecm(1) |
4 | ======== | |
5f09af76 DM |
5 | :pve-toplevel: |
6 | ||
d8742b0c DM |
7 | NAME |
8 | ---- | |
9 | ||
74026b8f | 10 | pvecm - Proxmox VE Cluster Manager |
d8742b0c | 11 | |
49a5e11c | 12 | SYNOPSIS |
d8742b0c DM |
13 | -------- |
14 | ||
15 | include::pvecm.1-synopsis.adoc[] | |
16 | ||
17 | DESCRIPTION | |
18 | ----------- | |
19 | endif::manvolnum[] | |
20 | ||
21 | ifndef::manvolnum[] | |
22 | Cluster Manager | |
23 | =============== | |
5f09af76 | 24 | :pve-toplevel: |
194d2f29 | 25 | endif::manvolnum[] |
5f09af76 | 26 | |
65a0aa49 | 27 | The {pve} cluster manager `pvecm` is a tool to create a group of |
8c1189b6 | 28 | physical servers. Such a group is called a *cluster*. We use the |
8a865621 | 29 | http://www.corosync.org[Corosync Cluster Engine] for reliable group |
fdf1dd36 TL |
30 | communication. There's no explicit limit for the number of nodes in a cluster. |
31 | In practice, the actual possible node count may be limited by the host and | |
79bb0794 | 32 | network performance. Currently (2021), there are reports of clusters (using |
fdf1dd36 | 33 | high-end enterprise hardware) with over 50 nodes in production. |
8a865621 | 34 | |
8c1189b6 | 35 | `pvecm` can be used to create a new cluster, join nodes to a cluster, |
a37d539f | 36 | leave the cluster, get status information, and do various other cluster-related |
60ed554f | 37 | tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'') |
e300cf7d | 38 | is used to transparently distribute the cluster configuration to all cluster |
8a865621 DM |
39 | nodes. |
40 | ||
41 | Grouping nodes into a cluster has the following advantages: | |
42 | ||
a37d539f | 43 | * Centralized, web-based management |
8a865621 | 44 | |
6d3c0b34 | 45 | * Multi-master clusters: each node can do all management tasks |
8a865621 | 46 | |
a37d539f DW |
47 | * Use of `pmxcfs`, a database-driven file system, for storing configuration |
48 | files, replicated in real-time on all nodes using `corosync` | |
8a865621 | 49 | |
5eba0743 | 50 | * Easy migration of virtual machines and containers between physical |
8a865621 DM |
51 | hosts |
52 | ||
53 | * Fast deployment | |
54 | ||
55 | * Cluster-wide services like firewall and HA | |
56 | ||
57 | ||
58 | Requirements | |
59 | ------------ | |
60 | ||
a9e7c3aa SR |
61 | * All nodes must be able to connect to each other via UDP ports 5404 and 5405 |
62 | for corosync to work. | |
8a865621 | 63 | |
a37d539f | 64 | * Date and time must be synchronized. |
8a865621 | 65 | |
a37d539f | 66 | * An SSH tunnel on TCP port 22 between nodes is required. |
8a865621 | 67 | |
ceabe189 DM |
68 | * If you are interested in High Availability, you need to have at |
69 | least three nodes for reliable quorum. All nodes should have the | |
70 | same version. | |
8a865621 DM |
71 | |
72 | * We recommend a dedicated NIC for the cluster traffic, especially if | |
73 | you use shared storage. | |
74 | ||
a37d539f | 75 | * The root password of a cluster node is required for adding nodes. |
d4a9910f | 76 | |
e4b62d04 TL |
77 | NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster |
78 | nodes. | |
79 | ||
a37d539f DW |
80 | NOTE: While it's possible to mix {pve} 4.4 and {pve} 5.0 nodes, doing so is |
81 | not supported as a production configuration and should only be done temporarily, | |
82 | during an upgrade of the whole cluster from one major version to another. | |
8a865621 | 83 | |
a9e7c3aa SR |
84 | NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The |
85 | cluster protocol (corosync) between {pve} 6.x and earlier versions changed | |
86 | fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the | |
87 | upgrade procedure to {pve} 6.0. | |
88 | ||
8a865621 | 89 | |
ceabe189 DM |
90 | Preparing Nodes |
91 | --------------- | |
8a865621 | 92 | |
65a0aa49 | 93 | First, install {pve} on all nodes. Make sure that each node is |
8a865621 DM |
94 | installed with the final hostname and IP configuration. Changing the |
95 | hostname and IP is not possible after cluster creation. | |
96 | ||
a37d539f | 97 | While it's common to reference all node names and their IPs in `/etc/hosts` (or |
a9e7c3aa SR |
98 | make their names resolvable through other means), this is not necessary for a |
99 | cluster to work. It may be useful however, as you can then connect from one node | |
a37d539f | 100 | to another via SSH, using the easier to remember node name (see also |
a9e7c3aa | 101 | xref:pvecm_corosync_addresses[Link Address Types]). Note that we always |
a37d539f | 102 | recommend referencing nodes by their IP addresses in the cluster configuration. |
a9e7c3aa | 103 | |
9a7396aa | 104 | |
11202f1d | 105 | [[pvecm_create_cluster]] |
6cab1704 TL |
106 | Create a Cluster |
107 | ---------------- | |
108 | ||
109 | You can either create a cluster on the console (login via `ssh`), or through | |
a37d539f | 110 | the API using the {pve} web interface (__Datacenter -> Cluster__). |
8a865621 | 111 | |
6cab1704 TL |
112 | NOTE: Use a unique name for your cluster. This name cannot be changed later. |
113 | The cluster name follows the same rules as node names. | |
3e380ce0 | 114 | |
6cab1704 | 115 | [[pvecm_cluster_create_via_gui]] |
3e380ce0 SR |
116 | Create via Web GUI |
117 | ~~~~~~~~~~~~~~~~~~ | |
118 | ||
24398259 SR |
119 | [thumbnail="screenshot/gui-cluster-create.png"] |
120 | ||
3e380ce0 | 121 | Under __Datacenter -> Cluster__, click on *Create Cluster*. Enter the cluster |
a37d539f DW |
122 | name and select a network connection from the drop-down list to serve as the |
123 | main cluster network (Link 0). It defaults to the IP resolved via the node's | |
3e380ce0 SR |
124 | hostname. |
125 | ||
663ae2bf DW |
126 | As of {pve} 6.2, up to 8 fallback links can be added to a cluster. To add a |
127 | redundant link, click the 'Add' button and select a link number and IP address | |
128 | from the respective fields. Prior to {pve} 6.2, to add a second link as | |
129 | fallback, you can select the 'Advanced' checkbox and choose an additional | |
130 | network interface (Link 1, see also xref:pvecm_redundancy[Corosync Redundancy]). | |
3e380ce0 | 131 | |
a37d539f DW |
132 | NOTE: Ensure that the network selected for cluster communication is not used for |
133 | any high traffic purposes, like network storage or live-migration. | |
6cab1704 TL |
134 | While the cluster network itself produces small amounts of data, it is very |
135 | sensitive to latency. Check out full | |
136 | xref:pvecm_cluster_network_requirements[cluster network requirements]. | |
137 | ||
138 | [[pvecm_cluster_create_via_cli]] | |
a37d539f DW |
139 | Create via the Command Line |
140 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
3e380ce0 SR |
141 | |
142 | Login via `ssh` to the first {pve} node and run the following command: | |
8a865621 | 143 | |
c15cdfba TL |
144 | ---- |
145 | hp1# pvecm create CLUSTERNAME | |
146 | ---- | |
8a865621 | 147 | |
3e380ce0 | 148 | To check the state of the new cluster use: |
8a865621 | 149 | |
c15cdfba | 150 | ---- |
8a865621 | 151 | hp1# pvecm status |
c15cdfba | 152 | ---- |
8a865621 | 153 | |
a37d539f DW |
154 | Multiple Clusters in the Same Network |
155 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
dd1aa0e0 TL |
156 | |
157 | It is possible to create multiple clusters in the same physical or logical | |
a37d539f DW |
158 | network. In this case, each cluster must have a unique name to avoid possible |
159 | clashes in the cluster communication stack. Furthermore, this helps avoid human | |
160 | confusion by making clusters clearly distinguishable. | |
dd1aa0e0 TL |
161 | |
162 | While the bandwidth requirement of a corosync cluster is relatively low, the | |
163 | latency of packages and the package per second (PPS) rate is the limiting | |
164 | factor. Different clusters in the same network can compete with each other for | |
165 | these resources, so it may still make sense to use separate physical network | |
166 | infrastructure for bigger clusters. | |
8a865621 | 167 | |
11202f1d | 168 | [[pvecm_join_node_to_cluster]] |
8a865621 | 169 | Adding Nodes to the Cluster |
ceabe189 | 170 | --------------------------- |
8a865621 | 171 | |
3e380ce0 SR |
172 | CAUTION: A node that is about to be added to the cluster cannot hold any guests. |
173 | All existing configuration in `/etc/pve` is overwritten when joining a cluster, | |
a37d539f DW |
174 | since guest IDs could otherwise conflict. As a workaround, you can create a |
175 | backup of the guest (`vzdump`) and restore it under a different ID, after the | |
176 | node has been added to the cluster. | |
3e380ce0 | 177 | |
6cab1704 TL |
178 | Join Node to Cluster via GUI |
179 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
3e380ce0 | 180 | |
24398259 SR |
181 | [thumbnail="screenshot/gui-cluster-join-information.png"] |
182 | ||
a37d539f DW |
183 | Log in to the web interface on an existing cluster node. Under __Datacenter -> |
184 | Cluster__, click the *Join Information* button at the top. Then, click on the | |
3e380ce0 SR |
185 | button *Copy Information*. Alternatively, copy the string from the 'Information' |
186 | field manually. | |
187 | ||
24398259 SR |
188 | [thumbnail="screenshot/gui-cluster-join.png"] |
189 | ||
a37d539f | 190 | Next, log in to the web interface on the node you want to add. |
3e380ce0 | 191 | Under __Datacenter -> Cluster__, click on *Join Cluster*. Fill in the |
6cab1704 TL |
192 | 'Information' field with the 'Join Information' text you copied earlier. |
193 | Most settings required for joining the cluster will be filled out | |
194 | automatically. For security reasons, the cluster password has to be entered | |
195 | manually. | |
3e380ce0 SR |
196 | |
197 | NOTE: To enter all required data manually, you can disable the 'Assisted Join' | |
198 | checkbox. | |
199 | ||
6cab1704 | 200 | After clicking the *Join* button, the cluster join process will start |
a37d539f DW |
201 | immediately. After the node has joined the cluster, its current node certificate |
202 | will be replaced by one signed from the cluster certificate authority (CA). | |
203 | This means that the current session will stop working after a few seconds. You | |
204 | then might need to force-reload the web interface and log in again with the | |
205 | cluster credentials. | |
3e380ce0 | 206 | |
6cab1704 | 207 | Now your node should be visible under __Datacenter -> Cluster__. |
3e380ce0 | 208 | |
6cab1704 TL |
209 | Join Node to Cluster via Command Line |
210 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
3e380ce0 | 211 | |
a37d539f | 212 | Log in to the node you want to join into an existing cluster via `ssh`. |
8a865621 | 213 | |
c15cdfba | 214 | ---- |
8673c878 | 215 | # pvecm add IP-ADDRESS-CLUSTER |
c15cdfba | 216 | ---- |
8a865621 | 217 | |
a37d539f | 218 | For `IP-ADDRESS-CLUSTER`, use the IP or hostname of an existing cluster node. |
a9e7c3aa | 219 | An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]). |
8a865621 | 220 | |
8a865621 | 221 | |
a9e7c3aa | 222 | To check the state of the cluster use: |
8a865621 | 223 | |
c15cdfba | 224 | ---- |
8a865621 | 225 | # pvecm status |
c15cdfba | 226 | ---- |
8a865621 | 227 | |
ceabe189 | 228 | .Cluster status after adding 4 nodes |
8a865621 | 229 | ---- |
8673c878 DW |
230 | # pvecm status |
231 | Cluster information | |
232 | ~~~~~~~~~~~~~~~~~~~ | |
233 | Name: prod-central | |
234 | Config Version: 3 | |
235 | Transport: knet | |
236 | Secure auth: on | |
237 | ||
8a865621 DM |
238 | Quorum information |
239 | ~~~~~~~~~~~~~~~~~~ | |
8673c878 | 240 | Date: Tue Sep 14 11:06:47 2021 |
8a865621 DM |
241 | Quorum provider: corosync_votequorum |
242 | Nodes: 4 | |
243 | Node ID: 0x00000001 | |
8673c878 | 244 | Ring ID: 1.1a8 |
8a865621 DM |
245 | Quorate: Yes |
246 | ||
247 | Votequorum information | |
248 | ~~~~~~~~~~~~~~~~~~~~~~ | |
249 | Expected votes: 4 | |
250 | Highest expected: 4 | |
251 | Total votes: 4 | |
91f3edd0 | 252 | Quorum: 3 |
8a865621 DM |
253 | Flags: Quorate |
254 | ||
255 | Membership information | |
256 | ~~~~~~~~~~~~~~~~~~~~~~ | |
257 | Nodeid Votes Name | |
258 | 0x00000001 1 192.168.15.91 | |
259 | 0x00000002 1 192.168.15.92 (local) | |
260 | 0x00000003 1 192.168.15.93 | |
261 | 0x00000004 1 192.168.15.94 | |
262 | ---- | |
263 | ||
a37d539f | 264 | If you only want a list of all nodes, use: |
8a865621 | 265 | |
c15cdfba | 266 | ---- |
8a865621 | 267 | # pvecm nodes |
c15cdfba | 268 | ---- |
8a865621 | 269 | |
5eba0743 | 270 | .List nodes in a cluster |
8a865621 | 271 | ---- |
8673c878 | 272 | # pvecm nodes |
8a865621 DM |
273 | |
274 | Membership information | |
275 | ~~~~~~~~~~~~~~~~~~~~~~ | |
276 | Nodeid Votes Name | |
277 | 1 1 hp1 | |
278 | 2 1 hp2 (local) | |
279 | 3 1 hp3 | |
280 | 4 1 hp4 | |
281 | ---- | |
282 | ||
3254bfdd | 283 | [[pvecm_adding_nodes_with_separated_cluster_network]] |
a37d539f | 284 | Adding Nodes with Separated Cluster Network |
e4ec4154 TL |
285 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
286 | ||
a37d539f | 287 | When adding a node to a cluster with a separated cluster network, you need to |
a9e7c3aa | 288 | use the 'link0' parameter to set the nodes address on that network: |
e4ec4154 TL |
289 | |
290 | [source,bash] | |
4d19cb00 | 291 | ---- |
a9e7c3aa | 292 | pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0 |
4d19cb00 | 293 | ---- |
e4ec4154 | 294 | |
a9e7c3aa | 295 | If you want to use the built-in xref:pvecm_redundancy[redundancy] of the |
a37d539f | 296 | Kronosnet transport layer, also use the 'link1' parameter. |
e4ec4154 | 297 | |
a37d539f DW |
298 | Using the GUI, you can select the correct interface from the corresponding |
299 | 'Link X' fields in the *Cluster Join* dialog. | |
8a865621 DM |
300 | |
301 | Remove a Cluster Node | |
ceabe189 | 302 | --------------------- |
8a865621 | 303 | |
a37d539f | 304 | CAUTION: Read the procedure carefully before proceeding, as it may |
8a865621 DM |
305 | not be what you want or need. |
306 | ||
7ec7bcee DW |
307 | Move all virtual machines from the node. Ensure that you have made copies of any |
308 | local data or backups that you want to keep. In addition, make sure to remove | |
309 | any scheduled replication jobs to the node to be removed. | |
310 | ||
311 | CAUTION: Failure to remove replication jobs to a node before removing said node | |
312 | will result in the replication job becoming irremovable. Especially note that | |
313 | replication automatically switches direction if a replicated VM is migrated, so | |
314 | by migrating a replicated VM from a node to be deleted, replication jobs will be | |
315 | set up to that node automatically. | |
316 | ||
317 | In the following example, we will remove the node hp4 from the cluster. | |
8a865621 | 318 | |
e8503c6c EK |
319 | Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes` |
320 | command to identify the node ID to remove: | |
8a865621 DM |
321 | |
322 | ---- | |
8673c878 | 323 | hp1# pvecm nodes |
8a865621 DM |
324 | |
325 | Membership information | |
326 | ~~~~~~~~~~~~~~~~~~~~~~ | |
327 | Nodeid Votes Name | |
328 | 1 1 hp1 (local) | |
329 | 2 1 hp2 | |
330 | 3 1 hp3 | |
331 | 4 1 hp4 | |
332 | ---- | |
333 | ||
e8503c6c | 334 | |
a37d539f DW |
335 | At this point, you must power off hp4 and ensure that it will not power on |
336 | again (in the network) with its current configuration. | |
e8503c6c | 337 | |
a37d539f DW |
338 | IMPORTANT: As mentioned above, it is critical to power off the node |
339 | *before* removal, and make sure that it will *not* power on again | |
340 | (in the existing cluster network) with its current configuration. | |
341 | If you power on the node as it is, the cluster could end up broken, | |
342 | and it could be difficult to restore it to a functioning state. | |
e8503c6c EK |
343 | |
344 | After powering off the node hp4, we can safely remove it from the cluster. | |
8a865621 | 345 | |
c15cdfba | 346 | ---- |
8a865621 | 347 | hp1# pvecm delnode hp4 |
10da5ce1 | 348 | Killing node 4 |
c15cdfba | 349 | ---- |
8a865621 | 350 | |
249fd833 DW |
351 | NOTE: At this point, it is possible that you will receive an error message |
352 | stating `Could not kill node (error = CS_ERR_NOT_EXIST)`. This does not | |
353 | signify an actual failure in the deletion of the node, but rather a failure in | |
354 | corosync trying to kill an offline node. Thus, it can be safely ignored. | |
355 | ||
10da5ce1 DJ |
356 | Use `pvecm nodes` or `pvecm status` to check the node list again. It should |
357 | look something like: | |
8a865621 DM |
358 | |
359 | ---- | |
360 | hp1# pvecm status | |
361 | ||
8673c878 | 362 | ... |
8a865621 DM |
363 | |
364 | Votequorum information | |
365 | ~~~~~~~~~~~~~~~~~~~~~~ | |
366 | Expected votes: 3 | |
367 | Highest expected: 3 | |
368 | Total votes: 3 | |
91f3edd0 | 369 | Quorum: 2 |
8a865621 DM |
370 | Flags: Quorate |
371 | ||
372 | Membership information | |
373 | ~~~~~~~~~~~~~~~~~~~~~~ | |
374 | Nodeid Votes Name | |
375 | 0x00000001 1 192.168.15.90 (local) | |
376 | 0x00000002 1 192.168.15.91 | |
377 | 0x00000003 1 192.168.15.92 | |
378 | ---- | |
379 | ||
a9e7c3aa | 380 | If, for whatever reason, you want this server to join the same cluster again, |
a37d539f | 381 | you have to: |
8a865621 | 382 | |
a37d539f | 383 | * do a fresh install of {pve} on it, |
8a865621 DM |
384 | |
385 | * then join it, as explained in the previous section. | |
d8742b0c | 386 | |
41925ede SR |
387 | NOTE: After removal of the node, its SSH fingerprint will still reside in the |
388 | 'known_hosts' of the other nodes. If you receive an SSH error after rejoining | |
9121b45b TL |
389 | a node with the same IP or hostname, run `pvecm updatecerts` once on the |
390 | re-added node to update its fingerprint cluster wide. | |
41925ede | 391 | |
38ae8db3 | 392 | [[pvecm_separate_node_without_reinstall]] |
a37d539f | 393 | Separate a Node Without Reinstalling |
555e966b TL |
394 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
395 | ||
396 | CAUTION: This is *not* the recommended method, proceed with caution. Use the | |
a37d539f | 397 | previous method if you're unsure. |
555e966b TL |
398 | |
399 | You can also separate a node from a cluster without reinstalling it from | |
a37d539f DW |
400 | scratch. But after removing the node from the cluster, it will still have |
401 | access to any shared storage. This must be resolved before you start removing | |
555e966b | 402 | the node from the cluster. A {pve} cluster cannot share the exact same |
60ed554f | 403 | storage with another cluster, as storage locking doesn't work over the cluster |
a37d539f | 404 | boundary. Furthermore, it may also lead to VMID conflicts. |
555e966b | 405 | |
a37d539f | 406 | It's suggested that you create a new storage, where only the node which you want |
a9e7c3aa | 407 | to separate has access. This can be a new export on your NFS or a new Ceph |
a37d539f DW |
408 | pool, to name a few examples. It's just important that the exact same storage |
409 | does not get accessed by multiple clusters. After setting up this storage, move | |
410 | all data and VMs from the node to it. Then you are ready to separate the | |
3be22308 | 411 | node from the cluster. |
555e966b | 412 | |
a37d539f DW |
413 | WARNING: Ensure that all shared resources are cleanly separated! Otherwise you |
414 | will run into conflicts and problems. | |
555e966b | 415 | |
a37d539f | 416 | First, stop the corosync and pve-cluster services on the node: |
555e966b | 417 | [source,bash] |
4d19cb00 | 418 | ---- |
555e966b TL |
419 | systemctl stop pve-cluster |
420 | systemctl stop corosync | |
4d19cb00 | 421 | ---- |
555e966b | 422 | |
a37d539f | 423 | Start the cluster file system again in local mode: |
555e966b | 424 | [source,bash] |
4d19cb00 | 425 | ---- |
555e966b | 426 | pmxcfs -l |
4d19cb00 | 427 | ---- |
555e966b TL |
428 | |
429 | Delete the corosync configuration files: | |
430 | [source,bash] | |
4d19cb00 | 431 | ---- |
555e966b | 432 | rm /etc/pve/corosync.conf |
838081cd | 433 | rm -r /etc/corosync/* |
4d19cb00 | 434 | ---- |
555e966b | 435 | |
a37d539f | 436 | You can now start the file system again as a normal service: |
555e966b | 437 | [source,bash] |
4d19cb00 | 438 | ---- |
555e966b TL |
439 | killall pmxcfs |
440 | systemctl start pve-cluster | |
4d19cb00 | 441 | ---- |
555e966b | 442 | |
a37d539f DW |
443 | The node is now separated from the cluster. You can deleted it from any |
444 | remaining node of the cluster with: | |
555e966b | 445 | [source,bash] |
4d19cb00 | 446 | ---- |
555e966b | 447 | pvecm delnode oldnode |
4d19cb00 | 448 | ---- |
555e966b | 449 | |
a37d539f DW |
450 | If the command fails due to a loss of quorum in the remaining node, you can set |
451 | the expected votes to 1 as a workaround: | |
555e966b | 452 | [source,bash] |
4d19cb00 | 453 | ---- |
555e966b | 454 | pvecm expected 1 |
4d19cb00 | 455 | ---- |
555e966b | 456 | |
96d698db | 457 | And then repeat the 'pvecm delnode' command. |
555e966b | 458 | |
a37d539f DW |
459 | Now switch back to the separated node and delete all the remaining cluster |
460 | files on it. This ensures that the node can be added to another cluster again | |
461 | without problems. | |
555e966b TL |
462 | |
463 | [source,bash] | |
4d19cb00 | 464 | ---- |
555e966b | 465 | rm /var/lib/corosync/* |
4d19cb00 | 466 | ---- |
555e966b TL |
467 | |
468 | As the configuration files from the other nodes are still in the cluster | |
a37d539f DW |
469 | file system, you may want to clean those up too. After making absolutely sure |
470 | that you have the correct node name, you can simply remove the entire | |
471 | directory recursively from '/etc/pve/nodes/NODENAME'. | |
555e966b | 472 | |
a37d539f DW |
473 | CAUTION: The node's SSH keys will remain in the 'authorized_key' file. This |
474 | means that the nodes can still connect to each other with public key | |
475 | authentication. You should fix this by removing the respective keys from the | |
555e966b | 476 | '/etc/pve/priv/authorized_keys' file. |
d8742b0c | 477 | |
a9e7c3aa | 478 | |
806ef12d DM |
479 | Quorum |
480 | ------ | |
481 | ||
482 | {pve} use a quorum-based technique to provide a consistent state among | |
483 | all cluster nodes. | |
484 | ||
485 | [quote, from Wikipedia, Quorum (distributed computing)] | |
486 | ____ | |
487 | A quorum is the minimum number of votes that a distributed transaction | |
488 | has to obtain in order to be allowed to perform an operation in a | |
489 | distributed system. | |
490 | ____ | |
491 | ||
492 | In case of network partitioning, state changes requires that a | |
493 | majority of nodes are online. The cluster switches to read-only mode | |
5eba0743 | 494 | if it loses quorum. |
806ef12d DM |
495 | |
496 | NOTE: {pve} assigns a single vote to each node by default. | |
497 | ||
a9e7c3aa | 498 | |
e4ec4154 TL |
499 | Cluster Network |
500 | --------------- | |
501 | ||
502 | The cluster network is the core of a cluster. All messages sent over it have to | |
a9e7c3aa | 503 | be delivered reliably to all nodes in their respective order. In {pve} this |
a37d539f DW |
504 | part is done by corosync, an implementation of a high performance, low overhead, |
505 | high availability development toolkit. It serves our decentralized configuration | |
506 | file system (`pmxcfs`). | |
e4ec4154 | 507 | |
3254bfdd | 508 | [[pvecm_cluster_network_requirements]] |
e4ec4154 TL |
509 | Network Requirements |
510 | ~~~~~~~~~~~~~~~~~~~~ | |
511 | This needs a reliable network with latencies under 2 milliseconds (LAN | |
a9e7c3aa | 512 | performance) to work properly. The network should not be used heavily by other |
a37d539f | 513 | members; ideally corosync runs on its own network. Do not use a shared network |
a9e7c3aa SR |
514 | for corosync and storage (except as a potential low-priority fallback in a |
515 | xref:pvecm_redundancy[redundant] configuration). | |
e4ec4154 | 516 | |
a9e7c3aa | 517 | Before setting up a cluster, it is good practice to check if the network is fit |
a37d539f | 518 | for that purpose. To ensure that the nodes can connect to each other on the |
a9e7c3aa SR |
519 | cluster network, you can test the connectivity between them with the `ping` |
520 | tool. | |
e4ec4154 | 521 | |
a9e7c3aa SR |
522 | If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically |
523 | be generated - no manual action is required. | |
e4ec4154 | 524 | |
a9e7c3aa SR |
525 | NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0). |
526 | Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster | |
527 | communication, which, for now, only supports regular UDP unicast. | |
e4ec4154 | 528 | |
a9e7c3aa SR |
529 | CAUTION: You can still enable Multicast or legacy unicast by setting your |
530 | transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf], | |
531 | but keep in mind that this will disable all cryptography and redundancy support. | |
532 | This is therefore not recommended. | |
e4ec4154 TL |
533 | |
534 | Separate Cluster Network | |
535 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
536 | ||
a37d539f DW |
537 | When creating a cluster without any parameters, the corosync cluster network is |
538 | generally shared with the web interface and the VMs' network. Depending on | |
539 | your setup, even storage traffic may get sent over the same network. It's | |
540 | recommended to change that, as corosync is a time-critical, real-time | |
a9e7c3aa | 541 | application. |
e4ec4154 | 542 | |
a37d539f | 543 | Setting Up a New Network |
e4ec4154 TL |
544 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
545 | ||
9ffebff5 | 546 | First, you have to set up a new network interface. It should be on a physically |
e4ec4154 | 547 | separate network. Ensure that your network fulfills the |
3254bfdd | 548 | xref:pvecm_cluster_network_requirements[cluster network requirements]. |
e4ec4154 TL |
549 | |
550 | Separate On Cluster Creation | |
551 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
552 | ||
a9e7c3aa | 553 | This is possible via the 'linkX' parameters of the 'pvecm create' |
a37d539f | 554 | command, used for creating a new cluster. |
e4ec4154 | 555 | |
a9e7c3aa SR |
556 | If you have set up an additional NIC with a static address on 10.10.10.1/25, |
557 | and want to send and receive all cluster communication over this interface, | |
e4ec4154 TL |
558 | you would execute: |
559 | ||
560 | [source,bash] | |
4d19cb00 | 561 | ---- |
a9e7c3aa | 562 | pvecm create test --link0 10.10.10.1 |
4d19cb00 | 563 | ---- |
e4ec4154 | 564 | |
a37d539f | 565 | To check if everything is working properly, execute: |
e4ec4154 | 566 | [source,bash] |
4d19cb00 | 567 | ---- |
e4ec4154 | 568 | systemctl status corosync |
4d19cb00 | 569 | ---- |
e4ec4154 | 570 | |
a9e7c3aa | 571 | Afterwards, proceed as described above to |
3254bfdd | 572 | xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network]. |
82d52451 | 573 | |
3254bfdd | 574 | [[pvecm_separate_cluster_net_after_creation]] |
e4ec4154 TL |
575 | Separate After Cluster Creation |
576 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
577 | ||
a9e7c3aa | 578 | You can do this if you have already created a cluster and want to switch |
e4ec4154 | 579 | its communication to another network, without rebuilding the whole cluster. |
a37d539f | 580 | This change may lead to short periods of quorum loss in the cluster, as nodes |
e4ec4154 TL |
581 | have to restart corosync and come up one after the other on the new network. |
582 | ||
3254bfdd | 583 | Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first. |
a9e7c3aa | 584 | Then, open it and you should see a file similar to: |
e4ec4154 TL |
585 | |
586 | ---- | |
587 | logging { | |
588 | debug: off | |
589 | to_syslog: yes | |
590 | } | |
591 | ||
592 | nodelist { | |
593 | ||
594 | node { | |
595 | name: due | |
596 | nodeid: 2 | |
597 | quorum_votes: 1 | |
598 | ring0_addr: due | |
599 | } | |
600 | ||
601 | node { | |
602 | name: tre | |
603 | nodeid: 3 | |
604 | quorum_votes: 1 | |
605 | ring0_addr: tre | |
606 | } | |
607 | ||
608 | node { | |
609 | name: uno | |
610 | nodeid: 1 | |
611 | quorum_votes: 1 | |
612 | ring0_addr: uno | |
613 | } | |
614 | ||
615 | } | |
616 | ||
617 | quorum { | |
618 | provider: corosync_votequorum | |
619 | } | |
620 | ||
621 | totem { | |
a9e7c3aa | 622 | cluster_name: testcluster |
e4ec4154 | 623 | config_version: 3 |
a9e7c3aa | 624 | ip_version: ipv4-6 |
e4ec4154 TL |
625 | secauth: on |
626 | version: 2 | |
627 | interface { | |
a9e7c3aa | 628 | linknumber: 0 |
e4ec4154 TL |
629 | } |
630 | ||
631 | } | |
632 | ---- | |
633 | ||
a37d539f | 634 | NOTE: `ringX_addr` actually specifies a corosync *link address*. The name "ring" |
a9e7c3aa SR |
635 | is a remnant of older corosync versions that is kept for backwards |
636 | compatibility. | |
637 | ||
a37d539f | 638 | The first thing you want to do is add the 'name' properties in the node entries, |
a9e7c3aa | 639 | if you do not see them already. Those *must* match the node name. |
e4ec4154 | 640 | |
a9e7c3aa SR |
641 | Then replace all addresses from the 'ring0_addr' properties of all nodes with |
642 | the new addresses. You may use plain IP addresses or hostnames here. If you use | |
a37d539f DW |
643 | hostnames, ensure that they are resolvable from all nodes (see also |
644 | xref:pvecm_corosync_addresses[Link Address Types]). | |
e4ec4154 | 645 | |
a37d539f DW |
646 | In this example, we want to switch cluster communication to the |
647 | 10.10.10.1/25 network, so we change the 'ring0_addr' of each node respectively. | |
e4ec4154 | 648 | |
a9e7c3aa | 649 | NOTE: The exact same procedure can be used to change other 'ringX_addr' values |
a37d539f DW |
650 | as well. However, we recommend only changing one link address at a time, so |
651 | that it's easier to recover if something goes wrong. | |
a9e7c3aa SR |
652 | |
653 | After we increase the 'config_version' property, the new configuration file | |
e4ec4154 TL |
654 | should look like: |
655 | ||
656 | ---- | |
e4ec4154 TL |
657 | logging { |
658 | debug: off | |
659 | to_syslog: yes | |
660 | } | |
661 | ||
662 | nodelist { | |
663 | ||
664 | node { | |
665 | name: due | |
666 | nodeid: 2 | |
667 | quorum_votes: 1 | |
668 | ring0_addr: 10.10.10.2 | |
669 | } | |
670 | ||
671 | node { | |
672 | name: tre | |
673 | nodeid: 3 | |
674 | quorum_votes: 1 | |
675 | ring0_addr: 10.10.10.3 | |
676 | } | |
677 | ||
678 | node { | |
679 | name: uno | |
680 | nodeid: 1 | |
681 | quorum_votes: 1 | |
682 | ring0_addr: 10.10.10.1 | |
683 | } | |
684 | ||
685 | } | |
686 | ||
687 | quorum { | |
688 | provider: corosync_votequorum | |
689 | } | |
690 | ||
691 | totem { | |
a9e7c3aa | 692 | cluster_name: testcluster |
e4ec4154 | 693 | config_version: 4 |
a9e7c3aa | 694 | ip_version: ipv4-6 |
e4ec4154 TL |
695 | secauth: on |
696 | version: 2 | |
697 | interface { | |
a9e7c3aa | 698 | linknumber: 0 |
e4ec4154 TL |
699 | } |
700 | ||
701 | } | |
702 | ---- | |
703 | ||
a37d539f DW |
704 | Then, after a final check to see that all changed information is correct, we |
705 | save it and once again follow the | |
706 | xref:pvecm_edit_corosync_conf[edit corosync.conf file] section to bring it into | |
707 | effect. | |
e4ec4154 | 708 | |
a9e7c3aa SR |
709 | The changes will be applied live, so restarting corosync is not strictly |
710 | necessary. If you changed other settings as well, or notice corosync | |
711 | complaining, you can optionally trigger a restart. | |
e4ec4154 TL |
712 | |
713 | On a single node execute: | |
a9e7c3aa | 714 | |
e4ec4154 | 715 | [source,bash] |
4d19cb00 | 716 | ---- |
e4ec4154 | 717 | systemctl restart corosync |
4d19cb00 | 718 | ---- |
e4ec4154 | 719 | |
a37d539f | 720 | Now check if everything is okay: |
e4ec4154 TL |
721 | |
722 | [source,bash] | |
4d19cb00 | 723 | ---- |
e4ec4154 | 724 | systemctl status corosync |
4d19cb00 | 725 | ---- |
e4ec4154 | 726 | |
a37d539f | 727 | If corosync begins to work again, restart it on all other nodes too. |
e4ec4154 TL |
728 | They will then join the cluster membership one by one on the new network. |
729 | ||
3254bfdd | 730 | [[pvecm_corosync_addresses]] |
a37d539f | 731 | Corosync Addresses |
270757a1 SR |
732 | ~~~~~~~~~~~~~~~~~~ |
733 | ||
a9e7c3aa SR |
734 | A corosync link address (for backwards compatibility denoted by 'ringX_addr' in |
735 | `corosync.conf`) can be specified in two ways: | |
270757a1 | 736 | |
a37d539f | 737 | * **IPv4/v6 addresses** can be used directly. They are recommended, since they |
270757a1 SR |
738 | are static and usually not changed carelessly. |
739 | ||
a37d539f | 740 | * **Hostnames** will be resolved using `getaddrinfo`, which means that by |
270757a1 SR |
741 | default, IPv6 addresses will be used first, if available (see also |
742 | `man gai.conf`). Keep this in mind, especially when upgrading an existing | |
743 | cluster to IPv6. | |
744 | ||
a37d539f | 745 | CAUTION: Hostnames should be used with care, since the addresses they |
270757a1 SR |
746 | resolve to can be changed without touching corosync or the node it runs on - |
747 | which may lead to a situation where an address is changed without thinking | |
748 | about implications for corosync. | |
749 | ||
5f318cc0 | 750 | A separate, static hostname specifically for corosync is recommended, if |
270757a1 SR |
751 | hostnames are preferred. Also, make sure that every node in the cluster can |
752 | resolve all hostnames correctly. | |
753 | ||
754 | Since {pve} 5.1, while supported, hostnames will be resolved at the time of | |
a37d539f | 755 | entry. Only the resolved IP is saved to the configuration. |
270757a1 SR |
756 | |
757 | Nodes that joined the cluster on earlier versions likely still use their | |
758 | unresolved hostname in `corosync.conf`. It might be a good idea to replace | |
5f318cc0 | 759 | them with IPs or a separate hostname, as mentioned above. |
270757a1 | 760 | |
e4ec4154 | 761 | |
a9e7c3aa SR |
762 | [[pvecm_redundancy]] |
763 | Corosync Redundancy | |
764 | ------------------- | |
e4ec4154 | 765 | |
a37d539f | 766 | Corosync supports redundant networking via its integrated Kronosnet layer by |
a9e7c3aa SR |
767 | default (it is not supported on the legacy udp/udpu transports). It can be |
768 | enabled by specifying more than one link address, either via the '--linkX' | |
3e380ce0 SR |
769 | parameters of `pvecm`, in the GUI as **Link 1** (while creating a cluster or |
770 | adding a new node) or by specifying more than one 'ringX_addr' in | |
771 | `corosync.conf`. | |
e4ec4154 | 772 | |
a9e7c3aa SR |
773 | NOTE: To provide useful failover, every link should be on its own |
774 | physical network connection. | |
e4ec4154 | 775 | |
a9e7c3aa SR |
776 | Links are used according to a priority setting. You can configure this priority |
777 | by setting 'knet_link_priority' in the corresponding interface section in | |
5f318cc0 | 778 | `corosync.conf`, or, preferably, using the 'priority' parameter when creating |
a9e7c3aa | 779 | your cluster with `pvecm`: |
e4ec4154 | 780 | |
4d19cb00 | 781 | ---- |
fcf0226e | 782 | # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=15 --link1 10.20.20.1,priority=20 |
4d19cb00 | 783 | ---- |
e4ec4154 | 784 | |
fcf0226e | 785 | This would cause 'link1' to be used first, since it has the higher priority. |
a9e7c3aa SR |
786 | |
787 | If no priorities are configured manually (or two links have the same priority), | |
788 | links will be used in order of their number, with the lower number having higher | |
789 | priority. | |
790 | ||
791 | Even if all links are working, only the one with the highest priority will see | |
a37d539f DW |
792 | corosync traffic. Link priorities cannot be mixed, meaning that links with |
793 | different priorities will not be able to communicate with each other. | |
e4ec4154 | 794 | |
a9e7c3aa | 795 | Since lower priority links will not see traffic unless all higher priorities |
a37d539f DW |
796 | have failed, it becomes a useful strategy to specify networks used for |
797 | other tasks (VMs, storage, etc.) as low-priority links. If worst comes to | |
798 | worst, a higher latency or more congested connection might be better than no | |
a9e7c3aa | 799 | connection at all. |
e4ec4154 | 800 | |
a9e7c3aa SR |
801 | Adding Redundant Links To An Existing Cluster |
802 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
e4ec4154 | 803 | |
a9e7c3aa SR |
804 | To add a new link to a running configuration, first check how to |
805 | xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. | |
e4ec4154 | 806 | |
a9e7c3aa SR |
807 | Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make |
808 | sure that your 'X' is the same for every node you add it to, and that it is | |
809 | unique for each node. | |
810 | ||
811 | Lastly, add a new 'interface', as shown below, to your `totem` | |
a37d539f | 812 | section, replacing 'X' with the link number chosen above. |
a9e7c3aa SR |
813 | |
814 | Assuming you added a link with number 1, the new configuration file could look | |
815 | like this: | |
e4ec4154 TL |
816 | |
817 | ---- | |
a9e7c3aa SR |
818 | logging { |
819 | debug: off | |
820 | to_syslog: yes | |
e4ec4154 TL |
821 | } |
822 | ||
823 | nodelist { | |
a9e7c3aa | 824 | |
e4ec4154 | 825 | node { |
a9e7c3aa SR |
826 | name: due |
827 | nodeid: 2 | |
e4ec4154 | 828 | quorum_votes: 1 |
a9e7c3aa SR |
829 | ring0_addr: 10.10.10.2 |
830 | ring1_addr: 10.20.20.2 | |
e4ec4154 TL |
831 | } |
832 | ||
a9e7c3aa SR |
833 | node { |
834 | name: tre | |
835 | nodeid: 3 | |
e4ec4154 | 836 | quorum_votes: 1 |
a9e7c3aa SR |
837 | ring0_addr: 10.10.10.3 |
838 | ring1_addr: 10.20.20.3 | |
e4ec4154 TL |
839 | } |
840 | ||
a9e7c3aa SR |
841 | node { |
842 | name: uno | |
843 | nodeid: 1 | |
844 | quorum_votes: 1 | |
845 | ring0_addr: 10.10.10.1 | |
846 | ring1_addr: 10.20.20.1 | |
847 | } | |
848 | ||
849 | } | |
850 | ||
851 | quorum { | |
852 | provider: corosync_votequorum | |
853 | } | |
854 | ||
855 | totem { | |
856 | cluster_name: testcluster | |
857 | config_version: 4 | |
858 | ip_version: ipv4-6 | |
859 | secauth: on | |
860 | version: 2 | |
861 | interface { | |
862 | linknumber: 0 | |
863 | } | |
864 | interface { | |
865 | linknumber: 1 | |
866 | } | |
e4ec4154 | 867 | } |
a9e7c3aa | 868 | ---- |
e4ec4154 | 869 | |
a9e7c3aa SR |
870 | The new link will be enabled as soon as you follow the last steps to |
871 | xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not | |
872 | be necessary. You can check that corosync loaded the new link using: | |
e4ec4154 | 873 | |
a9e7c3aa SR |
874 | ---- |
875 | journalctl -b -u corosync | |
e4ec4154 TL |
876 | ---- |
877 | ||
a9e7c3aa SR |
878 | It might be a good idea to test the new link by temporarily disconnecting the |
879 | old link on one node and making sure that its status remains online while | |
880 | disconnected: | |
e4ec4154 | 881 | |
a9e7c3aa SR |
882 | ---- |
883 | pvecm status | |
884 | ---- | |
885 | ||
886 | If you see a healthy cluster state, it means that your new link is being used. | |
e4ec4154 | 887 | |
e4ec4154 | 888 | |
65a0aa49 | 889 | Role of SSH in {pve} Clusters |
9d999d1b | 890 | ----------------------------- |
39aa8892 | 891 | |
65a0aa49 | 892 | {pve} utilizes SSH tunnels for various features. |
39aa8892 | 893 | |
4e8fe2a9 | 894 | * Proxying console/shell sessions (node and guests) |
9d999d1b | 895 | + |
4e8fe2a9 FG |
896 | When using the shell for node B while being connected to node A, connects to a |
897 | terminal proxy on node A, which is in turn connected to the login shell on node | |
898 | B via a non-interactive SSH tunnel. | |
39aa8892 | 899 | |
4e8fe2a9 FG |
900 | * VM and CT memory and local-storage migration in 'secure' mode. |
901 | + | |
a37d539f | 902 | During the migration, one or more SSH tunnel(s) are established between the |
4e8fe2a9 FG |
903 | source and target nodes, in order to exchange migration information and |
904 | transfer memory and disk contents. | |
9d999d1b TL |
905 | |
906 | * Storage replication | |
39aa8892 | 907 | |
9d999d1b TL |
908 | .Pitfalls due to automatic execution of `.bashrc` and siblings |
909 | [IMPORTANT] | |
910 | ==== | |
911 | In case you have a custom `.bashrc`, or similar files that get executed on | |
912 | login by the configured shell, `ssh` will automatically run it once the session | |
913 | is established successfully. This can cause some unexpected behavior, as those | |
a37d539f DW |
914 | commands may be executed with root permissions on any of the operations |
915 | described above. This can cause possible problematic side-effects! | |
39aa8892 OB |
916 | |
917 | In order to avoid such complications, it's recommended to add a check in | |
918 | `/root/.bashrc` to make sure the session is interactive, and only then run | |
919 | `.bashrc` commands. | |
920 | ||
921 | You can add this snippet at the beginning of your `.bashrc` file: | |
922 | ||
923 | ---- | |
9d999d1b | 924 | # Early exit if not running interactively to avoid side-effects! |
39aa8892 OB |
925 | case $- in |
926 | *i*) ;; | |
927 | *) return;; | |
928 | esac | |
929 | ---- | |
9d999d1b | 930 | ==== |
39aa8892 OB |
931 | |
932 | ||
c21d2cbe OB |
933 | Corosync External Vote Support |
934 | ------------------------------ | |
935 | ||
936 | This section describes a way to deploy an external voter in a {pve} cluster. | |
937 | When configured, the cluster can sustain more node failures without | |
938 | violating safety properties of the cluster communication. | |
939 | ||
a37d539f | 940 | For this to work, there are two services involved: |
c21d2cbe | 941 | |
a37d539f | 942 | * A QDevice daemon which runs on each {pve} node |
c21d2cbe | 943 | |
a37d539f | 944 | * An external vote daemon which runs on an independent server |
c21d2cbe | 945 | |
a37d539f | 946 | As a result, you can achieve higher availability, even in smaller setups (for |
c21d2cbe OB |
947 | example 2+1 nodes). |
948 | ||
949 | QDevice Technical Overview | |
950 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
951 | ||
5f318cc0 | 952 | The Corosync Quorum Device (QDevice) is a daemon which runs on each cluster |
a37d539f DW |
953 | node. It provides a configured number of votes to the cluster's quorum |
954 | subsystem, based on an externally running third-party arbitrator's decision. | |
c21d2cbe OB |
955 | Its primary use is to allow a cluster to sustain more node failures than |
956 | standard quorum rules allow. This can be done safely as the external device | |
957 | can see all nodes and thus choose only one set of nodes to give its vote. | |
a37d539f | 958 | This will only be done if said set of nodes can have quorum (again) after |
c21d2cbe OB |
959 | receiving the third-party vote. |
960 | ||
a37d539f DW |
961 | Currently, only 'QDevice Net' is supported as a third-party arbitrator. This is |
962 | a daemon which provides a vote to a cluster partition, if it can reach the | |
963 | partition members over the network. It will only give votes to one partition | |
c21d2cbe OB |
964 | of a cluster at any time. |
965 | It's designed to support multiple clusters and is almost configuration and | |
966 | state free. New clusters are handled dynamically and no configuration file | |
967 | is needed on the host running a QDevice. | |
968 | ||
a37d539f DW |
969 | The only requirements for the external host are that it needs network access to |
970 | the cluster and to have a corosync-qnetd package available. We provide a package | |
971 | for Debian based hosts, and other Linux distributions should also have a package | |
c21d2cbe OB |
972 | available through their respective package manager. |
973 | ||
974 | NOTE: In contrast to corosync itself, a QDevice connects to the cluster over | |
a37d539f | 975 | TCP/IP. The daemon may even run outside of the cluster's LAN and can have longer |
a9e7c3aa | 976 | latencies than 2 ms. |
c21d2cbe OB |
977 | |
978 | Supported Setups | |
979 | ~~~~~~~~~~~~~~~~ | |
980 | ||
981 | We support QDevices for clusters with an even number of nodes and recommend | |
982 | it for 2 node clusters, if they should provide higher availability. | |
a37d539f DW |
983 | For clusters with an odd node count, we currently discourage the use of |
984 | QDevices. The reason for this is the difference in the votes which the QDevice | |
985 | provides for each cluster type. Even numbered clusters get a single additional | |
986 | vote, which only increases availability, because if the QDevice | |
987 | itself fails, you are in the same position as with no QDevice at all. | |
988 | ||
989 | On the other hand, with an odd numbered cluster size, the QDevice provides | |
990 | '(N-1)' votes -- where 'N' corresponds to the cluster node count. This | |
991 | alternative behavior makes sense; if it had only one additional vote, the | |
992 | cluster could get into a split-brain situation. This algorithm allows for all | |
993 | nodes but one (and naturally the QDevice itself) to fail. However, there are two | |
994 | drawbacks to this: | |
c21d2cbe OB |
995 | |
996 | * If the QNet daemon itself fails, no other node may fail or the cluster | |
a37d539f | 997 | immediately loses quorum. For example, in a cluster with 15 nodes, 7 |
c21d2cbe | 998 | could fail before the cluster becomes inquorate. But, if a QDevice is |
a37d539f DW |
999 | configured here and it itself fails, **no single node** of the 15 may fail. |
1000 | The QDevice acts almost as a single point of failure in this case. | |
c21d2cbe | 1001 | |
a37d539f DW |
1002 | * The fact that all but one node plus QDevice may fail sounds promising at |
1003 | first, but this may result in a mass recovery of HA services, which could | |
1004 | overload the single remaining node. Furthermore, a Ceph server will stop | |
1005 | providing services if only '((N-1)/2)' nodes or less remain online. | |
c21d2cbe | 1006 | |
a37d539f DW |
1007 | If you understand the drawbacks and implications, you can decide yourself if |
1008 | you want to use this technology in an odd numbered cluster setup. | |
c21d2cbe | 1009 | |
c21d2cbe OB |
1010 | QDevice-Net Setup |
1011 | ~~~~~~~~~~~~~~~~~ | |
1012 | ||
a37d539f | 1013 | We recommend running any daemon which provides votes to corosync-qdevice as an |
7c039095 | 1014 | unprivileged user. {pve} and Debian provide a package which is already |
e34c3e91 | 1015 | configured to do so. |
c21d2cbe | 1016 | The traffic between the daemon and the cluster must be encrypted to ensure a |
a37d539f | 1017 | safe and secure integration of the QDevice in {pve}. |
c21d2cbe | 1018 | |
41a37193 DJ |
1019 | First, install the 'corosync-qnetd' package on your external server |
1020 | ||
1021 | ---- | |
1022 | external# apt install corosync-qnetd | |
1023 | ---- | |
1024 | ||
1025 | and the 'corosync-qdevice' package on all cluster nodes | |
1026 | ||
1027 | ---- | |
1028 | pve# apt install corosync-qdevice | |
1029 | ---- | |
c21d2cbe | 1030 | |
a37d539f | 1031 | After doing this, ensure that all the nodes in the cluster are online. |
c21d2cbe | 1032 | |
a37d539f | 1033 | You can now set up your QDevice by running the following command on one |
c21d2cbe OB |
1034 | of the {pve} nodes: |
1035 | ||
1036 | ---- | |
1037 | pve# pvecm qdevice setup <QDEVICE-IP> | |
1038 | ---- | |
1039 | ||
1b80fbaa DJ |
1040 | The SSH key from the cluster will be automatically copied to the QDevice. |
1041 | ||
1042 | NOTE: Make sure that the SSH configuration on your external server allows root | |
1043 | login via password, if you are asked for a password during this step. | |
c21d2cbe | 1044 | |
a37d539f DW |
1045 | After you enter the password and all the steps have successfully completed, you |
1046 | will see "Done". You can verify that the QDevice has been set up with: | |
c21d2cbe OB |
1047 | |
1048 | ---- | |
1049 | pve# pvecm status | |
1050 | ||
1051 | ... | |
1052 | ||
1053 | Votequorum information | |
1054 | ~~~~~~~~~~~~~~~~~~~~~ | |
1055 | Expected votes: 3 | |
1056 | Highest expected: 3 | |
1057 | Total votes: 3 | |
1058 | Quorum: 2 | |
1059 | Flags: Quorate Qdevice | |
1060 | ||
1061 | Membership information | |
1062 | ~~~~~~~~~~~~~~~~~~~~~~ | |
1063 | Nodeid Votes Qdevice Name | |
1064 | 0x00000001 1 A,V,NMW 192.168.22.180 (local) | |
1065 | 0x00000002 1 A,V,NMW 192.168.22.181 | |
1066 | 0x00000000 1 Qdevice | |
1067 | ||
1068 | ---- | |
1069 | ||
c21d2cbe | 1070 | |
c21d2cbe OB |
1071 | Frequently Asked Questions |
1072 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
1073 | ||
1074 | Tie Breaking | |
1075 | ^^^^^^^^^^^^ | |
1076 | ||
00821894 | 1077 | In case of a tie, where two same-sized cluster partitions cannot see each other |
a37d539f DW |
1078 | but can see the QDevice, the QDevice chooses one of those partitions randomly |
1079 | and provides a vote to it. | |
c21d2cbe | 1080 | |
d31de328 TL |
1081 | Possible Negative Implications |
1082 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1083 | ||
a37d539f DW |
1084 | For clusters with an even node count, there are no negative implications when |
1085 | using a QDevice. If it fails to work, it is the same as not having a QDevice | |
1086 | at all. | |
d31de328 | 1087 | |
870c2817 OB |
1088 | Adding/Deleting Nodes After QDevice Setup |
1089 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
d31de328 TL |
1090 | |
1091 | If you want to add a new node or remove an existing one from a cluster with a | |
00821894 TL |
1092 | QDevice setup, you need to remove the QDevice first. After that, you can add or |
1093 | remove nodes normally. Once you have a cluster with an even node count again, | |
a37d539f | 1094 | you can set up the QDevice again as described previously. |
870c2817 OB |
1095 | |
1096 | Removing the QDevice | |
1097 | ^^^^^^^^^^^^^^^^^^^^ | |
1098 | ||
00821894 | 1099 | If you used the official `pvecm` tool to add the QDevice, you can remove it |
a37d539f | 1100 | by running: |
870c2817 OB |
1101 | |
1102 | ---- | |
1103 | pve# pvecm qdevice remove | |
1104 | ---- | |
d31de328 | 1105 | |
51730d56 TL |
1106 | //Still TODO |
1107 | //^^^^^^^^^^ | |
a9e7c3aa | 1108 | //There is still stuff to add here |
c21d2cbe OB |
1109 | |
1110 | ||
e4ec4154 TL |
1111 | Corosync Configuration |
1112 | ---------------------- | |
1113 | ||
a9e7c3aa SR |
1114 | The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It |
1115 | controls the cluster membership and its network. | |
1116 | For further information about it, check the corosync.conf man page: | |
e4ec4154 | 1117 | [source,bash] |
4d19cb00 | 1118 | ---- |
e4ec4154 | 1119 | man corosync.conf |
4d19cb00 | 1120 | ---- |
e4ec4154 | 1121 | |
a37d539f | 1122 | For node membership, you should always use the `pvecm` tool provided by {pve}. |
e4ec4154 TL |
1123 | You may have to edit the configuration file manually for other changes. |
1124 | Here are a few best practice tips for doing this. | |
1125 | ||
3254bfdd | 1126 | [[pvecm_edit_corosync_conf]] |
e4ec4154 TL |
1127 | Edit corosync.conf |
1128 | ~~~~~~~~~~~~~~~~~~ | |
1129 | ||
a9e7c3aa SR |
1130 | Editing the corosync.conf file is not always very straightforward. There are |
1131 | two on each cluster node, one in `/etc/pve/corosync.conf` and the other in | |
e4ec4154 TL |
1132 | `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will |
1133 | propagate the changes to the local one, but not vice versa. | |
1134 | ||
a37d539f DW |
1135 | The configuration will get updated automatically, as soon as the file changes. |
1136 | This means that changes which can be integrated in a running corosync will take | |
1137 | effect immediately. Thus, you should always make a copy and edit that instead, | |
1138 | to avoid triggering unintended changes when saving the file while editing. | |
e4ec4154 TL |
1139 | |
1140 | [source,bash] | |
4d19cb00 | 1141 | ---- |
e4ec4154 | 1142 | cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new |
4d19cb00 | 1143 | ---- |
e4ec4154 | 1144 | |
a37d539f DW |
1145 | Then, open the config file with your favorite editor, such as `nano` or |
1146 | `vim.tiny`, which come pre-installed on every {pve} node. | |
e4ec4154 | 1147 | |
a37d539f | 1148 | NOTE: Always increment the 'config_version' number after configuration changes; |
e4ec4154 TL |
1149 | omitting this can lead to problems. |
1150 | ||
a37d539f | 1151 | After making the necessary changes, create another copy of the current working |
e4ec4154 | 1152 | configuration file. This serves as a backup if the new configuration fails to |
a37d539f | 1153 | apply or causes other issues. |
e4ec4154 TL |
1154 | |
1155 | [source,bash] | |
4d19cb00 | 1156 | ---- |
e4ec4154 | 1157 | cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak |
4d19cb00 | 1158 | ---- |
e4ec4154 | 1159 | |
a37d539f | 1160 | Then replace the old configuration file with the new one: |
e4ec4154 | 1161 | [source,bash] |
4d19cb00 | 1162 | ---- |
e4ec4154 | 1163 | mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf |
4d19cb00 | 1164 | ---- |
e4ec4154 | 1165 | |
a37d539f DW |
1166 | You can check if the changes could be applied automatically, using the following |
1167 | commands: | |
e4ec4154 | 1168 | [source,bash] |
4d19cb00 | 1169 | ---- |
e4ec4154 TL |
1170 | systemctl status corosync |
1171 | journalctl -b -u corosync | |
4d19cb00 | 1172 | ---- |
e4ec4154 | 1173 | |
a37d539f | 1174 | If the changes could not be applied automatically, you may have to restart the |
e4ec4154 TL |
1175 | corosync service via: |
1176 | [source,bash] | |
4d19cb00 | 1177 | ---- |
e4ec4154 | 1178 | systemctl restart corosync |
4d19cb00 | 1179 | ---- |
e4ec4154 | 1180 | |
a37d539f | 1181 | On errors, check the troubleshooting section below. |
e4ec4154 TL |
1182 | |
1183 | Troubleshooting | |
1184 | ~~~~~~~~~~~~~~~ | |
1185 | ||
1186 | Issue: 'quorum.expected_votes must be configured' | |
1187 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1188 | ||
1189 | When corosync starts to fail and you get the following message in the system log: | |
1190 | ||
1191 | ---- | |
1192 | [...] | |
1193 | corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize. | |
1194 | corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason | |
1195 | 'configuration error: nodelist or quorum.expected_votes must be configured!' | |
1196 | [...] | |
1197 | ---- | |
1198 | ||
a37d539f | 1199 | It means that the hostname you set for a corosync 'ringX_addr' in the |
e4ec4154 TL |
1200 | configuration could not be resolved. |
1201 | ||
e4ec4154 TL |
1202 | Write Configuration When Not Quorate |
1203 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1204 | ||
a37d539f DW |
1205 | If you need to change '/etc/pve/corosync.conf' on a node with no quorum, and you |
1206 | understand what you are doing, use: | |
e4ec4154 | 1207 | [source,bash] |
4d19cb00 | 1208 | ---- |
e4ec4154 | 1209 | pvecm expected 1 |
4d19cb00 | 1210 | ---- |
e4ec4154 TL |
1211 | |
1212 | This sets the expected vote count to 1 and makes the cluster quorate. You can | |
a37d539f | 1213 | then fix your configuration, or revert it back to the last working backup. |
e4ec4154 | 1214 | |
a37d539f DW |
1215 | This is not enough if corosync cannot start anymore. In that case, it is best to |
1216 | edit the local copy of the corosync configuration in | |
1217 | '/etc/corosync/corosync.conf', so that corosync can start again. Ensure that on | |
1218 | all nodes, this configuration has the same content to avoid split-brain | |
1219 | situations. | |
e4ec4154 TL |
1220 | |
1221 | ||
3254bfdd | 1222 | [[pvecm_corosync_conf_glossary]] |
e4ec4154 TL |
1223 | Corosync Configuration Glossary |
1224 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
1225 | ||
1226 | ringX_addr:: | |
a37d539f | 1227 | This names the different link addresses for the Kronosnet connections between |
a9e7c3aa | 1228 | nodes. |
e4ec4154 | 1229 | |
806ef12d DM |
1230 | |
1231 | Cluster Cold Start | |
1232 | ------------------ | |
1233 | ||
1234 | It is obvious that a cluster is not quorate when all nodes are | |
1235 | offline. This is a common case after a power failure. | |
1236 | ||
1237 | NOTE: It is always a good idea to use an uninterruptible power supply | |
8c1189b6 | 1238 | (``UPS'', also called ``battery backup'') to avoid this state, especially if |
806ef12d DM |
1239 | you want HA. |
1240 | ||
204231df | 1241 | On node startup, the `pve-guests` service is started and waits for |
8c1189b6 | 1242 | quorum. Once quorate, it starts all guests which have the `onboot` |
612417fd DM |
1243 | flag set. |
1244 | ||
1245 | When you turn on nodes, or when power comes back after power failure, | |
a37d539f | 1246 | it is likely that some nodes will boot faster than others. Please keep in |
612417fd | 1247 | mind that guest startup is delayed until you reach quorum. |
806ef12d | 1248 | |
054a7e7d | 1249 | |
082ea7d9 TL |
1250 | Guest Migration |
1251 | --------------- | |
1252 | ||
054a7e7d DM |
1253 | Migrating virtual guests to other nodes is a useful feature in a |
1254 | cluster. There are settings to control the behavior of such | |
1255 | migrations. This can be done via the configuration file | |
1256 | `datacenter.cfg` or for a specific migration via API or command line | |
1257 | parameters. | |
1258 | ||
a37d539f | 1259 | It makes a difference if a guest is online or offline, or if it has |
da6c7dee DC |
1260 | local resources (like a local disk). |
1261 | ||
a37d539f | 1262 | For details about virtual machine migration, see the |
a9e7c3aa | 1263 | xref:qm_migration[QEMU/KVM Migration Chapter]. |
da6c7dee | 1264 | |
a37d539f | 1265 | For details about container migration, see the |
a9e7c3aa | 1266 | xref:pct_migration[Container Migration Chapter]. |
082ea7d9 TL |
1267 | |
1268 | Migration Type | |
1269 | ~~~~~~~~~~~~~~ | |
1270 | ||
44f38275 | 1271 | The migration type defines if the migration data should be sent over an |
d63be10b | 1272 | encrypted (`secure`) channel or an unencrypted (`insecure`) one. |
082ea7d9 | 1273 | Setting the migration type to insecure means that the RAM content of a |
a37d539f | 1274 | virtual guest is also transferred unencrypted, which can lead to |
b1743473 | 1275 | information disclosure of critical data from inside the guest (for |
a37d539f | 1276 | example, passwords or encryption keys). |
054a7e7d DM |
1277 | |
1278 | Therefore, we strongly recommend using the secure channel if you do | |
1279 | not have full control over the network and can not guarantee that no | |
6d3c0b34 | 1280 | one is eavesdropping on it. |
082ea7d9 | 1281 | |
054a7e7d DM |
1282 | NOTE: Storage migration does not follow this setting. Currently, it |
1283 | always sends the storage content over a secure channel. | |
1284 | ||
1285 | Encryption requires a lot of computing power, so this setting is often | |
1286 | changed to "unsafe" to achieve better performance. The impact on | |
1287 | modern systems is lower because they implement AES encryption in | |
b1743473 | 1288 | hardware. The performance impact is particularly evident in fast |
a37d539f | 1289 | networks, where you can transfer 10 Gbps or more. |
082ea7d9 | 1290 | |
082ea7d9 TL |
1291 | Migration Network |
1292 | ~~~~~~~~~~~~~~~~~ | |
1293 | ||
a9baa444 | 1294 | By default, {pve} uses the network in which cluster communication |
a37d539f | 1295 | takes place to send the migration traffic. This is not optimal both because |
a9baa444 TL |
1296 | sensitive cluster traffic can be disrupted and this network may not |
1297 | have the best bandwidth available on the node. | |
1298 | ||
1299 | Setting the migration network parameter allows the use of a dedicated | |
a37d539f | 1300 | network for all migration traffic. In addition to the memory, |
a9baa444 TL |
1301 | this also affects the storage traffic for offline migrations. |
1302 | ||
a37d539f DW |
1303 | The migration network is set as a network using CIDR notation. This |
1304 | has the advantage that you don't have to set individual IP addresses | |
1305 | for each node. {pve} can determine the real address on the | |
1306 | destination node from the network specified in the CIDR form. To | |
1307 | enable this, the network must be specified so that each node has exactly one | |
1308 | IP in the respective network. | |
a9baa444 | 1309 | |
082ea7d9 TL |
1310 | Example |
1311 | ^^^^^^^ | |
1312 | ||
a37d539f | 1313 | We assume that we have a three-node setup, with three separate |
a9baa444 | 1314 | networks. One for public communication with the Internet, one for |
a37d539f | 1315 | cluster communication, and a very fast one, which we want to use as a |
a9baa444 TL |
1316 | dedicated network for migration. |
1317 | ||
1318 | A network configuration for such a setup might look as follows: | |
082ea7d9 TL |
1319 | |
1320 | ---- | |
7a0d4784 | 1321 | iface eno1 inet manual |
082ea7d9 TL |
1322 | |
1323 | # public network | |
1324 | auto vmbr0 | |
1325 | iface vmbr0 inet static | |
8673c878 | 1326 | address 192.X.Y.57/24 |
082ea7d9 | 1327 | gateway 192.X.Y.1 |
7a39aabd AL |
1328 | bridge-ports eno1 |
1329 | bridge-stp off | |
1330 | bridge-fd 0 | |
082ea7d9 TL |
1331 | |
1332 | # cluster network | |
7a0d4784 WL |
1333 | auto eno2 |
1334 | iface eno2 inet static | |
8673c878 | 1335 | address 10.1.1.1/24 |
082ea7d9 TL |
1336 | |
1337 | # fast network | |
7a0d4784 WL |
1338 | auto eno3 |
1339 | iface eno3 inet static | |
8673c878 | 1340 | address 10.1.2.1/24 |
082ea7d9 TL |
1341 | ---- |
1342 | ||
a9baa444 TL |
1343 | Here, we will use the network 10.1.2.0/24 as a migration network. For |
1344 | a single migration, you can do this using the `migration_network` | |
1345 | parameter of the command line tool: | |
1346 | ||
082ea7d9 | 1347 | ---- |
b1743473 | 1348 | # qm migrate 106 tre --online --migration_network 10.1.2.0/24 |
082ea7d9 TL |
1349 | ---- |
1350 | ||
a9baa444 TL |
1351 | To configure this as the default network for all migrations in the |
1352 | cluster, set the `migration` property of the `/etc/pve/datacenter.cfg` | |
1353 | file: | |
1354 | ||
082ea7d9 | 1355 | ---- |
a9baa444 | 1356 | # use dedicated migration network |
b1743473 | 1357 | migration: secure,network=10.1.2.0/24 |
082ea7d9 TL |
1358 | ---- |
1359 | ||
a9baa444 | 1360 | NOTE: The migration type must always be set when the migration network |
a37d539f | 1361 | is set in `/etc/pve/datacenter.cfg`. |
a9baa444 | 1362 | |
806ef12d | 1363 | |
d8742b0c DM |
1364 | ifdef::manvolnum[] |
1365 | include::pve-copyright.adoc[] | |
1366 | endif::manvolnum[] |