]>
Commit | Line | Data |
---|---|---|
bde0e57d | 1 | [[chapter_pvecm]] |
d8742b0c | 2 | ifdef::manvolnum[] |
b2f242ab DM |
3 | pvecm(1) |
4 | ======== | |
5f09af76 DM |
5 | :pve-toplevel: |
6 | ||
d8742b0c DM |
7 | NAME |
8 | ---- | |
9 | ||
74026b8f | 10 | pvecm - Proxmox VE Cluster Manager |
d8742b0c | 11 | |
49a5e11c | 12 | SYNOPSIS |
d8742b0c DM |
13 | -------- |
14 | ||
15 | include::pvecm.1-synopsis.adoc[] | |
16 | ||
17 | DESCRIPTION | |
18 | ----------- | |
19 | endif::manvolnum[] | |
20 | ||
21 | ifndef::manvolnum[] | |
22 | Cluster Manager | |
23 | =============== | |
5f09af76 | 24 | :pve-toplevel: |
194d2f29 | 25 | endif::manvolnum[] |
5f09af76 | 26 | |
65a0aa49 | 27 | The {pve} cluster manager `pvecm` is a tool to create a group of |
8c1189b6 | 28 | physical servers. Such a group is called a *cluster*. We use the |
8a865621 | 29 | http://www.corosync.org[Corosync Cluster Engine] for reliable group |
fdf1dd36 TL |
30 | communication. There's no explicit limit for the number of nodes in a cluster. |
31 | In practice, the actual possible node count may be limited by the host and | |
79bb0794 | 32 | network performance. Currently (2021), there are reports of clusters (using |
fdf1dd36 | 33 | high-end enterprise hardware) with over 50 nodes in production. |
8a865621 | 34 | |
8c1189b6 | 35 | `pvecm` can be used to create a new cluster, join nodes to a cluster, |
a37d539f | 36 | leave the cluster, get status information, and do various other cluster-related |
60ed554f | 37 | tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'') |
e300cf7d | 38 | is used to transparently distribute the cluster configuration to all cluster |
8a865621 DM |
39 | nodes. |
40 | ||
41 | Grouping nodes into a cluster has the following advantages: | |
42 | ||
a37d539f | 43 | * Centralized, web-based management |
8a865621 | 44 | |
6d3c0b34 | 45 | * Multi-master clusters: each node can do all management tasks |
8a865621 | 46 | |
a37d539f DW |
47 | * Use of `pmxcfs`, a database-driven file system, for storing configuration |
48 | files, replicated in real-time on all nodes using `corosync` | |
8a865621 | 49 | |
5eba0743 | 50 | * Easy migration of virtual machines and containers between physical |
8a865621 DM |
51 | hosts |
52 | ||
53 | * Fast deployment | |
54 | ||
55 | * Cluster-wide services like firewall and HA | |
56 | ||
57 | ||
58 | Requirements | |
59 | ------------ | |
60 | ||
a9e7c3aa SR |
61 | * All nodes must be able to connect to each other via UDP ports 5404 and 5405 |
62 | for corosync to work. | |
8a865621 | 63 | |
a37d539f | 64 | * Date and time must be synchronized. |
8a865621 | 65 | |
a37d539f | 66 | * An SSH tunnel on TCP port 22 between nodes is required. |
8a865621 | 67 | |
ceabe189 DM |
68 | * If you are interested in High Availability, you need to have at |
69 | least three nodes for reliable quorum. All nodes should have the | |
70 | same version. | |
8a865621 DM |
71 | |
72 | * We recommend a dedicated NIC for the cluster traffic, especially if | |
73 | you use shared storage. | |
74 | ||
a37d539f | 75 | * The root password of a cluster node is required for adding nodes. |
d4a9910f | 76 | |
e4b62d04 TL |
77 | NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster |
78 | nodes. | |
79 | ||
a37d539f DW |
80 | NOTE: While it's possible to mix {pve} 4.4 and {pve} 5.0 nodes, doing so is |
81 | not supported as a production configuration and should only be done temporarily, | |
82 | during an upgrade of the whole cluster from one major version to another. | |
8a865621 | 83 | |
a9e7c3aa SR |
84 | NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The |
85 | cluster protocol (corosync) between {pve} 6.x and earlier versions changed | |
86 | fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the | |
87 | upgrade procedure to {pve} 6.0. | |
88 | ||
8a865621 | 89 | |
ceabe189 DM |
90 | Preparing Nodes |
91 | --------------- | |
8a865621 | 92 | |
65a0aa49 | 93 | First, install {pve} on all nodes. Make sure that each node is |
8a865621 DM |
94 | installed with the final hostname and IP configuration. Changing the |
95 | hostname and IP is not possible after cluster creation. | |
96 | ||
a37d539f | 97 | While it's common to reference all node names and their IPs in `/etc/hosts` (or |
a9e7c3aa SR |
98 | make their names resolvable through other means), this is not necessary for a |
99 | cluster to work. It may be useful however, as you can then connect from one node | |
a37d539f | 100 | to another via SSH, using the easier to remember node name (see also |
a9e7c3aa | 101 | xref:pvecm_corosync_addresses[Link Address Types]). Note that we always |
a37d539f | 102 | recommend referencing nodes by their IP addresses in the cluster configuration. |
a9e7c3aa | 103 | |
9a7396aa | 104 | |
11202f1d | 105 | [[pvecm_create_cluster]] |
6cab1704 TL |
106 | Create a Cluster |
107 | ---------------- | |
108 | ||
109 | You can either create a cluster on the console (login via `ssh`), or through | |
a37d539f | 110 | the API using the {pve} web interface (__Datacenter -> Cluster__). |
8a865621 | 111 | |
6cab1704 TL |
112 | NOTE: Use a unique name for your cluster. This name cannot be changed later. |
113 | The cluster name follows the same rules as node names. | |
3e380ce0 | 114 | |
6cab1704 | 115 | [[pvecm_cluster_create_via_gui]] |
3e380ce0 SR |
116 | Create via Web GUI |
117 | ~~~~~~~~~~~~~~~~~~ | |
118 | ||
24398259 SR |
119 | [thumbnail="screenshot/gui-cluster-create.png"] |
120 | ||
3e380ce0 | 121 | Under __Datacenter -> Cluster__, click on *Create Cluster*. Enter the cluster |
a37d539f DW |
122 | name and select a network connection from the drop-down list to serve as the |
123 | main cluster network (Link 0). It defaults to the IP resolved via the node's | |
3e380ce0 SR |
124 | hostname. |
125 | ||
126 | To add a second link as fallback, you can select the 'Advanced' checkbox and | |
127 | choose an additional network interface (Link 1, see also | |
128 | xref:pvecm_redundancy[Corosync Redundancy]). | |
129 | ||
a37d539f DW |
130 | NOTE: Ensure that the network selected for cluster communication is not used for |
131 | any high traffic purposes, like network storage or live-migration. | |
6cab1704 TL |
132 | While the cluster network itself produces small amounts of data, it is very |
133 | sensitive to latency. Check out full | |
134 | xref:pvecm_cluster_network_requirements[cluster network requirements]. | |
135 | ||
136 | [[pvecm_cluster_create_via_cli]] | |
a37d539f DW |
137 | Create via the Command Line |
138 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
3e380ce0 SR |
139 | |
140 | Login via `ssh` to the first {pve} node and run the following command: | |
8a865621 | 141 | |
c15cdfba TL |
142 | ---- |
143 | hp1# pvecm create CLUSTERNAME | |
144 | ---- | |
8a865621 | 145 | |
3e380ce0 | 146 | To check the state of the new cluster use: |
8a865621 | 147 | |
c15cdfba | 148 | ---- |
8a865621 | 149 | hp1# pvecm status |
c15cdfba | 150 | ---- |
8a865621 | 151 | |
a37d539f DW |
152 | Multiple Clusters in the Same Network |
153 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
dd1aa0e0 TL |
154 | |
155 | It is possible to create multiple clusters in the same physical or logical | |
a37d539f DW |
156 | network. In this case, each cluster must have a unique name to avoid possible |
157 | clashes in the cluster communication stack. Furthermore, this helps avoid human | |
158 | confusion by making clusters clearly distinguishable. | |
dd1aa0e0 TL |
159 | |
160 | While the bandwidth requirement of a corosync cluster is relatively low, the | |
161 | latency of packages and the package per second (PPS) rate is the limiting | |
162 | factor. Different clusters in the same network can compete with each other for | |
163 | these resources, so it may still make sense to use separate physical network | |
164 | infrastructure for bigger clusters. | |
8a865621 | 165 | |
11202f1d | 166 | [[pvecm_join_node_to_cluster]] |
8a865621 | 167 | Adding Nodes to the Cluster |
ceabe189 | 168 | --------------------------- |
8a865621 | 169 | |
3e380ce0 SR |
170 | CAUTION: A node that is about to be added to the cluster cannot hold any guests. |
171 | All existing configuration in `/etc/pve` is overwritten when joining a cluster, | |
a37d539f DW |
172 | since guest IDs could otherwise conflict. As a workaround, you can create a |
173 | backup of the guest (`vzdump`) and restore it under a different ID, after the | |
174 | node has been added to the cluster. | |
3e380ce0 | 175 | |
6cab1704 TL |
176 | Join Node to Cluster via GUI |
177 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
3e380ce0 | 178 | |
24398259 SR |
179 | [thumbnail="screenshot/gui-cluster-join-information.png"] |
180 | ||
a37d539f DW |
181 | Log in to the web interface on an existing cluster node. Under __Datacenter -> |
182 | Cluster__, click the *Join Information* button at the top. Then, click on the | |
3e380ce0 SR |
183 | button *Copy Information*. Alternatively, copy the string from the 'Information' |
184 | field manually. | |
185 | ||
24398259 SR |
186 | [thumbnail="screenshot/gui-cluster-join.png"] |
187 | ||
a37d539f | 188 | Next, log in to the web interface on the node you want to add. |
3e380ce0 | 189 | Under __Datacenter -> Cluster__, click on *Join Cluster*. Fill in the |
6cab1704 TL |
190 | 'Information' field with the 'Join Information' text you copied earlier. |
191 | Most settings required for joining the cluster will be filled out | |
192 | automatically. For security reasons, the cluster password has to be entered | |
193 | manually. | |
3e380ce0 SR |
194 | |
195 | NOTE: To enter all required data manually, you can disable the 'Assisted Join' | |
196 | checkbox. | |
197 | ||
6cab1704 | 198 | After clicking the *Join* button, the cluster join process will start |
a37d539f DW |
199 | immediately. After the node has joined the cluster, its current node certificate |
200 | will be replaced by one signed from the cluster certificate authority (CA). | |
201 | This means that the current session will stop working after a few seconds. You | |
202 | then might need to force-reload the web interface and log in again with the | |
203 | cluster credentials. | |
3e380ce0 | 204 | |
6cab1704 | 205 | Now your node should be visible under __Datacenter -> Cluster__. |
3e380ce0 | 206 | |
6cab1704 TL |
207 | Join Node to Cluster via Command Line |
208 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
3e380ce0 | 209 | |
a37d539f | 210 | Log in to the node you want to join into an existing cluster via `ssh`. |
8a865621 | 211 | |
c15cdfba | 212 | ---- |
8a865621 | 213 | hp2# pvecm add IP-ADDRESS-CLUSTER |
c15cdfba | 214 | ---- |
8a865621 | 215 | |
a37d539f | 216 | For `IP-ADDRESS-CLUSTER`, use the IP or hostname of an existing cluster node. |
a9e7c3aa | 217 | An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]). |
8a865621 | 218 | |
8a865621 | 219 | |
a9e7c3aa | 220 | To check the state of the cluster use: |
8a865621 | 221 | |
c15cdfba | 222 | ---- |
8a865621 | 223 | # pvecm status |
c15cdfba | 224 | ---- |
8a865621 | 225 | |
ceabe189 | 226 | .Cluster status after adding 4 nodes |
8a865621 DM |
227 | ---- |
228 | hp2# pvecm status | |
229 | Quorum information | |
230 | ~~~~~~~~~~~~~~~~~~ | |
231 | Date: Mon Apr 20 12:30:13 2015 | |
232 | Quorum provider: corosync_votequorum | |
233 | Nodes: 4 | |
234 | Node ID: 0x00000001 | |
a9e7c3aa | 235 | Ring ID: 1/8 |
8a865621 DM |
236 | Quorate: Yes |
237 | ||
238 | Votequorum information | |
239 | ~~~~~~~~~~~~~~~~~~~~~~ | |
240 | Expected votes: 4 | |
241 | Highest expected: 4 | |
242 | Total votes: 4 | |
91f3edd0 | 243 | Quorum: 3 |
8a865621 DM |
244 | Flags: Quorate |
245 | ||
246 | Membership information | |
247 | ~~~~~~~~~~~~~~~~~~~~~~ | |
248 | Nodeid Votes Name | |
249 | 0x00000001 1 192.168.15.91 | |
250 | 0x00000002 1 192.168.15.92 (local) | |
251 | 0x00000003 1 192.168.15.93 | |
252 | 0x00000004 1 192.168.15.94 | |
253 | ---- | |
254 | ||
a37d539f | 255 | If you only want a list of all nodes, use: |
8a865621 | 256 | |
c15cdfba | 257 | ---- |
8a865621 | 258 | # pvecm nodes |
c15cdfba | 259 | ---- |
8a865621 | 260 | |
5eba0743 | 261 | .List nodes in a cluster |
8a865621 DM |
262 | ---- |
263 | hp2# pvecm nodes | |
264 | ||
265 | Membership information | |
266 | ~~~~~~~~~~~~~~~~~~~~~~ | |
267 | Nodeid Votes Name | |
268 | 1 1 hp1 | |
269 | 2 1 hp2 (local) | |
270 | 3 1 hp3 | |
271 | 4 1 hp4 | |
272 | ---- | |
273 | ||
3254bfdd | 274 | [[pvecm_adding_nodes_with_separated_cluster_network]] |
a37d539f | 275 | Adding Nodes with Separated Cluster Network |
e4ec4154 TL |
276 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
277 | ||
a37d539f | 278 | When adding a node to a cluster with a separated cluster network, you need to |
a9e7c3aa | 279 | use the 'link0' parameter to set the nodes address on that network: |
e4ec4154 TL |
280 | |
281 | [source,bash] | |
4d19cb00 | 282 | ---- |
a9e7c3aa | 283 | pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0 |
4d19cb00 | 284 | ---- |
e4ec4154 | 285 | |
a9e7c3aa | 286 | If you want to use the built-in xref:pvecm_redundancy[redundancy] of the |
a37d539f | 287 | Kronosnet transport layer, also use the 'link1' parameter. |
e4ec4154 | 288 | |
a37d539f DW |
289 | Using the GUI, you can select the correct interface from the corresponding |
290 | 'Link X' fields in the *Cluster Join* dialog. | |
8a865621 DM |
291 | |
292 | Remove a Cluster Node | |
ceabe189 | 293 | --------------------- |
8a865621 | 294 | |
a37d539f | 295 | CAUTION: Read the procedure carefully before proceeding, as it may |
8a865621 DM |
296 | not be what you want or need. |
297 | ||
a37d539f DW |
298 | Move all virtual machines from the node. Make sure you have made copies of any |
299 | local data or backups that you want to keep. In the following example, we will | |
300 | remove the node hp4 from the cluster. | |
8a865621 | 301 | |
e8503c6c EK |
302 | Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes` |
303 | command to identify the node ID to remove: | |
8a865621 DM |
304 | |
305 | ---- | |
306 | hp1# pvecm nodes | |
307 | ||
308 | Membership information | |
309 | ~~~~~~~~~~~~~~~~~~~~~~ | |
310 | Nodeid Votes Name | |
311 | 1 1 hp1 (local) | |
312 | 2 1 hp2 | |
313 | 3 1 hp3 | |
314 | 4 1 hp4 | |
315 | ---- | |
316 | ||
e8503c6c | 317 | |
a37d539f DW |
318 | At this point, you must power off hp4 and ensure that it will not power on |
319 | again (in the network) with its current configuration. | |
e8503c6c | 320 | |
a37d539f DW |
321 | IMPORTANT: As mentioned above, it is critical to power off the node |
322 | *before* removal, and make sure that it will *not* power on again | |
323 | (in the existing cluster network) with its current configuration. | |
324 | If you power on the node as it is, the cluster could end up broken, | |
325 | and it could be difficult to restore it to a functioning state. | |
e8503c6c EK |
326 | |
327 | After powering off the node hp4, we can safely remove it from the cluster. | |
8a865621 | 328 | |
c15cdfba | 329 | ---- |
8a865621 | 330 | hp1# pvecm delnode hp4 |
10da5ce1 | 331 | Killing node 4 |
c15cdfba | 332 | ---- |
8a865621 | 333 | |
10da5ce1 DJ |
334 | Use `pvecm nodes` or `pvecm status` to check the node list again. It should |
335 | look something like: | |
8a865621 DM |
336 | |
337 | ---- | |
338 | hp1# pvecm status | |
339 | ||
340 | Quorum information | |
341 | ~~~~~~~~~~~~~~~~~~ | |
342 | Date: Mon Apr 20 12:44:28 2015 | |
343 | Quorum provider: corosync_votequorum | |
344 | Nodes: 3 | |
345 | Node ID: 0x00000001 | |
a9e7c3aa | 346 | Ring ID: 1/8 |
8a865621 DM |
347 | Quorate: Yes |
348 | ||
349 | Votequorum information | |
350 | ~~~~~~~~~~~~~~~~~~~~~~ | |
351 | Expected votes: 3 | |
352 | Highest expected: 3 | |
353 | Total votes: 3 | |
91f3edd0 | 354 | Quorum: 2 |
8a865621 DM |
355 | Flags: Quorate |
356 | ||
357 | Membership information | |
358 | ~~~~~~~~~~~~~~~~~~~~~~ | |
359 | Nodeid Votes Name | |
360 | 0x00000001 1 192.168.15.90 (local) | |
361 | 0x00000002 1 192.168.15.91 | |
362 | 0x00000003 1 192.168.15.92 | |
363 | ---- | |
364 | ||
a9e7c3aa | 365 | If, for whatever reason, you want this server to join the same cluster again, |
a37d539f | 366 | you have to: |
8a865621 | 367 | |
a37d539f | 368 | * do a fresh install of {pve} on it, |
8a865621 DM |
369 | |
370 | * then join it, as explained in the previous section. | |
d8742b0c | 371 | |
41925ede SR |
372 | NOTE: After removal of the node, its SSH fingerprint will still reside in the |
373 | 'known_hosts' of the other nodes. If you receive an SSH error after rejoining | |
9121b45b TL |
374 | a node with the same IP or hostname, run `pvecm updatecerts` once on the |
375 | re-added node to update its fingerprint cluster wide. | |
41925ede | 376 | |
38ae8db3 | 377 | [[pvecm_separate_node_without_reinstall]] |
a37d539f | 378 | Separate a Node Without Reinstalling |
555e966b TL |
379 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
380 | ||
381 | CAUTION: This is *not* the recommended method, proceed with caution. Use the | |
a37d539f | 382 | previous method if you're unsure. |
555e966b TL |
383 | |
384 | You can also separate a node from a cluster without reinstalling it from | |
a37d539f DW |
385 | scratch. But after removing the node from the cluster, it will still have |
386 | access to any shared storage. This must be resolved before you start removing | |
555e966b | 387 | the node from the cluster. A {pve} cluster cannot share the exact same |
60ed554f | 388 | storage with another cluster, as storage locking doesn't work over the cluster |
a37d539f | 389 | boundary. Furthermore, it may also lead to VMID conflicts. |
555e966b | 390 | |
a37d539f | 391 | It's suggested that you create a new storage, where only the node which you want |
a9e7c3aa | 392 | to separate has access. This can be a new export on your NFS or a new Ceph |
a37d539f DW |
393 | pool, to name a few examples. It's just important that the exact same storage |
394 | does not get accessed by multiple clusters. After setting up this storage, move | |
395 | all data and VMs from the node to it. Then you are ready to separate the | |
3be22308 | 396 | node from the cluster. |
555e966b | 397 | |
a37d539f DW |
398 | WARNING: Ensure that all shared resources are cleanly separated! Otherwise you |
399 | will run into conflicts and problems. | |
555e966b | 400 | |
a37d539f | 401 | First, stop the corosync and pve-cluster services on the node: |
555e966b | 402 | [source,bash] |
4d19cb00 | 403 | ---- |
555e966b TL |
404 | systemctl stop pve-cluster |
405 | systemctl stop corosync | |
4d19cb00 | 406 | ---- |
555e966b | 407 | |
a37d539f | 408 | Start the cluster file system again in local mode: |
555e966b | 409 | [source,bash] |
4d19cb00 | 410 | ---- |
555e966b | 411 | pmxcfs -l |
4d19cb00 | 412 | ---- |
555e966b TL |
413 | |
414 | Delete the corosync configuration files: | |
415 | [source,bash] | |
4d19cb00 | 416 | ---- |
555e966b | 417 | rm /etc/pve/corosync.conf |
838081cd | 418 | rm -r /etc/corosync/* |
4d19cb00 | 419 | ---- |
555e966b | 420 | |
a37d539f | 421 | You can now start the file system again as a normal service: |
555e966b | 422 | [source,bash] |
4d19cb00 | 423 | ---- |
555e966b TL |
424 | killall pmxcfs |
425 | systemctl start pve-cluster | |
4d19cb00 | 426 | ---- |
555e966b | 427 | |
a37d539f DW |
428 | The node is now separated from the cluster. You can deleted it from any |
429 | remaining node of the cluster with: | |
555e966b | 430 | [source,bash] |
4d19cb00 | 431 | ---- |
555e966b | 432 | pvecm delnode oldnode |
4d19cb00 | 433 | ---- |
555e966b | 434 | |
a37d539f DW |
435 | If the command fails due to a loss of quorum in the remaining node, you can set |
436 | the expected votes to 1 as a workaround: | |
555e966b | 437 | [source,bash] |
4d19cb00 | 438 | ---- |
555e966b | 439 | pvecm expected 1 |
4d19cb00 | 440 | ---- |
555e966b | 441 | |
96d698db | 442 | And then repeat the 'pvecm delnode' command. |
555e966b | 443 | |
a37d539f DW |
444 | Now switch back to the separated node and delete all the remaining cluster |
445 | files on it. This ensures that the node can be added to another cluster again | |
446 | without problems. | |
555e966b TL |
447 | |
448 | [source,bash] | |
4d19cb00 | 449 | ---- |
555e966b | 450 | rm /var/lib/corosync/* |
4d19cb00 | 451 | ---- |
555e966b TL |
452 | |
453 | As the configuration files from the other nodes are still in the cluster | |
a37d539f DW |
454 | file system, you may want to clean those up too. After making absolutely sure |
455 | that you have the correct node name, you can simply remove the entire | |
456 | directory recursively from '/etc/pve/nodes/NODENAME'. | |
555e966b | 457 | |
a37d539f DW |
458 | CAUTION: The node's SSH keys will remain in the 'authorized_key' file. This |
459 | means that the nodes can still connect to each other with public key | |
460 | authentication. You should fix this by removing the respective keys from the | |
555e966b | 461 | '/etc/pve/priv/authorized_keys' file. |
d8742b0c | 462 | |
a9e7c3aa | 463 | |
806ef12d DM |
464 | Quorum |
465 | ------ | |
466 | ||
467 | {pve} use a quorum-based technique to provide a consistent state among | |
468 | all cluster nodes. | |
469 | ||
470 | [quote, from Wikipedia, Quorum (distributed computing)] | |
471 | ____ | |
472 | A quorum is the minimum number of votes that a distributed transaction | |
473 | has to obtain in order to be allowed to perform an operation in a | |
474 | distributed system. | |
475 | ____ | |
476 | ||
477 | In case of network partitioning, state changes requires that a | |
478 | majority of nodes are online. The cluster switches to read-only mode | |
5eba0743 | 479 | if it loses quorum. |
806ef12d DM |
480 | |
481 | NOTE: {pve} assigns a single vote to each node by default. | |
482 | ||
a9e7c3aa | 483 | |
e4ec4154 TL |
484 | Cluster Network |
485 | --------------- | |
486 | ||
487 | The cluster network is the core of a cluster. All messages sent over it have to | |
a9e7c3aa | 488 | be delivered reliably to all nodes in their respective order. In {pve} this |
a37d539f DW |
489 | part is done by corosync, an implementation of a high performance, low overhead, |
490 | high availability development toolkit. It serves our decentralized configuration | |
491 | file system (`pmxcfs`). | |
e4ec4154 | 492 | |
3254bfdd | 493 | [[pvecm_cluster_network_requirements]] |
e4ec4154 TL |
494 | Network Requirements |
495 | ~~~~~~~~~~~~~~~~~~~~ | |
496 | This needs a reliable network with latencies under 2 milliseconds (LAN | |
a9e7c3aa | 497 | performance) to work properly. The network should not be used heavily by other |
a37d539f | 498 | members; ideally corosync runs on its own network. Do not use a shared network |
a9e7c3aa SR |
499 | for corosync and storage (except as a potential low-priority fallback in a |
500 | xref:pvecm_redundancy[redundant] configuration). | |
e4ec4154 | 501 | |
a9e7c3aa | 502 | Before setting up a cluster, it is good practice to check if the network is fit |
a37d539f | 503 | for that purpose. To ensure that the nodes can connect to each other on the |
a9e7c3aa SR |
504 | cluster network, you can test the connectivity between them with the `ping` |
505 | tool. | |
e4ec4154 | 506 | |
a9e7c3aa SR |
507 | If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically |
508 | be generated - no manual action is required. | |
e4ec4154 | 509 | |
a9e7c3aa SR |
510 | NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0). |
511 | Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster | |
512 | communication, which, for now, only supports regular UDP unicast. | |
e4ec4154 | 513 | |
a9e7c3aa SR |
514 | CAUTION: You can still enable Multicast or legacy unicast by setting your |
515 | transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf], | |
516 | but keep in mind that this will disable all cryptography and redundancy support. | |
517 | This is therefore not recommended. | |
e4ec4154 TL |
518 | |
519 | Separate Cluster Network | |
520 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
521 | ||
a37d539f DW |
522 | When creating a cluster without any parameters, the corosync cluster network is |
523 | generally shared with the web interface and the VMs' network. Depending on | |
524 | your setup, even storage traffic may get sent over the same network. It's | |
525 | recommended to change that, as corosync is a time-critical, real-time | |
a9e7c3aa | 526 | application. |
e4ec4154 | 527 | |
a37d539f | 528 | Setting Up a New Network |
e4ec4154 TL |
529 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
530 | ||
9ffebff5 | 531 | First, you have to set up a new network interface. It should be on a physically |
e4ec4154 | 532 | separate network. Ensure that your network fulfills the |
3254bfdd | 533 | xref:pvecm_cluster_network_requirements[cluster network requirements]. |
e4ec4154 TL |
534 | |
535 | Separate On Cluster Creation | |
536 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
537 | ||
a9e7c3aa | 538 | This is possible via the 'linkX' parameters of the 'pvecm create' |
a37d539f | 539 | command, used for creating a new cluster. |
e4ec4154 | 540 | |
a9e7c3aa SR |
541 | If you have set up an additional NIC with a static address on 10.10.10.1/25, |
542 | and want to send and receive all cluster communication over this interface, | |
e4ec4154 TL |
543 | you would execute: |
544 | ||
545 | [source,bash] | |
4d19cb00 | 546 | ---- |
a9e7c3aa | 547 | pvecm create test --link0 10.10.10.1 |
4d19cb00 | 548 | ---- |
e4ec4154 | 549 | |
a37d539f | 550 | To check if everything is working properly, execute: |
e4ec4154 | 551 | [source,bash] |
4d19cb00 | 552 | ---- |
e4ec4154 | 553 | systemctl status corosync |
4d19cb00 | 554 | ---- |
e4ec4154 | 555 | |
a9e7c3aa | 556 | Afterwards, proceed as described above to |
3254bfdd | 557 | xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network]. |
82d52451 | 558 | |
3254bfdd | 559 | [[pvecm_separate_cluster_net_after_creation]] |
e4ec4154 TL |
560 | Separate After Cluster Creation |
561 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
562 | ||
a9e7c3aa | 563 | You can do this if you have already created a cluster and want to switch |
e4ec4154 | 564 | its communication to another network, without rebuilding the whole cluster. |
a37d539f | 565 | This change may lead to short periods of quorum loss in the cluster, as nodes |
e4ec4154 TL |
566 | have to restart corosync and come up one after the other on the new network. |
567 | ||
3254bfdd | 568 | Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first. |
a9e7c3aa | 569 | Then, open it and you should see a file similar to: |
e4ec4154 TL |
570 | |
571 | ---- | |
572 | logging { | |
573 | debug: off | |
574 | to_syslog: yes | |
575 | } | |
576 | ||
577 | nodelist { | |
578 | ||
579 | node { | |
580 | name: due | |
581 | nodeid: 2 | |
582 | quorum_votes: 1 | |
583 | ring0_addr: due | |
584 | } | |
585 | ||
586 | node { | |
587 | name: tre | |
588 | nodeid: 3 | |
589 | quorum_votes: 1 | |
590 | ring0_addr: tre | |
591 | } | |
592 | ||
593 | node { | |
594 | name: uno | |
595 | nodeid: 1 | |
596 | quorum_votes: 1 | |
597 | ring0_addr: uno | |
598 | } | |
599 | ||
600 | } | |
601 | ||
602 | quorum { | |
603 | provider: corosync_votequorum | |
604 | } | |
605 | ||
606 | totem { | |
a9e7c3aa | 607 | cluster_name: testcluster |
e4ec4154 | 608 | config_version: 3 |
a9e7c3aa | 609 | ip_version: ipv4-6 |
e4ec4154 TL |
610 | secauth: on |
611 | version: 2 | |
612 | interface { | |
a9e7c3aa | 613 | linknumber: 0 |
e4ec4154 TL |
614 | } |
615 | ||
616 | } | |
617 | ---- | |
618 | ||
a37d539f | 619 | NOTE: `ringX_addr` actually specifies a corosync *link address*. The name "ring" |
a9e7c3aa SR |
620 | is a remnant of older corosync versions that is kept for backwards |
621 | compatibility. | |
622 | ||
a37d539f | 623 | The first thing you want to do is add the 'name' properties in the node entries, |
a9e7c3aa | 624 | if you do not see them already. Those *must* match the node name. |
e4ec4154 | 625 | |
a9e7c3aa SR |
626 | Then replace all addresses from the 'ring0_addr' properties of all nodes with |
627 | the new addresses. You may use plain IP addresses or hostnames here. If you use | |
a37d539f DW |
628 | hostnames, ensure that they are resolvable from all nodes (see also |
629 | xref:pvecm_corosync_addresses[Link Address Types]). | |
e4ec4154 | 630 | |
a37d539f DW |
631 | In this example, we want to switch cluster communication to the |
632 | 10.10.10.1/25 network, so we change the 'ring0_addr' of each node respectively. | |
e4ec4154 | 633 | |
a9e7c3aa | 634 | NOTE: The exact same procedure can be used to change other 'ringX_addr' values |
a37d539f DW |
635 | as well. However, we recommend only changing one link address at a time, so |
636 | that it's easier to recover if something goes wrong. | |
a9e7c3aa SR |
637 | |
638 | After we increase the 'config_version' property, the new configuration file | |
e4ec4154 TL |
639 | should look like: |
640 | ||
641 | ---- | |
e4ec4154 TL |
642 | logging { |
643 | debug: off | |
644 | to_syslog: yes | |
645 | } | |
646 | ||
647 | nodelist { | |
648 | ||
649 | node { | |
650 | name: due | |
651 | nodeid: 2 | |
652 | quorum_votes: 1 | |
653 | ring0_addr: 10.10.10.2 | |
654 | } | |
655 | ||
656 | node { | |
657 | name: tre | |
658 | nodeid: 3 | |
659 | quorum_votes: 1 | |
660 | ring0_addr: 10.10.10.3 | |
661 | } | |
662 | ||
663 | node { | |
664 | name: uno | |
665 | nodeid: 1 | |
666 | quorum_votes: 1 | |
667 | ring0_addr: 10.10.10.1 | |
668 | } | |
669 | ||
670 | } | |
671 | ||
672 | quorum { | |
673 | provider: corosync_votequorum | |
674 | } | |
675 | ||
676 | totem { | |
a9e7c3aa | 677 | cluster_name: testcluster |
e4ec4154 | 678 | config_version: 4 |
a9e7c3aa | 679 | ip_version: ipv4-6 |
e4ec4154 TL |
680 | secauth: on |
681 | version: 2 | |
682 | interface { | |
a9e7c3aa | 683 | linknumber: 0 |
e4ec4154 TL |
684 | } |
685 | ||
686 | } | |
687 | ---- | |
688 | ||
a37d539f DW |
689 | Then, after a final check to see that all changed information is correct, we |
690 | save it and once again follow the | |
691 | xref:pvecm_edit_corosync_conf[edit corosync.conf file] section to bring it into | |
692 | effect. | |
e4ec4154 | 693 | |
a9e7c3aa SR |
694 | The changes will be applied live, so restarting corosync is not strictly |
695 | necessary. If you changed other settings as well, or notice corosync | |
696 | complaining, you can optionally trigger a restart. | |
e4ec4154 TL |
697 | |
698 | On a single node execute: | |
a9e7c3aa | 699 | |
e4ec4154 | 700 | [source,bash] |
4d19cb00 | 701 | ---- |
e4ec4154 | 702 | systemctl restart corosync |
4d19cb00 | 703 | ---- |
e4ec4154 | 704 | |
a37d539f | 705 | Now check if everything is okay: |
e4ec4154 TL |
706 | |
707 | [source,bash] | |
4d19cb00 | 708 | ---- |
e4ec4154 | 709 | systemctl status corosync |
4d19cb00 | 710 | ---- |
e4ec4154 | 711 | |
a37d539f | 712 | If corosync begins to work again, restart it on all other nodes too. |
e4ec4154 TL |
713 | They will then join the cluster membership one by one on the new network. |
714 | ||
3254bfdd | 715 | [[pvecm_corosync_addresses]] |
a37d539f | 716 | Corosync Addresses |
270757a1 SR |
717 | ~~~~~~~~~~~~~~~~~~ |
718 | ||
a9e7c3aa SR |
719 | A corosync link address (for backwards compatibility denoted by 'ringX_addr' in |
720 | `corosync.conf`) can be specified in two ways: | |
270757a1 | 721 | |
a37d539f | 722 | * **IPv4/v6 addresses** can be used directly. They are recommended, since they |
270757a1 SR |
723 | are static and usually not changed carelessly. |
724 | ||
a37d539f | 725 | * **Hostnames** will be resolved using `getaddrinfo`, which means that by |
270757a1 SR |
726 | default, IPv6 addresses will be used first, if available (see also |
727 | `man gai.conf`). Keep this in mind, especially when upgrading an existing | |
728 | cluster to IPv6. | |
729 | ||
a37d539f | 730 | CAUTION: Hostnames should be used with care, since the addresses they |
270757a1 SR |
731 | resolve to can be changed without touching corosync or the node it runs on - |
732 | which may lead to a situation where an address is changed without thinking | |
733 | about implications for corosync. | |
734 | ||
5f318cc0 | 735 | A separate, static hostname specifically for corosync is recommended, if |
270757a1 SR |
736 | hostnames are preferred. Also, make sure that every node in the cluster can |
737 | resolve all hostnames correctly. | |
738 | ||
739 | Since {pve} 5.1, while supported, hostnames will be resolved at the time of | |
a37d539f | 740 | entry. Only the resolved IP is saved to the configuration. |
270757a1 SR |
741 | |
742 | Nodes that joined the cluster on earlier versions likely still use their | |
743 | unresolved hostname in `corosync.conf`. It might be a good idea to replace | |
5f318cc0 | 744 | them with IPs or a separate hostname, as mentioned above. |
270757a1 | 745 | |
e4ec4154 | 746 | |
a9e7c3aa SR |
747 | [[pvecm_redundancy]] |
748 | Corosync Redundancy | |
749 | ------------------- | |
e4ec4154 | 750 | |
a37d539f | 751 | Corosync supports redundant networking via its integrated Kronosnet layer by |
a9e7c3aa SR |
752 | default (it is not supported on the legacy udp/udpu transports). It can be |
753 | enabled by specifying more than one link address, either via the '--linkX' | |
3e380ce0 SR |
754 | parameters of `pvecm`, in the GUI as **Link 1** (while creating a cluster or |
755 | adding a new node) or by specifying more than one 'ringX_addr' in | |
756 | `corosync.conf`. | |
e4ec4154 | 757 | |
a9e7c3aa SR |
758 | NOTE: To provide useful failover, every link should be on its own |
759 | physical network connection. | |
e4ec4154 | 760 | |
a9e7c3aa SR |
761 | Links are used according to a priority setting. You can configure this priority |
762 | by setting 'knet_link_priority' in the corresponding interface section in | |
5f318cc0 | 763 | `corosync.conf`, or, preferably, using the 'priority' parameter when creating |
a9e7c3aa | 764 | your cluster with `pvecm`: |
e4ec4154 | 765 | |
4d19cb00 | 766 | ---- |
fcf0226e | 767 | # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=15 --link1 10.20.20.1,priority=20 |
4d19cb00 | 768 | ---- |
e4ec4154 | 769 | |
fcf0226e | 770 | This would cause 'link1' to be used first, since it has the higher priority. |
a9e7c3aa SR |
771 | |
772 | If no priorities are configured manually (or two links have the same priority), | |
773 | links will be used in order of their number, with the lower number having higher | |
774 | priority. | |
775 | ||
776 | Even if all links are working, only the one with the highest priority will see | |
a37d539f DW |
777 | corosync traffic. Link priorities cannot be mixed, meaning that links with |
778 | different priorities will not be able to communicate with each other. | |
e4ec4154 | 779 | |
a9e7c3aa | 780 | Since lower priority links will not see traffic unless all higher priorities |
a37d539f DW |
781 | have failed, it becomes a useful strategy to specify networks used for |
782 | other tasks (VMs, storage, etc.) as low-priority links. If worst comes to | |
783 | worst, a higher latency or more congested connection might be better than no | |
a9e7c3aa | 784 | connection at all. |
e4ec4154 | 785 | |
a9e7c3aa SR |
786 | Adding Redundant Links To An Existing Cluster |
787 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
e4ec4154 | 788 | |
a9e7c3aa SR |
789 | To add a new link to a running configuration, first check how to |
790 | xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. | |
e4ec4154 | 791 | |
a9e7c3aa SR |
792 | Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make |
793 | sure that your 'X' is the same for every node you add it to, and that it is | |
794 | unique for each node. | |
795 | ||
796 | Lastly, add a new 'interface', as shown below, to your `totem` | |
a37d539f | 797 | section, replacing 'X' with the link number chosen above. |
a9e7c3aa SR |
798 | |
799 | Assuming you added a link with number 1, the new configuration file could look | |
800 | like this: | |
e4ec4154 TL |
801 | |
802 | ---- | |
a9e7c3aa SR |
803 | logging { |
804 | debug: off | |
805 | to_syslog: yes | |
e4ec4154 TL |
806 | } |
807 | ||
808 | nodelist { | |
a9e7c3aa | 809 | |
e4ec4154 | 810 | node { |
a9e7c3aa SR |
811 | name: due |
812 | nodeid: 2 | |
e4ec4154 | 813 | quorum_votes: 1 |
a9e7c3aa SR |
814 | ring0_addr: 10.10.10.2 |
815 | ring1_addr: 10.20.20.2 | |
e4ec4154 TL |
816 | } |
817 | ||
a9e7c3aa SR |
818 | node { |
819 | name: tre | |
820 | nodeid: 3 | |
e4ec4154 | 821 | quorum_votes: 1 |
a9e7c3aa SR |
822 | ring0_addr: 10.10.10.3 |
823 | ring1_addr: 10.20.20.3 | |
e4ec4154 TL |
824 | } |
825 | ||
a9e7c3aa SR |
826 | node { |
827 | name: uno | |
828 | nodeid: 1 | |
829 | quorum_votes: 1 | |
830 | ring0_addr: 10.10.10.1 | |
831 | ring1_addr: 10.20.20.1 | |
832 | } | |
833 | ||
834 | } | |
835 | ||
836 | quorum { | |
837 | provider: corosync_votequorum | |
838 | } | |
839 | ||
840 | totem { | |
841 | cluster_name: testcluster | |
842 | config_version: 4 | |
843 | ip_version: ipv4-6 | |
844 | secauth: on | |
845 | version: 2 | |
846 | interface { | |
847 | linknumber: 0 | |
848 | } | |
849 | interface { | |
850 | linknumber: 1 | |
851 | } | |
e4ec4154 | 852 | } |
a9e7c3aa | 853 | ---- |
e4ec4154 | 854 | |
a9e7c3aa SR |
855 | The new link will be enabled as soon as you follow the last steps to |
856 | xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not | |
857 | be necessary. You can check that corosync loaded the new link using: | |
e4ec4154 | 858 | |
a9e7c3aa SR |
859 | ---- |
860 | journalctl -b -u corosync | |
e4ec4154 TL |
861 | ---- |
862 | ||
a9e7c3aa SR |
863 | It might be a good idea to test the new link by temporarily disconnecting the |
864 | old link on one node and making sure that its status remains online while | |
865 | disconnected: | |
e4ec4154 | 866 | |
a9e7c3aa SR |
867 | ---- |
868 | pvecm status | |
869 | ---- | |
870 | ||
871 | If you see a healthy cluster state, it means that your new link is being used. | |
e4ec4154 | 872 | |
e4ec4154 | 873 | |
65a0aa49 | 874 | Role of SSH in {pve} Clusters |
9d999d1b | 875 | ----------------------------- |
39aa8892 | 876 | |
65a0aa49 | 877 | {pve} utilizes SSH tunnels for various features. |
39aa8892 | 878 | |
4e8fe2a9 | 879 | * Proxying console/shell sessions (node and guests) |
9d999d1b | 880 | + |
4e8fe2a9 FG |
881 | When using the shell for node B while being connected to node A, connects to a |
882 | terminal proxy on node A, which is in turn connected to the login shell on node | |
883 | B via a non-interactive SSH tunnel. | |
39aa8892 | 884 | |
4e8fe2a9 FG |
885 | * VM and CT memory and local-storage migration in 'secure' mode. |
886 | + | |
a37d539f | 887 | During the migration, one or more SSH tunnel(s) are established between the |
4e8fe2a9 FG |
888 | source and target nodes, in order to exchange migration information and |
889 | transfer memory and disk contents. | |
9d999d1b TL |
890 | |
891 | * Storage replication | |
39aa8892 | 892 | |
9d999d1b TL |
893 | .Pitfalls due to automatic execution of `.bashrc` and siblings |
894 | [IMPORTANT] | |
895 | ==== | |
896 | In case you have a custom `.bashrc`, or similar files that get executed on | |
897 | login by the configured shell, `ssh` will automatically run it once the session | |
898 | is established successfully. This can cause some unexpected behavior, as those | |
a37d539f DW |
899 | commands may be executed with root permissions on any of the operations |
900 | described above. This can cause possible problematic side-effects! | |
39aa8892 OB |
901 | |
902 | In order to avoid such complications, it's recommended to add a check in | |
903 | `/root/.bashrc` to make sure the session is interactive, and only then run | |
904 | `.bashrc` commands. | |
905 | ||
906 | You can add this snippet at the beginning of your `.bashrc` file: | |
907 | ||
908 | ---- | |
9d999d1b | 909 | # Early exit if not running interactively to avoid side-effects! |
39aa8892 OB |
910 | case $- in |
911 | *i*) ;; | |
912 | *) return;; | |
913 | esac | |
914 | ---- | |
9d999d1b | 915 | ==== |
39aa8892 OB |
916 | |
917 | ||
c21d2cbe OB |
918 | Corosync External Vote Support |
919 | ------------------------------ | |
920 | ||
921 | This section describes a way to deploy an external voter in a {pve} cluster. | |
922 | When configured, the cluster can sustain more node failures without | |
923 | violating safety properties of the cluster communication. | |
924 | ||
a37d539f | 925 | For this to work, there are two services involved: |
c21d2cbe | 926 | |
a37d539f | 927 | * A QDevice daemon which runs on each {pve} node |
c21d2cbe | 928 | |
a37d539f | 929 | * An external vote daemon which runs on an independent server |
c21d2cbe | 930 | |
a37d539f | 931 | As a result, you can achieve higher availability, even in smaller setups (for |
c21d2cbe OB |
932 | example 2+1 nodes). |
933 | ||
934 | QDevice Technical Overview | |
935 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
936 | ||
5f318cc0 | 937 | The Corosync Quorum Device (QDevice) is a daemon which runs on each cluster |
a37d539f DW |
938 | node. It provides a configured number of votes to the cluster's quorum |
939 | subsystem, based on an externally running third-party arbitrator's decision. | |
c21d2cbe OB |
940 | Its primary use is to allow a cluster to sustain more node failures than |
941 | standard quorum rules allow. This can be done safely as the external device | |
942 | can see all nodes and thus choose only one set of nodes to give its vote. | |
a37d539f | 943 | This will only be done if said set of nodes can have quorum (again) after |
c21d2cbe OB |
944 | receiving the third-party vote. |
945 | ||
a37d539f DW |
946 | Currently, only 'QDevice Net' is supported as a third-party arbitrator. This is |
947 | a daemon which provides a vote to a cluster partition, if it can reach the | |
948 | partition members over the network. It will only give votes to one partition | |
c21d2cbe OB |
949 | of a cluster at any time. |
950 | It's designed to support multiple clusters and is almost configuration and | |
951 | state free. New clusters are handled dynamically and no configuration file | |
952 | is needed on the host running a QDevice. | |
953 | ||
a37d539f DW |
954 | The only requirements for the external host are that it needs network access to |
955 | the cluster and to have a corosync-qnetd package available. We provide a package | |
956 | for Debian based hosts, and other Linux distributions should also have a package | |
c21d2cbe OB |
957 | available through their respective package manager. |
958 | ||
959 | NOTE: In contrast to corosync itself, a QDevice connects to the cluster over | |
a37d539f | 960 | TCP/IP. The daemon may even run outside of the cluster's LAN and can have longer |
a9e7c3aa | 961 | latencies than 2 ms. |
c21d2cbe OB |
962 | |
963 | Supported Setups | |
964 | ~~~~~~~~~~~~~~~~ | |
965 | ||
966 | We support QDevices for clusters with an even number of nodes and recommend | |
967 | it for 2 node clusters, if they should provide higher availability. | |
a37d539f DW |
968 | For clusters with an odd node count, we currently discourage the use of |
969 | QDevices. The reason for this is the difference in the votes which the QDevice | |
970 | provides for each cluster type. Even numbered clusters get a single additional | |
971 | vote, which only increases availability, because if the QDevice | |
972 | itself fails, you are in the same position as with no QDevice at all. | |
973 | ||
974 | On the other hand, with an odd numbered cluster size, the QDevice provides | |
975 | '(N-1)' votes -- where 'N' corresponds to the cluster node count. This | |
976 | alternative behavior makes sense; if it had only one additional vote, the | |
977 | cluster could get into a split-brain situation. This algorithm allows for all | |
978 | nodes but one (and naturally the QDevice itself) to fail. However, there are two | |
979 | drawbacks to this: | |
c21d2cbe OB |
980 | |
981 | * If the QNet daemon itself fails, no other node may fail or the cluster | |
a37d539f | 982 | immediately loses quorum. For example, in a cluster with 15 nodes, 7 |
c21d2cbe | 983 | could fail before the cluster becomes inquorate. But, if a QDevice is |
a37d539f DW |
984 | configured here and it itself fails, **no single node** of the 15 may fail. |
985 | The QDevice acts almost as a single point of failure in this case. | |
c21d2cbe | 986 | |
a37d539f DW |
987 | * The fact that all but one node plus QDevice may fail sounds promising at |
988 | first, but this may result in a mass recovery of HA services, which could | |
989 | overload the single remaining node. Furthermore, a Ceph server will stop | |
990 | providing services if only '((N-1)/2)' nodes or less remain online. | |
c21d2cbe | 991 | |
a37d539f DW |
992 | If you understand the drawbacks and implications, you can decide yourself if |
993 | you want to use this technology in an odd numbered cluster setup. | |
c21d2cbe | 994 | |
c21d2cbe OB |
995 | QDevice-Net Setup |
996 | ~~~~~~~~~~~~~~~~~ | |
997 | ||
a37d539f | 998 | We recommend running any daemon which provides votes to corosync-qdevice as an |
7c039095 | 999 | unprivileged user. {pve} and Debian provide a package which is already |
e34c3e91 | 1000 | configured to do so. |
c21d2cbe | 1001 | The traffic between the daemon and the cluster must be encrypted to ensure a |
a37d539f | 1002 | safe and secure integration of the QDevice in {pve}. |
c21d2cbe | 1003 | |
41a37193 DJ |
1004 | First, install the 'corosync-qnetd' package on your external server |
1005 | ||
1006 | ---- | |
1007 | external# apt install corosync-qnetd | |
1008 | ---- | |
1009 | ||
1010 | and the 'corosync-qdevice' package on all cluster nodes | |
1011 | ||
1012 | ---- | |
1013 | pve# apt install corosync-qdevice | |
1014 | ---- | |
c21d2cbe | 1015 | |
a37d539f | 1016 | After doing this, ensure that all the nodes in the cluster are online. |
c21d2cbe | 1017 | |
a37d539f | 1018 | You can now set up your QDevice by running the following command on one |
c21d2cbe OB |
1019 | of the {pve} nodes: |
1020 | ||
1021 | ---- | |
1022 | pve# pvecm qdevice setup <QDEVICE-IP> | |
1023 | ---- | |
1024 | ||
1b80fbaa DJ |
1025 | The SSH key from the cluster will be automatically copied to the QDevice. |
1026 | ||
1027 | NOTE: Make sure that the SSH configuration on your external server allows root | |
1028 | login via password, if you are asked for a password during this step. | |
c21d2cbe | 1029 | |
a37d539f DW |
1030 | After you enter the password and all the steps have successfully completed, you |
1031 | will see "Done". You can verify that the QDevice has been set up with: | |
c21d2cbe OB |
1032 | |
1033 | ---- | |
1034 | pve# pvecm status | |
1035 | ||
1036 | ... | |
1037 | ||
1038 | Votequorum information | |
1039 | ~~~~~~~~~~~~~~~~~~~~~ | |
1040 | Expected votes: 3 | |
1041 | Highest expected: 3 | |
1042 | Total votes: 3 | |
1043 | Quorum: 2 | |
1044 | Flags: Quorate Qdevice | |
1045 | ||
1046 | Membership information | |
1047 | ~~~~~~~~~~~~~~~~~~~~~~ | |
1048 | Nodeid Votes Qdevice Name | |
1049 | 0x00000001 1 A,V,NMW 192.168.22.180 (local) | |
1050 | 0x00000002 1 A,V,NMW 192.168.22.181 | |
1051 | 0x00000000 1 Qdevice | |
1052 | ||
1053 | ---- | |
1054 | ||
c21d2cbe | 1055 | |
c21d2cbe OB |
1056 | Frequently Asked Questions |
1057 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
1058 | ||
1059 | Tie Breaking | |
1060 | ^^^^^^^^^^^^ | |
1061 | ||
00821894 | 1062 | In case of a tie, where two same-sized cluster partitions cannot see each other |
a37d539f DW |
1063 | but can see the QDevice, the QDevice chooses one of those partitions randomly |
1064 | and provides a vote to it. | |
c21d2cbe | 1065 | |
d31de328 TL |
1066 | Possible Negative Implications |
1067 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1068 | ||
a37d539f DW |
1069 | For clusters with an even node count, there are no negative implications when |
1070 | using a QDevice. If it fails to work, it is the same as not having a QDevice | |
1071 | at all. | |
d31de328 | 1072 | |
870c2817 OB |
1073 | Adding/Deleting Nodes After QDevice Setup |
1074 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
d31de328 TL |
1075 | |
1076 | If you want to add a new node or remove an existing one from a cluster with a | |
00821894 TL |
1077 | QDevice setup, you need to remove the QDevice first. After that, you can add or |
1078 | remove nodes normally. Once you have a cluster with an even node count again, | |
a37d539f | 1079 | you can set up the QDevice again as described previously. |
870c2817 OB |
1080 | |
1081 | Removing the QDevice | |
1082 | ^^^^^^^^^^^^^^^^^^^^ | |
1083 | ||
00821894 | 1084 | If you used the official `pvecm` tool to add the QDevice, you can remove it |
a37d539f | 1085 | by running: |
870c2817 OB |
1086 | |
1087 | ---- | |
1088 | pve# pvecm qdevice remove | |
1089 | ---- | |
d31de328 | 1090 | |
51730d56 TL |
1091 | //Still TODO |
1092 | //^^^^^^^^^^ | |
a9e7c3aa | 1093 | //There is still stuff to add here |
c21d2cbe OB |
1094 | |
1095 | ||
e4ec4154 TL |
1096 | Corosync Configuration |
1097 | ---------------------- | |
1098 | ||
a9e7c3aa SR |
1099 | The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It |
1100 | controls the cluster membership and its network. | |
1101 | For further information about it, check the corosync.conf man page: | |
e4ec4154 | 1102 | [source,bash] |
4d19cb00 | 1103 | ---- |
e4ec4154 | 1104 | man corosync.conf |
4d19cb00 | 1105 | ---- |
e4ec4154 | 1106 | |
a37d539f | 1107 | For node membership, you should always use the `pvecm` tool provided by {pve}. |
e4ec4154 TL |
1108 | You may have to edit the configuration file manually for other changes. |
1109 | Here are a few best practice tips for doing this. | |
1110 | ||
3254bfdd | 1111 | [[pvecm_edit_corosync_conf]] |
e4ec4154 TL |
1112 | Edit corosync.conf |
1113 | ~~~~~~~~~~~~~~~~~~ | |
1114 | ||
a9e7c3aa SR |
1115 | Editing the corosync.conf file is not always very straightforward. There are |
1116 | two on each cluster node, one in `/etc/pve/corosync.conf` and the other in | |
e4ec4154 TL |
1117 | `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will |
1118 | propagate the changes to the local one, but not vice versa. | |
1119 | ||
a37d539f DW |
1120 | The configuration will get updated automatically, as soon as the file changes. |
1121 | This means that changes which can be integrated in a running corosync will take | |
1122 | effect immediately. Thus, you should always make a copy and edit that instead, | |
1123 | to avoid triggering unintended changes when saving the file while editing. | |
e4ec4154 TL |
1124 | |
1125 | [source,bash] | |
4d19cb00 | 1126 | ---- |
e4ec4154 | 1127 | cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new |
4d19cb00 | 1128 | ---- |
e4ec4154 | 1129 | |
a37d539f DW |
1130 | Then, open the config file with your favorite editor, such as `nano` or |
1131 | `vim.tiny`, which come pre-installed on every {pve} node. | |
e4ec4154 | 1132 | |
a37d539f | 1133 | NOTE: Always increment the 'config_version' number after configuration changes; |
e4ec4154 TL |
1134 | omitting this can lead to problems. |
1135 | ||
a37d539f | 1136 | After making the necessary changes, create another copy of the current working |
e4ec4154 | 1137 | configuration file. This serves as a backup if the new configuration fails to |
a37d539f | 1138 | apply or causes other issues. |
e4ec4154 TL |
1139 | |
1140 | [source,bash] | |
4d19cb00 | 1141 | ---- |
e4ec4154 | 1142 | cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak |
4d19cb00 | 1143 | ---- |
e4ec4154 | 1144 | |
a37d539f | 1145 | Then replace the old configuration file with the new one: |
e4ec4154 | 1146 | [source,bash] |
4d19cb00 | 1147 | ---- |
e4ec4154 | 1148 | mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf |
4d19cb00 | 1149 | ---- |
e4ec4154 | 1150 | |
a37d539f DW |
1151 | You can check if the changes could be applied automatically, using the following |
1152 | commands: | |
e4ec4154 | 1153 | [source,bash] |
4d19cb00 | 1154 | ---- |
e4ec4154 TL |
1155 | systemctl status corosync |
1156 | journalctl -b -u corosync | |
4d19cb00 | 1157 | ---- |
e4ec4154 | 1158 | |
a37d539f | 1159 | If the changes could not be applied automatically, you may have to restart the |
e4ec4154 TL |
1160 | corosync service via: |
1161 | [source,bash] | |
4d19cb00 | 1162 | ---- |
e4ec4154 | 1163 | systemctl restart corosync |
4d19cb00 | 1164 | ---- |
e4ec4154 | 1165 | |
a37d539f | 1166 | On errors, check the troubleshooting section below. |
e4ec4154 TL |
1167 | |
1168 | Troubleshooting | |
1169 | ~~~~~~~~~~~~~~~ | |
1170 | ||
1171 | Issue: 'quorum.expected_votes must be configured' | |
1172 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1173 | ||
1174 | When corosync starts to fail and you get the following message in the system log: | |
1175 | ||
1176 | ---- | |
1177 | [...] | |
1178 | corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize. | |
1179 | corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason | |
1180 | 'configuration error: nodelist or quorum.expected_votes must be configured!' | |
1181 | [...] | |
1182 | ---- | |
1183 | ||
a37d539f | 1184 | It means that the hostname you set for a corosync 'ringX_addr' in the |
e4ec4154 TL |
1185 | configuration could not be resolved. |
1186 | ||
e4ec4154 TL |
1187 | Write Configuration When Not Quorate |
1188 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1189 | ||
a37d539f DW |
1190 | If you need to change '/etc/pve/corosync.conf' on a node with no quorum, and you |
1191 | understand what you are doing, use: | |
e4ec4154 | 1192 | [source,bash] |
4d19cb00 | 1193 | ---- |
e4ec4154 | 1194 | pvecm expected 1 |
4d19cb00 | 1195 | ---- |
e4ec4154 TL |
1196 | |
1197 | This sets the expected vote count to 1 and makes the cluster quorate. You can | |
a37d539f | 1198 | then fix your configuration, or revert it back to the last working backup. |
e4ec4154 | 1199 | |
a37d539f DW |
1200 | This is not enough if corosync cannot start anymore. In that case, it is best to |
1201 | edit the local copy of the corosync configuration in | |
1202 | '/etc/corosync/corosync.conf', so that corosync can start again. Ensure that on | |
1203 | all nodes, this configuration has the same content to avoid split-brain | |
1204 | situations. | |
e4ec4154 TL |
1205 | |
1206 | ||
3254bfdd | 1207 | [[pvecm_corosync_conf_glossary]] |
e4ec4154 TL |
1208 | Corosync Configuration Glossary |
1209 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
1210 | ||
1211 | ringX_addr:: | |
a37d539f | 1212 | This names the different link addresses for the Kronosnet connections between |
a9e7c3aa | 1213 | nodes. |
e4ec4154 | 1214 | |
806ef12d DM |
1215 | |
1216 | Cluster Cold Start | |
1217 | ------------------ | |
1218 | ||
1219 | It is obvious that a cluster is not quorate when all nodes are | |
1220 | offline. This is a common case after a power failure. | |
1221 | ||
1222 | NOTE: It is always a good idea to use an uninterruptible power supply | |
8c1189b6 | 1223 | (``UPS'', also called ``battery backup'') to avoid this state, especially if |
806ef12d DM |
1224 | you want HA. |
1225 | ||
204231df | 1226 | On node startup, the `pve-guests` service is started and waits for |
8c1189b6 | 1227 | quorum. Once quorate, it starts all guests which have the `onboot` |
612417fd DM |
1228 | flag set. |
1229 | ||
1230 | When you turn on nodes, or when power comes back after power failure, | |
a37d539f | 1231 | it is likely that some nodes will boot faster than others. Please keep in |
612417fd | 1232 | mind that guest startup is delayed until you reach quorum. |
806ef12d | 1233 | |
054a7e7d | 1234 | |
082ea7d9 TL |
1235 | Guest Migration |
1236 | --------------- | |
1237 | ||
054a7e7d DM |
1238 | Migrating virtual guests to other nodes is a useful feature in a |
1239 | cluster. There are settings to control the behavior of such | |
1240 | migrations. This can be done via the configuration file | |
1241 | `datacenter.cfg` or for a specific migration via API or command line | |
1242 | parameters. | |
1243 | ||
a37d539f | 1244 | It makes a difference if a guest is online or offline, or if it has |
da6c7dee DC |
1245 | local resources (like a local disk). |
1246 | ||
a37d539f | 1247 | For details about virtual machine migration, see the |
a9e7c3aa | 1248 | xref:qm_migration[QEMU/KVM Migration Chapter]. |
da6c7dee | 1249 | |
a37d539f | 1250 | For details about container migration, see the |
a9e7c3aa | 1251 | xref:pct_migration[Container Migration Chapter]. |
082ea7d9 TL |
1252 | |
1253 | Migration Type | |
1254 | ~~~~~~~~~~~~~~ | |
1255 | ||
44f38275 | 1256 | The migration type defines if the migration data should be sent over an |
d63be10b | 1257 | encrypted (`secure`) channel or an unencrypted (`insecure`) one. |
082ea7d9 | 1258 | Setting the migration type to insecure means that the RAM content of a |
a37d539f | 1259 | virtual guest is also transferred unencrypted, which can lead to |
b1743473 | 1260 | information disclosure of critical data from inside the guest (for |
a37d539f | 1261 | example, passwords or encryption keys). |
054a7e7d DM |
1262 | |
1263 | Therefore, we strongly recommend using the secure channel if you do | |
1264 | not have full control over the network and can not guarantee that no | |
6d3c0b34 | 1265 | one is eavesdropping on it. |
082ea7d9 | 1266 | |
054a7e7d DM |
1267 | NOTE: Storage migration does not follow this setting. Currently, it |
1268 | always sends the storage content over a secure channel. | |
1269 | ||
1270 | Encryption requires a lot of computing power, so this setting is often | |
1271 | changed to "unsafe" to achieve better performance. The impact on | |
1272 | modern systems is lower because they implement AES encryption in | |
b1743473 | 1273 | hardware. The performance impact is particularly evident in fast |
a37d539f | 1274 | networks, where you can transfer 10 Gbps or more. |
082ea7d9 | 1275 | |
082ea7d9 TL |
1276 | Migration Network |
1277 | ~~~~~~~~~~~~~~~~~ | |
1278 | ||
a9baa444 | 1279 | By default, {pve} uses the network in which cluster communication |
a37d539f | 1280 | takes place to send the migration traffic. This is not optimal both because |
a9baa444 TL |
1281 | sensitive cluster traffic can be disrupted and this network may not |
1282 | have the best bandwidth available on the node. | |
1283 | ||
1284 | Setting the migration network parameter allows the use of a dedicated | |
a37d539f | 1285 | network for all migration traffic. In addition to the memory, |
a9baa444 TL |
1286 | this also affects the storage traffic for offline migrations. |
1287 | ||
a37d539f DW |
1288 | The migration network is set as a network using CIDR notation. This |
1289 | has the advantage that you don't have to set individual IP addresses | |
1290 | for each node. {pve} can determine the real address on the | |
1291 | destination node from the network specified in the CIDR form. To | |
1292 | enable this, the network must be specified so that each node has exactly one | |
1293 | IP in the respective network. | |
a9baa444 | 1294 | |
082ea7d9 TL |
1295 | Example |
1296 | ^^^^^^^ | |
1297 | ||
a37d539f | 1298 | We assume that we have a three-node setup, with three separate |
a9baa444 | 1299 | networks. One for public communication with the Internet, one for |
a37d539f | 1300 | cluster communication, and a very fast one, which we want to use as a |
a9baa444 TL |
1301 | dedicated network for migration. |
1302 | ||
1303 | A network configuration for such a setup might look as follows: | |
082ea7d9 TL |
1304 | |
1305 | ---- | |
7a0d4784 | 1306 | iface eno1 inet manual |
082ea7d9 TL |
1307 | |
1308 | # public network | |
1309 | auto vmbr0 | |
1310 | iface vmbr0 inet static | |
1311 | address 192.X.Y.57 | |
1312 | netmask 255.255.250.0 | |
1313 | gateway 192.X.Y.1 | |
7a39aabd AL |
1314 | bridge-ports eno1 |
1315 | bridge-stp off | |
1316 | bridge-fd 0 | |
082ea7d9 TL |
1317 | |
1318 | # cluster network | |
7a0d4784 WL |
1319 | auto eno2 |
1320 | iface eno2 inet static | |
082ea7d9 TL |
1321 | address 10.1.1.1 |
1322 | netmask 255.255.255.0 | |
1323 | ||
1324 | # fast network | |
7a0d4784 WL |
1325 | auto eno3 |
1326 | iface eno3 inet static | |
082ea7d9 TL |
1327 | address 10.1.2.1 |
1328 | netmask 255.255.255.0 | |
082ea7d9 TL |
1329 | ---- |
1330 | ||
a9baa444 TL |
1331 | Here, we will use the network 10.1.2.0/24 as a migration network. For |
1332 | a single migration, you can do this using the `migration_network` | |
1333 | parameter of the command line tool: | |
1334 | ||
082ea7d9 | 1335 | ---- |
b1743473 | 1336 | # qm migrate 106 tre --online --migration_network 10.1.2.0/24 |
082ea7d9 TL |
1337 | ---- |
1338 | ||
a9baa444 TL |
1339 | To configure this as the default network for all migrations in the |
1340 | cluster, set the `migration` property of the `/etc/pve/datacenter.cfg` | |
1341 | file: | |
1342 | ||
082ea7d9 | 1343 | ---- |
a9baa444 | 1344 | # use dedicated migration network |
b1743473 | 1345 | migration: secure,network=10.1.2.0/24 |
082ea7d9 TL |
1346 | ---- |
1347 | ||
a9baa444 | 1348 | NOTE: The migration type must always be set when the migration network |
a37d539f | 1349 | is set in `/etc/pve/datacenter.cfg`. |
a9baa444 | 1350 | |
806ef12d | 1351 | |
d8742b0c DM |
1352 | ifdef::manvolnum[] |
1353 | include::pve-copyright.adoc[] | |
1354 | endif::manvolnum[] |