]>
Commit | Line | Data |
---|---|---|
1 | [[chapter_pvecm]] | |
2 | ifdef::manvolnum[] | |
3 | pvecm(1) | |
4 | ======== | |
5 | :pve-toplevel: | |
6 | ||
7 | NAME | |
8 | ---- | |
9 | ||
10 | pvecm - Proxmox VE Cluster Manager | |
11 | ||
12 | SYNOPSIS | |
13 | -------- | |
14 | ||
15 | include::pvecm.1-synopsis.adoc[] | |
16 | ||
17 | DESCRIPTION | |
18 | ----------- | |
19 | endif::manvolnum[] | |
20 | ||
21 | ifndef::manvolnum[] | |
22 | Cluster Manager | |
23 | =============== | |
24 | :pve-toplevel: | |
25 | endif::manvolnum[] | |
26 | ||
27 | The {PVE} cluster manager `pvecm` is a tool to create a group of | |
28 | physical servers. Such a group is called a *cluster*. We use the | |
29 | http://www.corosync.org[Corosync Cluster Engine] for reliable group | |
30 | communication, and such clusters can consist of up to 32 physical nodes | |
31 | (probably more, dependent on network latency). | |
32 | ||
33 | `pvecm` can be used to create a new cluster, join nodes to a cluster, | |
34 | leave the cluster, get status information and do various other cluster | |
35 | related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'') | |
36 | is used to transparently distribute the cluster configuration to all cluster | |
37 | nodes. | |
38 | ||
39 | Grouping nodes into a cluster has the following advantages: | |
40 | ||
41 | * Centralized, web based management | |
42 | ||
43 | * Multi-master clusters: each node can do all management task | |
44 | ||
45 | * `pmxcfs`: database-driven file system for storing configuration files, | |
46 | replicated in real-time on all nodes using `corosync`. | |
47 | ||
48 | * Easy migration of virtual machines and containers between physical | |
49 | hosts | |
50 | ||
51 | * Fast deployment | |
52 | ||
53 | * Cluster-wide services like firewall and HA | |
54 | ||
55 | ||
56 | Requirements | |
57 | ------------ | |
58 | ||
59 | * All nodes must be in the same network as `corosync` uses IP Multicast | |
60 | to communicate between nodes (also see | |
61 | http://www.corosync.org[Corosync Cluster Engine]). Corosync uses UDP | |
62 | ports 5404 and 5405 for cluster communication. | |
63 | + | |
64 | NOTE: Some switches do not support IP multicast by default and must be | |
65 | manually enabled first. | |
66 | ||
67 | * Date and time have to be synchronized. | |
68 | ||
69 | * SSH tunnel on TCP port 22 between nodes is used. | |
70 | ||
71 | * If you are interested in High Availability, you need to have at | |
72 | least three nodes for reliable quorum. All nodes should have the | |
73 | same version. | |
74 | ||
75 | * We recommend a dedicated NIC for the cluster traffic, especially if | |
76 | you use shared storage. | |
77 | ||
78 | * Root password of a cluster node is required for adding nodes. | |
79 | ||
80 | NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster | |
81 | nodes. | |
82 | ||
83 | NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as | |
84 | production configuration and should only used temporarily during upgrading the | |
85 | whole cluster from one to another major version. | |
86 | ||
87 | ||
88 | Preparing Nodes | |
89 | --------------- | |
90 | ||
91 | First, install {PVE} on all nodes. Make sure that each node is | |
92 | installed with the final hostname and IP configuration. Changing the | |
93 | hostname and IP is not possible after cluster creation. | |
94 | ||
95 | Currently the cluster creation can either be done on the console (login via | |
96 | `ssh`) or the API, which we have a GUI implementation for (__Datacenter -> | |
97 | Cluster__). | |
98 | ||
99 | While it's often common use to reference all other nodenames in `/etc/hosts` | |
100 | with their IP this is not strictly necessary for a cluster, which normally uses | |
101 | multicast, to work. It maybe useful as you then can connect from one node to | |
102 | the other with SSH through the easier to remember node name. | |
103 | ||
104 | [[pvecm_create_cluster]] | |
105 | Create the Cluster | |
106 | ------------------ | |
107 | ||
108 | Login via `ssh` to the first {pve} node. Use a unique name for your cluster. | |
109 | This name cannot be changed later. The cluster name follows the same rules as | |
110 | node names. | |
111 | ||
112 | ---- | |
113 | hp1# pvecm create CLUSTERNAME | |
114 | ---- | |
115 | ||
116 | CAUTION: The cluster name is used to compute the default multicast address. | |
117 | Please use unique cluster names if you run more than one cluster inside your | |
118 | network. To avoid human confusion, it is also recommended to choose different | |
119 | names even if clusters do not share the cluster network. | |
120 | ||
121 | To check the state of your cluster use: | |
122 | ||
123 | ---- | |
124 | hp1# pvecm status | |
125 | ---- | |
126 | ||
127 | Multiple Clusters In Same Network | |
128 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
129 | ||
130 | It is possible to create multiple clusters in the same physical or logical | |
131 | network. Each cluster must have a unique name, which is used to generate the | |
132 | cluster's multicast group address. As long as no duplicate cluster names are | |
133 | configured in one network segment, the different clusters won't interfere with | |
134 | each other. | |
135 | ||
136 | If multiple clusters operate in a single network it may be beneficial to setup | |
137 | an IGMP querier and enable IGMP Snooping in said network. This may reduce the | |
138 | load of the network significantly because multicast packets are only delivered | |
139 | to endpoints of the respective member nodes. | |
140 | ||
141 | ||
142 | [[pvecm_join_node_to_cluster]] | |
143 | Adding Nodes to the Cluster | |
144 | --------------------------- | |
145 | ||
146 | Login via `ssh` to the node you want to add. | |
147 | ||
148 | ---- | |
149 | hp2# pvecm add IP-ADDRESS-CLUSTER | |
150 | ---- | |
151 | ||
152 | For `IP-ADDRESS-CLUSTER` use the IP from an existing cluster node. | |
153 | ||
154 | CAUTION: A new node cannot hold any VMs, because you would get | |
155 | conflicts about identical VM IDs. Also, all existing configuration in | |
156 | `/etc/pve` is overwritten when you join a new node to the cluster. To | |
157 | workaround, use `vzdump` to backup and restore to a different VMID after | |
158 | adding the node to the cluster. | |
159 | ||
160 | To check the state of cluster: | |
161 | ||
162 | ---- | |
163 | # pvecm status | |
164 | ---- | |
165 | ||
166 | .Cluster status after adding 4 nodes | |
167 | ---- | |
168 | hp2# pvecm status | |
169 | Quorum information | |
170 | ~~~~~~~~~~~~~~~~~~ | |
171 | Date: Mon Apr 20 12:30:13 2015 | |
172 | Quorum provider: corosync_votequorum | |
173 | Nodes: 4 | |
174 | Node ID: 0x00000001 | |
175 | Ring ID: 1928 | |
176 | Quorate: Yes | |
177 | ||
178 | Votequorum information | |
179 | ~~~~~~~~~~~~~~~~~~~~~~ | |
180 | Expected votes: 4 | |
181 | Highest expected: 4 | |
182 | Total votes: 4 | |
183 | Quorum: 2 | |
184 | Flags: Quorate | |
185 | ||
186 | Membership information | |
187 | ~~~~~~~~~~~~~~~~~~~~~~ | |
188 | Nodeid Votes Name | |
189 | 0x00000001 1 192.168.15.91 | |
190 | 0x00000002 1 192.168.15.92 (local) | |
191 | 0x00000003 1 192.168.15.93 | |
192 | 0x00000004 1 192.168.15.94 | |
193 | ---- | |
194 | ||
195 | If you only want the list of all nodes use: | |
196 | ||
197 | ---- | |
198 | # pvecm nodes | |
199 | ---- | |
200 | ||
201 | .List nodes in a cluster | |
202 | ---- | |
203 | hp2# pvecm nodes | |
204 | ||
205 | Membership information | |
206 | ~~~~~~~~~~~~~~~~~~~~~~ | |
207 | Nodeid Votes Name | |
208 | 1 1 hp1 | |
209 | 2 1 hp2 (local) | |
210 | 3 1 hp3 | |
211 | 4 1 hp4 | |
212 | ---- | |
213 | ||
214 | [[adding-nodes-with-separated-cluster-network]] | |
215 | Adding Nodes With Separated Cluster Network | |
216 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
217 | ||
218 | When adding a node to a cluster with a separated cluster network you need to | |
219 | use the 'ringX_addr' parameters to set the nodes address on those networks: | |
220 | ||
221 | [source,bash] | |
222 | ---- | |
223 | pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0 | |
224 | ---- | |
225 | ||
226 | If you want to use the Redundant Ring Protocol you will also want to pass the | |
227 | 'ring1_addr' parameter. | |
228 | ||
229 | ||
230 | Remove a Cluster Node | |
231 | --------------------- | |
232 | ||
233 | CAUTION: Read carefully the procedure before proceeding, as it could | |
234 | not be what you want or need. | |
235 | ||
236 | Move all virtual machines from the node. Make sure you have no local | |
237 | data or backups you want to keep, or save them accordingly. | |
238 | In the following example we will remove the node hp4 from the cluster. | |
239 | ||
240 | Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes` | |
241 | command to identify the node ID to remove: | |
242 | ||
243 | ---- | |
244 | hp1# pvecm nodes | |
245 | ||
246 | Membership information | |
247 | ~~~~~~~~~~~~~~~~~~~~~~ | |
248 | Nodeid Votes Name | |
249 | 1 1 hp1 (local) | |
250 | 2 1 hp2 | |
251 | 3 1 hp3 | |
252 | 4 1 hp4 | |
253 | ---- | |
254 | ||
255 | ||
256 | At this point you must power off hp4 and | |
257 | make sure that it will not power on again (in the network) as it | |
258 | is. | |
259 | ||
260 | IMPORTANT: As said above, it is critical to power off the node | |
261 | *before* removal, and make sure that it will *never* power on again | |
262 | (in the existing cluster network) as it is. | |
263 | If you power on the node as it is, your cluster will be screwed up and | |
264 | it could be difficult to restore a clean cluster state. | |
265 | ||
266 | After powering off the node hp4, we can safely remove it from the cluster. | |
267 | ||
268 | ---- | |
269 | hp1# pvecm delnode hp4 | |
270 | ---- | |
271 | ||
272 | If the operation succeeds no output is returned, just check the node | |
273 | list again with `pvecm nodes` or `pvecm status`. You should see | |
274 | something like: | |
275 | ||
276 | ---- | |
277 | hp1# pvecm status | |
278 | ||
279 | Quorum information | |
280 | ~~~~~~~~~~~~~~~~~~ | |
281 | Date: Mon Apr 20 12:44:28 2015 | |
282 | Quorum provider: corosync_votequorum | |
283 | Nodes: 3 | |
284 | Node ID: 0x00000001 | |
285 | Ring ID: 1992 | |
286 | Quorate: Yes | |
287 | ||
288 | Votequorum information | |
289 | ~~~~~~~~~~~~~~~~~~~~~~ | |
290 | Expected votes: 3 | |
291 | Highest expected: 3 | |
292 | Total votes: 3 | |
293 | Quorum: 3 | |
294 | Flags: Quorate | |
295 | ||
296 | Membership information | |
297 | ~~~~~~~~~~~~~~~~~~~~~~ | |
298 | Nodeid Votes Name | |
299 | 0x00000001 1 192.168.15.90 (local) | |
300 | 0x00000002 1 192.168.15.91 | |
301 | 0x00000003 1 192.168.15.92 | |
302 | ---- | |
303 | ||
304 | If, for whatever reason, you want that this server joins the same | |
305 | cluster again, you have to | |
306 | ||
307 | * reinstall {pve} on it from scratch | |
308 | ||
309 | * then join it, as explained in the previous section. | |
310 | ||
311 | [[pvecm_separate_node_without_reinstall]] | |
312 | Separate A Node Without Reinstalling | |
313 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
314 | ||
315 | CAUTION: This is *not* the recommended method, proceed with caution. Use the | |
316 | above mentioned method if you're unsure. | |
317 | ||
318 | You can also separate a node from a cluster without reinstalling it from | |
319 | scratch. But after removing the node from the cluster it will still have | |
320 | access to the shared storages! This must be resolved before you start removing | |
321 | the node from the cluster. A {pve} cluster cannot share the exact same | |
322 | storage with another cluster, as storage locking doesn't work over cluster | |
323 | boundary. Further, it may also lead to VMID conflicts. | |
324 | ||
325 | Its suggested that you create a new storage where only the node which you want | |
326 | to separate has access. This can be an new export on your NFS or a new Ceph | |
327 | pool, to name a few examples. Its just important that the exact same storage | |
328 | does not gets accessed by multiple clusters. After setting this storage up move | |
329 | all data from the node and its VMs to it. Then you are ready to separate the | |
330 | node from the cluster. | |
331 | ||
332 | WARNING: Ensure all shared resources are cleanly separated! You will run into | |
333 | conflicts and problems else. | |
334 | ||
335 | First stop the corosync and the pve-cluster services on the node: | |
336 | [source,bash] | |
337 | ---- | |
338 | systemctl stop pve-cluster | |
339 | systemctl stop corosync | |
340 | ---- | |
341 | ||
342 | Start the cluster filesystem again in local mode: | |
343 | [source,bash] | |
344 | ---- | |
345 | pmxcfs -l | |
346 | ---- | |
347 | ||
348 | Delete the corosync configuration files: | |
349 | [source,bash] | |
350 | ---- | |
351 | rm /etc/pve/corosync.conf | |
352 | rm /etc/corosync/* | |
353 | ---- | |
354 | ||
355 | You can now start the filesystem again as normal service: | |
356 | [source,bash] | |
357 | ---- | |
358 | killall pmxcfs | |
359 | systemctl start pve-cluster | |
360 | ---- | |
361 | ||
362 | The node is now separated from the cluster. You can deleted it from a remaining | |
363 | node of the cluster with: | |
364 | [source,bash] | |
365 | ---- | |
366 | pvecm delnode oldnode | |
367 | ---- | |
368 | ||
369 | If the command failed, because the remaining node in the cluster lost quorum | |
370 | when the now separate node exited, you may set the expected votes to 1 as a workaround: | |
371 | [source,bash] | |
372 | ---- | |
373 | pvecm expected 1 | |
374 | ---- | |
375 | ||
376 | And the repeat the 'pvecm delnode' command. | |
377 | ||
378 | Now switch back to the separated node, here delete all remaining files left | |
379 | from the old cluster. This ensures that the node can be added to another | |
380 | cluster again without problems. | |
381 | ||
382 | [source,bash] | |
383 | ---- | |
384 | rm /var/lib/corosync/* | |
385 | ---- | |
386 | ||
387 | As the configuration files from the other nodes are still in the cluster | |
388 | filesystem you may want to clean those up too. Remove simply the whole | |
389 | directory recursive from '/etc/pve/nodes/NODENAME', but check three times that | |
390 | you used the correct one before deleting it. | |
391 | ||
392 | CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means | |
393 | the nodes can still connect to each other with public key authentication. This | |
394 | should be fixed by removing the respective keys from the | |
395 | '/etc/pve/priv/authorized_keys' file. | |
396 | ||
397 | Quorum | |
398 | ------ | |
399 | ||
400 | {pve} use a quorum-based technique to provide a consistent state among | |
401 | all cluster nodes. | |
402 | ||
403 | [quote, from Wikipedia, Quorum (distributed computing)] | |
404 | ____ | |
405 | A quorum is the minimum number of votes that a distributed transaction | |
406 | has to obtain in order to be allowed to perform an operation in a | |
407 | distributed system. | |
408 | ____ | |
409 | ||
410 | In case of network partitioning, state changes requires that a | |
411 | majority of nodes are online. The cluster switches to read-only mode | |
412 | if it loses quorum. | |
413 | ||
414 | NOTE: {pve} assigns a single vote to each node by default. | |
415 | ||
416 | Cluster Network | |
417 | --------------- | |
418 | ||
419 | The cluster network is the core of a cluster. All messages sent over it have to | |
420 | be delivered reliable to all nodes in their respective order. In {pve} this | |
421 | part is done by corosync, an implementation of a high performance low overhead | |
422 | high availability development toolkit. It serves our decentralized | |
423 | configuration file system (`pmxcfs`). | |
424 | ||
425 | [[cluster-network-requirements]] | |
426 | Network Requirements | |
427 | ~~~~~~~~~~~~~~~~~~~~ | |
428 | This needs a reliable network with latencies under 2 milliseconds (LAN | |
429 | performance) to work properly. While corosync can also use unicast for | |
430 | communication between nodes its **highly recommended** to have a multicast | |
431 | capable network. The network should not be used heavily by other members, | |
432 | ideally corosync runs on its own network. | |
433 | *never* share it with network where storage communicates too. | |
434 | ||
435 | Before setting up a cluster it is good practice to check if the network is fit | |
436 | for that purpose. | |
437 | ||
438 | * Ensure that all nodes are in the same subnet. This must only be true for the | |
439 | network interfaces used for cluster communication (corosync). | |
440 | ||
441 | * Ensure all nodes can reach each other over those interfaces, using `ping` is | |
442 | enough for a basic test. | |
443 | ||
444 | * Ensure that multicast works in general and a high package rates. This can be | |
445 | done with the `omping` tool. The final "%loss" number should be < 1%. | |
446 | + | |
447 | [source,bash] | |
448 | ---- | |
449 | omping -c 10000 -i 0.001 -F -q NODE1-IP NODE2-IP ... | |
450 | ---- | |
451 | ||
452 | * Ensure that multicast communication works over an extended period of time. | |
453 | This uncovers problems where IGMP snooping is activated on the network but | |
454 | no multicast querier is active. This test has a duration of around 10 | |
455 | minutes. | |
456 | + | |
457 | [source,bash] | |
458 | ---- | |
459 | omping -c 600 -i 1 -q NODE1-IP NODE2-IP ... | |
460 | ---- | |
461 | ||
462 | Your network is not ready for clustering if any of these test fails. Recheck | |
463 | your network configuration. Especially switches are notorious for having | |
464 | multicast disabled by default or IGMP snooping enabled with no IGMP querier | |
465 | active. | |
466 | ||
467 | In smaller cluster its also an option to use unicast if you really cannot get | |
468 | multicast to work. | |
469 | ||
470 | Separate Cluster Network | |
471 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
472 | ||
473 | When creating a cluster without any parameters the cluster network is generally | |
474 | shared with the Web UI and the VMs and its traffic. Depending on your setup | |
475 | even storage traffic may get sent over the same network. Its recommended to | |
476 | change that, as corosync is a time critical real time application. | |
477 | ||
478 | Setting Up A New Network | |
479 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
480 | ||
481 | First you have to setup a new network interface. It should be on a physical | |
482 | separate network. Ensure that your network fulfills the | |
483 | <<cluster-network-requirements,cluster network requirements>>. | |
484 | ||
485 | Separate On Cluster Creation | |
486 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
487 | ||
488 | This is possible through the 'ring0_addr' and 'bindnet0_addr' parameter of | |
489 | the 'pvecm create' command used for creating a new cluster. | |
490 | ||
491 | If you have setup an additional NIC with a static address on 10.10.10.1/25 | |
492 | and want to send and receive all cluster communication over this interface | |
493 | you would execute: | |
494 | ||
495 | [source,bash] | |
496 | ---- | |
497 | pvecm create test --ring0_addr 10.10.10.1 --bindnet0_addr 10.10.10.0 | |
498 | ---- | |
499 | ||
500 | To check if everything is working properly execute: | |
501 | [source,bash] | |
502 | ---- | |
503 | systemctl status corosync | |
504 | ---- | |
505 | ||
506 | Afterwards, proceed as descripted in the section to | |
507 | <<adding-nodes-with-separated-cluster-network,add nodes with a separated cluster network>>. | |
508 | ||
509 | [[separate-cluster-net-after-creation]] | |
510 | Separate After Cluster Creation | |
511 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
512 | ||
513 | You can do this also if you have already created a cluster and want to switch | |
514 | its communication to another network, without rebuilding the whole cluster. | |
515 | This change may lead to short durations of quorum loss in the cluster, as nodes | |
516 | have to restart corosync and come up one after the other on the new network. | |
517 | ||
518 | Check how to <<edit-corosync-conf,edit the corosync.conf file>> first. | |
519 | The open it and you should see a file similar to: | |
520 | ||
521 | ---- | |
522 | logging { | |
523 | debug: off | |
524 | to_syslog: yes | |
525 | } | |
526 | ||
527 | nodelist { | |
528 | ||
529 | node { | |
530 | name: due | |
531 | nodeid: 2 | |
532 | quorum_votes: 1 | |
533 | ring0_addr: due | |
534 | } | |
535 | ||
536 | node { | |
537 | name: tre | |
538 | nodeid: 3 | |
539 | quorum_votes: 1 | |
540 | ring0_addr: tre | |
541 | } | |
542 | ||
543 | node { | |
544 | name: uno | |
545 | nodeid: 1 | |
546 | quorum_votes: 1 | |
547 | ring0_addr: uno | |
548 | } | |
549 | ||
550 | } | |
551 | ||
552 | quorum { | |
553 | provider: corosync_votequorum | |
554 | } | |
555 | ||
556 | totem { | |
557 | cluster_name: thomas-testcluster | |
558 | config_version: 3 | |
559 | ip_version: ipv4 | |
560 | secauth: on | |
561 | version: 2 | |
562 | interface { | |
563 | bindnetaddr: 192.168.30.50 | |
564 | ringnumber: 0 | |
565 | } | |
566 | ||
567 | } | |
568 | ---- | |
569 | ||
570 | The first you want to do is add the 'name' properties in the node entries if | |
571 | you do not see them already. Those *must* match the node name. | |
572 | ||
573 | Then replace the address from the 'ring0_addr' properties with the new | |
574 | addresses. You may use plain IP addresses or also hostnames here. If you use | |
575 | hostnames ensure that they are resolvable from all nodes. | |
576 | ||
577 | In my example I want to switch my cluster communication to the 10.10.10.1/25 | |
578 | network. So I replace all 'ring0_addr' respectively. I also set the bindnetaddr | |
579 | in the totem section of the config to an address of the new network. It can be | |
580 | any address from the subnet configured on the new network interface. | |
581 | ||
582 | After you increased the 'config_version' property the new configuration file | |
583 | should look like: | |
584 | ||
585 | ---- | |
586 | ||
587 | logging { | |
588 | debug: off | |
589 | to_syslog: yes | |
590 | } | |
591 | ||
592 | nodelist { | |
593 | ||
594 | node { | |
595 | name: due | |
596 | nodeid: 2 | |
597 | quorum_votes: 1 | |
598 | ring0_addr: 10.10.10.2 | |
599 | } | |
600 | ||
601 | node { | |
602 | name: tre | |
603 | nodeid: 3 | |
604 | quorum_votes: 1 | |
605 | ring0_addr: 10.10.10.3 | |
606 | } | |
607 | ||
608 | node { | |
609 | name: uno | |
610 | nodeid: 1 | |
611 | quorum_votes: 1 | |
612 | ring0_addr: 10.10.10.1 | |
613 | } | |
614 | ||
615 | } | |
616 | ||
617 | quorum { | |
618 | provider: corosync_votequorum | |
619 | } | |
620 | ||
621 | totem { | |
622 | cluster_name: thomas-testcluster | |
623 | config_version: 4 | |
624 | ip_version: ipv4 | |
625 | secauth: on | |
626 | version: 2 | |
627 | interface { | |
628 | bindnetaddr: 10.10.10.1 | |
629 | ringnumber: 0 | |
630 | } | |
631 | ||
632 | } | |
633 | ---- | |
634 | ||
635 | Now after a final check whether all changed information is correct we save it | |
636 | and see again the <<edit-corosync-conf,edit corosync.conf file>> section to | |
637 | learn how to bring it in effect. | |
638 | ||
639 | As our change cannot be enforced live from corosync we have to do an restart. | |
640 | ||
641 | On a single node execute: | |
642 | [source,bash] | |
643 | ---- | |
644 | systemctl restart corosync | |
645 | ---- | |
646 | ||
647 | Now check if everything is fine: | |
648 | ||
649 | [source,bash] | |
650 | ---- | |
651 | systemctl status corosync | |
652 | ---- | |
653 | ||
654 | If corosync runs again correct restart corosync also on all other nodes. | |
655 | They will then join the cluster membership one by one on the new network. | |
656 | ||
657 | [[pvecm_rrp]] | |
658 | Redundant Ring Protocol | |
659 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
660 | To avoid a single point of failure you should implement counter measurements. | |
661 | This can be on the hardware and operating system level through network bonding. | |
662 | ||
663 | Corosync itself offers also a possibility to add redundancy through the so | |
664 | called 'Redundant Ring Protocol'. This protocol allows running a second totem | |
665 | ring on another network, this network should be physically separated from the | |
666 | other rings network to actually increase availability. | |
667 | ||
668 | RRP On Cluster Creation | |
669 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
670 | ||
671 | The 'pvecm create' command provides the additional parameters 'bindnetX_addr', | |
672 | 'ringX_addr' and 'rrp_mode', can be used for RRP configuration. | |
673 | ||
674 | NOTE: See the <<corosync-conf-glossary,glossary>> if you do not know what each parameter means. | |
675 | ||
676 | So if you have two networks, one on the 10.10.10.1/24 and the other on the | |
677 | 10.10.20.1/24 subnet you would execute: | |
678 | ||
679 | [source,bash] | |
680 | ---- | |
681 | pvecm create CLUSTERNAME -bindnet0_addr 10.10.10.1 -ring0_addr 10.10.10.1 \ | |
682 | -bindnet1_addr 10.10.20.1 -ring1_addr 10.10.20.1 | |
683 | ---- | |
684 | ||
685 | RRP On Existing Clusters | |
686 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
687 | ||
688 | You will take similar steps as described in | |
689 | <<separate-cluster-net-after-creation,separating the cluster network>> to | |
690 | enable RRP on an already running cluster. The single difference is, that you | |
691 | will add `ring1` and use it instead of `ring0`. | |
692 | ||
693 | First add a new `interface` subsection in the `totem` section, set its | |
694 | `ringnumber` property to `1`. Set the interfaces `bindnetaddr` property to an | |
695 | address of the subnet you have configured for your new ring. | |
696 | Further set the `rrp_mode` to `passive`, this is the only stable mode. | |
697 | ||
698 | Then add to each node entry in the `nodelist` section its new `ring1_addr` | |
699 | property with the nodes additional ring address. | |
700 | ||
701 | So if you have two networks, one on the 10.10.10.1/24 and the other on the | |
702 | 10.10.20.1/24 subnet, the final configuration file should look like: | |
703 | ||
704 | ---- | |
705 | totem { | |
706 | cluster_name: tweak | |
707 | config_version: 9 | |
708 | ip_version: ipv4 | |
709 | rrp_mode: passive | |
710 | secauth: on | |
711 | version: 2 | |
712 | interface { | |
713 | bindnetaddr: 10.10.10.1 | |
714 | ringnumber: 0 | |
715 | } | |
716 | interface { | |
717 | bindnetaddr: 10.10.20.1 | |
718 | ringnumber: 1 | |
719 | } | |
720 | } | |
721 | ||
722 | nodelist { | |
723 | node { | |
724 | name: pvecm1 | |
725 | nodeid: 1 | |
726 | quorum_votes: 1 | |
727 | ring0_addr: 10.10.10.1 | |
728 | ring1_addr: 10.10.20.1 | |
729 | } | |
730 | ||
731 | node { | |
732 | name: pvecm2 | |
733 | nodeid: 2 | |
734 | quorum_votes: 1 | |
735 | ring0_addr: 10.10.10.2 | |
736 | ring1_addr: 10.10.20.2 | |
737 | } | |
738 | ||
739 | [...] # other cluster nodes here | |
740 | } | |
741 | ||
742 | [...] # other remaining config sections here | |
743 | ||
744 | ---- | |
745 | ||
746 | Bring it in effect like described in the | |
747 | <<edit-corosync-conf,edit the corosync.conf file>> section. | |
748 | ||
749 | This is a change which cannot take live in effect and needs at least a restart | |
750 | of corosync. Recommended is a restart of the whole cluster. | |
751 | ||
752 | If you cannot reboot the whole cluster ensure no High Availability services are | |
753 | configured and the stop the corosync service on all nodes. After corosync is | |
754 | stopped on all nodes start it one after the other again. | |
755 | ||
756 | Corosync Configuration | |
757 | ---------------------- | |
758 | ||
759 | The `/etc/pve/corosync.conf` file plays a central role in {pve} cluster. It | |
760 | controls the cluster member ship and its network. | |
761 | For reading more about it check the corosync.conf man page: | |
762 | [source,bash] | |
763 | ---- | |
764 | man corosync.conf | |
765 | ---- | |
766 | ||
767 | For node membership you should always use the `pvecm` tool provided by {pve}. | |
768 | You may have to edit the configuration file manually for other changes. | |
769 | Here are a few best practice tips for doing this. | |
770 | ||
771 | [[edit-corosync-conf]] | |
772 | Edit corosync.conf | |
773 | ~~~~~~~~~~~~~~~~~~ | |
774 | ||
775 | Editing the corosync.conf file can be not always straight forward. There are | |
776 | two on each cluster, one in `/etc/pve/corosync.conf` and the other in | |
777 | `/etc/corosync/corosync.conf`. Editing the one in our cluster file system will | |
778 | propagate the changes to the local one, but not vice versa. | |
779 | ||
780 | The configuration will get updated automatically as soon as the file changes. | |
781 | This means changes which can be integrated in a running corosync will take | |
782 | instantly effect. So you should always make a copy and edit that instead, to | |
783 | avoid triggering some unwanted changes by an in between safe. | |
784 | ||
785 | [source,bash] | |
786 | ---- | |
787 | cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new | |
788 | ---- | |
789 | ||
790 | Then open the Config file with your favorite editor, `nano` and `vim.tiny` are | |
791 | preinstalled on {pve} for example. | |
792 | ||
793 | NOTE: Always increment the 'config_version' number on configuration changes, | |
794 | omitting this can lead to problems. | |
795 | ||
796 | After making the necessary changes create another copy of the current working | |
797 | configuration file. This serves as a backup if the new configuration fails to | |
798 | apply or makes problems in other ways. | |
799 | ||
800 | [source,bash] | |
801 | ---- | |
802 | cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak | |
803 | ---- | |
804 | ||
805 | Then move the new configuration file over the old one: | |
806 | [source,bash] | |
807 | ---- | |
808 | mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf | |
809 | ---- | |
810 | ||
811 | You may check with the commands | |
812 | [source,bash] | |
813 | ---- | |
814 | systemctl status corosync | |
815 | journalctl -b -u corosync | |
816 | ---- | |
817 | ||
818 | If the change could applied automatically. If not you may have to restart the | |
819 | corosync service via: | |
820 | [source,bash] | |
821 | ---- | |
822 | systemctl restart corosync | |
823 | ---- | |
824 | ||
825 | On errors check the troubleshooting section below. | |
826 | ||
827 | Troubleshooting | |
828 | ~~~~~~~~~~~~~~~ | |
829 | ||
830 | Issue: 'quorum.expected_votes must be configured' | |
831 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
832 | ||
833 | When corosync starts to fail and you get the following message in the system log: | |
834 | ||
835 | ---- | |
836 | [...] | |
837 | corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize. | |
838 | corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason | |
839 | 'configuration error: nodelist or quorum.expected_votes must be configured!' | |
840 | [...] | |
841 | ---- | |
842 | ||
843 | It means that the hostname you set for corosync 'ringX_addr' in the | |
844 | configuration could not be resolved. | |
845 | ||
846 | ||
847 | Write Configuration When Not Quorate | |
848 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
849 | ||
850 | If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you | |
851 | know what you do, use: | |
852 | [source,bash] | |
853 | ---- | |
854 | pvecm expected 1 | |
855 | ---- | |
856 | ||
857 | This sets the expected vote count to 1 and makes the cluster quorate. You can | |
858 | now fix your configuration, or revert it back to the last working backup. | |
859 | ||
860 | This is not enough if corosync cannot start anymore. Here its best to edit the | |
861 | local copy of the corosync configuration in '/etc/corosync/corosync.conf' so | |
862 | that corosync can start again. Ensure that on all nodes this configuration has | |
863 | the same content to avoid split brains. If you are not sure what went wrong | |
864 | it's best to ask the Proxmox Community to help you. | |
865 | ||
866 | ||
867 | [[corosync-conf-glossary]] | |
868 | Corosync Configuration Glossary | |
869 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
870 | ||
871 | ringX_addr:: | |
872 | This names the different ring addresses for the corosync totem rings used for | |
873 | the cluster communication. | |
874 | ||
875 | bindnetaddr:: | |
876 | Defines to which interface the ring should bind to. It may be any address of | |
877 | the subnet configured on the interface we want to use. In general its the | |
878 | recommended to just use an address a node uses on this interface. | |
879 | ||
880 | rrp_mode:: | |
881 | Specifies the mode of the redundant ring protocol and may be passive, active or | |
882 | none. Note that use of active is highly experimental and not official | |
883 | supported. Passive is the preferred mode, it may double the cluster | |
884 | communication throughput and increases availability. | |
885 | ||
886 | ||
887 | Cluster Cold Start | |
888 | ------------------ | |
889 | ||
890 | It is obvious that a cluster is not quorate when all nodes are | |
891 | offline. This is a common case after a power failure. | |
892 | ||
893 | NOTE: It is always a good idea to use an uninterruptible power supply | |
894 | (``UPS'', also called ``battery backup'') to avoid this state, especially if | |
895 | you want HA. | |
896 | ||
897 | On node startup, the `pve-guests` service is started and waits for | |
898 | quorum. Once quorate, it starts all guests which have the `onboot` | |
899 | flag set. | |
900 | ||
901 | When you turn on nodes, or when power comes back after power failure, | |
902 | it is likely that some nodes boots faster than others. Please keep in | |
903 | mind that guest startup is delayed until you reach quorum. | |
904 | ||
905 | ||
906 | Guest Migration | |
907 | --------------- | |
908 | ||
909 | Migrating virtual guests to other nodes is a useful feature in a | |
910 | cluster. There are settings to control the behavior of such | |
911 | migrations. This can be done via the configuration file | |
912 | `datacenter.cfg` or for a specific migration via API or command line | |
913 | parameters. | |
914 | ||
915 | It makes a difference if a Guest is online or offline, or if it has | |
916 | local resources (like a local disk). | |
917 | ||
918 | For Details about Virtual Machine Migration see the | |
919 | xref:qm_migration[QEMU/KVM Migration Chapter] | |
920 | ||
921 | For Details about Container Migration see the | |
922 | xref:pct_migration[Container Migration Chapter] | |
923 | ||
924 | Migration Type | |
925 | ~~~~~~~~~~~~~~ | |
926 | ||
927 | The migration type defines if the migration data should be sent over an | |
928 | encrypted (`secure`) channel or an unencrypted (`insecure`) one. | |
929 | Setting the migration type to insecure means that the RAM content of a | |
930 | virtual guest gets also transferred unencrypted, which can lead to | |
931 | information disclosure of critical data from inside the guest (for | |
932 | example passwords or encryption keys). | |
933 | ||
934 | Therefore, we strongly recommend using the secure channel if you do | |
935 | not have full control over the network and can not guarantee that no | |
936 | one is eavesdropping to it. | |
937 | ||
938 | NOTE: Storage migration does not follow this setting. Currently, it | |
939 | always sends the storage content over a secure channel. | |
940 | ||
941 | Encryption requires a lot of computing power, so this setting is often | |
942 | changed to "unsafe" to achieve better performance. The impact on | |
943 | modern systems is lower because they implement AES encryption in | |
944 | hardware. The performance impact is particularly evident in fast | |
945 | networks where you can transfer 10 Gbps or more. | |
946 | ||
947 | ||
948 | Migration Network | |
949 | ~~~~~~~~~~~~~~~~~ | |
950 | ||
951 | By default, {pve} uses the network in which cluster communication | |
952 | takes place to send the migration traffic. This is not optimal because | |
953 | sensitive cluster traffic can be disrupted and this network may not | |
954 | have the best bandwidth available on the node. | |
955 | ||
956 | Setting the migration network parameter allows the use of a dedicated | |
957 | network for the entire migration traffic. In addition to the memory, | |
958 | this also affects the storage traffic for offline migrations. | |
959 | ||
960 | The migration network is set as a network in the CIDR notation. This | |
961 | has the advantage that you do not have to set individual IP addresses | |
962 | for each node. {pve} can determine the real address on the | |
963 | destination node from the network specified in the CIDR form. To | |
964 | enable this, the network must be specified so that each node has one, | |
965 | but only one IP in the respective network. | |
966 | ||
967 | ||
968 | Example | |
969 | ^^^^^^^ | |
970 | ||
971 | We assume that we have a three-node setup with three separate | |
972 | networks. One for public communication with the Internet, one for | |
973 | cluster communication and a very fast one, which we want to use as a | |
974 | dedicated network for migration. | |
975 | ||
976 | A network configuration for such a setup might look as follows: | |
977 | ||
978 | ---- | |
979 | iface eno1 inet manual | |
980 | ||
981 | # public network | |
982 | auto vmbr0 | |
983 | iface vmbr0 inet static | |
984 | address 192.X.Y.57 | |
985 | netmask 255.255.250.0 | |
986 | gateway 192.X.Y.1 | |
987 | bridge_ports eno1 | |
988 | bridge_stp off | |
989 | bridge_fd 0 | |
990 | ||
991 | # cluster network | |
992 | auto eno2 | |
993 | iface eno2 inet static | |
994 | address 10.1.1.1 | |
995 | netmask 255.255.255.0 | |
996 | ||
997 | # fast network | |
998 | auto eno3 | |
999 | iface eno3 inet static | |
1000 | address 10.1.2.1 | |
1001 | netmask 255.255.255.0 | |
1002 | ---- | |
1003 | ||
1004 | Here, we will use the network 10.1.2.0/24 as a migration network. For | |
1005 | a single migration, you can do this using the `migration_network` | |
1006 | parameter of the command line tool: | |
1007 | ||
1008 | ---- | |
1009 | # qm migrate 106 tre --online --migration_network 10.1.2.0/24 | |
1010 | ---- | |
1011 | ||
1012 | To configure this as the default network for all migrations in the | |
1013 | cluster, set the `migration` property of the `/etc/pve/datacenter.cfg` | |
1014 | file: | |
1015 | ||
1016 | ---- | |
1017 | # use dedicated migration network | |
1018 | migration: secure,network=10.1.2.0/24 | |
1019 | ---- | |
1020 | ||
1021 | NOTE: The migration type must always be set when the migration network | |
1022 | gets set in `/etc/pve/datacenter.cfg`. | |
1023 | ||
1024 | ||
1025 | ifdef::manvolnum[] | |
1026 | include::pve-copyright.adoc[] | |
1027 | endif::manvolnum[] |