]> git.proxmox.com Git - pve-docs.git/blob - ha-manager.adoc
ha-manager.adoc: add new section Node Maintenance
[pve-docs.git] / ha-manager.adoc
1 [[chapter_ha_manager]]
2 ifdef::manvolnum[]
3 ha-manager(1)
4 =============
5 :pve-toplevel:
6
7 NAME
8 ----
9
10 ha-manager - Proxmox VE HA Manager
11
12 SYNOPSIS
13 --------
14
15 include::ha-manager.1-synopsis.adoc[]
16
17 DESCRIPTION
18 -----------
19 endif::manvolnum[]
20 ifndef::manvolnum[]
21 High Availability
22 =================
23 :pve-toplevel:
24 endif::manvolnum[]
25
26 Our modern society depends heavily on information provided by
27 computers over the network. Mobile devices amplified that dependency,
28 because people can access the network any time from anywhere. If you
29 provide such services, it is very important that they are available
30 most of the time.
31
32 We can mathematically define the availability as the ratio of (A) the
33 total time a service is capable of being used during a given interval
34 to (B) the length of the interval. It is normally expressed as a
35 percentage of uptime in a given year.
36
37 .Availability - Downtime per Year
38 [width="60%",cols="<d,d",options="header"]
39 |===========================================================
40 |Availability % |Downtime per year
41 |99 |3.65 days
42 |99.9 |8.76 hours
43 |99.99 |52.56 minutes
44 |99.999 |5.26 minutes
45 |99.9999 |31.5 seconds
46 |99.99999 |3.15 seconds
47 |===========================================================
48
49 There are several ways to increase availability. The most elegant
50 solution is to rewrite your software, so that you can run it on
51 several host at the same time. The software itself need to have a way
52 to detect errors and do failover. This is relatively easy if you just
53 want to serve read-only web pages. But in general this is complex, and
54 sometimes impossible because you cannot modify the software
55 yourself. The following solutions works without modifying the
56 software:
57
58 * Use reliable ``server'' components
59 +
60 NOTE: Computer components with same functionality can have varying
61 reliability numbers, depending on the component quality. Most vendors
62 sell components with higher reliability as ``server'' components -
63 usually at higher price.
64
65 * Eliminate single point of failure (redundant components)
66 ** use an uninterruptible power supply (UPS)
67 ** use redundant power supplies on the main boards
68 ** use ECC-RAM
69 ** use redundant network hardware
70 ** use RAID for local storage
71 ** use distributed, redundant storage for VM data
72
73 * Reduce downtime
74 ** rapidly accessible administrators (24/7)
75 ** availability of spare parts (other nodes in a {pve} cluster)
76 ** automatic error detection (provided by `ha-manager`)
77 ** automatic failover (provided by `ha-manager`)
78
79 Virtualization environments like {pve} make it much easier to reach
80 high availability because they remove the ``hardware'' dependency. They
81 also support to setup and use redundant storage and network
82 devices. So if one host fail, you can simply start those services on
83 another host within your cluster.
84
85 Even better, {pve} provides a software stack called `ha-manager`,
86 which can do that automatically for you. It is able to automatically
87 detect errors and do automatic failover.
88
89 {pve} `ha-manager` works like an ``automated'' administrator. First, you
90 configure what resources (VMs, containers, ...) it should
91 manage. `ha-manager` then observes correct functionality, and handles
92 service failover to another node in case of errors. `ha-manager` can
93 also handle normal user requests which may start, stop, relocate and
94 migrate a service.
95
96 But high availability comes at a price. High quality components are
97 more expensive, and making them redundant duplicates the costs at
98 least. Additional spare parts increase costs further. So you should
99 carefully calculate the benefits, and compare with those additional
100 costs.
101
102 TIP: Increasing availability from 99% to 99.9% is relatively
103 simply. But increasing availability from 99.9999% to 99.99999% is very
104 hard and costly. `ha-manager` has typical error detection and failover
105 times of about 2 minutes, so you can get no more than 99.999%
106 availability.
107
108
109 Requirements
110 ------------
111
112 You must meet the following requirements before you start with HA:
113
114 * at least three cluster nodes (to get reliable quorum)
115
116 * shared storage for VMs and containers
117
118 * hardware redundancy (everywhere)
119
120 * use reliable “server” components
121
122 * hardware watchdog - if not available we fall back to the
123 linux kernel software watchdog (`softdog`)
124
125 * optional hardware fencing devices
126
127
128 [[ha_manager_resources]]
129 Resources
130 ---------
131
132 We call the primary management unit handled by `ha-manager` a
133 resource. A resource (also called ``service'') is uniquely
134 identified by a service ID (SID), which consists of the resource type
135 and an type specific ID, e.g.: `vm:100`. That example would be a
136 resource of type `vm` (virtual machine) with the ID 100.
137
138 For now we have two important resources types - virtual machines and
139 containers. One basic idea here is that we can bundle related software
140 into such VM or container, so there is no need to compose one big
141 service from other services, like it was done with `rgmanager`. In
142 general, a HA enabled resource should not depend on other resources.
143
144
145 How It Works
146 ------------
147
148 This section provides a detailed description of the {PVE} HA manager
149 internals. It describes all involved daemons and how they work
150 together. To provide HA, two daemons run on each node:
151
152 `pve-ha-lrm`::
153
154 The local resource manager (LRM), which controls the services running on
155 the local node. It reads the requested states for its services from
156 the current manager status file and executes the respective commands.
157
158 `pve-ha-crm`::
159
160 The cluster resource manager (CRM), which makes the cluster wide
161 decisions. It sends commands to the LRM, processes the results,
162 and moves resources to other nodes if something fails. The CRM also
163 handles node fencing.
164
165
166 .Locks in the LRM & CRM
167 [NOTE]
168 Locks are provided by our distributed configuration file system (pmxcfs).
169 They are used to guarantee that each LRM is active once and working. As a
170 LRM only executes actions when it holds its lock we can mark a failed node
171 as fenced if we can acquire its lock. This lets us then recover any failed
172 HA services securely without any interference from the now unknown failed node.
173 This all gets supervised by the CRM which holds currently the manager master
174 lock.
175
176
177 Service States
178 ~~~~~~~~~~~~~~
179
180 [thumbnail="gui-ha-manager-status.png"]
181
182 The CRM use a service state enumeration to record the current service
183 state. We display this state on the GUI and you can query it using
184 the `ha-manager` command line tool:
185
186 ----
187 # ha-manager status
188 quorum OK
189 master elsa (active, Mon Nov 21 07:23:29 2016)
190 lrm elsa (active, Mon Nov 21 07:23:22 2016)
191 service ct:100 (elsa, stopped)
192 service ct:102 (elsa, started)
193 service vm:501 (elsa, started)
194 ----
195
196 Here is the list of possible states:
197
198 stopped::
199
200 Service is stopped (confirmed by LRM). If the LRM detects a stopped
201 service is still running, it will stop it again.
202
203 request_stop::
204
205 Service should be stopped. The CRM waits for confirmation from the
206 LRM.
207
208 started::
209
210 Service is active an LRM should start it ASAP if not already running.
211 If the Service fails and is detected to be not running the LRM
212 restarts it
213 (see xref:ha_manager_start_failure_policy[Start Failure Policy]).
214
215 fence::
216
217 Wait for node fencing (service node is not inside quorate cluster
218 partition). As soon as node gets fenced successfully the service will
219 be recovered to another node, if possible
220 (see xref:ha_manager_fencing[Fencing]).
221
222 freeze::
223
224 Do not touch the service state. We use this state while we reboot a
225 node, or when we restart the LRM daemon
226 (see xref:ha_manager_package_updates[Package Updates]).
227
228 migrate::
229
230 Migrate service (live) to other node.
231
232 error::
233
234 Service is disabled because of LRM errors. Needs manual intervention
235 (see xref:ha_manager_error_recovery[Error Recovery]).
236
237
238 Local Resource Manager
239 ~~~~~~~~~~~~~~~~~~~~~~
240
241 The local resource manager (`pve-ha-lrm`) is started as a daemon on
242 boot and waits until the HA cluster is quorate and thus cluster wide
243 locks are working.
244
245 It can be in three states:
246
247 wait for agent lock::
248
249 The LRM waits for our exclusive lock. This is also used as idle state if no
250 service is configured.
251
252 active::
253
254 The LRM holds its exclusive lock and has services configured.
255
256 lost agent lock::
257
258 The LRM lost its lock, this means a failure happened and quorum was lost.
259
260 After the LRM gets in the active state it reads the manager status
261 file in `/etc/pve/ha/manager_status` and determines the commands it
262 has to execute for the services it owns.
263 For each command a worker gets started, this workers are running in
264 parallel and are limited to at most 4 by default. This default setting
265 may be changed through the datacenter configuration key `max_worker`.
266 When finished the worker process gets collected and its result saved for
267 the CRM.
268
269 .Maximum Concurrent Worker Adjustment Tips
270 [NOTE]
271 The default value of at most 4 concurrent workers may be unsuited for
272 a specific setup. For example may 4 live migrations happen at the same
273 time, which can lead to network congestions with slower networks and/or
274 big (memory wise) services. Ensure that also in the worst case no congestion
275 happens and lower the `max_worker` value if needed. In the contrary, if you
276 have a particularly powerful high end setup you may also want to increase it.
277
278 Each command requested by the CRM is uniquely identifiable by an UID, when
279 the worker finished its result will be processed and written in the LRM
280 status file `/etc/pve/nodes/<nodename>/lrm_status`. There the CRM may collect
281 it and let its state machine - respective the commands output - act on it.
282
283 The actions on each service between CRM and LRM are normally always synced.
284 This means that the CRM requests a state uniquely marked by an UID, the LRM
285 then executes this action *one time* and writes back the result, also
286 identifiable by the same UID. This is needed so that the LRM does not
287 executes an outdated command.
288 With the exception of the `stop` and the `error` command,
289 those two do not depend on the result produced and are executed
290 always in the case of the stopped state and once in the case of
291 the error state.
292
293 .Read the Logs
294 [NOTE]
295 The HA Stack logs every action it makes. This helps to understand what
296 and also why something happens in the cluster. Here its important to see
297 what both daemons, the LRM and the CRM, did. You may use
298 `journalctl -u pve-ha-lrm` on the node(s) where the service is and
299 the same command for the pve-ha-crm on the node which is the current master.
300
301 Cluster Resource Manager
302 ~~~~~~~~~~~~~~~~~~~~~~~~
303
304 The cluster resource manager (`pve-ha-crm`) starts on each node and
305 waits there for the manager lock, which can only be held by one node
306 at a time. The node which successfully acquires the manager lock gets
307 promoted to the CRM master.
308
309 It can be in three states:
310
311 wait for agent lock::
312
313 The CRM waits for our exclusive lock. This is also used as idle state if no
314 service is configured
315
316 active::
317
318 The CRM holds its exclusive lock and has services configured
319
320 lost agent lock::
321
322 The CRM lost its lock, this means a failure happened and quorum was lost.
323
324 It main task is to manage the services which are configured to be highly
325 available and try to always enforce them to the wanted state, e.g.: a
326 enabled service will be started if its not running, if it crashes it will
327 be started again. Thus it dictates the LRM the actions it needs to execute.
328
329 When an node leaves the cluster quorum, its state changes to unknown.
330 If the current CRM then can secure the failed nodes lock, the services
331 will be 'stolen' and restarted on another node.
332
333 When a cluster member determines that it is no longer in the cluster
334 quorum, the LRM waits for a new quorum to form. As long as there is no
335 quorum the node cannot reset the watchdog. This will trigger a reboot
336 after the watchdog then times out, this happens after 60 seconds.
337
338
339 Configuration
340 -------------
341
342 The HA stack is well integrated into the {pve} API. So, for example,
343 HA can be configured via the `ha-manager` command line interface, or
344 the {pve} web interface - both interfaces provide an easy way to
345 manage HA. Automation tools can use the API directly.
346
347 All HA configuration files are within `/etc/pve/ha/`, so they get
348 automatically distributed to the cluster nodes, and all nodes share
349 the same HA configuration.
350
351
352 Resources
353 ~~~~~~~~~
354
355 [thumbnail="gui-ha-manager-resources-view.png"]
356
357 The resource configuration file `/etc/pve/ha/resources.cfg` stores
358 the list of resources managed by `ha-manager`. A resource configuration
359 inside that list look like this:
360
361 ----
362 <type>: <name>
363 <property> <value>
364 ...
365 ----
366
367 It starts with a resource type followed by a resource specific name,
368 separated with colon. Together this forms the HA resource ID, which is
369 used by all `ha-manager` commands to uniquely identify a resource
370 (example: `vm:100` or `ct:101`). The next lines contain additional
371 properties:
372
373 include::ha-resources-opts.adoc[]
374
375 Here is a real world example with one VM and one container. As you see,
376 the syntax of those files is really simple, so it is even posiible to
377 read or edit those files using your favorite editor:
378
379 .Configuration Example (`/etc/pve/ha/resources.cfg`)
380 ----
381 vm: 501
382 state started
383 max_relocate 2
384
385 ct: 102
386 # Note: use default settings for everything
387 ----
388
389 [thumbnail="gui-ha-manager-add-resource.png"]
390
391 Above config was generated using the `ha-manager` command line tool:
392
393 ----
394 # ha-manager add vm:501 --state started --max_relocate 2
395 # ha-manager add ct:102
396 ----
397
398
399 [[ha_manager_groups]]
400 Groups
401 ~~~~~~
402
403 [thumbnail="gui-ha-manager-groups-view.png"]
404
405 The HA group configuration file `/etc/pve/ha/groups.cfg` is used to
406 define groups of cluster nodes. A resource can be restricted to run
407 only on the members of such group. A group configuration look like
408 this:
409
410 ----
411 group: <group>
412 nodes <node_list>
413 <property> <value>
414 ...
415 ----
416
417 include::ha-groups-opts.adoc[]
418
419 [thumbnail="gui-ha-manager-add-group.png"]
420
421 A commom requirement is that a resource should run on a specific
422 node. Usually the resource is able to run on other nodes, so you can define
423 an unrestricted group with a single member:
424
425 ----
426 # ha-manager groupadd prefer_node1 --nodes node1
427 ----
428
429 For bigger clusters, it makes sense to define a more detailed failover
430 behavior. For example, you may want to run a set of services on
431 `node1` if possible. If `node1` is not available, you want to run them
432 equally splitted on `node2` and `node3`. If those nodes also fail the
433 services should run on `node4`. To achieve this you could set the node
434 list to:
435
436 ----
437 # ha-manager groupadd mygroup1 -nodes "node1:2,node2:1,node3:1,node4"
438 ----
439
440 Another use case is if a resource uses other resources only available
441 on specific nodes, lets say `node1` and `node2`. We need to make sure
442 that HA manager does not use other nodes, so we need to create a
443 restricted group with said nodes:
444
445 ----
446 # ha-manager groupadd mygroup2 -nodes "node1,node2" -restricted
447 ----
448
449 Above commands created the following group configuration fils:
450
451 .Configuration Example (`/etc/pve/ha/groups.cfg`)
452 ----
453 group: prefer_node1
454 nodes node1
455
456 group: mygroup1
457 nodes node2:1,node4,node1:2,node3:1
458
459 group: mygroup2
460 nodes node2,node1
461 restricted 1
462 ----
463
464
465 The `nofailback` options is mostly useful to avoid unwanted resource
466 movements during administartion tasks. For example, if you need to
467 migrate a service to a node which hasn't the highest priority in the
468 group, you need to tell the HA manager to not move this service
469 instantly back by setting the `nofailback` option.
470
471 Another scenario is when a service was fenced and it got recovered to
472 another node. The admin tries to repair the fenced node and brings it
473 up online again to investigate the failure cause and check if it runs
474 stable again. Setting the `nofailback` flag prevents that the
475 recovered services move straight back to the fenced node.
476
477
478 [[ha_manager_fencing]]
479 Fencing
480 -------
481
482 On node failures, fencing ensures that the erroneous node is
483 guaranteed to be offline. This is required to make sure that no
484 resource runs twice when it gets recovered on another node. This is a
485 really important task, because without, it would not be possible to
486 recover a resource on another node.
487
488 If a node would not get fenced, it would be in an unknown state where
489 it may have still access to shared resources. This is really
490 dangerous! Imagine that every network but the storage one broke. Now,
491 while not reachable from the public network, the VM still runs and
492 writes to the shared storage.
493
494 If we then simply start up this VM on another node, we would get a
495 dangerous race conditions because we write from both nodes. Such
496 condition can destroy all VM data and the whole VM could be rendered
497 unusable. The recovery could also fail if the storage protects from
498 multiple mounts.
499
500
501 How {pve} Fences
502 ~~~~~~~~~~~~~~~~
503
504 There are different methods to fence a node, for example, fence
505 devices which cut off the power from the node or disable their
506 communication completely. Those are often quite expensive and bring
507 additional critical components into a system, because if they fail you
508 cannot recover any service.
509
510 We thus wanted to integrate a simpler fencing method, which does not
511 require additional external hardware. This can be done using
512 watchdog timers.
513
514 .Possible Fencing Methods
515 - external power switches
516 - isolate nodes by disabling complete network traffic on the switch
517 - self fencing using watchdog timers
518
519 Watchdog timers are widely used in critical and dependable systems
520 since the beginning of micro controllers. They are often independent
521 and simple integrated circuits which are used to detect and recover
522 from computer malfunctions.
523
524 During normal operation, `ha-manager` regularly resets the watchdog
525 timer to prevent it from elapsing. If, due to a hardware fault or
526 program error, the computer fails to reset the watchdog, the timer
527 will elapse and triggers a reset of the whole server (reboot).
528
529 Recent server motherboards often include such hardware watchdogs, but
530 these need to be configured. If no watchdog is available or
531 configured, we fall back to the Linux Kernel 'softdog'. While still
532 reliable, it is not independent of the servers hardware, and thus has
533 a lower reliability than a hardware watchdog.
534
535
536 Configure Hardware Watchdog
537 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
538
539 By default, all hardware watchdog modules are blocked for security
540 reasons. They are like a loaded gun if not correctly initialized. To
541 enable a hardware watchdog, you need to specify the module to load in
542 '/etc/default/pve-ha-manager', for example:
543
544 ----
545 # select watchdog module (default is softdog)
546 WATCHDOG_MODULE=iTCO_wdt
547 ----
548
549 This configuration is read by the 'watchdog-mux' service, which load
550 the specified module at startup.
551
552
553 Recover Fenced Services
554 ~~~~~~~~~~~~~~~~~~~~~~~
555
556 After a node failed and its fencing was successful, the CRM tries to
557 move services from the failed node to nodes which are still online.
558
559 The selection of nodes, on which those services gets recovered, is
560 influenced by the resource `group` settings, the list of currently active
561 nodes, and their respective active service count.
562
563 The CRM first builds a set out of the intersection between user selected
564 nodes (from `group` setting) and available nodes. It then choose the
565 subset of nodes with the highest priority, and finally select the node
566 with the lowest active service count. This minimizes the possibility
567 of an overloaded node.
568
569 CAUTION: On node failure, the CRM distributes services to the
570 remaining nodes. This increase the service count on those nodes, and
571 can lead to high load, especially on small clusters. Please design
572 your cluster so that it can handle such worst case scenarios.
573
574
575 [[ha_manager_start_failure_policy]]
576 Start Failure Policy
577 ---------------------
578
579 The start failure policy comes in effect if a service failed to start on a
580 node once ore more times. It can be used to configure how often a restart
581 should be triggered on the same node and how often a service should be
582 relocated so that it gets a try to be started on another node.
583 The aim of this policy is to circumvent temporary unavailability of shared
584 resources on a specific node. For example, if a shared storage isn't available
585 on a quorate node anymore, e.g. network problems, but still on other nodes,
586 the relocate policy allows then that the service gets started nonetheless.
587
588 There are two service start recover policy settings which can be configured
589 specific for each resource.
590
591 max_restart::
592
593 Maximum number of tries to restart an failed service on the actual
594 node. The default is set to one.
595
596 max_relocate::
597
598 Maximum number of tries to relocate the service to a different node.
599 A relocate only happens after the max_restart value is exceeded on the
600 actual node. The default is set to one.
601
602 NOTE: The relocate count state will only reset to zero when the
603 service had at least one successful start. That means if a service is
604 re-enabled without fixing the error only the restart policy gets
605 repeated.
606
607
608 [[ha_manager_error_recovery]]
609 Error Recovery
610 --------------
611
612 If after all tries the service state could not be recovered it gets
613 placed in an error state. In this state the service won't get touched
614 by the HA stack anymore. To recover from this state you should follow
615 these steps:
616
617 * bring the resource back into a safe and consistent state (e.g.,
618 killing its process)
619
620 * disable the ha resource to place it in an stopped state
621
622 * fix the error which led to this failures
623
624 * *after* you fixed all errors you may enable the service again
625
626
627 Node Maintenance
628 ----------------
629
630 It is sometimes possible to shutdown or reboot a node to do
631 maintenance tasks. Either to replace hardware, or simply to install a
632 new kernel image.
633
634
635 Shutdown
636 ~~~~~~~~
637
638 A shutdown ('poweroff') is usually done if the node is planned to stay
639 down for some time. The LRM stops all managed services in that
640 case. This means that other nodes will take over those service
641 afterwards.
642
643 NOTE: Recent hardware has large amounts of RAM. So we stop all
644 resources, then restart them to avoid online migration of all that
645 RAM. If you want to use online migration, you need to invoke that
646 manually before you shutdown the node.
647
648
649 Reboot
650 ~~~~~~
651
652 Node reboots are initiated with the 'reboot' command. This is usually
653 done after installing a new kernel. Please note that this is different
654 from ``shutdown'', because the node immediately starts again.
655
656 The LRM tells the CRM that it wants to restart, and waits until the
657 CRM puts all resources into the `freeze` state. This prevents that
658 those resources are moved to other nodes. Instead, the CRM start the
659 resources after the reboot on the same node.
660
661
662 Manual Resource Movement
663 ~~~~~~~~~~~~~~~~~~~~~~~~
664
665 Last but not least, you can also move resources manually to other
666 nodes before you shutdown or restart a node. The advantage is that you
667 have full control, and you can decide if you want to use online
668 migration or not.
669
670 NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
671 `watchdog-mux`. They manage and use the watchdog, so this can result
672 in a node reboot.
673
674
675 [[ha_manager_package_updates]]
676 Package Updates
677 ---------------
678
679 When updating the ha-manager you should do one node after the other, never
680 all at once for various reasons. First, while we test our software
681 thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
682 Upgrading one node after the other and checking the functionality of each node
683 after finishing the update helps to recover from an eventual problems, while
684 updating all could render you in a broken cluster state and is generally not
685 good practice.
686
687 Also, the {pve} HA stack uses a request acknowledge protocol to perform
688 actions between the cluster and the local resource manager. For restarting,
689 the LRM makes a request to the CRM to freeze all its services. This prevents
690 that they get touched by the Cluster during the short time the LRM is restarting.
691 After that the LRM may safely close the watchdog during a restart.
692 Such a restart happens on a update and as already stated a active master
693 CRM is needed to acknowledge the requests from the LRM, if this is not the case
694 the update process can be too long which, in the worst case, may result in
695 a watchdog reset.
696
697
698 [[ha_manager_service_operations]]
699 Service Operations
700 ------------------
701
702 This are how the basic user-initiated service operations (via
703 `ha-manager`) work.
704
705 enable::
706
707 The service will be started by the LRM if not already running.
708
709 disable::
710
711 The service will be stopped by the LRM if running.
712
713 migrate/relocate::
714
715 The service will be relocated (live) to another node.
716
717 remove::
718
719 The service will be removed from the HA managed resource list. Its
720 current state will not be touched.
721
722 start/stop::
723
724 `start` and `stop` commands can be issued to the resource specific tools
725 (like `qm` or `pct`), they will forward the request to the
726 `ha-manager` which then will execute the action and set the resulting
727 service state (enabled, disabled).
728
729
730 ifdef::manvolnum[]
731 include::pve-copyright.adoc[]
732 endif::manvolnum[]
733