]> git.proxmox.com Git - pve-docs.git/blame_incremental - ha-manager.adoc
ha-manager.adoc: add new section Node Maintenance
[pve-docs.git] / ha-manager.adoc
... / ...
CommitLineData
1[[chapter_ha_manager]]
2ifdef::manvolnum[]
3ha-manager(1)
4=============
5:pve-toplevel:
6
7NAME
8----
9
10ha-manager - Proxmox VE HA Manager
11
12SYNOPSIS
13--------
14
15include::ha-manager.1-synopsis.adoc[]
16
17DESCRIPTION
18-----------
19endif::manvolnum[]
20ifndef::manvolnum[]
21High Availability
22=================
23:pve-toplevel:
24endif::manvolnum[]
25
26Our modern society depends heavily on information provided by
27computers over the network. Mobile devices amplified that dependency,
28because people can access the network any time from anywhere. If you
29provide such services, it is very important that they are available
30most of the time.
31
32We can mathematically define the availability as the ratio of (A) the
33total time a service is capable of being used during a given interval
34to (B) the length of the interval. It is normally expressed as a
35percentage of uptime in a given year.
36
37.Availability - Downtime per Year
38[width="60%",cols="<d,d",options="header"]
39|===========================================================
40|Availability % |Downtime per year
41|99 |3.65 days
42|99.9 |8.76 hours
43|99.99 |52.56 minutes
44|99.999 |5.26 minutes
45|99.9999 |31.5 seconds
46|99.99999 |3.15 seconds
47|===========================================================
48
49There are several ways to increase availability. The most elegant
50solution is to rewrite your software, so that you can run it on
51several host at the same time. The software itself need to have a way
52to detect errors and do failover. This is relatively easy if you just
53want to serve read-only web pages. But in general this is complex, and
54sometimes impossible because you cannot modify the software
55yourself. The following solutions works without modifying the
56software:
57
58* Use reliable ``server'' components
59+
60NOTE: Computer components with same functionality can have varying
61reliability numbers, depending on the component quality. Most vendors
62sell components with higher reliability as ``server'' components -
63usually at higher price.
64
65* Eliminate single point of failure (redundant components)
66** use an uninterruptible power supply (UPS)
67** use redundant power supplies on the main boards
68** use ECC-RAM
69** use redundant network hardware
70** use RAID for local storage
71** use distributed, redundant storage for VM data
72
73* Reduce downtime
74** rapidly accessible administrators (24/7)
75** availability of spare parts (other nodes in a {pve} cluster)
76** automatic error detection (provided by `ha-manager`)
77** automatic failover (provided by `ha-manager`)
78
79Virtualization environments like {pve} make it much easier to reach
80high availability because they remove the ``hardware'' dependency. They
81also support to setup and use redundant storage and network
82devices. So if one host fail, you can simply start those services on
83another host within your cluster.
84
85Even better, {pve} provides a software stack called `ha-manager`,
86which can do that automatically for you. It is able to automatically
87detect errors and do automatic failover.
88
89{pve} `ha-manager` works like an ``automated'' administrator. First, you
90configure what resources (VMs, containers, ...) it should
91manage. `ha-manager` then observes correct functionality, and handles
92service failover to another node in case of errors. `ha-manager` can
93also handle normal user requests which may start, stop, relocate and
94migrate a service.
95
96But high availability comes at a price. High quality components are
97more expensive, and making them redundant duplicates the costs at
98least. Additional spare parts increase costs further. So you should
99carefully calculate the benefits, and compare with those additional
100costs.
101
102TIP: Increasing availability from 99% to 99.9% is relatively
103simply. But increasing availability from 99.9999% to 99.99999% is very
104hard and costly. `ha-manager` has typical error detection and failover
105times of about 2 minutes, so you can get no more than 99.999%
106availability.
107
108
109Requirements
110------------
111
112You must meet the following requirements before you start with HA:
113
114* at least three cluster nodes (to get reliable quorum)
115
116* shared storage for VMs and containers
117
118* hardware redundancy (everywhere)
119
120* use reliable “server” components
121
122* hardware watchdog - if not available we fall back to the
123 linux kernel software watchdog (`softdog`)
124
125* optional hardware fencing devices
126
127
128[[ha_manager_resources]]
129Resources
130---------
131
132We call the primary management unit handled by `ha-manager` a
133resource. A resource (also called ``service'') is uniquely
134identified by a service ID (SID), which consists of the resource type
135and an type specific ID, e.g.: `vm:100`. That example would be a
136resource of type `vm` (virtual machine) with the ID 100.
137
138For now we have two important resources types - virtual machines and
139containers. One basic idea here is that we can bundle related software
140into such VM or container, so there is no need to compose one big
141service from other services, like it was done with `rgmanager`. In
142general, a HA enabled resource should not depend on other resources.
143
144
145How It Works
146------------
147
148This section provides a detailed description of the {PVE} HA manager
149internals. It describes all involved daemons and how they work
150together. To provide HA, two daemons run on each node:
151
152`pve-ha-lrm`::
153
154The local resource manager (LRM), which controls the services running on
155the local node. It reads the requested states for its services from
156the current manager status file and executes the respective commands.
157
158`pve-ha-crm`::
159
160The cluster resource manager (CRM), which makes the cluster wide
161decisions. It sends commands to the LRM, processes the results,
162and moves resources to other nodes if something fails. The CRM also
163handles node fencing.
164
165
166.Locks in the LRM & CRM
167[NOTE]
168Locks are provided by our distributed configuration file system (pmxcfs).
169They are used to guarantee that each LRM is active once and working. As a
170LRM only executes actions when it holds its lock we can mark a failed node
171as fenced if we can acquire its lock. This lets us then recover any failed
172HA services securely without any interference from the now unknown failed node.
173This all gets supervised by the CRM which holds currently the manager master
174lock.
175
176
177Service States
178~~~~~~~~~~~~~~
179
180[thumbnail="gui-ha-manager-status.png"]
181
182The CRM use a service state enumeration to record the current service
183state. We display this state on the GUI and you can query it using
184the `ha-manager` command line tool:
185
186----
187# ha-manager status
188quorum OK
189master elsa (active, Mon Nov 21 07:23:29 2016)
190lrm elsa (active, Mon Nov 21 07:23:22 2016)
191service ct:100 (elsa, stopped)
192service ct:102 (elsa, started)
193service vm:501 (elsa, started)
194----
195
196Here is the list of possible states:
197
198stopped::
199
200Service is stopped (confirmed by LRM). If the LRM detects a stopped
201service is still running, it will stop it again.
202
203request_stop::
204
205Service should be stopped. The CRM waits for confirmation from the
206LRM.
207
208started::
209
210Service is active an LRM should start it ASAP if not already running.
211If the Service fails and is detected to be not running the LRM
212restarts it
213(see xref:ha_manager_start_failure_policy[Start Failure Policy]).
214
215fence::
216
217Wait for node fencing (service node is not inside quorate cluster
218partition). As soon as node gets fenced successfully the service will
219be recovered to another node, if possible
220(see xref:ha_manager_fencing[Fencing]).
221
222freeze::
223
224Do not touch the service state. We use this state while we reboot a
225node, or when we restart the LRM daemon
226(see xref:ha_manager_package_updates[Package Updates]).
227
228migrate::
229
230Migrate service (live) to other node.
231
232error::
233
234Service is disabled because of LRM errors. Needs manual intervention
235(see xref:ha_manager_error_recovery[Error Recovery]).
236
237
238Local Resource Manager
239~~~~~~~~~~~~~~~~~~~~~~
240
241The local resource manager (`pve-ha-lrm`) is started as a daemon on
242boot and waits until the HA cluster is quorate and thus cluster wide
243locks are working.
244
245It can be in three states:
246
247wait for agent lock::
248
249The LRM waits for our exclusive lock. This is also used as idle state if no
250service is configured.
251
252active::
253
254The LRM holds its exclusive lock and has services configured.
255
256lost agent lock::
257
258The LRM lost its lock, this means a failure happened and quorum was lost.
259
260After the LRM gets in the active state it reads the manager status
261file in `/etc/pve/ha/manager_status` and determines the commands it
262has to execute for the services it owns.
263For each command a worker gets started, this workers are running in
264parallel and are limited to at most 4 by default. This default setting
265may be changed through the datacenter configuration key `max_worker`.
266When finished the worker process gets collected and its result saved for
267the CRM.
268
269.Maximum Concurrent Worker Adjustment Tips
270[NOTE]
271The default value of at most 4 concurrent workers may be unsuited for
272a specific setup. For example may 4 live migrations happen at the same
273time, which can lead to network congestions with slower networks and/or
274big (memory wise) services. Ensure that also in the worst case no congestion
275happens and lower the `max_worker` value if needed. In the contrary, if you
276have a particularly powerful high end setup you may also want to increase it.
277
278Each command requested by the CRM is uniquely identifiable by an UID, when
279the worker finished its result will be processed and written in the LRM
280status file `/etc/pve/nodes/<nodename>/lrm_status`. There the CRM may collect
281it and let its state machine - respective the commands output - act on it.
282
283The actions on each service between CRM and LRM are normally always synced.
284This means that the CRM requests a state uniquely marked by an UID, the LRM
285then executes this action *one time* and writes back the result, also
286identifiable by the same UID. This is needed so that the LRM does not
287executes an outdated command.
288With the exception of the `stop` and the `error` command,
289those two do not depend on the result produced and are executed
290always in the case of the stopped state and once in the case of
291the error state.
292
293.Read the Logs
294[NOTE]
295The HA Stack logs every action it makes. This helps to understand what
296and also why something happens in the cluster. Here its important to see
297what both daemons, the LRM and the CRM, did. You may use
298`journalctl -u pve-ha-lrm` on the node(s) where the service is and
299the same command for the pve-ha-crm on the node which is the current master.
300
301Cluster Resource Manager
302~~~~~~~~~~~~~~~~~~~~~~~~
303
304The cluster resource manager (`pve-ha-crm`) starts on each node and
305waits there for the manager lock, which can only be held by one node
306at a time. The node which successfully acquires the manager lock gets
307promoted to the CRM master.
308
309It can be in three states:
310
311wait for agent lock::
312
313The CRM waits for our exclusive lock. This is also used as idle state if no
314service is configured
315
316active::
317
318The CRM holds its exclusive lock and has services configured
319
320lost agent lock::
321
322The CRM lost its lock, this means a failure happened and quorum was lost.
323
324It main task is to manage the services which are configured to be highly
325available and try to always enforce them to the wanted state, e.g.: a
326enabled service will be started if its not running, if it crashes it will
327be started again. Thus it dictates the LRM the actions it needs to execute.
328
329When an node leaves the cluster quorum, its state changes to unknown.
330If the current CRM then can secure the failed nodes lock, the services
331will be 'stolen' and restarted on another node.
332
333When a cluster member determines that it is no longer in the cluster
334quorum, the LRM waits for a new quorum to form. As long as there is no
335quorum the node cannot reset the watchdog. This will trigger a reboot
336after the watchdog then times out, this happens after 60 seconds.
337
338
339Configuration
340-------------
341
342The HA stack is well integrated into the {pve} API. So, for example,
343HA can be configured via the `ha-manager` command line interface, or
344the {pve} web interface - both interfaces provide an easy way to
345manage HA. Automation tools can use the API directly.
346
347All HA configuration files are within `/etc/pve/ha/`, so they get
348automatically distributed to the cluster nodes, and all nodes share
349the same HA configuration.
350
351
352Resources
353~~~~~~~~~
354
355[thumbnail="gui-ha-manager-resources-view.png"]
356
357The resource configuration file `/etc/pve/ha/resources.cfg` stores
358the list of resources managed by `ha-manager`. A resource configuration
359inside that list look like this:
360
361----
362<type>: <name>
363 <property> <value>
364 ...
365----
366
367It starts with a resource type followed by a resource specific name,
368separated with colon. Together this forms the HA resource ID, which is
369used by all `ha-manager` commands to uniquely identify a resource
370(example: `vm:100` or `ct:101`). The next lines contain additional
371properties:
372
373include::ha-resources-opts.adoc[]
374
375Here is a real world example with one VM and one container. As you see,
376the syntax of those files is really simple, so it is even posiible to
377read or edit those files using your favorite editor:
378
379.Configuration Example (`/etc/pve/ha/resources.cfg`)
380----
381vm: 501
382 state started
383 max_relocate 2
384
385ct: 102
386 # Note: use default settings for everything
387----
388
389[thumbnail="gui-ha-manager-add-resource.png"]
390
391Above config was generated using the `ha-manager` command line tool:
392
393----
394# ha-manager add vm:501 --state started --max_relocate 2
395# ha-manager add ct:102
396----
397
398
399[[ha_manager_groups]]
400Groups
401~~~~~~
402
403[thumbnail="gui-ha-manager-groups-view.png"]
404
405The HA group configuration file `/etc/pve/ha/groups.cfg` is used to
406define groups of cluster nodes. A resource can be restricted to run
407only on the members of such group. A group configuration look like
408this:
409
410----
411group: <group>
412 nodes <node_list>
413 <property> <value>
414 ...
415----
416
417include::ha-groups-opts.adoc[]
418
419[thumbnail="gui-ha-manager-add-group.png"]
420
421A commom requirement is that a resource should run on a specific
422node. Usually the resource is able to run on other nodes, so you can define
423an unrestricted group with a single member:
424
425----
426# ha-manager groupadd prefer_node1 --nodes node1
427----
428
429For bigger clusters, it makes sense to define a more detailed failover
430behavior. For example, you may want to run a set of services on
431`node1` if possible. If `node1` is not available, you want to run them
432equally splitted on `node2` and `node3`. If those nodes also fail the
433services should run on `node4`. To achieve this you could set the node
434list to:
435
436----
437# ha-manager groupadd mygroup1 -nodes "node1:2,node2:1,node3:1,node4"
438----
439
440Another use case is if a resource uses other resources only available
441on specific nodes, lets say `node1` and `node2`. We need to make sure
442that HA manager does not use other nodes, so we need to create a
443restricted group with said nodes:
444
445----
446# ha-manager groupadd mygroup2 -nodes "node1,node2" -restricted
447----
448
449Above commands created the following group configuration fils:
450
451.Configuration Example (`/etc/pve/ha/groups.cfg`)
452----
453group: prefer_node1
454 nodes node1
455
456group: mygroup1
457 nodes node2:1,node4,node1:2,node3:1
458
459group: mygroup2
460 nodes node2,node1
461 restricted 1
462----
463
464
465The `nofailback` options is mostly useful to avoid unwanted resource
466movements during administartion tasks. For example, if you need to
467migrate a service to a node which hasn't the highest priority in the
468group, you need to tell the HA manager to not move this service
469instantly back by setting the `nofailback` option.
470
471Another scenario is when a service was fenced and it got recovered to
472another node. The admin tries to repair the fenced node and brings it
473up online again to investigate the failure cause and check if it runs
474stable again. Setting the `nofailback` flag prevents that the
475recovered services move straight back to the fenced node.
476
477
478[[ha_manager_fencing]]
479Fencing
480-------
481
482On node failures, fencing ensures that the erroneous node is
483guaranteed to be offline. This is required to make sure that no
484resource runs twice when it gets recovered on another node. This is a
485really important task, because without, it would not be possible to
486recover a resource on another node.
487
488If a node would not get fenced, it would be in an unknown state where
489it may have still access to shared resources. This is really
490dangerous! Imagine that every network but the storage one broke. Now,
491while not reachable from the public network, the VM still runs and
492writes to the shared storage.
493
494If we then simply start up this VM on another node, we would get a
495dangerous race conditions because we write from both nodes. Such
496condition can destroy all VM data and the whole VM could be rendered
497unusable. The recovery could also fail if the storage protects from
498multiple mounts.
499
500
501How {pve} Fences
502~~~~~~~~~~~~~~~~
503
504There are different methods to fence a node, for example, fence
505devices which cut off the power from the node or disable their
506communication completely. Those are often quite expensive and bring
507additional critical components into a system, because if they fail you
508cannot recover any service.
509
510We thus wanted to integrate a simpler fencing method, which does not
511require additional external hardware. This can be done using
512watchdog timers.
513
514.Possible Fencing Methods
515- external power switches
516- isolate nodes by disabling complete network traffic on the switch
517- self fencing using watchdog timers
518
519Watchdog timers are widely used in critical and dependable systems
520since the beginning of micro controllers. They are often independent
521and simple integrated circuits which are used to detect and recover
522from computer malfunctions.
523
524During normal operation, `ha-manager` regularly resets the watchdog
525timer to prevent it from elapsing. If, due to a hardware fault or
526program error, the computer fails to reset the watchdog, the timer
527will elapse and triggers a reset of the whole server (reboot).
528
529Recent server motherboards often include such hardware watchdogs, but
530these need to be configured. If no watchdog is available or
531configured, we fall back to the Linux Kernel 'softdog'. While still
532reliable, it is not independent of the servers hardware, and thus has
533a lower reliability than a hardware watchdog.
534
535
536Configure Hardware Watchdog
537~~~~~~~~~~~~~~~~~~~~~~~~~~~
538
539By default, all hardware watchdog modules are blocked for security
540reasons. They are like a loaded gun if not correctly initialized. To
541enable a hardware watchdog, you need to specify the module to load in
542'/etc/default/pve-ha-manager', for example:
543
544----
545# select watchdog module (default is softdog)
546WATCHDOG_MODULE=iTCO_wdt
547----
548
549This configuration is read by the 'watchdog-mux' service, which load
550the specified module at startup.
551
552
553Recover Fenced Services
554~~~~~~~~~~~~~~~~~~~~~~~
555
556After a node failed and its fencing was successful, the CRM tries to
557move services from the failed node to nodes which are still online.
558
559The selection of nodes, on which those services gets recovered, is
560influenced by the resource `group` settings, the list of currently active
561nodes, and their respective active service count.
562
563The CRM first builds a set out of the intersection between user selected
564nodes (from `group` setting) and available nodes. It then choose the
565subset of nodes with the highest priority, and finally select the node
566with the lowest active service count. This minimizes the possibility
567of an overloaded node.
568
569CAUTION: On node failure, the CRM distributes services to the
570remaining nodes. This increase the service count on those nodes, and
571can lead to high load, especially on small clusters. Please design
572your cluster so that it can handle such worst case scenarios.
573
574
575[[ha_manager_start_failure_policy]]
576Start Failure Policy
577---------------------
578
579The start failure policy comes in effect if a service failed to start on a
580node once ore more times. It can be used to configure how often a restart
581should be triggered on the same node and how often a service should be
582relocated so that it gets a try to be started on another node.
583The aim of this policy is to circumvent temporary unavailability of shared
584resources on a specific node. For example, if a shared storage isn't available
585on a quorate node anymore, e.g. network problems, but still on other nodes,
586the relocate policy allows then that the service gets started nonetheless.
587
588There are two service start recover policy settings which can be configured
589specific for each resource.
590
591max_restart::
592
593Maximum number of tries to restart an failed service on the actual
594node. The default is set to one.
595
596max_relocate::
597
598Maximum number of tries to relocate the service to a different node.
599A relocate only happens after the max_restart value is exceeded on the
600actual node. The default is set to one.
601
602NOTE: The relocate count state will only reset to zero when the
603service had at least one successful start. That means if a service is
604re-enabled without fixing the error only the restart policy gets
605repeated.
606
607
608[[ha_manager_error_recovery]]
609Error Recovery
610--------------
611
612If after all tries the service state could not be recovered it gets
613placed in an error state. In this state the service won't get touched
614by the HA stack anymore. To recover from this state you should follow
615these steps:
616
617* bring the resource back into a safe and consistent state (e.g.,
618killing its process)
619
620* disable the ha resource to place it in an stopped state
621
622* fix the error which led to this failures
623
624* *after* you fixed all errors you may enable the service again
625
626
627Node Maintenance
628----------------
629
630It is sometimes possible to shutdown or reboot a node to do
631maintenance tasks. Either to replace hardware, or simply to install a
632new kernel image.
633
634
635Shutdown
636~~~~~~~~
637
638A shutdown ('poweroff') is usually done if the node is planned to stay
639down for some time. The LRM stops all managed services in that
640case. This means that other nodes will take over those service
641afterwards.
642
643NOTE: Recent hardware has large amounts of RAM. So we stop all
644resources, then restart them to avoid online migration of all that
645RAM. If you want to use online migration, you need to invoke that
646manually before you shutdown the node.
647
648
649Reboot
650~~~~~~
651
652Node reboots are initiated with the 'reboot' command. This is usually
653done after installing a new kernel. Please note that this is different
654from ``shutdown'', because the node immediately starts again.
655
656The LRM tells the CRM that it wants to restart, and waits until the
657CRM puts all resources into the `freeze` state. This prevents that
658those resources are moved to other nodes. Instead, the CRM start the
659resources after the reboot on the same node.
660
661
662Manual Resource Movement
663~~~~~~~~~~~~~~~~~~~~~~~~
664
665Last but not least, you can also move resources manually to other
666nodes before you shutdown or restart a node. The advantage is that you
667have full control, and you can decide if you want to use online
668migration or not.
669
670NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
671`watchdog-mux`. They manage and use the watchdog, so this can result
672in a node reboot.
673
674
675[[ha_manager_package_updates]]
676Package Updates
677---------------
678
679When updating the ha-manager you should do one node after the other, never
680all at once for various reasons. First, while we test our software
681thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
682Upgrading one node after the other and checking the functionality of each node
683after finishing the update helps to recover from an eventual problems, while
684updating all could render you in a broken cluster state and is generally not
685good practice.
686
687Also, the {pve} HA stack uses a request acknowledge protocol to perform
688actions between the cluster and the local resource manager. For restarting,
689the LRM makes a request to the CRM to freeze all its services. This prevents
690that they get touched by the Cluster during the short time the LRM is restarting.
691After that the LRM may safely close the watchdog during a restart.
692Such a restart happens on a update and as already stated a active master
693CRM is needed to acknowledge the requests from the LRM, if this is not the case
694the update process can be too long which, in the worst case, may result in
695a watchdog reset.
696
697
698[[ha_manager_service_operations]]
699Service Operations
700------------------
701
702This are how the basic user-initiated service operations (via
703`ha-manager`) work.
704
705enable::
706
707The service will be started by the LRM if not already running.
708
709disable::
710
711The service will be stopped by the LRM if running.
712
713migrate/relocate::
714
715The service will be relocated (live) to another node.
716
717remove::
718
719The service will be removed from the HA managed resource list. Its
720current state will not be touched.
721
722start/stop::
723
724`start` and `stop` commands can be issued to the resource specific tools
725(like `qm` or `pct`), they will forward the request to the
726`ha-manager` which then will execute the action and set the resulting
727service state (enabled, disabled).
728
729
730ifdef::manvolnum[]
731include::pve-copyright.adoc[]
732endif::manvolnum[]
733