]> git.proxmox.com Git - pve-docs.git/blame - ha-manager.adoc
further improve ha-manager intro
[pve-docs.git] / ha-manager.adoc
CommitLineData
22653ac8
DM
1[[chapter-ha-manager]]
2ifdef::manvolnum[]
3PVE({manvolnum})
4================
5include::attributes.txt[]
6
7NAME
8----
9
734404b4 10ha-manager - Proxmox VE HA Manager
22653ac8
DM
11
12SYNOPSYS
13--------
14
15include::ha-manager.1-synopsis.adoc[]
16
17DESCRIPTION
18-----------
19endif::manvolnum[]
20
21ifndef::manvolnum[]
22High Availability
23=================
24include::attributes.txt[]
25endif::manvolnum[]
26
b5266e9f
DM
27
28Our modern society depends heavily on information provided by
29computers over the network. Mobile devices amplified that dependency,
30because people can access the network any time from anywhere. If you
31provide such services, it is very important that they are available
32most of the time.
33
34We can mathematically define the availability as the ratio of (A) the
35total time a service is capable of being used during a given interval
36to (B) the length of the interval. It is normally expressed as a
37percentage of uptime in a given year.
38
39.Availability - Downtime per Year
40[width="60%",cols="<d,d",options="header"]
41|===========================================================
42|Availability % |Downtime per year
43|99 |3.65 days
44|99.9 |8.76 hours
45|99.99 |52.56 minutes
46|99.999 |5.26 minutes
47|99.9999 |31.5 seconds
48|99.99999 |3.15 seconds
49|===========================================================
50
04bde502
DM
51There are several ways to increase availability. The most elegant
52solution is to rewrite your software, so that you can run it on
53several host at the same time. The software itself need to have a way
54to detect erors and do failover. This is relatively easy if you just
55want to serve read-only web pages. But in general this is complex, and
56sometimes impossible because you cannot modify the software
57yourself. The following solutions works without modifying the
58software:
59
60* Use reliable "server" components
61
62NOTE: Computer components with same functionality can have varying
63reliability numbers, depending on the component quality. Most verdors
64sell components with higher reliability as "server" components -
65usually at higher price.
b5266e9f
DM
66
67* Eliminate single point of failure (redundant components)
68
69 - use an uniteruptable power supply (UPS)
70 - use redundant power supplies on the main boards
71 - use ECC-RAM
72 - use redundant network hardware
04bde502
DM
73 - use RAID for local storage
74 - use distributed, redundant storage for VM data
b5266e9f
DM
75
76* Reduce downtime
77
04bde502
DM
78 - rapidly accessible adminstrators (24/7)
79 - availability of spare parts (other nodes is a {pve} cluster)
80 - automatic error detection ('ha-manager')
81 - automatic failover ('ha-manager')
b5266e9f
DM
82
83Virtualization environments like {pve} makes it much easier to reach
04bde502
DM
84high availability because they remove the "hardware" dependency. They
85also support to setup and use redundant storage and network
86devices. So if one host fail, you can simply start those services on
43da8322
DM
87another host within your cluster.
88
89Even better, {pve} provides a software stack called 'ha-manager',
90which can do that automatically for you. It is able to automatically
91detect errors and do automatic failover.
92
93{pve} 'ha-manager' works like an "automated" administrator. First, you
94configure what resources (VMs, containers, ...) it should
95manage. 'ha-manager' then observes correct functionality, and handles
96service failover to another node in case of errors. 'ha-manager' can
97also handle normal user requests which may start, stop, relocate and
98migrate a service.
04bde502
DM
99
100But high availability comes at a price. High quality components are
101more expensive, and making them redundant duplicates the costs at
102least. Additional spare parts increase costs further. So you should
103carefully calculate the benefits, and compare with those additional
104costs.
105
106TIP: Increasing availability from 99% to 99.9% is relatively
107simply. But increasing availability from 99.9999% to 99.99999% is very
43da8322
DM
108hard and costly. 'ha-manager' has typical error detection and failover
109times of about 2 minutes, so you can get no more than 99.999%
110availability.
b5266e9f 111
3810ae1e 112
43da8322
DM
113Resources
114---------
115
116A resource (sometimes also called service) is uniquely identified by a
117service ID (SID) which consists of the service type and an type
118specific id, e.g.: 'vm:100'. That example would be a service of type
119vm (Virtual machine) with the VMID 100.
120
3810ae1e
TL
121
122Requirements
123------------
124
125* at least three nodes
126
127* shared storage
128
129* hardware redundancy
130
131* hardware watchdog - if not available we fall back to the
132 linux kernel soft dog
22653ac8 133
2b52e195 134How It Works
22653ac8
DM
135------------
136
3810ae1e
TL
137This section provides an in detail description of the {PVE} HA-manager
138internals. It describes how the CRM and the LRM work together.
139
140To provide High Availability two daemons run on each node:
141
142'pve-ha-lrm'::
143
144The local resource manager (LRM), it controls the services running on
145the local node.
146It reads the requested states for its services from the current manager
147status file and executes the respective commands.
148
149'pve-ha-crm'::
150
151The cluster resource manager (CRM), it controls the cluster wide
152actions of the services, processes the LRM result includes the state
153machine which controls the state of each service.
154
155.Locks in the LRM & CRM
156[NOTE]
157Locks are provided by our distributed configuration file system (pmxcfs).
158They are used to guarantee that each LRM is active and working as a
159LRM only executes actions when he has its lock we can mark a failed node
160as fenced if we get its lock. This lets us then recover the failed HA services
161securely without the failed (but maybe still running) LRM interfering.
162This all gets supervised by the CRM which holds currently the manager master
163lock.
164
165Local Resource Manager
166~~~~~~~~~~~~~~~~~~~~~~
167
22653ac8 168The local resource manager ('pve-ha-lrm') is started as a daemon on
3810ae1e
TL
169boot and waits until the HA cluster is quorate and thus cluster wide
170locks are working.
171
172It can be in three states:
173
174* *wait for agent lock*: the LRM waits for our exclusive lock. This is
175 also used as idle sate if no service is configured
176* *active*: the LRM holds its exclusive lock and has services configured
177* *lost agent lock*: the LRM lost its lock, this means a failure happened
178 and quorum was lost.
179
180After the LRM gets in the active state it reads the manager status
181file in '/etc/pve/ha/manager_status' and determines the commands it
182has to execute for the service it owns.
183For each command a worker gets started, this workers are running in
184parallel and are limited to maximal 4 by default. This default setting
185may be changed through the datacenter configuration key "max_worker".
186
187.Maximal Concurrent Worker Adjustment Tips
188[NOTE]
189The default value of 4 maximal concurrent Workers may be unsuited for
190a specific setup. For example may 4 live migrations happen at the same
191time, which can lead to network congestions with slower networks and/or
192big (memory wise) services. Ensure that also in the worst case no congestion
193happens and lower the "max_worker" value if needed. In the contrary, if you
194have a particularly powerful high end setup you may also want to increase it.
195
196Each command requested by the CRM is uniquely identifiable by an UID, when
197the worker finished its result will be processed and written in the LRM
198status file '/etc/pve/nodes/<nodename>/lrm_status'. There the CRM may collect
199it and let its state machine - respective the commands output - act on it.
200
201The actions on each service between CRM and LRM are normally always synced.
202This means that the CRM requests a state uniquely marked by an UID, the LRM
203then executes this action *one time* and writes back the result, also
204identifiable by the same UID. This is needed so that the LRM does not
205executes an outdated command.
206With the exception of the 'stop' and the 'error' command,
207those two do not depend on the result produce and are executed
208always in the case of the stopped state and once in the case of
209the error state.
210
211.Read the Logs
212[NOTE]
213The HA Stack logs every action it makes. This helps to understand what
214and also why something happens in the cluster. Here its important to see
215what both daemons, the LRM and the CRM, did. You may use
216`journalctl -u pve-ha-lrm` on the node(s) where the service is and
217the same command for the pve-ha-crm on the node which is the current master.
218
219Cluster Resource Manager
220~~~~~~~~~~~~~~~~~~~~~~~~
22653ac8
DM
221
222The cluster resource manager ('pve-ha-crm') starts on each node and
223waits there for the manager lock, which can only be held by one node
224at a time. The node which successfully acquires the manager lock gets
3810ae1e
TL
225promoted to the CRM master.
226
227It can be in three states: TODO
228
229* *wait for agent lock*: the LRM waits for our exclusive lock. This is
230 also used as idle sate if no service is configured
231* *active*: the LRM holds its exclusive lock and has services configured
232* *lost agent lock*: the LRM lost its lock, this means a failure happened
233 and quorum was lost.
234
235It main task is to manage the services which are configured to be highly
236available and try to get always bring them in the wanted state, e.g.: a
237enabled service will be started if its not running, if it crashes it will
238be started again. Thus it dictates the LRM the wanted actions.
22653ac8
DM
239
240When an node leaves the cluster quorum, its state changes to unknown.
241If the current CRM then can secure the failed nodes lock, the services
242will be 'stolen' and restarted on another node.
243
244When a cluster member determines that it is no longer in the cluster
245quorum, the LRM waits for a new quorum to form. As long as there is no
246quorum the node cannot reset the watchdog. This will trigger a reboot
247after 60 seconds.
248
2b52e195 249Configuration
22653ac8
DM
250-------------
251
252The HA stack is well integrated int the Proxmox VE API2. So, for
253example, HA can be configured via 'ha-manager' or the PVE web
254interface, which both provide an easy to use tool.
255
256The resource configuration file can be located at
257'/etc/pve/ha/resources.cfg' and the group configuration file at
258'/etc/pve/ha/groups.cfg'. Use the provided tools to make changes,
259there shouldn't be any need to edit them manually.
260
3810ae1e
TL
261Node Power Status
262-----------------
263
264If a node needs maintenance you should migrate and or relocate all
265services which are required to run always on another node first.
266After that you can stop the LRM and CRM services. But note that the
267watchdog triggers if you stop it with active services.
268
269Fencing
270-------
271
272What Is Fencing
273~~~~~~~~~~~~~~~
274
275Fencing secures that on a node failure the dangerous node gets will be rendered
276unable to do any damage and that no resource runs twice when it gets recovered
277from the failed node.
278
279Configure Hardware Watchdog
280~~~~~~~~~~~~~~~~~~~~~~~~~~~
281By default all watchdog modules are blocked for security reasons as they are
282like a loaded gun if not correctly initialized.
283If you have a hardware watchdog available remove its module from the blacklist
284and restart 'the watchdog-mux' service.
285
286
2b52e195 287Resource/Service Agents
22653ac8
DM
288-------------------------
289
290A resource or also called service can be managed by the
291ha-manager. Currently we support virtual machines and container.
292
2b52e195 293Groups
22653ac8
DM
294------
295
296A group is a collection of cluster nodes which a service may be bound to.
297
2b52e195 298Group Settings
22653ac8
DM
299~~~~~~~~~~~~~~
300
301nodes::
302
303list of group node members
304
305restricted::
306
307resources bound to this group may only run on nodes defined by the
308group. If no group node member is available the resource will be
309placed in the stopped state.
310
311nofailback::
312
313the resource won't automatically fail back when a more preferred node
314(re)joins the cluster.
315
316
2b52e195 317Recovery Policy
22653ac8
DM
318---------------
319
320There are two service recover policy settings which can be configured
321specific for each resource.
322
323max_restart::
324
325maximal number of tries to restart an failed service on the actual
326node. The default is set to one.
327
328max_relocate::
329
330maximal number of tries to relocate the service to a different node.
331A relocate only happens after the max_restart value is exceeded on the
332actual node. The default is set to one.
333
334Note that the relocate count state will only reset to zero when the
335service had at least one successful start. That means if a service is
336re-enabled without fixing the error only the restart policy gets
337repeated.
338
2b52e195 339Error Recovery
22653ac8
DM
340--------------
341
342If after all tries the service state could not be recovered it gets
343placed in an error state. In this state the service won't get touched
344by the HA stack anymore. To recover from this state you should follow
345these steps:
346
347* bring the resource back into an safe and consistent state (e.g:
348killing its process)
349
350* disable the ha resource to place it in an stopped state
351
352* fix the error which led to this failures
353
354* *after* you fixed all errors you may enable the service again
355
356
2b52e195 357Service Operations
22653ac8
DM
358------------------
359
360This are how the basic user-initiated service operations (via
361'ha-manager') work.
362
363enable::
364
365the service will be started by the LRM if not already running.
366
367disable::
368
369the service will be stopped by the LRM if running.
370
371migrate/relocate::
372
373the service will be relocated (live) to another node.
374
375remove::
376
377the service will be removed from the HA managed resource list. Its
378current state will not be touched.
379
380start/stop::
381
382start and stop commands can be issued to the resource specific tools
383(like 'qm' or 'pct'), they will forward the request to the
384'ha-manager' which then will execute the action and set the resulting
385service state (enabled, disabled).
386
387
2b52e195 388Service States
22653ac8
DM
389--------------
390
391stopped::
392
393Service is stopped (confirmed by LRM)
394
395request_stop::
396
397Service should be stopped. Waiting for confirmation from LRM.
398
399started::
400
401Service is active an LRM should start it ASAP if not already running.
402
403fence::
404
405Wait for node fencing (service node is not inside quorate cluster
406partition).
407
408freeze::
409
410Do not touch the service state. We use this state while we reboot a
411node, or when we restart the LRM daemon.
412
413migrate::
414
415Migrate service (live) to other node.
416
417error::
418
419Service disabled because of LRM errors. Needs manual intervention.
420
421
422ifdef::manvolnum[]
423include::pve-copyright.adoc[]
424endif::manvolnum[]
425