| 1 | [[chapter-ha-manager]] |
| 2 | ifdef::manvolnum[] |
| 3 | PVE({manvolnum}) |
| 4 | ================ |
| 5 | include::attributes.txt[] |
| 6 | |
| 7 | NAME |
| 8 | ---- |
| 9 | |
| 10 | ha-manager - Proxmox VE HA Manager |
| 11 | |
| 12 | SYNOPSYS |
| 13 | -------- |
| 14 | |
| 15 | include::ha-manager.1-synopsis.adoc[] |
| 16 | |
| 17 | DESCRIPTION |
| 18 | ----------- |
| 19 | endif::manvolnum[] |
| 20 | |
| 21 | ifndef::manvolnum[] |
| 22 | High Availability |
| 23 | ================= |
| 24 | include::attributes.txt[] |
| 25 | endif::manvolnum[] |
| 26 | |
| 27 | |
| 28 | Our modern society depends heavily on information provided by |
| 29 | computers over the network. Mobile devices amplified that dependency, |
| 30 | because people can access the network any time from anywhere. If you |
| 31 | provide such services, it is very important that they are available |
| 32 | most of the time. |
| 33 | |
| 34 | We can mathematically define the availability as the ratio of (A) the |
| 35 | total time a service is capable of being used during a given interval |
| 36 | to (B) the length of the interval. It is normally expressed as a |
| 37 | percentage of uptime in a given year. |
| 38 | |
| 39 | .Availability - Downtime per Year |
| 40 | [width="60%",cols="<d,d",options="header"] |
| 41 | |=========================================================== |
| 42 | |Availability % |Downtime per year |
| 43 | |99 |3.65 days |
| 44 | |99.9 |8.76 hours |
| 45 | |99.99 |52.56 minutes |
| 46 | |99.999 |5.26 minutes |
| 47 | |99.9999 |31.5 seconds |
| 48 | |99.99999 |3.15 seconds |
| 49 | |=========================================================== |
| 50 | |
| 51 | There are several ways to increase availability. The most elegant |
| 52 | solution is to rewrite your software, so that you can run it on |
| 53 | several host at the same time. The software itself need to have a way |
| 54 | to detect errors and do failover. This is relatively easy if you just |
| 55 | want to serve read-only web pages. But in general this is complex, and |
| 56 | sometimes impossible because you cannot modify the software |
| 57 | yourself. The following solutions works without modifying the |
| 58 | software: |
| 59 | |
| 60 | * Use reliable "server" components |
| 61 | |
| 62 | NOTE: Computer components with same functionality can have varying |
| 63 | reliability numbers, depending on the component quality. Most vendors |
| 64 | sell components with higher reliability as "server" components - |
| 65 | usually at higher price. |
| 66 | |
| 67 | * Eliminate single point of failure (redundant components) |
| 68 | |
| 69 | - use an uninterruptible power supply (UPS) |
| 70 | - use redundant power supplies on the main boards |
| 71 | - use ECC-RAM |
| 72 | - use redundant network hardware |
| 73 | - use RAID for local storage |
| 74 | - use distributed, redundant storage for VM data |
| 75 | |
| 76 | * Reduce downtime |
| 77 | |
| 78 | - rapidly accessible administrators (24/7) |
| 79 | - availability of spare parts (other nodes in a {pve} cluster) |
| 80 | - automatic error detection ('ha-manager') |
| 81 | - automatic failover ('ha-manager') |
| 82 | |
| 83 | Virtualization environments like {pve} make it much easier to reach |
| 84 | high availability because they remove the "hardware" dependency. They |
| 85 | also support to setup and use redundant storage and network |
| 86 | devices. So if one host fail, you can simply start those services on |
| 87 | another host within your cluster. |
| 88 | |
| 89 | Even better, {pve} provides a software stack called 'ha-manager', |
| 90 | which can do that automatically for you. It is able to automatically |
| 91 | detect errors and do automatic failover. |
| 92 | |
| 93 | {pve} 'ha-manager' works like an "automated" administrator. First, you |
| 94 | configure what resources (VMs, containers, ...) it should |
| 95 | manage. 'ha-manager' then observes correct functionality, and handles |
| 96 | service failover to another node in case of errors. 'ha-manager' can |
| 97 | also handle normal user requests which may start, stop, relocate and |
| 98 | migrate a service. |
| 99 | |
| 100 | But high availability comes at a price. High quality components are |
| 101 | more expensive, and making them redundant duplicates the costs at |
| 102 | least. Additional spare parts increase costs further. So you should |
| 103 | carefully calculate the benefits, and compare with those additional |
| 104 | costs. |
| 105 | |
| 106 | TIP: Increasing availability from 99% to 99.9% is relatively |
| 107 | simply. But increasing availability from 99.9999% to 99.99999% is very |
| 108 | hard and costly. 'ha-manager' has typical error detection and failover |
| 109 | times of about 2 minutes, so you can get no more than 99.999% |
| 110 | availability. |
| 111 | |
| 112 | Requirements |
| 113 | ------------ |
| 114 | |
| 115 | * at least three cluster nodes (to get reliable quorum) |
| 116 | |
| 117 | * shared storage for VMs and containers |
| 118 | |
| 119 | * hardware redundancy (everywhere) |
| 120 | |
| 121 | * hardware watchdog - if not available we fall back to the |
| 122 | linux kernel software watchdog ('softdog') |
| 123 | |
| 124 | * optional hardware fencing devices |
| 125 | |
| 126 | |
| 127 | Resources |
| 128 | --------- |
| 129 | |
| 130 | We call the primary management unit handled by 'ha-manager' a |
| 131 | resource. A resource (also called "service") is uniquely |
| 132 | identified by a service ID (SID), which consists of the resource type |
| 133 | and an type specific ID, e.g.: 'vm:100'. That example would be a |
| 134 | resource of type 'vm' (virtual machine) with the ID 100. |
| 135 | |
| 136 | For now we have two important resources types - virtual machines and |
| 137 | containers. One basic idea here is that we can bundle related software |
| 138 | into such VM or container, so there is no need to compose one big |
| 139 | service from other services, like it was done with 'rgmanager'. In |
| 140 | general, a HA enabled resource should not depend on other resources. |
| 141 | |
| 142 | |
| 143 | How It Works |
| 144 | ------------ |
| 145 | |
| 146 | This section provides an in detail description of the {PVE} HA-manager |
| 147 | internals. It describes how the CRM and the LRM work together. |
| 148 | |
| 149 | To provide High Availability two daemons run on each node: |
| 150 | |
| 151 | 'pve-ha-lrm':: |
| 152 | |
| 153 | The local resource manager (LRM), it controls the services running on |
| 154 | the local node. |
| 155 | It reads the requested states for its services from the current manager |
| 156 | status file and executes the respective commands. |
| 157 | |
| 158 | 'pve-ha-crm':: |
| 159 | |
| 160 | The cluster resource manager (CRM), it controls the cluster wide |
| 161 | actions of the services, processes the LRM results and includes the state |
| 162 | machine which controls the state of each service. |
| 163 | |
| 164 | .Locks in the LRM & CRM |
| 165 | [NOTE] |
| 166 | Locks are provided by our distributed configuration file system (pmxcfs). |
| 167 | They are used to guarantee that each LRM is active once and working. As a |
| 168 | LRM only executes actions when it holds its lock we can mark a failed node |
| 169 | as fenced if we can acquire its lock. This lets us then recover any failed |
| 170 | HA services securely without any interference from the now unknown failed Node. |
| 171 | This all gets supervised by the CRM which holds currently the manager master |
| 172 | lock. |
| 173 | |
| 174 | Local Resource Manager |
| 175 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 176 | |
| 177 | The local resource manager ('pve-ha-lrm') is started as a daemon on |
| 178 | boot and waits until the HA cluster is quorate and thus cluster wide |
| 179 | locks are working. |
| 180 | |
| 181 | It can be in three states: |
| 182 | |
| 183 | * *wait for agent lock*: the LRM waits for our exclusive lock. This is |
| 184 | also used as idle sate if no service is configured |
| 185 | * *active*: the LRM holds its exclusive lock and has services configured |
| 186 | * *lost agent lock*: the LRM lost its lock, this means a failure happened |
| 187 | and quorum was lost. |
| 188 | |
| 189 | After the LRM gets in the active state it reads the manager status |
| 190 | file in '/etc/pve/ha/manager_status' and determines the commands it |
| 191 | has to execute for the services it owns. |
| 192 | For each command a worker gets started, this workers are running in |
| 193 | parallel and are limited to maximal 4 by default. This default setting |
| 194 | may be changed through the datacenter configuration key "max_worker". |
| 195 | When finished the worker process gets collected and its result saved for |
| 196 | the CRM. |
| 197 | |
| 198 | .Maximal Concurrent Worker Adjustment Tips |
| 199 | [NOTE] |
| 200 | The default value of 4 maximal concurrent Workers may be unsuited for |
| 201 | a specific setup. For example may 4 live migrations happen at the same |
| 202 | time, which can lead to network congestions with slower networks and/or |
| 203 | big (memory wise) services. Ensure that also in the worst case no congestion |
| 204 | happens and lower the "max_worker" value if needed. In the contrary, if you |
| 205 | have a particularly powerful high end setup you may also want to increase it. |
| 206 | |
| 207 | Each command requested by the CRM is uniquely identifiable by an UID, when |
| 208 | the worker finished its result will be processed and written in the LRM |
| 209 | status file '/etc/pve/nodes/<nodename>/lrm_status'. There the CRM may collect |
| 210 | it and let its state machine - respective the commands output - act on it. |
| 211 | |
| 212 | The actions on each service between CRM and LRM are normally always synced. |
| 213 | This means that the CRM requests a state uniquely marked by an UID, the LRM |
| 214 | then executes this action *one time* and writes back the result, also |
| 215 | identifiable by the same UID. This is needed so that the LRM does not |
| 216 | executes an outdated command. |
| 217 | With the exception of the 'stop' and the 'error' command, |
| 218 | those two do not depend on the result produced and are executed |
| 219 | always in the case of the stopped state and once in the case of |
| 220 | the error state. |
| 221 | |
| 222 | .Read the Logs |
| 223 | [NOTE] |
| 224 | The HA Stack logs every action it makes. This helps to understand what |
| 225 | and also why something happens in the cluster. Here its important to see |
| 226 | what both daemons, the LRM and the CRM, did. You may use |
| 227 | `journalctl -u pve-ha-lrm` on the node(s) where the service is and |
| 228 | the same command for the pve-ha-crm on the node which is the current master. |
| 229 | |
| 230 | Cluster Resource Manager |
| 231 | ~~~~~~~~~~~~~~~~~~~~~~~~ |
| 232 | |
| 233 | The cluster resource manager ('pve-ha-crm') starts on each node and |
| 234 | waits there for the manager lock, which can only be held by one node |
| 235 | at a time. The node which successfully acquires the manager lock gets |
| 236 | promoted to the CRM master. |
| 237 | |
| 238 | It can be in three states: |
| 239 | |
| 240 | * *wait for agent lock*: the LRM waits for our exclusive lock. This is |
| 241 | also used as idle sate if no service is configured |
| 242 | * *active*: the LRM holds its exclusive lock and has services configured |
| 243 | * *lost agent lock*: the LRM lost its lock, this means a failure happened |
| 244 | and quorum was lost. |
| 245 | |
| 246 | It main task is to manage the services which are configured to be highly |
| 247 | available and try to always enforce them to the wanted state, e.g.: a |
| 248 | enabled service will be started if its not running, if it crashes it will |
| 249 | be started again. Thus it dictates the LRM the actions it needs to execute. |
| 250 | |
| 251 | When an node leaves the cluster quorum, its state changes to unknown. |
| 252 | If the current CRM then can secure the failed nodes lock, the services |
| 253 | will be 'stolen' and restarted on another node. |
| 254 | |
| 255 | When a cluster member determines that it is no longer in the cluster |
| 256 | quorum, the LRM waits for a new quorum to form. As long as there is no |
| 257 | quorum the node cannot reset the watchdog. This will trigger a reboot |
| 258 | after the watchdog then times out, this happens after 60 seconds. |
| 259 | |
| 260 | Configuration |
| 261 | ------------- |
| 262 | |
| 263 | The HA stack is well integrated in the Proxmox VE API2. So, for |
| 264 | example, HA can be configured via 'ha-manager' or the PVE web |
| 265 | interface, which both provide an easy to use tool. |
| 266 | |
| 267 | The resource configuration file can be located at |
| 268 | '/etc/pve/ha/resources.cfg' and the group configuration file at |
| 269 | '/etc/pve/ha/groups.cfg'. Use the provided tools to make changes, |
| 270 | there shouldn't be any need to edit them manually. |
| 271 | |
| 272 | Node Power Status |
| 273 | ----------------- |
| 274 | |
| 275 | If a node needs maintenance you should migrate and or relocate all |
| 276 | services which are required to run always on another node first. |
| 277 | After that you can stop the LRM and CRM services. But note that the |
| 278 | watchdog triggers if you stop it with active services. |
| 279 | |
| 280 | Package Updates |
| 281 | --------------- |
| 282 | |
| 283 | When updating the ha-manager you should do one node after the other, never |
| 284 | all at once for various reasons. First, while we test our software |
| 285 | thoughtfully, a bug affecting your specific setup cannot totally be ruled out. |
| 286 | Upgrading one node after the other and checking the functionality of each node |
| 287 | after finishing the update helps to recover from an eventual problems, while |
| 288 | updating all could render you in a broken cluster state and is generally not |
| 289 | good practice. |
| 290 | |
| 291 | Also, the {pve} HA stack uses a request acknowledge protocol to perform |
| 292 | actions between the cluster and the local resource manager. For restarting, |
| 293 | the LRM makes a request to the CRM to freeze all its services. This prevents |
| 294 | that they get touched by the Cluster during the short time the LRM is restarting. |
| 295 | After that the LRM may safely close the watchdog during a restart. |
| 296 | Such a restart happens on a update and as already stated a active master |
| 297 | CRM is needed to acknowledge the requests from the LRM, if this is not the case |
| 298 | the update process can be too long which, in the worst case, may result in |
| 299 | a watchdog reset. |
| 300 | |
| 301 | |
| 302 | Fencing |
| 303 | ------- |
| 304 | |
| 305 | What Is Fencing |
| 306 | ~~~~~~~~~~~~~~~ |
| 307 | |
| 308 | Fencing secures that on a node failure the dangerous node gets will be rendered |
| 309 | unable to do any damage and that no resource runs twice when it gets recovered |
| 310 | from the failed node. This is a really important task and one of the base |
| 311 | principles to make a system Highly Available. |
| 312 | |
| 313 | If a node would not get fenced it would be in an unknown state where it may |
| 314 | have still access to shared resources, this is really dangerous! |
| 315 | Imagine that every network but the storage one broke, now while not |
| 316 | reachable from the public network the VM still runs and writes on the shared |
| 317 | storage. If we would not fence the node and just start up this VM on another |
| 318 | Node we would get dangerous race conditions, atomicity violations the whole VM |
| 319 | could be rendered unusable. The recovery could also simply fail if the storage |
| 320 | protects from multiple mounts and thus defeat the purpose of HA. |
| 321 | |
| 322 | How {pve} Fences |
| 323 | ~~~~~~~~~~~~~~~~~ |
| 324 | |
| 325 | There are different methods to fence a node, for example fence devices which |
| 326 | cut off the power from the node or disable their communication completely. |
| 327 | |
| 328 | Those are often quite expensive and bring additional critical components in |
| 329 | a system, because if they fail you cannot recover any service. |
| 330 | |
| 331 | We thus wanted to integrate a simpler method in the HA Manager first, namely |
| 332 | self fencing with watchdogs. |
| 333 | |
| 334 | Watchdogs are widely used in critical and dependable systems since the |
| 335 | beginning of micro controllers, they are often independent and simple |
| 336 | integrated circuit which programs can use to watch them. After opening they need to |
| 337 | report periodically. If, for whatever reason, a program becomes unable to do |
| 338 | so the watchdogs triggers a reset of the whole server. |
| 339 | |
| 340 | Server motherboards often already include such hardware watchdogs, these need |
| 341 | to be configured. If no watchdog is available or configured we fall back to the |
| 342 | Linux Kernel softdog while still reliable it is not independent of the servers |
| 343 | Hardware and thus has a lower reliability then a hardware watchdog. |
| 344 | |
| 345 | Configure Hardware Watchdog |
| 346 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 347 | By default all watchdog modules are blocked for security reasons as they are |
| 348 | like a loaded gun if not correctly initialized. |
| 349 | If you have a hardware watchdog available remove its kernel module from the |
| 350 | blacklist, load it with insmod and restart the 'watchdog-mux' service or reboot |
| 351 | the node. |
| 352 | |
| 353 | Groups |
| 354 | ------ |
| 355 | |
| 356 | A group is a collection of cluster nodes which a service may be bound to. |
| 357 | |
| 358 | Group Settings |
| 359 | ~~~~~~~~~~~~~~ |
| 360 | |
| 361 | nodes:: |
| 362 | |
| 363 | List of group node members where a priority can be given to each node. |
| 364 | A service bound to this group will run on the nodes with the highest priority |
| 365 | available. If more nodes are in the highest priority class the services will |
| 366 | get distributed to those node if not already there. The priorities have a |
| 367 | relative meaning only. |
| 368 | |
| 369 | restricted:: |
| 370 | |
| 371 | resources bound to this group may only run on nodes defined by the |
| 372 | group. If no group node member is available the resource will be |
| 373 | placed in the stopped state. |
| 374 | |
| 375 | nofailback:: |
| 376 | |
| 377 | the resource won't automatically fail back when a more preferred node |
| 378 | (re)joins the cluster. |
| 379 | |
| 380 | |
| 381 | Start Failure Policy |
| 382 | --------------------- |
| 383 | |
| 384 | The start failure policy comes in effect if a service failed to start on a |
| 385 | node once ore more times. It can be used to configure how often a restart |
| 386 | should be triggered on the same node and how often a service should be |
| 387 | relocated so that it gets a try to be started on another node. |
| 388 | The aim of this policy is to circumvent temporary unavailability of shared |
| 389 | resources on a specific node. For example, if a shared storage isn't available |
| 390 | on a quorate node anymore, e.g. network problems, but still on other nodes, |
| 391 | the relocate policy allows then that the service gets started nonetheless. |
| 392 | |
| 393 | There are two service start recover policy settings which can be configured |
| 394 | specific for each resource. |
| 395 | |
| 396 | max_restart:: |
| 397 | |
| 398 | maximal number of tries to restart an failed service on the actual |
| 399 | node. The default is set to one. |
| 400 | |
| 401 | max_relocate:: |
| 402 | |
| 403 | maximal number of tries to relocate the service to a different node. |
| 404 | A relocate only happens after the max_restart value is exceeded on the |
| 405 | actual node. The default is set to one. |
| 406 | |
| 407 | NOTE: The relocate count state will only reset to zero when the |
| 408 | service had at least one successful start. That means if a service is |
| 409 | re-enabled without fixing the error only the restart policy gets |
| 410 | repeated. |
| 411 | |
| 412 | Error Recovery |
| 413 | -------------- |
| 414 | |
| 415 | If after all tries the service state could not be recovered it gets |
| 416 | placed in an error state. In this state the service won't get touched |
| 417 | by the HA stack anymore. To recover from this state you should follow |
| 418 | these steps: |
| 419 | |
| 420 | * bring the resource back into an safe and consistent state (e.g: |
| 421 | killing its process) |
| 422 | |
| 423 | * disable the ha resource to place it in an stopped state |
| 424 | |
| 425 | * fix the error which led to this failures |
| 426 | |
| 427 | * *after* you fixed all errors you may enable the service again |
| 428 | |
| 429 | |
| 430 | Service Operations |
| 431 | ------------------ |
| 432 | |
| 433 | This are how the basic user-initiated service operations (via |
| 434 | 'ha-manager') work. |
| 435 | |
| 436 | enable:: |
| 437 | |
| 438 | the service will be started by the LRM if not already running. |
| 439 | |
| 440 | disable:: |
| 441 | |
| 442 | the service will be stopped by the LRM if running. |
| 443 | |
| 444 | migrate/relocate:: |
| 445 | |
| 446 | the service will be relocated (live) to another node. |
| 447 | |
| 448 | remove:: |
| 449 | |
| 450 | the service will be removed from the HA managed resource list. Its |
| 451 | current state will not be touched. |
| 452 | |
| 453 | start/stop:: |
| 454 | |
| 455 | start and stop commands can be issued to the resource specific tools |
| 456 | (like 'qm' or 'pct'), they will forward the request to the |
| 457 | 'ha-manager' which then will execute the action and set the resulting |
| 458 | service state (enabled, disabled). |
| 459 | |
| 460 | |
| 461 | Service States |
| 462 | -------------- |
| 463 | |
| 464 | stopped:: |
| 465 | |
| 466 | Service is stopped (confirmed by LRM), if detected running it will get stopped |
| 467 | again. |
| 468 | |
| 469 | request_stop:: |
| 470 | |
| 471 | Service should be stopped. Waiting for confirmation from LRM. |
| 472 | |
| 473 | started:: |
| 474 | |
| 475 | Service is active an LRM should start it ASAP if not already running. |
| 476 | If the Service fails and is detected to be not running the LRM restarts it. |
| 477 | |
| 478 | fence:: |
| 479 | |
| 480 | Wait for node fencing (service node is not inside quorate cluster |
| 481 | partition). |
| 482 | As soon as node gets fenced successfully the service will be recovered to |
| 483 | another node, if possible. |
| 484 | |
| 485 | freeze:: |
| 486 | |
| 487 | Do not touch the service state. We use this state while we reboot a |
| 488 | node, or when we restart the LRM daemon. |
| 489 | |
| 490 | migrate:: |
| 491 | |
| 492 | Migrate service (live) to other node. |
| 493 | |
| 494 | error:: |
| 495 | |
| 496 | Service disabled because of LRM errors. Needs manual intervention. |
| 497 | |
| 498 | |
| 499 | ifdef::manvolnum[] |
| 500 | include::pve-copyright.adoc[] |
| 501 | endif::manvolnum[] |
| 502 | |