X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=ha-manager.adoc;h=54db2a50f13409b24b4e7efe5075807ec705f475;hb=9a08108970ee1fd321806776f1ede67bb9a05cc1;hp=321fa3d43cdf50935db7a65f018bb783d8c9b64c;hpb=22653ac84b2ffc3613ebac10b4b8a2b44d197e5f;p=pve-docs.git diff --git a/ha-manager.adoc b/ha-manager.adoc index 321fa3d..54db2a5 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -1,15 +1,15 @@ -[[chapter-ha-manager]] +[[chapter_ha_manager]] ifdef::manvolnum[] -PVE({manvolnum}) -================ -include::attributes.txt[] +ha-manager(1) +============= +:pve-toplevel: NAME ---- -ha-manager - Proxmox VE HA manager command line interface +ha-manager - Proxmox VE HA Manager -SYNOPSYS +SYNOPSIS -------- include::ha-manager.1-synopsis.adoc[] @@ -17,180 +17,322 @@ include::ha-manager.1-synopsis.adoc[] DESCRIPTION ----------- endif::manvolnum[] - ifndef::manvolnum[] High Availability ================= -include::attributes.txt[] +:pve-toplevel: endif::manvolnum[] -'ha-manager' handles management of user-defined cluster services. This -includes handling of user requests including service start, service -disable, service relocate, and service restart. The cluster resource -manager daemon also handles restarting and relocating services in the -event of failures. - -HOW IT WORKS +Our modern society depends heavily on information provided by +computers over the network. Mobile devices amplified that dependency, +because people can access the network any time from anywhere. If you +provide such services, it is very important that they are available +most of the time. + +We can mathematically define the availability as the ratio of (A), the +total time a service is capable of being used during a given interval +to (B), the length of the interval. It is normally expressed as a +percentage of uptime in a given year. + +.Availability - Downtime per Year +[width="60%",cols="/lrm_status`. There the CRM may collect +it and let its state machine - respective to the commands output - act on it. + +The actions on each service between CRM and LRM are normally always synced. +This means that the CRM requests a state uniquely marked by a UID, the LRM +then executes this action *one time* and writes back the result, which is also +identifiable by the same UID. This is needed so that the LRM does not +execute an outdated command. +The only exceptions to this behaviour are the `stop` and `error` commands; +these two do not depend on the result produced and are executed +always in the case of the stopped state and once in the case of +the error state. + +.Read the Logs +[NOTE] +The HA Stack logs every action it makes. This helps to understand what +and also why something happens in the cluster. Here its important to see +what both daemons, the LRM and the CRM, did. You may use +`journalctl -u pve-ha-lrm` on the node(s) where the service is and +the same command for the pve-ha-crm on the node which is the current master. + +Cluster Resource Manager +~~~~~~~~~~~~~~~~~~~~~~~~ + +The cluster resource manager (`pve-ha-crm`) starts on each node and +waits there for the manager lock, which can only be held by one node +at a time. The node which successfully acquires the manager lock gets +promoted to the CRM master. + +It can be in three states: + +wait for agent lock:: + +The CRM waits for our exclusive lock. This is also used as idle state if no +service is configured + +active:: + +The CRM holds its exclusive lock and has services configured + +lost agent lock:: + +The CRM lost its lock, this means a failure happened and quorum was lost. + +Its main task is to manage the services which are configured to be highly +available and try to always enforce the requested state. For example, a +service with the requested state 'started' will be started if its not +already running. If it crashes it will be automatically started again. +Thus the CRM dictates the actions the LRM needs to execute. + +When a node leaves the cluster quorum, its state changes to unknown. +If the current CRM can then secure the failed node's lock, the services +will be 'stolen' and restarted on another node. + +When a cluster member determines that it is no longer in the cluster +quorum, the LRM waits for a new quorum to form. As long as there is no +quorum the node cannot reset the watchdog. This will trigger a reboot +after the watchdog times out (this happens after 60 seconds). + + +HA Simulator +------------ + +[thumbnail="screenshot/gui-ha-manager-status.png"] + +By using the HA simulator you can test and learn all functionalities of the +Proxmox VE HA solutions. + +By default, the simulator allows you to watch and test the behaviour of a +real-world 3 node cluster with 6 VMs. You can also add or remove additional VMs +or Container. + +You do not have to setup or configure a real cluster, the HA simulator runs out +of the box. + +Install with apt: + +---- +apt install pve-ha-simulator +---- + +You can even install the package on any Debian-based system without any +other Proxmox VE packages. For that you will need to download the package and +copy it to the system you want to run it on for installation. When you install +the package with apt from the local file system it will also resolve the +required dependencies for you. + + +To start the simulator on a remote machine you must have an X11 redirection to +your current system. + +If you are on a Linux machine you can use: + +---- +ssh root@ -Y +---- + +On Windows it works with https://mobaxterm.mobatek.net/[mobaxterm]. + +After connecting to an existing {pve} with the simulator installed or +installing it on your local Debian-based system manually, you can try it out as +follows. + +First you need to create a working directory where the simulator saves its +current state and writes its default config: + +---- +mkdir working +---- + +Then, simply pass the created directory as a parameter to 'pve-ha-simulator': + +---- +pve-ha-simulator working/ +---- + +You can then start, stop, migrate the simulated HA services, or even check out +what happens on a node failure. + +Configuration +------------- + +The HA stack is well integrated into the {pve} API. So, for example, +HA can be configured via the `ha-manager` command line interface, or +the {pve} web interface - both interfaces provide an easy way to +manage HA. Automation tools can use the API directly. + +All HA configuration files are within `/etc/pve/ha/`, so they get +automatically distributed to the cluster nodes, and all nodes share +the same HA configuration. + + +[[ha_manager_resource_config]] +Resources +~~~~~~~~~ + +[thumbnail="screenshot/gui-ha-manager-status.png"] + + +The resource configuration file `/etc/pve/ha/resources.cfg` stores +the list of resources managed by `ha-manager`. A resource configuration +inside that list looks like this: + +---- +: + + ... +---- + +It starts with a resource type followed by a resource specific name, +separated with colon. Together this forms the HA resource ID, which is +used by all `ha-manager` commands to uniquely identify a resource +(example: `vm:100` or `ct:101`). The next lines contain additional +properties: + +include::ha-resources-opts.adoc[] + +Here is a real world example with one VM and one container. As you see, +the syntax of those files is really simple, so it is even possible to +read or edit those files using your favorite editor: + +.Configuration Example (`/etc/pve/ha/resources.cfg`) +---- +vm: 501 + state started + max_relocate 2 + +ct: 102 + # Note: use default settings for everything +---- + +[thumbnail="screenshot/gui-ha-manager-add-resource.png"] + +The above config was generated using the `ha-manager` command line tool: + +---- +# ha-manager add vm:501 --state started --max_relocate 2 +# ha-manager add ct:102 +---- + + +[[ha_manager_groups]] +Groups +~~~~~~ + +[thumbnail="screenshot/gui-ha-manager-groups-view.png"] + +The HA group configuration file `/etc/pve/ha/groups.cfg` is used to +define groups of cluster nodes. A resource can be restricted to run +only on the members of such group. A group configuration look like +this: + +---- +group: + nodes + + ... +---- + +include::ha-groups-opts.adoc[] + +[thumbnail="screenshot/gui-ha-manager-add-group.png"] + +A common requirement is that a resource should run on a specific +node. Usually the resource is able to run on other nodes, so you can define +an unrestricted group with a single member: + +---- +# ha-manager groupadd prefer_node1 --nodes node1 +---- + +For bigger clusters, it makes sense to define a more detailed failover +behavior. For example, you may want to run a set of services on +`node1` if possible. If `node1` is not available, you want to run them +equally split on `node2` and `node3`. If those nodes also fail, the +services should run on `node4`. To achieve this you could set the node +list to: + +---- +# ha-manager groupadd mygroup1 -nodes "node1:2,node2:1,node3:1,node4" +---- + +Another use case is if a resource uses other resources only available +on specific nodes, lets say `node1` and `node2`. We need to make sure +that HA manager does not use other nodes, so we need to create a +restricted group with said nodes: + +---- +# ha-manager groupadd mygroup2 -nodes "node1,node2" -restricted +---- + +The above commands created the following group configuration file: + +.Configuration Example (`/etc/pve/ha/groups.cfg`) +---- +group: prefer_node1 + nodes node1 + +group: mygroup1 + nodes node2:1,node4,node1:2,node3:1 + +group: mygroup2 + nodes node2,node1 + restricted 1 +---- + + +The `nofailback` options is mostly useful to avoid unwanted resource +movements during administration tasks. For example, if you need to +migrate a service to a node which doesn't have the highest priority in the +group, you need to tell the HA manager not to instantly move this service +back by setting the `nofailback` option. + +Another scenario is when a service was fenced and it got recovered to +another node. The admin tries to repair the fenced node and brings it +up online again to investigate the cause of failure and check if it runs +stably again. Setting the `nofailback` flag prevents the recovered services from +moving straight back to the fenced node. + + +[[ha_manager_fencing]] +Fencing +------- + +On node failures, fencing ensures that the erroneous node is +guaranteed to be offline. This is required to make sure that no +resource runs twice when it gets recovered on another node. This is a +really important task, because without this, it would not be possible to +recover a resource on another node. + +If a node did not get fenced, it would be in an unknown state where +it may have still access to shared resources. This is really +dangerous! Imagine that every network but the storage one broke. Now, +while not reachable from the public network, the VM still runs and +writes to the shared storage. + +If we then simply start up this VM on another node, we would get a +dangerous race condition, because we write from both nodes. Such +conditions can destroy all VM data and the whole VM could be rendered +unusable. The recovery could also fail if the storage protects against +multiple mounts. + + +How {pve} Fences +~~~~~~~~~~~~~~~~ + +There are different methods to fence a node, for example, fence +devices which cut off the power from the node or disable their +communication completely. Those are often quite expensive and bring +additional critical components into a system, because if they fail you +cannot recover any service. + +We thus wanted to integrate a simpler fencing method, which does not +require additional external hardware. This can be done using +watchdog timers. + +.Possible Fencing Methods +- external power switches +- isolate nodes by disabling complete network traffic on the switch +- self fencing using watchdog timers + +Watchdog timers have been widely used in critical and dependable systems +since the beginning of microcontrollers. They are often simple, independent +integrated circuits which are used to detect and recover from computer malfunctions. + +During normal operation, `ha-manager` regularly resets the watchdog +timer to prevent it from elapsing. If, due to a hardware fault or +program error, the computer fails to reset the watchdog, the timer +will elapse and trigger a reset of the whole server (reboot). + +Recent server motherboards often include such hardware watchdogs, but +these need to be configured. If no watchdog is available or +configured, we fall back to the Linux Kernel 'softdog'. While still +reliable, it is not independent of the servers hardware, and thus has +a lower reliability than a hardware watchdog. + + +Configure Hardware Watchdog +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, all hardware watchdog modules are blocked for security +reasons. They are like a loaded gun if not correctly initialized. To +enable a hardware watchdog, you need to specify the module to load in +'/etc/default/pve-ha-manager', for example: + +---- +# select watchdog module (default is softdog) +WATCHDOG_MODULE=iTCO_wdt +---- + +This configuration is read by the 'watchdog-mux' service, which loads +the specified module at startup. + + +Recover Fenced Services +~~~~~~~~~~~~~~~~~~~~~~~ + +After a node failed and its fencing was successful, the CRM tries to +move services from the failed node to nodes which are still online. + +The selection of nodes, on which those services gets recovered, is +influenced by the resource `group` settings, the list of currently active +nodes, and their respective active service count. + +The CRM first builds a set out of the intersection between user selected +nodes (from `group` setting) and available nodes. It then choose the +subset of nodes with the highest priority, and finally select the node +with the lowest active service count. This minimizes the possibility +of an overloaded node. + +CAUTION: On node failure, the CRM distributes services to the +remaining nodes. This increases the service count on those nodes, and +can lead to high load, especially on small clusters. Please design +your cluster so that it can handle such worst case scenarios. + + +[[ha_manager_start_failure_policy]] +Start Failure Policy +--------------------- + +The start failure policy comes into effect if a service failed to start on a +node one or more times. It can be used to configure how often a restart +should be triggered on the same node and how often a service should be +relocated, so that it has an attempt to be started on another node. +The aim of this policy is to circumvent temporary unavailability of shared +resources on a specific node. For example, if a shared storage isn't available +on a quorate node anymore, for instance due to network problems, but is still +available on other nodes, the relocate policy allows the service to start +nonetheless. + +There are two service start recover policy settings which can be configured +specific for each resource. + +max_restart:: + +Maximum number of attempts to restart a failed service on the actual +node. The default is set to one. + +max_relocate:: + +Maximum number of attempts to relocate the service to a different node. +A relocate only happens after the max_restart value is exceeded on the +actual node. The default is set to one. + +NOTE: The relocate count state will only reset to zero when the +service had at least one successful start. That means if a service is +re-started without fixing the error only the restart policy gets +repeated. + + +[[ha_manager_error_recovery]] +Error Recovery +-------------- + +If, after all attempts, the service state could not be recovered, it gets +placed in an error state. In this state, the service won't get touched +by the HA stack anymore. The only way out is disabling a service: + +---- +# ha-manager set vm:100 --state disabled +---- + +This can also be done in the web interface. + +To recover from the error state you should do the following: + +* bring the resource back into a safe and consistent state (e.g.: +kill its process if the service could not be stopped) + +* disable the resource to remove the error flag + +* fix the error which led to this failures + +* *after* you fixed all errors you may request that the service starts again + + +[[ha_manager_package_updates]] +Package Updates +--------------- + +When updating the ha-manager, you should do one node after the other, never +all at once for various reasons. First, while we test our software +thoroughly, a bug affecting your specific setup cannot totally be ruled out. +Updating one node after the other and checking the functionality of each node +after finishing the update helps to recover from eventual problems, while +updating all at once could result in a broken cluster and is generally not +good practice. + +Also, the {pve} HA stack uses a request acknowledge protocol to perform +actions between the cluster and the local resource manager. For restarting, +the LRM makes a request to the CRM to freeze all its services. This prevents +them from getting touched by the Cluster during the short time the LRM is restarting. +After that, the LRM may safely close the watchdog during a restart. +Such a restart happens normally during a package update and, as already stated, +an active master CRM is needed to acknowledge the requests from the LRM. If +this is not the case the update process can take too long which, in the worst +case, may result in a reset triggered by the watchdog. + + +Node Maintenance +---------------- + +It is sometimes necessary to shutdown or reboot a node to do maintenance tasks, +such as to replace hardware, or simply to install a new kernel image. This is +also true when using the HA stack. The behaviour of the HA stack during a +shutdown can be configured. + +[[ha_manager_shutdown_policy]] +Shutdown Policy +~~~~~~~~~~~~~~~ + +Below you will find a description of the different HA policies for a node +shutdown. Currently 'Conditional' is the default due to backward compatibility. +Some users may find that 'Migrate' behaves more as expected. + +Migrate +^^^^^^^ + +Once the Local Resource manager (LRM) gets a shutdown request and this policy +is enabled, it will mark itself as unavailable for the current HA manager. +This triggers a migration of all HA Services currently located on this node. +The LRM will try to delay the shutdown process, until all running services get +moved away. But, this expects that the running services *can* be migrated to +another node. In other words, the service must not be locally bound, for example +by using hardware passthrough. As non-group member nodes are considered as +runnable target if no group member is available, this policy can still be used +when making use of HA groups with only some nodes selected. But, marking a group +as 'restricted' tells the HA manager that the service cannot run outside of the +chosen set of nodes. If all of those nodes are unavailable, the shutdown will +hang until you manually intervene. Once the shut down node comes back online +again, the previously displaced services will be moved back, if they were not +already manually migrated in-between. + +NOTE: The watchdog is still active during the migration process on shutdown. +If the node loses quorum it will be fenced and the services will be recovered. + +If you start a (previously stopped) service on a node which is currently being +maintained, the node needs to be fenced to ensure that the service can be moved +and started on another available node. + +Failover +^^^^^^^^ + +This mode ensures that all services get stopped, but that they will also be +recovered, if the current node is not online soon. It can be useful when doing +maintenance on a cluster scale, where live-migrating VMs may not be possible if +too many nodes are powered off at a time, but you still want to ensure HA +services get recovered and started again as soon as possible. + +Freeze +^^^^^^ + +This mode ensures that all services get stopped and frozen, so that they won't +get recovered until the current node is online again. + +Conditional +^^^^^^^^^^^ + +The 'Conditional' shutdown policy automatically detects if a shutdown or a +reboot is requested, and changes behaviour accordingly. + +.Shutdown + +A shutdown ('poweroff') is usually done if it is planned for the node to stay +down for some time. The LRM stops all managed services in this case. This means +that other nodes will take over those services afterwards. + +NOTE: Recent hardware has large amounts of memory (RAM). So we stop all +resources, then restart them to avoid online migration of all that RAM. If you +want to use online migration, you need to invoke that manually before you +shutdown the node. + + +.Reboot + +Node reboots are initiated with the 'reboot' command. This is usually done +after installing a new kernel. Please note that this is different from +``shutdown'', because the node immediately starts again. + +The LRM tells the CRM that it wants to restart, and waits until the CRM puts +all resources into the `freeze` state (same mechanism is used for +xref:ha_manager_package_updates[Package Updates]). This prevents those resources +from being moved to other nodes. Instead, the CRM starts the resources after the +reboot on the same node. + + +Manual Resource Movement +^^^^^^^^^^^^^^^^^^^^^^^^ + +Last but not least, you can also manually move resources to other nodes, before +you shutdown or restart a node. The advantage is that you have full control, +and you can decide if you want to use online migration or not. + +NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or +`watchdog-mux`. They manage and use the watchdog, so this can result in an +immediate node reboot or even reset. ifdef::manvolnum[]