X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=ha-manager.adoc;h=54db2a50f13409b24b4e7efe5075807ec705f475;hp=e68dcbe4645a769b22c849919a1a9834b76c4717;hb=HEAD;hpb=3810ae1e90774bd0d54e36485b9020cb46d1c512 diff --git a/ha-manager.adoc b/ha-manager.adoc index e68dcbe..66a3b8f 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -1,15 +1,15 @@ -[[chapter-ha-manager]] +[[chapter_ha_manager]] ifdef::manvolnum[] -PVE({manvolnum}) -================ -include::attributes.txt[] +ha-manager(1) +============= +:pve-toplevel: NAME ---- ha-manager - Proxmox VE HA Manager -SYNOPSYS +SYNOPSIS -------- include::ha-manager.1-synopsis.adoc[] @@ -17,110 +17,396 @@ include::ha-manager.1-synopsis.adoc[] DESCRIPTION ----------- endif::manvolnum[] - ifndef::manvolnum[] High Availability ================= -include::attributes.txt[] +:pve-toplevel: endif::manvolnum[] -'ha-manager' handles management of user-defined cluster services. This -includes handling of user requests which may start, stop, relocate, +Our modern society depends heavily on information provided by +computers over the network. Mobile devices amplified that dependency, +because people can access the network any time from anywhere. If you +provide such services, it is very important that they are available +most of the time. + +We can mathematically define the availability as the ratio of (A), the +total time a service is capable of being used during a given interval +to (B), the length of the interval. It is normally expressed as a +percentage of uptime in a given year. + +.Availability - Downtime per Year +[width="60%",cols="/lrm_status'. There the CRM may collect -it and let its state machine - respective the commands output - act on it. +Each command requested by the CRM is uniquely identifiable by a UID. When +the worker finishes, its result will be processed and written in the LRM +status file `/etc/pve/nodes//lrm_status`. There the CRM may collect +it and let its state machine - respective to the commands output - act on it. The actions on each service between CRM and LRM are normally always synced. -This means that the CRM requests a state uniquely marked by an UID, the LRM -then executes this action *one time* and writes back the result, also +This means that the CRM requests a state uniquely marked by a UID, the LRM +then executes this action *one time* and writes back the result, which is also identifiable by the same UID. This is needed so that the LRM does not -executes an outdated command. -With the exception of the 'stop' and the 'error' command, -those two do not depend on the result produce and are executed +execute an outdated command. +The only exceptions to this behaviour are the `stop` and `error` commands; +these two do not depend on the result produced and are executed always in the case of the stopped state and once in the case of the error state. @@ -132,208 +418,675 @@ what both daemons, the LRM and the CRM, did. You may use `journalctl -u pve-ha-lrm` on the node(s) where the service is and the same command for the pve-ha-crm on the node which is the current master. + +[[ha_manager_crm]] Cluster Resource Manager ~~~~~~~~~~~~~~~~~~~~~~~~ -The cluster resource manager ('pve-ha-crm') starts on each node and +The cluster resource manager (`pve-ha-crm`) starts on each node and waits there for the manager lock, which can only be held by one node at a time. The node which successfully acquires the manager lock gets promoted to the CRM master. -It can be in three states: TODO +It can be in three states: + +wait for agent lock:: + +The CRM waits for our exclusive lock. This is also used as idle state if no +service is configured + +active:: -* *wait for agent lock*: the LRM waits for our exclusive lock. This is - also used as idle sate if no service is configured -* *active*: the LRM holds its exclusive lock and has services configured -* *lost agent lock*: the LRM lost its lock, this means a failure happened - and quorum was lost. +The CRM holds its exclusive lock and has services configured -It main task is to manage the services which are configured to be highly -available and try to get always bring them in the wanted state, e.g.: a -enabled service will be started if its not running, if it crashes it will -be started again. Thus it dictates the LRM the wanted actions. +lost agent lock:: -When an node leaves the cluster quorum, its state changes to unknown. -If the current CRM then can secure the failed nodes lock, the services +The CRM lost its lock, this means a failure happened and quorum was lost. + +Its main task is to manage the services which are configured to be highly +available and try to always enforce the requested state. For example, a +service with the requested state 'started' will be started if its not +already running. If it crashes it will be automatically started again. +Thus the CRM dictates the actions the LRM needs to execute. + +When a node leaves the cluster quorum, its state changes to unknown. +If the current CRM can then secure the failed node's lock, the services will be 'stolen' and restarted on another node. When a cluster member determines that it is no longer in the cluster quorum, the LRM waits for a new quorum to form. As long as there is no quorum the node cannot reset the watchdog. This will trigger a reboot -after 60 seconds. +after the watchdog times out (this happens after 60 seconds). + + +HA Simulator +------------ + +[thumbnail="screenshot/gui-ha-manager-status.png"] + +By using the HA simulator you can test and learn all functionalities of the +Proxmox VE HA solutions. + +By default, the simulator allows you to watch and test the behaviour of a +real-world 3 node cluster with 6 VMs. You can also add or remove additional VMs +or Container. + +You do not have to setup or configure a real cluster, the HA simulator runs out +of the box. + +Install with apt: + +---- +apt install pve-ha-simulator +---- + +You can even install the package on any Debian-based system without any +other Proxmox VE packages. For that you will need to download the package and +copy it to the system you want to run it on for installation. When you install +the package with apt from the local file system it will also resolve the +required dependencies for you. + + +To start the simulator on a remote machine you must have an X11 redirection to +your current system. + +If you are on a Linux machine you can use: + +---- +ssh root@ -Y +---- + +On Windows it works with https://mobaxterm.mobatek.net/[mobaxterm]. + +After connecting to an existing {pve} with the simulator installed or +installing it on your local Debian-based system manually, you can try it out as +follows. + +First you need to create a working directory where the simulator saves its +current state and writes its default config: + +---- +mkdir working +---- + +Then, simply pass the created directory as a parameter to 'pve-ha-simulator': + +---- +pve-ha-simulator working/ +---- + +You can then start, stop, migrate the simulated HA services, or even check out +what happens on a node failure. Configuration ------------- -The HA stack is well integrated int the Proxmox VE API2. So, for -example, HA can be configured via 'ha-manager' or the PVE web -interface, which both provide an easy to use tool. +The HA stack is well integrated into the {pve} API. So, for example, +HA can be configured via the `ha-manager` command-line interface, or +the {pve} web interface - both interfaces provide an easy way to +manage HA. Automation tools can use the API directly. + +All HA configuration files are within `/etc/pve/ha/`, so they get +automatically distributed to the cluster nodes, and all nodes share +the same HA configuration. + + +[[ha_manager_resource_config]] +Resources +~~~~~~~~~ + +[thumbnail="screenshot/gui-ha-manager-status.png"] + + +The resource configuration file `/etc/pve/ha/resources.cfg` stores +the list of resources managed by `ha-manager`. A resource configuration +inside that list looks like this: + +---- +: + + ... +---- + +It starts with a resource type followed by a resource specific name, +separated with colon. Together this forms the HA resource ID, which is +used by all `ha-manager` commands to uniquely identify a resource +(example: `vm:100` or `ct:101`). The next lines contain additional +properties: + +include::ha-resources-opts.adoc[] + +Here is a real world example with one VM and one container. As you see, +the syntax of those files is really simple, so it is even possible to +read or edit those files using your favorite editor: + +.Configuration Example (`/etc/pve/ha/resources.cfg`) +---- +vm: 501 + state started + max_relocate 2 + +ct: 102 + # Note: use default settings for everything +---- -The resource configuration file can be located at -'/etc/pve/ha/resources.cfg' and the group configuration file at -'/etc/pve/ha/groups.cfg'. Use the provided tools to make changes, -there shouldn't be any need to edit them manually. +[thumbnail="screenshot/gui-ha-manager-add-resource.png"] -Node Power Status ------------------ +The above config was generated using the `ha-manager` command-line tool: + +---- +# ha-manager add vm:501 --state started --max_relocate 2 +# ha-manager add ct:102 +---- + + +[[ha_manager_groups]] +Groups +~~~~~~ + +[thumbnail="screenshot/gui-ha-manager-groups-view.png"] + +The HA group configuration file `/etc/pve/ha/groups.cfg` is used to +define groups of cluster nodes. A resource can be restricted to run +only on the members of such group. A group configuration look like +this: + +---- +group: + nodes + + ... +---- + +include::ha-groups-opts.adoc[] + +[thumbnail="screenshot/gui-ha-manager-add-group.png"] + +A common requirement is that a resource should run on a specific +node. Usually the resource is able to run on other nodes, so you can define +an unrestricted group with a single member: + +---- +# ha-manager groupadd prefer_node1 --nodes node1 +---- -If a node needs maintenance you should migrate and or relocate all -services which are required to run always on another node first. -After that you can stop the LRM and CRM services. But note that the -watchdog triggers if you stop it with active services. +For bigger clusters, it makes sense to define a more detailed failover +behavior. For example, you may want to run a set of services on +`node1` if possible. If `node1` is not available, you want to run them +equally split on `node2` and `node3`. If those nodes also fail, the +services should run on `node4`. To achieve this you could set the node +list to: +---- +# ha-manager groupadd mygroup1 -nodes "node1:2,node2:1,node3:1,node4" +---- + +Another use case is if a resource uses other resources only available +on specific nodes, lets say `node1` and `node2`. We need to make sure +that HA manager does not use other nodes, so we need to create a +restricted group with said nodes: + +---- +# ha-manager groupadd mygroup2 -nodes "node1,node2" -restricted +---- + +The above commands created the following group configuration file: + +.Configuration Example (`/etc/pve/ha/groups.cfg`) +---- +group: prefer_node1 + nodes node1 + +group: mygroup1 + nodes node2:1,node4,node1:2,node3:1 + +group: mygroup2 + nodes node2,node1 + restricted 1 +---- + + +The `nofailback` options is mostly useful to avoid unwanted resource +movements during administration tasks. For example, if you need to +migrate a service to a node which doesn't have the highest priority in the +group, you need to tell the HA manager not to instantly move this service +back by setting the `nofailback` option. + +Another scenario is when a service was fenced and it got recovered to +another node. The admin tries to repair the fenced node and brings it +up online again to investigate the cause of failure and check if it runs +stably again. Setting the `nofailback` flag prevents the recovered services from +moving straight back to the fenced node. + + +[[ha_manager_fencing]] Fencing ------- -What Is Fencing -~~~~~~~~~~~~~~~ +On node failures, fencing ensures that the erroneous node is +guaranteed to be offline. This is required to make sure that no +resource runs twice when it gets recovered on another node. This is a +really important task, because without this, it would not be possible to +recover a resource on another node. -Fencing secures that on a node failure the dangerous node gets will be rendered -unable to do any damage and that no resource runs twice when it gets recovered -from the failed node. +If a node did not get fenced, it would be in an unknown state where +it may have still access to shared resources. This is really +dangerous! Imagine that every network but the storage one broke. Now, +while not reachable from the public network, the VM still runs and +writes to the shared storage. -Configure Hardware Watchdog -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -By default all watchdog modules are blocked for security reasons as they are -like a loaded gun if not correctly initialized. -If you have a hardware watchdog available remove its module from the blacklist -and restart 'the watchdog-mux' service. +If we then simply start up this VM on another node, we would get a +dangerous race condition, because we write from both nodes. Such +conditions can destroy all VM data and the whole VM could be rendered +unusable. The recovery could also fail if the storage protects against +multiple mounts. -Resource/Service Agents -------------------------- +How {pve} Fences +~~~~~~~~~~~~~~~~ -A resource or also called service can be managed by the -ha-manager. Currently we support virtual machines and container. +There are different methods to fence a node, for example, fence +devices which cut off the power from the node or disable their +communication completely. Those are often quite expensive and bring +additional critical components into a system, because if they fail you +cannot recover any service. -Groups ------- +We thus wanted to integrate a simpler fencing method, which does not +require additional external hardware. This can be done using +watchdog timers. -A group is a collection of cluster nodes which a service may be bound to. +.Possible Fencing Methods +- external power switches +- isolate nodes by disabling complete network traffic on the switch +- self fencing using watchdog timers -Group Settings -~~~~~~~~~~~~~~ +Watchdog timers have been widely used in critical and dependable systems +since the beginning of microcontrollers. They are often simple, independent +integrated circuits which are used to detect and recover from computer malfunctions. -nodes:: +During normal operation, `ha-manager` regularly resets the watchdog +timer to prevent it from elapsing. If, due to a hardware fault or +program error, the computer fails to reset the watchdog, the timer +will elapse and trigger a reset of the whole server (reboot). -list of group node members +Recent server motherboards often include such hardware watchdogs, but +these need to be configured. If no watchdog is available or +configured, we fall back to the Linux Kernel 'softdog'. While still +reliable, it is not independent of the servers hardware, and thus has +a lower reliability than a hardware watchdog. -restricted:: -resources bound to this group may only run on nodes defined by the -group. If no group node member is available the resource will be -placed in the stopped state. +Configure Hardware Watchdog +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -nofailback:: +By default, all hardware watchdog modules are blocked for security +reasons. They are like a loaded gun if not correctly initialized. To +enable a hardware watchdog, you need to specify the module to load in +'/etc/default/pve-ha-manager', for example: -the resource won't automatically fail back when a more preferred node -(re)joins the cluster. +---- +# select watchdog module (default is softdog) +WATCHDOG_MODULE=iTCO_wdt +---- +This configuration is read by the 'watchdog-mux' service, which loads +the specified module at startup. -Recovery Policy ---------------- -There are two service recover policy settings which can be configured +Recover Fenced Services +~~~~~~~~~~~~~~~~~~~~~~~ + +After a node failed and its fencing was successful, the CRM tries to +move services from the failed node to nodes which are still online. + +The selection of nodes, on which those services gets recovered, is +influenced by the resource `group` settings, the list of currently active +nodes, and their respective active service count. + +The CRM first builds a set out of the intersection between user selected +nodes (from `group` setting) and available nodes. It then choose the +subset of nodes with the highest priority, and finally select the node +with the lowest active service count. This minimizes the possibility +of an overloaded node. + +CAUTION: On node failure, the CRM distributes services to the +remaining nodes. This increases the service count on those nodes, and +can lead to high load, especially on small clusters. Please design +your cluster so that it can handle such worst case scenarios. + + +[[ha_manager_start_failure_policy]] +Start Failure Policy +--------------------- + +The start failure policy comes into effect if a service failed to start on a +node one or more times. It can be used to configure how often a restart +should be triggered on the same node and how often a service should be +relocated, so that it has an attempt to be started on another node. +The aim of this policy is to circumvent temporary unavailability of shared +resources on a specific node. For example, if a shared storage isn't available +on a quorate node anymore, for instance due to network problems, but is still +available on other nodes, the relocate policy allows the service to start +nonetheless. + +There are two service start recover policy settings which can be configured specific for each resource. max_restart:: -maximal number of tries to restart an failed service on the actual +Maximum number of attempts to restart a failed service on the actual node. The default is set to one. max_relocate:: -maximal number of tries to relocate the service to a different node. +Maximum number of attempts to relocate the service to a different node. A relocate only happens after the max_restart value is exceeded on the actual node. The default is set to one. -Note that the relocate count state will only reset to zero when the +NOTE: The relocate count state will only reset to zero when the service had at least one successful start. That means if a service is -re-enabled without fixing the error only the restart policy gets +re-started without fixing the error only the restart policy gets repeated. + +[[ha_manager_error_recovery]] Error Recovery -------------- -If after all tries the service state could not be recovered it gets -placed in an error state. In this state the service won't get touched -by the HA stack anymore. To recover from this state you should follow -these steps: +If, after all attempts, the service state could not be recovered, it gets +placed in an error state. In this state, the service won't get touched +by the HA stack anymore. The only way out is disabling a service: + +---- +# ha-manager set vm:100 --state disabled +---- -* bring the resource back into an safe and consistent state (e.g: -killing its process) +This can also be done in the web interface. -* disable the ha resource to place it in an stopped state +To recover from the error state you should do the following: + +* bring the resource back into a safe and consistent state (e.g.: +kill its process if the service could not be stopped) + +* disable the resource to remove the error flag * fix the error which led to this failures -* *after* you fixed all errors you may enable the service again +* *after* you fixed all errors you may request that the service starts again -Service Operations ------------------- +[[ha_manager_package_updates]] +Package Updates +--------------- -This are how the basic user-initiated service operations (via -'ha-manager') work. +When updating the ha-manager, you should do one node after the other, never +all at once for various reasons. First, while we test our software +thoroughly, a bug affecting your specific setup cannot totally be ruled out. +Updating one node after the other and checking the functionality of each node +after finishing the update helps to recover from eventual problems, while +updating all at once could result in a broken cluster and is generally not +good practice. -enable:: +Also, the {pve} HA stack uses a request acknowledge protocol to perform +actions between the cluster and the local resource manager. For restarting, +the LRM makes a request to the CRM to freeze all its services. This prevents +them from getting touched by the Cluster during the short time the LRM is restarting. +After that, the LRM may safely close the watchdog during a restart. +Such a restart happens normally during a package update and, as already stated, +an active master CRM is needed to acknowledge the requests from the LRM. If +this is not the case the update process can take too long which, in the worst +case, may result in a reset triggered by the watchdog. -the service will be started by the LRM if not already running. -disable:: +[[ha_manager_node_maintenance]] +Node Maintenance +---------------- -the service will be stopped by the LRM if running. +Sometimes it is necessary to perform maintenance on a node, such as replacing +hardware or simply installing a new kernel image. This also applies while the +HA stack is in use. -migrate/relocate:: +The HA stack can support you mainly in two types of maintenance: -the service will be relocated (live) to another node. +* for general shutdowns or reboots, the behavior can be configured, see + xref:ha_manager_shutdown_policy[Shutdown Policy]. +* for maintenance that does not require a shutdown or reboot, or that should + not be switched off automatically after only one reboot, you can enable the + manual maintenance mode. -remove:: -the service will be removed from the HA managed resource list. Its -current state will not be touched. +Maintenance Mode +~~~~~~~~~~~~~~~~ -start/stop:: +You can use the manual maintenance mode to mark the node as unavailable for HA +operation, prompting all services managed by HA to migrate to other nodes. -start and stop commands can be issued to the resource specific tools -(like 'qm' or 'pct'), they will forward the request to the -'ha-manager' which then will execute the action and set the resulting -service state (enabled, disabled). +The target nodes for these migrations are selected from the other currently +available nodes, and determined by the HA group configuration and the configured +cluster resource scheduler (CRS) mode. +During each migration, the original node will be recorded in the HA managers' +state, so that the service can be moved back again automatically once the +maintenance mode is disabled and the node is back online. +Currently you can enabled or disable the maintenance mode using the ha-manager +CLI tool. -Service States --------------- +.Enabling maintenance mode for a node +---- +# ha-manager crm-command node-maintenance enable NODENAME +---- -stopped:: +This will queue a CRM command, when the manager processes this command it will +record the request for maintenance-mode in the manager status. This allows you +to submit the command on any node, not just on the one you want to place in, or +out of the maintenance mode. -Service is stopped (confirmed by LRM) +Once the LRM on the respective node picks the command up it will mark itself as +unavailable, but still process all migration commands. This means that the LRM +self-fencing watchdog will stay active until all active services got moved, and +all running workers finished. -request_stop:: +Note that the LRM status will read `maintenance` mode as soon as the LRM +picked the requested state up, not only when all services got moved away, this +user experience is planned to be improved in the future. +For now, you can check for any active HA service left on the node, or watching +out for a log line like: `pve-ha-lrm[PID]: watchdog closed (disabled)` to know +when the node finished its transition into the maintenance mode. -Service should be stopped. Waiting for confirmation from LRM. +NOTE: The manual maintenance mode is not automatically deleted on node reboot, +but only if it is either manually deactivated using the `ha-manager` CLI or if +the manager-status is manually cleared. -started:: +.Disabling maintenance mode for a node +---- +# ha-manager crm-command node-maintenance disable NODENAME +---- -Service is active an LRM should start it ASAP if not already running. +The process of disabling the manual maintenance mode is similar to enabling it. +Using the `ha-manager` CLI command shown above will queue a CRM command that, +once processed, marks the respective LRM node as available again. -fence:: +If you deactivate the maintenance mode, all services that were on the node when +the maintenance mode was activated will be moved back. -Wait for node fencing (service node is not inside quorate cluster -partition). +[[ha_manager_shutdown_policy]] +Shutdown Policy +~~~~~~~~~~~~~~~ -freeze:: +Below you will find a description of the different HA policies for a node +shutdown. Currently 'Conditional' is the default due to backward compatibility. +Some users may find that 'Migrate' behaves more as expected. -Do not touch the service state. We use this state while we reboot a -node, or when we restart the LRM daemon. +The shutdown policy can be configured in the Web UI (`Datacenter` -> `Options` +-> `HA Settings`), or directly in `datacenter.cfg`: -migrate:: +---- +ha: shutdown_policy= +---- -Migrate service (live) to other node. +Migrate +^^^^^^^ -error:: +Once the Local Resource manager (LRM) gets a shutdown request and this policy +is enabled, it will mark itself as unavailable for the current HA manager. +This triggers a migration of all HA Services currently located on this node. +The LRM will try to delay the shutdown process, until all running services get +moved away. But, this expects that the running services *can* be migrated to +another node. In other words, the service must not be locally bound, for example +by using hardware passthrough. As non-group member nodes are considered as +runnable target if no group member is available, this policy can still be used +when making use of HA groups with only some nodes selected. But, marking a group +as 'restricted' tells the HA manager that the service cannot run outside of the +chosen set of nodes. If all of those nodes are unavailable, the shutdown will +hang until you manually intervene. Once the shut down node comes back online +again, the previously displaced services will be moved back, if they were not +already manually migrated in-between. + +NOTE: The watchdog is still active during the migration process on shutdown. +If the node loses quorum it will be fenced and the services will be recovered. + +If you start a (previously stopped) service on a node which is currently being +maintained, the node needs to be fenced to ensure that the service can be moved +and started on another available node. + +Failover +^^^^^^^^ + +This mode ensures that all services get stopped, but that they will also be +recovered, if the current node is not online soon. It can be useful when doing +maintenance on a cluster scale, where live-migrating VMs may not be possible if +too many nodes are powered off at a time, but you still want to ensure HA +services get recovered and started again as soon as possible. + +Freeze +^^^^^^ + +This mode ensures that all services get stopped and frozen, so that they won't +get recovered until the current node is online again. + +Conditional +^^^^^^^^^^^ + +The 'Conditional' shutdown policy automatically detects if a shutdown or a +reboot is requested, and changes behaviour accordingly. + +.Shutdown + +A shutdown ('poweroff') is usually done if it is planned for the node to stay +down for some time. The LRM stops all managed services in this case. This means +that other nodes will take over those services afterwards. + +NOTE: Recent hardware has large amounts of memory (RAM). So we stop all +resources, then restart them to avoid online migration of all that RAM. If you +want to use online migration, you need to invoke that manually before you +shutdown the node. + + +.Reboot + +Node reboots are initiated with the 'reboot' command. This is usually done +after installing a new kernel. Please note that this is different from +``shutdown'', because the node immediately starts again. + +The LRM tells the CRM that it wants to restart, and waits until the CRM puts +all resources into the `freeze` state (same mechanism is used for +xref:ha_manager_package_updates[Package Updates]). This prevents those resources +from being moved to other nodes. Instead, the CRM starts the resources after the +reboot on the same node. + + +Manual Resource Movement +^^^^^^^^^^^^^^^^^^^^^^^^ + +Last but not least, you can also manually move resources to other nodes, before +you shutdown or restart a node. The advantage is that you have full control, +and you can decide if you want to use online migration or not. + +NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or +`watchdog-mux`. They manage and use the watchdog, so this can result in an +immediate node reboot or even reset. + + +[[ha_manager_crs]] +Cluster Resource Scheduling +--------------------------- + +The cluster resource scheduler (CRS) mode controls how HA selects nodes for the +recovery of a service as well as for migrations that are triggered by a +shutdown policy. The default mode is `basic`, you can change it in the Web UI +(`Datacenter` -> `Options`), or directly in `datacenter.cfg`: + +---- +crs: ha=static +---- + +[thumbnail="screenshot/gui-datacenter-options-crs.png"] + +The change will be in effect starting with the next manager round (after a few +seconds). + +For each service that needs to be recovered or migrated, the scheduler +iteratively chooses the best node among the nodes with the highest priority in +the service's group. + +NOTE: There are plans to add modes for (static and dynamic) load-balancing in +the future. + +Basic Scheduler +~~~~~~~~~~~~~~~ + +The number of active HA services on each node is used to choose a recovery node. +Non-HA-managed services are currently not counted. + +Static-Load Scheduler +~~~~~~~~~~~~~~~~~~~~~ + +IMPORTANT: The static mode is still a technology preview. + +Static usage information from HA services on each node is used to choose a +recovery node. Usage of non-HA-managed services is currently not considered. + +For this selection, each node in turn is considered as if the service was +already running on it, using CPU and memory usage from the associated guest +configuration. Then for each such alternative, CPU and memory usage of all nodes +are considered, with memory being weighted much more, because it's a truly +limited resource. For both, CPU and memory, highest usage among nodes (weighted +more, as ideally no node should be overcommitted) and average usage of all nodes +(to still be able to distinguish in case there already is a more highly +committed node) are considered. + +IMPORTANT: The more services the more possible combinations there are, so it's +currently not recommended to use it if you have thousands of HA managed +services. + + +CRS Scheduling Points +~~~~~~~~~~~~~~~~~~~~~ + +The CRS algorithm is not applied for every service in every round, since this +would mean a large number of constant migrations. Depending on the workload, +this could put more strain on the cluster than could be avoided by constant +balancing. +That's why the {pve} HA manager favors keeping services on their current node. + +The CRS is currently used at the following scheduling points: + +- Service recovery (always active). When a node with active HA services fails, + all its services need to be recovered to other nodes. The CRS algorithm will + be used here to balance that recovery over the remaining nodes. -Service disabled because of LRM errors. Needs manual intervention. +- HA group config changes (always active). If a node is removed from a group, + or its priority is reduced, the HA stack will use the CRS algorithm to find a + new target node for the HA services in that group, matching the adapted + priority constraints. +- HA service stopped -> start transtion (opt-in). Requesting that a stopped + service should be started is an good opportunity to check for the best suited + node as per the CRS algorithm, as moving stopped services is cheaper to do + than moving them started, especially if their disk volumes reside on shared + storage. You can enable this by setting the **`ha-rebalance-on-start`** + CRS option in the datacenter config. You can change that option also in the + Web UI, under `Datacenter` -> `Options` -> `Cluster Resource Scheduling`. ifdef::manvolnum[] include::pve-copyright.adoc[]