X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=ha-manager.adoc;h=fadc6b5bb3bfceea1067a235624d78fe98e75a3b;hp=2162d25ea0e64db8561375e9a9fe85d43a0b63d9;hb=HEAD;hpb=049fc55728e69ae80361ebf88c8dabbf068a4417

diff --git a/ha-manager.adoc b/ha-manager.adoc
index 2162d25..66a3b8f 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -63,7 +63,7 @@ usually at higher price.
 
 * Eliminate single point of failure (redundant components)
 ** use an uninterruptible power supply (UPS)
-** use redundant power supplies on the main boards
+** use redundant power supplies in your servers
 ** use ECC-RAM
 ** use redundant network hardware
 ** use RAID for local storage
@@ -147,7 +147,7 @@ Management Tasks
 This section provides a short overview of common management tasks. The
 first step is to enable HA for a resource. This is done by adding the
 resource to the HA resource configuration. You can do this using the
-GUI, or simply use the command line tool, for example:
+GUI, or simply use the command-line tool, for example:
 
 ----
 # ha-manager add vm:100
@@ -243,7 +243,7 @@ the current manager status file and executes the respective commands.
 
 `pve-ha-crm`::
 
-The cluster resource manager (CRM), which makes the cluster wide
+The cluster resource manager (CRM), which makes the cluster-wide
 decisions. It sends commands to the LRM, processes the results,
 and moves resources to other nodes if something fails. The CRM also
 handles node fencing.
@@ -260,12 +260,13 @@ This all gets supervised by the CRM which currently holds the manager master
 lock.
 
 
+[[ha_manager_service_states]]
 Service States
 ~~~~~~~~~~~~~~
 
 The CRM uses a service state enumeration to record the current service
 state. This state is displayed on the GUI and can be queried using
-the `ha-manager` command line tool:
+the `ha-manager` command-line tool:
 
 ----
 # ha-manager status
@@ -307,10 +308,20 @@ LRM that the service is running.
 
 fence::
 
-Wait for node fencing (service node is not inside quorate cluster
-partition). As soon as node gets fenced successfully the service will
-be recovered to another node, if possible
-(see xref:ha_manager_fencing[Fencing]).
+Wait for node fencing as the service node is not inside the quorate cluster
+partition (see xref:ha_manager_fencing[Fencing]).
+As soon as node gets fenced successfully the service will be placed into the
+recovery state.
+
+recovery::
+
+Wait for recovery of the service. The HA manager tries to find a new node where
+the service can run on. This search depends not only on the list of online and
+quorate nodes, but also if the service is a group member and how such a group
+is limited.
+As soon as a new available node is found, the service will be moved there and
+initially placed into stopped state. If it's configured to run the new node
+will do so.
 
 freeze::
 
@@ -321,9 +332,8 @@ node, or when we restart the LRM daemon
 ignored::
 
 Act as if the service were not managed by HA at all.
-Useful, when full control over the service is desired temporarily,
-without removing it from the HA configuration.
-
+Useful, when full control over the service is desired temporarily, without
+removing it from the HA configuration.
 
 migrate::
 
@@ -343,11 +353,12 @@ disabled::
 Service is stopped and marked as `disabled`
 
 
+[[ha_manager_lrm]]
 Local Resource Manager
 ~~~~~~~~~~~~~~~~~~~~~~
 
 The local resource manager (`pve-ha-lrm`) is started as a daemon on
-boot and waits until the HA cluster is quorate and thus cluster wide
+boot and waits until the HA cluster is quorate and thus cluster-wide
 locks are working.
 
 It can be in three states:
@@ -407,6 +418,8 @@ what both daemons, the LRM and the CRM, did. You may use
 `journalctl -u pve-ha-lrm` on the node(s) where the service is and
 the same command for the pve-ha-crm on the node which is the current master.
 
+
+[[ha_manager_crm]]
 Cluster Resource Manager
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -509,7 +522,7 @@ Configuration
 -------------
 
 The HA stack is well integrated into the {pve} API. So, for example,
-HA can be configured via the `ha-manager` command line interface, or
+HA can be configured via the `ha-manager` command-line interface, or
 the {pve} web interface - both interfaces provide an easy way to
 manage HA. Automation tools can use the API directly.
 
@@ -559,7 +572,7 @@ ct: 102
 
 [thumbnail="screenshot/gui-ha-manager-add-resource.png"]
 
-The above config was generated using the `ha-manager` command line tool:
+The above config was generated using the `ha-manager` command-line tool:
 
 ----
 # ha-manager add vm:501 --state started --max_relocate 2
@@ -825,13 +838,76 @@ this is not the case the update process can take too long which, in the worst
 case, may result in a reset triggered by the watchdog.
 
 
+[[ha_manager_node_maintenance]]
 Node Maintenance
 ----------------
 
-It is sometimes necessary to shutdown or reboot a node to do maintenance tasks,
-such as to replace hardware, or simply to install a new kernel image. This is
-also true when using the HA stack. The behaviour of the HA stack during a
-shutdown can be configured.
+Sometimes it is necessary to perform maintenance on a node, such as replacing
+hardware or simply installing a new kernel image. This also applies while the
+HA stack is in use.
+
+The HA stack can support you mainly in two types of maintenance:
+
+* for general shutdowns or reboots, the behavior can be configured, see
+  xref:ha_manager_shutdown_policy[Shutdown Policy].
+* for maintenance that does not require a shutdown or reboot, or that should
+  not be switched off automatically after only one reboot, you can enable the
+  manual maintenance mode.
+
+
+Maintenance Mode
+~~~~~~~~~~~~~~~~
+
+You can use the manual maintenance mode to mark the node as unavailable for HA
+operation, prompting all services managed by HA to migrate to other nodes.
+
+The target nodes for these migrations are selected from the other currently
+available nodes, and determined by the HA group configuration and the configured
+cluster resource scheduler (CRS) mode.
+During each migration, the original node will be recorded in the HA managers'
+state, so that the service can be moved back again automatically once the
+maintenance mode is disabled and the node is back online.
+
+Currently you can enabled or disable the maintenance mode using the ha-manager
+CLI tool.
+
+.Enabling maintenance mode for a node
+----
+# ha-manager crm-command node-maintenance enable NODENAME
+----
+
+This will queue a CRM command, when the manager processes this command it will
+record the request for maintenance-mode in the manager status. This allows you
+to submit the command on any node, not just on the one you want to place in, or
+out of the maintenance mode.
+
+Once the LRM on the respective node picks the command up it will mark itself as
+unavailable, but still process all migration commands. This means that the LRM
+self-fencing watchdog will stay active until all active services got moved, and
+all running workers finished.
+
+Note that the LRM status will read `maintenance` mode as soon as the LRM
+picked the requested state up, not only when all services got moved away, this
+user experience is planned to be improved in the future.
+For now, you can check for any active HA service left on the node, or watching
+out for a log line like: `pve-ha-lrm[PID]: watchdog closed (disabled)` to know
+when the node finished its transition into the maintenance mode.
+
+NOTE: The manual maintenance mode is not automatically deleted on node reboot,
+but only if it is either manually deactivated using the `ha-manager` CLI or if
+the manager-status is manually cleared.
+
+.Disabling maintenance mode for a node
+----
+# ha-manager crm-command node-maintenance disable NODENAME
+----
+
+The process of disabling the manual maintenance mode is similar to enabling it.
+Using the `ha-manager` CLI command shown above will queue a CRM command that,
+once processed, marks the respective LRM node as available again.
+
+If you deactivate the maintenance mode, all services that were on the node when
+the maintenance mode was activated will be moved back.
 
 [[ha_manager_shutdown_policy]]
 Shutdown Policy
@@ -841,6 +917,13 @@ Below you will find a description of the different HA policies for a node
 shutdown. Currently 'Conditional' is the default due to backward compatibility.
 Some users may find that 'Migrate' behaves more as expected.
 
+The shutdown policy can be configured in the Web UI (`Datacenter` -> `Options`
+-> `HA Settings`), or directly in `datacenter.cfg`:
+
+----
+ha: shutdown_policy=<value>
+----
+
 Migrate
 ^^^^^^^
 
@@ -924,6 +1007,87 @@ NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
 immediate node reboot or even reset.
 
 
+[[ha_manager_crs]]
+Cluster Resource Scheduling
+---------------------------
+
+The cluster resource scheduler (CRS) mode controls how HA selects nodes for the
+recovery of a service as well as for migrations that are triggered by a
+shutdown policy. The default mode is `basic`, you can change it in the Web UI
+(`Datacenter` -> `Options`), or directly in `datacenter.cfg`:
+
+----
+crs: ha=static
+----
+
+[thumbnail="screenshot/gui-datacenter-options-crs.png"]
+
+The change will be in effect starting with the next manager round (after a few
+seconds).
+
+For each service that needs to be recovered or migrated, the scheduler
+iteratively chooses the best node among the nodes with the highest priority in
+the service's group.
+
+NOTE: There are plans to add modes for (static and dynamic) load-balancing in
+the future.
+
+Basic Scheduler
+~~~~~~~~~~~~~~~
+
+The number of active HA services on each node is used to choose a recovery node.
+Non-HA-managed services are currently not counted.
+
+Static-Load Scheduler
+~~~~~~~~~~~~~~~~~~~~~
+
+IMPORTANT: The static mode is still a technology preview.
+
+Static usage information from HA services on each node is used to choose a
+recovery node. Usage of non-HA-managed services is currently not considered.
+
+For this selection, each node in turn is considered as if the service was
+already running on it, using CPU and memory usage from the associated guest
+configuration. Then for each such alternative, CPU and memory usage of all nodes
+are considered, with memory being weighted much more, because it's a truly
+limited resource. For both, CPU and memory, highest usage among nodes (weighted
+more, as ideally no node should be overcommitted) and average usage of all nodes
+(to still be able to distinguish in case there already is a more highly
+committed node) are considered.
+
+IMPORTANT: The more services the more possible combinations there are, so it's
+currently not recommended to use it if you have thousands of HA managed
+services.
+
+
+CRS Scheduling Points
+~~~~~~~~~~~~~~~~~~~~~
+
+The CRS algorithm is not applied for every service in every round, since this
+would mean a large number of constant migrations. Depending on the workload,
+this could put more strain on the cluster than could be avoided by constant
+balancing.
+That's why the {pve} HA manager favors keeping services on their current node.
+
+The CRS is currently used at the following scheduling points:
+
+- Service recovery (always active). When a node with active HA services fails,
+  all its services need to be recovered to other nodes. The CRS algorithm will
+  be used here to balance that recovery over the remaining nodes.
+
+- HA group config changes (always active). If a node is removed from a group,
+  or its priority is reduced, the HA stack will use the CRS algorithm to find a
+  new target node for the HA services in that group, matching the adapted
+  priority constraints.
+
+- HA service stopped -> start transtion (opt-in). Requesting that a stopped
+  service should be started is an good opportunity to check for the best suited
+  node as per the CRS algorithm, as moving stopped services is  cheaper to do
+  than moving them started, especially if their disk volumes reside on shared
+  storage. You can enable this by setting the **`ha-rebalance-on-start`**
+  CRS option in the datacenter config. You can change that option also in the
+  Web UI, under `Datacenter` -> `Options` -> `Cluster Resource Scheduling`.
+
 ifdef::manvolnum[]
 include::pve-copyright.adoc[]
 endif::manvolnum[]