network: override device names: suggest running update-initramfs

[pve-docs.git] / ha-manager.adoc
diff --git a/ha-manager.adoc b/ha-manager.adoc

index 231c449d5b209ba47817961d82bf25c4b24ec314..66a3b8fcbe5cd375ae7293c8d01c4b940b6b058b 100644 (file)
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -63,7 +63,7 @@ usually at higher price.
  
  * Eliminate single point of failure (redundant components)
  ** use an uninterruptible power supply (UPS)
-** use redundant power supplies on the main boards
+** use redundant power supplies in your servers
  ** use ECC-RAM
  ** use redundant network hardware
  ** use RAID for local storage
@@ -147,7 +147,7 @@ Management Tasks
  This section provides a short overview of common management tasks. The
  first step is to enable HA for a resource. This is done by adding the
  resource to the HA resource configuration. You can do this using the
-GUI, or simply use the command line tool, for example:
+GUI, or simply use the command-line tool, for example:
  
  ----
  # ha-manager add vm:100
@@ -260,12 +260,13 @@ This all gets supervised by the CRM which currently holds the manager master
  lock.
  
  
+[[ha_manager_service_states]]
  Service States
  ~~~~~~~~~~~~~~
  
  The CRM uses a service state enumeration to record the current service
  state. This state is displayed on the GUI and can be queried using
-the `ha-manager` command line tool:
+the `ha-manager` command-line tool:
  
  ----
  # ha-manager status
@@ -352,6 +353,7 @@ disabled::
  Service is stopped and marked as `disabled`
  
  
+[[ha_manager_lrm]]
  Local Resource Manager
  ~~~~~~~~~~~~~~~~~~~~~~
  
@@ -416,6 +418,8 @@ what both daemons, the LRM and the CRM, did. You may use
  `journalctl -u pve-ha-lrm` on the node(s) where the service is and
  the same command for the pve-ha-crm on the node which is the current master.
  
+
+[[ha_manager_crm]]
  Cluster Resource Manager
  ~~~~~~~~~~~~~~~~~~~~~~~~
  
@@ -518,7 +522,7 @@ Configuration
  -------------
  
  The HA stack is well integrated into the {pve} API. So, for example,
-HA can be configured via the `ha-manager` command line interface, or
+HA can be configured via the `ha-manager` command-line interface, or
  the {pve} web interface - both interfaces provide an easy way to
  manage HA. Automation tools can use the API directly.
  
@@ -568,7 +572,7 @@ ct: 102
  
  [thumbnail="screenshot/gui-ha-manager-add-resource.png"]
  
-The above config was generated using the `ha-manager` command line tool:
+The above config was generated using the `ha-manager` command-line tool:
  
  ----
  # ha-manager add vm:501 --state started --max_relocate 2
@@ -834,6 +838,7 @@ this is not the case the update process can take too long which, in the worst
  case, may result in a reset triggered by the watchdog.
  
  
+[[ha_manager_node_maintenance]]
  Node Maintenance
  ----------------
  
@@ -853,12 +858,15 @@ The HA stack can support you mainly in two types of maintenance:
  Maintenance Mode
  ~~~~~~~~~~~~~~~~
  
-Enabling the manual maintenance mode will mark the node as unavailable for
-operation, this in turn will migrate away all services to other nodes, which
-are selected through the configured cluster resource scheduler (CRS) mode.
-During migration the original node will be recorded, so that the service can be
-moved back to to that node as soon as the maintenance mode is disabled, and it
-becomes online again.
+You can use the manual maintenance mode to mark the node as unavailable for HA
+operation, prompting all services managed by HA to migrate to other nodes.
+
+The target nodes for these migrations are selected from the other currently
+available nodes, and determined by the HA group configuration and the configured
+cluster resource scheduler (CRS) mode.
+During each migration, the original node will be recorded in the HA managers'
+state, so that the service can be moved back again automatically once the
+maintenance mode is disabled and the node is back online.
  
  Currently you can enabled or disable the maintenance mode using the ha-manager
  CLI tool.
@@ -909,6 +917,13 @@ Below you will find a description of the different HA policies for a node
  shutdown. Currently 'Conditional' is the default due to backward compatibility.
  Some users may find that 'Migrate' behaves more as expected.
  
+The shutdown policy can be configured in the Web UI (`Datacenter` -> `Options`
+-> `HA Settings`), or directly in `datacenter.cfg`:
+
+----
+ha: shutdown_policy=<value>
+----
+
  Migrate
  ^^^^^^^
  
@@ -996,14 +1011,17 @@ immediate node reboot or even reset.
  Cluster Resource Scheduling
  ---------------------------
  
-The scheduler mode controls how HA selects nodes for the recovery of a service
-as well as for migrations that are triggered by a shutdown policy. The default
-mode is `basic`, you can change it in `datacenter.cfg`:
+The cluster resource scheduler (CRS) mode controls how HA selects nodes for the
+recovery of a service as well as for migrations that are triggered by a
+shutdown policy. The default mode is `basic`, you can change it in the Web UI
+(`Datacenter` -> `Options`), or directly in `datacenter.cfg`:
  
  ----
  crs: ha=static
  ----
  
+[thumbnail="screenshot/gui-datacenter-options-crs.png"]
+
  The change will be in effect starting with the next manager round (after a few
  seconds).
  
@@ -1014,14 +1032,14 @@ the service's group.
  NOTE: There are plans to add modes for (static and dynamic) load-balancing in
  the future.
  
-Basic
-~~~~~
+Basic Scheduler
+~~~~~~~~~~~~~~~
  
  The number of active HA services on each node is used to choose a recovery node.
  Non-HA-managed services are currently not counted.
  
-Static
-~~~~~~
+Static-Load Scheduler
+~~~~~~~~~~~~~~~~~~~~~
  
  IMPORTANT: The static mode is still a technology preview.
  
@@ -1042,6 +1060,34 @@ currently not recommended to use it if you have thousands of HA managed
  services.
  
  
+CRS Scheduling Points
+~~~~~~~~~~~~~~~~~~~~~
+
+The CRS algorithm is not applied for every service in every round, since this
+would mean a large number of constant migrations. Depending on the workload,
+this could put more strain on the cluster than could be avoided by constant
+balancing.
+That's why the {pve} HA manager favors keeping services on their current node.
+
+The CRS is currently used at the following scheduling points:
+
+- Service recovery (always active). When a node with active HA services fails,
+  all its services need to be recovered to other nodes. The CRS algorithm will
+  be used here to balance that recovery over the remaining nodes.
+
+- HA group config changes (always active). If a node is removed from a group,
+  or its priority is reduced, the HA stack will use the CRS algorithm to find a
+  new target node for the HA services in that group, matching the adapted
+  priority constraints.
+
+- HA service stopped -> start transtion (opt-in). Requesting that a stopped
+  service should be started is an good opportunity to check for the best suited
+  node as per the CRS algorithm, as moving stopped services is  cheaper to do
+  than moving them started, especially if their disk volumes reside on shared
+  storage. You can enable this by setting the **`ha-rebalance-on-start`**
+  CRS option in the datacenter config. You can change that option also in the
+  Web UI, under `Datacenter` -> `Options` -> `Cluster Resource Scheduling`.
+
  ifdef::manvolnum[]
  include::pve-copyright.adoc[]
  endif::manvolnum[]