ha-manager.adoc: reorder sections

[pve-docs.git] / ha-manager.adoc
diff --git a/ha-manager.adoc b/ha-manager.adoc

index e25b34564c643c917dc29f8c129b15021c35c2d7..944ad480e863fdb73297b300aa91732d990338ea 100644 (file)
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -177,6 +177,8 @@ lock.
  Service States
  ~~~~~~~~~~~~~~
  
+[thumbnail="gui-ha-manager-status.png"]
+
  The CRM use a service state enumeration to record the current service
  state. We display this state on the GUI and you can query it using
  the `ha-manager` command line tool:
@@ -350,6 +352,8 @@ the same HA configuration.
  Resources
  ~~~~~~~~~
  
+[thumbnail="gui-ha-manager-resources-view.png"]
+
  The resource configuration file `/etc/pve/ha/resources.cfg` stores
  the list of resources managed by `ha-manager`. A resource configuration
  inside that list look like this:
@@ -382,6 +386,8 @@ ct: 102
      # Note: use default settings for everything
  ----
  
+[thumbnail="gui-ha-manager-add-resource.png"]
+
  Above config was generated using the `ha-manager` command line tool:
  
  ----
@@ -394,6 +400,8 @@ Above config was generated using the `ha-manager` command line tool:
  Groups
  ~~~~~~
  
+[thumbnail="gui-ha-manager-groups-view.png"]
+
  The HA group configuration file `/etc/pve/ha/groups.cfg` is used to
  define groups of cluster nodes. A resource can be restricted to run
  only on the members of such group. A group configuration look like
@@ -408,6 +416,8 @@ group: <group>
  
  include::ha-groups-opts.adoc[]
  
+[thumbnail="gui-ha-manager-add-group.png"]
+
  A commom requirement is that a resource should run on a specific
  node. Usually the resource is able to run on other nodes, so you can define
  an unrestricted group with a single member:
@@ -465,38 +475,6 @@ stable again. Setting the `nofailback` flag prevents that the
  recovered services move straight back to the fenced node.
  
  
-Node Power Status
------------------
-
-If a node needs maintenance you should migrate and or relocate all
-services which are required to run always on another node first.
-After that you can stop the LRM and CRM services. But note that the
-watchdog triggers if you stop it with active services.
-
-
-[[ha_manager_package_updates]]
-Package Updates
----------------
-
-When updating the ha-manager you should do one node after the other, never
-all at once for various reasons. First, while we test our software
-thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
-Upgrading one node after the other and checking the functionality of each node
-after finishing the update helps to recover from an eventual problems, while
-updating all could render you in a broken cluster state and is generally not
-good practice.
-
-Also, the {pve} HA stack uses a request acknowledge protocol to perform
-actions between the cluster and the local resource manager. For restarting,
-the LRM makes a request to the CRM to freeze all its services. This prevents
-that they get touched by the Cluster during the short time the LRM is restarting.
-After that the LRM may safely close the watchdog during a restart.
-Such a restart happens on a update and as already stated a active master
-CRM is needed to acknowledge the requests from the LRM, if this is not the case
-the update process can be too long which, in the worst case, may result in
-a watchdog reset.
-
-
  [[ha_manager_fencing]]
  Fencing
  -------
@@ -575,20 +553,23 @@ the specified module at startup.
  Recover Fenced Services
  ~~~~~~~~~~~~~~~~~~~~~~~
  
-After a node failed and its fencing was successful we start to recover services
-to other available nodes and restart them there so that they can provide service
-again.
+After a node failed and its fencing was successful, the CRM tries to
+move services from the failed node to nodes which are still online.
+
+The selection of nodes, on which those services gets recovered, is
+influenced by the resource `group` settings, the list of currently active
+nodes, and their respective active service count.
+
+The CRM first builds a set out of the intersection between user selected
+nodes (from `group` setting) and available nodes. It then choose the
+subset of nodes with the highest priority, and finally select the node
+with the lowest active service count. This minimizes the possibility
+of an overloaded node.
  
-The selection of the node on which the services gets recovered is influenced
-by the users group settings, the currently active nodes and their respective
-active service count.
-First we build a set out of the intersection between user selected nodes and
-available nodes. Then the subset with the highest priority of those nodes
-gets chosen as possible nodes for recovery. We select the node with the
-currently lowest active service count as a new node for the service.
-That minimizes the possibility of an overload, which else could cause an
-unresponsive node and as a result a chain reaction of node failures in the
-cluster.
+CAUTION: On node failure, the CRM distributes services to the
+remaining nodes. This increase the service count on those nodes, and
+can lead to high load, especially on small clusters. Please design
+your cluster so that it can handle such worst case scenarios.
  
  
  [[ha_manager_start_failure_policy]]
@@ -643,6 +624,38 @@ killing its process)
  * *after* you fixed all errors you may enable the service again
  
  
+Node Power Status
+-----------------
+
+If a node needs maintenance you should migrate and or relocate all
+services which are required to run always on another node first.
+After that you can stop the LRM and CRM services. But note that the
+watchdog triggers if you stop it with active services.
+
+
+[[ha_manager_package_updates]]
+Package Updates
+---------------
+
+When updating the ha-manager you should do one node after the other, never
+all at once for various reasons. First, while we test our software
+thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
+Upgrading one node after the other and checking the functionality of each node
+after finishing the update helps to recover from an eventual problems, while
+updating all could render you in a broken cluster state and is generally not
+good practice.
+
+Also, the {pve} HA stack uses a request acknowledge protocol to perform
+actions between the cluster and the local resource manager. For restarting,
+the LRM makes a request to the CRM to freeze all its services. This prevents
+that they get touched by the Cluster during the short time the LRM is restarting.
+After that the LRM may safely close the watchdog during a restart.
+Such a restart happens on a update and as already stated a active master
+CRM is needed to acknowledge the requests from the LRM, if this is not the case
+the update process can be too long which, in the worst case, may result in
+a watchdog reset.
+
+
  [[ha_manager_service_operations]]
  Service Operations
  ------------------