X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=ha-manager.adoc;h=6400e208f6338f6b136a91e2767e9c59a31d5a5f;hp=cef806d7af723599410a96f8c8a9a7b41e38b206;hb=1acab952e33ec87ba3d15fd1711fc45a1989f5b6;hpb=01911cf3ca3ed6f4560fe510f3cbbbf8b1219e0d

diff --git a/ha-manager.adoc b/ha-manager.adoc
index cef806d..6400e20 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -1,15 +1,15 @@
-[[chapter-ha-manager]]
+[[chapter_ha_manager]]
 ifdef::manvolnum[]
-PVE({manvolnum})
-================
-include::attributes.txt[]
+ha-manager(1)
+=============
+:pve-toplevel:
 
 NAME
 ----
 
 ha-manager - Proxmox VE HA Manager
 
-SYNOPSYS
+SYNOPSIS
 --------
 
 include::ha-manager.1-synopsis.adoc[]
@@ -17,14 +17,12 @@ include::ha-manager.1-synopsis.adoc[]
 DESCRIPTION
 -----------
 endif::manvolnum[]
-
 ifndef::manvolnum[]
 High Availability
 =================
-include::attributes.txt[]
+:pve-toplevel:
 endif::manvolnum[]
 
-
 Our modern society depends heavily on information provided by
 computers over the network. Mobile devices amplified that dependency,
 because people can access the network any time from anywhere. If you
@@ -58,7 +56,7 @@ yourself. The following solutions works without modifying the
 software:
 
 * Use reliable ``server'' components
-
++
 NOTE: Computer components with same functionality can have varying
 reliability numbers, depending on the component quality. Most vendors
 sell components with higher reliability as ``server'' components -
@@ -107,21 +105,27 @@ hard and costly. `ha-manager` has typical error detection and failover
 times of about 2 minutes, so you can get no more than 99.999%
 availability.
 
+
 Requirements
 ------------
 
+You must meet the following requirements before you start with HA:
+
 * at least three cluster nodes (to get reliable quorum)
 
 * shared storage for VMs and containers
 
 * hardware redundancy (everywhere)
 
+* use reliable âserverâ components
+
 * hardware watchdog - if not available we fall back to the
   linux kernel software watchdog (`softdog`)
 
 * optional hardware fencing devices
 
 
+[[ha_manager_resources]]
 Resources
 ---------
 
@@ -148,16 +152,17 @@ To provide High Availability two daemons run on each node:
 
 `pve-ha-lrm`::
 
-The local resource manager (LRM), it controls the services running on
-the local node.
-It reads the requested states for its services from the current manager
-status file and executes the respective commands.
+The local resource manager (LRM), which controls the services running on
+the local node. It reads the requested states for its services from
+the current manager status file and executes the respective commands.
 
 `pve-ha-crm`::
 
-The cluster resource manager (CRM), it controls the cluster wide
-actions of the services, processes the LRM results and includes the state
-machine which controls the state of each service.
+The cluster resource manager (CRM), which makes the cluster wide
+decisions. It sends commands to the LRM, processes the results,
+and moves resources to other nodes if something fails. The CRM also
+handles node fencing.
+
 
 .Locks in the LRM & CRM
 [NOTE]
@@ -269,17 +274,137 @@ quorum, the LRM waits for a new quorum to form. As long as there is no
 quorum the node cannot reset the watchdog. This will trigger a reboot
 after the watchdog then times out, this happens after 60 seconds.
 
+
 Configuration
 -------------
 
-The HA stack is well integrated in the Proxmox VE API2. So, for
-example, HA can be configured via `ha-manager` or the PVE web
-interface, which both provide an easy to use tool.
+The HA stack is well integrated into the {pve} API. So, for example,
+HA can be configured via the `ha-manager` command line interface, or
+the {pve} web interface - both interfaces provide an easy way to
+manage HA. Automation tools can use the API directly.
+
+All HA configuration files are within `/etc/pve/ha/`, so they get
+automatically distributed to the cluster nodes, and all nodes share
+the same HA configuration.
+
+
+Resources
+~~~~~~~~~
+
+The resource configuration file `/etc/pve/ha/resources.cfg` stores
+the list of resources managed by `ha-manager`. A resource configuration
+inside that list look like this:
+
+----
+<type>: <name>
+	<property> <value>
+	...
+----
+
+It starts with a resource type followed by a resource specific name,
+separated with colon. Together this forms the HA resource ID, which is
+used by all `ha-manager` commands to uniquely identify a resource
+(example: `vm:100` or `ct:101`). The next lines contain additional
+properties:
+
+include::ha-resources-opts.adoc[]
+
+Here is a real world example with one VM and one container. As you see,
+the syntax of those files is really simple, so it is even posiible to
+read or edit those files using your favorite editor:
+
+.Configuration Example (`/etc/pve/ha/resources.cfg`)
+----
+vm: 501
+    state started
+    max_relocate 2
+
+ct: 102
+    # Note: use default settings for everything
+----
+
+Above config was generated using the `ha-manager` command line tool:
+
+----
+# ha-manager add vm:501 --state started --max_relocate 2
+# ha-manager add ct:102
+----
+
+
+[[ha_manager_groups]]
+Groups
+~~~~~~
+
+The HA group configuration file `/etc/pve/ha/groups.cfg` is used to
+define groups of cluster nodes. A resource can be restricted to run
+only on the members of such group. A group configuration look like
+this:
+
+----
+group: <group>
+       nodes <node_list>
+       <property> <value>
+       ...
+----
+
+include::ha-groups-opts.adoc[]
+
+A commom requirement is that a resource should run on a specific
+node. Usually the resource is able to run on other nodes, so you can define
+an unrestricted group with a single member:
+
+----
+# ha-manager groupadd prefer_node1 --nodes node1
+----
+
+For bigger clusters, it makes sense to define a more detailed failover
+behavior. For example, you may want to run a set of services on
+`node1` if possible. If `node1` is not available, you want to run them
+equally splitted on `node2` and `node3`. If those nodes also fail the
+services should run on `node4`. To achieve this you could set the node
+list to:
+
+----
+# ha-manager groupadd mygroup1 -nodes "node1:2,node2:1,node3:1,node4"
+----
+
+Another use case is if a resource uses other resources only available
+on specific nodes, lets say `node1` and `node2`. We need to make sure
+that HA manager does not use other nodes, so we need to create a
+restricted group with said nodes:
+
+----
+# ha-manager groupadd mygroup2 -nodes "node1,node2" -restricted
+----
+
+Above commands created the following group configuration fils:
+
+.Configuration Example (`/etc/pve/ha/groups.cfg`)
+----
+group: prefer_node1
+       nodes node1
+
+group: mygroup1
+       nodes node2:1,node4,node1:2,node3:1
+
+group: mygroup2
+       nodes node2,node1
+       restricted 1
+----
+
+
+The `nofailback` options is mostly useful to avoid unwanted resource
+movements during administartion tasks. For example, if you need to
+migrate a service to a node which hasn't the highest priority in the
+group, you need to tell the HA manager to not move this service
+instantly back by setting the `nofailback` option.
+
+Another scenario is when a service was fenced and it got recovered to
+another node. The admin tries to repair the fenced node and brings it
+up online again to investigate the failure cause and check if it runs
+stable again. Setting the `nofailback` flag prevents that the
+recovered services move straight back to the fenced node.
 
-The resource configuration file can be located at
-`/etc/pve/ha/resources.cfg` and the group configuration file at
-`/etc/pve/ha/groups.cfg`. Use the provided tools to make changes,
-there shouldn't be any need to edit them manually.
 
 Node Power Status
 -----------------
@@ -311,6 +436,7 @@ the update process can be too long which, in the worst case, may result in
 a watchdog reset.
 
 
+[[ha_manager_fencing]]
 Fencing
 -------
 
@@ -380,56 +506,6 @@ That minimizes the possibility of an overload, which else could cause an
 unresponsive node and as a result a chain reaction of node failures in the
 cluster.
 
-Groups
-------
-
-A group is a collection of cluster nodes which a service may be bound to.
-
-Group Settings
-~~~~~~~~~~~~~~
-
-nodes::
-
-List of group node members where a priority can be given to each node.
-A service bound to this group will run on the nodes with the highest priority
-available. If more nodes are in the highest priority class the services will
-get distributed to those node if not already there. The priorities have a
-relative meaning only.
-  Example;;
-  You want to run all services from a group on `node1` if possible. If this node
-  is not available, you want them to run equally splitted on `node2` and `node3`, and
-  if those fail it should use `node4`.
-  To achieve this you could set the node list to:
-[source,bash]
-  ha-manager groupset mygroup -nodes "node1:2,node2:1,node3:1,node4"
-
-restricted::
-
-Resources bound to this group may only run on nodes defined by the
-group. If no group node member is available the resource will be
-placed in the stopped state.
-  Example;;
-  Lets say a service uses resources only available on `node1` and `node2`,
-  so we need to make sure that HA manager does not use other nodes.
-  We need to create a 'restricted' group with said nodes:
-[source,bash]
-  ha-manager groupset mygroup -nodes "node1,node2" -restricted 
-
-nofailback::
-
-The resource won't automatically fail back when a more preferred node
-(re)joins the cluster.
-  Examples;;
-  * You need to migrate a service to a node which hasn't the highest priority
-  in the group at the moment, to tell the HA manager to not move this service
-  instantly back set the nofailnback option and the service will stay on
-
-  * A service was fenced and he got recovered to another node. The admin
-  repaired the node and brang it up online again but does not want that the
-  recovered services move straight back to the repaired node as he wants to
-  first investigate the failure cause and check if it runs stable. He can use
-  the nofailback option to achieve this.
-
 
 Start Failure Policy
 ---------------------
@@ -480,6 +556,7 @@ killing its process)
 * *after* you fixed all errors you may enable the service again
 
 
+[[ha_manager_service_operations]]
 Service Operations
 ------------------