identifiable by the same UID. This is needed so that the LRM does not
executes an outdated command.
With the exception of the 'stop' and the 'error' command,
-those two do not depend on the result produce and are executed
+those two do not depend on the result produced and are executed
always in the case of the stopped state and once in the case of
the error state.
~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default all watchdog modules are blocked for security reasons as they are
like a loaded gun if not correctly initialized.
-If you have a hardware watchdog available remove its module from the blacklist
-and restart 'the watchdog-mux' service.
-
+If you have a hardware watchdog available remove its kernel module from the
+blacklist, load it with insmod and restart the 'watchdog-mux' service or reboot
+the node.
+
+Recover Fenced Services
+~~~~~~~~~~~~~~~~~~~~~~~
+
+After a node failed and its fencing was successful we start to recover services
+to other available nodes and restart them there so that they can provide service
+again.
+
+The selection of the node on which the services gets recovered is influenced
+by the users group settings, the currently active nodes and their respective
+active service count.
+First we build a set out of the intersection between user selected nodes and
+available nodes. Then the subset with the highest priority of those nodes
+gets chosen as possible nodes for recovery. We select the node with the
+currently lowest active service count as a new node for the service.
+That minimizes the possibility of an overload, which else could cause an
+unresponsive node and as a result a chain reaction of node failures in the
+cluster.
Groups
------
nodes::
-list of group node members
+List of group node members where a priority can be given to each node.
+A service bound to this group will run on the nodes with the highest priority
+available. If more nodes are in the highest priority class the services will
+get distributed to those node if not already there. The priorities have a
+relative meaning only.
restricted::
(re)joins the cluster.
-Recovery Policy
----------------
+Start Failure Policy
+---------------------
+
+The start failure policy comes in effect if a service failed to start on a
+node once ore more times. It can be used to configure how often a restart
+should be triggered on the same node and how often a service should be
+relocated so that it gets a try to be started on another node.
+The aim of this policy is to circumvent temporary unavailability of shared
+resources on a specific node. For example, if a shared storage isn't available
+on a quorate node anymore, e.g. network problems, but still on other nodes,
+the relocate policy allows then that the service gets started nonetheless.
-There are two service recover policy settings which can be configured
+There are two service start recover policy settings which can be configured
specific for each resource.
max_restart::
stopped::
-Service is stopped (confirmed by LRM)
+Service is stopped (confirmed by LRM), if detected running it will get stopped
+again.
request_stop::
started::
Service is active an LRM should start it ASAP if not already running.
+If the Service fails and is detected to be not running the LRM restarts it.
fence::
Wait for node fencing (service node is not inside quorate cluster
partition).
+As soon as node gets fenced successfully the service will be recovered to
+another node, if possible.
freeze::