X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=ha-manager.adoc;h=5db5b052e44d3f1817298a284ca8b65efc8eadca;hb=ba1d96fd8f6ed0109479b65cb5f543c315d5f78c;hp=78bbf10ecafd76ed405a77c48b92cc47bf9ea2db;hpb=5771d9b0cb6bff7a2a57f8725c3aaa6dc5b25922;p=pve-docs.git diff --git a/ha-manager.adoc b/ha-manager.adoc index 78bbf10..5db5b05 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -215,7 +215,7 @@ then executes this action *one time* and writes back the result, also identifiable by the same UID. This is needed so that the LRM does not executes an outdated command. With the exception of the 'stop' and the 'error' command, -those two do not depend on the result produce and are executed +those two do not depend on the result produced and are executed always in the case of the stopped state and once in the case of the error state. @@ -346,9 +346,27 @@ Configure Hardware Watchdog ~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default all watchdog modules are blocked for security reasons as they are like a loaded gun if not correctly initialized. -If you have a hardware watchdog available remove its module from the blacklist -and restart 'the watchdog-mux' service. - +If you have a hardware watchdog available remove its kernel module from the +blacklist, load it with insmod and restart the 'watchdog-mux' service or reboot +the node. + +Recover Fenced Services +~~~~~~~~~~~~~~~~~~~~~~~ + +After a node failed and its fencing was successful we start to recover services +to other available nodes and restart them there so that they can provide service +again. + +The selection of the node on which the services gets recovered is influenced +by the users group settings, the currently active nodes and their respective +active service count. +First we build a set out of the intersection between user selected nodes and +available nodes. Then the subset with the highest priority of those nodes +gets chosen as possible nodes for recovery. We select the node with the +currently lowest active service count as a new node for the service. +That minimizes the possibility of an overload, which else could cause an +unresponsive node and as a result a chain reaction of node failures in the +cluster. Groups ------ @@ -360,7 +378,11 @@ Group Settings nodes:: -list of group node members +List of group node members where a priority can be given to each node. +A service bound to this group will run on the nodes with the highest priority +available. If more nodes are in the highest priority class the services will +get distributed to those node if not already there. The priorities have a +relative meaning only. restricted:: @@ -374,10 +396,19 @@ the resource won't automatically fail back when a more preferred node (re)joins the cluster. -Recovery Policy ---------------- +Start Failure Policy +--------------------- + +The start failure policy comes in effect if a service failed to start on a +node once ore more times. It can be used to configure how often a restart +should be triggered on the same node and how often a service should be +relocated so that it gets a try to be started on another node. +The aim of this policy is to circumvent temporary unavailability of shared +resources on a specific node. For example, if a shared storage isn't available +on a quorate node anymore, e.g. network problems, but still on other nodes, +the relocate policy allows then that the service gets started nonetheless. -There are two service recover policy settings which can be configured +There are two service start recover policy settings which can be configured specific for each resource. max_restart:: @@ -450,7 +481,8 @@ Service States stopped:: -Service is stopped (confirmed by LRM) +Service is stopped (confirmed by LRM), if detected running it will get stopped +again. request_stop:: @@ -459,11 +491,14 @@ Service should be stopped. Waiting for confirmation from LRM. started:: Service is active an LRM should start it ASAP if not already running. +If the Service fails and is detected to be not running the LRM restarts it. fence:: Wait for node fencing (service node is not inside quorate cluster partition). +As soon as node gets fenced successfully the service will be recovered to +another node, if possible. freeze::