ha-manager.adoc: add new section Node Maintenance

[pve-docs.git] / ha-manager.adoc
diff --git a/ha-manager.adoc b/ha-manager.adoc

index 77788a77a014f9355be5c11dec8923d9f77d9712..e1b0df82ebb1023710d51af725b560b2165ca1ef 100644 (file)
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -177,6 +177,8 @@ lock.
  Service States
  ~~~~~~~~~~~~~~
  
+[thumbnail="gui-ha-manager-status.png"]
+
  The CRM use a service state enumeration to record the current service
  state. We display this state on the GUI and you can query it using
  the `ha-manager` command line tool:
@@ -350,6 +352,8 @@ the same HA configuration.
  Resources
  ~~~~~~~~~
  
+[thumbnail="gui-ha-manager-resources-view.png"]
+
  The resource configuration file `/etc/pve/ha/resources.cfg` stores
  the list of resources managed by `ha-manager`. A resource configuration
  inside that list look like this:
@@ -382,6 +386,8 @@ ct: 102
      # Note: use default settings for everything
  ----
  
+[thumbnail="gui-ha-manager-add-resource.png"]
+
  Above config was generated using the `ha-manager` command line tool:
  
  ----
@@ -394,6 +400,8 @@ Above config was generated using the `ha-manager` command line tool:
  Groups
  ~~~~~~
  
+[thumbnail="gui-ha-manager-groups-view.png"]
+
  The HA group configuration file `/etc/pve/ha/groups.cfg` is used to
  define groups of cluster nodes. A resource can be restricted to run
  only on the members of such group. A group configuration look like
@@ -408,6 +416,8 @@ group: <group>
  
  include::ha-groups-opts.adoc[]
  
+[thumbnail="gui-ha-manager-add-group.png"]
+
  A commom requirement is that a resource should run on a specific
  node. Usually the resource is able to run on other nodes, so you can define
  an unrestricted group with a single member:
@@ -465,38 +475,6 @@ stable again. Setting the `nofailback` flag prevents that the
  recovered services move straight back to the fenced node.
  
  
-Node Power Status
------------------
-
-If a node needs maintenance you should migrate and or relocate all
-services which are required to run always on another node first.
-After that you can stop the LRM and CRM services. But note that the
-watchdog triggers if you stop it with active services.
-
-
-[[ha_manager_package_updates]]
-Package Updates
----------------
-
-When updating the ha-manager you should do one node after the other, never
-all at once for various reasons. First, while we test our software
-thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
-Upgrading one node after the other and checking the functionality of each node
-after finishing the update helps to recover from an eventual problems, while
-updating all could render you in a broken cluster state and is generally not
-good practice.
-
-Also, the {pve} HA stack uses a request acknowledge protocol to perform
-actions between the cluster and the local resource manager. For restarting,
-the LRM makes a request to the CRM to freeze all its services. This prevents
-that they get touched by the Cluster during the short time the LRM is restarting.
-After that the LRM may safely close the watchdog during a restart.
-Such a restart happens on a update and as already stated a active master
-CRM is needed to acknowledge the requests from the LRM, if this is not the case
-the update process can be too long which, in the worst case, may result in
-a watchdog reset.
-
-
  [[ha_manager_fencing]]
  Fencing
  -------
@@ -646,6 +624,77 @@ killing its process)
  * *after* you fixed all errors you may enable the service again
  
  
+Node Maintenance
+----------------
+
+It is sometimes possible to shutdown or reboot a node to do
+maintenance tasks. Either to replace hardware, or simply to install a
+new kernel image.
+
+
+Shutdown
+~~~~~~~~
+
+A shutdown ('poweroff') is usually done if the node is planned to stay
+down for some time. The LRM stops all managed services in that
+case. This means that other nodes will take over those service
+afterwards.
+
+NOTE: Recent hardware has large amounts of RAM. So we stop all
+resources, then restart them to avoid online migration of all that
+RAM. If you want to use online migration, you need to invoke that
+manually before you shutdown the node.
+
+
+Reboot
+~~~~~~
+
+Node reboots are initiated with the 'reboot' command. This is usually
+done after installing a new kernel. Please note that this is different
+from ``shutdown'', because the node immediately starts again.
+
+The LRM tells the CRM that it wants to restart, and waits until the
+CRM puts all resources into the `freeze` state. This prevents that
+those resources are moved to other nodes. Instead, the CRM start the
+resources after the reboot on the same node.
+
+
+Manual Resource Movement
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Last but not least, you can also move resources manually to other
+nodes before you shutdown or restart a node. The advantage is that you
+have full control, and you can decide if you want to use online
+migration or not.
+
+NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
+`watchdog-mux`. They manage and use the watchdog, so this can result
+in a node reboot.
+
+
+[[ha_manager_package_updates]]
+Package Updates
+---------------
+
+When updating the ha-manager you should do one node after the other, never
+all at once for various reasons. First, while we test our software
+thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
+Upgrading one node after the other and checking the functionality of each node
+after finishing the update helps to recover from an eventual problems, while
+updating all could render you in a broken cluster state and is generally not
+good practice.
+
+Also, the {pve} HA stack uses a request acknowledge protocol to perform
+actions between the cluster and the local resource manager. For restarting,
+the LRM makes a request to the CRM to freeze all its services. This prevents
+that they get touched by the Cluster during the short time the LRM is restarting.
+After that the LRM may safely close the watchdog during a restart.
+Such a restart happens on a update and as already stated a active master
+CRM is needed to acknowledge the requests from the LRM, if this is not the case
+the update process can be too long which, in the worst case, may result in
+a watchdog reset.
+
+
  [[ha_manager_service_operations]]
  Service Operations
  ------------------