From a4a67cdb745e3f553c55d8e0ae24c19fe2b7664c Mon Sep 17 00:00:00 2001
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
Date: Wed, 27 Nov 2019 15:42:42 +0100
Subject: [PATCH] ha: add shutdown policy docs

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
 ha-manager.adoc | 98 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 69 insertions(+), 29 deletions(-)

diff --git a/ha-manager.adoc b/ha-manager.adoc
index 284e5fb..a98d3f6 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -828,50 +828,90 @@ case, may result in a reset triggered by the watchdog.
 Node Maintenance
 ----------------
 
-It is sometimes possible to shutdown or reboot a node to do
-maintenance tasks. Either to replace hardware, or simply to install a
-new kernel image.
+It is sometimes possible to shutdown or reboot a node to do maintenance tasks.
+Either to replace hardware, or simply to install a new kernel image.
+This is also true when using the HA stack. The behaviour of the HA stack during
+a shutdown can be configured.
 
+[[ha_manager_shutdown_policy]]
+Shutdown Policy
+~~~~~~~~~~~~~~~
 
-Shutdown
-~~~~~~~~
+Below you will find a description of the different HA policies for a node
+shutdown. Currently 'Conditional' is the default due to backward compatibility.
+Some users may find that the 'Migrate' behaves more as expected.
 
-A shutdown ('poweroff') is usually done if the node is planned to stay
-down for some time. The LRM stops all managed services in that
-case. This means that other nodes will take over those service
-afterwards.
+Migrate
+^^^^^^^
 
-NOTE: Recent hardware has large amounts of RAM. So we stop all
-resources, then restart them to avoid online migration of all that
-RAM. If you want to use online migration, you need to invoke that
-manually before you shutdown the node.
+Once the Local Resource manager (LRM) gets a shutdown request and this policy
+is enabled, it will mark it self as unavailable for the current HA manager.
+This triggers a migration of all HA Services currently located on this node.
+Until all running Services got moved away, the LRM will try to delay the
+shutdown process. But, this expects that the running services *can* be migrated
+to another node. In other words, the service must not be locally bound, for
+example by using hardware passthrough. As non-group member nodes are considered
+as runnable target if no group member is available, this policy can still be
+used when making use of group node restrictions.
+Once the shut down node comes back online again, the previously displaced
+services will be moved back, if they did not get migrated manually in-between.
 
+NOTE: The watchdog is still active during the migration process on shutdown.
+If the node loses quorum it will be fenced and the services will be recovered.
 
-Reboot
-~~~~~~
+Failover
+^^^^^^^^
+
+This mode ensures that all services get stopped, but that they will also be
+recovered, if the current node is not online soon. It can be useful when doing
+maintenance on a cluster scale, were live-migrating VMs may not be possible if
+to many nodes are powered-off at a time, but you still want to ensure HA
+services get recovered and started again as soon as possible.
+
+Freeze
+^^^^^^
+
+This mode ensures that all services get stopped and frozen, so that they won't
+get recovered until the current node is online again.
+
+Conditional
+^^^^^^^^^^^
+
+.Shutdown
+
+A shutdown ('poweroff') is usually done if the node is planned to stay down for
+some time. The LRM stops all managed services in that case. This means that
+other nodes will take over those service afterwards.
+
+NOTE: Recent hardware has large amounts of memory (RAM). So we stop all
+resources, then restart them to avoid online migration of all that RAM. If you
+want to use online migration, you need to invoke that manually before you
+shutdown the node.
+
+
+.Reboot
 
-Node reboots are initiated with the 'reboot' command. This is usually
-done after installing a new kernel. Please note that this is different
-from ``shutdown'', because the node immediately starts again.
+Node reboots are initiated with the 'reboot' command. This is usually done
+after installing a new kernel. Please note that this is different from
+``shutdown'', because the node immediately starts again.
 
-The LRM tells the CRM that it wants to restart, and waits until the
-CRM puts all resources into the `freeze` state (same mechanism is used
-for xref:ha_manager_package_updates[Package Updates]). This prevents
-that those resources are moved to other nodes. Instead, the CRM start
-the resources after the reboot on the same node.
+The LRM tells the CRM that it wants to restart, and waits until the CRM puts
+all resources into the `freeze` state (same mechanism is used for
+xref:ha_manager_package_updates[Package Updates]). This prevents that those
+resources are moved to other nodes. Instead, the CRM start the resources after
+the reboot on the same node.
 
 
 Manual Resource Movement
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
-Last but not least, you can also move resources manually to other
-nodes before you shutdown or restart a node. The advantage is that you
-have full control, and you can decide if you want to use online
-migration or not.
+Last but not least, you can also move resources manually to other nodes before
+you shutdown or restart a node. The advantage is that you have full control,
+and you can decide if you want to use online migration or not.
 
 NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
-`watchdog-mux`. They manage and use the watchdog, so this can result
-in a node reboot.
+`watchdog-mux`. They manage and use the watchdog, so this can result in a
+immediate node reboot or even reset.
 
 
 ifdef::manvolnum[]
-- 
2.39.2