X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=ha-manager.adoc;h=fadc6b5bb3bfceea1067a235624d78fe98e75a3b;hb=ac70d7d134ce81431aa875c336bd547094326d9b;hp=93c2632e46d347ced4f967239d72119743360712;hpb=bdfd46015eaa7acfa40bd70cf2cb97442c36be39;p=pve-docs.git

diff --git a/ha-manager.adoc b/ha-manager.adoc
index 93c2632..fadc6b5 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -100,7 +100,7 @@ carefully calculate the benefits, and compare with those additional
 costs.
 
 TIP: Increasing availability from 99% to 99.9% is relatively
-simply. But increasing availability from 99.9999% to 99.99999% is very
+simple. But increasing availability from 99.9999% to 99.99999% is very
 hard and costly. `ha-manager` has typical error detection and failover
 times of about 2 minutes, so you can get no more than 99.999%
 availability.
@@ -446,6 +446,65 @@ quorum the node cannot reset the watchdog. This will trigger a reboot
 after the watchdog then times out, this happens after 60 seconds.
 
 
+HA Simulator
+------------
+
+[thumbnail="screenshot/gui-ha-manager-status.png"]
+
+By using the HA simulator you can test and learn all functionalities of the
+Proxmox VE HA solutions.
+
+By default, the simulator allows you to watch and test the behaviour of a
+real-world 3 node cluster with 6 VMs. You can also add or remove additional VMs
+or Container.
+
+You do not have to setup or configure a real cluster, the HA simulator runs out
+of the box.
+
+Install with apt:
+
+----
+apt install pve-ha-simulator
+----
+
+You can even install the package on any Debian based system without any
+other Proxmox VE packages.  For that you will need to download the package and
+copy it to the system you want to run it on for installation.  When you install
+the package with apt from the local file system it will also resolve the
+required dependencies for you.
+
+
+To start the simulator on a remote machine you must have a X11 redirection to
+your current system.
+
+If you are on a Linux machine you can use:
+
+----
+ssh root@<IPofPVE> -Y
+----
+
+On Windows it is working with https://mobaxterm.mobatek.net/[mobaxterm].
+
+After either connecting to a existing {pve} with the simulator installed, or
+installing it on your local Debian based system manually you can try it out as
+follows.
+
+First you need to create a working directory where the simulator saves it's
+current state and writes its the default config:
+
+----
+mkdir working
+----
+
+Then, simply pass the created directory as parameter to 'pve-ha-simulator':
+
+----
+pve-ha-simulator working/
+----
+
+You can then start, stop, migrate the simulated HA services, or even check out
+what happens on a node failure.
+
 Configuration
 -------------
 
@@ -702,7 +761,7 @@ specific for each resource.
 
 max_restart::
 
-Maximum number of tries to restart an failed service on the actual
+Maximum number of tries to restart a failed service on the actual
 node.  The default is set to one.
 
 max_relocate::
@@ -750,9 +809,9 @@ Package Updates
 When updating the ha-manager you should do one node after the other, never
 all at once for various reasons. First, while we test our software
 thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
-Upgrading one node after the other and checking the functionality of each node
-after finishing the update helps to recover from an eventual problems, while
-updating all could render you in a broken cluster state and is generally not
+Updating one node after the other and checking the functionality of each node
+after finishing the update helps to recover from eventual problems, while
+updating all at once could result in a broken cluster and is generally not
 good practice.
 
 Also, the {pve} HA stack uses a request acknowledge protocol to perform
@@ -769,50 +828,100 @@ case, may result in a reset triggered by the watchdog.
 Node Maintenance
 ----------------
 
-It is sometimes possible to shutdown or reboot a node to do
-maintenance tasks. Either to replace hardware, or simply to install a
-new kernel image.
+It is sometimes possible to shutdown or reboot a node to do maintenance tasks.
+Either to replace hardware, or simply to install a new kernel image.
+This is also true when using the HA stack. The behaviour of the HA stack during
+a shutdown can be configured.
 
+[[ha_manager_shutdown_policy]]
+Shutdown Policy
+~~~~~~~~~~~~~~~
 
-Shutdown
-~~~~~~~~
+Below you will find a description of the different HA policies for a node
+shutdown. Currently 'Conditional' is the default due to backward compatibility.
+Some users may find that the 'Migrate' behaves more as expected.
 
-A shutdown ('poweroff') is usually done if the node is planned to stay
-down for some time. The LRM stops all managed services in that
-case. This means that other nodes will take over those service
-afterwards.
+Migrate
+^^^^^^^
 
-NOTE: Recent hardware has large amounts of RAM. So we stop all
-resources, then restart them to avoid online migration of all that
-RAM. If you want to use online migration, you need to invoke that
-manually before you shutdown the node.
+Once the Local Resource manager (LRM) gets a shutdown request and this policy
+is enabled, it will mark it self as unavailable for the current HA manager.
+This triggers a migration of all HA Services currently located on this node.
+Until all running Services got moved away, the LRM will try to delay the
+shutdown process. But, this expects that the running services *can* be migrated
+to another node. In other words, the service must not be locally bound, for
+example by using hardware passthrough. As non-group member nodes are considered
+as runnable target if no group member is available, this policy can still be
+used when making use of HA groups with only some nodes selected. But, marking a
+group as 'restricted' tells the HA manager that the service cannot run outside
+of the chosen set of nodes, if all of those nodes are unavailable the shutdown
+will hang until you manually intervene. Once the shut down node comes back
+online again, the previously displaced services will be moved back, if they did
+not get migrated manually in-between.
 
+NOTE: The watchdog is still active during the migration process on shutdown.
+If the node loses quorum it will be fenced and the services will be recovered.
 
-Reboot
-~~~~~~
+If you start a (previously stopped) service on a node which is currently being
+maintained, the node needs to be fenced to ensure that the service can be moved
+and started on another, available, node.
+
+Failover
+^^^^^^^^
+
+This mode ensures that all services get stopped, but that they will also be
+recovered, if the current node is not online soon. It can be useful when doing
+maintenance on a cluster scale, were live-migrating VMs may not be possible if
+to many nodes are powered-off at a time, but you still want to ensure HA
+services get recovered and started again as soon as possible.
 
-Node reboots are initiated with the 'reboot' command. This is usually
-done after installing a new kernel. Please note that this is different
-from ``shutdown'', because the node immediately starts again.
+Freeze
+^^^^^^
 
-The LRM tells the CRM that it wants to restart, and waits until the
-CRM puts all resources into the `freeze` state (same mechanism is used
-for xref:ha_manager_package_updates[Package Updates]). This prevents
-that those resources are moved to other nodes. Instead, the CRM start
-the resources after the reboot on the same node.
+This mode ensures that all services get stopped and frozen, so that they won't
+get recovered until the current node is online again.
+
+Conditional
+^^^^^^^^^^^
+
+The 'Conditional' shutdown policy automatically detects if a shutdown or a
+reboot is requested, and changes behaviour accordingly.
+
+.Shutdown
+
+A shutdown ('poweroff') is usually done if the node is planned to stay down for
+some time. The LRM stops all managed services in that case. This means that
+other nodes will take over those service afterwards.
+
+NOTE: Recent hardware has large amounts of memory (RAM). So we stop all
+resources, then restart them to avoid online migration of all that RAM. If you
+want to use online migration, you need to invoke that manually before you
+shutdown the node.
+
+
+.Reboot
+
+Node reboots are initiated with the 'reboot' command. This is usually done
+after installing a new kernel. Please note that this is different from
+``shutdown'', because the node immediately starts again.
+
+The LRM tells the CRM that it wants to restart, and waits until the CRM puts
+all resources into the `freeze` state (same mechanism is used for
+xref:ha_manager_package_updates[Package Updates]). This prevents that those
+resources are moved to other nodes. Instead, the CRM start the resources after
+the reboot on the same node.
 
 
 Manual Resource Movement
-~~~~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^^^^^^
 
-Last but not least, you can also move resources manually to other
-nodes before you shutdown or restart a node. The advantage is that you
-have full control, and you can decide if you want to use online
-migration or not.
+Last but not least, you can also move resources manually to other nodes before
+you shutdown or restart a node. The advantage is that you have full control,
+and you can decide if you want to use online migration or not.
 
 NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
-`watchdog-mux`. They manage and use the watchdog, so this can result
-in a node reboot.
+`watchdog-mux`. They manage and use the watchdog, so this can result in a
+immediate node reboot or even reset.
 
 
 ifdef::manvolnum[]