X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=ha-manager.adoc;h=fadc6b5bb3bfceea1067a235624d78fe98e75a3b;hb=ac70d7d134ce81431aa875c336bd547094326d9b;hp=93c2632e46d347ced4f967239d72119743360712;hpb=bdfd46015eaa7acfa40bd70cf2cb97442c36be39;p=pve-docs.git diff --git a/ha-manager.adoc b/ha-manager.adoc index 93c2632..fadc6b5 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -100,7 +100,7 @@ carefully calculate the benefits, and compare with those additional costs. TIP: Increasing availability from 99% to 99.9% is relatively -simply. But increasing availability from 99.9999% to 99.99999% is very +simple. But increasing availability from 99.9999% to 99.99999% is very hard and costly. `ha-manager` has typical error detection and failover times of about 2 minutes, so you can get no more than 99.999% availability. @@ -446,6 +446,65 @@ quorum the node cannot reset the watchdog. This will trigger a reboot after the watchdog then times out, this happens after 60 seconds. +HA Simulator +------------ + +[thumbnail="screenshot/gui-ha-manager-status.png"] + +By using the HA simulator you can test and learn all functionalities of the +Proxmox VE HA solutions. + +By default, the simulator allows you to watch and test the behaviour of a +real-world 3 node cluster with 6 VMs. You can also add or remove additional VMs +or Container. + +You do not have to setup or configure a real cluster, the HA simulator runs out +of the box. + +Install with apt: + +---- +apt install pve-ha-simulator +---- + +You can even install the package on any Debian based system without any +other Proxmox VE packages. For that you will need to download the package and +copy it to the system you want to run it on for installation. When you install +the package with apt from the local file system it will also resolve the +required dependencies for you. + + +To start the simulator on a remote machine you must have a X11 redirection to +your current system. + +If you are on a Linux machine you can use: + +---- +ssh root@ -Y +---- + +On Windows it is working with https://mobaxterm.mobatek.net/[mobaxterm]. + +After either connecting to a existing {pve} with the simulator installed, or +installing it on your local Debian based system manually you can try it out as +follows. + +First you need to create a working directory where the simulator saves it's +current state and writes its the default config: + +---- +mkdir working +---- + +Then, simply pass the created directory as parameter to 'pve-ha-simulator': + +---- +pve-ha-simulator working/ +---- + +You can then start, stop, migrate the simulated HA services, or even check out +what happens on a node failure. + Configuration ------------- @@ -702,7 +761,7 @@ specific for each resource. max_restart:: -Maximum number of tries to restart an failed service on the actual +Maximum number of tries to restart a failed service on the actual node. The default is set to one. max_relocate:: @@ -750,9 +809,9 @@ Package Updates When updating the ha-manager you should do one node after the other, never all at once for various reasons. First, while we test our software thoughtfully, a bug affecting your specific setup cannot totally be ruled out. -Upgrading one node after the other and checking the functionality of each node -after finishing the update helps to recover from an eventual problems, while -updating all could render you in a broken cluster state and is generally not +Updating one node after the other and checking the functionality of each node +after finishing the update helps to recover from eventual problems, while +updating all at once could result in a broken cluster and is generally not good practice. Also, the {pve} HA stack uses a request acknowledge protocol to perform @@ -769,50 +828,100 @@ case, may result in a reset triggered by the watchdog. Node Maintenance ---------------- -It is sometimes possible to shutdown or reboot a node to do -maintenance tasks. Either to replace hardware, or simply to install a -new kernel image. +It is sometimes possible to shutdown or reboot a node to do maintenance tasks. +Either to replace hardware, or simply to install a new kernel image. +This is also true when using the HA stack. The behaviour of the HA stack during +a shutdown can be configured. +[[ha_manager_shutdown_policy]] +Shutdown Policy +~~~~~~~~~~~~~~~ -Shutdown -~~~~~~~~ +Below you will find a description of the different HA policies for a node +shutdown. Currently 'Conditional' is the default due to backward compatibility. +Some users may find that the 'Migrate' behaves more as expected. -A shutdown ('poweroff') is usually done if the node is planned to stay -down for some time. The LRM stops all managed services in that -case. This means that other nodes will take over those service -afterwards. +Migrate +^^^^^^^ -NOTE: Recent hardware has large amounts of RAM. So we stop all -resources, then restart them to avoid online migration of all that -RAM. If you want to use online migration, you need to invoke that -manually before you shutdown the node. +Once the Local Resource manager (LRM) gets a shutdown request and this policy +is enabled, it will mark it self as unavailable for the current HA manager. +This triggers a migration of all HA Services currently located on this node. +Until all running Services got moved away, the LRM will try to delay the +shutdown process. But, this expects that the running services *can* be migrated +to another node. In other words, the service must not be locally bound, for +example by using hardware passthrough. As non-group member nodes are considered +as runnable target if no group member is available, this policy can still be +used when making use of HA groups with only some nodes selected. But, marking a +group as 'restricted' tells the HA manager that the service cannot run outside +of the chosen set of nodes, if all of those nodes are unavailable the shutdown +will hang until you manually intervene. Once the shut down node comes back +online again, the previously displaced services will be moved back, if they did +not get migrated manually in-between. +NOTE: The watchdog is still active during the migration process on shutdown. +If the node loses quorum it will be fenced and the services will be recovered. -Reboot -~~~~~~ +If you start a (previously stopped) service on a node which is currently being +maintained, the node needs to be fenced to ensure that the service can be moved +and started on another, available, node. + +Failover +^^^^^^^^ + +This mode ensures that all services get stopped, but that they will also be +recovered, if the current node is not online soon. It can be useful when doing +maintenance on a cluster scale, were live-migrating VMs may not be possible if +to many nodes are powered-off at a time, but you still want to ensure HA +services get recovered and started again as soon as possible. -Node reboots are initiated with the 'reboot' command. This is usually -done after installing a new kernel. Please note that this is different -from ``shutdown'', because the node immediately starts again. +Freeze +^^^^^^ -The LRM tells the CRM that it wants to restart, and waits until the -CRM puts all resources into the `freeze` state (same mechanism is used -for xref:ha_manager_package_updates[Package Updates]). This prevents -that those resources are moved to other nodes. Instead, the CRM start -the resources after the reboot on the same node. +This mode ensures that all services get stopped and frozen, so that they won't +get recovered until the current node is online again. + +Conditional +^^^^^^^^^^^ + +The 'Conditional' shutdown policy automatically detects if a shutdown or a +reboot is requested, and changes behaviour accordingly. + +.Shutdown + +A shutdown ('poweroff') is usually done if the node is planned to stay down for +some time. The LRM stops all managed services in that case. This means that +other nodes will take over those service afterwards. + +NOTE: Recent hardware has large amounts of memory (RAM). So we stop all +resources, then restart them to avoid online migration of all that RAM. If you +want to use online migration, you need to invoke that manually before you +shutdown the node. + + +.Reboot + +Node reboots are initiated with the 'reboot' command. This is usually done +after installing a new kernel. Please note that this is different from +``shutdown'', because the node immediately starts again. + +The LRM tells the CRM that it wants to restart, and waits until the CRM puts +all resources into the `freeze` state (same mechanism is used for +xref:ha_manager_package_updates[Package Updates]). This prevents that those +resources are moved to other nodes. Instead, the CRM start the resources after +the reboot on the same node. Manual Resource Movement -~~~~~~~~~~~~~~~~~~~~~~~~ +^^^^^^^^^^^^^^^^^^^^^^^^ -Last but not least, you can also move resources manually to other -nodes before you shutdown or restart a node. The advantage is that you -have full control, and you can decide if you want to use online -migration or not. +Last but not least, you can also move resources manually to other nodes before +you shutdown or restart a node. The advantage is that you have full control, +and you can decide if you want to use online migration or not. NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or -`watchdog-mux`. They manage and use the watchdog, so this can result -in a node reboot. +`watchdog-mux`. They manage and use the watchdog, so this can result in a +immediate node reboot or even reset. ifdef::manvolnum[]