X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=ha-manager.adoc;h=d8489cb232652a4e2e0c04c30c8b3162749e4852;hp=a5ffe00325e0b711cac094036ef8a229a1c37af2;hb=a9c77fec9239c1dd979bb0fd025a4d9186ae6449;hpb=49a5e11cd14742d8ad28116fab7fce9fc85321bd diff --git a/ha-manager.adoc b/ha-manager.adoc index a5ffe00..d8489cb 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -1,8 +1,8 @@ -[[chapter-ha-manager]] +[[chapter_ha_manager]] ifdef::manvolnum[] -PVE({manvolnum}) -================ -include::attributes.txt[] +ha-manager(1) +============= +:pve-toplevel: NAME ---- @@ -17,14 +17,12 @@ include::ha-manager.1-synopsis.adoc[] DESCRIPTION ----------- endif::manvolnum[] - ifndef::manvolnum[] High Availability ================= -include::attributes.txt[] +:pve-toplevel: endif::manvolnum[] - Our modern society depends heavily on information provided by computers over the network. Mobile devices amplified that dependency, because people can access the network any time from anywhere. If you @@ -58,7 +56,7 @@ yourself. The following solutions works without modifying the software: * Use reliable ``server'' components - ++ NOTE: Computer components with same functionality can have varying reliability numbers, depending on the component quality. Most vendors sell components with higher reliability as ``server'' components - @@ -107,21 +105,27 @@ hard and costly. `ha-manager` has typical error detection and failover times of about 2 minutes, so you can get no more than 99.999% availability. + Requirements ------------ +You must meet the following requirements before you start with HA: + * at least three cluster nodes (to get reliable quorum) * shared storage for VMs and containers * hardware redundancy (everywhere) +* use reliable “server” components + * hardware watchdog - if not available we fall back to the linux kernel software watchdog (`softdog`) * optional hardware fencing devices +[[ha_manager_resources]] Resources --------- @@ -148,16 +152,17 @@ To provide High Availability two daemons run on each node: `pve-ha-lrm`:: -The local resource manager (LRM), it controls the services running on -the local node. -It reads the requested states for its services from the current manager -status file and executes the respective commands. +The local resource manager (LRM), which controls the services running on +the local node. It reads the requested states for its services from +the current manager status file and executes the respective commands. `pve-ha-crm`:: -The cluster resource manager (CRM), it controls the cluster wide -actions of the services, processes the LRM results and includes the state -machine which controls the state of each service. +The cluster resource manager (CRM), which makes the cluster wide +decisions. It sends commands to the LRM, processes the results, +and moves resources to other nodes if something fails. The CRM also +handles node fencing. + .Locks in the LRM & CRM [NOTE] @@ -269,17 +274,59 @@ quorum, the LRM waits for a new quorum to form. As long as there is no quorum the node cannot reset the watchdog. This will trigger a reboot after the watchdog then times out, this happens after 60 seconds. + Configuration ------------- -The HA stack is well integrated in the Proxmox VE API2. So, for -example, HA can be configured via `ha-manager` or the PVE web -interface, which both provide an easy to use tool. +The HA stack is well integrated into the {pve} API. So, for example, +HA can be configured via the `ha-manager` command line interface, or +the {pve} web interface - both interfaces provide an easy way to +manage HA. Automation tools can use the API directly. + +All HA configuration files are within `/etc/pve/ha/`, so they get +automatically distributed to the cluster nodes, and all nodes share +the same HA configuration. + + +Resources +~~~~~~~~~ + +The resource configuration file `/etc/pve/ha/resources.cfg` stores +the list of resources managed by `ha-manager`. A resource configuration +inside that list look like this: + +---- +: + + ... +---- + +It starts with a resource type followed by a resource specific name, +separated with colon. Together this forms the HA resource ID, which is +used by all `ha-manager` commands to uniquely identify a resource +(example: `vm:100` or `ct:101`). The next lines contain additional +properties: + +include::ha-resources-opts.adoc[] + + +Groups +~~~~~~ + +The HA group configuration file `/etc/pve/ha/groups.cfg` is used to +define groups of cluster nodes. A resource can be restricted to run +only on the members of such group. A group configuration look like +this: + +---- +group: + nodes + + ... +---- + +include::ha-groups-opts.adoc[] -The resource configuration file can be located at -`/etc/pve/ha/resources.cfg` and the group configuration file at -`/etc/pve/ha/groups.cfg`. Use the provided tools to make changes, -there shouldn't be any need to edit them manually. Node Power Status ----------------- @@ -311,6 +358,7 @@ the update process can be too long which, in the worst case, may result in a watchdog reset. +[[ha_manager_fencing]] Fencing ------- @@ -380,6 +428,7 @@ That minimizes the possibility of an overload, which else could cause an unresponsive node and as a result a chain reaction of node failures in the cluster. +[[ha_manager_groups]] Groups ------ @@ -481,6 +530,7 @@ killing its process) * *after* you fixed all errors you may enable the service again +[[ha_manager_service_operations]] Service Operations ------------------