From de225e04c47b0a06c64fe6c9d12af48bf5823e58 Mon Sep 17 00:00:00 2001 From: Thomas Lamprecht Date: Tue, 3 Jan 2023 13:19:16 +0100 Subject: [PATCH] update readme to be a bit less confusing/outdated E.g., pve-ha-manager is our current HA manager, so talking about the "current HA stack" being EOL without mentioning the actually meant `rgmanager` one, got taken up the wrong way by some potential users. Correct that and a few other things, but as there are definitively stuff still out-of-date, or will be in a few months, mention that this is an older readme and refer to the HA reference docs at the top. Signed-off-by: Thomas Lamprecht --- README | 101 +++++++++++++++++++++++++++++++++------------------------ 1 file changed, 58 insertions(+), 43 deletions(-) diff --git a/README b/README index 1c5177f..cf6560d 100644 --- a/README +++ b/README @@ -1,41 +1,42 @@ = Proxmox HA Manager = -== Motivation == +Note that this README got written as early development planning/documentation +in 2015, even though small updates where made in 2023 it might be a bit out of +date. For usage documentation see the official reference docs shipped with your +Proxmox VE installation or, for your convenience, hosted at: +https://pve.proxmox.com/pve-docs/chapter-ha-manager.html -The current HA manager has a bunch of drawbacks: +== History & Motivation == -- no more development (redhat moved to pacemaker) +The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks: +- no more development (redhat moved to pacemaker) - highly depend on old version of corosync - -- complicated code (cause by compatibility layer with - older cluster stack (cman) - +- complicated code (cause by compatibility layer with older cluster stack + (cman) - no self-fencing -In future, we want to make HA easier for our users, and it should -be possible to move to newest corosync, or even a totally different -cluster stack. So we want: - -- possibility to run with any distributed key/value store which provides - some kind of locking with timeouts (zookeeper, consul, etcd, ..) +For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA +easier for our users while also making it possible to move to newest corosync, +or even a totally different cluster stack. So, the following core requirements +got set out: +- possibility to run with any distributed key/value store which provides some + kind of locking with timeouts (zookeeper, consul, etcd, ..) - self fencing using Linux watchdog device - - implemented in Perl, so that we can use PVE framework - - only work with simply resources like VMs -We dropped the idea to assemble complex, dependend services, because we think -this is already done with the VM abstraction. +We dropped the idea to assemble complex, dependent services, because we think +this is already done with the VM/CT abstraction. -= Architecture = +== Architecture == -== Cluster requirements == +Cluster requirements. === Cluster wide locks with timeouts === -The cluster stack must provide cluster wide locks with timeouts. +The cluster stack must provide cluster-wide locks with timeouts. The Proxmox 'pmxcfs' implements this on top of corosync. === Watchdog === @@ -50,7 +51,7 @@ allows one user. We need to protect at least two daemons, so we write our own watchdog daemon. This daemon work on /dev/watchdog, but provides that service to several other daemons using a local socket. -== Self fencing == +=== Self fencing === A node needs to acquire a special 'ha_agent_${node}_lock' (one separate lock for each node) before starting HA resources, and the node updates @@ -65,21 +66,25 @@ long as there are running services on that node. The HA manger can assume that the watchdog triggered a reboot when he is able to acquire the 'ha_agent_${node}_lock' for that node. -=== Problems with "two_node" Clusters === +==== Problems with "two_node" Clusters ==== This corosync options depends on a fence race condition, and only works using reliable HW fence devices. Above 'self fencing' algorithm does not work if you use this option! -== Testing requirements == +Note that you can use a QDevice, i.e., a external simple (no full corosync +membership, so relaxed networking) note arbiter process. + +=== Testing Requirements === -We want to be able to simulate HA cluster, using a GUI. This makes it easier -to learn how the system behaves. We also need a way to run regression tests. +We want to be able to simulate and test the behavior of a HA cluster, using +either a GUI or a CLI. This makes it easier to learn how the system behaves. We +also need a way to run regression tests. -= Implementation details = +== Implementation details == -== Cluster Resource Manager (class PVE::HA::CRM) == +=== Cluster Resource Manager (class PVE::HA::CRM) ==== The Cluster Resource Manager (CRM) daemon runs one each node, but locking makes sure only one CRM daemon act in 'master' role. That @@ -88,41 +93,51 @@ service states by writing the global 'manager_status'. That data structure is read by the Local Resource Manager, which performs the real work (start/stop/migrate) services. -=== Service Relocation === +==== Service Relocation ==== -Some services like Qemu Virtual Machines supports live migration. -So the LRM can migrate those services without stopping them (CRM -service state 'migrate'), +Some services, like a QEMU Virtual Machine, supports live migration. +So the LRM can migrate those services without stopping them (CRM service state +'migrate'), -Most other service types requires the service to be stopped, and then -restarted at the other node. Stopped services are moved by the CRM -(usually by simply changing the service configuration). +Most other service types requires the service to be stopped, and then restarted +at the other node. Stopped services are moved by the CRM (usually by simply +changing the service configuration). -=== Service ordering and colocation constarints === +==== Service Ordering and Co-location Constraints ==== -So far there are no plans to implement this (although it would be possible). +There are no to implement this for the initial version but although it would be +possible and probably should be done for later versions. -=== Possible CRM Service States === +==== Possible CRM Service States ==== stopped: Service is stopped (confirmed by LRM) request_stop: Service should be stopped. Waiting for - confirmation from LRM. + confirmation from LRM. started: Service is active an LRM should start it asap. fence: Wait for node fencing (service node is not inside - quorate cluster partition). + quorate cluster partition). + +recovery: Service node gets recovered to a new node as it current one was + fenced. Note that a service might be stuck here depending on the + group/priority configuration freeze: Do not touch. We use this state while we reboot a node, - or when we restart the LRM daemon. + or when we restart the LRM daemon. migrate: Migrate (live) service to other node. +relocate: Migrate (stop. move, start) service to other node. + error: Service disabled because of LRM errors. -== Local Resource Manager (class PVE::HA::LRM) == +There's also a `ignored` state which tells the HA stack to ignore a service +completely, i.e., as it wasn't under HA control at all. + +=== Local Resource Manager (class PVE::HA::LRM) === The Local Resource Manager (LRM) daemon runs one each node, and performs service commands (start/stop/migrate) for services assigned @@ -131,11 +146,11 @@ cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed to assign the service to another node while the LRM holds that lock. The LRM reads the requested service state from 'manager_status', and -tries to bring the local service into that state. The actial service +tries to bring the local service into that state. The actual service status is written back to the 'service_${node}_status', and can be read by the CRM. -== Pluggable Interface for cluster environment (class PVE::HA::Env) == +=== Pluggable Interface for Cluster Environment (class PVE::HA::Env) === This class defines an interface to the actual cluster environment: -- 2.39.2