[pve-ha-manager.git] / README

= Proxmox HA Manager =

Note that this README got written as early development planning/documentation
in 2015, even though small updates where made in 2023 it might be a bit out of
date. For usage documentation see the official reference docs shipped with your
Proxmox VE installation or, for your convenience, hosted at:
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html

== History & Motivation ==

The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks:

- no more development (redhat moved to pacemaker)
- highly depend on old version of corosync
- complicated code (cause by compatibility layer with older cluster stack
  (cman)
- no self-fencing

For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA
easier for our users while also making it possible to move to newest corosync,
or even a totally different cluster stack. So, the following core requirements
got set out:

- possibility to run with any distributed key/value store which provides some
  kind of locking with timeouts (zookeeper, consul, etcd, ..) 
- self fencing using Linux watchdog device
- implemented in Perl, so that we can use PVE framework
- only work with simply resources like VMs

We dropped the idea to assemble complex, dependent services, because we think
this is already done with the VM/CT abstraction.

== Architecture ==

Cluster requirements.

=== Cluster wide locks with timeouts ===

The cluster stack must provide cluster-wide locks with timeouts.
The Proxmox 'pmxcfs' implements this on top of corosync.

=== Watchdog ===

We need a reliable watchdog mechanism, which is able to provide hard
timeouts. It must be guaranteed that the node reboots within the specified
timeout if we do not update the watchdog. For me it looks that neither
systemd nor the standard watchdog(8) daemon provides such guarantees.

We could use the /dev/watchdog directly, but unfortunately this only
allows one user. We need to protect at least two daemons, so we write
our own watchdog daemon. This daemon work on /dev/watchdog, but
provides that service to several other daemons using a local socket.

=== Self fencing ===

A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
lock for each node) before starting HA resources, and the node updates
the watchdog device once it get that lock. If the node loose quorum,
or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
longer updated. The node can release the lock if there are no running
HA resources.

This makes sure that the node holds the 'ha_agent_${node}_lock' as
long as there are running services on that node.

The HA manger can assume that the watchdog triggered a reboot when he
is able to acquire the 'ha_agent_${node}_lock' for that node.

==== Problems with "two_node" Clusters ====

This corosync options depends on a fence race condition, and only
works using reliable HW fence devices.

Above 'self fencing' algorithm does not work if you use this option!

Note that you can use a QDevice, i.e., a external simple (no full corosync
membership, so relaxed networking) note arbiter process.

=== Testing Requirements ===

We want to be able to simulate and test the behavior of a HA cluster, using
either a GUI or a CLI. This makes it easier to learn how the system behaves. We
also need a way to run regression tests.

== Implementation details ==

=== Cluster Resource Manager (class PVE::HA::CRM) ====

The Cluster Resource Manager (CRM) daemon runs one each node, but
locking makes sure only one CRM daemon act in 'master' role. That
'master' daemon reads the service configuration file, and request new
service states by writing the global 'manager_status'. That data
structure is read by the Local Resource Manager, which performs the
real work (start/stop/migrate) services.

==== Service Relocation ====

Some services, like a QEMU Virtual Machine, supports live migration.
So the LRM can migrate those services without stopping them (CRM service state
'migrate'),

Most other service types requires the service to be stopped, and then restarted
at the other node. Stopped services are moved by the CRM (usually by simply
changing the service configuration).

==== Service Ordering and Co-location Constraints ====

There are no to implement this for the initial version but although it would be
possible and probably should be done for later versions.

==== Possible CRM Service States ====

stopped:      Service is stopped (confirmed by LRM)

request_stop: Service should be stopped. Waiting for 
              confirmation from LRM.

started:      Service is active an LRM should start it asap.

fence:        Wait for node fencing (service node is not inside
              quorate cluster partition).

recovery:     Service node gets recovered to a new node as it current one was
              fenced. Note that a service might be stuck here depending on the
              group/priority configuration

freeze:       Do not touch. We use this state while we reboot a node,
              or when we restart the LRM daemon.

migrate:      Migrate (live) service to other node.

relocate:     Migrate (stop. move, start) service to other node.

error:        Service disabled because of LRM errors.


There's also a `ignored` state which tells the HA stack to ignore a service
completely, i.e., as it wasn't under HA control at all.

=== Local Resource Manager (class PVE::HA::LRM) ===

The Local Resource Manager (LRM) daemon runs one each node, and
performs service commands (start/stop/migrate) for services assigned
to the local node. It should be mentioned that each LRM holds a
cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
to assign the service to another node while the LRM holds that lock.

The LRM reads the requested service state from 'manager_status', and
tries to bring the local service into that state. The actual service
status is written back to the 'service_${node}_status', and can be
read by the CRM.

=== Pluggable Interface for Cluster Environment (class PVE::HA::Env) ===

This class defines an interface to the actual cluster environment:

* get node membership and quorum information

* get/release cluster wide locks

* get system time

* watchdog interface

* read/write cluster wide status files 

We have plugins for several different environments:

* PVE::HA::Sim::TestEnv: the regression test environment

* PVE::HA::Sim::RTEnv: the graphical simulator

* PVE::HA::Env::PVE2: the real Proxmox VE cluster
Commit	Line	Data
7cdfa499	1	= Proxmox HA Manager =
95ca6580	2
de225e04 TL	3	Note that this README got written as early development planning/documentation
	4	in 2015, even though small updates where made in 2023 it might be a bit out of
	5	date. For usage documentation see the official reference docs shipped with your
	6	Proxmox VE installation or, for your convenience, hosted at:
	7	https://pve.proxmox.com/pve-docs/chapter-ha-manager.html
7cdfa499	8
de225e04	9	== History & Motivation ==
7cdfa499	10
de225e04	11	The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks:
7cdfa499	12
de225e04	13	- no more development (redhat moved to pacemaker)
b101fa0c	14	- highly depend on old version of corosync
de225e04 TL	15	- complicated code (cause by compatibility layer with older cluster stack
de225e04 TL	16	(cman)
7cdfa499 DM	17	- no self-fencing
7cdfa499 DM	18
de225e04 TL	19	For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA
	20	easier for our users while also making it possible to move to newest corosync,
	21	or even a totally different cluster stack. So, the following core requirements
	22	got set out:
7cdfa499	23
de225e04 TL	24	- possibility to run with any distributed key/value store which provides some
de225e04 TL	25	kind of locking with timeouts (zookeeper, consul, etcd, ..)
b101fa0c	26	- self fencing using Linux watchdog device
b101fa0c	27	- implemented in Perl, so that we can use PVE framework
3b2ed964 DM	28	- only work with simply resources like VMs
3b2ed964 DM	29
de225e04 TL	30	We dropped the idea to assemble complex, dependent services, because we think
de225e04 TL	31	this is already done with the VM/CT abstraction.
95ca6580	32
de225e04	33	== Architecture ==
7cdfa499	34
de225e04	35	Cluster requirements.
7cdfa499 DM	36
	37	=== Cluster wide locks with timeouts ===
	38
de225e04	39	The cluster stack must provide cluster-wide locks with timeouts.
7cdfa499 DM	40	The Proxmox 'pmxcfs' implements this on top of corosync.
7cdfa499 DM	41
f02ff212 DM	42	=== Watchdog ===
	43
	44	We need a reliable watchdog mechanism, which is able to provide hard
63f6a08c	45	timeouts. It must be guaranteed that the node reboots within the specified
f02ff212 DM	46	timeout if we do not update the watchdog. For me it looks that neither
	47	systemd nor the standard watchdog(8) daemon provides such guarantees.
	48
	49	We could use the /dev/watchdog directly, but unfortunately this only
	50	allows one user. We need to protect at least two daemons, so we write
	51	our own watchdog daemon. This daemon work on /dev/watchdog, but
	52	provides that service to several other daemons using a local socket.
	53
de225e04	54	=== Self fencing ===
7cdfa499	55
63f6a08c	56	A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
b101fa0c DM	57	lock for each node) before starting HA resources, and the node updates
	58	the watchdog device once it get that lock. If the node loose quorum,
	59	or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
	60	longer updated. The node can release the lock if there are no running
	61	HA resources.
7cdfa499	62
b101fa0c DM	63	This makes sure that the node holds the 'ha_agent_${node}_lock' as
b101fa0c DM	64	long as there are running services on that node.
7cdfa499 DM	65
7cdfa499 DM	66	The HA manger can assume that the watchdog triggered a reboot when he
63f6a08c	67	is able to acquire the 'ha_agent_${node}_lock' for that node.
b101fa0c	68
de225e04	69	==== Problems with "two_node" Clusters ====
71bf7e6b DM	70
	71	This corosync options depends on a fence race condition, and only
	72	works using reliable HW fence devices.
	73
	74	Above 'self fencing' algorithm does not work if you use this option!
	75
de225e04 TL	76	Note that you can use a QDevice, i.e., a external simple (no full corosync
	77	membership, so relaxed networking) note arbiter process.
	78
	79	=== Testing Requirements ===
b101fa0c	80
de225e04 TL	81	We want to be able to simulate and test the behavior of a HA cluster, using
	82	either a GUI or a CLI. This makes it easier to learn how the system behaves. We
	83	also need a way to run regression tests.
b101fa0c	84
de225e04	85	== Implementation details ==
b101fa0c	86
de225e04	87	=== Cluster Resource Manager (class PVE::HA::CRM) ====
b101fa0c DM	88
	89	The Cluster Resource Manager (CRM) daemon runs one each node, but
	90	locking makes sure only one CRM daemon act in 'master' role. That
	91	'master' daemon reads the service configuration file, and request new
	92	service states by writing the global 'manager_status'. That data
	93	structure is read by the Local Resource Manager, which performs the
	94	real work (start/stop/migrate) services.
	95
de225e04	96	==== Service Relocation ====
a821d99e	97
de225e04 TL	98	Some services, like a QEMU Virtual Machine, supports live migration.
	99	So the LRM can migrate those services without stopping them (CRM service state
	100	'migrate'),
a821d99e	101
de225e04 TL	102	Most other service types requires the service to be stopped, and then restarted
	103	at the other node. Stopped services are moved by the CRM (usually by simply
	104	changing the service configuration).
a821d99e	105
de225e04	106	==== Service Ordering and Co-location Constraints ====
c36974d7	107
de225e04 TL	108	There are no to implement this for the initial version but although it would be
de225e04 TL	109	possible and probably should be done for later versions.
c36974d7	110
de225e04	111	==== Possible CRM Service States ====
618fbeda DM	112
	113	stopped: Service is stopped (confirmed by LRM)
	114
	115	request_stop: Service should be stopped. Waiting for
de225e04	116	confirmation from LRM.
618fbeda DM	117
	118	started: Service is active an LRM should start it asap.
	119
	120	fence: Wait for node fencing (service node is not inside
de225e04 TL	121	quorate cluster partition).
	122
	123	recovery: Service node gets recovered to a new node as it current one was
	124	fenced. Note that a service might be stuck here depending on the
	125	group/priority configuration
618fbeda	126
3b2ed964	127	freeze: Do not touch. We use this state while we reboot a node,
de225e04	128	or when we restart the LRM daemon.
3b2ed964	129
a821d99e	130	migrate: Migrate (live) service to other node.
618fbeda	131
de225e04 TL	132	relocate: Migrate (stop. move, start) service to other node.
de225e04 TL	133
618fbeda DM	134	error: Service disabled because of LRM errors.
618fbeda DM	135
a821d99e	136
de225e04 TL	137	There's also a `ignored` state which tells the HA stack to ignore a service
	138	completely, i.e., as it wasn't under HA control at all.
	139
	140	=== Local Resource Manager (class PVE::HA::LRM) ===
b101fa0c DM	141
	142	The Local Resource Manager (LRM) daemon runs one each node, and
	143	performs service commands (start/stop/migrate) for services assigned
	144	to the local node. It should be mentioned that each LRM holds a
	145	cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
	146	to assign the service to another node while the LRM holds that lock.
	147
	148	The LRM reads the requested service state from 'manager_status', and
de225e04	149	tries to bring the local service into that state. The actual service
b101fa0c DM	150	status is written back to the 'service_${node}_status', and can be
	151	read by the CRM.
	152
de225e04	153	=== Pluggable Interface for Cluster Environment (class PVE::HA::Env) ===
b101fa0c DM	154
	155	This class defines an interface to the actual cluster environment:
	156
	157	* get node membership and quorum information
	158
	159	* get/release cluster wide locks
	160
	161	* get system time
	162
	163	* watchdog interface
	164
	165	* read/write cluster wide status files
	166
	167	We have plugins for several different environments:
	168
	169	* PVE::HA::Sim::TestEnv: the regression test environment
	170
	171	* PVE::HA::Sim::RTEnv: the graphical simulator
	172
	173	* PVE::HA::Env::PVE2: the real Proxmox VE cluster
	174
	175