[pve-ha-manager.git] / README

= Proxmox HA Manager =

== Motivation ==

The current HA manager has a bunch of drawbacks:

- no more development (redhat moved to pacemaker)

- highly depend on old version of corosync

- complicated code (cause by compatibility layer with 
  older cluster stack (cman)

- no self-fencing

In future, we want to make HA easier for our users, and it should 
be possible to move to newest corosync, or even a totally different 
cluster stack. So we want:

- possibility to run with any distributed key/value store which provides
  some kind of locking with timeouts (zookeeper, consul, etcd, ..) 

- self fencing using Linux watchdog device

- implemented in Perl, so that we can use PVE framework

- only work with simply resources like VMs

We dropped the idea to assemble complex, dependend services, because we think
this is already done with the VM abstraction.

= Architecture =

== Cluster requirements ==

=== Cluster wide locks with timeouts ===

The cluster stack must provide cluster wide locks with timeouts.
The Proxmox 'pmxcfs' implements this on top of corosync.

=== Watchdog ===

We need a reliable watchdog mechanism, which is able to provide hard
timeouts. It must be guaranteed that the node reboot withing specified
timeout if we do not update the watchdog. For me it looks that neither
systemd nor the standard watchdog(8) daemon provides such guarantees.

We could use the /dev/watchdog directly, but unfortunately this only
allows one user. We need to protect at least two daemons, so we write
our own watchdog daemon. This daemon work on /dev/watchdog, but
provides that service to several other daemons using a local socket.

== Self fencing ==

A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
lock for each node) before starting HA resources, and the node updates
the watchdog device once it get that lock. If the node loose quorum,
or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
longer updated. The node can release the lock if there are no running
HA resources.

This makes sure that the node holds the 'ha_agent_${node}_lock' as
long as there are running services on that node.

The HA manger can assume that the watchdog triggered a reboot when he
is able to aquire the 'ha_agent_${node}_lock' for that node.

=== Problems with "two_node" Clusters ===

This corosync options depends on a fence race condition, and only
works using reliable HW fence devices.

Above 'self fencing' algorithm does not work if you use this option!

== Testing requirements ==

We want to be able to simulate HA cluster, using a GUI. This makes it easier
to learn how the system behaves. We also need a way to run regression tests.

= Implementation details =

== Cluster Resource Manager (class PVE::HA::CRM) ==

The Cluster Resource Manager (CRM) daemon runs one each node, but
locking makes sure only one CRM daemon act in 'master' role. That
'master' daemon reads the service configuration file, and request new
service states by writing the global 'manager_status'. That data
structure is read by the Local Resource Manager, which performs the
real work (start/stop/migrate) services.

=== Service Relocation ===

Some services like Qemu Virtual Machines supports live migration.
So the LRM can migrate those services without stopping them (CRM 
service state 'migrate'),

Most other service types requires the service to be stopped, and then
restarted at the other node. Stopped services are moved by the CRM
(usually by simply changing the service configuration).

=== Possible CRM Service States ===

stopped:      Service is stopped (confirmed by LRM)

request_stop: Service should be stopped. Waiting for 
	      confirmation from LRM.

started:      Service is active an LRM should start it asap.

fence:        Wait for node fencing (service node is not inside
	      quorate cluster partition).

freeze:       Do not touch. We use this state while we reboot a node,
	      or when we restart the LRM daemon.

migrate:      Migrate (live) service to other node.

error:        Service disabled because of LRM errors.


== Local Resource Manager (class PVE::HA::LRM) ==

The Local Resource Manager (LRM) daemon runs one each node, and
performs service commands (start/stop/migrate) for services assigned
to the local node. It should be mentioned that each LRM holds a
cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
to assign the service to another node while the LRM holds that lock.

The LRM reads the requested service state from 'manager_status', and
tries to bring the local service into that state. The actial service
status is written back to the 'service_${node}_status', and can be
read by the CRM.

== Pluggable Interface for cluster environment (class PVE::HA::Env) ==

This class defines an interface to the actual cluster environment:

* get node membership and quorum information

* get/release cluster wide locks

* get system time

* watchdog interface

* read/write cluster wide status files 

We have plugins for several different environments:

* PVE::HA::Sim::TestEnv: the regression test environment

* PVE::HA::Sim::RTEnv: the graphical simulator

* PVE::HA::Env::PVE2: the real Proxmox VE cluster
Commit	Line	Data
7cdfa499	1	= Proxmox HA Manager =
95ca6580	2
7cdfa499 DM	3	== Motivation ==
	4
	5	The current HA manager has a bunch of drawbacks:
	6
	7	- no more development (redhat moved to pacemaker)
	8
b101fa0c	9	- highly depend on old version of corosync
7cdfa499 DM	10
	11	- complicated code (cause by compatibility layer with
	12	older cluster stack (cman)
	13
	14	- no self-fencing
	15
	16	In future, we want to make HA easier for our users, and it should
	17	be possible to move to newest corosync, or even a totally different
	18	cluster stack. So we want:
	19
3b2ed964	20	- possibility to run with any distributed key/value store which provides
f02ff212	21	some kind of locking with timeouts (zookeeper, consul, etcd, ..)
7cdfa499	22
b101fa0c	23	- self fencing using Linux watchdog device
7cdfa499	24
b101fa0c	25	- implemented in Perl, so that we can use PVE framework
95ca6580	26
3b2ed964 DM	27	- only work with simply resources like VMs
	28
	29	We dropped the idea to assemble complex, dependend services, because we think
	30	this is already done with the VM abstraction.
95ca6580	31
7cdfa499 DM	32	= Architecture =
	33
	34	== Cluster requirements ==
	35
	36	=== Cluster wide locks with timeouts ===
	37
	38	The cluster stack must provide cluster wide locks with timeouts.
	39	The Proxmox 'pmxcfs' implements this on top of corosync.
	40
f02ff212 DM	41	=== Watchdog ===
	42
	43	We need a reliable watchdog mechanism, which is able to provide hard
	44	timeouts. It must be guaranteed that the node reboot withing specified
	45	timeout if we do not update the watchdog. For me it looks that neither
	46	systemd nor the standard watchdog(8) daemon provides such guarantees.
	47
	48	We could use the /dev/watchdog directly, but unfortunately this only
	49	allows one user. We need to protect at least two daemons, so we write
	50	our own watchdog daemon. This daemon work on /dev/watchdog, but
	51	provides that service to several other daemons using a local socket.
	52
7cdfa499 DM	53	== Self fencing ==
7cdfa499 DM	54
b101fa0c DM	55	A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
	56	lock for each node) before starting HA resources, and the node updates
	57	the watchdog device once it get that lock. If the node loose quorum,
	58	or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
	59	longer updated. The node can release the lock if there are no running
	60	HA resources.
7cdfa499	61
b101fa0c DM	62	This makes sure that the node holds the 'ha_agent_${node}_lock' as
b101fa0c DM	63	long as there are running services on that node.
7cdfa499 DM	64
7cdfa499 DM	65	The HA manger can assume that the watchdog triggered a reboot when he
b101fa0c DM	66	is able to aquire the 'ha_agent_${node}_lock' for that node.
b101fa0c DM	67
71bf7e6b DM	68	=== Problems with "two_node" Clusters ===
	69
	70	This corosync options depends on a fence race condition, and only
	71	works using reliable HW fence devices.
	72
	73	Above 'self fencing' algorithm does not work if you use this option!
	74
b101fa0c DM	75	== Testing requirements ==
	76
	77	We want to be able to simulate HA cluster, using a GUI. This makes it easier
	78	to learn how the system behaves. We also need a way to run regression tests.
	79
	80	= Implementation details =
	81
	82	== Cluster Resource Manager (class PVE::HA::CRM) ==
	83
	84	The Cluster Resource Manager (CRM) daemon runs one each node, but
	85	locking makes sure only one CRM daemon act in 'master' role. That
	86	'master' daemon reads the service configuration file, and request new
	87	service states by writing the global 'manager_status'. That data
	88	structure is read by the Local Resource Manager, which performs the
	89	real work (start/stop/migrate) services.
	90
a821d99e DM	91	=== Service Relocation ===
	92
	93	Some services like Qemu Virtual Machines supports live migration.
	94	So the LRM can migrate those services without stopping them (CRM
	95	service state 'migrate'),
	96
2d7a0983 DM	97	Most other service types requires the service to be stopped, and then
	98	restarted at the other node. Stopped services are moved by the CRM
	99	(usually by simply changing the service configuration).
a821d99e	100
618fbeda DM	101	=== Possible CRM Service States ===
	102
	103	stopped: Service is stopped (confirmed by LRM)
	104
	105	request_stop: Service should be stopped. Waiting for
	106	confirmation from LRM.
	107
	108	started: Service is active an LRM should start it asap.
	109
	110	fence: Wait for node fencing (service node is not inside
	111	quorate cluster partition).
	112
3b2ed964 DM	113	freeze: Do not touch. We use this state while we reboot a node,
	114	or when we restart the LRM daemon.
	115
a821d99e	116	migrate: Migrate (live) service to other node.
618fbeda DM	117
	118	error: Service disabled because of LRM errors.
	119
a821d99e	120
b101fa0c DM	121	== Local Resource Manager (class PVE::HA::LRM) ==
	122
	123	The Local Resource Manager (LRM) daemon runs one each node, and
	124	performs service commands (start/stop/migrate) for services assigned
	125	to the local node. It should be mentioned that each LRM holds a
	126	cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
	127	to assign the service to another node while the LRM holds that lock.
	128
	129	The LRM reads the requested service state from 'manager_status', and
	130	tries to bring the local service into that state. The actial service
	131	status is written back to the 'service_${node}_status', and can be
	132	read by the CRM.
	133
	134	== Pluggable Interface for cluster environment (class PVE::HA::Env) ==
	135
	136	This class defines an interface to the actual cluster environment:
	137
	138	* get node membership and quorum information
	139
	140	* get/release cluster wide locks
	141
	142	* get system time
	143
	144	* watchdog interface
	145
	146	* read/write cluster wide status files
	147
	148	We have plugins for several different environments:
	149
	150	* PVE::HA::Sim::TestEnv: the regression test environment
	151
	152	* PVE::HA::Sim::RTEnv: the graphical simulator
	153
	154	* PVE::HA::Env::PVE2: the real Proxmox VE cluster
	155
	156