[pve-ha-manager.git] / README

= Proxmox HA Manager =

== Motivation ==

The current HA manager has a bunch of drawbacks:

- no more development (redhat moved to pacemaker)

- highly depend on old version of corosync

- complicated code (cause by compatibility layer with 
  older cluster stack (cman)

- no self-fencing

In future, we want to make HA easier for our users, and it should 
be possible to move to newest corosync, or even a totally different 
cluster stack. So we want:

- possible to run with any distributed key/value store which provides
  some kind of locking with timeouts.

- self fencing using Linux watchdog device

- implemented in Perl, so that we can use PVE framework

- only works with simply resources like VMs

= Architecture =

== Cluster requirements ==

=== Cluster wide locks with timeouts ===

The cluster stack must provide cluster wide locks with timeouts.
The Proxmox 'pmxcfs' implements this on top of corosync.

== Self fencing ==

A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
lock for each node) before starting HA resources, and the node updates
the watchdog device once it get that lock. If the node loose quorum,
or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
longer updated. The node can release the lock if there are no running
HA resources.

This makes sure that the node holds the 'ha_agent_${node}_lock' as
long as there are running services on that node.

The HA manger can assume that the watchdog triggered a reboot when he
is able to aquire the 'ha_agent_${node}_lock' for that node.

== Testing requirements ==

We want to be able to simulate HA cluster, using a GUI. This makes it easier
to learn how the system behaves. We also need a way to run regression tests.

= Implementation details =

== Cluster Resource Manager (class PVE::HA::CRM) ==

The Cluster Resource Manager (CRM) daemon runs one each node, but
locking makes sure only one CRM daemon act in 'master' role. That
'master' daemon reads the service configuration file, and request new
service states by writing the global 'manager_status'. That data
structure is read by the Local Resource Manager, which performs the
real work (start/stop/migrate) services.

=== Service Relocation ===

Some services like Qemu Virtual Machines supports live migration.
So the LRM can migrate those services without stopping them (CRM 
service state 'migrate'),

Most other service types requires the service to be stopped, and 
then restarted at the other node. We use the following CRM service 
states transitions: 'relocate_stop' => 'relocate_move' => 'started'

Stopped services are moved using service state 'move'. It has to be
noted that service relocation is always done using the LRM (the LRM
'owns' the service), unless a node is fenced. In that case the CRM
is allowed to 'steal' the resource and mode it to another node.

=== Possible CRM Service States ===

stopped:      Service is stopped (confirmed by LRM)

request_stop: Service should be stopped. Waiting for 
	      confirmation from LRM.

started:      Service is active an LRM should start it asap.

fence:        Wait for node fencing (service node is not inside
	      quorate cluster partition).

migrate:      Migrate (live) service to other node.

error:        Service disabled because of LRM errors.


== Local Resource Manager (class PVE::HA::LRM) ==

The Local Resource Manager (LRM) daemon runs one each node, and
performs service commands (start/stop/migrate) for services assigned
to the local node. It should be mentioned that each LRM holds a
cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
to assign the service to another node while the LRM holds that lock.

The LRM reads the requested service state from 'manager_status', and
tries to bring the local service into that state. The actial service
status is written back to the 'service_${node}_status', and can be
read by the CRM.

== Pluggable Interface for cluster environment (class PVE::HA::Env) ==

This class defines an interface to the actual cluster environment:

* get node membership and quorum information

* get/release cluster wide locks

* get system time

* watchdog interface

* read/write cluster wide status files 

We have plugins for several different environments:

* PVE::HA::Sim::TestEnv: the regression test environment

* PVE::HA::Sim::RTEnv: the graphical simulator

* PVE::HA::Env::PVE2: the real Proxmox VE cluster
Commit	Line	Data
7cdfa499	1	= Proxmox HA Manager =
95ca6580	2
7cdfa499 DM	3	== Motivation ==
	4
	5	The current HA manager has a bunch of drawbacks:
	6
	7	- no more development (redhat moved to pacemaker)
	8
b101fa0c	9	- highly depend on old version of corosync
7cdfa499 DM	10
	11	- complicated code (cause by compatibility layer with
	12	older cluster stack (cman)
	13
	14	- no self-fencing
	15
	16	In future, we want to make HA easier for our users, and it should
	17	be possible to move to newest corosync, or even a totally different
	18	cluster stack. So we want:
	19
	20	- possible to run with any distributed key/value store which provides
b101fa0c	21	some kind of locking with timeouts.
7cdfa499	22
b101fa0c	23	- self fencing using Linux watchdog device
7cdfa499	24
b101fa0c	25	- implemented in Perl, so that we can use PVE framework
95ca6580 DM	26
	27	- only works with simply resources like VMs
	28
7cdfa499 DM	29	= Architecture =
	30
	31	== Cluster requirements ==
	32
	33	=== Cluster wide locks with timeouts ===
	34
	35	The cluster stack must provide cluster wide locks with timeouts.
	36	The Proxmox 'pmxcfs' implements this on top of corosync.
	37
	38	== Self fencing ==
	39
b101fa0c DM	40	A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
	41	lock for each node) before starting HA resources, and the node updates
	42	the watchdog device once it get that lock. If the node loose quorum,
	43	or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
	44	longer updated. The node can release the lock if there are no running
	45	HA resources.
7cdfa499	46
b101fa0c DM	47	This makes sure that the node holds the 'ha_agent_${node}_lock' as
b101fa0c DM	48	long as there are running services on that node.
7cdfa499 DM	49
7cdfa499 DM	50	The HA manger can assume that the watchdog triggered a reboot when he
b101fa0c DM	51	is able to aquire the 'ha_agent_${node}_lock' for that node.
	52
	53	== Testing requirements ==
	54
	55	We want to be able to simulate HA cluster, using a GUI. This makes it easier
	56	to learn how the system behaves. We also need a way to run regression tests.
	57
	58	= Implementation details =
	59
	60	== Cluster Resource Manager (class PVE::HA::CRM) ==
	61
	62	The Cluster Resource Manager (CRM) daemon runs one each node, but
	63	locking makes sure only one CRM daemon act in 'master' role. That
	64	'master' daemon reads the service configuration file, and request new
	65	service states by writing the global 'manager_status'. That data
	66	structure is read by the Local Resource Manager, which performs the
	67	real work (start/stop/migrate) services.
	68
a821d99e DM	69	=== Service Relocation ===
	70
	71	Some services like Qemu Virtual Machines supports live migration.
	72	So the LRM can migrate those services without stopping them (CRM
	73	service state 'migrate'),
	74
	75	Most other service types requires the service to be stopped, and
	76	then restarted at the other node. We use the following CRM service
	77	states transitions: 'relocate_stop' => 'relocate_move' => 'started'
	78
	79	Stopped services are moved using service state 'move'. It has to be
	80	noted that service relocation is always done using the LRM (the LRM
	81	'owns' the service), unless a node is fenced. In that case the CRM
	82	is allowed to 'steal' the resource and mode it to another node.
	83
618fbeda DM	84	=== Possible CRM Service States ===
	85
	86	stopped: Service is stopped (confirmed by LRM)
	87
	88	request_stop: Service should be stopped. Waiting for
	89	confirmation from LRM.
	90
	91	started: Service is active an LRM should start it asap.
	92
	93	fence: Wait for node fencing (service node is not inside
	94	quorate cluster partition).
	95
a821d99e	96	migrate: Migrate (live) service to other node.
618fbeda DM	97
	98	error: Service disabled because of LRM errors.
	99
a821d99e DM	100
a821d99e DM	101
b101fa0c DM	102	== Local Resource Manager (class PVE::HA::LRM) ==
	103
	104	The Local Resource Manager (LRM) daemon runs one each node, and
	105	performs service commands (start/stop/migrate) for services assigned
	106	to the local node. It should be mentioned that each LRM holds a
	107	cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
	108	to assign the service to another node while the LRM holds that lock.
	109
	110	The LRM reads the requested service state from 'manager_status', and
	111	tries to bring the local service into that state. The actial service
	112	status is written back to the 'service_${node}_status', and can be
	113	read by the CRM.
	114
	115	== Pluggable Interface for cluster environment (class PVE::HA::Env) ==
	116
	117	This class defines an interface to the actual cluster environment:
	118
	119	* get node membership and quorum information
	120
	121	* get/release cluster wide locks
	122
	123	* get system time
	124
	125	* watchdog interface
	126
	127	* read/write cluster wide status files
	128
	129	We have plugins for several different environments:
	130
	131	* PVE::HA::Sim::TestEnv: the regression test environment
	132
	133	* PVE::HA::Sim::RTEnv: the graphical simulator
	134
	135	* PVE::HA::Env::PVE2: the real Proxmox VE cluster
	136
	137