]> git.proxmox.com Git - pve-ha-manager.git/blame - README
improve crm state transitions
[pve-ha-manager.git] / README
CommitLineData
7cdfa499 1= Proxmox HA Manager =
95ca6580 2
7cdfa499
DM
3== Motivation ==
4
5The current HA manager has a bunch of drawbacks:
6
7- no more development (redhat moved to pacemaker)
8
b101fa0c 9- highly depend on old version of corosync
7cdfa499
DM
10
11- complicated code (cause by compatibility layer with
12 older cluster stack (cman)
13
14- no self-fencing
15
16In future, we want to make HA easier for our users, and it should
17be possible to move to newest corosync, or even a totally different
18cluster stack. So we want:
19
20- possible to run with any distributed key/value store which provides
b101fa0c 21 some kind of locking with timeouts.
7cdfa499 22
b101fa0c 23- self fencing using Linux watchdog device
7cdfa499 24
b101fa0c 25- implemented in Perl, so that we can use PVE framework
95ca6580
DM
26
27- only works with simply resources like VMs
28
7cdfa499
DM
29= Architecture =
30
31== Cluster requirements ==
32
33=== Cluster wide locks with timeouts ===
34
35The cluster stack must provide cluster wide locks with timeouts.
36The Proxmox 'pmxcfs' implements this on top of corosync.
37
38== Self fencing ==
39
b101fa0c
DM
40A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
41lock for each node) before starting HA resources, and the node updates
42the watchdog device once it get that lock. If the node loose quorum,
43or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
44longer updated. The node can release the lock if there are no running
45HA resources.
7cdfa499 46
b101fa0c
DM
47This makes sure that the node holds the 'ha_agent_${node}_lock' as
48long as there are running services on that node.
7cdfa499
DM
49
50The HA manger can assume that the watchdog triggered a reboot when he
b101fa0c
DM
51is able to aquire the 'ha_agent_${node}_lock' for that node.
52
53== Testing requirements ==
54
55We want to be able to simulate HA cluster, using a GUI. This makes it easier
56to learn how the system behaves. We also need a way to run regression tests.
57
58= Implementation details =
59
60== Cluster Resource Manager (class PVE::HA::CRM) ==
61
62The Cluster Resource Manager (CRM) daemon runs one each node, but
63locking makes sure only one CRM daemon act in 'master' role. That
64'master' daemon reads the service configuration file, and request new
65service states by writing the global 'manager_status'. That data
66structure is read by the Local Resource Manager, which performs the
67real work (start/stop/migrate) services.
68
a821d99e
DM
69=== Service Relocation ===
70
71Some services like Qemu Virtual Machines supports live migration.
72So the LRM can migrate those services without stopping them (CRM
73service state 'migrate'),
74
75Most other service types requires the service to be stopped, and
76then restarted at the other node. We use the following CRM service
77states transitions: 'relocate_stop' => 'relocate_move' => 'started'
78
79Stopped services are moved using service state 'move'. It has to be
80noted that service relocation is always done using the LRM (the LRM
81'owns' the service), unless a node is fenced. In that case the CRM
82is allowed to 'steal' the resource and mode it to another node.
83
618fbeda
DM
84=== Possible CRM Service States ===
85
86stopped: Service is stopped (confirmed by LRM)
87
88request_stop: Service should be stopped. Waiting for
89 confirmation from LRM.
90
91started: Service is active an LRM should start it asap.
92
93fence: Wait for node fencing (service node is not inside
94 quorate cluster partition).
95
a821d99e 96migrate: Migrate (live) service to other node.
618fbeda
DM
97
98error: Service disabled because of LRM errors.
99
a821d99e
DM
100
101
b101fa0c
DM
102== Local Resource Manager (class PVE::HA::LRM) ==
103
104The Local Resource Manager (LRM) daemon runs one each node, and
105performs service commands (start/stop/migrate) for services assigned
106to the local node. It should be mentioned that each LRM holds a
107cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
108to assign the service to another node while the LRM holds that lock.
109
110The LRM reads the requested service state from 'manager_status', and
111tries to bring the local service into that state. The actial service
112status is written back to the 'service_${node}_status', and can be
113read by the CRM.
114
115== Pluggable Interface for cluster environment (class PVE::HA::Env) ==
116
117This class defines an interface to the actual cluster environment:
118
119* get node membership and quorum information
120
121* get/release cluster wide locks
122
123* get system time
124
125* watchdog interface
126
127* read/write cluster wide status files
128
129We have plugins for several different environments:
130
131* PVE::HA::Sim::TestEnv: the regression test environment
132
133* PVE::HA::Sim::RTEnv: the graphical simulator
134
135* PVE::HA::Env::PVE2: the real Proxmox VE cluster
136
137