]> git.proxmox.com Git - pve-ha-manager.git/blame - README
bump version to 4.0.2
[pve-ha-manager.git] / README
CommitLineData
7cdfa499 1= Proxmox HA Manager =
95ca6580 2
de225e04
TL
3Note that this README got written as early development planning/documentation
4in 2015, even though small updates where made in 2023 it might be a bit out of
5date. For usage documentation see the official reference docs shipped with your
6Proxmox VE installation or, for your convenience, hosted at:
7https://pve.proxmox.com/pve-docs/chapter-ha-manager.html
7cdfa499 8
de225e04 9== History & Motivation ==
7cdfa499 10
de225e04 11The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks:
7cdfa499 12
de225e04 13- no more development (redhat moved to pacemaker)
b101fa0c 14- highly depend on old version of corosync
de225e04
TL
15- complicated code (cause by compatibility layer with older cluster stack
16 (cman)
7cdfa499
DM
17- no self-fencing
18
de225e04
TL
19For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA
20easier for our users while also making it possible to move to newest corosync,
21or even a totally different cluster stack. So, the following core requirements
22got set out:
7cdfa499 23
de225e04
TL
24- possibility to run with any distributed key/value store which provides some
25 kind of locking with timeouts (zookeeper, consul, etcd, ..)
b101fa0c 26- self fencing using Linux watchdog device
b101fa0c 27- implemented in Perl, so that we can use PVE framework
3b2ed964
DM
28- only work with simply resources like VMs
29
de225e04
TL
30We dropped the idea to assemble complex, dependent services, because we think
31this is already done with the VM/CT abstraction.
95ca6580 32
de225e04 33== Architecture ==
7cdfa499 34
de225e04 35Cluster requirements.
7cdfa499
DM
36
37=== Cluster wide locks with timeouts ===
38
de225e04 39The cluster stack must provide cluster-wide locks with timeouts.
7cdfa499
DM
40The Proxmox 'pmxcfs' implements this on top of corosync.
41
f02ff212
DM
42=== Watchdog ===
43
44We need a reliable watchdog mechanism, which is able to provide hard
63f6a08c 45timeouts. It must be guaranteed that the node reboots within the specified
f02ff212
DM
46timeout if we do not update the watchdog. For me it looks that neither
47systemd nor the standard watchdog(8) daemon provides such guarantees.
48
49We could use the /dev/watchdog directly, but unfortunately this only
50allows one user. We need to protect at least two daemons, so we write
51our own watchdog daemon. This daemon work on /dev/watchdog, but
52provides that service to several other daemons using a local socket.
53
de225e04 54=== Self fencing ===
7cdfa499 55
63f6a08c 56A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
b101fa0c
DM
57lock for each node) before starting HA resources, and the node updates
58the watchdog device once it get that lock. If the node loose quorum,
59or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
60longer updated. The node can release the lock if there are no running
61HA resources.
7cdfa499 62
b101fa0c
DM
63This makes sure that the node holds the 'ha_agent_${node}_lock' as
64long as there are running services on that node.
7cdfa499
DM
65
66The HA manger can assume that the watchdog triggered a reboot when he
63f6a08c 67is able to acquire the 'ha_agent_${node}_lock' for that node.
b101fa0c 68
de225e04 69==== Problems with "two_node" Clusters ====
71bf7e6b
DM
70
71This corosync options depends on a fence race condition, and only
72works using reliable HW fence devices.
73
74Above 'self fencing' algorithm does not work if you use this option!
75
de225e04
TL
76Note that you can use a QDevice, i.e., a external simple (no full corosync
77membership, so relaxed networking) note arbiter process.
78
79=== Testing Requirements ===
b101fa0c 80
de225e04
TL
81We want to be able to simulate and test the behavior of a HA cluster, using
82either a GUI or a CLI. This makes it easier to learn how the system behaves. We
83also need a way to run regression tests.
b101fa0c 84
de225e04 85== Implementation details ==
b101fa0c 86
de225e04 87=== Cluster Resource Manager (class PVE::HA::CRM) ====
b101fa0c
DM
88
89The Cluster Resource Manager (CRM) daemon runs one each node, but
90locking makes sure only one CRM daemon act in 'master' role. That
91'master' daemon reads the service configuration file, and request new
92service states by writing the global 'manager_status'. That data
93structure is read by the Local Resource Manager, which performs the
94real work (start/stop/migrate) services.
95
de225e04 96==== Service Relocation ====
a821d99e 97
de225e04
TL
98Some services, like a QEMU Virtual Machine, supports live migration.
99So the LRM can migrate those services without stopping them (CRM service state
100'migrate'),
a821d99e 101
de225e04
TL
102Most other service types requires the service to be stopped, and then restarted
103at the other node. Stopped services are moved by the CRM (usually by simply
104changing the service configuration).
a821d99e 105
de225e04 106==== Service Ordering and Co-location Constraints ====
c36974d7 107
de225e04
TL
108There are no to implement this for the initial version but although it would be
109possible and probably should be done for later versions.
c36974d7 110
de225e04 111==== Possible CRM Service States ====
618fbeda
DM
112
113stopped: Service is stopped (confirmed by LRM)
114
115request_stop: Service should be stopped. Waiting for
de225e04 116 confirmation from LRM.
618fbeda
DM
117
118started: Service is active an LRM should start it asap.
119
120fence: Wait for node fencing (service node is not inside
de225e04
TL
121 quorate cluster partition).
122
123recovery: Service node gets recovered to a new node as it current one was
124 fenced. Note that a service might be stuck here depending on the
125 group/priority configuration
618fbeda 126
3b2ed964 127freeze: Do not touch. We use this state while we reboot a node,
de225e04 128 or when we restart the LRM daemon.
3b2ed964 129
a821d99e 130migrate: Migrate (live) service to other node.
618fbeda 131
de225e04
TL
132relocate: Migrate (stop. move, start) service to other node.
133
618fbeda
DM
134error: Service disabled because of LRM errors.
135
a821d99e 136
de225e04
TL
137There's also a `ignored` state which tells the HA stack to ignore a service
138completely, i.e., as it wasn't under HA control at all.
139
140=== Local Resource Manager (class PVE::HA::LRM) ===
b101fa0c
DM
141
142The Local Resource Manager (LRM) daemon runs one each node, and
143performs service commands (start/stop/migrate) for services assigned
144to the local node. It should be mentioned that each LRM holds a
145cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
146to assign the service to another node while the LRM holds that lock.
147
148The LRM reads the requested service state from 'manager_status', and
de225e04 149tries to bring the local service into that state. The actual service
b101fa0c
DM
150status is written back to the 'service_${node}_status', and can be
151read by the CRM.
152
de225e04 153=== Pluggable Interface for Cluster Environment (class PVE::HA::Env) ===
b101fa0c
DM
154
155This class defines an interface to the actual cluster environment:
156
157* get node membership and quorum information
158
159* get/release cluster wide locks
160
161* get system time
162
163* watchdog interface
164
165* read/write cluster wide status files
166
167We have plugins for several different environments:
168
169* PVE::HA::Sim::TestEnv: the regression test environment
170
171* PVE::HA::Sim::RTEnv: the graphical simulator
172
173* PVE::HA::Env::PVE2: the real Proxmox VE cluster
174
175