]> git.proxmox.com Git - pve-ha-manager.git/blame - README
lrm: explicitly log shutdown_policy on node shutdown
[pve-ha-manager.git] / README
CommitLineData
7cdfa499 1= Proxmox HA Manager =
95ca6580 2
7cdfa499
DM
3== Motivation ==
4
5The current HA manager has a bunch of drawbacks:
6
7- no more development (redhat moved to pacemaker)
8
b101fa0c 9- highly depend on old version of corosync
7cdfa499
DM
10
11- complicated code (cause by compatibility layer with
12 older cluster stack (cman)
13
14- no self-fencing
15
16In future, we want to make HA easier for our users, and it should
17be possible to move to newest corosync, or even a totally different
18cluster stack. So we want:
19
3b2ed964 20- possibility to run with any distributed key/value store which provides
f02ff212 21 some kind of locking with timeouts (zookeeper, consul, etcd, ..)
7cdfa499 22
b101fa0c 23- self fencing using Linux watchdog device
7cdfa499 24
b101fa0c 25- implemented in Perl, so that we can use PVE framework
95ca6580 26
3b2ed964
DM
27- only work with simply resources like VMs
28
29We dropped the idea to assemble complex, dependend services, because we think
30this is already done with the VM abstraction.
95ca6580 31
7cdfa499
DM
32= Architecture =
33
34== Cluster requirements ==
35
36=== Cluster wide locks with timeouts ===
37
38The cluster stack must provide cluster wide locks with timeouts.
39The Proxmox 'pmxcfs' implements this on top of corosync.
40
f02ff212
DM
41=== Watchdog ===
42
43We need a reliable watchdog mechanism, which is able to provide hard
63f6a08c 44timeouts. It must be guaranteed that the node reboots within the specified
f02ff212
DM
45timeout if we do not update the watchdog. For me it looks that neither
46systemd nor the standard watchdog(8) daemon provides such guarantees.
47
48We could use the /dev/watchdog directly, but unfortunately this only
49allows one user. We need to protect at least two daemons, so we write
50our own watchdog daemon. This daemon work on /dev/watchdog, but
51provides that service to several other daemons using a local socket.
52
7cdfa499
DM
53== Self fencing ==
54
63f6a08c 55A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
b101fa0c
DM
56lock for each node) before starting HA resources, and the node updates
57the watchdog device once it get that lock. If the node loose quorum,
58or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
59longer updated. The node can release the lock if there are no running
60HA resources.
7cdfa499 61
b101fa0c
DM
62This makes sure that the node holds the 'ha_agent_${node}_lock' as
63long as there are running services on that node.
7cdfa499
DM
64
65The HA manger can assume that the watchdog triggered a reboot when he
63f6a08c 66is able to acquire the 'ha_agent_${node}_lock' for that node.
b101fa0c 67
71bf7e6b
DM
68=== Problems with "two_node" Clusters ===
69
70This corosync options depends on a fence race condition, and only
71works using reliable HW fence devices.
72
73Above 'self fencing' algorithm does not work if you use this option!
74
b101fa0c
DM
75== Testing requirements ==
76
77We want to be able to simulate HA cluster, using a GUI. This makes it easier
78to learn how the system behaves. We also need a way to run regression tests.
79
80= Implementation details =
81
82== Cluster Resource Manager (class PVE::HA::CRM) ==
83
84The Cluster Resource Manager (CRM) daemon runs one each node, but
85locking makes sure only one CRM daemon act in 'master' role. That
86'master' daemon reads the service configuration file, and request new
87service states by writing the global 'manager_status'. That data
88structure is read by the Local Resource Manager, which performs the
89real work (start/stop/migrate) services.
90
a821d99e
DM
91=== Service Relocation ===
92
93Some services like Qemu Virtual Machines supports live migration.
94So the LRM can migrate those services without stopping them (CRM
95service state 'migrate'),
96
2d7a0983
DM
97Most other service types requires the service to be stopped, and then
98restarted at the other node. Stopped services are moved by the CRM
99(usually by simply changing the service configuration).
a821d99e 100
c36974d7
DM
101=== Service ordering and colocation constarints ===
102
103So far there are no plans to implement this (although it would be possible).
104
618fbeda
DM
105=== Possible CRM Service States ===
106
107stopped: Service is stopped (confirmed by LRM)
108
109request_stop: Service should be stopped. Waiting for
110 confirmation from LRM.
111
112started: Service is active an LRM should start it asap.
113
114fence: Wait for node fencing (service node is not inside
115 quorate cluster partition).
116
3b2ed964
DM
117freeze: Do not touch. We use this state while we reboot a node,
118 or when we restart the LRM daemon.
119
a821d99e 120migrate: Migrate (live) service to other node.
618fbeda
DM
121
122error: Service disabled because of LRM errors.
123
a821d99e 124
b101fa0c
DM
125== Local Resource Manager (class PVE::HA::LRM) ==
126
127The Local Resource Manager (LRM) daemon runs one each node, and
128performs service commands (start/stop/migrate) for services assigned
129to the local node. It should be mentioned that each LRM holds a
130cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
131to assign the service to another node while the LRM holds that lock.
132
133The LRM reads the requested service state from 'manager_status', and
134tries to bring the local service into that state. The actial service
135status is written back to the 'service_${node}_status', and can be
136read by the CRM.
137
138== Pluggable Interface for cluster environment (class PVE::HA::Env) ==
139
140This class defines an interface to the actual cluster environment:
141
142* get node membership and quorum information
143
144* get/release cluster wide locks
145
146* get system time
147
148* watchdog interface
149
150* read/write cluster wide status files
151
152We have plugins for several different environments:
153
154* PVE::HA::Sim::TestEnv: the regression test environment
155
156* PVE::HA::Sim::RTEnv: the graphical simulator
157
158* PVE::HA::Env::PVE2: the real Proxmox VE cluster
159
160