]>
Commit | Line | Data |
---|---|---|
1 | = Proxmox HA Manager = | |
2 | ||
3 | Note that this README got written as early development planning/documentation | |
4 | in 2015, even though small updates where made in 2023 it might be a bit out of | |
5 | date. For usage documentation see the official reference docs shipped with your | |
6 | Proxmox VE installation or, for your convenience, hosted at: | |
7 | https://pve.proxmox.com/pve-docs/chapter-ha-manager.html | |
8 | ||
9 | == History & Motivation == | |
10 | ||
11 | The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks: | |
12 | ||
13 | - no more development (redhat moved to pacemaker) | |
14 | - highly depend on old version of corosync | |
15 | - complicated code (cause by compatibility layer with older cluster stack | |
16 | (cman) | |
17 | - no self-fencing | |
18 | ||
19 | For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA | |
20 | easier for our users while also making it possible to move to newest corosync, | |
21 | or even a totally different cluster stack. So, the following core requirements | |
22 | got set out: | |
23 | ||
24 | - possibility to run with any distributed key/value store which provides some | |
25 | kind of locking with timeouts (zookeeper, consul, etcd, ..) | |
26 | - self fencing using Linux watchdog device | |
27 | - implemented in Perl, so that we can use PVE framework | |
28 | - only work with simply resources like VMs | |
29 | ||
30 | We dropped the idea to assemble complex, dependent services, because we think | |
31 | this is already done with the VM/CT abstraction. | |
32 | ||
33 | == Architecture == | |
34 | ||
35 | Cluster requirements. | |
36 | ||
37 | === Cluster wide locks with timeouts === | |
38 | ||
39 | The cluster stack must provide cluster-wide locks with timeouts. | |
40 | The Proxmox 'pmxcfs' implements this on top of corosync. | |
41 | ||
42 | === Watchdog === | |
43 | ||
44 | We need a reliable watchdog mechanism, which is able to provide hard | |
45 | timeouts. It must be guaranteed that the node reboots within the specified | |
46 | timeout if we do not update the watchdog. For me it looks that neither | |
47 | systemd nor the standard watchdog(8) daemon provides such guarantees. | |
48 | ||
49 | We could use the /dev/watchdog directly, but unfortunately this only | |
50 | allows one user. We need to protect at least two daemons, so we write | |
51 | our own watchdog daemon. This daemon work on /dev/watchdog, but | |
52 | provides that service to several other daemons using a local socket. | |
53 | ||
54 | === Self fencing === | |
55 | ||
56 | A node needs to acquire a special 'ha_agent_${node}_lock' (one separate | |
57 | lock for each node) before starting HA resources, and the node updates | |
58 | the watchdog device once it get that lock. If the node loose quorum, | |
59 | or is unable to get the 'ha_agent_${node}_lock', the watchdog is no | |
60 | longer updated. The node can release the lock if there are no running | |
61 | HA resources. | |
62 | ||
63 | This makes sure that the node holds the 'ha_agent_${node}_lock' as | |
64 | long as there are running services on that node. | |
65 | ||
66 | The HA manger can assume that the watchdog triggered a reboot when he | |
67 | is able to acquire the 'ha_agent_${node}_lock' for that node. | |
68 | ||
69 | ==== Problems with "two_node" Clusters ==== | |
70 | ||
71 | This corosync options depends on a fence race condition, and only | |
72 | works using reliable HW fence devices. | |
73 | ||
74 | Above 'self fencing' algorithm does not work if you use this option! | |
75 | ||
76 | Note that you can use a QDevice, i.e., a external simple (no full corosync | |
77 | membership, so relaxed networking) note arbiter process. | |
78 | ||
79 | === Testing Requirements === | |
80 | ||
81 | We want to be able to simulate and test the behavior of a HA cluster, using | |
82 | either a GUI or a CLI. This makes it easier to learn how the system behaves. We | |
83 | also need a way to run regression tests. | |
84 | ||
85 | == Implementation details == | |
86 | ||
87 | === Cluster Resource Manager (class PVE::HA::CRM) ==== | |
88 | ||
89 | The Cluster Resource Manager (CRM) daemon runs one each node, but | |
90 | locking makes sure only one CRM daemon act in 'master' role. That | |
91 | 'master' daemon reads the service configuration file, and request new | |
92 | service states by writing the global 'manager_status'. That data | |
93 | structure is read by the Local Resource Manager, which performs the | |
94 | real work (start/stop/migrate) services. | |
95 | ||
96 | ==== Service Relocation ==== | |
97 | ||
98 | Some services, like a QEMU Virtual Machine, supports live migration. | |
99 | So the LRM can migrate those services without stopping them (CRM service state | |
100 | 'migrate'), | |
101 | ||
102 | Most other service types requires the service to be stopped, and then restarted | |
103 | at the other node. Stopped services are moved by the CRM (usually by simply | |
104 | changing the service configuration). | |
105 | ||
106 | ==== Service Ordering and Co-location Constraints ==== | |
107 | ||
108 | There are no to implement this for the initial version but although it would be | |
109 | possible and probably should be done for later versions. | |
110 | ||
111 | ==== Possible CRM Service States ==== | |
112 | ||
113 | stopped: Service is stopped (confirmed by LRM) | |
114 | ||
115 | request_stop: Service should be stopped. Waiting for | |
116 | confirmation from LRM. | |
117 | ||
118 | started: Service is active an LRM should start it asap. | |
119 | ||
120 | fence: Wait for node fencing (service node is not inside | |
121 | quorate cluster partition). | |
122 | ||
123 | recovery: Service node gets recovered to a new node as it current one was | |
124 | fenced. Note that a service might be stuck here depending on the | |
125 | group/priority configuration | |
126 | ||
127 | freeze: Do not touch. We use this state while we reboot a node, | |
128 | or when we restart the LRM daemon. | |
129 | ||
130 | migrate: Migrate (live) service to other node. | |
131 | ||
132 | relocate: Migrate (stop. move, start) service to other node. | |
133 | ||
134 | error: Service disabled because of LRM errors. | |
135 | ||
136 | ||
137 | There's also a `ignored` state which tells the HA stack to ignore a service | |
138 | completely, i.e., as it wasn't under HA control at all. | |
139 | ||
140 | === Local Resource Manager (class PVE::HA::LRM) === | |
141 | ||
142 | The Local Resource Manager (LRM) daemon runs one each node, and | |
143 | performs service commands (start/stop/migrate) for services assigned | |
144 | to the local node. It should be mentioned that each LRM holds a | |
145 | cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed | |
146 | to assign the service to another node while the LRM holds that lock. | |
147 | ||
148 | The LRM reads the requested service state from 'manager_status', and | |
149 | tries to bring the local service into that state. The actual service | |
150 | status is written back to the 'service_${node}_status', and can be | |
151 | read by the CRM. | |
152 | ||
153 | === Pluggable Interface for Cluster Environment (class PVE::HA::Env) === | |
154 | ||
155 | This class defines an interface to the actual cluster environment: | |
156 | ||
157 | * get node membership and quorum information | |
158 | ||
159 | * get/release cluster wide locks | |
160 | ||
161 | * get system time | |
162 | ||
163 | * watchdog interface | |
164 | ||
165 | * read/write cluster wide status files | |
166 | ||
167 | We have plugins for several different environments: | |
168 | ||
169 | * PVE::HA::Sim::TestEnv: the regression test environment | |
170 | ||
171 | * PVE::HA::Sim::RTEnv: the graphical simulator | |
172 | ||
173 | * PVE::HA::Env::PVE2: the real Proxmox VE cluster | |
174 | ||
175 |