]>
Commit | Line | Data |
---|---|---|
7cdfa499 | 1 | = Proxmox HA Manager = |
95ca6580 | 2 | |
de225e04 TL |
3 | Note that this README got written as early development planning/documentation |
4 | in 2015, even though small updates where made in 2023 it might be a bit out of | |
5 | date. For usage documentation see the official reference docs shipped with your | |
6 | Proxmox VE installation or, for your convenience, hosted at: | |
7 | https://pve.proxmox.com/pve-docs/chapter-ha-manager.html | |
7cdfa499 | 8 | |
de225e04 | 9 | == History & Motivation == |
7cdfa499 | 10 | |
de225e04 | 11 | The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks: |
7cdfa499 | 12 | |
de225e04 | 13 | - no more development (redhat moved to pacemaker) |
b101fa0c | 14 | - highly depend on old version of corosync |
de225e04 TL |
15 | - complicated code (cause by compatibility layer with older cluster stack |
16 | (cman) | |
7cdfa499 DM |
17 | - no self-fencing |
18 | ||
de225e04 TL |
19 | For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA |
20 | easier for our users while also making it possible to move to newest corosync, | |
21 | or even a totally different cluster stack. So, the following core requirements | |
22 | got set out: | |
7cdfa499 | 23 | |
de225e04 TL |
24 | - possibility to run with any distributed key/value store which provides some |
25 | kind of locking with timeouts (zookeeper, consul, etcd, ..) | |
b101fa0c | 26 | - self fencing using Linux watchdog device |
b101fa0c | 27 | - implemented in Perl, so that we can use PVE framework |
3b2ed964 DM |
28 | - only work with simply resources like VMs |
29 | ||
de225e04 TL |
30 | We dropped the idea to assemble complex, dependent services, because we think |
31 | this is already done with the VM/CT abstraction. | |
95ca6580 | 32 | |
de225e04 | 33 | == Architecture == |
7cdfa499 | 34 | |
de225e04 | 35 | Cluster requirements. |
7cdfa499 DM |
36 | |
37 | === Cluster wide locks with timeouts === | |
38 | ||
de225e04 | 39 | The cluster stack must provide cluster-wide locks with timeouts. |
7cdfa499 DM |
40 | The Proxmox 'pmxcfs' implements this on top of corosync. |
41 | ||
f02ff212 DM |
42 | === Watchdog === |
43 | ||
44 | We need a reliable watchdog mechanism, which is able to provide hard | |
63f6a08c | 45 | timeouts. It must be guaranteed that the node reboots within the specified |
f02ff212 DM |
46 | timeout if we do not update the watchdog. For me it looks that neither |
47 | systemd nor the standard watchdog(8) daemon provides such guarantees. | |
48 | ||
49 | We could use the /dev/watchdog directly, but unfortunately this only | |
50 | allows one user. We need to protect at least two daemons, so we write | |
51 | our own watchdog daemon. This daemon work on /dev/watchdog, but | |
52 | provides that service to several other daemons using a local socket. | |
53 | ||
de225e04 | 54 | === Self fencing === |
7cdfa499 | 55 | |
63f6a08c | 56 | A node needs to acquire a special 'ha_agent_${node}_lock' (one separate |
b101fa0c DM |
57 | lock for each node) before starting HA resources, and the node updates |
58 | the watchdog device once it get that lock. If the node loose quorum, | |
59 | or is unable to get the 'ha_agent_${node}_lock', the watchdog is no | |
60 | longer updated. The node can release the lock if there are no running | |
61 | HA resources. | |
7cdfa499 | 62 | |
b101fa0c DM |
63 | This makes sure that the node holds the 'ha_agent_${node}_lock' as |
64 | long as there are running services on that node. | |
7cdfa499 DM |
65 | |
66 | The HA manger can assume that the watchdog triggered a reboot when he | |
63f6a08c | 67 | is able to acquire the 'ha_agent_${node}_lock' for that node. |
b101fa0c | 68 | |
de225e04 | 69 | ==== Problems with "two_node" Clusters ==== |
71bf7e6b DM |
70 | |
71 | This corosync options depends on a fence race condition, and only | |
72 | works using reliable HW fence devices. | |
73 | ||
74 | Above 'self fencing' algorithm does not work if you use this option! | |
75 | ||
de225e04 TL |
76 | Note that you can use a QDevice, i.e., a external simple (no full corosync |
77 | membership, so relaxed networking) note arbiter process. | |
78 | ||
79 | === Testing Requirements === | |
b101fa0c | 80 | |
de225e04 TL |
81 | We want to be able to simulate and test the behavior of a HA cluster, using |
82 | either a GUI or a CLI. This makes it easier to learn how the system behaves. We | |
83 | also need a way to run regression tests. | |
b101fa0c | 84 | |
de225e04 | 85 | == Implementation details == |
b101fa0c | 86 | |
de225e04 | 87 | === Cluster Resource Manager (class PVE::HA::CRM) ==== |
b101fa0c DM |
88 | |
89 | The Cluster Resource Manager (CRM) daemon runs one each node, but | |
90 | locking makes sure only one CRM daemon act in 'master' role. That | |
91 | 'master' daemon reads the service configuration file, and request new | |
92 | service states by writing the global 'manager_status'. That data | |
93 | structure is read by the Local Resource Manager, which performs the | |
94 | real work (start/stop/migrate) services. | |
95 | ||
de225e04 | 96 | ==== Service Relocation ==== |
a821d99e | 97 | |
de225e04 TL |
98 | Some services, like a QEMU Virtual Machine, supports live migration. |
99 | So the LRM can migrate those services without stopping them (CRM service state | |
100 | 'migrate'), | |
a821d99e | 101 | |
de225e04 TL |
102 | Most other service types requires the service to be stopped, and then restarted |
103 | at the other node. Stopped services are moved by the CRM (usually by simply | |
104 | changing the service configuration). | |
a821d99e | 105 | |
de225e04 | 106 | ==== Service Ordering and Co-location Constraints ==== |
c36974d7 | 107 | |
de225e04 TL |
108 | There are no to implement this for the initial version but although it would be |
109 | possible and probably should be done for later versions. | |
c36974d7 | 110 | |
de225e04 | 111 | ==== Possible CRM Service States ==== |
618fbeda DM |
112 | |
113 | stopped: Service is stopped (confirmed by LRM) | |
114 | ||
115 | request_stop: Service should be stopped. Waiting for | |
de225e04 | 116 | confirmation from LRM. |
618fbeda DM |
117 | |
118 | started: Service is active an LRM should start it asap. | |
119 | ||
120 | fence: Wait for node fencing (service node is not inside | |
de225e04 TL |
121 | quorate cluster partition). |
122 | ||
123 | recovery: Service node gets recovered to a new node as it current one was | |
124 | fenced. Note that a service might be stuck here depending on the | |
125 | group/priority configuration | |
618fbeda | 126 | |
3b2ed964 | 127 | freeze: Do not touch. We use this state while we reboot a node, |
de225e04 | 128 | or when we restart the LRM daemon. |
3b2ed964 | 129 | |
a821d99e | 130 | migrate: Migrate (live) service to other node. |
618fbeda | 131 | |
de225e04 TL |
132 | relocate: Migrate (stop. move, start) service to other node. |
133 | ||
618fbeda DM |
134 | error: Service disabled because of LRM errors. |
135 | ||
a821d99e | 136 | |
de225e04 TL |
137 | There's also a `ignored` state which tells the HA stack to ignore a service |
138 | completely, i.e., as it wasn't under HA control at all. | |
139 | ||
140 | === Local Resource Manager (class PVE::HA::LRM) === | |
b101fa0c DM |
141 | |
142 | The Local Resource Manager (LRM) daemon runs one each node, and | |
143 | performs service commands (start/stop/migrate) for services assigned | |
144 | to the local node. It should be mentioned that each LRM holds a | |
145 | cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed | |
146 | to assign the service to another node while the LRM holds that lock. | |
147 | ||
148 | The LRM reads the requested service state from 'manager_status', and | |
de225e04 | 149 | tries to bring the local service into that state. The actual service |
b101fa0c DM |
150 | status is written back to the 'service_${node}_status', and can be |
151 | read by the CRM. | |
152 | ||
de225e04 | 153 | === Pluggable Interface for Cluster Environment (class PVE::HA::Env) === |
b101fa0c DM |
154 | |
155 | This class defines an interface to the actual cluster environment: | |
156 | ||
157 | * get node membership and quorum information | |
158 | ||
159 | * get/release cluster wide locks | |
160 | ||
161 | * get system time | |
162 | ||
163 | * watchdog interface | |
164 | ||
165 | * read/write cluster wide status files | |
166 | ||
167 | We have plugins for several different environments: | |
168 | ||
169 | * PVE::HA::Sim::TestEnv: the regression test environment | |
170 | ||
171 | * PVE::HA::Sim::RTEnv: the graphical simulator | |
172 | ||
173 | * PVE::HA::Env::PVE2: the real Proxmox VE cluster | |
174 | ||
175 |