]>
Commit | Line | Data |
---|---|---|
7cdfa499 | 1 | = Proxmox HA Manager = |
95ca6580 | 2 | |
7cdfa499 DM |
3 | == Motivation == |
4 | ||
5 | The current HA manager has a bunch of drawbacks: | |
6 | ||
7 | - no more development (redhat moved to pacemaker) | |
8 | ||
b101fa0c | 9 | - highly depend on old version of corosync |
7cdfa499 DM |
10 | |
11 | - complicated code (cause by compatibility layer with | |
12 | older cluster stack (cman) | |
13 | ||
14 | - no self-fencing | |
15 | ||
16 | In future, we want to make HA easier for our users, and it should | |
17 | be possible to move to newest corosync, or even a totally different | |
18 | cluster stack. So we want: | |
19 | ||
3b2ed964 | 20 | - possibility to run with any distributed key/value store which provides |
f02ff212 | 21 | some kind of locking with timeouts (zookeeper, consul, etcd, ..) |
7cdfa499 | 22 | |
b101fa0c | 23 | - self fencing using Linux watchdog device |
7cdfa499 | 24 | |
b101fa0c | 25 | - implemented in Perl, so that we can use PVE framework |
95ca6580 | 26 | |
3b2ed964 DM |
27 | - only work with simply resources like VMs |
28 | ||
29 | We dropped the idea to assemble complex, dependend services, because we think | |
30 | this is already done with the VM abstraction. | |
95ca6580 | 31 | |
7cdfa499 DM |
32 | = Architecture = |
33 | ||
34 | == Cluster requirements == | |
35 | ||
36 | === Cluster wide locks with timeouts === | |
37 | ||
38 | The cluster stack must provide cluster wide locks with timeouts. | |
39 | The Proxmox 'pmxcfs' implements this on top of corosync. | |
40 | ||
f02ff212 DM |
41 | === Watchdog === |
42 | ||
43 | We need a reliable watchdog mechanism, which is able to provide hard | |
63f6a08c | 44 | timeouts. It must be guaranteed that the node reboots within the specified |
f02ff212 DM |
45 | timeout if we do not update the watchdog. For me it looks that neither |
46 | systemd nor the standard watchdog(8) daemon provides such guarantees. | |
47 | ||
48 | We could use the /dev/watchdog directly, but unfortunately this only | |
49 | allows one user. We need to protect at least two daemons, so we write | |
50 | our own watchdog daemon. This daemon work on /dev/watchdog, but | |
51 | provides that service to several other daemons using a local socket. | |
52 | ||
7cdfa499 DM |
53 | == Self fencing == |
54 | ||
63f6a08c | 55 | A node needs to acquire a special 'ha_agent_${node}_lock' (one separate |
b101fa0c DM |
56 | lock for each node) before starting HA resources, and the node updates |
57 | the watchdog device once it get that lock. If the node loose quorum, | |
58 | or is unable to get the 'ha_agent_${node}_lock', the watchdog is no | |
59 | longer updated. The node can release the lock if there are no running | |
60 | HA resources. | |
7cdfa499 | 61 | |
b101fa0c DM |
62 | This makes sure that the node holds the 'ha_agent_${node}_lock' as |
63 | long as there are running services on that node. | |
7cdfa499 DM |
64 | |
65 | The HA manger can assume that the watchdog triggered a reboot when he | |
63f6a08c | 66 | is able to acquire the 'ha_agent_${node}_lock' for that node. |
b101fa0c | 67 | |
71bf7e6b DM |
68 | === Problems with "two_node" Clusters === |
69 | ||
70 | This corosync options depends on a fence race condition, and only | |
71 | works using reliable HW fence devices. | |
72 | ||
73 | Above 'self fencing' algorithm does not work if you use this option! | |
74 | ||
b101fa0c DM |
75 | == Testing requirements == |
76 | ||
77 | We want to be able to simulate HA cluster, using a GUI. This makes it easier | |
78 | to learn how the system behaves. We also need a way to run regression tests. | |
79 | ||
80 | = Implementation details = | |
81 | ||
82 | == Cluster Resource Manager (class PVE::HA::CRM) == | |
83 | ||
84 | The Cluster Resource Manager (CRM) daemon runs one each node, but | |
85 | locking makes sure only one CRM daemon act in 'master' role. That | |
86 | 'master' daemon reads the service configuration file, and request new | |
87 | service states by writing the global 'manager_status'. That data | |
88 | structure is read by the Local Resource Manager, which performs the | |
89 | real work (start/stop/migrate) services. | |
90 | ||
a821d99e DM |
91 | === Service Relocation === |
92 | ||
93 | Some services like Qemu Virtual Machines supports live migration. | |
94 | So the LRM can migrate those services without stopping them (CRM | |
95 | service state 'migrate'), | |
96 | ||
2d7a0983 DM |
97 | Most other service types requires the service to be stopped, and then |
98 | restarted at the other node. Stopped services are moved by the CRM | |
99 | (usually by simply changing the service configuration). | |
a821d99e | 100 | |
c36974d7 DM |
101 | === Service ordering and colocation constarints === |
102 | ||
103 | So far there are no plans to implement this (although it would be possible). | |
104 | ||
618fbeda DM |
105 | === Possible CRM Service States === |
106 | ||
107 | stopped: Service is stopped (confirmed by LRM) | |
108 | ||
109 | request_stop: Service should be stopped. Waiting for | |
110 | confirmation from LRM. | |
111 | ||
112 | started: Service is active an LRM should start it asap. | |
113 | ||
114 | fence: Wait for node fencing (service node is not inside | |
115 | quorate cluster partition). | |
116 | ||
3b2ed964 DM |
117 | freeze: Do not touch. We use this state while we reboot a node, |
118 | or when we restart the LRM daemon. | |
119 | ||
a821d99e | 120 | migrate: Migrate (live) service to other node. |
618fbeda DM |
121 | |
122 | error: Service disabled because of LRM errors. | |
123 | ||
a821d99e | 124 | |
b101fa0c DM |
125 | == Local Resource Manager (class PVE::HA::LRM) == |
126 | ||
127 | The Local Resource Manager (LRM) daemon runs one each node, and | |
128 | performs service commands (start/stop/migrate) for services assigned | |
129 | to the local node. It should be mentioned that each LRM holds a | |
130 | cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed | |
131 | to assign the service to another node while the LRM holds that lock. | |
132 | ||
133 | The LRM reads the requested service state from 'manager_status', and | |
134 | tries to bring the local service into that state. The actial service | |
135 | status is written back to the 'service_${node}_status', and can be | |
136 | read by the CRM. | |
137 | ||
138 | == Pluggable Interface for cluster environment (class PVE::HA::Env) == | |
139 | ||
140 | This class defines an interface to the actual cluster environment: | |
141 | ||
142 | * get node membership and quorum information | |
143 | ||
144 | * get/release cluster wide locks | |
145 | ||
146 | * get system time | |
147 | ||
148 | * watchdog interface | |
149 | ||
150 | * read/write cluster wide status files | |
151 | ||
152 | We have plugins for several different environments: | |
153 | ||
154 | * PVE::HA::Sim::TestEnv: the regression test environment | |
155 | ||
156 | * PVE::HA::Sim::RTEnv: the graphical simulator | |
157 | ||
158 | * PVE::HA::Env::PVE2: the real Proxmox VE cluster | |
159 | ||
160 |