]>
Commit | Line | Data |
---|---|---|
7cdfa499 | 1 | = Proxmox HA Manager = |
95ca6580 | 2 | |
7cdfa499 DM |
3 | == Motivation == |
4 | ||
5 | The current HA manager has a bunch of drawbacks: | |
6 | ||
7 | - no more development (redhat moved to pacemaker) | |
8 | ||
b101fa0c | 9 | - highly depend on old version of corosync |
7cdfa499 DM |
10 | |
11 | - complicated code (cause by compatibility layer with | |
12 | older cluster stack (cman) | |
13 | ||
14 | - no self-fencing | |
15 | ||
16 | In future, we want to make HA easier for our users, and it should | |
17 | be possible to move to newest corosync, or even a totally different | |
18 | cluster stack. So we want: | |
19 | ||
20 | - possible to run with any distributed key/value store which provides | |
b101fa0c | 21 | some kind of locking with timeouts. |
7cdfa499 | 22 | |
b101fa0c | 23 | - self fencing using Linux watchdog device |
7cdfa499 | 24 | |
b101fa0c | 25 | - implemented in Perl, so that we can use PVE framework |
95ca6580 DM |
26 | |
27 | - only works with simply resources like VMs | |
28 | ||
7cdfa499 DM |
29 | = Architecture = |
30 | ||
31 | == Cluster requirements == | |
32 | ||
33 | === Cluster wide locks with timeouts === | |
34 | ||
35 | The cluster stack must provide cluster wide locks with timeouts. | |
36 | The Proxmox 'pmxcfs' implements this on top of corosync. | |
37 | ||
38 | == Self fencing == | |
39 | ||
b101fa0c DM |
40 | A node needs to aquire a special 'ha_agent_${node}_lock' (one separate |
41 | lock for each node) before starting HA resources, and the node updates | |
42 | the watchdog device once it get that lock. If the node loose quorum, | |
43 | or is unable to get the 'ha_agent_${node}_lock', the watchdog is no | |
44 | longer updated. The node can release the lock if there are no running | |
45 | HA resources. | |
7cdfa499 | 46 | |
b101fa0c DM |
47 | This makes sure that the node holds the 'ha_agent_${node}_lock' as |
48 | long as there are running services on that node. | |
7cdfa499 DM |
49 | |
50 | The HA manger can assume that the watchdog triggered a reboot when he | |
b101fa0c DM |
51 | is able to aquire the 'ha_agent_${node}_lock' for that node. |
52 | ||
53 | == Testing requirements == | |
54 | ||
55 | We want to be able to simulate HA cluster, using a GUI. This makes it easier | |
56 | to learn how the system behaves. We also need a way to run regression tests. | |
57 | ||
58 | = Implementation details = | |
59 | ||
60 | == Cluster Resource Manager (class PVE::HA::CRM) == | |
61 | ||
62 | The Cluster Resource Manager (CRM) daemon runs one each node, but | |
63 | locking makes sure only one CRM daemon act in 'master' role. That | |
64 | 'master' daemon reads the service configuration file, and request new | |
65 | service states by writing the global 'manager_status'. That data | |
66 | structure is read by the Local Resource Manager, which performs the | |
67 | real work (start/stop/migrate) services. | |
68 | ||
a821d99e DM |
69 | === Service Relocation === |
70 | ||
71 | Some services like Qemu Virtual Machines supports live migration. | |
72 | So the LRM can migrate those services without stopping them (CRM | |
73 | service state 'migrate'), | |
74 | ||
75 | Most other service types requires the service to be stopped, and | |
76 | then restarted at the other node. We use the following CRM service | |
77 | states transitions: 'relocate_stop' => 'relocate_move' => 'started' | |
78 | ||
79 | Stopped services are moved using service state 'move'. It has to be | |
80 | noted that service relocation is always done using the LRM (the LRM | |
81 | 'owns' the service), unless a node is fenced. In that case the CRM | |
82 | is allowed to 'steal' the resource and mode it to another node. | |
83 | ||
618fbeda DM |
84 | === Possible CRM Service States === |
85 | ||
86 | stopped: Service is stopped (confirmed by LRM) | |
87 | ||
88 | request_stop: Service should be stopped. Waiting for | |
89 | confirmation from LRM. | |
90 | ||
91 | started: Service is active an LRM should start it asap. | |
92 | ||
93 | fence: Wait for node fencing (service node is not inside | |
94 | quorate cluster partition). | |
95 | ||
a821d99e | 96 | migrate: Migrate (live) service to other node. |
618fbeda DM |
97 | |
98 | error: Service disabled because of LRM errors. | |
99 | ||
a821d99e DM |
100 | |
101 | ||
b101fa0c DM |
102 | == Local Resource Manager (class PVE::HA::LRM) == |
103 | ||
104 | The Local Resource Manager (LRM) daemon runs one each node, and | |
105 | performs service commands (start/stop/migrate) for services assigned | |
106 | to the local node. It should be mentioned that each LRM holds a | |
107 | cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed | |
108 | to assign the service to another node while the LRM holds that lock. | |
109 | ||
110 | The LRM reads the requested service state from 'manager_status', and | |
111 | tries to bring the local service into that state. The actial service | |
112 | status is written back to the 'service_${node}_status', and can be | |
113 | read by the CRM. | |
114 | ||
115 | == Pluggable Interface for cluster environment (class PVE::HA::Env) == | |
116 | ||
117 | This class defines an interface to the actual cluster environment: | |
118 | ||
119 | * get node membership and quorum information | |
120 | ||
121 | * get/release cluster wide locks | |
122 | ||
123 | * get system time | |
124 | ||
125 | * watchdog interface | |
126 | ||
127 | * read/write cluster wide status files | |
128 | ||
129 | We have plugins for several different environments: | |
130 | ||
131 | * PVE::HA::Sim::TestEnv: the regression test environment | |
132 | ||
133 | * PVE::HA::Sim::RTEnv: the graphical simulator | |
134 | ||
135 | * PVE::HA::Env::PVE2: the real Proxmox VE cluster | |
136 | ||
137 |