]>
Commit | Line | Data |
---|---|---|
7cdfa499 | 1 | = Proxmox HA Manager = |
95ca6580 | 2 | |
7cdfa499 DM |
3 | == Motivation == |
4 | ||
5 | The current HA manager has a bunch of drawbacks: | |
6 | ||
7 | - no more development (redhat moved to pacemaker) | |
8 | ||
b101fa0c | 9 | - highly depend on old version of corosync |
7cdfa499 DM |
10 | |
11 | - complicated code (cause by compatibility layer with | |
12 | older cluster stack (cman) | |
13 | ||
14 | - no self-fencing | |
15 | ||
16 | In future, we want to make HA easier for our users, and it should | |
17 | be possible to move to newest corosync, or even a totally different | |
18 | cluster stack. So we want: | |
19 | ||
20 | - possible to run with any distributed key/value store which provides | |
b101fa0c | 21 | some kind of locking with timeouts. |
7cdfa499 | 22 | |
b101fa0c | 23 | - self fencing using Linux watchdog device |
7cdfa499 | 24 | |
b101fa0c | 25 | - implemented in Perl, so that we can use PVE framework |
95ca6580 DM |
26 | |
27 | - only works with simply resources like VMs | |
28 | ||
7cdfa499 DM |
29 | = Architecture = |
30 | ||
31 | == Cluster requirements == | |
32 | ||
33 | === Cluster wide locks with timeouts === | |
34 | ||
35 | The cluster stack must provide cluster wide locks with timeouts. | |
36 | The Proxmox 'pmxcfs' implements this on top of corosync. | |
37 | ||
38 | == Self fencing == | |
39 | ||
b101fa0c DM |
40 | A node needs to aquire a special 'ha_agent_${node}_lock' (one separate |
41 | lock for each node) before starting HA resources, and the node updates | |
42 | the watchdog device once it get that lock. If the node loose quorum, | |
43 | or is unable to get the 'ha_agent_${node}_lock', the watchdog is no | |
44 | longer updated. The node can release the lock if there are no running | |
45 | HA resources. | |
7cdfa499 | 46 | |
b101fa0c DM |
47 | This makes sure that the node holds the 'ha_agent_${node}_lock' as |
48 | long as there are running services on that node. | |
7cdfa499 DM |
49 | |
50 | The HA manger can assume that the watchdog triggered a reboot when he | |
b101fa0c DM |
51 | is able to aquire the 'ha_agent_${node}_lock' for that node. |
52 | ||
53 | == Testing requirements == | |
54 | ||
55 | We want to be able to simulate HA cluster, using a GUI. This makes it easier | |
56 | to learn how the system behaves. We also need a way to run regression tests. | |
57 | ||
58 | = Implementation details = | |
59 | ||
60 | == Cluster Resource Manager (class PVE::HA::CRM) == | |
61 | ||
62 | The Cluster Resource Manager (CRM) daemon runs one each node, but | |
63 | locking makes sure only one CRM daemon act in 'master' role. That | |
64 | 'master' daemon reads the service configuration file, and request new | |
65 | service states by writing the global 'manager_status'. That data | |
66 | structure is read by the Local Resource Manager, which performs the | |
67 | real work (start/stop/migrate) services. | |
68 | ||
618fbeda DM |
69 | === Possible CRM Service States === |
70 | ||
71 | stopped: Service is stopped (confirmed by LRM) | |
72 | ||
73 | request_stop: Service should be stopped. Waiting for | |
74 | confirmation from LRM. | |
75 | ||
76 | started: Service is active an LRM should start it asap. | |
77 | ||
78 | fence: Wait for node fencing (service node is not inside | |
79 | quorate cluster partition). | |
80 | ||
81 | migrate: Migrate VM to other node | |
82 | ||
83 | error: Service disabled because of LRM errors. | |
84 | ||
b101fa0c DM |
85 | == Local Resource Manager (class PVE::HA::LRM) == |
86 | ||
87 | The Local Resource Manager (LRM) daemon runs one each node, and | |
88 | performs service commands (start/stop/migrate) for services assigned | |
89 | to the local node. It should be mentioned that each LRM holds a | |
90 | cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed | |
91 | to assign the service to another node while the LRM holds that lock. | |
92 | ||
93 | The LRM reads the requested service state from 'manager_status', and | |
94 | tries to bring the local service into that state. The actial service | |
95 | status is written back to the 'service_${node}_status', and can be | |
96 | read by the CRM. | |
97 | ||
98 | == Pluggable Interface for cluster environment (class PVE::HA::Env) == | |
99 | ||
100 | This class defines an interface to the actual cluster environment: | |
101 | ||
102 | * get node membership and quorum information | |
103 | ||
104 | * get/release cluster wide locks | |
105 | ||
106 | * get system time | |
107 | ||
108 | * watchdog interface | |
109 | ||
110 | * read/write cluster wide status files | |
111 | ||
112 | We have plugins for several different environments: | |
113 | ||
114 | * PVE::HA::Sim::TestEnv: the regression test environment | |
115 | ||
116 | * PVE::HA::Sim::RTEnv: the graphical simulator | |
117 | ||
118 | * PVE::HA::Env::PVE2: the real Proxmox VE cluster | |
119 | ||
120 |