]>
Commit | Line | Data |
---|---|---|
7cdfa499 | 1 | = Proxmox HA Manager = |
95ca6580 | 2 | |
7cdfa499 DM |
3 | == Motivation == |
4 | ||
5 | The current HA manager has a bunch of drawbacks: | |
6 | ||
7 | - no more development (redhat moved to pacemaker) | |
8 | ||
9 | - highly depend on corosync (old version) | |
10 | ||
11 | - complicated code (cause by compatibility layer with | |
12 | older cluster stack (cman) | |
13 | ||
14 | - no self-fencing | |
15 | ||
16 | In future, we want to make HA easier for our users, and it should | |
17 | be possible to move to newest corosync, or even a totally different | |
18 | cluster stack. So we want: | |
19 | ||
20 | - possible to run with any distributed key/value store which provides | |
21 | some kind of locking (with timeouts). | |
22 | ||
23 | - self fencing using linux watchdog device | |
24 | ||
25 | - implemented in perl, so thatw e can use PVE framework | |
95ca6580 DM |
26 | |
27 | - only works with simply resources like VMs | |
28 | ||
7cdfa499 DM |
29 | = Architecture = |
30 | ||
31 | == Cluster requirements == | |
32 | ||
33 | === Cluster wide locks with timeouts === | |
34 | ||
35 | The cluster stack must provide cluster wide locks with timeouts. | |
36 | The Proxmox 'pmxcfs' implements this on top of corosync. | |
37 | ||
38 | == Self fencing == | |
39 | ||
40 | A node needs to aquire a special 'agent_lock' (one separate lock for | |
41 | each node) before starting HA resources, and the node updates the | |
42 | watchdog device once it get that lock. If the node loose quorum, or is | |
43 | unable to get the 'agent_lock', the watchdog is no longer updated. The | |
44 | node can release the lock if there are no running HA resources. | |
45 | ||
46 | This makes sure that the node holds the 'agent_lock' as long as there | |
47 | are running services on that node. | |
48 | ||
49 | The HA manger can assume that the watchdog triggered a reboot when he | |
50 | is able to aquire the 'agent_lock' for that node. |