README

   1 = Proxmox HA Manager =
   2
   3 Note that this README got written as early development planning/documentation
   4 in 2015, even though small updates where made in 2023 it might be a bit out of
   5 date. For usage documentation see the official reference docs shipped with your
   6 Proxmox VE installation or, for your convenience, hosted at:
   7 https://pve.proxmox.com/pve-docs/chapter-ha-manager.html
   8
   9 == History & Motivation ==
  10
  11 The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks:
  12
  13 - no more development (redhat moved to pacemaker)
  14 - highly depend on old version of corosync
  15 - complicated code (cause by compatibility layer with older cluster stack
  16   (cman)
  17 - no self-fencing
  18
  19 For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA
  20 easier for our users while also making it possible to move to newest corosync,
  21 or even a totally different cluster stack. So, the following core requirements
  22 got set out:
  23
  24 - possibility to run with any distributed key/value store which provides some
  25   kind of locking with timeouts (zookeeper, consul, etcd, ..)
  26 - self fencing using Linux watchdog device
  27 - implemented in Perl, so that we can use PVE framework
  28 - only work with simply resources like VMs
  29
  30 We dropped the idea to assemble complex, dependent services, because we think
  31 this is already done with the VM/CT abstraction.
  32
  33 == Architecture ==
  34
  35 Cluster requirements.
  36
  37 === Cluster wide locks with timeouts ===
  38
  39 The cluster stack must provide cluster-wide locks with timeouts.
  40 The Proxmox 'pmxcfs' implements this on top of corosync.
  41
  42 === Watchdog ===
  43
  44 We need a reliable watchdog mechanism, which is able to provide hard
  45 timeouts. It must be guaranteed that the node reboots within the specified
  46 timeout if we do not update the watchdog. For me it looks that neither
  47 systemd nor the standard watchdog(8) daemon provides such guarantees.
  48
  49 We could use the /dev/watchdog directly, but unfortunately this only
  50 allows one user. We need to protect at least two daemons, so we write
  51 our own watchdog daemon. This daemon work on /dev/watchdog, but
  52 provides that service to several other daemons using a local socket.
  53
  54 === Self fencing ===
  55
  56 A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
  57 lock for each node) before starting HA resources, and the node updates
  58 the watchdog device once it get that lock. If the node loose quorum,
  59 or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
  60 longer updated. The node can release the lock if there are no running
  61 HA resources.
  62
  63 This makes sure that the node holds the 'ha_agent_${node}_lock' as
  64 long as there are running services on that node.
  65
  66 The HA manger can assume that the watchdog triggered a reboot when he
  67 is able to acquire the 'ha_agent_${node}_lock' for that node.
  68
  69 ==== Problems with "two_node" Clusters ====
  70
  71 This corosync options depends on a fence race condition, and only
  72 works using reliable HW fence devices.
  73
  74 Above 'self fencing' algorithm does not work if you use this option!
  75
  76 Note that you can use a QDevice, i.e., a external simple (no full corosync
  77 membership, so relaxed networking) note arbiter process.
  78
  79 === Testing Requirements ===
  80
  81 We want to be able to simulate and test the behavior of a HA cluster, using
  82 either a GUI or a CLI. This makes it easier to learn how the system behaves. We
  83 also need a way to run regression tests.
  84
  85 == Implementation details ==
  86
  87 === Cluster Resource Manager (class PVE::HA::CRM) ====
  88
  89 The Cluster Resource Manager (CRM) daemon runs one each node, but
  90 locking makes sure only one CRM daemon act in 'master' role. That
  91 'master' daemon reads the service configuration file, and request new
  92 service states by writing the global 'manager_status'. That data
  93 structure is read by the Local Resource Manager, which performs the
  94 real work (start/stop/migrate) services.
  95
  96 ==== Service Relocation ====
  97
  98 Some services, like a QEMU Virtual Machine, supports live migration.
  99 So the LRM can migrate those services without stopping them (CRM service state
 100 'migrate'),
 101
 102 Most other service types requires the service to be stopped, and then restarted
 103 at the other node. Stopped services are moved by the CRM (usually by simply
 104 changing the service configuration).
 105
 106 ==== Service Ordering and Co-location Constraints ====
 107
 108 There are no to implement this for the initial version but although it would be
 109 possible and probably should be done for later versions.
 110
 111 ==== Possible CRM Service States ====
 112
 113 stopped:      Service is stopped (confirmed by LRM)
 114
 115 request_stop: Service should be stopped. Waiting for
 116               confirmation from LRM.
 117
 118 started:      Service is active an LRM should start it asap.
 119
 120 fence:        Wait for node fencing (service node is not inside
 121               quorate cluster partition).
 122
 123 recovery:     Service node gets recovered to a new node as it current one was
 124               fenced. Note that a service might be stuck here depending on the
 125               group/priority configuration
 126
 127 freeze:       Do not touch. We use this state while we reboot a node,
 128               or when we restart the LRM daemon.
 129
 130 migrate:      Migrate (live) service to other node.
 131
 132 relocate:     Migrate (stop. move, start) service to other node.
 133
 134 error:        Service disabled because of LRM errors.
 135
 136
 137 There's also a `ignored` state which tells the HA stack to ignore a service
 138 completely, i.e., as it wasn't under HA control at all.
 139
 140 === Local Resource Manager (class PVE::HA::LRM) ===
 141
 142 The Local Resource Manager (LRM) daemon runs one each node, and
 143 performs service commands (start/stop/migrate) for services assigned
 144 to the local node. It should be mentioned that each LRM holds a
 145 cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
 146 to assign the service to another node while the LRM holds that lock.
 147
 148 The LRM reads the requested service state from 'manager_status', and
 149 tries to bring the local service into that state. The actual service
 150 status is written back to the 'service_${node}_status', and can be
 151 read by the CRM.
 152
 153 === Pluggable Interface for Cluster Environment (class PVE::HA::Env) ===
 154
 155 This class defines an interface to the actual cluster environment:
 156
 157 * get node membership and quorum information
 158
 159 * get/release cluster wide locks
 160
 161 * get system time
 162
 163 * watchdog interface
 164
 165 * read/write cluster wide status files
 166
 167 We have plugins for several different environments:
 168
 169 * PVE::HA::Sim::TestEnv: the regression test environment
 170
 171 * PVE::HA::Sim::RTEnv: the graphical simulator
 172
 173 * PVE::HA::Env::PVE2: the real Proxmox VE cluster
 174
 175