README

   1 = Proxmox HA Manager =
   2
   3 == Motivation ==
   4
   5 The current HA manager has a bunch of drawbacks:
   6
   7 - no more development (redhat moved to pacemaker)
   8
   9 - highly depend on old version of corosync
  10
  11 - complicated code (cause by compatibility layer with
  12   older cluster stack (cman)
  13
  14 - no self-fencing
  15
  16 In future, we want to make HA easier for our users, and it should
  17 be possible to move to newest corosync, or even a totally different
  18 cluster stack. So we want:
  19
  20 - possibility to run with any distributed key/value store which provides
  21   some kind of locking with timeouts (zookeeper, consul, etcd, ..)
  22
  23 - self fencing using Linux watchdog device
  24
  25 - implemented in Perl, so that we can use PVE framework
  26
  27 - only work with simply resources like VMs
  28
  29 We dropped the idea to assemble complex, dependend services, because we think
  30 this is already done with the VM abstraction.
  31
  32 = Architecture =
  33
  34 == Cluster requirements ==
  35
  36 === Cluster wide locks with timeouts ===
  37
  38 The cluster stack must provide cluster wide locks with timeouts.
  39 The Proxmox 'pmxcfs' implements this on top of corosync.
  40
  41 === Watchdog ===
  42
  43 We need a reliable watchdog mechanism, which is able to provide hard
  44 timeouts. It must be guaranteed that the node reboots within the specified
  45 timeout if we do not update the watchdog. For me it looks that neither
  46 systemd nor the standard watchdog(8) daemon provides such guarantees.
  47
  48 We could use the /dev/watchdog directly, but unfortunately this only
  49 allows one user. We need to protect at least two daemons, so we write
  50 our own watchdog daemon. This daemon work on /dev/watchdog, but
  51 provides that service to several other daemons using a local socket.
  52
  53 == Self fencing ==
  54
  55 A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
  56 lock for each node) before starting HA resources, and the node updates
  57 the watchdog device once it get that lock. If the node loose quorum,
  58 or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
  59 longer updated. The node can release the lock if there are no running
  60 HA resources.
  61
  62 This makes sure that the node holds the 'ha_agent_${node}_lock' as
  63 long as there are running services on that node.
  64
  65 The HA manger can assume that the watchdog triggered a reboot when he
  66 is able to acquire the 'ha_agent_${node}_lock' for that node.
  67
  68 === Problems with "two_node" Clusters ===
  69
  70 This corosync options depends on a fence race condition, and only
  71 works using reliable HW fence devices.
  72
  73 Above 'self fencing' algorithm does not work if you use this option!
  74
  75 == Testing requirements ==
  76
  77 We want to be able to simulate HA cluster, using a GUI. This makes it easier
  78 to learn how the system behaves. We also need a way to run regression tests.
  79
  80 = Implementation details =
  81
  82 == Cluster Resource Manager (class PVE::HA::CRM) ==
  83
  84 The Cluster Resource Manager (CRM) daemon runs one each node, but
  85 locking makes sure only one CRM daemon act in 'master' role. That
  86 'master' daemon reads the service configuration file, and request new
  87 service states by writing the global 'manager_status'. That data
  88 structure is read by the Local Resource Manager, which performs the
  89 real work (start/stop/migrate) services.
  90
  91 === Service Relocation ===
  92
  93 Some services like Qemu Virtual Machines supports live migration.
  94 So the LRM can migrate those services without stopping them (CRM
  95 service state 'migrate'),
  96
  97 Most other service types requires the service to be stopped, and then
  98 restarted at the other node. Stopped services are moved by the CRM
  99 (usually by simply changing the service configuration).
 100
 101 === Service ordering and colocation constarints ===
 102
 103 So far there are no plans to implement this (although it would be possible).
 104
 105 === Possible CRM Service States ===
 106
 107 stopped:      Service is stopped (confirmed by LRM)
 108
 109 request_stop: Service should be stopped. Waiting for
 110               confirmation from LRM.
 111
 112 started:      Service is active an LRM should start it asap.
 113
 114 fence:        Wait for node fencing (service node is not inside
 115               quorate cluster partition).
 116
 117 freeze:       Do not touch. We use this state while we reboot a node,
 118               or when we restart the LRM daemon.
 119
 120 migrate:      Migrate (live) service to other node.
 121
 122 error:        Service disabled because of LRM errors.
 123
 124
 125 == Local Resource Manager (class PVE::HA::LRM) ==
 126
 127 The Local Resource Manager (LRM) daemon runs one each node, and
 128 performs service commands (start/stop/migrate) for services assigned
 129 to the local node. It should be mentioned that each LRM holds a
 130 cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
 131 to assign the service to another node while the LRM holds that lock.
 132
 133 The LRM reads the requested service state from 'manager_status', and
 134 tries to bring the local service into that state. The actial service
 135 status is written back to the 'service_${node}_status', and can be
 136 read by the CRM.
 137
 138 == Pluggable Interface for cluster environment (class PVE::HA::Env) ==
 139
 140 This class defines an interface to the actual cluster environment:
 141
 142 * get node membership and quorum information
 143
 144 * get/release cluster wide locks
 145
 146 * get system time
 147
 148 * watchdog interface
 149
 150 * read/write cluster wide status files
 151
 152 We have plugins for several different environments:
 153
 154 * PVE::HA::Sim::TestEnv: the regression test environment
 155
 156 * PVE::HA::Sim::RTEnv: the graphical simulator
 157
 158 * PVE::HA::Env::PVE2: the real Proxmox VE cluster
 159
 160