]> git.proxmox.com Git - pve-ha-manager.git/commit
add 'migrate' node shutdown policy
authorThomas Lamprecht <t.lamprecht@proxmox.com>
Fri, 4 Oct 2019 17:35:15 +0000 (19:35 +0200)
committerThomas Lamprecht <t.lamprecht@proxmox.com>
Mon, 25 Nov 2019 18:46:09 +0000 (19:46 +0100)
commit99278e06a8f1fc9266e24538ee0c1c914cdb68f8
tree28fd7fd873cf2051726fc1469e22192e8e002b22
parent5c2eef4b9ede36093d10801d038df6455f46bb26
add 'migrate' node shutdown policy

This adds handling for a new shutdown policy, namely "migrate".
If that is set then the LRM doesn't queues stop jobs, but transitions
to a new mode, namely 'maintenance'.

The LRM modes now get passed from the CRM in the NodeStatus update
method, this allows to detect such a mode and make node-status state
transitions. Effectively we only allow to transition if we're
currently online, else this is ignored. 'maintenance' does not
protects from fencing.

The moving then gets done by select service node. A node in
maintenance mode is not in "list_online_nodes" and so also not in
online_node_usage used to re-calculate if a service needs to be
moved. Only started services will get moved, this can be done almost
by leveraging exiting behavior, the next_state_started FSM state
transition method just needs to be thought to not early return for
nodes which are not online but in maintenance mode.

A few tests get adapted from the other policy tests is added to
showcase behavior with reboot, shutdown, and shutdown of the current
manager. It also shows the behavior when a service cannot be
migrated, albeit as our test system is limited to simulate maximal 9
migration failures, it "seems" to succeed after that. But note here
that the maximal retries would have been hit way more earlier, so
this is just artifact from our test system.

Besides some implementation details two question still are not solved
by this approach:
* what if a service cannot be moved away, either by errors or as no
  alternative node is found by select_service_node
  - retrying indefinitely, this happens currently. The user set this
    up like this in the first place. We will order SSH, pveproxy,
    after the LRM service to ensure that the're still the possibility
    for manual interventions
  - a idea would be to track the time and see if we're stuck (this is
    not to hard), in such a case we could stop the services after X
    minutes and continue.
* a full cluster shutdown, but that is even without this mode not to
  ideal, nodes will get fenced after no partition is quorate anymore,
  already. And as long as it's just a central setting in DC config,
  an admin has a single switch to flip to make it work, so not sure
  how much handling we want to do here, if we go over the point where
  we have no quorum we're dead anyhow, soo.. at least not really an
  issue of this series, orthogonal related yes, but not more.

For real world usability the datacenter.cfg schema needs to be
changed to allow the migrate shutdown policy, but that's trivial

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
27 files changed:
src/PVE/HA/LRM.pm
src/PVE/HA/Manager.pm
src/PVE/HA/NodeStatus.pm
src/test/test-shutdown-policy-migrate-fail1/cmdlist [new file with mode: 0644]
src/test/test-shutdown-policy-migrate-fail1/datacenter.cfg [new file with mode: 0644]
src/test/test-shutdown-policy-migrate-fail1/hardware_status [new file with mode: 0644]
src/test/test-shutdown-policy-migrate-fail1/log.expect [new file with mode: 0644]
src/test/test-shutdown-policy-migrate-fail1/manager_status [new file with mode: 0644]
src/test/test-shutdown-policy-migrate-fail1/service_config [new file with mode: 0644]
src/test/test-shutdown-policy3/cmdlist [new file with mode: 0644]
src/test/test-shutdown-policy3/datacenter.cfg [new file with mode: 0644]
src/test/test-shutdown-policy3/hardware_status [new file with mode: 0644]
src/test/test-shutdown-policy3/log.expect [new file with mode: 0644]
src/test/test-shutdown-policy3/manager_status [new file with mode: 0644]
src/test/test-shutdown-policy3/service_config [new file with mode: 0644]
src/test/test-shutdown-policy4/cmdlist [new file with mode: 0644]
src/test/test-shutdown-policy4/datacenter.cfg [new file with mode: 0644]
src/test/test-shutdown-policy4/hardware_status [new file with mode: 0644]
src/test/test-shutdown-policy4/log.expect [new file with mode: 0644]
src/test/test-shutdown-policy4/manager_status [new file with mode: 0644]
src/test/test-shutdown-policy4/service_config [new file with mode: 0644]
src/test/test-shutdown-policy5/cmdlist [new file with mode: 0644]
src/test/test-shutdown-policy5/datacenter.cfg [new file with mode: 0644]
src/test/test-shutdown-policy5/hardware_status [new file with mode: 0644]
src/test/test-shutdown-policy5/log.expect [new file with mode: 0644]
src/test/test-shutdown-policy5/manager_status [new file with mode: 0644]
src/test/test-shutdown-policy5/service_config [new file with mode: 0644]