]> git.proxmox.com Git - pve-ha-manager.git/commit
fix #3415: never switch in error state on recovery, try harder
authorThomas Lamprecht <t.lamprecht@proxmox.com>
Fri, 2 Jul 2021 15:32:42 +0000 (17:32 +0200)
committerThomas Lamprecht <t.lamprecht@proxmox.com>
Fri, 2 Jul 2021 18:08:12 +0000 (20:08 +0200)
commit90a247552cc27d84f13c31a5dfa560ee9ae10af6
tree48e73331a167e8e67927a36bd41443b3cdb60555
parentbdbd9b2ba5aa589c084355024097815d22d1093d
fix #3415: never switch in error state on recovery, try harder

With the new 'recovery' state introduced a commit previously we get a
clean transition, and thus actual difference, from to-be-fenced and
fenced.

Use that to avoid going into the error state when we did not find any
possible new node we could recover the service too.
That can happen if the user uses the HA manager for local services,
which is an OK use-case as long as the service is restricted to a
group with only that node. But previous to that we could never
recover such services if their node failed, as they got always put
into the "error" dummy/final state.
But that's just artificially limiting ourself to get a false sense of
safety.

Nobody, touches the services while it's in the recovery state, no LRM
not anything else (as any normal API call gets just routed to the HA
stack anyway) so there's just no chance that we get a bad
double-start of the same services, with resource access collisions
and all the bad stuff that could happen (and note, this will in
practice only matter for restricted services, which are normally only
using local resources, so here it wouldn't even matter if it wasn't
safe already - but it is, double time!).

So, the usual transition guarantees still hold:
* only the current master does transitions
* there needs to be a OK quorate partition to have a master

And, for getting into recovery the following holds:
* the old node's lock was acquired by the master, which means it was
  (self-)fenced -> resource not running

So as "recovery" is a no-op state we got only into once the nodes was
fenced we can continue recovery, i.e., try to find a new node for t
the failed services.

Tests:
* adapt the exist recovery test output to match the endless retry for
  finding a new node (vs. the previous "go into error immediately"
* add a test where the node comes up eventually, so that we cover
  also the recovery to the same node it was on, previous to a failure
* add a test with a non-empty start-state, the restricted failed node
  is online again. This ensure that the service won't get started
  until the HA manager actively recovered it, even if it's staying on
  that node.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
18 files changed:
src/PVE/HA/LRM.pm
src/PVE/HA/Manager.pm
src/test/test-recovery1/README
src/test/test-recovery1/log.expect
src/test/test-recovery2/README [new file with mode: 0644]
src/test/test-recovery2/cmdlist [new file with mode: 0644]
src/test/test-recovery2/groups [new file with mode: 0644]
src/test/test-recovery2/hardware_status [new file with mode: 0644]
src/test/test-recovery2/log.expect [new file with mode: 0644]
src/test/test-recovery2/manager_status [new file with mode: 0644]
src/test/test-recovery2/service_config [new file with mode: 0644]
src/test/test-recovery3/README [new file with mode: 0644]
src/test/test-recovery3/cmdlist [new file with mode: 0644]
src/test/test-recovery3/groups [new file with mode: 0644]
src/test/test-recovery3/hardware_status [new file with mode: 0644]
src/test/test-recovery3/log.expect [new file with mode: 0644]
src/test/test-recovery3/manager_status [new file with mode: 0644]
src/test/test-recovery3/service_config [new file with mode: 0644]