fix #3415: never switch in error state on recovery, try harder
With the new 'recovery' state introduced a commit previously we get a
clean transition, and thus actual difference, from to-be-fenced and
fenced.
Use that to avoid going into the error state when we did not find any
possible new node we could recover the service too.
That can happen if the user uses the HA manager for local services,
which is an OK use-case as long as the service is restricted to a
group with only that node. But previous to that we could never
recover such services if their node failed, as they got always put
into the "error" dummy/final state.
But that's just artificially limiting ourself to get a false sense of
safety.
Nobody, touches the services while it's in the recovery state, no LRM
not anything else (as any normal API call gets just routed to the HA
stack anyway) so there's just no chance that we get a bad
double-start of the same services, with resource access collisions
and all the bad stuff that could happen (and note, this will in
practice only matter for restricted services, which are normally only
using local resources, so here it wouldn't even matter if it wasn't
safe already - but it is, double time!).
So, the usual transition guarantees still hold:
* only the current master does transitions
* there needs to be a OK quorate partition to have a master
And, for getting into recovery the following holds:
* the old node's lock was acquired by the master, which means it was
(self-)fenced -> resource not running
So as "recovery" is a no-op state we got only into once the nodes was
fenced we can continue recovery, i.e., try to find a new node for t
the failed services.
Tests:
* adapt the exist recovery test output to match the endless retry for
finding a new node (vs. the previous "go into error immediately"
* add a test where the node comes up eventually, so that we cover
also the recovery to the same node it was on, previous to a failure
* add a test with a non-empty start-state, the restricted failed node
is online again. This ensure that the service won't get started
until the HA manager actively recovered it, even if it's staying on
that node.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>