git.proxmox.com Git - pve-ha-manager.git/commit

author	Thomas Lamprecht <t.lamprecht@proxmox.com>
	Wed, 22 Nov 2017 10:53:07 +0000 (11:53 +0100)
committer	Thomas Lamprecht <t.lamprecht@proxmox.com>
	Tue, 30 Jan 2018 08:33:16 +0000 (09:33 +0100)
commit	8e940b68f950f68e49aaf389e2f85ffe0c6bcb5f
tree	e4f5ebc99142f7a1d12c3ea5e7482733ffe36c4e	tree
parent	ba2a45cd9d39116044fef403e09c97d58ea15f4c	commit \| diff

lrm: handle an error during service_status update

we may get an error here if the cluster filesystem is (temporarily)
unavailable here, this error resulted in stopping the whole CRM
service immediately, which then triggered a node reset (if happened
on the current master), even if we had still time left to retry and
thus, for example, handle a update of pve-cluster gracefully.

Add a method which wraps the status read in an eval and logs an
eventual error, but does not abort the service. Instead we rely on
our get_protected_ha_agent_lock method to detect a problem and switch
to the lost_agent_lock state.

If the pmxcfs outage was really short, so that the manager status
read failed but the lock update worked again we update also always
before doing real work when in the 'active' state. If this update
fails we return from the eval and try next round again, as no point
in doing anything without consistent state.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>