git.proxmox.com Git - pve-ha-manager.git/commit

author	Thomas Lamprecht <t.lamprecht@proxmox.com>
	Mon, 12 Sep 2016 09:28:17 +0000 (11:28 +0200)
committer	Dietmar Maurer <dietmar@proxmox.com>
	Mon, 12 Sep 2016 10:57:45 +0000 (12:57 +0200)
commit	bf2d8d7498026c396c76a9659207d30f232ad327
tree	6830b6e63126827172d4d1c69c5a68488cb23036	tree
parent	8a126e581ff01de62ba831edd53b9830e2b21c2a	commit \| diff

fix race condition on slow resource commands in started state

When we fixed the dangling state machine - where one command request
from the CRM could result in multiple executes of said command by
the LRM - by ensuring in the LRM that UID identified actions get
only started once per UID (except the stop comand) we introduced a
bug which can result in a lost LRM result from an failed service
start.

The reason for this is that we generated a new UID for the started
state every CRM turn, so that a service gets restarted if it
crashes. But as we do this without checking if the LRM has finished
our last request we may loose the result of this last request.
As an example consider the following timeline of events:
1. CRM request start of Service 'ct:100'
2. LRM starts this request, needs a bit longer
3. Before LRM worker finishes the CRM does an iteration and
   generates a new UID for this service
4. The LRM worker finishes but cannot write back its result as
   the UID doesn't exists anymore in the managers service status.
5. The CRM gets another round and generates a new UID for 'ct:100'
6. The cycle begins again, the LRM always throws away its last
   result as the CRM wrongfully generated an new command

This loss of the result is problematic if it was an erroneous one,
because then it result in a malfunction of the failure restart and
relocate policies.

Fix this by checking in the CRM if the last command was processed by
the LRM, so simply check if a $lrm_result exists.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>