From bf2d8d7498026c396c76a9659207d30f232ad327 Mon Sep 17 00:00:00 2001 From: Thomas Lamprecht Date: Mon, 12 Sep 2016 11:28:17 +0200 Subject: [PATCH] fix race condition on slow resource commands in started state When we fixed the dangling state machine - where one command request from the CRM could result in multiple executes of said command by the LRM - by ensuring in the LRM that UID identified actions get only started once per UID (except the stop comand) we introduced a bug which can result in a lost LRM result from an failed service start. The reason for this is that we generated a new UID for the started state every CRM turn, so that a service gets restarted if it crashes. But as we do this without checking if the LRM has finished our last request we may loose the result of this last request. As an example consider the following timeline of events: 1. CRM request start of Service 'ct:100' 2. LRM starts this request, needs a bit longer 3. Before LRM worker finishes the CRM does an iteration and generates a new UID for this service 4. The LRM worker finishes but cannot write back its result as the UID doesn't exists anymore in the managers service status. 5. The CRM gets another round and generates a new UID for 'ct:100' 6. The cycle begins again, the LRM always throws away its last result as the CRM wrongfully generated an new command This loss of the result is problematic if it was an erroneous one, because then it result in a malfunction of the failure restart and relocate policies. Fix this by checking in the CRM if the last command was processed by the LRM, so simply check if a $lrm_result exists. Signed-off-by: Thomas Lamprecht --- src/PVE/HA/Manager.pm | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm index 2fa8a2d..e3d6ffa 100644 --- a/src/PVE/HA/Manager.pm +++ b/src/PVE/HA/Manager.pm @@ -679,7 +679,8 @@ sub next_state_started { "Tried nodes: " . join(', ', @{$sd->{failed_nodes}})); } # ensure service get started again if it went unexpected down - $sd->{uid} = compute_new_uuid($sd->{state}); + # but ensure also no LRM result gets lost + $sd->{uid} = compute_new_uuid($sd->{state}) if defined($lrm_res); } } -- 2.39.2