From: Fiona Ebner Date: Fri, 14 Apr 2023 12:38:30 +0000 (+0200) Subject: lrm: do not migrate if service already running upon rebalance on start X-Git-Url: https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff_plain;h=5a9c3a28083820107f05bf45b111457725bcdab9 lrm: do not migrate if service already running upon rebalance on start As reported in the community forum[0], currently, a newly added service that's already running is shut down, offline migrated and started again if rebalance selects a new node for it. This is unexpected. An improvement would be online migrating the service, but rebalance is only supposed to happen for a stopped->start transition[1], so the service should not being migrated at all. The cleanest solution would be for the CRM to use the state 'started' instead of 'request_start' for newly added services that are already running, i.e. restore the behavior from before commit c2f2b9c ("manager: set new request_start state for services freshly added to HA") for such services. But currently, there is no mechanism for the CRM to check if the service is already running, because it could be on a different node. For now, avoiding the migration has to be handled in the LRM instead. If the CRM ever has access to the necessary information in the future, to solution mentioned above can be re-considered. Note that the CRM log message relies on the fact that the LRM only returns the IGNORED status in this case, but it's more user-friendly than using a generic message like "migration ignored (check LRM log)". [0]: https://forum.proxmox.com/threads/125597/ [1]: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_crs_scheduling_points Suggested-by: Thomas Lamprecht Signed-off-by: Fiona Ebner [ T: split out adding the test to a previous commit so that one can see in git what the original bad behavior was and how it's now ] Signed-off-by: Thomas Lamprecht --- diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm index a283070..b6ac0fe 100644 --- a/src/PVE/HA/LRM.pm +++ b/src/PVE/HA/LRM.pm @@ -962,6 +962,11 @@ sub exec_resource_agent { return SUCCESS; } + if ($cmd eq 'request_start_balance' && $running) { + $haenv->log("info", "ignoring rebalance-on-start for service $sid - already running"); + return IGNORED; + } + my $online = ($cmd eq 'migrate') ? 1 : 0; my $res = $plugin->migrate($haenv, $id, $target, $online); diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm index e63d281..7cfc402 100644 --- a/src/PVE/HA/Manager.pm +++ b/src/PVE/HA/Manager.pm @@ -648,6 +648,12 @@ sub next_state_migrate_relocate { $haenv->log('err', "service '$sid' - migration failed: service" . " registered on wrong node!"); &$change_service_state($self, $sid, 'error'); + } elsif ($exit_code == IGNORED) { + $haenv->log( + "info", + "service '$sid' - rebalance-on-start request ignored - service already running", + ); + $change_service_state->($self, $sid, $req_state, node => $sd->{node}); } else { $haenv->log('err', "service '$sid' - migration failed (exit code $exit_code)"); &$change_service_state($self, $sid, $req_state, node => $sd->{node}); diff --git a/src/test/test-crs-static-rebalance2/log.expect b/src/test/test-crs-static-rebalance2/log.expect index 58e53b0..286514d 100644 --- a/src/test/test-crs-static-rebalance2/log.expect +++ b/src/test/test-crs-static-rebalance2/log.expect @@ -21,44 +21,39 @@ info 120 node1/crm: service 'vm:100': state changed from 'request_start' t info 122 node2/crm: status change wait_for_quorum => slave info 123 node2/lrm: got lock 'ha_agent_node2_lock' info 123 node2/lrm: status change wait_for_agent_lock => active -info 123 node2/lrm: service vm:100 - start relocate to node 'node1' -info 123 node2/lrm: stopping service vm:100 (relocate) -info 123 node2/lrm: service status vm:100 stopped -info 123 node2/lrm: service vm:100 - end relocate to node 'node1' +info 123 node2/lrm: ignoring rebalance-on-start for service vm:100 - already running info 124 node3/crm: status change wait_for_quorum => slave -info 140 node1/crm: service 'vm:100': state changed from 'request_start_balance' to 'started' (node = node1) -info 141 node1/lrm: got lock 'ha_agent_node1_lock' -info 141 node1/lrm: status change wait_for_agent_lock => active -info 141 node1/lrm: starting service vm:100 -info 141 node1/lrm: service status vm:100 started +info 140 node1/crm: service 'vm:100' - rebalance-on-start request ignored - service already running +info 140 node1/crm: service 'vm:100': state changed from 'request_start_balance' to 'started' (node = node2) info 220 cmdlist: execute service vm:101 add node2 started 0 info 220 node1/crm: adding new service 'vm:101' on node 'node2' -info 220 node1/crm: service vm:101: re-balance selected current node node2 for startup -info 220 node1/crm: service 'vm:101': state changed from 'request_start' to 'started' (node = node2) -info 223 node2/lrm: starting service vm:101 -info 223 node2/lrm: service status vm:101 started +info 220 node1/crm: service vm:101: re-balance selected new node node1 for startup +info 220 node1/crm: service 'vm:101': state changed from 'request_start' to 'request_start_balance' (node = node2, target = node1) +info 223 node2/lrm: service vm:101 - start relocate to node 'node1' +info 223 node2/lrm: service vm:101 - end relocate to node 'node1' +info 240 node1/crm: service 'vm:101': state changed from 'request_start_balance' to 'started' (node = node1) +info 241 node1/lrm: got lock 'ha_agent_node1_lock' +info 241 node1/lrm: status change wait_for_agent_lock => active +info 241 node1/lrm: starting service vm:101 +info 241 node1/lrm: service status vm:101 started info 320 cmdlist: execute service vm:102 add node2 started 1 info 320 node1/crm: adding new service 'vm:102' on node 'node2' info 320 node1/crm: service vm:102: re-balance selected new node node3 for startup info 320 node1/crm: service 'vm:102': state changed from 'request_start' to 'request_start_balance' (node = node2, target = node3) -info 323 node2/lrm: service vm:102 - start relocate to node 'node3' -info 323 node2/lrm: stopping service vm:102 (relocate) -info 323 node2/lrm: service status vm:102 stopped -info 323 node2/lrm: service vm:102 - end relocate to node 'node3' -info 340 node1/crm: service 'vm:102': state changed from 'request_start_balance' to 'started' (node = node3) -info 345 node3/lrm: got lock 'ha_agent_node3_lock' -info 345 node3/lrm: status change wait_for_agent_lock => active -info 345 node3/lrm: starting service vm:102 -info 345 node3/lrm: service status vm:102 started +info 323 node2/lrm: ignoring rebalance-on-start for service vm:102 - already running +info 340 node1/crm: service 'vm:102' - rebalance-on-start request ignored - service already running +info 340 node1/crm: service 'vm:102': state changed from 'request_start_balance' to 'started' (node = node2) info 420 cmdlist: execute service vm:103 add node2 started 0 info 420 node1/crm: adding new service 'vm:103' on node 'node2' -info 420 node1/crm: service vm:103: re-balance selected new node node1 for startup -info 420 node1/crm: service 'vm:103': state changed from 'request_start' to 'request_start_balance' (node = node2, target = node1) -info 423 node2/lrm: service vm:103 - start relocate to node 'node1' -info 423 node2/lrm: service vm:103 - end relocate to node 'node1' -info 440 node1/crm: service 'vm:103': state changed from 'request_start_balance' to 'started' (node = node1) -info 441 node1/lrm: starting service vm:103 -info 441 node1/lrm: service status vm:103 started +info 420 node1/crm: service vm:103: re-balance selected new node node3 for startup +info 420 node1/crm: service 'vm:103': state changed from 'request_start' to 'request_start_balance' (node = node2, target = node3) +info 423 node2/lrm: service vm:103 - start relocate to node 'node3' +info 423 node2/lrm: service vm:103 - end relocate to node 'node3' +info 440 node1/crm: service 'vm:103': state changed from 'request_start_balance' to 'started' (node = node3) +info 445 node3/lrm: got lock 'ha_agent_node3_lock' +info 445 node3/lrm: status change wait_for_agent_lock => active +info 445 node3/lrm: starting service vm:103 +info 445 node3/lrm: service status vm:103 started info 520 cmdlist: execute service vm:104 add node2 stopped 0 info 520 node1/crm: adding new service 'vm:104' on node 'node2' info 540 node1/crm: service 'vm:104': state changed from 'request_stop' to 'stopped'