Fiona Ebner [Thu, 17 Nov 2022 14:00:07 +0000 (15:00 +0100)]
usage: add Usage::Static plugin
for calculating node usage of services based upon static CPU and
memory configuration as well as scoring the nodes with that
information to decide where to start a new or recovered service.
For getting the service stats, it's necessary to also consider the
migration target (if present), becuase the configuration file might
have already moved.
It's necessary to update the cluster filesystem upon stealing the
service to be able to always read the moved config right away when
adding the usage.
Fiona Ebner [Thu, 17 Nov 2022 14:00:05 +0000 (15:00 +0100)]
manager: select service node: add $sid to parameters
In preparation for scheduling based on static information, where the
scoring of nodes depends on information from the service's
VM/CT configuration file (and the $sid is required to query that).
Fiona Ebner [Thu, 17 Nov 2022 14:00:04 +0000 (15:00 +0100)]
add Usage base plugin and Usage::Basic plugin
in preparation to also support static resource scheduling via another
such Usage plugin.
The interface is designed in anticipation of the Usage::Static plugin,
the Usage::Basic plugin doesn't require all parameters.
In Usage::Static, the $haenv will necessary for logging and getting
the static node stats. add_service_usage_to_node() and
score_nodes_to_start_service() take the sid, service node and the
former also the optional migration target (during a migration it's not
clear whether the config file has already been moved or not) to be
able to get the static service stats.
Thomas Lamprecht [Fri, 22 Jul 2022 07:12:37 +0000 (09:12 +0200)]
manager: online node usage: factor out possible traget and future proof
only count up target selection if that node is already in the online
node usage list, to avoid that a offline node is considered online if
its a target from any command
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
run_workers is responsible for updating the state after workers have
exited. if the current LRM state is 'active', but a shutdown_request was
issued in 'restart' mode (like on package upgrades), this call is the
only one made in the LRM work() loop.
skipping it if there are active services means the following sequence of
events effectively keeps the LRM from restarting or making any progress:
- start HA migration on node A
- reload LRM on node A while migration is still running
even once the migration is finished, the service count is still >= 1
since the LRM never calls run_workers (directly or via
manage_resources), so the service having been migrated is never noticed.
maintenance mode (i.e., rebooting the node with shutdown policy migrate)
does call manage_resources and thus run_workers, and will proceed once
the last worker has exited.
Thomas Lamprecht [Thu, 20 Jan 2022 14:35:02 +0000 (15:35 +0100)]
lrm: avoid job starvation on huge workloads
If a setup has a lot VMs we may run into the time limit from the
run_worker loop before processing all workers, which can easily
happen if an admin did not increased their default of max_workers in
the setup, but even with a bigger max_worker setting one can run into
it.
That combined with the fact that we sorted just by the $sid
alpha-numerically means that CTs where preferred over VMs (C comes
before V) and additionally lower VMIDs where preferred too.
That means that a set of SIDs had a lower chance of ever get actually
run, which is naturally not ideal at all.
Improve on that behavior by adding a counter to the queued worker and
preferring those that have a higher one, i.e., spent more time
waiting on getting actively run.
Note, due to the way the stop state is enforced, i.e., always
enqueued as new worker, its start-try counter will be reset every
round and thus have a lower priority compared to other request
states. We probably want to differ between a stop request when the
service is/was in another state just before and the time a stop is
just re-requested even if a service was already stopped for a while.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Fabian Ebner [Fri, 8 Oct 2021 12:52:26 +0000 (14:52 +0200)]
manage: handle edge case where a node gets stuck in 'fence' state
If all services in 'fence' state are gone from a node (e.g. by
removing the services) before fence_node() was successful, a node
would get stuck in the 'fence' state. Avoid this by calling
fence_node() if the node is in 'fence' state, regardless of service
state.
Reported in the community forum:
https://forum.proxmox.com/threads/ha-migration-stuck-is-doing-nothing.94469/
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
[ T: track test change of new test ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 17 Jan 2022 14:52:53 +0000 (15:52 +0100)]
lrm: also check CRM node-status for determining fence-request
This fixes point 2. of commit 3addeeb - avoiding that a LRM goes
active as long as the CRM still has it in (pending) `fence` state,
which can happen after a watchdog reset + fast boot. This avoids that
we interfere with the CRM acquiring the lock, which is all the more
important once a future commit gets added that ensures a node isn't
stuck in `fence` state if there's no service configured (anymore) due
to admin manually removing them during fencing.
We explicitly fix the startup first to better show how it works in
the test framework, but as the test/sim hardware can now delay the
CRM now while keeping LRM running, the second test (i.e.,
test-service-command9) should still trigger after the next commit, if
this one would be reverted or broken otherwise.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 17 Jan 2022 11:25:35 +0000 (12:25 +0100)]
test: cover case where all service get removed from in-progress fenced node
this test's log is showing up two issues we'll fix in later commits
1. If a node gets fenced and an admin removes all services before the
fencing completes, the manager will ignore that node's state and
thus never make the "fence" -> "unknown" transition required by
the state machine
2. If a node is marked as "fence" in the manager's node status, but
has no service, its LRM's check for "pending fence request"
returns a false negative and the node start trying to acquire its
LRM work lock. This can even succeed in practice, e.g. the events:
1. Node A gets fenced (whyever that is), CRM is working on
acquiring its lock while Node A reboots
2. Admin is present and removes all services of Node A from HA
2. Node A booted up fast again, LRM is already starting before
CRM could ever get the lock (<< 2 minutes)
3. Service located on Node A gets added to HA (again)
4. LRM of Node A will actively try to get lock as it has no
service in fence state and is (currently) not checking the
manager's node state, so is ignorant of the not yet processed
fence -> unknown transition
(note: above uses 2. twice as those points order doesn't matter)
As a result the CRM may never get to acquire the lock of Node A's
LRM, and thus cannot finish the fence -> unknown transition,
resulting in user confusion and possible weird effects.
I the current log one can observe 1. by the missing fence tries of
the master and 2. can be observed by the LRM acquiring the lock while
still being in "fence" state from the masters POV.
We use two tests so that point 2. is better covered later on
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Wed, 19 Jan 2022 09:55:29 +0000 (10:55 +0100)]
sim: implement skip-round command for crm/lrm
This allows to simulate situations where there's some asymmetry
required in service type scheduling, e.g., if we the master should
not pickup LRM changes just yet - something that can happen quite
often in the real world due to scheduling not being predictable,
especially across different hosts.
The implementation is pretty simple for now, that also means we just
do not care about watchdog updates for the skipped service, meaning
that one is limited to skip two 20s rounds max before self-fencing
kicks in.
This can be made more advanced once required.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 17 Jan 2022 10:30:46 +0000 (11:30 +0100)]
d/postinst: fix restarting LRM/CRM when triggered
We wrongly dropped the semi-manual postinst in favor of a fully
auto-generated one, but we always need to generate the trigger
actions ourself - cannot work otherwise.
Fix 3166752 ("postinst: use auto generated postinst") Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Mostly for convenience for the admin, to avoid the need for removing
it completely, which is always frowned uppon by most users.
Follows the same logic and safety criteria as the transition to
`stopped` on getting into the `disabled` state in the
`next_state_error`.
As we previously had a rather immediate transition from recovery ->
error (not anymore) this is actually restoring a previous feature and
does not adds new implications or the like.
Still, add a test which also covers that the recovery state does not
allows things like stop or migrate to happen.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
tests: cover request-state changes and crm-cmds for in-recovery services
Add a test which covers that the recovery state does not allows
things like stop or migrate to happen.
Also add one for disabling at the end, this is currently blocked too
but will change in the next patch, as it can be a safe way out for
the admin to reset the service without removing it.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
fix #3415: never switch in error state on recovery, try harder
With the new 'recovery' state introduced a commit previously we get a
clean transition, and thus actual difference, from to-be-fenced and
fenced.
Use that to avoid going into the error state when we did not find any
possible new node we could recover the service too.
That can happen if the user uses the HA manager for local services,
which is an OK use-case as long as the service is restricted to a
group with only that node. But previous to that we could never
recover such services if their node failed, as they got always put
into the "error" dummy/final state.
But that's just artificially limiting ourself to get a false sense of
safety.
Nobody, touches the services while it's in the recovery state, no LRM
not anything else (as any normal API call gets just routed to the HA
stack anyway) so there's just no chance that we get a bad
double-start of the same services, with resource access collisions
and all the bad stuff that could happen (and note, this will in
practice only matter for restricted services, which are normally only
using local resources, so here it wouldn't even matter if it wasn't
safe already - but it is, double time!).
So, the usual transition guarantees still hold:
* only the current master does transitions
* there needs to be a OK quorate partition to have a master
And, for getting into recovery the following holds:
* the old node's lock was acquired by the master, which means it was
(self-)fenced -> resource not running
So as "recovery" is a no-op state we got only into once the nodes was
fenced we can continue recovery, i.e., try to find a new node for t
the failed services.
Tests:
* adapt the exist recovery test output to match the endless retry for
finding a new node (vs. the previous "go into error immediately"
* add a test where the node comes up eventually, so that we cover
also the recovery to the same node it was on, previous to a failure
* add a test with a non-empty start-state, the restricted failed node
is online again. This ensure that the service won't get started
until the HA manager actively recovered it, even if it's staying on
that node.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
It's not much but repeated a few times, and as a next commit will add
another such time let's just refactor it to a local private helper
with a very explicit name and comment about what implications calling
it has.
Take the chance and add some more safety comments too.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 24 May 2021 09:36:57 +0000 (11:36 +0200)]
d/rules: update to systemd dh changes
both, `override_dh_systemd_enable` and `override_dh_systemd_start`
are ignored with current compat level 12, and will become an error in
level >= 13, so drop them and use `override_dh_installsystemd` for
both of the previous uses.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Thu, 14 May 2020 08:15:59 +0000 (10:15 +0200)]
vm resource: add "with-local-disks" for replicated migrate
We do not need passing a target storage as the identity mapping
prefers replicated storage for a replicated disks already, and other
cases do not make sense anyway as they wouldn't work for HA
recovery..
We probably want to check the "really only replicated OK migrations"
in the respective API code paths for the "ha" RPC environment case,
though.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Thu, 12 Mar 2020 11:50:04 +0000 (12:50 +0100)]
factor out service configured/delete helpers
those differ from the "managed" service in that that they do not
check the state at all, the just check if, or respectively delete, a
SID is in the config or not.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 25 Nov 2019 16:35:43 +0000 (17:35 +0100)]
lrm.service: add after ordering for SSH and pveproxy
To avoid early disconnect during shutdown ensure we order After them,
for shutdown the ordering is reversed and so we're stopped before
those two - this allows to checkout the node stats and do SSH stuff
if something fails.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 25 Nov 2019 16:48:42 +0000 (17:48 +0100)]
do simple fallback if node comes back online from maintenance
We simply remember the node we where on, if moved for maintenance.
This record gets dropped once we move to _any_ other node, be it:
* our previous node, as it came back from maintenance
* another node due to manual migration, group priority changes or
fencing
The first point is handled explicitly by this patch. In the select
service node we check for and old fallback node, if that one is found
in a online node list with top priority we _always_ move to it - even
if there's no real reason for a move.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This adds handling for a new shutdown policy, namely "migrate".
If that is set then the LRM doesn't queues stop jobs, but transitions
to a new mode, namely 'maintenance'.
The LRM modes now get passed from the CRM in the NodeStatus update
method, this allows to detect such a mode and make node-status state
transitions. Effectively we only allow to transition if we're
currently online, else this is ignored. 'maintenance' does not
protects from fencing.
The moving then gets done by select service node. A node in
maintenance mode is not in "list_online_nodes" and so also not in
online_node_usage used to re-calculate if a service needs to be
moved. Only started services will get moved, this can be done almost
by leveraging exiting behavior, the next_state_started FSM state
transition method just needs to be thought to not early return for
nodes which are not online but in maintenance mode.
A few tests get adapted from the other policy tests is added to
showcase behavior with reboot, shutdown, and shutdown of the current
manager. It also shows the behavior when a service cannot be
migrated, albeit as our test system is limited to simulate maximal 9
migration failures, it "seems" to succeed after that. But note here
that the maximal retries would have been hit way more earlier, so
this is just artifact from our test system.
Besides some implementation details two question still are not solved
by this approach:
* what if a service cannot be moved away, either by errors or as no
alternative node is found by select_service_node
- retrying indefinitely, this happens currently. The user set this
up like this in the first place. We will order SSH, pveproxy,
after the LRM service to ensure that the're still the possibility
for manual interventions
- a idea would be to track the time and see if we're stuck (this is
not to hard), in such a case we could stop the services after X
minutes and continue.
* a full cluster shutdown, but that is even without this mode not to
ideal, nodes will get fenced after no partition is quorate anymore,
already. And as long as it's just a central setting in DC config,
an admin has a single switch to flip to make it work, so not sure
how much handling we want to do here, if we go over the point where
we have no quorum we're dead anyhow, soo.. at least not really an
issue of this series, orthogonal related yes, but not more.
For real world usability the datacenter.cfg schema needs to be
changed to allow the migrate shutdown policy, but that's trivial
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 25 Nov 2019 17:05:11 +0000 (18:05 +0100)]
account service to source and target during move
As the Service load is often still happening on the source, and the
target may feel the performance impact from an incoming migrate, so
account the service to both nodes during that time.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Tue, 19 Nov 2019 13:05:30 +0000 (14:05 +0100)]
fix #1339: remove more locks from services IF the node got fenced
Remove further locks from a service after it was recovered from a
fenced node. This can be done due to the fact that the node was
fenced and thus the operation it was locked for was interrupted
anyway. We note in the syslog that we removed a lock.
Mostly we disallow the 'create' lock, as here is the only case where
we know that the service was not yet in a runnable state before.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Fabian Ebner [Thu, 10 Oct 2019 10:25:08 +0000 (12:25 +0200)]
Introduce crm-command to CLI and add stop as a subcommand
This should reduce confusion between the old 'set <sid> --state stopped' and
the new 'stop' command by making explicit that it is sent as a crm command.