git.proxmox.com Git - pve-ha-manager.git/commit

author	Thomas Lamprecht <t.lamprecht@proxmox.com>
	Thu, 20 Jan 2022 14:35:02 +0000 (15:35 +0100)
committer	Thomas Lamprecht <t.lamprecht@proxmox.com>
	Thu, 20 Jan 2022 15:14:03 +0000 (16:14 +0100)
commit	65c1fbac992d8a1f26c401cf49f2d4848bc42080
tree	2a1041f20d05000f2d0b58df820666687a025721	tree
parent	b538340c9d474bd0a4267c865023d4d2b42be846	commit \| diff

lrm: avoid job starvation on huge workloads

If a setup has a lot VMs we may run into the time limit from the
run_worker loop before processing all workers, which can easily
happen if an admin did not increased their default of max_workers in
the setup, but even with a bigger max_worker setting one can run into
it.

That combined with the fact that we sorted just by the $sid
alpha-numerically means that CTs where preferred over VMs (C comes
before V) and additionally lower VMIDs where preferred too.

That means that a set of SIDs had a lower chance of ever get actually
run, which is naturally not ideal at all.
Improve on that behavior by adding a counter to the queued worker and
preferring those that have a higher one, i.e., spent more time
waiting on getting actively run.

Note, due to the way the stop state is enforced, i.e., always
enqueued as new worker, its start-try counter will be reset every
round and thus have a lower priority compared to other request
states. We probably want to differ between a stop request when the
service is/was in another state just before and the time a stop is
just re-requested even if a service was already stopped for a while.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>