]> git.proxmox.com Git - pve-ha-manager.git/log
pve-ha-manager.git
2 years agomanage: handle edge case where a node gets stuck in 'fence' state
Fabian Ebner [Fri, 8 Oct 2021 12:52:26 +0000 (14:52 +0200)]
manage: handle edge case where a node gets stuck in 'fence' state

If all services in 'fence' state are gone from a node (e.g. by
removing the services) before fence_node() was successful, a node
would get stuck in the 'fence' state. Avoid this by calling
fence_node() if the node is in 'fence' state, regardless of service
state.

Reported in the community forum:
https://forum.proxmox.com/threads/ha-migration-stuck-is-doing-nothing.94469/

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
[ T: track test change of new test ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agolrm: also check CRM node-status for determining fence-request
Thomas Lamprecht [Mon, 17 Jan 2022 14:52:53 +0000 (15:52 +0100)]
lrm: also check CRM node-status for determining fence-request

This fixes point 2. of commit 3addeeb - avoiding that a LRM goes
active as long as the CRM still has it in (pending) `fence` state,
which can happen after a watchdog reset + fast boot. This avoids that
we interfere with the CRM acquiring the lock, which is all the more
important once a future commit gets added that ensures a node isn't
stuck in `fence` state if there's no service configured (anymore) due
to admin manually removing them during fencing.

We explicitly fix the startup first to better show how it works in
the test framework, but as the test/sim hardware can now delay the
CRM now while keeping LRM running, the second test (i.e.,
test-service-command9) should still trigger after the next commit, if
this one would be reverted or broken otherwise.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agolrm: factor out fence-request check into own helper
Thomas Lamprecht [Mon, 17 Jan 2022 14:48:27 +0000 (15:48 +0100)]
lrm: factor out fence-request check into own helper

we'll extend that a bit in a future commit

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agotest: cover case where all service get removed from in-progress fenced node
Thomas Lamprecht [Mon, 17 Jan 2022 11:25:35 +0000 (12:25 +0100)]
test: cover case where all service get removed from in-progress fenced node

this test's log is showing up two issues we'll fix in later commits

1. If a node gets fenced and an admin removes all services before the
   fencing completes, the manager will ignore that node's state and
   thus never make the "fence" -> "unknown" transition required by
   the state machine

2. If a node is marked as "fence" in the manager's node status, but
   has no service, its LRM's check for "pending fence request"
   returns a false negative and the node start trying to acquire its
   LRM work lock. This can even succeed in practice, e.g. the events:
    1. Node A gets fenced (whyever that is), CRM is working on
       acquiring its lock while Node A reboots
    2. Admin is present and removes all services of Node A from HA
    2. Node A booted up fast again, LRM is already starting before
       CRM could ever get the lock (<< 2 minutes)
    3. Service located on Node A gets added to HA (again)
    4. LRM of Node A will actively try to get lock as it has no
       service in fence state and is (currently) not checking the
       manager's node state, so is ignorant of the not yet processed
       fence -> unknown transition
    (note: above uses 2. twice as those points order doesn't matter)

    As a result the CRM may never get to acquire the lock of Node A's
    LRM, and thus cannot finish the fence -> unknown transition,
    resulting in user confusion and possible weird effects.

I the current log one can observe 1. by the missing fence tries of
the master and 2. can be observed by the LRM acquiring the lock while
still being in "fence" state from the masters POV.

We use two tests so that point 2. is better covered later on

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agosim: implement skip-round command for crm/lrm
Thomas Lamprecht [Wed, 19 Jan 2022 09:55:29 +0000 (10:55 +0100)]
sim: implement skip-round command for crm/lrm

This allows to simulate situations where there's some asymmetry
required in service type scheduling, e.g., if we the master should
not pickup LRM changes just yet - something that can happen quite
often in the real world due to scheduling not being predictable,
especially across different hosts.

The implementation is pretty simple for now, that also means we just
do not care about watchdog updates for the skipped service, meaning
that one is limited to skip two 20s rounds max before self-fencing
kicks in.

This can be made more advanced once required.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agosim: test hw: small code cleanups and whitespace fixes
Thomas Lamprecht [Tue, 18 Jan 2022 14:33:41 +0000 (15:33 +0100)]
sim: test hw: small code cleanups and whitespace fixes

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agosim: service add command: allow to override state
Thomas Lamprecht [Mon, 17 Jan 2022 14:45:20 +0000 (15:45 +0100)]
sim: service add command: allow to override state

Until now we had at most one extra param, so lets get the all
remaining params in an array and use that, fallback staid the same.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agosim: add service: set type/name in config
Thomas Lamprecht [Mon, 17 Jan 2022 14:47:19 +0000 (15:47 +0100)]
sim: add service: set type/name in config

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agotest/sim: also log delay commands
Thomas Lamprecht [Wed, 19 Jan 2022 10:17:24 +0000 (11:17 +0100)]
test/sim: also log delay commands

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agosim/hardware: sort and split use statements
Thomas Lamprecht [Mon, 17 Jan 2022 14:43:48 +0000 (15:43 +0100)]
sim/hardware: sort and split use statements

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agolrm: fix comment typos
Thomas Lamprecht [Mon, 17 Jan 2022 14:43:03 +0000 (15:43 +0100)]
lrm: fix comment typos

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agocrm: code/style cleanup
Thomas Lamprecht [Mon, 17 Jan 2022 11:27:30 +0000 (12:27 +0100)]
crm: code/style cleanup

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agod/postinst: fix restarting LRM/CRM when triggered
Thomas Lamprecht [Mon, 17 Jan 2022 10:30:46 +0000 (11:30 +0100)]
d/postinst: fix restarting LRM/CRM when triggered

We wrongly dropped the semi-manual postinst in favor of a fully
auto-generated one, but we always need to generate the trigger
actions ourself - cannot work otherwise.

Fix 3166752 ("postinst: use auto generated postinst")
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agod/lintian: update repeated-trigger override
Thomas Lamprecht [Mon, 17 Jan 2022 10:30:08 +0000 (11:30 +0100)]
d/lintian: update repeated-trigger override

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agolrm: fix log call on wrong module
Thomas Lamprecht [Thu, 7 Oct 2021 13:19:30 +0000 (15:19 +0200)]
lrm: fix log call on wrong module

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agobump version to 3.3-1
Thomas Lamprecht [Fri, 2 Jul 2021 18:03:36 +0000 (20:03 +0200)]
bump version to 3.3-1

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agorecovery: allow disabling a in-recovery service
Thomas Lamprecht [Fri, 2 Jul 2021 17:51:31 +0000 (19:51 +0200)]
recovery: allow disabling a in-recovery service

Mostly for convenience for the admin, to avoid the need for removing
it completely, which is always frowned uppon by most users.

Follows the same logic and safety criteria as the transition to
`stopped` on getting into the `disabled` state in the
`next_state_error`.

As we previously had a rather immediate transition from recovery ->
error (not anymore) this is actually restoring a previous feature and
does not adds new implications or the like.

Still, add a test which also covers that the recovery state does not
allows things like stop or migrate to happen.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agotests: cover request-state changes and crm-cmds for in-recovery services
Thomas Lamprecht [Fri, 2 Jul 2021 17:31:42 +0000 (19:31 +0200)]
tests: cover request-state changes and crm-cmds for in-recovery services

Add a test which covers that the recovery state does not allows
things like stop or migrate to happen.

Also add one for disabling at the end, this is currently blocked too
but will change in the next patch, as it can be a safe way out for
the admin to reset the service without removing it.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agorecompute_online_node_usage: show state on internal error
Thomas Lamprecht [Fri, 2 Jul 2021 17:18:22 +0000 (19:18 +0200)]
recompute_online_node_usage: show state on internal error

makes debugging easier, also throw in some code cleanup

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agofix #3415: never switch in error state on recovery, try harder
Thomas Lamprecht [Fri, 2 Jul 2021 15:32:42 +0000 (17:32 +0200)]
fix #3415: never switch in error state on recovery, try harder

With the new 'recovery' state introduced a commit previously we get a
clean transition, and thus actual difference, from to-be-fenced and
fenced.

Use that to avoid going into the error state when we did not find any
possible new node we could recover the service too.
That can happen if the user uses the HA manager for local services,
which is an OK use-case as long as the service is restricted to a
group with only that node. But previous to that we could never
recover such services if their node failed, as they got always put
into the "error" dummy/final state.
But that's just artificially limiting ourself to get a false sense of
safety.

Nobody, touches the services while it's in the recovery state, no LRM
not anything else (as any normal API call gets just routed to the HA
stack anyway) so there's just no chance that we get a bad
double-start of the same services, with resource access collisions
and all the bad stuff that could happen (and note, this will in
practice only matter for restricted services, which are normally only
using local resources, so here it wouldn't even matter if it wasn't
safe already - but it is, double time!).

So, the usual transition guarantees still hold:
* only the current master does transitions
* there needs to be a OK quorate partition to have a master

And, for getting into recovery the following holds:
* the old node's lock was acquired by the master, which means it was
  (self-)fenced -> resource not running

So as "recovery" is a no-op state we got only into once the nodes was
fenced we can continue recovery, i.e., try to find a new node for t
the failed services.

Tests:
* adapt the exist recovery test output to match the endless retry for
  finding a new node (vs. the previous "go into error immediately"
* add a test where the node comes up eventually, so that we cover
  also the recovery to the same node it was on, previous to a failure
* add a test with a non-empty start-state, the restricted failed node
  is online again. This ensure that the service won't get started
  until the HA manager actively recovered it, even if it's staying on
  that node.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agogitignore: add test status output directory's content to ignored files
Thomas Lamprecht [Fri, 2 Jul 2021 14:12:09 +0000 (16:12 +0200)]
gitignore: add test status output directory's content to ignored files

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agotests: add one for service set to be & stay ignored from the start
Thomas Lamprecht [Thu, 1 Jul 2021 15:26:13 +0000 (17:26 +0200)]
tests: add one for service set to be & stay ignored from the start

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agoLRM: release lock and close watchdog if no service configured for >10min
Thomas Lamprecht [Thu, 1 Jul 2021 13:55:43 +0000 (15:55 +0200)]
LRM: release lock and close watchdog if no service configured for >10min

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agoLRM: factor out closing watchdog local helper
Thomas Lamprecht [Thu, 1 Jul 2021 13:56:37 +0000 (15:56 +0200)]
LRM: factor out closing watchdog local helper

It's not much but repeated a few times, and as a next commit will add
another such time let's just refactor it to a local private helper
with a very explicit name and comment about what implications calling
it has.

Take the chance and add some more safety comments too.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agomanager: make recovery actual state in FSM
Thomas Lamprecht [Wed, 30 Jun 2021 10:43:08 +0000 (12:43 +0200)]
manager: make recovery actual state in FSM

This basically makes recovery just an active state transition, as can
be seen from the regression tests - no other semantic change is
caused.

For the admin this is much better to grasp than services still marked
as "fence" when the failed node is already fenced or even already up
again.

Code-wise it makes sense too, to make the recovery part not so hidden
anymore, but show it was it is: an actual part of the FSM

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agomanager: indentation/code-style cleanups
Thomas Lamprecht [Wed, 30 Jun 2021 08:39:24 +0000 (10:39 +0200)]
manager: indentation/code-style cleanups

we now allow for a longer text-width in general and adapt some lines
for that

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agoha-tester: allow one to supress the actual test output
Thomas Lamprecht [Thu, 1 Jul 2021 13:39:15 +0000 (15:39 +0200)]
ha-tester: allow one to supress the actual test output

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agoha-tester: report summary count of run/passed tests and list failed ones
Thomas Lamprecht [Thu, 1 Jul 2021 12:53:16 +0000 (14:53 +0200)]
ha-tester: report summary count of run/passed tests and list failed ones

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agoha-tester: allow to continue harness on test failure
Thomas Lamprecht [Thu, 1 Jul 2021 12:51:45 +0000 (14:51 +0200)]
ha-tester: allow to continue harness on test failure

To see if just a bit or many tests are broken it is useful to
sometimes run all, and not just exit after first failure.

Allow this as opt-in feature.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agosim: hardware: update & reformat comment for available commands
Thomas Lamprecht [Thu, 1 Jul 2021 13:59:18 +0000 (15:59 +0200)]
sim: hardware: update & reformat comment for available commands

The service addition and deletion, and also the artificial delay
(useful to force continuation of the HW) commands where missing
completely.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agobuildsys: change upload/repo dist to bullseye
Thomas Lamprecht [Mon, 24 May 2021 09:40:39 +0000 (11:40 +0200)]
buildsys: change upload/repo dist to bullseye

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agobump version to 3.2-2
Thomas Lamprecht [Mon, 24 May 2021 09:38:46 +0000 (11:38 +0200)]
bump version to 3.2-2

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agod/rules: update to systemd dh changes
Thomas Lamprecht [Mon, 24 May 2021 09:36:57 +0000 (11:36 +0200)]
d/rules: update to systemd dh changes

both, `override_dh_systemd_enable` and `override_dh_systemd_start`
are ignored with current compat level 12, and will become an error in
level >= 13, so drop them and use `override_dh_installsystemd` for
both of the previous uses.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agobump version to 3.2-1
Thomas Lamprecht [Wed, 12 May 2021 18:56:03 +0000 (20:56 +0200)]
bump version to 3.2-1

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2 years agod/control: bump debhelper compat level to 12
Thomas Lamprecht [Wed, 12 May 2021 18:54:22 +0000 (20:54 +0200)]
d/control: bump debhelper compat level to 12

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
3 years agobump version to 3.1-1
Thomas Lamprecht [Mon, 31 Aug 2020 08:52:17 +0000 (10:52 +0200)]
bump version to 3.1-1

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
3 years agovm resource: add "with-local-disks" for replicated migrate
Thomas Lamprecht [Thu, 14 May 2020 08:15:59 +0000 (10:15 +0200)]
vm resource: add "with-local-disks" for replicated migrate

We do not need passing a target storage as the identity mapping
prefers replicated storage for a replicated disks already, and other
cases do not make sense anyway as they wouldn't work for HA
recovery..

We probably want to check the "really only replicated OK migrations"
in the respective API code paths for the "ha" RPC environment case,
though.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-9
Thomas Lamprecht [Thu, 12 Mar 2020 12:18:52 +0000 (13:18 +0100)]
bump version to 3.0-9

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofactor out service configured/delete helpers
Thomas Lamprecht [Thu, 12 Mar 2020 11:50:04 +0000 (12:50 +0100)]
factor out service configured/delete helpers

those differ from the "managed" service in that that they do not
check the state at all, the just check if, or respectively delete, a
SID is in the config or not.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoconfig parse_sid: improve error message, not only used on 'add'
Thomas Lamprecht [Thu, 12 Mar 2020 11:48:03 +0000 (12:48 +0100)]
config parse_sid: improve error message, not only used on 'add'

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agonodestate: move to modern coderef call syntax
Thomas Lamprecht [Sat, 15 Feb 2020 12:17:12 +0000 (13:17 +0100)]
nodestate: move to modern coderef call syntax

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofix service name for pve-ha-crm
Oguz Bektas [Tue, 11 Feb 2020 08:26:25 +0000 (09:26 +0100)]
fix service name for pve-ha-crm

"PVE Cluster Resource Manager Daemon" should be "PVE Cluster HA Resource
Manager Daemon"

[0]: https://forum.proxmox.com/threads/typo-omission.65107/

Signed-off-by: Oguz Bektas <o.bektas@proxmox.com>
4 years agogrammar fix: s/does not exists/does not exist/g
Thomas Lamprecht [Fri, 13 Dec 2019 11:08:30 +0000 (12:08 +0100)]
grammar fix: s/does not exists/does not exist/g

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agotests: add a start on a maintained node
Thomas Lamprecht [Mon, 2 Dec 2019 09:56:18 +0000 (10:56 +0100)]
tests: add a start on a maintained node

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agotest shutdown policy: add stopped service to ensure maintained node is not fenced
Thomas Lamprecht [Mon, 2 Dec 2019 09:37:18 +0000 (10:37 +0100)]
test shutdown policy: add stopped service to ensure maintained node is not fenced

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-8
Thomas Lamprecht [Mon, 2 Dec 2019 09:33:10 +0000 (10:33 +0100)]
bump version to 3.0-8

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoFix check for maintenance mode
Fabian Ebner [Mon, 2 Dec 2019 08:45:32 +0000 (09:45 +0100)]
Fix check for maintenance mode

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agobump version to 3.0-7
Thomas Lamprecht [Sat, 30 Nov 2019 18:47:48 +0000 (19:47 +0100)]
bump version to 3.0-7

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoapi/status: extra handling of maintenance mode
Thomas Lamprecht [Sat, 30 Nov 2019 18:46:47 +0000 (19:46 +0100)]
api/status: extra handling of maintenance mode

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agodo not mark maintenaned nodes as unkown
Thomas Lamprecht [Sat, 30 Nov 2019 18:31:50 +0000 (19:31 +0100)]
do not mark maintenaned nodes as unkown

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump LRM stop_wait_time to an hour
Thomas Lamprecht [Fri, 29 Nov 2019 13:15:11 +0000 (14:15 +0100)]
bump LRM stop_wait_time to an hour

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-6
Thomas Lamprecht [Tue, 26 Nov 2019 17:03:32 +0000 (18:03 +0100)]
bump version to 3.0-6

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agolrm.service: add after ordering for SSH and pveproxy
Thomas Lamprecht [Mon, 25 Nov 2019 16:35:43 +0000 (17:35 +0100)]
lrm.service: add after ordering for SSH and pveproxy

To avoid early disconnect during shutdown ensure we order After them,
for shutdown the ordering is reversed and so we're stopped before
those two - this allows to checkout the node stats and do SSH stuff
if something fails.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agodo simple fallback if node comes back online from maintenance
Thomas Lamprecht [Mon, 25 Nov 2019 16:48:42 +0000 (17:48 +0100)]
do simple fallback if node comes back online from maintenance

We simply remember the node we where on, if moved for maintenance.
This record gets dropped once we move to _any_ other node, be it:
* our previous node, as it came back from maintenance
* another node due to manual migration, group priority changes or
  fencing

The first point is handled explicitly by this patch. In the select
service node we check for and old fallback node, if that one is found
in a online node list with top priority we _always_ move to it - even
if there's no real reason for a move.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoadd 'migrate' node shutdown policy
Thomas Lamprecht [Fri, 4 Oct 2019 17:35:15 +0000 (19:35 +0200)]
add 'migrate' node shutdown policy

This adds handling for a new shutdown policy, namely "migrate".
If that is set then the LRM doesn't queues stop jobs, but transitions
to a new mode, namely 'maintenance'.

The LRM modes now get passed from the CRM in the NodeStatus update
method, this allows to detect such a mode and make node-status state
transitions. Effectively we only allow to transition if we're
currently online, else this is ignored. 'maintenance' does not
protects from fencing.

The moving then gets done by select service node. A node in
maintenance mode is not in "list_online_nodes" and so also not in
online_node_usage used to re-calculate if a service needs to be
moved. Only started services will get moved, this can be done almost
by leveraging exiting behavior, the next_state_started FSM state
transition method just needs to be thought to not early return for
nodes which are not online but in maintenance mode.

A few tests get adapted from the other policy tests is added to
showcase behavior with reboot, shutdown, and shutdown of the current
manager. It also shows the behavior when a service cannot be
migrated, albeit as our test system is limited to simulate maximal 9
migration failures, it "seems" to succeed after that. But note here
that the maximal retries would have been hit way more earlier, so
this is just artifact from our test system.

Besides some implementation details two question still are not solved
by this approach:
* what if a service cannot be moved away, either by errors or as no
  alternative node is found by select_service_node
  - retrying indefinitely, this happens currently. The user set this
    up like this in the first place. We will order SSH, pveproxy,
    after the LRM service to ensure that the're still the possibility
    for manual interventions
  - a idea would be to track the time and see if we're stuck (this is
    not to hard), in such a case we could stop the services after X
    minutes and continue.
* a full cluster shutdown, but that is even without this mode not to
  ideal, nodes will get fenced after no partition is quorate anymore,
  already. And as long as it's just a central setting in DC config,
  an admin has a single switch to flip to make it work, so not sure
  how much handling we want to do here, if we go over the point where
  we have no quorum we're dead anyhow, soo.. at least not really an
  issue of this series, orthogonal related yes, but not more.

For real world usability the datacenter.cfg schema needs to be
changed to allow the migrate shutdown policy, but that's trivial

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoaccount service to source and target during move
Thomas Lamprecht [Mon, 25 Nov 2019 17:05:11 +0000 (18:05 +0100)]
account service to source and target during move

As the Service load is often still happening on the source, and the
target may feel the performance impact from an incoming migrate, so
account the service to both nodes during that time.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agomanager select_service_node: code cleanup
Thomas Lamprecht [Mon, 25 Nov 2019 16:08:06 +0000 (17:08 +0100)]
manager select_service_node: code cleanup

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agolrm.service: sort After statements
Thomas Lamprecht [Mon, 25 Nov 2019 16:34:53 +0000 (17:34 +0100)]
lrm.service: sort After statements

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-5
Thomas Lamprecht [Wed, 20 Nov 2019 19:14:11 +0000 (20:14 +0100)]
bump version to 3.0-5

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agod/control: re-add CT/VM dependency
Thomas Lamprecht [Wed, 20 Nov 2019 19:13:33 +0000 (20:13 +0100)]
d/control: re-add CT/VM dependency

this was an issue for 5.x, initial pre-6.0 and should work now again
as expected..

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agorefactor: vm_qmp_command was moved to PVE::QemuServer::Monitor
Stefan Reiter [Tue, 19 Nov 2019 11:23:50 +0000 (12:23 +0100)]
refactor: vm_qmp_command was moved to PVE::QemuServer::Monitor

Also change to mon_cmd helper, avoid calling qmp_cmd directly.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
4 years agofix #1339: remove more locks from services IF the node got fenced
Thomas Lamprecht [Tue, 19 Nov 2019 13:05:30 +0000 (14:05 +0100)]
fix #1339: remove more locks from services IF the node got fenced

Remove further locks from a service after it was recovered from a
fenced node. This can be done due to the fact that the node was
fenced and thus the operation it was locked for was interrupted
anyway. We note in the syslog that we removed a lock.

Mostly we disallow the 'create' lock, as here is the only case where
we know that the service was not yet in a runnable state before.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-4
Fabian Grünbichler [Mon, 11 Nov 2019 10:28:13 +0000 (11:28 +0100)]
bump version to 3.0-4

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agouse PVE::DataCenterConfig
Fabian Grünbichler [Mon, 11 Nov 2019 10:28:12 +0000 (11:28 +0100)]
use PVE::DataCenterConfig

to make sure that the corresponding cfs_read_file works() works.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
4 years agocli stop cmd: fix property desc. indentation
Thomas Lamprecht [Thu, 14 Nov 2019 13:39:30 +0000 (14:39 +0100)]
cli stop cmd: fix property desc. indentation

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-3
Thomas Lamprecht [Mon, 11 Nov 2019 16:04:40 +0000 (17:04 +0100)]
bump version to 3.0-3

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofollowup, adapt stop request log messages; include SID
Thomas Lamprecht [Mon, 11 Nov 2019 15:50:37 +0000 (16:50 +0100)]
followup, adapt stop request log messages; include SID

it's always good to say that we request it, not that people think the
task should have been already started..

Also include the service ID (SID), so people know what we want(ed) to
stop at all.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoIntroduce crm-command to CLI and add stop as a subcommand
Fabian Ebner [Thu, 10 Oct 2019 10:25:08 +0000 (12:25 +0200)]
Introduce crm-command to CLI and add stop as a subcommand

This should reduce confusion between the old 'set <sid> --state stopped' and
the new 'stop' command by making explicit that it is sent as a crm command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoAdd crm command 'stop'
Fabian Ebner [Thu, 10 Oct 2019 10:25:07 +0000 (12:25 +0200)]
Add crm command 'stop'

Not every command parameter is 'target' anymore, so
it was necessary to modify the parsing of $sd->{cmd}.

Just changing the state to request_stop is not enough,
we need to actually update the service configuration as well.

Add a simple test for the stop command

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoAdd timeout parameter for shutdown
Fabian Ebner [Thu, 10 Oct 2019 10:25:06 +0000 (12:25 +0200)]
Add timeout parameter for shutdown

Introduces a timeout parameter for shutting a resource down.
If the parameter is 0, we perform a hard stop instead of a shutdown.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoAdd update_service_config to the HA environment interface and simulation
Fabian Ebner [Thu, 10 Oct 2019 10:25:05 +0000 (12:25 +0200)]
Add update_service_config to the HA environment interface and simulation

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agofollowup: s/ss/sc/
Thomas Lamprecht [Sat, 5 Oct 2019 18:25:32 +0000 (20:25 +0200)]
followup: s/ss/sc/

fixes: dcb4a2a48404a8bf06df41e071fea348d0c971a4

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofix # 2241: VM resource: allow migration with local device, when not running
Thomas Lamprecht [Sat, 5 Oct 2019 18:10:27 +0000 (20:10 +0200)]
fix # 2241: VM resource: allow migration with local device, when not running

qemu-server ignores the flag if the VM runs, so just set it to true
hardcoded.

People have identical hosts with same HW and want to be able to
relocate VMs in such cases, so allow it here - qemu knows to complain
if it cannot work, as nothing bad happens then (VM stays just were it
is) we can only win, so do it.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoget_verbose_service_state: render removal transition as 'deleting'
Thomas Lamprecht [Sat, 5 Oct 2019 17:11:44 +0000 (19:11 +0200)]
get_verbose_service_state: render removal transition as 'deleting'

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofix #1919, #1920: improve handling zombie (without node) services
Thomas Lamprecht [Sat, 5 Oct 2019 16:52:04 +0000 (18:52 +0200)]
fix #1919, #1920: improve handling zombie (without node) services

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoread_and_check_resources_config: remove dead if branch
Thomas Lamprecht [Sat, 5 Oct 2019 16:34:31 +0000 (18:34 +0200)]
read_and_check_resources_config: remove dead if branch

we only come to the if (!$vmd) check if the previous
if (my $vmd = $vmlist->{ids}->{$name) is taken, which means $vmd is
always true then.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoLRM shutdown: factor out shutdown type to reuse message
Thomas Lamprecht [Sat, 5 Oct 2019 15:54:11 +0000 (17:54 +0200)]
LRM shutdown: factor out shutdown type to reuse message

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoLRM shutdown request: propagate if we could not write out LRM status
Thomas Lamprecht [Sat, 5 Oct 2019 15:50:45 +0000 (17:50 +0200)]
LRM shutdown request: propagate if we could not write out LRM status

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofactor out resource config update from api to HA::Config
Fabian Ebner [Wed, 2 Oct 2019 09:46:02 +0000 (11:46 +0200)]
factor out resource config update from api to HA::Config

This makes it easier to update the resource configuration from within the CRM/LRM stack,
which is needed for the new 'stop' command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoRename target to param in simulation
Fabian Ebner [Mon, 30 Sep 2019 07:22:33 +0000 (09:22 +0200)]
Rename target to param in simulation

In preparation to introduce a stop command with a timeout parameter.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoMake parameters for LRM resource commands more flexible
Fabian Ebner [Mon, 30 Sep 2019 07:22:26 +0000 (09:22 +0200)]
Make parameters for LRM resource commands more flexible

This will allow for new parameters beside 'target' to be used.
This is in preparation to allow for a 'timeout' parameter for a new 'stop' command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoCleanup
Fabian Ebner [Thu, 26 Sep 2019 11:38:59 +0000 (13:38 +0200)]
Cleanup

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoWhitespace cleanup
Fabian Ebner [Thu, 26 Sep 2019 11:38:58 +0000 (13:38 +0200)]
Whitespace cleanup

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agobump version to 3.0-2
Thomas Lamprecht [Thu, 11 Jul 2019 17:27:27 +0000 (19:27 +0200)]
bump version to 3.0-2

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobuildsys: use DEB_VERSION_UPSTREAM for buildir
Thomas Lamprecht [Thu, 11 Jul 2019 17:23:51 +0000 (19:23 +0200)]
buildsys: use DEB_VERSION_UPSTREAM for buildir

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoAdd missing Dependencies to pve-ha-simulator
Rhonda D'Vine [Thu, 27 Jun 2019 10:50:16 +0000 (12:50 +0200)]
Add missing Dependencies to pve-ha-simulator

This two missing dependencies makes it possible to install the package
on a stock Debian system (without PVE)

Signed-off-by: Rhonda D'Vine <rhonda@proxmox.com>
4 years agofix #2234: fix typo in service description
Christian Ebner [Wed, 12 Jun 2019 08:17:20 +0000 (10:17 +0200)]
fix #2234: fix typo in service description

replace Ressource by Resource

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
4 years agoservices: update PIDFile to point directly to /run
Thomas Lamprecht [Sun, 26 May 2019 13:16:10 +0000 (15:16 +0200)]
services: update PIDFile to point directly to /run

fixes a complaint from system:
> PIDFile= references path below legacy directory /var/run/'

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobuildsys: switch upload dist over to buster
Thomas Lamprecht [Thu, 23 May 2019 16:18:16 +0000 (18:18 +0200)]
buildsys: switch upload dist over to buster

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-1
Thomas Lamprecht [Wed, 22 May 2019 17:18:40 +0000 (19:18 +0200)]
bump version to 3.0-1

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobuildsys: use dpkg-dev makefile helpers for pkg info
Thomas Lamprecht [Wed, 22 May 2019 17:11:29 +0000 (19:11 +0200)]
buildsys: use dpkg-dev makefile helpers for pkg info

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agohandle the case where a node gets fully removed
Thomas Lamprecht [Wed, 10 Apr 2019 10:41:17 +0000 (12:41 +0200)]
handle the case where a node gets fully removed

If an admin removes a node he may also remove /etc/pve/nodes/NODE
quite soon after that, if the "node really deleted" logic of our
NodeStatus module has not triggered until then (it waits an hour) the
current manager still tries to read the gone nodes LRM status, which
results in an exception. Move this exception to a warn and return a
node == unkown state in such a case.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agocoding style cleanup
Thomas Lamprecht [Wed, 10 Apr 2019 10:29:49 +0000 (12:29 +0200)]
coding style cleanup

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agobump version to 2.0-9
Thomas Lamprecht [Thu, 4 Apr 2019 14:27:49 +0000 (16:27 +0200)]
bump version to 2.0-9

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoservice data: only set failed_nodes key if needed
Thomas Lamprecht [Sat, 30 Mar 2019 18:52:46 +0000 (19:52 +0100)]
service data: only set failed_nodes key if needed

Currently we always set this, and thus each services gets a
"failed_nodes": null,
entry in the written out JSON ha/manager_status

so only set if neeed, which can reduce mananager_status quite a bit
with a lot of services.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agopartially revert previous unclean commit
Thomas Lamprecht [Sat, 30 Mar 2019 18:21:03 +0000 (19:21 +0100)]
partially revert previous unclean commit

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agomake clean: also clean source tar ball
Thomas Lamprecht [Sat, 30 Mar 2019 18:17:03 +0000 (19:17 +0100)]
make clean: also clean source tar ball

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agod/control: remove obsolete dh-systemd dependency
Thomas Lamprecht [Sat, 30 Mar 2019 18:02:26 +0000 (19:02 +0100)]
d/control: remove obsolete dh-systemd dependency

We do not need to depend explicitly on dh-systemd as we have a
versioned debhelper dependency with >= 10~, and lintian on buster for
this .dsc even warns:

> build-depends-on-obsolete-package build-depends: dh-systemd => use debhelper (>= 9.20160709)

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoadd target to build DSC
Thomas Lamprecht [Sat, 30 Mar 2019 17:59:36 +0000 (18:59 +0100)]
add target to build DSC

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoadd gitignore
Thomas Lamprecht [Sat, 30 Mar 2019 17:57:37 +0000 (18:57 +0100)]
add gitignore

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>