]> git.proxmox.com Git - pve-ha-manager.git/log
pve-ha-manager.git
4 years agolrm.service: add after ordering for SSH and pveproxy
Thomas Lamprecht [Mon, 25 Nov 2019 16:35:43 +0000 (17:35 +0100)]
lrm.service: add after ordering for SSH and pveproxy

To avoid early disconnect during shutdown ensure we order After them,
for shutdown the ordering is reversed and so we're stopped before
those two - this allows to checkout the node stats and do SSH stuff
if something fails.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agodo simple fallback if node comes back online from maintenance
Thomas Lamprecht [Mon, 25 Nov 2019 16:48:42 +0000 (17:48 +0100)]
do simple fallback if node comes back online from maintenance

We simply remember the node we where on, if moved for maintenance.
This record gets dropped once we move to _any_ other node, be it:
* our previous node, as it came back from maintenance
* another node due to manual migration, group priority changes or
  fencing

The first point is handled explicitly by this patch. In the select
service node we check for and old fallback node, if that one is found
in a online node list with top priority we _always_ move to it - even
if there's no real reason for a move.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoadd 'migrate' node shutdown policy
Thomas Lamprecht [Fri, 4 Oct 2019 17:35:15 +0000 (19:35 +0200)]
add 'migrate' node shutdown policy

This adds handling for a new shutdown policy, namely "migrate".
If that is set then the LRM doesn't queues stop jobs, but transitions
to a new mode, namely 'maintenance'.

The LRM modes now get passed from the CRM in the NodeStatus update
method, this allows to detect such a mode and make node-status state
transitions. Effectively we only allow to transition if we're
currently online, else this is ignored. 'maintenance' does not
protects from fencing.

The moving then gets done by select service node. A node in
maintenance mode is not in "list_online_nodes" and so also not in
online_node_usage used to re-calculate if a service needs to be
moved. Only started services will get moved, this can be done almost
by leveraging exiting behavior, the next_state_started FSM state
transition method just needs to be thought to not early return for
nodes which are not online but in maintenance mode.

A few tests get adapted from the other policy tests is added to
showcase behavior with reboot, shutdown, and shutdown of the current
manager. It also shows the behavior when a service cannot be
migrated, albeit as our test system is limited to simulate maximal 9
migration failures, it "seems" to succeed after that. But note here
that the maximal retries would have been hit way more earlier, so
this is just artifact from our test system.

Besides some implementation details two question still are not solved
by this approach:
* what if a service cannot be moved away, either by errors or as no
  alternative node is found by select_service_node
  - retrying indefinitely, this happens currently. The user set this
    up like this in the first place. We will order SSH, pveproxy,
    after the LRM service to ensure that the're still the possibility
    for manual interventions
  - a idea would be to track the time and see if we're stuck (this is
    not to hard), in such a case we could stop the services after X
    minutes and continue.
* a full cluster shutdown, but that is even without this mode not to
  ideal, nodes will get fenced after no partition is quorate anymore,
  already. And as long as it's just a central setting in DC config,
  an admin has a single switch to flip to make it work, so not sure
  how much handling we want to do here, if we go over the point where
  we have no quorum we're dead anyhow, soo.. at least not really an
  issue of this series, orthogonal related yes, but not more.

For real world usability the datacenter.cfg schema needs to be
changed to allow the migrate shutdown policy, but that's trivial

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoaccount service to source and target during move
Thomas Lamprecht [Mon, 25 Nov 2019 17:05:11 +0000 (18:05 +0100)]
account service to source and target during move

As the Service load is often still happening on the source, and the
target may feel the performance impact from an incoming migrate, so
account the service to both nodes during that time.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agomanager select_service_node: code cleanup
Thomas Lamprecht [Mon, 25 Nov 2019 16:08:06 +0000 (17:08 +0100)]
manager select_service_node: code cleanup

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agolrm.service: sort After statements
Thomas Lamprecht [Mon, 25 Nov 2019 16:34:53 +0000 (17:34 +0100)]
lrm.service: sort After statements

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-5
Thomas Lamprecht [Wed, 20 Nov 2019 19:14:11 +0000 (20:14 +0100)]
bump version to 3.0-5

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agod/control: re-add CT/VM dependency
Thomas Lamprecht [Wed, 20 Nov 2019 19:13:33 +0000 (20:13 +0100)]
d/control: re-add CT/VM dependency

this was an issue for 5.x, initial pre-6.0 and should work now again
as expected..

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agorefactor: vm_qmp_command was moved to PVE::QemuServer::Monitor
Stefan Reiter [Tue, 19 Nov 2019 11:23:50 +0000 (12:23 +0100)]
refactor: vm_qmp_command was moved to PVE::QemuServer::Monitor

Also change to mon_cmd helper, avoid calling qmp_cmd directly.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
4 years agofix #1339: remove more locks from services IF the node got fenced
Thomas Lamprecht [Tue, 19 Nov 2019 13:05:30 +0000 (14:05 +0100)]
fix #1339: remove more locks from services IF the node got fenced

Remove further locks from a service after it was recovered from a
fenced node. This can be done due to the fact that the node was
fenced and thus the operation it was locked for was interrupted
anyway. We note in the syslog that we removed a lock.

Mostly we disallow the 'create' lock, as here is the only case where
we know that the service was not yet in a runnable state before.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-4
Fabian Grünbichler [Mon, 11 Nov 2019 10:28:13 +0000 (11:28 +0100)]
bump version to 3.0-4

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agouse PVE::DataCenterConfig
Fabian Grünbichler [Mon, 11 Nov 2019 10:28:12 +0000 (11:28 +0100)]
use PVE::DataCenterConfig

to make sure that the corresponding cfs_read_file works() works.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
4 years agocli stop cmd: fix property desc. indentation
Thomas Lamprecht [Thu, 14 Nov 2019 13:39:30 +0000 (14:39 +0100)]
cli stop cmd: fix property desc. indentation

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-3
Thomas Lamprecht [Mon, 11 Nov 2019 16:04:40 +0000 (17:04 +0100)]
bump version to 3.0-3

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofollowup, adapt stop request log messages; include SID
Thomas Lamprecht [Mon, 11 Nov 2019 15:50:37 +0000 (16:50 +0100)]
followup, adapt stop request log messages; include SID

it's always good to say that we request it, not that people think the
task should have been already started..

Also include the service ID (SID), so people know what we want(ed) to
stop at all.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoIntroduce crm-command to CLI and add stop as a subcommand
Fabian Ebner [Thu, 10 Oct 2019 10:25:08 +0000 (12:25 +0200)]
Introduce crm-command to CLI and add stop as a subcommand

This should reduce confusion between the old 'set <sid> --state stopped' and
the new 'stop' command by making explicit that it is sent as a crm command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoAdd crm command 'stop'
Fabian Ebner [Thu, 10 Oct 2019 10:25:07 +0000 (12:25 +0200)]
Add crm command 'stop'

Not every command parameter is 'target' anymore, so
it was necessary to modify the parsing of $sd->{cmd}.

Just changing the state to request_stop is not enough,
we need to actually update the service configuration as well.

Add a simple test for the stop command

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoAdd timeout parameter for shutdown
Fabian Ebner [Thu, 10 Oct 2019 10:25:06 +0000 (12:25 +0200)]
Add timeout parameter for shutdown

Introduces a timeout parameter for shutting a resource down.
If the parameter is 0, we perform a hard stop instead of a shutdown.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoAdd update_service_config to the HA environment interface and simulation
Fabian Ebner [Thu, 10 Oct 2019 10:25:05 +0000 (12:25 +0200)]
Add update_service_config to the HA environment interface and simulation

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agofollowup: s/ss/sc/
Thomas Lamprecht [Sat, 5 Oct 2019 18:25:32 +0000 (20:25 +0200)]
followup: s/ss/sc/

fixes: dcb4a2a48404a8bf06df41e071fea348d0c971a4

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofix # 2241: VM resource: allow migration with local device, when not running
Thomas Lamprecht [Sat, 5 Oct 2019 18:10:27 +0000 (20:10 +0200)]
fix # 2241: VM resource: allow migration with local device, when not running

qemu-server ignores the flag if the VM runs, so just set it to true
hardcoded.

People have identical hosts with same HW and want to be able to
relocate VMs in such cases, so allow it here - qemu knows to complain
if it cannot work, as nothing bad happens then (VM stays just were it
is) we can only win, so do it.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoget_verbose_service_state: render removal transition as 'deleting'
Thomas Lamprecht [Sat, 5 Oct 2019 17:11:44 +0000 (19:11 +0200)]
get_verbose_service_state: render removal transition as 'deleting'

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofix #1919, #1920: improve handling zombie (without node) services
Thomas Lamprecht [Sat, 5 Oct 2019 16:52:04 +0000 (18:52 +0200)]
fix #1919, #1920: improve handling zombie (without node) services

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoread_and_check_resources_config: remove dead if branch
Thomas Lamprecht [Sat, 5 Oct 2019 16:34:31 +0000 (18:34 +0200)]
read_and_check_resources_config: remove dead if branch

we only come to the if (!$vmd) check if the previous
if (my $vmd = $vmlist->{ids}->{$name) is taken, which means $vmd is
always true then.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoLRM shutdown: factor out shutdown type to reuse message
Thomas Lamprecht [Sat, 5 Oct 2019 15:54:11 +0000 (17:54 +0200)]
LRM shutdown: factor out shutdown type to reuse message

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoLRM shutdown request: propagate if we could not write out LRM status
Thomas Lamprecht [Sat, 5 Oct 2019 15:50:45 +0000 (17:50 +0200)]
LRM shutdown request: propagate if we could not write out LRM status

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agofactor out resource config update from api to HA::Config
Fabian Ebner [Wed, 2 Oct 2019 09:46:02 +0000 (11:46 +0200)]
factor out resource config update from api to HA::Config

This makes it easier to update the resource configuration from within the CRM/LRM stack,
which is needed for the new 'stop' command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoRename target to param in simulation
Fabian Ebner [Mon, 30 Sep 2019 07:22:33 +0000 (09:22 +0200)]
Rename target to param in simulation

In preparation to introduce a stop command with a timeout parameter.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoMake parameters for LRM resource commands more flexible
Fabian Ebner [Mon, 30 Sep 2019 07:22:26 +0000 (09:22 +0200)]
Make parameters for LRM resource commands more flexible

This will allow for new parameters beside 'target' to be used.
This is in preparation to allow for a 'timeout' parameter for a new 'stop' command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoCleanup
Fabian Ebner [Thu, 26 Sep 2019 11:38:59 +0000 (13:38 +0200)]
Cleanup

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agoWhitespace cleanup
Fabian Ebner [Thu, 26 Sep 2019 11:38:58 +0000 (13:38 +0200)]
Whitespace cleanup

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
4 years agobump version to 3.0-2
Thomas Lamprecht [Thu, 11 Jul 2019 17:27:27 +0000 (19:27 +0200)]
bump version to 3.0-2

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobuildsys: use DEB_VERSION_UPSTREAM for buildir
Thomas Lamprecht [Thu, 11 Jul 2019 17:23:51 +0000 (19:23 +0200)]
buildsys: use DEB_VERSION_UPSTREAM for buildir

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agoAdd missing Dependencies to pve-ha-simulator
Rhonda D'Vine [Thu, 27 Jun 2019 10:50:16 +0000 (12:50 +0200)]
Add missing Dependencies to pve-ha-simulator

This two missing dependencies makes it possible to install the package
on a stock Debian system (without PVE)

Signed-off-by: Rhonda D'Vine <rhonda@proxmox.com>
4 years agofix #2234: fix typo in service description
Christian Ebner [Wed, 12 Jun 2019 08:17:20 +0000 (10:17 +0200)]
fix #2234: fix typo in service description

replace Ressource by Resource

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
4 years agoservices: update PIDFile to point directly to /run
Thomas Lamprecht [Sun, 26 May 2019 13:16:10 +0000 (15:16 +0200)]
services: update PIDFile to point directly to /run

fixes a complaint from system:
> PIDFile= references path below legacy directory /var/run/'

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobuildsys: switch upload dist over to buster
Thomas Lamprecht [Thu, 23 May 2019 16:18:16 +0000 (18:18 +0200)]
buildsys: switch upload dist over to buster

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobump version to 3.0-1
Thomas Lamprecht [Wed, 22 May 2019 17:18:40 +0000 (19:18 +0200)]
bump version to 3.0-1

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
4 years agobuildsys: use dpkg-dev makefile helpers for pkg info
Thomas Lamprecht [Wed, 22 May 2019 17:11:29 +0000 (19:11 +0200)]
buildsys: use dpkg-dev makefile helpers for pkg info

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agohandle the case where a node gets fully removed
Thomas Lamprecht [Wed, 10 Apr 2019 10:41:17 +0000 (12:41 +0200)]
handle the case where a node gets fully removed

If an admin removes a node he may also remove /etc/pve/nodes/NODE
quite soon after that, if the "node really deleted" logic of our
NodeStatus module has not triggered until then (it waits an hour) the
current manager still tries to read the gone nodes LRM status, which
results in an exception. Move this exception to a warn and return a
node == unkown state in such a case.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agocoding style cleanup
Thomas Lamprecht [Wed, 10 Apr 2019 10:29:49 +0000 (12:29 +0200)]
coding style cleanup

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agobump version to 2.0-9
Thomas Lamprecht [Thu, 4 Apr 2019 14:27:49 +0000 (16:27 +0200)]
bump version to 2.0-9

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoservice data: only set failed_nodes key if needed
Thomas Lamprecht [Sat, 30 Mar 2019 18:52:46 +0000 (19:52 +0100)]
service data: only set failed_nodes key if needed

Currently we always set this, and thus each services gets a
"failed_nodes": null,
entry in the written out JSON ha/manager_status

so only set if neeed, which can reduce mananager_status quite a bit
with a lot of services.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agopartially revert previous unclean commit
Thomas Lamprecht [Sat, 30 Mar 2019 18:21:03 +0000 (19:21 +0100)]
partially revert previous unclean commit

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agomake clean: also clean source tar ball
Thomas Lamprecht [Sat, 30 Mar 2019 18:17:03 +0000 (19:17 +0100)]
make clean: also clean source tar ball

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agod/control: remove obsolete dh-systemd dependency
Thomas Lamprecht [Sat, 30 Mar 2019 18:02:26 +0000 (19:02 +0100)]
d/control: remove obsolete dh-systemd dependency

We do not need to depend explicitly on dh-systemd as we have a
versioned debhelper dependency with >= 10~, and lintian on buster for
this .dsc even warns:

> build-depends-on-obsolete-package build-depends: dh-systemd => use debhelper (>= 9.20160709)

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoadd target to build DSC
Thomas Lamprecht [Sat, 30 Mar 2019 17:59:36 +0000 (18:59 +0100)]
add target to build DSC

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoadd gitignore
Thomas Lamprecht [Sat, 30 Mar 2019 17:57:37 +0000 (18:57 +0100)]
add gitignore

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agod/control: remove unused libsystemd-dev from build dependencies
Thomas Lamprecht [Thu, 21 Mar 2019 12:18:37 +0000 (13:18 +0100)]
d/control: remove unused libsystemd-dev from build dependencies

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agolrm: exit on restart and agent lock lost for > 90s
Thomas Lamprecht [Fri, 15 Mar 2019 09:48:44 +0000 (10:48 +0100)]
lrm: exit on restart and agent lock lost for > 90s

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoPVE2 Env: get_ha_settings: don't die if pmxcfs failed
Thomas Lamprecht [Fri, 15 Mar 2019 08:43:28 +0000 (09:43 +0100)]
PVE2 Env: get_ha_settings: don't die if pmxcfs failed

This is a method called in our shutdown path, so if we die here we
may silent a shutdown, nad just ignore it.
In combination with the fact that our service unit is configured
with: 'TimeoutStopSec=infinity' this means that a systemctl stop may
wait infinitely for this to happen, and any other systemctl command
will be queued for that long.

So if pmxcfs is stopped, we then get a shutdown request, we cannot
start pmxcfs again, at least not through systemd.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agotreewide trailing whitespace cleanup
Thomas Lamprecht [Thu, 14 Mar 2019 12:18:03 +0000 (13:18 +0100)]
treewide trailing whitespace cleanup

generated by: find . -name '*.pm' -exec sed -i 's/\s*$//g' {} \;

As I touched almost any file here anyway I'm not scared to appear in
git blame ;-) also it has support to suppress whitespace changes.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agobump version to 2.0-8
Thomas Lamprecht [Wed, 6 Mar 2019 07:03:32 +0000 (08:03 +0100)]
bump version to 2.0-8

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agod/control: do not track qemu-server and pve-container dependency
Thomas Lamprecht [Wed, 6 Mar 2019 06:51:46 +0000 (07:51 +0100)]
d/control: do not track qemu-server and pve-container dependency

While it would be correct to have them tracked here we cannot do this
at the moment, as with those two also depend on pve-ha-manager, and
with dpkg packaged under strech there's an issue with such cyclic
dependencies and trigger cycle detection only resolved for buster[0]

Currently, the issue exists on the following condition:

* update of pve-ha-manager plus either pve-container or qemu-server
* but _no_ update of pve-manager in the same upgrade cycle

[0]: https://salsa.debian.org/dpkg-team/dpkg/commit/7f43bf5f93c857bdb419892abfc014a5e9c3c273

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agobump version to 2.0-7
Thomas Lamprecht [Mon, 4 Mar 2019 09:36:41 +0000 (10:36 +0100)]
bump version to 2.0-7

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agod/control: bump version dependency to pve-doc-generator
Thomas Lamprecht [Fri, 22 Feb 2019 12:31:32 +0000 (13:31 +0100)]
d/control: bump version dependency to pve-doc-generator

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years ago1891 Add zsh command completion for ha-manager CLI tools
Christian Ebner [Thu, 21 Feb 2019 13:25:00 +0000 (14:25 +0100)]
1891 Add zsh command completion for ha-manager CLI tools

Add the zsh command completion generation for the ha-manager CLI tools.

This adds the automatic generation of the autocompletion scripts for zsh

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
5 years agoapi: delete resource: refactor and cleanup indentation
Thomas Lamprecht [Wed, 23 Jan 2019 12:42:11 +0000 (13:42 +0100)]
api: delete resource: refactor and cleanup indentation

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofix #1602: allow to delete 'ignored' services over API
Thomas Lamprecht [Wed, 23 Jan 2019 09:34:40 +0000 (10:34 +0100)]
fix #1602: allow to delete 'ignored' services over API

service_is_ha_managed returns false if a service is in the resource
configuration but marked as 'ignore', as for the internal stack it is
as it wasn't HA managed at all.

But user should be able to remvoe it from the configuration easily
even in this state, without setting the requesttate to anything else
first.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofix #1794: VM resource: catch qmp command exceptions
Thomas Lamprecht [Wed, 23 Jan 2019 12:50:14 +0000 (13:50 +0100)]
fix #1794: VM resource: catch qmp command exceptions

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoapi: resource migrate: add description for node parameter
Thomas Lamprecht [Thu, 24 Jan 2019 13:18:55 +0000 (14:18 +0100)]
api: resource migrate: add description for node parameter

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofix #1842: do not pass forceStop to CT shutdown
Thomas Lamprecht [Wed, 23 Jan 2019 08:43:14 +0000 (09:43 +0100)]
fix #1842: do not pass forceStop to CT shutdown

The vm_shutdown parameter forceStop differs in behaviour between VMs
and CTs. While on VMs it ensures that a VM gets stoppped if it could
not shutdown gracefully only after the timeout passed, the container
stack always ignores any timeout if forceStop is set and hard stops
the CT immediately.
To achieve this behaviour for CTs too, the timeout is enough, as
lxc-stop then does the hard stop after timeout itself.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofence cfg: count_devices: improve comment
Thomas Lamprecht [Sun, 13 Jan 2019 12:17:12 +0000 (13:17 +0100)]
fence cfg: count_devices: improve comment

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofence cfg parser: check command explicit, mark fence_all as todo
Thomas Lamprecht [Sun, 13 Jan 2019 12:06:05 +0000 (13:06 +0100)]
fence cfg parser: check command explicit, mark fence_all as todo

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofence config parser: output all errors at once
Thomas Lamprecht [Sun, 13 Jan 2019 11:44:56 +0000 (12:44 +0100)]
fence config parser: output all errors at once

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofence config parser: early return on ignored devices
Thomas Lamprecht [Sun, 13 Jan 2019 11:39:53 +0000 (12:39 +0100)]
fence config parser: early return on ignored devices

We do not support all of the dlm.conf possibilities, but we also do
not want to die on such "unkown" keys/commands as an admin should be
able to share this config if it is already used for other purposes,
e.g. lockd, gfs, or such.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoFenceConfig: move line parsing out to closure
Thomas Lamprecht [Sun, 13 Jan 2019 11:30:36 +0000 (12:30 +0100)]
FenceConfig: move line parsing out to closure

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoFenceConfig: whitespace cleanup
Thomas Lamprecht [Sun, 13 Jan 2019 11:29:08 +0000 (12:29 +0100)]
FenceConfig: whitespace cleanup

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoFenceConfig: early return if file is empty
Thomas Lamprecht [Sun, 13 Jan 2019 11:21:26 +0000 (12:21 +0100)]
FenceConfig: early return if file is empty

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agod/lintian-overrids: add repeated-trigger-name override
Thomas Lamprecht [Tue, 4 Sep 2018 09:21:04 +0000 (11:21 +0200)]
d/lintian-overrids: add repeated-trigger-name override

in this package we provide api functions, thus we want to activate
the pve-api-update trigger, so that packages like pve-manager get
notified about it. But we also use api functions directly so we setup
an interest in the pve-api-update trigger. This results in an lintian
error (lintian version from buster or newer) which we can override:

> [...]
> This tag is also triggered if the package has an activate trigger
> for something on which it also declares an interest. The only (but
> rather unlikely) reason to do this is if another package also
> declares an interest and this package needs to activate that other
> package. If the package is using it for this exact purpose, then
> please use a Lintian override to state this.
-- https://lintian.debian.org/tags/repeated-trigger-name.html

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agosim: show sent emails in regression tests
Thomas Lamprecht [Fri, 27 Jan 2017 10:51:28 +0000 (11:51 +0100)]
sim: show sent emails in regression tests

its good to check if any regression regarding sendmail happened, as
it can be annoying if a sendmail loop happens.

5 years agofence config: allow to pass arguments to fence agents via short-opts
Thomas Lamprecht [Tue, 8 Jan 2019 14:21:48 +0000 (15:21 +0100)]
fence config: allow to pass arguments to fence agents via short-opts

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agod/control: add missing pve-container dependency
Thomas Lamprecht [Tue, 4 Sep 2018 09:27:00 +0000 (11:27 +0200)]
d/control: add missing pve-container dependency

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofencing: fixup run_fence_jobs
Thomas Lamprecht [Tue, 4 Sep 2018 09:28:05 +0000 (11:28 +0200)]
fencing: fixup run_fence_jobs

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofixup changelog line length and typos
Thomas Lamprecht [Mon, 7 Jan 2019 12:35:34 +0000 (13:35 +0100)]
fixup changelog line length and typos

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agobump version 2.0-6
Thomas Lamprecht [Mon, 7 Jan 2019 12:00:00 +0000 (13:00 +0100)]
bump version 2.0-6

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofixup parse_sid call
Wolfgang Bumiller [Mon, 7 Jan 2019 11:04:24 +0000 (12:04 +0100)]
fixup parse_sid call

This call was missed in the commit moving it from
PVE::HA::Tools to PVE::HA:Config.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Fixes: 0087839aa530 ("Tools: remove dependency on PVE::Cluster")
5 years agofollowup code cleanup
Thomas Lamprecht [Mon, 7 Jan 2019 11:07:03 +0000 (12:07 +0100)]
followup code cleanup

addresses a few nits from Fabians review at:
https://pve.proxmox.com/pipermail/pve-devel/2018-December/035061.html
https://pve.proxmox.com/pipermail/pve-devel/2018-December/035085.html

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agolrm: explicitly log shutdown_policy on node shutdown
Thomas Lamprecht [Thu, 20 Dec 2018 07:44:43 +0000 (08:44 +0100)]
lrm: explicitly log shutdown_policy on node shutdown

Makes regression test a bit more telling and it helps to be verbose
for an user here too.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agofix #1378: allow to specify a service shutdown policy
Thomas Lamprecht [Thu, 20 Dec 2018 07:44:42 +0000 (08:44 +0100)]
fix #1378: allow to specify a service shutdown policy

Allow an admin to set a datacenter wide HA policy which can change
the way we handle services on a node shutdown.

There's:

* freeze: always freeze servivces, independent of the shutdown type
  (reboot, poweroff)
* failover: never freeze services, this means that a service will get
  recovered to another node if possible and if the current node does
  not comes back up in the grace period of 1 minute.
* default: this is the current behavior, freeze on reboot but do not
  freeze on poweroff

Add to tests, shutdown-policy1 which is based of the reboot1 test,
but enforces no freeze with a failover policy, and shutdown-policy2
which is based on the shutdown1 test but with a explicit freeze
policy. You can compare (diff) each tests log result to the test it's
based on to see what changes.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoEnv: add get_ha_settings method
Thomas Lamprecht [Thu, 20 Dec 2018 07:44:41 +0000 (08:44 +0100)]
Env: add get_ha_settings method

Add get_ha_settings, a method which returns the datacenter wide HA
settings

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoAdd missing Build-Depends
Rhonda D'Vine [Fri, 14 Dec 2018 14:52:32 +0000 (15:52 +0100)]
Add missing Build-Depends

Signed-off-by: Rhonda D'Vine <rhonda@proxmox.com>
5 years agoinstall simulator executable into bin not sbin
Thomas Lamprecht [Wed, 17 Oct 2018 09:51:04 +0000 (11:51 +0200)]
install simulator executable into bin not sbin

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agoTools: add note about indirect include of Config module
Thomas Lamprecht [Wed, 17 Oct 2018 09:41:44 +0000 (11:41 +0200)]
Tools: add note about indirect include of Config module

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
5 years agobuild: actually ship SOURCE file
Fabian Grünbichler [Wed, 10 Oct 2018 11:55:07 +0000 (13:55 +0200)]
build: actually ship SOURCE file

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agobuild: bump compat level to 10
Fabian Grünbichler [Wed, 10 Oct 2018 11:55:06 +0000 (13:55 +0200)]
build: bump compat level to 10

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agobuild: restructure packaging
Fabian Grünbichler [Wed, 10 Oct 2018 11:55:05 +0000 (13:55 +0200)]
build: restructure packaging

use dpkg-buildpackage and debhelper properly, add missing dependencies and
embed used perl modules from libpve-common-perl to make pve-ha-simulator
standalone.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agoTools: remove dependency on PVE::Cluster
Fabian Grünbichler [Wed, 10 Oct 2018 11:55:04 +0000 (13:55 +0200)]
Tools: remove dependency on PVE::Cluster

by moving parse_sid to PVE::HA::Env, with the default implementation in
PVE::HA::Config.

the bash completion methods use PVE::HA::Config (and PVE::Cluster), but
the corresponding use statements are only in PVE::CLI::ha_manager, where the
bash completion is actually used.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agoTools/Config: refactor lrm status json reading
Fabian Grünbichler [Wed, 10 Oct 2018 11:55:03 +0000 (13:55 +0200)]
Tools/Config: refactor lrm status json reading

to avoid unnecessary dependency on PVE::Cluster in PVE::HA::Tools.

reading the LRM status file was the only instance of reading from the
CFS via this method.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agosim: don't install PVE::HA::Config
Fabian Grünbichler [Fri, 28 Sep 2018 10:48:54 +0000 (12:48 +0200)]
sim: don't install PVE::HA::Config

it is not needed anymore by the simulator.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agosim: don't install real resources
Fabian Grünbichler [Fri, 28 Sep 2018 10:48:53 +0000 (12:48 +0200)]
sim: don't install real resources

they are not needed, the simulator contains its own (simulated)
resources.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agogroups: register groups directly
Fabian Grünbichler [Fri, 28 Sep 2018 10:48:51 +0000 (12:48 +0200)]
groups: register groups directly

and use PVE::HA::Groups to parse the config when testing/simulating.

this allows us to drop the dependency on PVE::HA::Config, which would
otherwise pull in a lot of additional depdendencies that we don't want
in the simulator.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agopve-ha-tester: use correct lib path
Fabian Grünbichler [Fri, 28 Sep 2018 10:48:50 +0000 (12:48 +0200)]
pve-ha-tester: use correct lib path

since we want to test the version from the current working tree, and not
the installed one.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agoremove unused use statements
Fabian Grünbichler [Fri, 28 Sep 2018 10:48:49 +0000 (12:48 +0200)]
remove unused use statements

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agobuild: remove leftover PHONY declaration
Fabian Grünbichler [Fri, 28 Sep 2018 10:48:48 +0000 (12:48 +0200)]
build: remove leftover PHONY declaration

simdeb is already declared PHONY on its own

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
5 years agodocument api result for ha resources
Dominik Csapak [Mon, 17 Sep 2018 08:33:21 +0000 (10:33 +0200)]
document api result for ha resources

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
6 years agobump version to 2.0-5
Thomas Lamprecht [Wed, 7 Feb 2018 10:20:21 +0000 (11:20 +0100)]
bump version to 2.0-5

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
6 years agobuildsys: use correct git revision for SOURCE file
Thomas Lamprecht [Wed, 7 Feb 2018 09:32:27 +0000 (10:32 +0100)]
buildsys: use correct git revision for SOURCE file

6 years agodo not do active work if cfs update failed
Thomas Lamprecht [Wed, 22 Nov 2017 10:53:12 +0000 (11:53 +0100)]
do not do active work if cfs update failed

We ignored if the cluster state update failed and happily worked with
an empty state, resulting in strange actions, e.g., the removal of
all (not so) "stale" services or changing the all but the masters
node state to unknown.

Check on the update result and if failed, either do not get active,
or, if already active, skip the current round with the knowledge
that we only got here because the update failed but our lock renew
worked => cfs got already in a working and quorate state again -
(probably just a restart)

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
6 years agomove cfs update to common code
Thomas Lamprecht [Wed, 22 Nov 2017 10:53:11 +0000 (11:53 +0100)]
move cfs update to common code

We updated the CRM and LRM view of the cluster state only in the PVE2
environment, outside of all regression testing and simulation scope.

Further, we ignored if this update failed and happily worked with an
empty state, resulting in strange actions, e.g., the removal of all
(not so) "stale" services or changing the all but the masters node
state to unknown.

This patch tries to improve this by moving out the update in a own
environment method, cluster_update_state, calling this in the LRM and
CRM and saving its result.
As with our introduced functionallity to simulate cfs rw or update
errors we can also simulate failures of this state update with the RT
system.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>