Thomas Lamprecht [Fri, 22 Jan 2016 16:06:42 +0000 (17:06 +0100)]
Move exec_resource_agent from environment classes to LRM
With the changes and preparation work from the previous commits
we can now move the quite important method exec_resource_agent
from the Env classes to the LRM where it get's called.
The main advantage of this is that it now underlies regression
tests and that we do not have two separate methods where it
- does not make sense as agents them self should be virtualized
not the method executing them
- adds more work as the must (or at least should) be in sync
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Fri, 22 Jan 2016 16:06:39 +0000 (17:06 +0100)]
Add virtual resources for tests and simulation
Introduce a base class for Virtual Resources with almost all methods
already implemented.
Also add a class for virtual CTs and VMs, with the primary
distinction that CTs may not migrate online.
The Resource are registered in the Hardware class and overwrite
any already registered resources from the same type (e.g. VirtVM
overwrites PVEVM) so that the correct plugins are loaded for
regression tests and the simulator.
This makes the way free for adding(deterministic) 'malicious'
resources, so we can make test where, for example, a service fails
a few times to start or migrate.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 18 Jan 2016 10:35:20 +0000 (11:35 +0100)]
LRM: release lock also on restart
Wen restarting the LRM (e.g. on a update) we get an new pid and thus
have to wait for our own lock to timeout.
We can (and should) do that as there are no services or all services
are freezed. If they are freezed only our LRM may touch them so we
we can unfreeze them faster with this patch.
The expected log of the restart-lrm test does not change much as the
test system does not need to wait for a timeout.
This let's the LRM start working directly after a restart,
especially usefull on package updates.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 18 Jan 2016 09:26:45 +0000 (10:26 +0100)]
TestHardware: correct shutdown/reboot behaviour of CRM and LRM
Instead of shutting down the LRM and then killing the CRM we now
also make a shutdown request to the CRM, that mirrors the real world
behaviour much better and let's us also test the lock release from
the CRM.
To accomplish this we add new sim_hardware commands for stopping and
starting the CRM.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
The params of resource methods are normally specific for each
resource type.
CTs and VMs have the same interfaces in the needed cases so we could
generate the params in the exec_resource_agent method. This is not
clean because:
* resource specific stuff shouldn't be in this method
* this can make problems if we want to add another resource type
in the future which has a completely different interface
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 11 Jan 2016 12:20:17 +0000 (13:20 +0100)]
free cmd pointer after it's execution
Quoting the asprintf man page:
> [..]
> This pointer should be passed to free(3) to release the allocated
> storage when it is no longer needed.
> [..]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 11 Jan 2016 12:20:15 +0000 (13:20 +0100)]
small cleanup
remove the unlink_socket variable and it's check as they wher
always true, as error and the end of the programm can only be
reached when the socket is already set up.
Also unlinking an non existent file does not result in any error.
also some whitespace cleanup in the surrounding area.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 11 Jan 2016 12:20:14 +0000 (13:20 +0100)]
remove watchdog-mux.socket
The use of an systemd socket unit for the watchdog socket is not
necessary for us it even generates problems as the socket already
runs and accepts input when the watchdog-mux daemon itself is not
running. So the LRM/CRM could successfully open and update the
watchdog even if it was not running!
This patch removes the unit file, adds a postinst script which
handles the removal of the links generated from systemd itself
and removes also the code from watchdog-mux which handled
the systemd socket unit.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Tue, 22 Dec 2015 07:52:38 +0000 (08:52 +0100)]
Sim/Env: fix removing service from old node on migration
We only removed the service from the source node on a relocate, we
also want to remove it on a successfull migration else we have it
on two nodes at the same time.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Tue, 22 Dec 2015 07:52:35 +0000 (08:52 +0100)]
add service disable/enable to regression tests
Allow execution of user triggered service commands in regression
tests, like enable or disable. This is the test equivalent to a
ha-manager action service:id
command.
Also add a test for a disable enable cycle.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 21 Dec 2015 09:12:47 +0000 (10:12 +0100)]
check_active_workers: fix typo /uuid/uid/
This typo caused a bug where resource_command_finished was never
called as $w->{uuid} is not existing and thus always undefined.
Use the correct $w->{uid} instead.
Also fix a comment where used 'uuid' to avoid confusion.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 14 Dec 2015 14:29:59 +0000 (15:29 +0100)]
HA Env: add 'is_poweroff' function
This function returns true if we do an shutdown/poweroff and thus the
services should not get freezed but fenced if the node does not
comes back up fast enough.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
Thomas Lamprecht [Mon, 14 Dec 2015 14:29:53 +0000 (15:29 +0100)]
Hardware: remove unnecessary lock in get_node_info
This lock is not needed asthe status gets written with
file_set_contents from PVE::Tools which is atomic (fuse bug does
not apply here) so we do not need to be scared that we read
inconsistent data.
This also prevents a deadlock if any function needs to know if the
simulated node is quorate in a locked context, like
sim_hardware_cmd is.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Fri, 30 Oct 2015 09:55:44 +0000 (10:55 +0100)]
HA API: Fix permissions
Integrate permission in the HA API so that not only root may do
changes.
-) create/edit/update actions need the 'Sys.Console' privileges on
the root (/) path
-) read actions need the 'Sys.Audit' privilege on the root (/) path
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Fri, 23 Oct 2015 12:04:25 +0000 (14:04 +0200)]
exec_resource_agent: return valid exit code instead of die's
Switch from die's to logging and return the respective exit codes.
This adds the possibility to handle (i.e.: fix) some errors outside
of the forked exec_resource_agent worker.
This does not changes behaviour for now, as the die returned an 255
exit code. We didn't checked on that exit code explicitly and so we
are safe to use the new exit codes, it results in the same behaviour
for the other code (most important the CRM Manager class).
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 12 Oct 2015 13:04:41 +0000 (15:04 +0200)]
check resource better on addition and update
Check if the resource exists in the cluster when adding it to the
ha stack.
When trying to update/migrate or delete a resource check if it's
ha managed at all.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Mon, 28 Sep 2015 09:34:52 +0000 (11:34 +0200)]
delete node from HA stack when deleted from cluster
When a node gets deleted from the cluster with pvecm delnode
we set it's node state in the manager status to 'gone'.
When set to gone the manager waits an hour after the node was last
seen online and only then deletes it from the manager status.
When some HA services were forgotten on the node (shouldn't happen
at all!!) the node will be fenced, the service migrated and then its
state reset to 'gone'. After an hour the node will be deleted,
unless it joined the cluster again in the meantime.
Deleting a node from the HA manager status is by no means a final
act, the ha-manager could live without deleting it, but for the user
it is confusing to see dead nodes in the interface.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Groups: correctly set optional flag in propertyList
Only group and type should be required, all other properties should
be marked optional inside propertyList. We can set correct values
for optional flag inside options().
Thomas Lamprecht [Wed, 16 Sep 2015 09:25:18 +0000 (11:25 +0200)]
fix includes from services
The crm and lrm daemon executables need to include SafeSyslog, as
they use syslog in their signal handler.
Whereas it isn't needed anymore in the Service class of the daemons.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Wed, 16 Sep 2015 09:25:15 +0000 (11:25 +0200)]
implement recovery policy for services
We implement recovery policies which use settings known from
rgmanager, however the behaviour is not strictly the same,
our approach is more configurable. For example rgmanager cannot
combine its restart and relocate policy.
There are the following policy settings which kick in on an failed
service start:
* max_restart: maxmial number of tries to restart an failed service
on the actual node. The default is 1 restart try.
This policy gets enforced by the LRM.
* max_relocate: maximal number of tries to relocate the service to a
a different node. A relocate only takes place after
the max_restart value is exceeded on the actual node
This policy gets enforced by the CRM.
If a service is still no running after all max tries, it's state
gets set to 'error'. This means that the service needs to be checked
and disabled manually.
*Note* that the relocate state will only reset when the service had
at least one successful start. That means if a service is reenabled
without fixing the error only the restart policy gets repeated.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>