Wolfgang Link [Wed, 9 May 2018 12:48:24 +0000 (14:48 +0200)]
Get snapshots when no state is available.
With this patch we can restore the state of a state less job.
It may happen that there are more replication snapshots,
because no job state is known can not delete any snapshot.
Existing multiple replication-snapshot happens
when a node fails in middel of a replication
and then the VM is moved to another node.
That's why we have to test if we have a common base on both nodes.
Given this, we take this as a replica state.
After we have a state again, the rest of the snapshots can be deleted on the next run.
Wolfgang Link [Wed, 9 May 2018 12:48:23 +0000 (14:48 +0200)]
Delete replication snapshots only if last_sync is not 0.
If last_sync is 0, the VM configuration has been stolen
(either manually or by HA restoration).
Under this condition, the replication snapshot should not be deleted.
This snapshot is used to restore replication state.
If the last_snap is greater than 0 and does not match the snap name
it must be a remnant of an earlier sync and should be deleted.
Wolfgang Link [Wed, 9 May 2018 12:48:21 +0000 (14:48 +0200)]
Cleanup for stateless jobs.
If a VM configuration has been manually moved or recovered by HA,
there is no status on this new node.
In this case, the replication snapshots still exist on the remote side.
It must be possible to remove a job without status,
otherwise, a new replication job on the same remote node will fail
and the disks will have to be manually removed.
When searching through the sorted_volumes generated from the VMID.conf,
we can be sure that every disk will be removed in the event
of a complete job removal on the remote side.
In the end, the remote_prepare_local_job calls on the remote side a prepare.
Wolfgang Link [Fri, 13 Apr 2018 10:24:39 +0000 (12:24 +0200)]
fix #1694: make failure of snapshot removal non-fatal
In certain high-load scenarios ANY ZFS operation can block,
including registering an (async) destroy.
Since ZFS operations are implemented via ioctl's,
killing the user space process
does not affect the waiting kernel thread processing the ioctl.
Once "zfs destroy" has been called, killing it does not say anything
about whether the destroy operation will be aborted or not.
Since running into a timeout effectively means killing it,
we don't know whether the snapshot exists afterwards or not.
We also don't know how long it takes for ZFS to catch up on pending ioctls.
Given the above problem, we must to not die on errors when deleting a no
longer needed snapshot fails (under a timeout) after an otherwise
successful replication. Since we retry on the next run anyway, this is
not problematic.
The snapshot deletion error will be logged in the replication log
and the syslog/journal.
Thomas Lamprecht [Thu, 14 Dec 2017 06:58:36 +0000 (07:58 +0100)]
vzdump: add common log sub-method
Add a general log method here which supports to pass on the "log to
syslog too" functionality and makes it more clear what each
parameter of logerr and logginfo means.
Further, we can now also log wlith a 'warn' level, which can be
useful to notice an backup user of a possible problem which isn't a
error per se, but may need the users attention.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Wed, 13 Sep 2017 08:30:14 +0000 (10:30 +0200)]
VZDump/Plugin: avoid cyclic dependency
pve-guest-common is above qemu-server, pve-container and thus also
pve-manager in the package hierarchy.
The latter hosts PVE::VZDump, so using it here adds a cyclic
dependency between pve-manager and pve-guest-common.
Move the log method to the base plugin class and inline the
run_command function directly do the plugins cmd method.
pve-manager's PVE::VZDump may use this plugins static log function
then instead of its own copy.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
AbstractMigrate: do not overwrite global signal handlers
perls 'local' must be either used in front of each $SIG{...}
assignments or they must be put in a list, else it affects only the
first variable and the rest are *not* in local context.
This may cause weird behaviour where daemons seemingly do not get
terminating signals delivered correctly and thus may not shutdown
gracefully anymore.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
replication: don't sync to offline targets on error states
There's no point in trying to replicate to a target node
which is offline. Note that if we're not already in an
error state we do still give it a try in order for this to
get logged as an error at least once.