Fabian Ebner [Tue, 4 May 2021 09:52:54 +0000 (11:52 +0200)]
lvm: volume import: handle worker returned by free_image
only affects LVM storages with 'saferemove 1' where the import fails at a rather
advanced stage. Previously in such cases, the renamed (by free_image) volume
del-vm-XYZ-disk-N would be left over.
Fabian Ebner [Tue, 4 May 2021 09:52:53 +0000 (11:52 +0200)]
pbs: free image: explicitly return undef
Storage.pm's vdisk_free interprets truthy return values as worker subs, so be
explicit about returning undef here. Not an issue at the moment, because
run_client_command already returns undef, but better be safe than sorry.
The latter commit suggests to switch to using mount.fuse.ceph for the '_netdev'
option, but it doesn't seem to work:
root@pve701 / # mount -t fuse.ceph 10.10.10.11,10.10.10.12,10.10.10.13:/ /mnttest/fuse -o 'ceph.id=admin,ceph.keyfile=/etc/pve/priv/ceph/cephfs.secret,ceph.conf=/etc/pve/ceph.conf,_netdev'
ceph-fuse[20729]: starting ceph client
2021-06-15T14:22:00.631+0200 7f995f878080 -1 init, newargv = 0x55e09fc11a40 newargc=11
ceph-fuse[20729]: starting fuse
root@pve701 / # mount -t ceph 10.10.10.11,10.10.10.12,10.10.10.13:/ /mnttest/normal -o 'name=admin,secretfile=/etc/pve/priv/ceph/cephfs.secret,conf=/etc/pve/ceph.conf,_netdev'
root@pve701 / # mount | grep mnttest
ceph-fuse on /mnttest/fuse type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
10.10.10.11,10.10.10.12,10.10.10.13:/ on /mnttest/normal type ceph (rw,relatime,name=admin,secret=<hidden>,acl,_netdev)
Also, the return value is not propagated by mount.fuse.ceph, meaning the output
would need to be parsed...
root@pve701 ~ # mount -t fuse.ceph 10.10.10.11,10.10.10.12,10.10.10.13:/ /mnttest/fuse -o 'ceph.id=admin,ceph.keyfile=/etc/pve/priv/ceph/cephfs.secret,ceph.conf=/etc/pve/ceph.conf,_netdev'
2021-06-15T14:42:56.326+0200 7f634edae080 -1 init, newargv = 0x560cdb5e0a40 newargc=11
ceph-fuse[34480]: starting ceph client
fuse: mountpoint is not empty
fuse: if you are sure this is safe, use the 'nonempty' mount option
ceph-fuse[34480]: fuse failed to start
2021-06-15T14:42:56.338+0200 7f634edae080 -1
fuse_mount(mountpoint=/mnttest/fuse) failed.
Mount failed with status code: 5
root@pve701 ~ # echo $?
0
Fabian Ebner [Wed, 16 Jun 2021 07:26:59 +0000 (09:26 +0200)]
cephfs: revert safe-guard check for Luminous
It's necessary to be on Nautilus before upgrading to 7.x, so the check is no
longer needed. See commit e54c3e334760491954bc42f3585a8b5b136d4b1d. It didn't
cleanly revert, because there were cleanups made afterwards.
Fabian Ebner [Wed, 16 Jun 2021 07:26:58 +0000 (09:26 +0200)]
config: add backup content type to default local storage
which is used if there is no ('dir'-type) 'local' entry. Storage configurations
made by the installer also support backups for the 'local' storage, and the
'prune-backups' parameter is not really useful otherwise.
Fabian Ebner [Wed, 16 Jun 2021 07:26:57 +0000 (09:26 +0200)]
config: mention that maxfiles is deprecated
Don't add an explicit deprecation warning on parsing (yet), this already done in
the pve6to7 script. Also, automatic conversion to 'prune-backups' happens when
the section config is read, so over time fewer users should be affected.
Postpone explicit warning/dropping the parameter to a future major release.
Also switch the setting for the default 'local' storage to 'prune-backups'.
Try to detect active mounts and holders early, because it's cheap. The wipefs
command in the worker will detect even more situations where wiping alone is
not enough for the device to show up as unused, or could otherwise be
problematic.
based on the wipe_disks method from pve-manager's Ceph/Tools.pm with the
following main differences:
* use wipefs to wipe labels first (to avoid sgdisk complaining about the
backed up GPT structure on a subsequent GPT initialization)
* only take one device as an argument
* do not use an absolute path for 'dd'
* die if one of the command fails
The wipefs command makes checks and complains about e.g. mounted or active
devices. One could supply --force to wipefs, but in many such situations it
does not work as expected, because the device would still be detected as in-use
afterwards, and further manaual steps would be needed.
Fabian Ebner [Thu, 4 Feb 2021 10:26:06 +0000 (11:26 +0100)]
clone image: specify base format option with qemu-img
and avoid a warning. It is deprecated to auto-detect the format of the base
volume. See commit d9f059aa6cfccefaffa3532556e966df4a99ece2 in qemu for more
information.
instead of just the snapshot for consistency with other API endpoints,
and possible future extension to VMA backups (where 'snapshot' would be
a rather strange terminology).
add some additional checks (pbs storage type, backup volume type),
completion and magic (allow passing in either a full volume ID with
correct storage, or just the volume name, or just the snapshot for
easier API/CLI usage/convenience).
Not replacing it with return, because the current behavior is dying:
Can't "next" outside a loop block
and the single existing caller in pve-manager's API2/Ceph/OSD.pm does not check
the return value.
Also check for $st, which can be undefined in case a non-existing path was
provided. This also led to dying previously:
Can't call method "mode" on an undefined value
diskmanage: improve setting usage for whole disk with include-partitions
in case a disk with partitions also has an fstype set, which happens for our ZFS
boot disks. Do not change the behavior without include-partitons, as we
prefer(red) to be more specific than simply 'partitions' then.
Currently, it's possible to break replication by:
1. have an existing snapshot whose name contains an uppercase letter
2. set up a replication job and run it
3. rollback to the existing snapshot
4. replicate again -> fails
The failure occurs, because after step 3, the most recent common snapshot is the
previously existing one and currently no uppercase letters are allowed for
export/import.
The pve-snapshot-name option uses the CONFIGID_RE
qr/[a-z][a-z0-9_-]+/i
so it cannot be used here, because it would not allow for e.g. '__migrate__'.
Simply allow uppercase letters, to be backwards compatible and allow all
possible pve-snapshot-name values.
There is still an issue if there also was state volume, but that's a different
bug[1].
fix #3345: zfs: restore container volume to ZFS with size 0
A restore to ZFS for a container which has a volume (rootfs / mount
point) of size 0 failed because the refquota property does not accept
'0k' but wants 'none' in that situation.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This patch introduces support for Cephs RBD namespaces.
A new storage config parameter 'namespace' defines the namespace to be
used for the RBD storage.
The namespace must already exist in the Ceph cluster as it is not
automatically created.
The main intention is to use this for external Ceph clusters. With
namespaces, each PVE cluster can get its own namespace and will not
conflict with other PVE clusters.
The <pool>/<image> paths are needed in quite a lot of places. Having one
single place where they are created helps to reduce duplicate code and
makes it easier to introduce new features.
The 'add_pool_to_disk' sub was already doing that but the name was not
really fitting. This commit renames it to the more general
'get_rbd_path' and changes the second parameter to the more widely used
$volume instead of $disk.
Furthermore, all occurences where "$pool/$volume" has been concatenated
have been replaced with a call to get_rbd_path.
Plus some minor code style cleanups for long function calls that were
touched.
by relying on archive_info's vmid first. archive_info is already used to
determine if it's a standard name, and in that case the vmid is certainly set.
Also add asserts to make sure we got what we expected.
The mentioned commit is actually a backwards-incompatible change that leads to
slightly different behavior when migrating a VM with volumes on a misconfigured
storage. For example, unreferenced volumes on a misconfigured storage won't be
picked up, even though they were before. And for referenced volumes on a
misconfigured storage, the disk size would not be updated on migration anymore.
We should wait until the next major release for this change and then also
re-evaluate the migration behavior with misconfigured disks.
Fabian Ebner [Fri, 12 Mar 2021 09:50:26 +0000 (10:50 +0100)]
vdisk list: only collect images from storages with an appropriate content type
Only these storages are activated in the first place, and it's bad behavior to
list images when no appropriate content type is not set.
For example, on VM destruction, this avoids unreferenced images to be deleted
from a storage with only 'backup' content type set, which is supposedly what
happened in this[0] forum thread.
(Some) callers expect all keys to be present and valid array references in the
result, so initialization is needed.
Now, the enabled check is already done by the preceding code for every element
that is iterated over, and thus isn't needed in the main loop anymore.
Fabian Ebner [Wed, 10 Mar 2021 09:26:27 +0000 (10:26 +0100)]
api: disk list: allow if an audit permission for the node is present
as that seems to be the more natural permission path for listing a nodes local
disks. For backwards compatibility, the old permission check has to be kept
(relevant with propagate=0).
This API call was originally part of the Ceph API and got copied here later,
which might explain the current permission check.
In the UI, the Disk panel is visible with a node audit permission, but the API
call itself failed without the '/' audit permission.
Fabian Ebner [Thu, 11 Feb 2021 10:24:13 +0000 (11:24 +0100)]
storage migration: insecure: improve logging
by including the message/error from the remote side. Some people on the forum[0]
ran into 'no tunnel IP received', but without information from the remote side
it's hard to tell why.
Thomas Lamprecht [Fri, 19 Feb 2021 14:01:36 +0000 (15:01 +0100)]
zpool: avoid wrong mount-decoding of dataset
this was mistakenly done as the procfs code uses it and it was
assumed we need to decode this too to get both in the same
encoding-space and thus correct comparission.
But only procfs has that encoding, we don't have it for pool values
in the storage config, so we must not do a decode on that value, that
could potentially break.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Stoiko Ivanov [Fri, 19 Feb 2021 12:45:44 +0000 (13:45 +0100)]
zfspoolplugin: check if imported before importing
This commit is a small performance optimization to the previous one:
`zpool list` is cheaper than `zpool import -d /dev..` (the latter
scans the disks in the provided directory for zfs signatures,
unconditionally)
Stoiko Ivanov [Fri, 19 Feb 2021 12:45:43 +0000 (13:45 +0100)]
zfspoolplugin: check if mounted instead of imported
This patch addresses an issue we recently saw on a production machine:
* after booting a ZFS pool failed to get imported (due to an empty
/etc/zfs/zpool.cache)
* pvestatd/guest-startall eventually tried to import the pool
* the pool was imported, yet the datasets of the pool remained
not mounted
A bit of debugging showed that `zpool import <poolname>` is not
atomic, in fact it does fork+exec `mount` with appropriate parameters.
If an import ran longer than the hardcoded timeout of 15s, it could
happen that the pool got imported, but the zpool command (and its
forks) got terminated due to timing out.
reproducing this is straight-forward by setting (drastic) bw+iops
limits on a guests' disk (which contains a zpool) - e.g.:
`qm set 100 -scsi1 wd:vm-100-disk-1,iops_rd=10,iops_rd_max=20,\
iops_wr=15,iops_wr_max=20,mbps_rd=10,mbps_rd_max=15,mbps_wr=10,\
mbps_wr_max=15`
afterwards running `timeout 15 zpool import <poolname>` resulted in
that situation in the guest on my machine
The patch changes the check in activate_storage for the ZFSPoolPlugin,
to check if any dataset below the 'pool' (which can also be a sub-dataset)
is mounted by parsing /proc/mounts:
* this is cheaper than running `zfs get` or `zpool list`
* it catches a properly imported and mounted pool in case the
root-dataset has 'canmount' set to off (or noauto), as long
as any dataset below is mounted
After trying to import the pool, we also run `zfs mount -a` (in case
another check of /proc/mounts fails).
Potential for regression:
* running `zfs mount -a` is problematic, if a dataset is manually
umounted after booting (without setting 'canmount')
* a pool without any mounted dataset (no mountpoint property set and
only zvols) - will result in repeated calls to `zfs mount -a`
both of the above seem unlikely and should not occur, if using our
tooling.
Fabian Ebner [Fri, 15 Jan 2021 10:58:05 +0000 (11:58 +0100)]
remove lock from is_base_and_used check
and squash the __no_lock-variant into it.
This lock is not broad enough, because for a caller that plans to do or not do
some storage operation based on the result of the check, the following could
happen:
1. volume_is_base_and_used is called and the result is used to enter a branch
2. situation on the storage changes in the meantime
3. the branch chosen in 1. might not be the one that should be taken anymore
This means that callers are responsible for locking, and luckily the existing
callers do use their own locks already:
1. vdisk_free used the __no_lock-variant with a broader lock also covering
the free operation.
2. vdisk_clone is not a caller, but is relevant and it does lock the storage
2. the calls during VM migration and VM destruction happen in the context of a
locked VM config. Because the clone operation also locks the VM config, it
cannot happen that a linked clone is created while the template VM is
migrated away or destroyed or vice versa. And even if that were the case,
the base disk would not be freed, because of what vdisk_free/vdisk_clone do.
Fabian Ebner [Tue, 26 Jan 2021 11:45:27 +0000 (12:45 +0100)]
Diskmanage: also include partitions with get_disks if flag is set
and have a parent key for partitions, to be able to see the associated disk in
the result without having to rely on naming heuristics (just adding a number at
the end doesn't work for NVMes).
The disk's usage will not be based on the partitions usage if the flag is set,
but will simply be 'partitions'.
Fabian Ebner [Tue, 26 Jan 2021 11:45:25 +0000 (12:45 +0100)]
Diskmanage: introduce ceph info helper
so it can be re-used for partitions.
Also changes the regular expression in get_ceph_volume_info to match the full
device/partition name the LV is on. Not only is this needed for partitions,
especially if there's multiple partitions with an OSD, but it also fixes
handling NVMe devices with an OSD as a side effect. Previuosly those were not
detected here, because of the digits in the name, e.g. /dev/nvme0n1