]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
9 months agocopy_file_range: fix fallback when source create on same txg
Rob N [Tue, 15 Aug 2023 00:34:14 +0000 (10:34 +1000)]
copy_file_range: fix fallback when source create on same txg

In 019dea0a5 we removed the conversion from EAGAIN->EXDEV inside
zfs_clone_range(), but forgot to add a test for EAGAIN to the
copy_file_range() entry points to trigger fallback to a content copy.

This commit fixes that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15170
Closes #15172

9 months agoMove zinject from openzfs-zfs-test to openzfs-zfsutils
Umer Saleem [Tue, 8 Aug 2023 16:40:36 +0000 (21:40 +0500)]
Move zinject from openzfs-zfs-test to openzfs-zfsutils

For Native Debian packaging, zinject binary and man page is
packaged in ZFS test package. zinject is not not directly related
to ZTS and should be packaged with other utilities, like it is
present in zfs_<ver>.rpm/deb packages.

This commit moves zinject binary and man page from openzfs-zfs-test
to openzfs-zfsutils package.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15160

9 months agodracut: support mountpoint=legacy for root dataset
Rafael Kitover [Tue, 8 Aug 2023 16:38:34 +0000 (16:38 +0000)]
dracut: support mountpoint=legacy for root dataset

Support mountpoint=legacy for the root dataset in the dracut zfs support
scripts.

mountpoint=/ or mountpoint=/sysroot also works.

Change zfs-env-bootfs.service to add zfsutil to BOOTFSFLAGS only for
root datasets with mountpoint != legacy.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Rafael Kitover <rkitover@gmail.com>
Closes #15149

9 months agozfs_clone_range should return a descriptive error codes
oromenahar [Tue, 8 Aug 2023 16:37:06 +0000 (18:37 +0200)]
zfs_clone_range should return a descriptive error codes

Return the more descriptive error codes instead of `EXDEV` when
the parameters don't match the requirements of the clone function.
Updated the comments in `brt.c` accordingly.
The first three errors are just invalid parameters, which zfs can
not handle.
The fourth error indicates that the block which should be cloned
is created and cloned or modified in the same transaction
group (`txg`).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes #15148

9 months agolibzfs: sendrecv: send_progress_thread: handle SIGINFO/SIGUSR1
наб [Tue, 8 Aug 2023 16:35:35 +0000 (18:35 +0200)]
libzfs: sendrecv: send_progress_thread: handle SIGINFO/SIGUSR1

POSIX timers target the process, not the thread (as does SIGINFO),
so we need to block it in the main thread which will die if interrupted.

Ref: https://101010.pl/@ed1conf@bsd.network/110731819189629373
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #15113

9 months agolinux/spl/kmem_cache: undefine `kmem_cache_alloc` before defining it
Ryan Lahfa [Mon, 7 Aug 2023 20:55:59 +0000 (22:55 +0200)]
linux/spl/kmem_cache: undefine `kmem_cache_alloc` before defining it

When compiling a kernel with bcachefs and zfs,
the two macros will collide, making it impossible
to have both filesystems.

It is sufficient to just undefine the macro before calling it.

On why this should be in ZFS rather than bcachefs, currently,
bcachefs is not a in-tree filesystem, but,
it has a reasonably high chance of getting included soon.

This avoids the breakage in ZFS early,
this patch may be distributed downstream in NixOS
and is already used there.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Closes #15144

9 months agoFix some typos
Mateusz Piotrowski [Mon, 7 Aug 2023 20:53:59 +0000 (22:53 +0200)]
Fix some typos

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Closes #15141

9 months agoZIL: Second attempt to reduce scope of zl_issuer_lock.
Alexander Motin [Fri, 25 Aug 2023 00:08:49 +0000 (20:08 -0400)]
ZIL: Second attempt to reduce scope of zl_issuer_lock.

The previous patch #14841 appeared to have significant flaw, causing
deadlocks if zl_get_data callback got blocked waiting for TXG sync.  I
already handled some of such cases in the original patch, but issue
 #14982 shown cases that were impossible to solve in that design.

This patch fixes the problem by postponing log blocks allocation till
the very end, just before the zios issue, leaving nothing blocking after
that point to cause deadlocks.  Before that point though any sleeps are
now allowed, not causing sync thread blockage.  This require slightly
more complicated lwb state machine to allocate blocks and issue zios
in proper order.  But with removal of special early issue workarounds
the new code is much cleaner now, and should even be more efficient.

Since this patch uses null zios between write, I've found that null
zios do not wait for logical children ready status in zio_ready(),
that makes parent write to proceed prematurely, producing incorrect
log blocks.  Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children()
fixes it.

Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15122

9 months agoZIL: Replay blocks without next block pointer.
Alexander Motin [Fri, 11 Aug 2023 16:04:44 +0000 (12:04 -0400)]
ZIL: Replay blocks without next block pointer.

If we get next block allocation error during log write, we trigger
transaction commit.  But the block we have just completed is still
written and transactions it covers will be acknowledged normally.
If after that we ignore the block during replay just because it is
the last in the chain, we may not replay some transactions that we
have acknowledged as synced, that is not right.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15132

9 months agoZIL: Avoid dbuf_read() before dmu_sync().
Alexander Motin [Fri, 11 Aug 2023 16:04:08 +0000 (12:04 -0400)]
ZIL: Avoid dbuf_read() before dmu_sync().

In most cases dmu_sync() works with dirty records directly and does
not need actual data. The only exception is dmu_sync_late_arrival().
To save some CPU time use dmu_buf_hold_noread*() in z*_get_data()
and explicitly call dbuf_read() in dmu_sync_late_arrival(). There
is also a chance that by that time TXG will already be synced and
we won't have to do it at all.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15153

9 months agoRemove fastwrite mechanism.
Alexander Motin [Fri, 28 Jul 2023 20:30:33 +0000 (16:30 -0400)]
Remove fastwrite mechanism.

Fastwrite was introduced many years ago to improve ZIL writes spread
between multiple top-level vdevs by tracking number of allocated but
not written blocks and choosing vdev with smaller count.  It suposed
to reduce ZIL knowledge about allocation, but actually made ZIL to
even more actively report allocation code about the allocations,
complicating both ZIL and metaslabs code.

On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync()
was lost many years ago, that was one of the declared benefits. Plus
introduction of embedded log metaslab class solved another problem
with allocation rotor accounting both normal and log allocations,
since in most cases those are now in different metaslab classes.

After all that, I'd prefer to simplify already too complicated ZIL,
ZIO and metaslab code if the benefit of complexity is not obvious.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15107

9 months agoAvoid waiting in dmu_sync_late_arrival().
Alexander Motin [Thu, 27 Jul 2023 16:07:09 +0000 (12:07 -0400)]
Avoid waiting in dmu_sync_late_arrival().

The transaction there does not produce any dirty data or log blocks,
so it should not be throttled. All other cases wait for TXG sync, by
which time the log block we are writing will be obsolete, so we can
skip waiting and just return error here instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15096

10 months agozpool_vdev_remove() should handle EALREADY error return
Serapheim Dimitropoulos [Tue, 1 Aug 2023 21:47:00 +0000 (14:47 -0700)]
zpool_vdev_remove() should handle EALREADY error return

When the vdev properties features was merged an extra check
was added in `spa_vdev_remove_top_check()` which checked
whether the vdev that we want to remove is already being
removed and if so return an EALREADY error.

```
static int
spa_vdev_remove_top_check(vdev_t *vd)
{
... <snip> ...
/*
 * This device is already being removed
 */
if (vd->vdev_removing)
return (SET_ERROR(EALREADY));
```

Before that change we'd still fail with an error but it
was a more generic one - here is the check that failed
later in the same function:
```
/*
 * There can not be a removal in progress.
 */
if (spa->spa_removing_phys.sr_state == DSS_SCANNING)
return (SET_ERROR(EBUSY));
```

Changing the error code returned from that function changed
the behavior of the removal's library interface exposed to
the userland - `spa_vdev_remove()` now returns `EZFS_UNKNOWN`
instead of `EZFS_EBUSY` that was returning before.

This patch adds logic to make `spa_vdev_remove()` mindful
of the new EALREADY code and propagating `EZFS_EBUSY`
reverting to the previously established semantics of that
function.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #15013
Closes #15129

10 months agolinux: zfs: ctldir: set [amc]time to snapshot's creation property
наб [Tue, 1 Aug 2023 15:50:17 +0000 (17:50 +0200)]
linux: zfs: ctldir: set [amc]time to snapshot's creation property

If looking up a snapdir inode failed, hold pool config – hold the
snapshot – get its creation property – release it – release it,
then use that as the [amc]time in the allocated inode. If that
fails then fall back to current time. No performance impact since
this is only done when allocating a new snapdir inode.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #15110
Closes #15117

10 months agoreadmmap.c: fix building with MUSL libc
Zach Dykstra [Tue, 1 Aug 2023 16:01:32 +0000 (11:01 -0500)]
readmmap.c: fix building with MUSL libc

glibc includes sys/types.h from stdlib.h. This is not the case for MUSL,
so explicitly include it. Fixes usage of uint_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Zach Dykstra <dykstra.zachary@gmail.com>
Closes #15130

10 months agoCheck the return value in clonefile test
oromenahar [Tue, 1 Aug 2023 15:26:12 +0000 (17:26 +0200)]
Check the return value in clonefile test

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes #15128

10 months agolinux/copy_file_range: properly request a fallback copy on Linux <5.3
Rob N [Tue, 1 Aug 2023 18:31:11 +0000 (04:31 +1000)]
linux/copy_file_range: properly request a fallback copy on Linux <5.3

Before Linux 5.3, the filesystem's copy_file_range handler had to signal
back to the kernel that we can't fulfill the request and it should
fallback to a content copy. This is done by returning -EOPNOTSUPP.

This commit converts the EXDEV return from zfs_clone_range to
EOPNOTSUPP, to force the kernel to fallback for all the valid reasons it
might be unable to clone. Without it the copy_file_range() syscall will
return EXDEV to userspace, breaking its semantics.

Add test for copy_file_range fallbacks.  copy_file_range should always
fallback to a content copy whenever ZFS can't service the request with
cloning.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15131

10 months agozdb: include cloned blocks in block statistics
Rob N [Tue, 1 Aug 2023 15:56:30 +0000 (01:56 +1000)]
zdb: include cloned blocks in block statistics

This gives `zdb -b` support for clone blocks.

Previously, it didn't know what clones were, so would count their space
allocation multiple times and then report leaked space (or, in debug,
would assert trying to claim blocks a second time).

This commit fixes those bugs, and reports the number of clones and the
space "used" (saved) by them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15123

10 months agoTag 2.2.0-rc3
Brian Behlendorf [Thu, 27 Jul 2023 23:15:44 +0000 (16:15 -0700)]
Tag 2.2.0-rc3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 months agoBRT should return EOPNOTSUPP
oromenahar [Thu, 27 Jul 2023 18:32:34 +0000 (20:32 +0200)]
BRT should return EOPNOTSUPP

Return the more descriptive EOPNOTSUPP instead of EXDEV when the
storage pool doesn't support block cloning.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes #15097

10 months agozts: block cloning tests
Rob Norris [Tue, 11 Jul 2023 10:46:33 +0000 (20:46 +1000)]
zts: block cloning tests

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
Closes #405
Closes #13349

10 months agolinux: implement filesystem-side copy/clone functions for EL7
Rob Norris [Sun, 25 Jun 2023 10:50:19 +0000 (20:50 +1000)]
linux: implement filesystem-side copy/clone functions for EL7

Redhat have backported copy_file_range and clone_file_range to the EL7
kernel using an "extended file operations" wrapper structure. This
connects all that up to let cloning work there too.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050

10 months agolinux: implement filesystem-side clone ioctls
Rob Norris [Tue, 27 Jun 2023 13:45:00 +0000 (23:45 +1000)]
linux: implement filesystem-side clone ioctls

Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and
were implemented as regular filesystem-specific ioctls. This implements
those ioctls directly in OpenZFS, allowing cloning to work on older
kernels.

There's no need to gate these behind version checks; on later kernels
Linux will simply never deliver these ioctls, instead calling the
approprate VFS op.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050

10 months agolinux: implement filesystem-side copy/clone functions
Rob Norris [Tue, 27 Jun 2023 13:44:53 +0000 (23:44 +1000)]
linux: implement filesystem-side copy/clone functions

This implements the Linux VFS ops required to service the file
copy/clone APIs:

  .copy_file_range    (4.5+)
  .clone_file_range   (4.5-4.19)
  .dedupe_file_range  (4.5-4.19)
  .remap_file_range   (4.20+)

Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are
hooked up here, but are not implemented yet.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050

10 months agodbuf_sync_leaf: check DB_READ in state assertions
Rob Norris [Mon, 24 Jul 2023 08:02:21 +0000 (18:02 +1000)]
dbuf_sync_leaf: check DB_READ in state assertions

Block cloning introduced a new state transition from DB_NOFILL to
DB_READ. This occurs when a block is cloned and then read on the
current txg.

In this case, the clone will move the dbuf to DB_NOFILL, and then the
read will be issued for the overidden block pointer. If that read is
still outstanding when it comes time to write, the dbuf will be in
DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus
tripping the assertions.

This updates those checks to allow DB_READ as a valid state iff the
dirty record is for a BRT write and there is a override block pointer.
This is a safe situation because the block already exists, so there's
nothing that could change from underneath the read.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Original-patch-by: Kay Pedersen <mail@mkwg.de>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050

10 months agodmu_buf_will_clone: only check that current txg is clean
Rob Norris [Mon, 24 Jul 2023 07:54:05 +0000 (17:54 +1000)]
dmu_buf_will_clone: only check that current txg is clean

dbuf_undirty() will (correctly) only removed dirty records for the given
(open) txg. If there is a dirty record for an earlier closed txg that
has not been synced out yet, then db_dirty_records will still have
entries on it, tripping the assertion.

Instead, change the assertion to only consider the current txg. To some
extent this is redundant, as its really just saying "did dbuf_undirty()
work?", but it it doesn't hurt and accurately expresses our
expectations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Original-patch-by: Kay Pedersen <mail@mkwg.de>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050

10 months agobrt_vdev_realloc: use vmem_alloc for large allocation
Rob Norris [Thu, 22 Jun 2023 03:44:00 +0000 (13:44 +1000)]
brt_vdev_realloc: use vmem_alloc for large allocation

bv_entcount can be a relatively large allocation (see comment for
BRT_RANGESIZE), so get it from the big allocator.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050

10 months agozfs_clone_range: use vmem_malloc for large allocation
Rob Norris [Thu, 22 Jun 2023 03:44:00 +0000 (13:44 +1000)]
zfs_clone_range: use vmem_malloc for large allocation

Just silencing the warning about large allocations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050

10 months agozed: Reduce log noise for large JBODs
Brian Behlendorf [Tue, 25 Jul 2023 20:55:29 +0000 (13:55 -0700)]
zed: Reduce log noise for large JBODs

For large JBODs the log message "zfs_iter_vdev: no match" can
account for the bulk of the log messages (over 70%).  Since this
message is purely informational and not that useful we remove it.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15086
Closes #15094

10 months agoLinux 6.4 compat: META
Brian Behlendorf [Mon, 24 Jul 2023 18:20:42 +0000 (11:20 -0700)]
Linux 6.4 compat: META

Update the META file to reflect compatibility with the 6.4 kernel.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15095

10 months agoRemove zl_issuer_lock from zil_suspend().
Alexander Motin [Tue, 25 Jul 2023 16:08:36 +0000 (12:08 -0400)]
Remove zl_issuer_lock from zil_suspend().

This locking was recently added as part of #14979. But appears it
is illegal to take zl_issuer_lock while holding dp_config_rwlock,
taken by dsl_pool_hold().  It causes deadlock with sync thread in
spa_sync_upgrades().  On a second thought, we should not
need this locking, since zil_commit_impl() we call below takes
zl_issuer_lock, that should sufficiently protect zl_suspend reads,
combined with other logic from #14979.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15103

10 months agoZIL: Fix config lock deadlock.
Alexander Motin [Mon, 24 Jul 2023 20:41:11 +0000 (16:41 -0400)]
ZIL: Fix config lock deadlock.

When we have some LWBs closed and their ZIOs ready to be issued, we
can not afford sleeping on config lock if somebody else try to lock
it as writer, or it will cause a deadlock.

To solve it, move spa_config_enter() from zil_lwb_write_issue() to
zil_lwb_write_close() under zl_issuer_lock to enforce lock ordering
with other threads.  Now if we can't immediately lock config, issue
all previously closed LWBs so that they could drop their config
locks after completion, and only then allow sleeping on our lock.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15078
Closes #15080

10 months agoUpdate changelog for OpenZFS 2.2.0 release
Umer Saleem [Tue, 25 Jul 2023 16:01:27 +0000 (21:01 +0500)]
Update changelog for OpenZFS 2.2.0 release

This commit updates changelog for native Debian packages for
OpenZFS 2.2.0 release.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15104

10 months agoTag 2.2.0-rc2
Brian Behlendorf [Fri, 21 Jul 2023 22:50:43 +0000 (15:50 -0700)]
Tag 2.2.0-rc2

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 months agoshellcheck: disable "unreachable command" check [SC2317]
Rob N [Fri, 21 Jul 2023 18:53:06 +0000 (04:53 +1000)]
shellcheck: disable "unreachable command" check [SC2317]

This new check in 0.9.0 appears to have some issues with various forms
of "early return", like trap, exit and return. This is tripping up (at
least):

  cmd/zed/zed.d/history_event-zfs-list-cacher.sh
  /etc/zfs/zfs-functions

Its not obvious what its complaining about or what the remedy is, so it
seems sensible to disable this check for now.

See also:

  https://www.shellcheck.net/wiki/SC2317
  https://github.com/koalaman/shellcheck/issues/2542
  https://github.com/koalaman/shellcheck/issues/2613

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15089

10 months agometaslab: tuneable to better control force ganging
Rob N [Fri, 21 Jul 2023 18:52:32 +0000 (04:52 +1000)]
metaslab: tuneable to better control force ganging

metaslab_force_ganging isn't enough to actually force ganging, because
it still only forces 3% of the time. This adds
metaslab_force_ganging_pct so we can configure how often to force
ganging.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15088

10 months agoAdjust prefetch parameters.
Alexander Motin [Fri, 21 Jul 2023 18:51:47 +0000 (14:51 -0400)]
Adjust prefetch parameters.

- Reduce maximum prefetch distance for 32bit platforms to 8MB as it
was previously.  Those systems didn't grow much probably, so better
stay conservative there.
 - Retire array_rd_sz tunable, blocking prefetch for large requests.
We should not penalize applications trying to be more efficient. The
speculative prefetcher by itself has reasonable distance limits, and
1MB is not much at all these days.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15072

10 months agoAdd explicit prefetches to bpobj_iterate().
Alexander Motin [Fri, 21 Jul 2023 18:50:48 +0000 (14:50 -0400)]
Add explicit prefetches to bpobj_iterate().

To simplify error handling bpobj_iterate_blkptrs() iterates through
the list of block pointers backwards.  Unfortunately speculative
prefetcher is currently unable to detect such patterns, that makes
each block read there synchronous and very slow on HDD pools.

According to my tests, added explicit prefetch reduces time needed
to asynchronously delete 8 snapshots of 4 million blocks each from
20 seconds to less than one, that should free sync thread for other
useful work, such as async writes, scrub, etc.

While there, plug one memory leak in case of bpobj_open() error and
harmonize some variable names.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15071

10 months agoDon't emit cksum_{actual_expected} in ereport.fs.zfs.checksum events
Alan Somers [Tue, 11 Jul 2023 21:13:57 +0000 (15:13 -0600)]
Don't emit cksum_{actual_expected} in ereport.fs.zfs.checksum events

With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely.  So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match".  The harm in sending them is simply that they bloat the
event.  In particular, on FreeBSD the event must fit into a 1016 byte
buffer.

Fixes #14717 for mirrored pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #14717
Closes #15052

10 months agoDon't emit checksum histograms in ereport.fs.zfs.checksum events
Alan Somers [Tue, 11 Jul 2023 20:45:06 +0000 (14:45 -0600)]
Don't emit checksum histograms in ereport.fs.zfs.checksum events

The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete.  With modern storage hardware, they will
almost always look like white noise; all bits will be wrong.  They only
serve to bloat the event.  That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.

This fixes issue #14717 for RAIDZ pools, but not for mirror pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #15052

10 months agozed: Fix zed ASSERT on slot power cycle
Tony Hutter [Fri, 21 Jul 2023 18:46:58 +0000 (11:46 -0700)]
zed: Fix zed ASSERT on slot power cycle

We would see zed assert on one of our systems if we powered off a
slot.  Further examination showed zfs_retire_recv() was reporting
a GUID of 0, which in turn would return a NULL nvlist.  Add
in a check for a zero GUID.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15084

10 months agoFix zpl_test_super race with zfs_umount
Chunwei Chen [Thu, 20 Jul 2023 17:30:21 +0000 (10:30 -0700)]
Fix zpl_test_super race with zfs_umount

We cannot call zpl_enter in zpl_test_super, because zpl_test_super is
under spinlock so we can't sleep, and also because zpl_test_super is
called without sb->s_umount taken, so it's possible we would race with
zfs_umount and call zpl_enter on freed zfsvfs.

Here's an stack trace when this happens:
[ 2379.114837] VERIFY(cvp->cv_magic == CV_MAGIC) failed
[ 2379.114845] PANIC at spl-condvar.c:497:__cv_broadcast()
[ 2379.114854] Kernel panic - not syncing: VERIFY(cvp->cv_magic == CV_MAGIC) failed
[ 2379.115012] Call Trace:
[ 2379.115019]  dump_stack+0x74/0x96
[ 2379.115024]  panic+0x114/0x2f6
[ 2379.115035]  spl_panic+0xcf/0xfc [spl]
[ 2379.115477]  __cv_broadcast+0x68/0xa0 [spl]
[ 2379.115585]  rrw_exit+0xb8/0x310 [zfs]
[ 2379.115696]  rrm_exit+0x4a/0x80 [zfs]
[ 2379.115808]  zpl_test_super+0xa9/0xd0 [zfs]
[ 2379.115920]  sget+0xd1/0x230
[ 2379.116033]  zpl_mount+0xdc/0x230 [zfs]
[ 2379.116037]  legacy_get_tree+0x28/0x50
[ 2379.116039]  vfs_get_tree+0x27/0xc0
[ 2379.116045]  path_mount+0x2fe/0xa70
[ 2379.116048]  do_mount+0x80/0xa0
[ 2379.116050]  __x64_sys_mount+0x8b/0xe0
[ 2379.116052]  do_syscall_64+0x35/0x50
[ 2379.116054]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 2379.116057] RIP: 0033:0x7f9912e8b26a

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #15077

10 months agospa_min_alloc should be GCD, not min
Ameer Hamza [Thu, 20 Jul 2023 17:23:52 +0000 (22:23 +0500)]
spa_min_alloc should be GCD, not min

Since spa_min_alloc may not be a power of 2, unlike ashifts, in the
case of DRAID, we should not select the minimal value among several
vdevs. Rounding to a multiple of it is unlikely to work for other
vdevs. Instead, using the greatest common divisor produces smaller
yet more reasonable results.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15067

10 months agoDon't panic if setting vdev properties is unsupported for this vdev type
Yuri Pankov [Thu, 20 Jul 2023 17:21:47 +0000 (19:21 +0200)]
Don't panic if setting vdev properties is unsupported for this vdev type

Check that vdev has valid zap and bail out early.

While here, move objid selection out of the loop, it's not going to
change.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes #15063

10 months agoIgnore pool ashift property during vdev attachment
Ameer Hamza [Thu, 20 Jul 2023 16:57:16 +0000 (21:57 +0500)]
Ignore pool ashift property during vdev attachment

Ashift can be set for a vdev only during its creation, and the
top-level vdev does not change when a vdev is attached or replaced.
The ashift property should not be used during attachment, as it
does not allow attaching/replacing a vdev if the pool's ashift
property is increased after the existing vdev was created. Instead,
we should be able to attach the vdev if the attached vdev can
satisfy the ashift requirement with its parent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15061

10 months agoRollback before zfs root is mounted
Wojciech Małota-Wójcik [Thu, 20 Jul 2023 16:55:22 +0000 (18:55 +0200)]
Rollback before zfs root is mounted

On my machines I observe random failures caused by rollback happening
after zfs root is mounted. I've observed two types of failures:

- zfs-rollback-bootfs.service fails saying that rollback must be
  done just before mounting the dataset
- boot process fails and rescue console is entered.

After making this modification and testing it for couple of days
none of those problems have been observed anymore.

I don't know if `dracut-mount.service` is still needed in the
`After` directive. Maybe someone else is able to address this?

Reviewed-by: Gregory Bartholomew <gregory.lee.bartholomew@gmail.com>
Signed-off-by: Wojciech Małota-Wójcik <59281144+outofforest@users.noreply.github.com>
Closes #15025

10 months agoDo not request data L1 buffers on scan prefetch.
Alexander Motin [Thu, 20 Jul 2023 16:10:04 +0000 (12:10 -0400)]
Do not request data L1 buffers on scan prefetch.

Set ARC_FLAG_NO_BUF when prefetching data L1 buffers for scan.  We
do not prefetch data L0 buffers, so we do not need the L1 buffers,
only want them to be ready in ARC. This saves some CPU time on the
buffers decompression.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15029

10 months agoLinux 6.5 compat: disk_check_media_change() was added
Coleman Kane [Thu, 20 Jul 2023 16:09:25 +0000 (12:09 -0400)]
Linux 6.5 compat: disk_check_media_change() was added

The disk_check_media_change() function was added which replaces
bdev_check_media_change.  This change was introduced in 6.5rc1
444aa2c58cb3b6cfe3b7cc7db6c294d73393a894 and the new function takes a
gendisk* as its argument, no longer a block_device*. Thus, bdev->bd_disk
is now used to pass the expected data.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15060

10 months agoLinux 6.5 compat: BLK_STS_NEXUS renamed to BLK_STS_RESV_CONFLICT
Coleman Kane [Fri, 14 Jul 2023 23:33:51 +0000 (19:33 -0400)]
Linux 6.5 compat: BLK_STS_NEXUS renamed to BLK_STS_RESV_CONFLICT

This change was introduced in Linux commit
7ba150834b840f6f5cdd07ca69a4ccf39df59a66

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15059

10 months agoLinux 6.5 compat: intptr_t definition is canonically signed
Coleman Kane [Fri, 14 Jul 2023 23:32:49 +0000 (19:32 -0400)]
Linux 6.5 compat: intptr_t definition is canonically signed

Make the version here match that elsewhere in the kernel and system
headers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15058
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 months agoset autotrim default to 'off' everywhere
Yuri Pankov [Thu, 20 Jul 2023 16:06:55 +0000 (18:06 +0200)]
set autotrim default to 'off' everywhere

As it turns out having autotrim default to 'on' on FreeBSD never really
worked due to mess with defines where userland and kernel module were
getting different default values (userland was defaulting to 'off',
module was thinking it's 'on').

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes #15079

10 months agoFix the ZFS checksum error histograms with larger record sizes
Alan Somers [Fri, 14 Jul 2023 23:13:15 +0000 (17:13 -0600)]
Fix the ZFS checksum error histograms with larger record sizes

My analysis in PR #14716 was incorrect.  Each histogram bucket contains
the number of incorrect bits, by position in a 64-bit word, over the
entire record.  8-bit buckets can overflow for record sizes above 2k.
To forestall that, saturate each bucket at 255.  That should still get
the point across: either all bits are equally wrong, or just a couple
are.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #15049

10 months agoFix raw receive with different indirect block size.
Alexander Motin [Fri, 14 Jul 2023 23:16:40 +0000 (19:16 -0400)]
Fix raw receive with different indirect block size.

Unlike regular receive, raw receive require destination to have the
same block structure as the source.  In case of dnode reclaim this
triggers two special cases, requiring special handling:
 - If dn_nlevels == 1, we can change the ibs, but dnode_set_blksz()
should not dirty the data buffer if block size does not change, or
durign receive dbuf_dirty_lightweight() will trigger assertion.
 - If dn_nlevels > 1, we just can't change the ibs, dnode_set_blksz()
would fail and receive_object would trigger assertion, so we should
destroy and recreate the dnode from scratch.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15039

(cherry picked from commit c4e8742149e31a77dc074f3baacfefed3ccb800e)

10 months agoAvoid extra snprintf() in dsl_deadlist_merge().
Alexander Motin [Fri, 14 Jul 2023 23:11:46 +0000 (19:11 -0400)]
Avoid extra snprintf() in dsl_deadlist_merge().

Since we are already iterating the ZAP, we have exact string key to
remove, we do not need to call zap_remove_int() with the int key we
just converted, we can call zap_remove() for the original string.

This should make no functional change, only a micro-optimization.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15056

(cherry picked from commit fdba8cbb796cb089c3d6eefa833f5176b0474c29)

10 months agoAdd missed DMU_PROJECTUSED_OBJECT prefetch.
Alexander Motin [Thu, 13 Jul 2023 16:12:55 +0000 (12:12 -0400)]
Add missed DMU_PROJECTUSED_OBJECT prefetch.

It seems 9c5167d19f "Project Quota on ZFS" missed to add prefetch
for DMU_PROJECTUSED_OBJECT during scan (scrub/resilver).  It should
not cause visible problems, but may affect scub/resilver performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15024

10 months agoFreeBSD: catch up to __FreeBSD_version 1400093
Mateusz Guzik [Thu, 13 Jul 2023 16:06:57 +0000 (18:06 +0200)]
FreeBSD: catch up to __FreeBSD_version 1400093

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #15036

10 months agoFreeBSD: Fix build on stable/13 after 1302506.
Alexander Motin [Thu, 13 Jul 2023 15:50:34 +0000 (11:50 -0400)]
FreeBSD: Fix build on stable/13 after 1302506.

Starting approximately from version 1302506 vn_lock_pair() grown two
additional arguments following head.  There is a one week hole, but
that is closet reference point we have.

Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #15047

11 months agoTag 2.2.0-rc1
Brian Behlendorf [Fri, 30 Jun 2023 18:42:14 +0000 (11:42 -0700)]
Tag 2.2.0-rc1

New features:
- Fully adaptive ARC eviction (#14359)
- Block cloning (#13392)
- Scrub error log (#12812, #12355)
- Linux container support (#14070, #14097, #12263)
- BLAKE3 Checksums (#12918)
- Corrective "zfs receive" (#9372)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
11 months agoEnable tuning of ZVOL open timeout value
Prakash Surya [Fri, 30 Jun 2023 18:34:05 +0000 (11:34 -0700)]
Enable tuning of ZVOL open timeout value

The default timeout for ZVOL opens may not be sufficient for all cases,
so we should enable the value to be more easily tuned to account for
systems where the default value is insufficient.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #15023

11 months agoRevert "spa.h: use IN_BASE instead of IN_FREEBSD_BASE"
Brian Behlendorf [Fri, 30 Jun 2023 17:03:41 +0000 (10:03 -0700)]
Revert "spa.h: use IN_BASE instead of IN_FREEBSD_BASE"

This reverts commit 77a3bb1f47e67c233eb1961b8746748c02bafde1.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
11 months agoPack our DDT ZAPs a bit denser.
Rich Ercolani [Fri, 30 Jun 2023 16:42:02 +0000 (12:42 -0400)]
Pack our DDT ZAPs a bit denser.

The DDT is really inefficient on 4k and up vdevs, because it always
allocates 4k blocks, and while compression could save us somewhat
at ashift 9, that stops being true.

So let's change the default to 32 KiB, which seems like a reasonable
compromise between improved space savings and inflated write sizes
for DDT updates.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14654

11 months agoddt_addref: remove unnecessary phys fill when refcount is 0
Rob N [Fri, 30 Jun 2023 16:01:58 +0000 (02:01 +1000)]
ddt_addref: remove unnecessary phys fill when refcount is 0

The previous comment wondered if this case could happen; it turns out
that it really can't.

This block can only be entered if dde_type and dde_class are "real";
that only happens when a ddt entry has been previously synced to a ddt
store, that is, it was created on a previous txg. Since its gone through
that sync, its dde_refcount must be >0.

ddt_addref() is called from brt_pending_apply(), which is called at the
beginning of spa_sync(), before pending DMU writes/frees are issued.
Freeing a dedup block is the only thing that can decrement dde_refcount,
so there's no way for it to drop to zero before applying the clone bumps
it.

Further, even if it _could_ go to zero, it wouldn't be necessary to fill
the entry from the block. The phys content is not cleared until the free
is issued, which happens when the refcount goes to zero, when the last
real free comes through. The cloned block should be identical to what's
in the phys already, so the fill should be a no-op anyway.

I've replaced this with an assertion because this is all very dependent
on the ordering in which BRT and DDT changes are applied, and that might
change in the future.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: Klara, Inc.
Closes #15004

11 months agoAgain fix race between zil_commit() and zil_suspend().
Alexander Motin [Fri, 30 Jun 2023 15:59:39 +0000 (11:59 -0400)]
Again fix race between zil_commit() and zil_suspend().

With zl_suspend read in zil_commit() not protected by any locks it
is possible for new ZIL writes to be in progress while zil_destroy()
called by zil_suspend() freeing them.  This patch closes the race
by taking zl_issuer_lock in zil_suspend() and adding the second
zl_suspend check to zil_get_commit_list(), protected by the lock.
It allows all already queued transactions to be logged normally,
while blocks any new ones, calling txg_wait_synced() for the TXGs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14979

11 months agoSome ZIO micro-optimizations.
Alexander Motin [Fri, 30 Jun 2023 15:54:00 +0000 (11:54 -0400)]
Some ZIO micro-optimizations.

- Pack struct zio_prop by 4 bytes from 84 to 80.
 - Skip new child ZIO locking while linking to parent.  The newly
allocated ZIO is not externally visible yet, so nobody should care.
 - Skip io_bp_copy writes when not used (write && non-debug).

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14985

11 months agoDo not report bytes skipped by scan as issued.
Alexander Motin [Fri, 30 Jun 2023 15:47:13 +0000 (11:47 -0400)]
Do not report bytes skipped by scan as issued.

Scan process may skip blocks based on their birth time, DVA, etc.
Traditionally those blocks were accounted as issued, that caused
reporting of hugely over-inflated numbers, having nothing to do
with actual disk I/O.  This change utilizes never used field in
struct dsl_scan_phys to account such skipped bytes, allowing to
report how much data were actually scrubbed/resilvered and what
is the actual I/O speed.  While formally it is an on-disk format
change, it should be compatible both ways, so should not need a
feature flag.

This should partially address the same issue as c85ac731a0e, but
from a different perspective, complementing it.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #15007

11 months agoDon't use hard-coded 'size' value in snprintf()
Arshad Hussain [Fri, 30 Jun 2023 15:37:26 +0000 (21:07 +0530)]
Don't use hard-coded 'size' value in snprintf()

This patch changes the passing of "size" to snprintf
from hard-coded (openended) to sizeof(errbuf). This
is bringing to standard with rest of the code where-
ever 'errbuf' is used.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
Closes #15003

11 months agoFix remount when setting multiple properties.
Alexander Motin [Fri, 30 Jun 2023 15:36:43 +0000 (11:36 -0400)]
Fix remount when setting multiple properties.

The previous code was checking zfs_is_namespace_prop() only for the
last property on the list.  If one was not "namespace", then remount
wasn't called.  To fix that move zfs_is_namespace_prop() inside the
loop and remount if at least one of properties was "namespace".

Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15000

11 months agocontrib: dracut: Conditionalize copying of libgcc_s.so.1 to glibc only
vimproved [Thu, 29 Jun 2023 19:54:37 +0000 (19:54 +0000)]
contrib: dracut: Conditionalize copying of libgcc_s.so.1 to glibc only

The issue that this is designed to work around is only applicable to
glibc, since it's caused by glibc's pthread_cancel() implementation
using dlopen on libgcc_s.so.1 (and therefor not triggering dracut to
include it in the initramfs). This commit adds an extra condition to the
workaround that tests for glibc via "ldconfig -p | grep -qF 'libc.so.6'"
(which should only be present on glibc systems).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Violet Purcell <vimproved@inventati.org>
Closes #14992

11 months agospa.h: use IN_BASE instead of IN_FREEBSD_BASE
Yuri Pankov [Thu, 29 Jun 2023 18:50:52 +0000 (20:50 +0200)]
spa.h: use IN_BASE instead of IN_FREEBSD_BASE

Consistently get the proper default value for autotrim.

Currently, only the kernel module is built with IN_FREEBSD_BASE,
and libzfs get the wrong default value, leading to confusion and
incorrect output when autotrim value was not set explicitly.

Reviewed-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes #15016

11 months agozdb: Add missing poolname to -C synopsis
Mateusz Piotrowski [Thu, 29 Jun 2023 17:54:43 +0000 (19:54 +0200)]
zdb: Add missing poolname to -C synopsis

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara Inc.
Closes #15014

11 months agoZIL: Fix another use-after-free.
Alexander Motin [Wed, 28 Jun 2023 00:03:37 +0000 (20:03 -0400)]
ZIL: Fix another use-after-free.

lwb->lwb_issued_txg can not be accessed after lwb_state is set to
LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
freed by zil_sync().  We must save the txg number before that.

This is similar to the 55b1842f92, but as I see the bug is not new.
It existed for quite a while, just was not triggered due to smaller
race window.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14988
Closes #14999

11 months agoUse big transactions for small recordsize writes.
Alexander Motin [Wed, 28 Jun 2023 00:00:30 +0000 (20:00 -0400)]
Use big transactions for small recordsize writes.

When ZFS appends files in chunks bigger than recordsize, it borrows
buffer from ARC and fills it before opening transaction.  This
supposed to help in case of page faults to not hold transaction open
indefinitely.  The problem appears when recordsize is set lower than
default 128KB. Since each block is committed in separate transaction,
per-transaction overhead becomes significant, and what is even worse,
active use of of per-dataset and per-pool locks to protect space use
accounting for each transaction badly hurts the code SMP scalability.
The same transaction size limitation applies in case of file rewrite,
but without even excuse of buffer borrowing.

To address the issue, disable the borrowing mechanism if recordsize
is smaller than default and the write request is 4x bigger than it.
In such case writes up to 32MB are executed in single transaction,
that dramatically reduces overhead and lock contention.  Since the
borrowing mechanism is not used for file rewrites, and it was never
used by zvols, which seem to work fine, I don't think this change
should create significant problems, partially because in addition to
the borrowing mechanism there are also used pre-faults.

My tests with 4/8 threads writing several files same time on datasets
with 32KB recordsize in 1MB requests show reduction of CPU usage by
the user threads by 25-35%.  I would measure it in GB/s, but at that
block size we are now limited by the lock contention of single write
issue taskqueue, which is a separate problem we are going to work on.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14964

11 months agoRemove unnecessary commas in zpool-create.8
Laevos [Tue, 27 Jun 2023 23:58:32 +0000 (16:58 -0700)]
Remove unnecessary commas in zpool-create.8

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Laevos <5572812+Laevos@users.noreply.github.com>
Closes #15011

11 months agoAnother set of vdev queue optimizations.
Alexander Motin [Tue, 27 Jun 2023 16:09:48 +0000 (12:09 -0400)]
Another set of vdev queue optimizations.

Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists.  AVL-trees are too expensive
for such a simple task.  To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.

To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue.  Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member.  Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().

Introduce deadline of ~0.5s for LBA-sorted queues.  Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation.  With this change I no longer see it.  I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal.  For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.

Arrange members in struct zio to access only one cache line when
searching through vdev queues.  While there, remove io_alloc_node,
reusing the io_queue_node instead.  Those two are never used same
time.

Remove zfs_vdev_aggregate_trim parameter.  It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests.  Just remove the tree.

Remove locking from txg_all_lists_empty().  It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.

With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14925

11 months agoAdd a delay to tearing down threads.
Rich Ercolani [Mon, 26 Jun 2023 20:57:12 +0000 (16:57 -0400)]
Add a delay to tearing down threads.

It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...

I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...

So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14938

11 months agoFix memory leak in zil_parse().
Alexander Motin [Sun, 18 Jun 2023 02:51:37 +0000 (22:51 -0400)]
Fix memory leak in zil_parse().

482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly
leaking up to 128KB of memory per dataset during ZIL replay.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14987

11 months agoShorten arcstat_quiescence sleep time
George Amanakis [Thu, 15 Jun 2023 19:45:36 +0000 (21:45 +0200)]
Shorten arcstat_quiescence sleep time

With the latest L2ARC fixes, 2 seconds is too long to wait for
quiescence of arcstats like l2_size. Shorten this interval to avoid
having the persistent L2ARC tests in ZTS prematurely terminated.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14981

11 months agoRemove ARC/ZIO physdone callbacks.
Alexander Motin [Thu, 15 Jun 2023 17:49:03 +0000 (13:49 -0400)]
Remove ARC/ZIO physdone callbacks.

Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.

Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution.  Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.

While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it.  Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.

My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second.  Same time attempts to measure
any write time fluctuations didn't show anything noticeable.

While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948

11 months agoZTS: Skip send_raw_ashift on FreeBSD
Brian Behlendorf [Wed, 14 Jun 2023 15:04:05 +0000 (10:04 -0500)]
ZTS: Skip send_raw_ashift on FreeBSD

On FreeBSD 14 this test runs slowly in the CI environment
and is killed by the 10 minute timeout.  Skip the test on
FreeBSD until the slow down is resolved.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14961

11 months agoSwitch refcount tracking from lists to AVL-trees.
Alexander Motin [Wed, 14 Jun 2023 15:02:27 +0000 (11:02 -0400)]
Switch refcount tracking from lists to AVL-trees.

With large number of tracked references list searches under the lock
become too expensive, creating enormous lock contention.

On my tests with ZFS_DEBUG enabled this increases write throughput
with 32KB blocks from ~1.2GB/s to ~7.5GB/s.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14970

11 months agoStore the L2ARC device ashift in the vdev label
George Amanakis [Wed, 14 Jun 2023 15:01:17 +0000 (17:01 +0200)]
Store the L2ARC device ashift in the vdev label

If this is not done, and the pool has an ashift other than the default
(at the moment 9) then the following happens:

1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but
   upon export it is not stored anywhere
2) at the first import, vdev_open() sees an vdev_ashift() of 0 and
   assigns the logical_ashift, which is 9
3) reading the contents of L2ARC, including the header fails
4) L2ARC buffers are not restored in ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14313
Closes #14963

11 months agoFix the L2ARC write size calculating logic (2)
George Amanakis [Sat, 10 Jun 2023 00:05:47 +0000 (02:05 +0200)]
Fix the L2ARC write size calculating logic (2)

While commit bcd5321 adjusts the write size based on the size of the log
block, this happens after comparing the unadjusted write size to the
evicted (target) size.

In this case l2ad_hand will exceed l2ad_evict and violate an assertion
at the end of l2arc_write_buffers().

Fix this by adding the max log block size to the allocated size of the
buffer to be committed before comparing the result to the target
size.

Also reset the l2arc_trim_ahead ZFS module variable when the adjusted
write size exceeds the size of the L2ARC device.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14936
Closes #14954

11 months agoFinally drop long disabled vdev cache.
Alexander Motin [Fri, 9 Jun 2023 19:40:55 +0000 (15:40 -0400)]
Finally drop long disabled vdev cache.

It was a vdev level read cache, designed to aggregate many small
reads by speculatively issuing bigger reads instead and caching
the result.  But since it has almost no idea about what is going
on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
it was found to make more harm than good, for which reason it was
disabled for the past 12 years.  These days we have much better
instruments to enlarge the I/Os, such as speculative and prescient
prefetches, I/O scheduler, I/O aggregation etc.

Besides just the dead code removal this removes one extra mutex
lock/unlock per write inside vdev_cache_write(), not otherwise
disabled and trying to do some work.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14953

11 months agoZTS: Skip checkpoint_discard_busy
Brian Behlendorf [Fri, 9 Jun 2023 18:10:01 +0000 (11:10 -0700)]
ZTS: Skip checkpoint_discard_busy

Until the ASSERT which is occasionally hit while running
checkpoint_discard_busy is resolved skip this test case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #12053
Closes #14952

11 months agoImprove l2arc reporting in arc_summary.
Alexander Motin [Fri, 9 Jun 2023 17:14:05 +0000 (13:14 -0400)]
Improve l2arc reporting in arc_summary.

- Do not report L2ARC as FAULTED in presence of in-flight writes.
- Report read and write I/Os, bytes and errors.
- Remove few numbers not important to average user.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #12304
Closes #14946

11 months agoUse list_remove_head() where possible.
Alexander Motin [Fri, 9 Jun 2023 17:12:52 +0000 (13:12 -0400)]
Use list_remove_head() where possible.

... instead of list_head() + list_remove().  On FreeBSD the list
functions are not inlined, so in addition to more compact code
this also saves another function call.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14955

11 months agoZIL: Fix race introduced by f63811f0721.
Alexander Motin [Fri, 9 Jun 2023 17:08:05 +0000 (13:08 -0400)]
ZIL: Fix race introduced by f63811f0721.

We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE
state and dropping zl_lock, since it may be freed by zil_sync().
To free itxs and waiters after dropping the lock we need to move
lwb_itxs and lwb_waiters lists elements to local storage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14957
Closes #14959

11 months agoRevert "systemd: Use non-absolute paths in Exec* lines"
Rich Ercolani [Wed, 7 Jun 2023 18:14:05 +0000 (14:14 -0400)]
Revert "systemd: Use non-absolute paths in Exec* lines"

This reverts commit 79b20949b25c8db4d379f6486b0835a6613b480c since it
doesn't work with the systemd version shipped with RHEL7-based systems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14943
Closes #14945

11 months agoLinux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)
Brian Behlendorf [Wed, 7 Jun 2023 17:43:43 +0000 (10:43 -0700)]
Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)

When a kmem cache is exhausted and needs to be expanded a new
slab is allocated.  KM_SLEEP callers can block and wait for the
allocation, but KM_NOSLEEP callers were incorrectly allowed to
block as well.

Resolve this by attempting an emergency allocation as a best
effort.  This may fail but that's fine since any KM_NOSLEEP
consumer is required to handle an allocation failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
11 months agoFix the L2ARC write size calculating logic
George Amanakis [Tue, 6 Jun 2023 19:32:37 +0000 (21:32 +0200)]
Fix the L2ARC write size calculating logic

l2arc_write_size() should return the write size after adjusting for trim
and overhead of the L2ARC log blocks. Also take into account the
allocated size of log blocks when deciding when to stop writing buffers
to L2ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14939

11 months agozdb: add -B option to generate backup stream
Rob Norris [Wed, 15 Mar 2023 07:18:10 +0000 (18:18 +1100)]
zdb: add -B option to generate backup stream

This is more-or-less like `zfs send`, but specifying the snapshot by its
objset id for situations where it can't be referenced any other way.

Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14642

11 months agoznode: expose zfs_get_zplprop to libzpool
Rob Norris [Sun, 4 Jun 2023 01:14:20 +0000 (11:14 +1000)]
znode: expose zfs_get_zplprop to libzpool

There's no particular reason this function should be kernel-only, and I
want to use it (indirectly) from zdb. I've moved it to zfs_znode.c
because libzpool does not compile in zfs_vfsops.c, and this at least
matches the header its imported from.

Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14642

11 months agoIntroduce zfs_refcount_(add|remove)_few().
Alexander Motin [Mon, 5 Jun 2023 18:51:44 +0000 (14:51 -0400)]
Introduce zfs_refcount_(add|remove)_few().

There are two places where we need to add/remove several references
with semantics of zfs_refcount_(add|remove). But when debug/tracing
is disabled, it is a crime to run multiple atomic_inc() in a loop,
especially under congested pool-wide allocator lock.

Introduced new functions implement the same semantics as the loop,
but without overhead in production builds.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14934

11 months agoLinux 6.3 compat: META (#14930)
Brian Behlendorf [Mon, 5 Jun 2023 18:08:24 +0000 (11:08 -0700)]
Linux 6.3 compat: META (#14930)

Update the META file to reflect compatibility with the 6.3 kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
12 months agozfs-create(8): ZFS for swap: caution, clarity
Graham Perrin [Fri, 2 Jun 2023 18:25:13 +0000 (19:25 +0100)]
zfs-create(8): ZFS for swap: caution, clarity

Make the section heading more generic (the section relates to ZFS files
as well as ZFS volumes).

Swapping to a ZFS volume is prone to deadlock. Remove the related
instruction, direct readers to OpenZFS FAQ. Related, but not linked
from within the manual page:

<https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux>
(Using a zvol for a swap device on Linux).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Perrin <grahamperrin@freebsd.org>
Issue #7734
Closes #14756

12 months agoZIL: Allow to replay blocks of any size.
Alexander Motin [Fri, 2 Jun 2023 18:01:58 +0000 (14:01 -0400)]
ZIL: Allow to replay blocks of any size.

There seems to be no reason for ZIL blocks to be limited by 128KB
other than replay code is written in such a way.  This change does
not increase the limit yet, just removes the artificial limitation.

Avoided extra memcpy() may save us a second during replay.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14910

12 months agoPAM: enable testing on FreeBSD
Val Packett [Thu, 11 May 2023 21:16:57 +0000 (18:16 -0300)]
PAM: enable testing on FreeBSD

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

12 months agoPAM: support password changes even when not mounted
Val Packett [Sat, 6 May 2023 01:17:12 +0000 (22:17 -0300)]
PAM: support password changes even when not mounted

There's usually no requirement that a user be logged in for changing
their password, so let's not be surprising here.

We need to use the fetch_lazy mechanism for the old password to avoid
a double prompt for it, so that mechanism is now generalized a bit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

12 months agoPAM: add 'uid_min' and 'uid_max' options for changing the uid range
Val Packett [Sat, 6 May 2023 01:34:58 +0000 (22:34 -0300)]
PAM: add 'uid_min' and 'uid_max' options for changing the uid range

Instead of a fixed >=1000 check, allow the configuration to override
the minimum UID and add a maximum one as well. While here, add the
uid range check to the authenticate method as well, and fix the return
in the chauthtok method (seems very wrong to report success when we've
done absolutely nothing).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

12 months agoPAM: add 'forceunmount' flag
Val Packett [Sat, 6 May 2023 01:02:13 +0000 (22:02 -0300)]
PAM: add 'forceunmount' flag

Probably not always a good idea, but it's nice to have the option.
It is a workaround for FreeBSD calling the PAM session end earier than
the last process is actually done touching the mount, for example.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834