]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
15 months agoZIL: Fix race introduced by f63811f0721.
Alexander Motin [Fri, 9 Jun 2023 17:08:05 +0000 (13:08 -0400)]
ZIL: Fix race introduced by f63811f0721.

We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE
state and dropping zl_lock, since it may be freed by zil_sync().
To free itxs and waiters after dropping the lock we need to move
lwb_itxs and lwb_waiters lists elements to local storage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14957
Closes #14959

15 months agoRevert "systemd: Use non-absolute paths in Exec* lines"
Rich Ercolani [Wed, 7 Jun 2023 18:14:05 +0000 (14:14 -0400)]
Revert "systemd: Use non-absolute paths in Exec* lines"

This reverts commit 79b20949b25c8db4d379f6486b0835a6613b480c since it
doesn't work with the systemd version shipped with RHEL7-based systems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14943
Closes #14945

15 months agoLinux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)
Brian Behlendorf [Wed, 7 Jun 2023 17:43:43 +0000 (10:43 -0700)]
Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)

When a kmem cache is exhausted and needs to be expanded a new
slab is allocated.  KM_SLEEP callers can block and wait for the
allocation, but KM_NOSLEEP callers were incorrectly allowed to
block as well.

Resolve this by attempting an emergency allocation as a best
effort.  This may fail but that's fine since any KM_NOSLEEP
consumer is required to handle an allocation failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
15 months agoFix the L2ARC write size calculating logic
George Amanakis [Tue, 6 Jun 2023 19:32:37 +0000 (21:32 +0200)]
Fix the L2ARC write size calculating logic

l2arc_write_size() should return the write size after adjusting for trim
and overhead of the L2ARC log blocks. Also take into account the
allocated size of log blocks when deciding when to stop writing buffers
to L2ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14939

15 months agozdb: add -B option to generate backup stream
Rob Norris [Wed, 15 Mar 2023 07:18:10 +0000 (18:18 +1100)]
zdb: add -B option to generate backup stream

This is more-or-less like `zfs send`, but specifying the snapshot by its
objset id for situations where it can't be referenced any other way.

Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14642

15 months agoznode: expose zfs_get_zplprop to libzpool
Rob Norris [Sun, 4 Jun 2023 01:14:20 +0000 (11:14 +1000)]
znode: expose zfs_get_zplprop to libzpool

There's no particular reason this function should be kernel-only, and I
want to use it (indirectly) from zdb. I've moved it to zfs_znode.c
because libzpool does not compile in zfs_vfsops.c, and this at least
matches the header its imported from.

Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14642

15 months agoIntroduce zfs_refcount_(add|remove)_few().
Alexander Motin [Mon, 5 Jun 2023 18:51:44 +0000 (14:51 -0400)]
Introduce zfs_refcount_(add|remove)_few().

There are two places where we need to add/remove several references
with semantics of zfs_refcount_(add|remove). But when debug/tracing
is disabled, it is a crime to run multiple atomic_inc() in a loop,
especially under congested pool-wide allocator lock.

Introduced new functions implement the same semantics as the loop,
but without overhead in production builds.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14934

15 months agoLinux 6.3 compat: META (#14930)
Brian Behlendorf [Mon, 5 Jun 2023 18:08:24 +0000 (11:08 -0700)]
Linux 6.3 compat: META (#14930)

Update the META file to reflect compatibility with the 6.3 kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
15 months agozfs-create(8): ZFS for swap: caution, clarity
Graham Perrin [Fri, 2 Jun 2023 18:25:13 +0000 (19:25 +0100)]
zfs-create(8): ZFS for swap: caution, clarity

Make the section heading more generic (the section relates to ZFS files
as well as ZFS volumes).

Swapping to a ZFS volume is prone to deadlock. Remove the related
instruction, direct readers to OpenZFS FAQ. Related, but not linked
from within the manual page:

<https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux>
(Using a zvol for a swap device on Linux).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Perrin <grahamperrin@freebsd.org>
Issue #7734
Closes #14756

15 months agoZIL: Allow to replay blocks of any size.
Alexander Motin [Fri, 2 Jun 2023 18:01:58 +0000 (14:01 -0400)]
ZIL: Allow to replay blocks of any size.

There seems to be no reason for ZIL blocks to be limited by 128KB
other than replay code is written in such a way.  This change does
not increase the limit yet, just removes the artificial limitation.

Avoided extra memcpy() may save us a second during replay.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14910

15 months agoPAM: enable testing on FreeBSD
Val Packett [Thu, 11 May 2023 21:16:57 +0000 (18:16 -0300)]
PAM: enable testing on FreeBSD

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: support password changes even when not mounted
Val Packett [Sat, 6 May 2023 01:17:12 +0000 (22:17 -0300)]
PAM: support password changes even when not mounted

There's usually no requirement that a user be logged in for changing
their password, so let's not be surprising here.

We need to use the fetch_lazy mechanism for the old password to avoid
a double prompt for it, so that mechanism is now generalized a bit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: add 'uid_min' and 'uid_max' options for changing the uid range
Val Packett [Sat, 6 May 2023 01:34:58 +0000 (22:34 -0300)]
PAM: add 'uid_min' and 'uid_max' options for changing the uid range

Instead of a fixed >=1000 check, allow the configuration to override
the minimum UID and add a maximum one as well. While here, add the
uid range check to the authenticate method as well, and fix the return
in the chauthtok method (seems very wrong to report success when we've
done absolutely nothing).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: add 'forceunmount' flag
Val Packett [Sat, 6 May 2023 01:02:13 +0000 (22:02 -0300)]
PAM: add 'forceunmount' flag

Probably not always a good idea, but it's nice to have the option.
It is a workaround for FreeBSD calling the PAM session end earier than
the last process is actually done touching the mount, for example.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: add 'recursive_homes' flag to use with 'prop_mountpoint'
Val Packett [Fri, 5 May 2023 22:35:57 +0000 (19:35 -0300)]
PAM: add 'recursive_homes' flag to use with 'prop_mountpoint'

It's not always desirable to have a fixed flat homes directory.
With the 'recursive_homes' flag, 'prop_mountpoint' search would
traverse the whole tree starting at 'homes' (which can now be '*'
to mean all pools) to find a dataset with a mountpoint matching
the home directory.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: use boolean_t for config flags
Val Packett [Sat, 6 May 2023 00:56:39 +0000 (21:56 -0300)]
PAM: use boolean_t for config flags

Since we already use boolean_t in the file, we can use it here.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: do not fail to mount if the key's already loaded
Val Packett [Fri, 5 May 2023 23:00:48 +0000 (20:00 -0300)]
PAM: do not fail to mount if the key's already loaded

If we're expecting a working home directory on login, it would be
rather frustrating to not have it mounted just because it e.g. failed to
unmount once on logout.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoRevert "initramfs: use `mount.zfs` instead of `mount`"
Rich Ercolani [Wed, 31 May 2023 23:58:41 +0000 (19:58 -0400)]
Revert "initramfs: use `mount.zfs` instead of `mount`"

This broke mounting of snapshots on / for users.

See https://github.com/openzfs/zfs/issues/9461#issuecomment-1376162949 for more context.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14908

15 months agoFix NULL pointer dereference when doing concurrent 'send' operations
Luís Henriques [Tue, 30 May 2023 22:15:24 +0000 (23:15 +0100)]
Fix NULL pointer dereference when doing concurrent 'send' operations

A NULL pointer will occur when doing a 'zfs send -S' on a dataset that
is still being received.  The problem is that the new 'send' will
rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will
fail), but then dmu_send() will still do the dsl_dataset_disown().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Luís Henriques <henrix@camandro.org>
Closes #14903
Closes #14890

15 months agoZTS: zvol_misc_trim disable blk mq
Brian Behlendorf [Mon, 29 May 2023 19:55:35 +0000 (12:55 -0700)]
ZTS: zvol_misc_trim disable blk mq

Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted
kernels.  This issue is being actively worked in #14872 and as part of that
fix this commit will be reverted.

    VERIFY(zh->zh_claim_txg == 0) failed
    PANIC at zil.c:904:zil_create()

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14872
Closes #14870

15 months agoUse __attribute__((malloc)) on memory allocation functions
Richard Yao [Fri, 26 May 2023 22:47:52 +0000 (18:47 -0400)]
Use __attribute__((malloc)) on memory allocation functions

This informs the C compiler that pointers returned from these functions
do not alias other functions, which allows it to do better code
optimization and should make the compiled code smaller.

References:
https://stackoverflow.com/a/53654773
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute
https://clang.llvm.org/docs/AttributeReference.html#malloc

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14827

15 months agoZTS: Add zpool_resilver_concurrent exception
Brian Behlendorf [Fri, 26 May 2023 22:39:23 +0000 (15:39 -0700)]
ZTS: Add zpool_resilver_concurrent exception

The zpool_resilver_concurrent test case requires the ZED which is not used
on FreeBSD.  Add this test to the known list of skipped tested for FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14904

15 months agoAdd compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2}
Mike Swanson [Fri, 26 May 2023 22:37:15 +0000 (15:37 -0700)]
Add compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2}

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
Closes #14902

15 months agoAdding new read-only compatible zpool features to compatibility.d/grub2
Colm [Fri, 26 May 2023 17:04:19 +0000 (10:04 -0700)]
Adding new read-only compatible zpool features to compatibility.d/grub2

GRUB2 is compatible with all "read-only compatible" features,
so it is safe to add new features of this type to the grub2
compatibility list. We generally want to include all compatible
features, to minimize the differences between grub2-compatible
pools and no-compatibility pools.

Adding new properties `livelist` and `zpool_checkpoint` accordingly.

Also adding them to the man page which references this file as an
example, for consistency.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #14893

15 months agobtree: Implement faster binary search algorithm
Richard Yao [Fri, 26 May 2023 17:03:12 +0000 (13:03 -0400)]
btree: Implement faster binary search algorithm

This implements a binary search algorithm for B-Trees that reduces
branching to the absolute minimum necessary for a binary search
algorithm. It also enables the compiler to inline the comparator to
ensure that the only slowdown when doing binary search is from waiting
for memory accesses. Additionally, it instructs the compiler to unroll
the loop, which gives an additional 40% improve with Clang and 8%
improvement with GCC.

Consumers must opt into using the faster algorithm. At present, only
B-Trees used inside kernel code have been modified to use the faster
algorithm.

Micro-benchmarks suggest that this can improve binary search performance
by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
compiling with GCC 12.2.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14866

15 months agoFix inconsistent definition of zfs_scrub_error_blocks_per_txg
George Amanakis [Fri, 26 May 2023 16:53:00 +0000 (18:53 +0200)]
Fix inconsistent definition of zfs_scrub_error_blocks_per_txg

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14894

15 months agoAdd missing files to Debian DKMS package
Damiano Albani [Thu, 25 May 2023 23:10:54 +0000 (01:10 +0200)]
Add missing files to Debian DKMS package

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Damiano Albani <damiano.albani@gmail.com>
Closes #14887
Closes #14889

15 months agoUpdate compatibility.d files
Brian Behlendorf [Thu, 25 May 2023 20:53:08 +0000 (13:53 -0700)]
Update compatibility.d files

Add an openzfs-2.2 compatibility file for the next release.

Edon-R support has been enabled for FreeBSD removing the need
for different FreeBSD and Linux files.  Symlinks for the -linux
and -freebsd names are created for any scripts expecting that
convention.

Additionally, a symlink for ubunutu-22.04 was added.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14833

15 months agozil: Add some more statistics.
Alexander Motin [Thu, 25 May 2023 20:51:53 +0000 (16:51 -0400)]
zil: Add some more statistics.

In addition to a number of actual log bytes written, account also a
total written bytes including padding and total allocated bytes (bytes
<= write <= alloc).  It should allow to monitor zil traffic and space
efficiency.

Add dtrace probe for zil block size selection.

Make zilstat report more information and fit it into less width.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14863

15 months agoZIL: Reduce scope of per-dataset zl_issuer_lock.
Alexander Motin [Thu, 25 May 2023 16:48:43 +0000 (12:48 -0400)]
ZIL: Reduce scope of per-dataset zl_issuer_lock.

Before this change ZIL copied all log data while holding the lock.
It caused huge lock contention on workloads with many big parallel
writes.  This change splits the process into two parts: first,
zil_lwb_assign() estimates the log space needed for all transactions,
and zil_lwb_write_close() allocates blocks and zios while holding the
lock, then, after the lock in dropped, zil_lwb_commit() copies the
data, and zil_lwb_write_issue() issues the I/Os.

Also while there slightly reduce scope of zl_lock.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14841

16 months agosystemd: Use non-absolute paths in Exec* lines
Dimitri John Ledkov [Wed, 24 May 2023 19:31:28 +0000 (20:31 +0100)]
systemd: Use non-absolute paths in Exec* lines

Since systemd v239, Exec* binaries are resolved from PATH when they
are not-absolute. Switch to this by default for ease of downstream
maintenance. Many downstream distributions move individual binaries
to locations that existing compile-time configurations cannot
accommodate.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Closes #14880

16 months agoFix concurrent resilvers initiated at same time
Akash B [Wed, 24 May 2023 19:28:09 +0000 (00:58 +0530)]
Fix concurrent resilvers initiated at same time

For draid vdevs it was possible to initiate both the
sequential and healing resilver at same time.

This fixes the following two scenarios.
     1) There's a window where a sequential rebuild can
be started via ZED even if a healing resilver has been
scheduled.
- This is fixed by adding additional check in
spa_vdev_attach() for any scheduled resilver and return
appropriate error code when a resilver is already in
progress.

     2) It was possible for zpool clear to start a healing
resilver when it wasn't needed at all. This occurs because
during a vdev_open() the device is presumed to be healthy not
until the device is validated by vdev_validate() and it's set
unavailable. However, by this point an async resilver will
have already been requested if the DTL isn't empty.
- This is fixed by cancelling the SPA_ASYNC_RESILVER
request immediately at the end of vdev_reopen() when a resilver
is unneeded.

Finally, added a testcase in ZTS for verification.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #14881
Closes #14892

16 months agoLinux 6.4 compat: reclaimed_slab renamed to reclaimed
youzhongyang [Wed, 24 May 2023 19:23:42 +0000 (15:23 -0400)]
Linux 6.4 compat: reclaimed_slab renamed to reclaimed

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14891

16 months agoHold db_mtx when updating db_state
Brian Atkinson [Fri, 19 May 2023 20:05:53 +0000 (16:05 -0400)]
Hold db_mtx when updating db_state

Commit 555ef90 did some general code refactoring for
dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was
not held when update db->db_state in those code block. The rest of the
dbuf code always holds the db_mtx when updating db_state. This is
important because cv_wait() db_changed is used to check for db_state
changes.

Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the
db_mtx when updating db_state.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #14875

16 months agoProbe vdevs before marking removed
Brian Behlendorf [Fri, 19 May 2023 20:05:09 +0000 (13:05 -0700)]
Probe vdevs before marking removed

Before allowing the ZED to mark a vdev as REMOVED due to a
hotplug event confirm that it is non-responsive with probe.
Any device which can be successfully probed should be left
ONLINE to prevent a healthy pool from being incorrectly
SUSPENDED.  This may occur for at least the following two
scenarios.

1) Drive expansion (zpool online -e) in VMware environments.
   If, during the partition resize operation, a partition is
   removed and re-created then udev will send a removed event.

2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan)
   may result in a udev remove and add event being delivered.

Finally, update the ZED to only kick in a spare when the
removal was successful.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14859
Closes #14861

16 months agoTeach zpool scrub to scrub only blocks in error log
George Amanakis [Fri, 17 Dec 2021 20:35:28 +0000 (21:35 +0100)]
Teach zpool scrub to scrub only blocks in error log

Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #8995
Closes #12355

16 months agoAdd the ability to uninitialize
Brian Behlendorf [Thu, 18 May 2023 17:02:20 +0000 (10:02 -0700)]
Add the ability to uninitialize

zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.

So let's add zpool initialize -u to clear it.

Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12451
Closes #14873

16 months agotest-runner: pass kmemleak and kmsg to Cmd.run
Antonio Russo [Mon, 15 May 2023 23:11:33 +0000 (17:11 -0600)]
test-runner: pass kmemleak and kmsg to Cmd.run

test-runner.py orchestrates all of the ZTS executions. The `Cmd` object
manages these process, and its `run` method specifically invokes these
possibly long-running processes, possibly retrying in the event of a
timeout. Since its inception, memory leak detection using the kmemleak
infrastructure [1], and kernel logging [2] have been added to this run
mechanism.

However, the callback to cull a process beyond its timeout threshold,
`kill_cmd`, has evaded modernization by both of these changes. As a
result, this function fails to properly invoke `run`, leading to an
untrapped exception and unreported test failure.

This patch extends `kill_cmd` to receive these kernel devices through
the `options` parameter, and regularizes all the `.run` calls from
`Cmd`, and its subclasses, to accept that parameter.

[1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1
[2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba

Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14849

16 months agoFix undefined behavior in spa_sync_props()
Richard Yao [Fri, 12 May 2023 21:10:14 +0000 (17:10 -0400)]
Fix undefined behavior in spa_sync_props()

8eae2d214cfa53862833eeeda9a5c1e9d5ded47d caused Coverity to begin
complaining about "Improper use of negative value" in two places in
spa_sync_props() because Coverity correctly inferred from `prop ==
ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name()
and zpool_prop_get_type() use it an array index, which is undefined
behavior.

Assuming that the system does not panic from an attempt to read invalid
memory, the case statement for ZPOOL_PROP_INVAL will ensure that only
user properties will reach this code when prop is ZPOOL_PROP_INVAL, such
that execution will continue safely. However, if we are unlucky enough
to read invalid memory, then the system will panic.

This issue predates the patch that caused coverity to begin complaining.
Thankfully, our userland tools do not pass nonsense to us, so this bug
should not be triggered unless a future userland tool attempts to set a
property that we do not understand.

Reported-by: Coverity (CID-1561129)
Reported-by: Coverity (CID-1561130)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860

16 months agoFix use after free regression in spa_remove_healed_errors()
Richard Yao [Fri, 12 May 2023 20:47:56 +0000 (16:47 -0400)]
Fix use after free regression in spa_remove_healed_errors()

6839ec6f1098c28ff7b772f1b31b832d05e6b567 placed code in
spa_remove_healed_errors() that uses a pointer after the kmem_free()
call that frees it.

Reported-by: Coverity (CID-1562375)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860

16 months agozil: Free lwb_buf after write completion.
Alexander Motin [Fri, 12 May 2023 16:49:26 +0000 (12:49 -0400)]
zil: Free lwb_buf after write completion.

There is no sense to keep that memory allocated during the flush.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14855

16 months agozil: Some micro-optimizations.
Alexander Motin [Fri, 12 May 2023 16:14:29 +0000 (12:14 -0400)]
zil: Some micro-optimizations.

Should not cause functional changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14854

16 months agoRefine special_small_blocks property validation
Don Brady [Fri, 12 May 2023 16:12:28 +0000 (10:12 -0600)]
Refine special_small_blocks property validation

When the special_small_blocks property is being set during a pool
create it enforces a limit of 128KiB even if the pool's record size
is larger.

If the recordsize property is being set during a pool create, then
use that value instead of the default SPA_OLD_MAXBLOCKSIZE value.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #13815
Closes #14811

16 months agoZTS: Add auto_replace_001_pos to exceptions
Brian Behlendorf [Fri, 12 May 2023 16:07:58 +0000 (09:07 -0700)]
ZTS: Add auto_replace_001_pos to exceptions

The auto_replace_001_pos test case does not reliably pass on
Fedora 37 and newer.  Until the test case can be updated to make
it reliable add it to the list of "maybe" exceptions on Linux.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14851
Closes #14852

16 months agoMake sure we are not trying to clone a spill block.
Pawel Jakub Dawidek [Wed, 10 May 2023 05:32:30 +0000 (22:32 -0700)]
Make sure we are not trying to clone a spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoCorrect comment.
Pawel Jakub Dawidek [Thu, 4 May 2023 23:14:19 +0000 (16:14 -0700)]
Correct comment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoRemove badly placed comment.
Pawel Jakub Dawidek [Thu, 4 May 2023 06:25:22 +0000 (23:25 -0700)]
Remove badly placed comment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoDon't call zfs_exit_two() before zfs_enter_two().
Pawel Jakub Dawidek [Wed, 3 May 2023 07:24:47 +0000 (00:24 -0700)]
Don't call zfs_exit_two() before zfs_enter_two().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoDon't use dmu_buf_is_dirty() for unassigned transaction.
Pawel Jakub Dawidek [Tue, 2 May 2023 22:46:14 +0000 (15:46 -0700)]
Don't use dmu_buf_is_dirty() for unassigned transaction.

The dmu_buf_is_dirty() call doesn't make sense here for two reasons:
1. txg is 0 for unassigned tx, so it was a no-op.
2. It is equivalent of checking if we have dirty records and we are doing
   this few lines earlier.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoDeny block cloning is dbuf size doesn't match BP size.
Pawel Jakub Dawidek [Tue, 2 May 2023 21:24:43 +0000 (14:24 -0700)]
Deny block cloning is dbuf size doesn't match BP size.

I don't know an easy way to shrink down dbuf size, so just deny block cloning
into dbufs that don't match our BP's size.

This fixes the following situation:
1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB.
2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB.
3. Truncate the large file to 0. Its dbuf will remain 2kB.
4. Clone the small file into the large file. Small file's BP lsize is
   1kB, but the large file's dbuf is 2kB.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoAdditional block cloning fixes.
Pawel Jakub Dawidek [Sun, 30 Apr 2023 09:47:09 +0000 (02:47 -0700)]
Additional block cloning fixes.

Reimplement some of the block cloning vs dbuf logic, mostly to fix
situation where we clone a block and in the same transaction group
we want to partially overwrite the clone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agozil: Don't expect zio_shrink() to succeed.
Alexander Motin [Thu, 11 May 2023 21:27:12 +0000 (17:27 -0400)]
zil: Don't expect zio_shrink() to succeed.

At least for RAIDZ zio_shrink() does not reduce zio size, but reduced
wsz in that case likely results in writing uninitialized memory.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14853

16 months agoPrevent panic during concurrent snapshot rollback and zvol read
Ameer Hamza [Wed, 10 May 2023 00:56:35 +0000 (05:56 +0500)]
Prevent panic during concurrent snapshot rollback and zvol read

Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent
release of the dnode, avoiding panic when a snapshot is rolled back
in parallel during ongoing zvol read operation.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14839

16 months agopam: Fix "buffer overflow" in pam ZTS tests on F38
Tony Hutter [Wed, 10 May 2023 00:55:19 +0000 (17:55 -0700)]
pam: Fix "buffer overflow" in pam ZTS tests on F38

The pam ZTS tests were reporting a buffer overflow on F38, possibly
due to F38 now setting _FORTIFY_SOURCE=3 by default.  gdb and
valgrind narrowed this down to a snprintf() buffer overflow in
zfs_key_config_modify_session_counter().  I'm not clear why this
particular snprintf() was being flagged as an overflow, but when
I replaced it with an asprintf(), the test passed reliably.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #14802
Closes #14842

16 months agoAdd dmu_tx_hold_append() interface
Brian Behlendorf [Tue, 9 May 2023 16:03:10 +0000 (09:03 -0700)]
Add dmu_tx_hold_append() interface

Provides an interface which callers can use to declare a write when
the exact starting offset in not yet known.  Since the full range
being updated is not available only the first L0 block at the
provided offset will be prefetched.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14819

16 months agoDebug auto_replace_001_pos failures
Brian Behlendorf [Tue, 9 May 2023 15:57:02 +0000 (08:57 -0700)]
Debug auto_replace_001_pos failures

Reduced the timeout to 60 seconds which should be more than
sufficient and allow the test to be marked as FAILED rather
than KILLED.  Also dump the pool status on cleanup.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14829

16 months agoRemove duplicate code in l2arc_evict()
George Amanakis [Tue, 9 May 2023 15:54:41 +0000 (17:54 +0200)]
Remove duplicate code in l2arc_evict()

l2arc_evict() performs the adjustment of the size of buffers to be
written on L2ARC unnecessarily. l2arc_write_size() is called right
before l2arc_evict() and performs those adjustments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14828

16 months agoRemove single parent assertion from zio_nowait().
Alexander Motin [Tue, 9 May 2023 15:54:01 +0000 (11:54 -0400)]
Remove single parent assertion from zio_nowait().

We only need to know if ZIO has any parent there.  We do not care if
it has more than one, but use of zio_unique_parent() == NULL asserts
that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14823

16 months agoEnable the head_errlog feature to remove errors
George Amanakis [Tue, 9 May 2023 15:53:27 +0000 (17:53 +0200)]
Enable the head_errlog feature to remove errors

In case check_filesystem() does not error out and does not report
an error, remove that error block from error lists and logs
without requiring a scrub. This can happen when the original file and
all snapshots/clones referencing it have been removed.

Otherwise zpool status will still report that "Permanent errors have
been detected..." without actually reporting any of them.

To implement this change the functions introduced in corrective
receive were modified to take into account the head_errlog feature.

Before this change:
=============================
pool: test
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                   STATE     READ WRITE CKSUM
        test                   ONLINE       0     0     0
          /home/user/vdev_a    ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

=============================

After this change:
=============================
  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:

        NAME                   STATE     READ WRITE CKSUM
        test                   ONLINE       0     0     0
          /home/user/vdev_a    ONLINE       0     0     2

errors: No known data errors
=============================

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14813

16 months agoFixes in head_errlog feature with encryption
George Amanakis [Mon, 8 May 2023 20:35:03 +0000 (22:35 +0200)]
Fixes in head_errlog feature with encryption

For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of
dsl_dataset_hold_obj() in order to enable access to the encryption keys
(if loaded). This enables reporting of errors in encrypted filesystems
which are not mounted but have their keys loaded.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14837

16 months agoVerify block pointers before writing them out
Matthew Ahrens [Mon, 8 May 2023 18:20:23 +0000 (11:20 -0700)]
Verify block pointers before writing them out

If a block pointer is corrupted (but the block containing it checksums
correctly, e.g. due to a bug that overwrites random memory), we can
often detect it before the block is read, with the `zfs_blkptr_verify()`
function, which is used in `arc_read()`, `zio_free()`, etc.

However, such corruption is not typically recoverable.  To recover from
it we would need to detect the memory error before the block pointer is
written to disk.

This PR verifies BP's that are contained in indirect blocks and dnodes
before they are written to disk, in `dbuf_write_ready()`. This way,
we'll get a panic before the on-disk data is corrupted. This will help
us to diagnose what's causing the corruption, as well as being much
easier to recover from.

To minimize performance impact, only checks that can be done without
holding the spa_config_lock are performed.

Additionally, when corruption is detected, the raw words of the block
pointer are logged.  (Note that `dprintf_bp()` is a no-op by default,
but if enabled it is not safe to use with invalid block pointers.)

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14817

16 months agozdb: consistent xattr output
Brian Behlendorf [Mon, 8 May 2023 18:17:41 +0000 (11:17 -0700)]
zdb: consistent xattr output

When using zdb to output the value of an xattr only interpret it
as printable characters if the entire byte array is printable.
Additionally, if the --parseable option is set always output the
buffer contents as octal for easy parsing.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14830

16 months agoZTS: add snapshot/snapshot_002_pos exception
Brian Behlendorf [Mon, 8 May 2023 17:09:30 +0000 (10:09 -0700)]
ZTS: add snapshot/snapshot_002_pos exception

Add snapshot_002_pos to the known list of occasional failures
for FreeBSD until it can be made entirely reliable.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14831
Closes #14832

16 months agoFix two abd_gang_add_gang() issues.
Alexander Motin [Fri, 5 May 2023 16:17:55 +0000 (12:17 -0400)]
Fix two abd_gang_add_gang() issues.

- There is no reason to assert that added gang is not empty.  It
may be weird to add an empty gang, but it is legal.
 - When moving chain list from the added gang clear its size, or it
will trigger assertion in abd_verify() when that gang is freed.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14816

16 months agoSimplify and optimize random_int_between().
Pawel Jakub Dawidek [Fri, 5 May 2023 16:09:12 +0000 (01:09 +0900)]
Simplify and optimize random_int_between().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14805

16 months agoPlug memory leak in zfsdev_state.
Pawel Jakub Dawidek [Fri, 5 May 2023 15:51:41 +0000 (00:51 +0900)]
Plug memory leak in zfsdev_state.

On kernel module unload, free all zfsdev state structures, except for
zfsdev_state_listhead, which is statically allocated.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14824

16 months agozpool import -m also removing spare and cache when log device is missing
Ameer Hamza [Wed, 3 May 2023 22:10:32 +0000 (03:10 +0500)]
zpool import -m also removing spare and cache when log device is missing

spa_import() relies on a pool config fetched by spa_try_import() for
spare/cache devices. Import flags are not passed to spa_tryimport(),
which makes it return early due to a missing log device and missing
retrieving the cache device and spare eventually. Passing
ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct
configuration regardless of the missing log device.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14794

16 months agoAllow zhack label repair to restore detached devices.
buzzingwires [Wed, 3 May 2023 16:03:57 +0000 (12:03 -0400)]
Allow zhack label repair to restore detached devices.

This commit expands on the zhack label repair command in d04b5c9 by
adding the -u option to undetach a device by regenerating uberblocks,
in addition to the existing functionality of fixing checksums, now
represented by -c. Previous behavior is retained in the case of no
options.

The changes are heavily inspired by Jeff Bonwick's labelfix
utility, as archived at:

https://gist.github.com/jjwhitney/baaa63144da89726e482

Additionally, it is now capable of properly determining the size of
block devices and other media, as well as handling sizes which are
not divisible by 2^18. This should make it viable for use on physical
devices and partitions, in addition to files.

These changes should make it possible to import zpools that have had
their uberblocks erased, such as in the case of pools rendered
inaccessible by erroneous detach commands.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: buzzingwires <buzzingwires@outlook.com>
Closes #14773

16 months agoOptimize check_filesystem() and process_error_log()
George Amanakis [Wed, 3 May 2023 16:00:14 +0000 (18:00 +0200)]
Optimize check_filesystem() and process_error_log()

Integrate check_clones() into check_filesystem() and implement a list
instead of iterating recursively over the clones, thus eliminating the
risk of a stack overflow.

Also use kmem_zalloc() to allocate large structures in
process_error_log() reducing its stack size from ~700 to ~128 bytes.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14744

16 months agoUse correct block pointer in block cloning case.
Pawel Jakub Dawidek [Tue, 2 May 2023 16:24:26 +0000 (01:24 +0900)]
Use correct block pointer in block cloning case.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14806

16 months agoWrap clang specific pragma
Brian Behlendorf [Tue, 2 May 2023 16:21:47 +0000 (09:21 -0700)]
Wrap clang specific pragma

Clang specific pragmas need to be wrapped to prevent a build
warning when compiling with gcc.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14814

16 months agoblake3: fix up bogus checksums in face of cpu migration
Mateusz Guzik [Tue, 2 May 2023 00:21:27 +0000 (02:21 +0200)]
blake3: fix up bogus checksums in face of cpu migration

This is a temporary measure until a better fix is sorted out.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by: Rubicon Communications, LLC ("Netgate")
Closes #14785
Closes #14808

16 months agoCorrect ABD size for split block ZIOs
Serapheim Dimitropoulos [Tue, 2 May 2023 00:18:42 +0000 (17:18 -0700)]
Correct ABD size for split block ZIOs

Currently when layering the ABD buffer of each split block on top of
an indirect vdev's ZIO ABD we don't specify the split block's ABD.
This results in those ABDs being incorrectly sized by inheriting
the size of their parent ABD which is larger than what each split
block needs.

The above behavior isn't causing any bugs currently but can lead
to unexpected ABD sizes for people analyzing and/or working on
the ZIO codepath. This patch fixes this behavior by properly setting
the ABD size for split block ZIOs.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14804

16 months agopowerpc64: Support ELFv2 asm on Big Endian
Justin Hibbits [Thu, 27 Apr 2023 19:49:21 +0000 (15:49 -0400)]
powerpc64: Support ELFv2 asm on Big Endian

FreeBSD/powerpc64 is all ELFv2 since FreeBSD 13, even big endian.  The
existing sha256 and sha512 asm code assumes that BE is all ELFv1, and LE
is ELFv2.  Minor changes to add ELFv2 in the BE side gets this working
correctly on FreeBSD with latest OpenZFS import.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Justin Hibbits <chmeeedalf@gmail.com>
Closes #14779

16 months agoMark TX_COMMIT transaction with TXG_NOTHROTTLE.
Alexander Motin [Thu, 27 Apr 2023 19:32:58 +0000 (15:32 -0400)]
Mark TX_COMMIT transaction with TXG_NOTHROTTLE.

TX_COMMIT has no on-disk representation and does not produce any more
dirty data.  It should not wait for anything, and even just skipping
the checks if not waiting gives improvement noticeable in profiler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14798

16 months agoPAM: support the authentication facility
Val Packett [Thu, 27 Apr 2023 16:49:03 +0000 (13:49 -0300)]
PAM: support the authentication facility

Implement the pam_sm_authenticate method, using the noop argument of
lzc_load_key to do a passphrase check without actually loading the key.

This allows using ZFS as the source of truth for user passwords,
without storing any password hashes in /etc or using other PAM modules.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14789

16 months agoFix BLAKE3 aarch64 assembly for FreeBSD and macOS
Tino Reichardt [Wed, 26 Apr 2023 19:40:26 +0000 (21:40 +0200)]
Fix BLAKE3 aarch64 assembly for FreeBSD and macOS

The x18 register isn't useable within FreeBSD kernel space, so we
have to fix the BLAKE3 aarch64 assembly for not using it.

The source files are here: https://github.com/mcmilk/BLAKE3-tests

Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14728

16 months agoFix checkstyle warning
Brian Behlendorf [Wed, 26 Apr 2023 18:49:16 +0000 (11:49 -0700)]
Fix checkstyle warning

Resolve a missed checkstyle warning.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14799

16 months agoFix positive ABD size assertion in abd_verify().
Alexander Motin [Wed, 26 Apr 2023 16:20:43 +0000 (12:20 -0400)]
Fix positive ABD size assertion in abd_verify().

Gang ABDs without childred are legal, and they do have zero size.
For other ABD types zero size doesn't have much sense and likely
not working correctly now.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14795

16 months agoFreeBSD: fix up EINVAL from getdirentries on .zfs
Mateusz Guzik [Thu, 20 Apr 2023 09:00:03 +0000 (09:00 +0000)]
FreeBSD: fix up EINVAL from getdirentries on .zfs

Without the change:
/.zfs
/.zfs/snapshot
find: /.zfs: Invalid argument

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14774

16 months agoFreeBSD: add missing vn state transition for .zfs
Mateusz Guzik [Thu, 20 Apr 2023 08:59:38 +0000 (08:59 +0000)]
FreeBSD: add missing vn state transition for .zfs

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14774

16 months agotests/zdb_encrypted: parse numbers a little more robustly
Rob N [Wed, 26 Apr 2023 15:50:44 +0000 (01:50 +1000)]
tests/zdb_encrypted: parse numbers a little more robustly

On FreeBSD, `wc` prints some leading spaces, while on Linux it does not.
So we tell ksh to expect an integer, and it does the rest.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14791
Closes #14797

16 months agozdb: Fix minor memory leak
Brian Behlendorf [Wed, 26 Apr 2023 15:43:39 +0000 (08:43 -0700)]
zdb: Fix minor memory leak

Commit 6b6aaf6dc2e65c63c74fbd7840c14627e9a91ce2 introduced a small
memory leak in zdb.  This was detected by the LeakSanitizer and was
causing all ztest runs to fail.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14796

16 months agoRevert "Fix data race between zil_commit() and zil_suspend()"
Brian Behlendorf [Tue, 25 Apr 2023 23:40:55 +0000 (16:40 -0700)]
Revert "Fix data race between zil_commit() and zil_suspend()"

This reverts commit 4c856fb333ac57d9b4a6ddd44407fd022a702f00 to
resolve a newly introduced deadlock which in practice in more
disruptive that the issue this commit intended to address.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14775
Closes #14790

16 months agoAdd loongarch64 support
Han Gao [Tue, 25 Apr 2023 23:05:45 +0000 (07:05 +0800)]
Add loongarch64 support

Add loongarch64 definitions & lua module setjmp asm

LoongArch is a new RISC ISA, which is a bit like MIPS or RISC-V.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Han Gao <gaohan@uniontech.com>
Signed-off-by: WANG Xuerui <xen0n@gentoo.org>
Closes #13422

16 months agoTaught zdb -bb to print metadata totals
Rich Ercolani [Mon, 24 Apr 2023 23:55:07 +0000 (19:55 -0400)]
Taught zdb -bb to print metadata totals

People often want estimates of how much of their pool is occupied
by metadata, but they end up using lots of text processing on zdb's
output to get it.

So let's just...provide it for them.

Now, zdb -bbbs will output something like:

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
[...]
    68  1.06M    272K    544K      8K    4.00     0.00      L6 Total
 1.71K   212M   6.85M   13.7M      8K   30.91     0.00      L5 Total
 1.71K   212M   6.85M   13.7M      8K   30.91     0.00      L4 Total
 1.73K   214M   6.92M   13.8M      8K   30.89     0.00      L3 Total
 18.7K  2.29G    111M    221M   11.8K   21.19     0.00      L2 Total
 3.56M   454G   28.4G   56.9G   16.0K   15.97     0.19      L1 Total
  308M  36.8T   28.2T   28.6T   95.1K    1.30    99.80      L0 Total
  311M  37.3T   28.3T   28.6T   94.2K    1.32   100.00  Total
 50.4M   774G    113G    291G   5.77K    6.85     0.99  Metadata Total

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14746

16 months agoFreeBSD: add missing vop_fplookup assignments
Mateusz Guzik [Mon, 24 Apr 2023 23:15:42 +0000 (01:15 +0200)]
FreeBSD: add missing vop_fplookup assignments

It became illegal to not have them as of
5f6df177758b9dff88e4b6069aeb2359e8b0c493 ("vfs: validate that vop
vectors provide all or none fplookup vops") upstream.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14788

16 months agoFreeBSD: try to fallback early if can't do optimized copy
Mateusz Guzik [Wed, 5 Apr 2023 21:28:52 +0000 (21:28 +0000)]
FreeBSD: try to fallback early if can't do optimized copy

Not complete, but already shaves on some locking.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by: Rubicon Communications, LLC ("Netgate")
Closes #14723

16 months agoFreeBSD: fix up EXDEV handling for clone_range
Mateusz Guzik [Wed, 5 Apr 2023 21:12:17 +0000 (21:12 +0000)]
FreeBSD: fix up EXDEV handling for clone_range

API contract requires VOPs to handle EXDEV internally, worst case by
falling back to the generic copy routine. This broke with the recent
changes.

While here whack custom loop to lock 2 vnodes with vn_lock_pair, which
provides the same functionality internally. write start/finish around
it plays no role so got eliminated.

One difference is that vn_lock_pair always takes an exclusive lock on
both vnodes. I did not patch around it because current code takes an
exclusive lock on the target vnode. zfs supports shared-locking for
writes, so this serializes different calls to the routine as is, despite
range locking inside. At the same time you may notice the source vnode
can get some traffic if only shared-locked, thus once more this goes
the safer route of exclusive-locking. Note this should be patched to
use shared-locking for both once the feature is considered stable.

Technically the switch to vn_lock_pair should be a separate change, but
it would only introduce churn immediately whacked by the rest of the
patch.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by: Rubicon Communications, LLC ("Netgate")
Closes #14723

17 months agoFreeBSD: make zfs_vfs_held() definition consistent with declaration
Dimitry Andric [Fri, 21 Apr 2023 17:22:52 +0000 (19:22 +0200)]
FreeBSD: make zfs_vfs_held() definition consistent with declaration

Noticed while attempting to change FreeBSD's boolean_t into an actual
bool: in include/sys/zfs_ioctl_impl.h, zfs_vfs_held() is declared to
return a boolean_t, but in module/os/freebsd/zfs/zfs_ioctl_os.c it is
defined to return an int. Make the definition match the declaration.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #14776

17 months agoAdd support for zpool user properties
Allan Jude [Fri, 21 Apr 2023 17:20:36 +0000 (13:20 -0400)]
Add support for zpool user properties

Usage:

    zpool set org.freebsd:comment="this is my pool" poolname

Tests are based on zfs_set's user property tests.

Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes #11680

17 months agoLinux: Suppress -Wordered-compare-function-pointers in tracepoint code
Richard Yao [Tue, 11 Apr 2023 17:56:16 +0000 (17:56 +0000)]
Linux: Suppress -Wordered-compare-function-pointers in tracepoint code

Clang points out that there is a comparison against -1, but we cannot
fix it because that is from the kernel headers, which we must support.
We can workaround this by using a pragma.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14738

17 months agoLinux: zfs_zaccess_trivial() should always call generic_permission()
Richard Yao [Tue, 11 Apr 2023 17:50:43 +0000 (17:50 +0000)]
Linux: zfs_zaccess_trivial() should always call generic_permission()

Building with Clang on Linux generates a warning that err could be
uninitialized if mnt_ns is a NULL pointer. However, mnt_ns should never
be NULL, so there is no need to put this behind an if statement.  Taking
it outside of the if statement means that the possibility of err being
uninitialized goes from being always zero in a way that the compiler
could not realize to a way that is always zero in a way that the
compiler can realize.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14738

17 months agoZTS: zvol_misc_trim retry busy export
Brian Behlendorf [Thu, 20 Apr 2023 17:25:16 +0000 (10:25 -0700)]
ZTS: zvol_misc_trim retry busy export

Retry the export if the pool is busy due to an open zvol.
Observed in the CI on Fedora 37.

  cannot export 'testpool': pool is busy
  ERROR: zpool export testpool exited 1

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14769

17 months agoCreate zap for root vdev
rob-wing [Thu, 20 Apr 2023 17:07:56 +0000 (09:07 -0800)]
Create zap for root vdev

And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.

Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2

Notably, this allows to get/set properties on the root vdev:

    % zpool set user:prop=value <pool> root-0

Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):

    % zpool set user:prop=value <pool> mirror-0

This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.

The following tests have been added:

    - zpool get all properties from root vdev
    - zpool set a property on root vdev
    - verify root vdev ZAP is created

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14405

17 months agoAllow MMP to bypass waiting for other threads
Herb Wartens [Wed, 19 Apr 2023 20:22:59 +0000 (13:22 -0700)]
Allow MMP to bypass waiting for other threads

At our site we have seen cases when multi-modifier protection is enabled
(multihost=on) on our pool and the pool gets suspended due to a single
disk that is failing and responding very slowly. Our pools have 90 disks
in them and we expect disks to fail. The current version of MMP requires
that we wait for other writers before moving on. When a disk is
responding very slowly, we observed that waiting here was bad enough to
cause the pool to suspend. This change allows the MMP thread to bypass
waiting for other threads and reduces the chances the pool gets
suspended.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Herb Wartens <hawartens@gmail.com>
Closes #14659

17 months agoZTS: send-c_volume is flaky
Paul Dagnelie [Wed, 19 Apr 2023 20:20:02 +0000 (13:20 -0700)]
ZTS: send-c_volume is flaky

We use block_device_wait to wait for the zvol block device to
actually appear, and we log the result of the dd calls by using
an intermediate file.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #14767

17 months agoFix "Detach spare vdev in case if resilvering does not happen"
Ameer Hamza [Wed, 19 Apr 2023 16:04:32 +0000 (21:04 +0500)]
Fix "Detach spare vdev in case if resilvering does not happen"

Spare vdev should detach from the pool when a disk is reinserted.
However, spare detachment depends on the completion of resilvering,
and if resilver does not schedule, the spare vdev keeps attached to
the pool until the next resilvering. When a zfs pool contains
several disks (25+ mirror), resilvering does not always happen when
a disk is reinserted. In this patch, spare vdev is manually detached
from the pool when resilvering does not occur and it has been tested
on both Linux and FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14722

17 months agozfsprops.7: update mandlock
наб [Wed, 19 Apr 2023 16:03:42 +0000 (18:03 +0200)]
zfsprops.7: update mandlock

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=7e59106e9c34458540f7d382d5b49071d1b7104f

Fixes: commit fb9baa9b2045a193a3caf0a46b5cac5ef7a84b61 ("zfsprops.8:
 remove nbmand-not-used-on-Linux and pointer to mount(8)")

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #14765

17 months agoSilence clang warning of flexible array not at end
youzhongyang [Wed, 19 Apr 2023 01:10:40 +0000 (21:10 -0400)]
Silence clang warning of flexible array not at end

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14764