]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
12 months agoLinux 6.4 compat: reclaimed_slab renamed to reclaimed
youzhongyang [Wed, 24 May 2023 19:23:42 +0000 (15:23 -0400)]
Linux 6.4 compat: reclaimed_slab renamed to reclaimed

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14891

13 months agoHold db_mtx when updating db_state
Brian Atkinson [Fri, 19 May 2023 20:05:53 +0000 (16:05 -0400)]
Hold db_mtx when updating db_state

Commit 555ef90 did some general code refactoring for
dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was
not held when update db->db_state in those code block. The rest of the
dbuf code always holds the db_mtx when updating db_state. This is
important because cv_wait() db_changed is used to check for db_state
changes.

Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the
db_mtx when updating db_state.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #14875

13 months agoProbe vdevs before marking removed
Brian Behlendorf [Fri, 19 May 2023 20:05:09 +0000 (13:05 -0700)]
Probe vdevs before marking removed

Before allowing the ZED to mark a vdev as REMOVED due to a
hotplug event confirm that it is non-responsive with probe.
Any device which can be successfully probed should be left
ONLINE to prevent a healthy pool from being incorrectly
SUSPENDED.  This may occur for at least the following two
scenarios.

1) Drive expansion (zpool online -e) in VMware environments.
   If, during the partition resize operation, a partition is
   removed and re-created then udev will send a removed event.

2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan)
   may result in a udev remove and add event being delivered.

Finally, update the ZED to only kick in a spare when the
removal was successful.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14859
Closes #14861

13 months agoTeach zpool scrub to scrub only blocks in error log
George Amanakis [Fri, 17 Dec 2021 20:35:28 +0000 (21:35 +0100)]
Teach zpool scrub to scrub only blocks in error log

Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #8995
Closes #12355

13 months agoAdd the ability to uninitialize
Brian Behlendorf [Thu, 18 May 2023 17:02:20 +0000 (10:02 -0700)]
Add the ability to uninitialize

zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.

So let's add zpool initialize -u to clear it.

Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12451
Closes #14873

13 months agotest-runner: pass kmemleak and kmsg to Cmd.run
Antonio Russo [Mon, 15 May 2023 23:11:33 +0000 (17:11 -0600)]
test-runner: pass kmemleak and kmsg to Cmd.run

test-runner.py orchestrates all of the ZTS executions. The `Cmd` object
manages these process, and its `run` method specifically invokes these
possibly long-running processes, possibly retrying in the event of a
timeout. Since its inception, memory leak detection using the kmemleak
infrastructure [1], and kernel logging [2] have been added to this run
mechanism.

However, the callback to cull a process beyond its timeout threshold,
`kill_cmd`, has evaded modernization by both of these changes. As a
result, this function fails to properly invoke `run`, leading to an
untrapped exception and unreported test failure.

This patch extends `kill_cmd` to receive these kernel devices through
the `options` parameter, and regularizes all the `.run` calls from
`Cmd`, and its subclasses, to accept that parameter.

[1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1
[2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba

Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14849

13 months agoFix undefined behavior in spa_sync_props()
Richard Yao [Fri, 12 May 2023 21:10:14 +0000 (17:10 -0400)]
Fix undefined behavior in spa_sync_props()

8eae2d214cfa53862833eeeda9a5c1e9d5ded47d caused Coverity to begin
complaining about "Improper use of negative value" in two places in
spa_sync_props() because Coverity correctly inferred from `prop ==
ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name()
and zpool_prop_get_type() use it an array index, which is undefined
behavior.

Assuming that the system does not panic from an attempt to read invalid
memory, the case statement for ZPOOL_PROP_INVAL will ensure that only
user properties will reach this code when prop is ZPOOL_PROP_INVAL, such
that execution will continue safely. However, if we are unlucky enough
to read invalid memory, then the system will panic.

This issue predates the patch that caused coverity to begin complaining.
Thankfully, our userland tools do not pass nonsense to us, so this bug
should not be triggered unless a future userland tool attempts to set a
property that we do not understand.

Reported-by: Coverity (CID-1561129)
Reported-by: Coverity (CID-1561130)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860

13 months agoFix use after free regression in spa_remove_healed_errors()
Richard Yao [Fri, 12 May 2023 20:47:56 +0000 (16:47 -0400)]
Fix use after free regression in spa_remove_healed_errors()

6839ec6f1098c28ff7b772f1b31b832d05e6b567 placed code in
spa_remove_healed_errors() that uses a pointer after the kmem_free()
call that frees it.

Reported-by: Coverity (CID-1562375)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860

13 months agozil: Free lwb_buf after write completion.
Alexander Motin [Fri, 12 May 2023 16:49:26 +0000 (12:49 -0400)]
zil: Free lwb_buf after write completion.

There is no sense to keep that memory allocated during the flush.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14855

13 months agozil: Some micro-optimizations.
Alexander Motin [Fri, 12 May 2023 16:14:29 +0000 (12:14 -0400)]
zil: Some micro-optimizations.

Should not cause functional changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14854

13 months agoRefine special_small_blocks property validation
Don Brady [Fri, 12 May 2023 16:12:28 +0000 (10:12 -0600)]
Refine special_small_blocks property validation

When the special_small_blocks property is being set during a pool
create it enforces a limit of 128KiB even if the pool's record size
is larger.

If the recordsize property is being set during a pool create, then
use that value instead of the default SPA_OLD_MAXBLOCKSIZE value.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #13815
Closes #14811

13 months agoZTS: Add auto_replace_001_pos to exceptions
Brian Behlendorf [Fri, 12 May 2023 16:07:58 +0000 (09:07 -0700)]
ZTS: Add auto_replace_001_pos to exceptions

The auto_replace_001_pos test case does not reliably pass on
Fedora 37 and newer.  Until the test case can be updated to make
it reliable add it to the list of "maybe" exceptions on Linux.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14851
Closes #14852

13 months agoMake sure we are not trying to clone a spill block.
Pawel Jakub Dawidek [Wed, 10 May 2023 05:32:30 +0000 (22:32 -0700)]
Make sure we are not trying to clone a spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

13 months agoCorrect comment.
Pawel Jakub Dawidek [Thu, 4 May 2023 23:14:19 +0000 (16:14 -0700)]
Correct comment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

13 months agoRemove badly placed comment.
Pawel Jakub Dawidek [Thu, 4 May 2023 06:25:22 +0000 (23:25 -0700)]
Remove badly placed comment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

13 months agoDon't call zfs_exit_two() before zfs_enter_two().
Pawel Jakub Dawidek [Wed, 3 May 2023 07:24:47 +0000 (00:24 -0700)]
Don't call zfs_exit_two() before zfs_enter_two().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

13 months agoDon't use dmu_buf_is_dirty() for unassigned transaction.
Pawel Jakub Dawidek [Tue, 2 May 2023 22:46:14 +0000 (15:46 -0700)]
Don't use dmu_buf_is_dirty() for unassigned transaction.

The dmu_buf_is_dirty() call doesn't make sense here for two reasons:
1. txg is 0 for unassigned tx, so it was a no-op.
2. It is equivalent of checking if we have dirty records and we are doing
   this few lines earlier.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

13 months agoDeny block cloning is dbuf size doesn't match BP size.
Pawel Jakub Dawidek [Tue, 2 May 2023 21:24:43 +0000 (14:24 -0700)]
Deny block cloning is dbuf size doesn't match BP size.

I don't know an easy way to shrink down dbuf size, so just deny block cloning
into dbufs that don't match our BP's size.

This fixes the following situation:
1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB.
2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB.
3. Truncate the large file to 0. Its dbuf will remain 2kB.
4. Clone the small file into the large file. Small file's BP lsize is
   1kB, but the large file's dbuf is 2kB.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

13 months agoAdditional block cloning fixes.
Pawel Jakub Dawidek [Sun, 30 Apr 2023 09:47:09 +0000 (02:47 -0700)]
Additional block cloning fixes.

Reimplement some of the block cloning vs dbuf logic, mostly to fix
situation where we clone a block and in the same transaction group
we want to partially overwrite the clone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

13 months agozil: Don't expect zio_shrink() to succeed.
Alexander Motin [Thu, 11 May 2023 21:27:12 +0000 (17:27 -0400)]
zil: Don't expect zio_shrink() to succeed.

At least for RAIDZ zio_shrink() does not reduce zio size, but reduced
wsz in that case likely results in writing uninitialized memory.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14853

13 months agoPrevent panic during concurrent snapshot rollback and zvol read
Ameer Hamza [Wed, 10 May 2023 00:56:35 +0000 (05:56 +0500)]
Prevent panic during concurrent snapshot rollback and zvol read

Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent
release of the dnode, avoiding panic when a snapshot is rolled back
in parallel during ongoing zvol read operation.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14839

13 months agopam: Fix "buffer overflow" in pam ZTS tests on F38
Tony Hutter [Wed, 10 May 2023 00:55:19 +0000 (17:55 -0700)]
pam: Fix "buffer overflow" in pam ZTS tests on F38

The pam ZTS tests were reporting a buffer overflow on F38, possibly
due to F38 now setting _FORTIFY_SOURCE=3 by default.  gdb and
valgrind narrowed this down to a snprintf() buffer overflow in
zfs_key_config_modify_session_counter().  I'm not clear why this
particular snprintf() was being flagged as an overflow, but when
I replaced it with an asprintf(), the test passed reliably.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #14802
Closes #14842

13 months agoAdd dmu_tx_hold_append() interface
Brian Behlendorf [Tue, 9 May 2023 16:03:10 +0000 (09:03 -0700)]
Add dmu_tx_hold_append() interface

Provides an interface which callers can use to declare a write when
the exact starting offset in not yet known.  Since the full range
being updated is not available only the first L0 block at the
provided offset will be prefetched.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14819

13 months agoDebug auto_replace_001_pos failures
Brian Behlendorf [Tue, 9 May 2023 15:57:02 +0000 (08:57 -0700)]
Debug auto_replace_001_pos failures

Reduced the timeout to 60 seconds which should be more than
sufficient and allow the test to be marked as FAILED rather
than KILLED.  Also dump the pool status on cleanup.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14829

13 months agoRemove duplicate code in l2arc_evict()
George Amanakis [Tue, 9 May 2023 15:54:41 +0000 (17:54 +0200)]
Remove duplicate code in l2arc_evict()

l2arc_evict() performs the adjustment of the size of buffers to be
written on L2ARC unnecessarily. l2arc_write_size() is called right
before l2arc_evict() and performs those adjustments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14828

13 months agoRemove single parent assertion from zio_nowait().
Alexander Motin [Tue, 9 May 2023 15:54:01 +0000 (11:54 -0400)]
Remove single parent assertion from zio_nowait().

We only need to know if ZIO has any parent there.  We do not care if
it has more than one, but use of zio_unique_parent() == NULL asserts
that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14823

13 months agoEnable the head_errlog feature to remove errors
George Amanakis [Tue, 9 May 2023 15:53:27 +0000 (17:53 +0200)]
Enable the head_errlog feature to remove errors

In case check_filesystem() does not error out and does not report
an error, remove that error block from error lists and logs
without requiring a scrub. This can happen when the original file and
all snapshots/clones referencing it have been removed.

Otherwise zpool status will still report that "Permanent errors have
been detected..." without actually reporting any of them.

To implement this change the functions introduced in corrective
receive were modified to take into account the head_errlog feature.

Before this change:
=============================
pool: test
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                   STATE     READ WRITE CKSUM
        test                   ONLINE       0     0     0
          /home/user/vdev_a    ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

=============================

After this change:
=============================
  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:

        NAME                   STATE     READ WRITE CKSUM
        test                   ONLINE       0     0     0
          /home/user/vdev_a    ONLINE       0     0     2

errors: No known data errors
=============================

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14813

13 months agoFixes in head_errlog feature with encryption
George Amanakis [Mon, 8 May 2023 20:35:03 +0000 (22:35 +0200)]
Fixes in head_errlog feature with encryption

For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of
dsl_dataset_hold_obj() in order to enable access to the encryption keys
(if loaded). This enables reporting of errors in encrypted filesystems
which are not mounted but have their keys loaded.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14837

13 months agoVerify block pointers before writing them out
Matthew Ahrens [Mon, 8 May 2023 18:20:23 +0000 (11:20 -0700)]
Verify block pointers before writing them out

If a block pointer is corrupted (but the block containing it checksums
correctly, e.g. due to a bug that overwrites random memory), we can
often detect it before the block is read, with the `zfs_blkptr_verify()`
function, which is used in `arc_read()`, `zio_free()`, etc.

However, such corruption is not typically recoverable.  To recover from
it we would need to detect the memory error before the block pointer is
written to disk.

This PR verifies BP's that are contained in indirect blocks and dnodes
before they are written to disk, in `dbuf_write_ready()`. This way,
we'll get a panic before the on-disk data is corrupted. This will help
us to diagnose what's causing the corruption, as well as being much
easier to recover from.

To minimize performance impact, only checks that can be done without
holding the spa_config_lock are performed.

Additionally, when corruption is detected, the raw words of the block
pointer are logged.  (Note that `dprintf_bp()` is a no-op by default,
but if enabled it is not safe to use with invalid block pointers.)

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14817

13 months agozdb: consistent xattr output
Brian Behlendorf [Mon, 8 May 2023 18:17:41 +0000 (11:17 -0700)]
zdb: consistent xattr output

When using zdb to output the value of an xattr only interpret it
as printable characters if the entire byte array is printable.
Additionally, if the --parseable option is set always output the
buffer contents as octal for easy parsing.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14830

13 months agoZTS: add snapshot/snapshot_002_pos exception
Brian Behlendorf [Mon, 8 May 2023 17:09:30 +0000 (10:09 -0700)]
ZTS: add snapshot/snapshot_002_pos exception

Add snapshot_002_pos to the known list of occasional failures
for FreeBSD until it can be made entirely reliable.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14831
Closes #14832

13 months agoFix two abd_gang_add_gang() issues.
Alexander Motin [Fri, 5 May 2023 16:17:55 +0000 (12:17 -0400)]
Fix two abd_gang_add_gang() issues.

- There is no reason to assert that added gang is not empty.  It
may be weird to add an empty gang, but it is legal.
 - When moving chain list from the added gang clear its size, or it
will trigger assertion in abd_verify() when that gang is freed.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14816

13 months agoSimplify and optimize random_int_between().
Pawel Jakub Dawidek [Fri, 5 May 2023 16:09:12 +0000 (01:09 +0900)]
Simplify and optimize random_int_between().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14805

13 months agoPlug memory leak in zfsdev_state.
Pawel Jakub Dawidek [Fri, 5 May 2023 15:51:41 +0000 (00:51 +0900)]
Plug memory leak in zfsdev_state.

On kernel module unload, free all zfsdev state structures, except for
zfsdev_state_listhead, which is statically allocated.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14824

13 months agozpool import -m also removing spare and cache when log device is missing
Ameer Hamza [Wed, 3 May 2023 22:10:32 +0000 (03:10 +0500)]
zpool import -m also removing spare and cache when log device is missing

spa_import() relies on a pool config fetched by spa_try_import() for
spare/cache devices. Import flags are not passed to spa_tryimport(),
which makes it return early due to a missing log device and missing
retrieving the cache device and spare eventually. Passing
ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct
configuration regardless of the missing log device.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14794

13 months agoAllow zhack label repair to restore detached devices.
buzzingwires [Wed, 3 May 2023 16:03:57 +0000 (12:03 -0400)]
Allow zhack label repair to restore detached devices.

This commit expands on the zhack label repair command in d04b5c9 by
adding the -u option to undetach a device by regenerating uberblocks,
in addition to the existing functionality of fixing checksums, now
represented by -c. Previous behavior is retained in the case of no
options.

The changes are heavily inspired by Jeff Bonwick's labelfix
utility, as archived at:

https://gist.github.com/jjwhitney/baaa63144da89726e482

Additionally, it is now capable of properly determining the size of
block devices and other media, as well as handling sizes which are
not divisible by 2^18. This should make it viable for use on physical
devices and partitions, in addition to files.

These changes should make it possible to import zpools that have had
their uberblocks erased, such as in the case of pools rendered
inaccessible by erroneous detach commands.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: buzzingwires <buzzingwires@outlook.com>
Closes #14773

13 months agoOptimize check_filesystem() and process_error_log()
George Amanakis [Wed, 3 May 2023 16:00:14 +0000 (18:00 +0200)]
Optimize check_filesystem() and process_error_log()

Integrate check_clones() into check_filesystem() and implement a list
instead of iterating recursively over the clones, thus eliminating the
risk of a stack overflow.

Also use kmem_zalloc() to allocate large structures in
process_error_log() reducing its stack size from ~700 to ~128 bytes.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14744

13 months agoUse correct block pointer in block cloning case.
Pawel Jakub Dawidek [Tue, 2 May 2023 16:24:26 +0000 (01:24 +0900)]
Use correct block pointer in block cloning case.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14806

13 months agoWrap clang specific pragma
Brian Behlendorf [Tue, 2 May 2023 16:21:47 +0000 (09:21 -0700)]
Wrap clang specific pragma

Clang specific pragmas need to be wrapped to prevent a build
warning when compiling with gcc.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14814

13 months agoblake3: fix up bogus checksums in face of cpu migration
Mateusz Guzik [Tue, 2 May 2023 00:21:27 +0000 (02:21 +0200)]
blake3: fix up bogus checksums in face of cpu migration

This is a temporary measure until a better fix is sorted out.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by: Rubicon Communications, LLC ("Netgate")
Closes #14785
Closes #14808

13 months agoCorrect ABD size for split block ZIOs
Serapheim Dimitropoulos [Tue, 2 May 2023 00:18:42 +0000 (17:18 -0700)]
Correct ABD size for split block ZIOs

Currently when layering the ABD buffer of each split block on top of
an indirect vdev's ZIO ABD we don't specify the split block's ABD.
This results in those ABDs being incorrectly sized by inheriting
the size of their parent ABD which is larger than what each split
block needs.

The above behavior isn't causing any bugs currently but can lead
to unexpected ABD sizes for people analyzing and/or working on
the ZIO codepath. This patch fixes this behavior by properly setting
the ABD size for split block ZIOs.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14804

13 months agopowerpc64: Support ELFv2 asm on Big Endian
Justin Hibbits [Thu, 27 Apr 2023 19:49:21 +0000 (15:49 -0400)]
powerpc64: Support ELFv2 asm on Big Endian

FreeBSD/powerpc64 is all ELFv2 since FreeBSD 13, even big endian.  The
existing sha256 and sha512 asm code assumes that BE is all ELFv1, and LE
is ELFv2.  Minor changes to add ELFv2 in the BE side gets this working
correctly on FreeBSD with latest OpenZFS import.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Justin Hibbits <chmeeedalf@gmail.com>
Closes #14779

13 months agoMark TX_COMMIT transaction with TXG_NOTHROTTLE.
Alexander Motin [Thu, 27 Apr 2023 19:32:58 +0000 (15:32 -0400)]
Mark TX_COMMIT transaction with TXG_NOTHROTTLE.

TX_COMMIT has no on-disk representation and does not produce any more
dirty data.  It should not wait for anything, and even just skipping
the checks if not waiting gives improvement noticeable in profiler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14798

13 months agoPAM: support the authentication facility
Val Packett [Thu, 27 Apr 2023 16:49:03 +0000 (13:49 -0300)]
PAM: support the authentication facility

Implement the pam_sm_authenticate method, using the noop argument of
lzc_load_key to do a passphrase check without actually loading the key.

This allows using ZFS as the source of truth for user passwords,
without storing any password hashes in /etc or using other PAM modules.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14789

13 months agoFix BLAKE3 aarch64 assembly for FreeBSD and macOS
Tino Reichardt [Wed, 26 Apr 2023 19:40:26 +0000 (21:40 +0200)]
Fix BLAKE3 aarch64 assembly for FreeBSD and macOS

The x18 register isn't useable within FreeBSD kernel space, so we
have to fix the BLAKE3 aarch64 assembly for not using it.

The source files are here: https://github.com/mcmilk/BLAKE3-tests

Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14728

13 months agoFix checkstyle warning
Brian Behlendorf [Wed, 26 Apr 2023 18:49:16 +0000 (11:49 -0700)]
Fix checkstyle warning

Resolve a missed checkstyle warning.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14799

13 months agoFix positive ABD size assertion in abd_verify().
Alexander Motin [Wed, 26 Apr 2023 16:20:43 +0000 (12:20 -0400)]
Fix positive ABD size assertion in abd_verify().

Gang ABDs without childred are legal, and they do have zero size.
For other ABD types zero size doesn't have much sense and likely
not working correctly now.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14795

13 months agoFreeBSD: fix up EINVAL from getdirentries on .zfs
Mateusz Guzik [Thu, 20 Apr 2023 09:00:03 +0000 (09:00 +0000)]
FreeBSD: fix up EINVAL from getdirentries on .zfs

Without the change:
/.zfs
/.zfs/snapshot
find: /.zfs: Invalid argument

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14774

13 months agoFreeBSD: add missing vn state transition for .zfs
Mateusz Guzik [Thu, 20 Apr 2023 08:59:38 +0000 (08:59 +0000)]
FreeBSD: add missing vn state transition for .zfs

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14774

13 months agotests/zdb_encrypted: parse numbers a little more robustly
Rob N [Wed, 26 Apr 2023 15:50:44 +0000 (01:50 +1000)]
tests/zdb_encrypted: parse numbers a little more robustly

On FreeBSD, `wc` prints some leading spaces, while on Linux it does not.
So we tell ksh to expect an integer, and it does the rest.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14791
Closes #14797

13 months agozdb: Fix minor memory leak
Brian Behlendorf [Wed, 26 Apr 2023 15:43:39 +0000 (08:43 -0700)]
zdb: Fix minor memory leak

Commit 6b6aaf6dc2e65c63c74fbd7840c14627e9a91ce2 introduced a small
memory leak in zdb.  This was detected by the LeakSanitizer and was
causing all ztest runs to fail.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14796

13 months agoRevert "Fix data race between zil_commit() and zil_suspend()"
Brian Behlendorf [Tue, 25 Apr 2023 23:40:55 +0000 (16:40 -0700)]
Revert "Fix data race between zil_commit() and zil_suspend()"

This reverts commit 4c856fb333ac57d9b4a6ddd44407fd022a702f00 to
resolve a newly introduced deadlock which in practice in more
disruptive that the issue this commit intended to address.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14775
Closes #14790

13 months agoAdd loongarch64 support
Han Gao [Tue, 25 Apr 2023 23:05:45 +0000 (07:05 +0800)]
Add loongarch64 support

Add loongarch64 definitions & lua module setjmp asm

LoongArch is a new RISC ISA, which is a bit like MIPS or RISC-V.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Han Gao <gaohan@uniontech.com>
Signed-off-by: WANG Xuerui <xen0n@gentoo.org>
Closes #13422

13 months agoTaught zdb -bb to print metadata totals
Rich Ercolani [Mon, 24 Apr 2023 23:55:07 +0000 (19:55 -0400)]
Taught zdb -bb to print metadata totals

People often want estimates of how much of their pool is occupied
by metadata, but they end up using lots of text processing on zdb's
output to get it.

So let's just...provide it for them.

Now, zdb -bbbs will output something like:

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
[...]
    68  1.06M    272K    544K      8K    4.00     0.00      L6 Total
 1.71K   212M   6.85M   13.7M      8K   30.91     0.00      L5 Total
 1.71K   212M   6.85M   13.7M      8K   30.91     0.00      L4 Total
 1.73K   214M   6.92M   13.8M      8K   30.89     0.00      L3 Total
 18.7K  2.29G    111M    221M   11.8K   21.19     0.00      L2 Total
 3.56M   454G   28.4G   56.9G   16.0K   15.97     0.19      L1 Total
  308M  36.8T   28.2T   28.6T   95.1K    1.30    99.80      L0 Total
  311M  37.3T   28.3T   28.6T   94.2K    1.32   100.00  Total
 50.4M   774G    113G    291G   5.77K    6.85     0.99  Metadata Total

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14746

13 months agoFreeBSD: add missing vop_fplookup assignments
Mateusz Guzik [Mon, 24 Apr 2023 23:15:42 +0000 (01:15 +0200)]
FreeBSD: add missing vop_fplookup assignments

It became illegal to not have them as of
5f6df177758b9dff88e4b6069aeb2359e8b0c493 ("vfs: validate that vop
vectors provide all or none fplookup vops") upstream.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14788

13 months agoFreeBSD: try to fallback early if can't do optimized copy
Mateusz Guzik [Wed, 5 Apr 2023 21:28:52 +0000 (21:28 +0000)]
FreeBSD: try to fallback early if can't do optimized copy

Not complete, but already shaves on some locking.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by: Rubicon Communications, LLC ("Netgate")
Closes #14723

13 months agoFreeBSD: fix up EXDEV handling for clone_range
Mateusz Guzik [Wed, 5 Apr 2023 21:12:17 +0000 (21:12 +0000)]
FreeBSD: fix up EXDEV handling for clone_range

API contract requires VOPs to handle EXDEV internally, worst case by
falling back to the generic copy routine. This broke with the recent
changes.

While here whack custom loop to lock 2 vnodes with vn_lock_pair, which
provides the same functionality internally. write start/finish around
it plays no role so got eliminated.

One difference is that vn_lock_pair always takes an exclusive lock on
both vnodes. I did not patch around it because current code takes an
exclusive lock on the target vnode. zfs supports shared-locking for
writes, so this serializes different calls to the routine as is, despite
range locking inside. At the same time you may notice the source vnode
can get some traffic if only shared-locked, thus once more this goes
the safer route of exclusive-locking. Note this should be patched to
use shared-locking for both once the feature is considered stable.

Technically the switch to vn_lock_pair should be a separate change, but
it would only introduce churn immediately whacked by the rest of the
patch.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by: Rubicon Communications, LLC ("Netgate")
Closes #14723

13 months agoFreeBSD: make zfs_vfs_held() definition consistent with declaration
Dimitry Andric [Fri, 21 Apr 2023 17:22:52 +0000 (19:22 +0200)]
FreeBSD: make zfs_vfs_held() definition consistent with declaration

Noticed while attempting to change FreeBSD's boolean_t into an actual
bool: in include/sys/zfs_ioctl_impl.h, zfs_vfs_held() is declared to
return a boolean_t, but in module/os/freebsd/zfs/zfs_ioctl_os.c it is
defined to return an int. Make the definition match the declaration.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #14776

13 months agoAdd support for zpool user properties
Allan Jude [Fri, 21 Apr 2023 17:20:36 +0000 (13:20 -0400)]
Add support for zpool user properties

Usage:

    zpool set org.freebsd:comment="this is my pool" poolname

Tests are based on zfs_set's user property tests.

Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes #11680

14 months agoLinux: Suppress -Wordered-compare-function-pointers in tracepoint code
Richard Yao [Tue, 11 Apr 2023 17:56:16 +0000 (17:56 +0000)]
Linux: Suppress -Wordered-compare-function-pointers in tracepoint code

Clang points out that there is a comparison against -1, but we cannot
fix it because that is from the kernel headers, which we must support.
We can workaround this by using a pragma.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14738

14 months agoLinux: zfs_zaccess_trivial() should always call generic_permission()
Richard Yao [Tue, 11 Apr 2023 17:50:43 +0000 (17:50 +0000)]
Linux: zfs_zaccess_trivial() should always call generic_permission()

Building with Clang on Linux generates a warning that err could be
uninitialized if mnt_ns is a NULL pointer. However, mnt_ns should never
be NULL, so there is no need to put this behind an if statement.  Taking
it outside of the if statement means that the possibility of err being
uninitialized goes from being always zero in a way that the compiler
could not realize to a way that is always zero in a way that the
compiler can realize.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14738

14 months agoZTS: zvol_misc_trim retry busy export
Brian Behlendorf [Thu, 20 Apr 2023 17:25:16 +0000 (10:25 -0700)]
ZTS: zvol_misc_trim retry busy export

Retry the export if the pool is busy due to an open zvol.
Observed in the CI on Fedora 37.

  cannot export 'testpool': pool is busy
  ERROR: zpool export testpool exited 1

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14769

14 months agoCreate zap for root vdev
rob-wing [Thu, 20 Apr 2023 17:07:56 +0000 (09:07 -0800)]
Create zap for root vdev

And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.

Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2

Notably, this allows to get/set properties on the root vdev:

    % zpool set user:prop=value <pool> root-0

Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):

    % zpool set user:prop=value <pool> mirror-0

This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.

The following tests have been added:

    - zpool get all properties from root vdev
    - zpool set a property on root vdev
    - verify root vdev ZAP is created

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14405

14 months agoAllow MMP to bypass waiting for other threads
Herb Wartens [Wed, 19 Apr 2023 20:22:59 +0000 (13:22 -0700)]
Allow MMP to bypass waiting for other threads

At our site we have seen cases when multi-modifier protection is enabled
(multihost=on) on our pool and the pool gets suspended due to a single
disk that is failing and responding very slowly. Our pools have 90 disks
in them and we expect disks to fail. The current version of MMP requires
that we wait for other writers before moving on. When a disk is
responding very slowly, we observed that waiting here was bad enough to
cause the pool to suspend. This change allows the MMP thread to bypass
waiting for other threads and reduces the chances the pool gets
suspended.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Herb Wartens <hawartens@gmail.com>
Closes #14659

14 months agoZTS: send-c_volume is flaky
Paul Dagnelie [Wed, 19 Apr 2023 20:20:02 +0000 (13:20 -0700)]
ZTS: send-c_volume is flaky

We use block_device_wait to wait for the zvol block device to
actually appear, and we log the result of the dd calls by using
an intermediate file.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #14767

14 months agoFix "Detach spare vdev in case if resilvering does not happen"
Ameer Hamza [Wed, 19 Apr 2023 16:04:32 +0000 (21:04 +0500)]
Fix "Detach spare vdev in case if resilvering does not happen"

Spare vdev should detach from the pool when a disk is reinserted.
However, spare detachment depends on the completion of resilvering,
and if resilver does not schedule, the spare vdev keeps attached to
the pool until the next resilvering. When a zfs pool contains
several disks (25+ mirror), resilvering does not always happen when
a disk is reinserted. In this patch, spare vdev is manually detached
from the pool when resilvering does not occur and it has been tested
on both Linux and FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14722

14 months agozfsprops.7: update mandlock
наб [Wed, 19 Apr 2023 16:03:42 +0000 (18:03 +0200)]
zfsprops.7: update mandlock

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=7e59106e9c34458540f7d382d5b49071d1b7104f

Fixes: commit fb9baa9b2045a193a3caf0a46b5cac5ef7a84b61 ("zfsprops.8:
 remove nbmand-not-used-on-Linux and pointer to mount(8)")

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #14765

14 months agoSilence clang warning of flexible array not at end
youzhongyang [Wed, 19 Apr 2023 01:10:40 +0000 (21:10 -0400)]
Silence clang warning of flexible array not at end

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14764

14 months agoValues printed by zpool-iostat(8) should be right-aligned
Low-power [Tue, 18 Apr 2023 18:34:41 +0000 (02:34 +0800)]
Values printed by zpool-iostat(8) should be right-aligned

This inappropriate left-alignment was introduced in 7bb7b1f.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #14751

14 months agoRevert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()"
Tony Hutter [Tue, 18 Apr 2023 15:41:52 +0000 (08:41 -0700)]
Revert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()"

This reverts commit 4b3133e671b958fa2c915a4faf57812820124a7b.

Users identified this commit as a possible source of data
corruption:
https://github.com/openzfs/zfs/issues/14753

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #14753
Closes #14761

14 months agoWork around Raspberry Pi kernel packaging oddities
Rich Ercolani [Tue, 18 Apr 2023 00:38:09 +0000 (20:38 -0400)]
Work around Raspberry Pi kernel packaging oddities

On Debian and Ubuntu and friends, you get something like
"linux-image-$(uname -r)" and "linux-headers-$(uname -r)" you
can put a Depends on.

On Raspberry Pi OS, you get "raspberrypi-kernel" and
"raspberrypi-kernel-headers", with version numbers like 20230411.

There is not, as far as I can tell, a reasonable way to map that
to a kernel version short of reaching out and digging around in
the changelogs or Makefile, so just special-case it so the packages
don't fail to install at install time. They still might not build
if the versions don't match, but I don't see a way to do anything
about that...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14745
Closes #14747

14 months agoFix VERIFY(!zil_replaying(zilog, tx)) panic
Pawel Jakub Dawidek [Mon, 17 Apr 2023 23:42:09 +0000 (08:42 +0900)]
Fix VERIFY(!zil_replaying(zilog, tx)) panic

The zfs_log_clone_range() function is never called from the
zfs_clone_range_replay() function, so I assumed it is safe to assert
that zil_replaying() is never TRUE here. It turns out zil_replaying()
also returns TRUE when the sync property is set to disabled.

Fix the problem by just returning if zil_replaying() returns TRUE.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported by: Florian Smeets
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14758

14 months agoMinor improvements to zpoolconcepts.7
dodexahedron [Thu, 13 Apr 2023 16:15:34 +0000 (09:15 -0700)]
Minor improvements to zpoolconcepts.7

 * Fixed one typo (effects -> affects)
 * Re-worded raidz description to make it clearer that it is not
   quite the same as RAID5, though similar
 * Clarified that data is not necessarily written in a static
   stripe width
 * Minor grammar consistency improvement
 * Noted that "volumes" means zvols
 * Fixed a couple of split infinitives
 * Clarified that hot spares come from the same pool they were
   assigned to
* "we" -> ZFS
* Fixed warnings thrown by mandoc, and removed unnecessary
  wordiness in one fixed line.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brandon Thetford <brandon@dodecatec.com>
Closes #14726

14 months agoLinux 6.3 compat: Fix memcpy "detected field-spanning write" error
youzhongyang [Thu, 13 Apr 2023 16:12:03 +0000 (12:12 -0400)]
Linux 6.3 compat: Fix memcpy "detected field-spanning write" error

Add a new union member of flexible array to dnode_phys_t and use
it in the macro so we can silence the memcpy() fortify error.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14737

14 months agoFix data corruption when cloning embedded blocks
Pawel Jakub Dawidek [Wed, 12 Apr 2023 23:15:05 +0000 (08:15 +0900)]
Fix data corruption when cloning embedded blocks

Don't overwrite blk_phys_birth, as for embedded blocks it is part of
the payload.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Issue #13392
Closes #14739

14 months agoinitramfs: source user scripts from /e/z/initramfs-tools-load-key{,.d/*}
наб [Wed, 12 Apr 2023 17:08:49 +0000 (19:08 +0200)]
initramfs: source user scripts from /e/z/initramfs-tools-load-key{,.d/*}

By dropping in a file in a directory (for packages) or by making a file
(for local administrators), custom key loading methods may be provided
for the rootfs and necessities.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nicholas Morris <security@niwamo.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Co-authored-by: Nicholas Morris <security@niwamo.com>
Supersedes: #14704
Closes: #13757
Closes #14733

14 months agoFix in check_filesystem()
George Amanakis [Wed, 12 Apr 2023 15:53:53 +0000 (17:53 +0200)]
Fix in check_filesystem()

Fix the code in case of missing snapshots. Previously the check was in
a conditional that would be executed if the filesystem had snapshots.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14735

14 months agoTrim needless zeroes from checksum events
Alan Somers [Mon, 10 Apr 2023 21:24:27 +0000 (15:24 -0600)]
Trim needless zeroes from checksum events

The ereport.fs.zfs.checksum event contains histograms of the bits that
were wrongly set or cleared according to their bit position in a 64-bit
word.  So the maximum value that any histogram bucket could have would
be 64.  But ZFS currently uses a uint32_t to hold each bucket.  As a
result, the event report is full of needless zeroes.

Change the bucket size to uint8_t, stripping 768 needless zeros from
each event.

Original event format:
```
 class=ereport.fs.zfs.checksum ena=639460469834258433 pool=testpool.1933 pool_guid=4979719877084416563 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=4136721804819128578 vdev_type=file vdev_path=/tmp/kyua.1TxP3A/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=609837019678 vdev_delta_ts=33450 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=2751977006639883417 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=702976 zio_size=1024 zio_objset=24 zio_object=0 zio_level=3 zio_blkid=0 bad_ranges=0000000000000400 bad_ranges_min_gap=8 bad_range_sets=0000079e bad_range_clears=00000854 bad_set_histogram=000000210000001a000000150000001d000000240000001b000000220000001b000000210000002100000018000000260000002300000025000000210000001e000000250000001b0000001d0000001e0000001600000025000000180000001b000000240000001b000000240000001b0000001c000000210000001b0000001e000000210000001a0000001e000000220000001d0000001b000000200000001f0000001a000000250000001f0000001d0000001b0000001d000000240000001d0000001b0000001b0000001f00000024000000190000001a0000001f0000001e000000240000001e0000002400000021000000200000001d0000001d00000021 bad_cleared_histogram=000000220000002700000021000000210000001b0000001a000000250000001f0000001c0000001e0000002400000022000000220000002400000022000000240000002200000021000000220000001b0000002100000021000000190000001b000000240000002400000020000000290000002a00000028000000250000002400000020000000270000002500000016000000270000001c000000210000001f000000240000001c0000002100000022000000240000002100000023000000210000002700000022000000240000001b00000022000000210000001c00000023000000150000002600000020000000270000001e0000001d0000002400000026 time=00000016806457270000000323406839 eid=458
```

New format:
```
 class=ereport.fs.zfs.checksum ena=96599319807790081 pool=testpool.1933 pool_guid=1236902063710799041 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=2774253874431514999 vdev_type=file vdev_path=/tmp/kyua.6Temlq/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=92124283803 vdev_delta_ts=46670 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=8090931855087882905 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=1028608 zio_size=512 zio_objset=0 zio_object=0 zio_level=0 zio_blkid=4 bad_ranges=0000000000000200 bad_ranges_min_gap=8 bad_range_sets=0000061f bad_range_clears=000001f4 bad_set_histogram=1719161c1c1c101618171a151a1a19161e1c171d1816161c191f1a18192117191c131d171b1613151a171419161a1b1319101b14171b18151e191a1b141a1c17 bad_cleared_histogram=06090a0808070a0b020609060506090a01090a050a0a0509070609080d050d0607080d060507080c04070807070a0608020c080c080908040808090a05090a07 time=00000016806477050000000604157480 eid=62
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Sponsored-by: Axcient
Closes #14716

14 months agoLinux 6.3 compat: idmapped mount API changes
youzhongyang [Mon, 10 Apr 2023 21:15:36 +0000 (17:15 -0400)]
Linux 6.3 compat: idmapped mount API changes

Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap
type for mounts (struct mnt_idmap), we need to detect these changes
and make zfs work with the new APIs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14682

14 months agomodule: freebsd: fix aarch64 fpu handling
Kyle Evans [Fri, 7 Apr 2023 00:40:34 +0000 (00:40 +0000)]
module: freebsd: fix aarch64 fpu handling

Just like x86, aarch64 needs to use the fpu_kern(9) API around FPU
usage, otherwise we panic promptly at boot as soon as ZFS attempts to
do checksum benchmarking.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes #14715

14 months agomodule: resync part of Makefile.bsd
Kyle Evans [Fri, 7 Apr 2023 00:40:34 +0000 (00:40 +0000)]
module: resync part of Makefile.bsd

sha256-armv8.S and sha512-armv8.S need the same treatment as the sse
bits; removal of -mgeneral-regs-only from flags.

This fixes errors about requiring NEON, which is a difference in clang
vs. gcc treatment of -mgeneral-regs-only being specified on asm files.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes #14715

14 months agolibzfs: add v2 iterator interfaces
Rob N [Mon, 10 Apr 2023 18:53:02 +0000 (04:53 +1000)]
libzfs: add v2 iterator interfaces

f6a0dac84 modified the zfs_iter_* functions to take a new "flags"
parameter, and introduced a variety of flags to ask the kernel to limit
the results in various ways, reducing the amount of work the caller
needed to do to filter out things they didn't need.

Unfortunately this change broke the ABI for existing clients (read:
older versions of the `zfs` program), and was reverted 399b98198.

dc95911d2 reintroduced the original patch, with the understanding that a
backwards-compatible fix would be made before the 2.2 release branch was
tagged. This commit is that fix.

This introduces zfs_iter_*_v2 functions that have the new flags
argument, and reverts the existing functions to not have the flags
parameter, as they were before. The old functions are now reimplemented
in terms of the new, with flags set to 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Original-patch-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Closes #14597

14 months agovdev: expose zfs_vdev_max_ms_shift as a module parameter
Rob N [Thu, 6 Apr 2023 17:52:50 +0000 (03:52 +1000)]
vdev: expose zfs_vdev_max_ms_shift as a module parameter

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14719

14 months agoFix typo in check_clones()
George Amanakis [Thu, 6 Apr 2023 17:46:18 +0000 (19:46 +0200)]
Fix typo in check_clones()

Run kmem_free() after zap_cursor_fini().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14702

14 months agoZTS: add existing tests to runfiles
Damian Szuberski [Thu, 6 Apr 2023 17:43:24 +0000 (03:43 +1000)]
ZTS: add existing tests to runfiles

Some test cases were committed to the repository but never added to
runfiles.
Move `zfs_unshare_008_pos` to the Linux-only runfile.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #14701

14 months agoZTS: Use inbuilt monotonic time
Andrew Innes [Thu, 6 Apr 2023 17:40:23 +0000 (01:40 +0800)]
ZTS: Use inbuilt monotonic time

Make the test runner try to use the included python monotonic time
function instead of calling librt.

This makes the test runner work on macos where librt wasn't available.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Closes #14700

14 months agoMiscellaneous FreBSD compilation bugfixes
Martin Matuška [Thu, 6 Apr 2023 17:35:02 +0000 (19:35 +0200)]
Miscellaneous FreBSD compilation bugfixes

Add missing machine/md_var.h to spl/sys/simd_aarch64.h and
spl/sys/simd_arm.h

In spl/sys/simd_x86.h, PCB_FPUNOSAVE exists only on amd64, use PCB_NPXNOSAVE
on i386

In FreeBSD sys/elf_common.h redefines AT_UID and AT_GID on FreeBSD, we need
a hack in vnode.h similar to Linux. sys/simd.h needs to be included early.

In zfs_freebsd_copy_file_range() we pass a (size_t *)lenp to
zfs_clone_range() that expects a (uint64_t *)

Allow compiling armv6 world by limiting ARM macros in sha256_impl.c and
sha512_impl.c to __ARM_ARCH > 6

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Signed-off-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #14674

14 months agovdev: expose zfs_vdev_def_queue_depth as a module parameter
Rob N [Thu, 6 Apr 2023 17:31:19 +0000 (03:31 +1000)]
vdev: expose zfs_vdev_def_queue_depth as a module parameter

It was previously available only to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14718

14 months agoStorage device expansion "silently" fails on degraded vdev
Paul Dagnelie [Thu, 6 Apr 2023 17:29:27 +0000 (10:29 -0700)]
Storage device expansion "silently" fails on degraded vdev

When a vdev is degraded or faulted, we refuse to expand it when doing
online -e. However, we also don't actually cause the online command
to fail, even though the disk didn't expand. This is confusing and
misleading, and can result in violated expectations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes 14145

14 months agoFix some signedness issues in arc_evict()
Alexander Motin [Wed, 5 Apr 2023 17:42:22 +0000 (02:42 +0900)]
Fix some signedness issues in arc_evict()

It may happen that "wanted total ARC size" (wt) is negative, that was
expected.  But multiplication product of it and unsigned fractions
result in unsigned value, incorrectly shifted right with a sing loss.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14692

14 months agoLinux 6.3 compat: writepage_t first arg struct folio*
youzhongyang [Wed, 5 Apr 2023 17:01:38 +0000 (13:01 -0400)]
Linux 6.3 compat: writepage_t first arg struct folio*

The type def of writepage_t in kernel 6.3 is changed to take
struct folio* as the first argument. We need to detect this
change and pass correct function to write_cache_pages().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14699

14 months agoFix "Add colored output to zfs list"
Tino Reichardt [Wed, 5 Apr 2023 16:57:01 +0000 (18:57 +0200)]
Fix "Add colored output to zfs list"

Running `zfs list -o avail rpool` resulted in a core dump.
This commit will fix this.

Run the needed overhead only, when `use_color()` is true.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14712

14 months agocontrib: dracut: fix race with root=zfs:dset when necessities required
наб [Fri, 31 Mar 2023 16:47:48 +0000 (18:47 +0200)]
contrib: dracut: fix race with root=zfs:dset when necessities required

This had always worked in my testing, but a user on hardware reported
this to happen 100%, and I reproduced it once with cold VM host caches.

dracut-zfs-generator runs as a systemd generator, i.e. at Some
Relatively Early Time; if root= is a fixed dataset, it tries to
"solve [necessities] statically at generation time".

If by that point zfs-import.target hasn't popped (because the import is
taking a non-negligible amount of time for whatever reason), it'll see
no children for the root datase, and as such generate no mounts.

This has never had any right to work. No-one caught this earlier because
it's just that much more convenient to have root=zfs:AUTO, which orders
itself properly.

To fix this, always run zfs-nonroot-necessities.service;
this additionally simplifies the implementation by:
  * making BOOTFS from zfs-env-bootfs.service be the real, canonical,
    root dataset name, not just "whatever the first bootfs is",
    and only set it if we're ZFS-booting
  * zfs-{rollback,snapshot}-bootfs.service can use this instead of
    re-implementing it
  * having zfs-env-bootfs.service also set BOOTFSFLAGS
  * this means the sysroot.mount drop-in can be fixed text
  * zfs-nonroot-necessities.service can also be constant and always
    enabled, because it's conditioned on BOOTFS being set

There is no longer any code generated at run-time
(the sysroot.mount drop-in is an unavoidable gratuitous cp).

The flow of BOOTFS{,FLAGS} from zfs-env-bootfs.service to sysroot.mount
is not noted explicitly in dracut.zfs(7), because (a) at some point it's
just visual noise and (b) it's already ordered via d-p-m.s from z-i.t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #14690

14 months agolinux 6.3 compat: needs REQ_PREFLUSH | REQ_OP_WRITE
youzhongyang [Fri, 31 Mar 2023 16:46:22 +0000 (12:46 -0400)]
linux 6.3 compat: needs REQ_PREFLUSH | REQ_OP_WRITE

Modify bio_set_flush() so if kernel version is >= 4.10, flags
REQ_PREFLUSH and REQ_OP_WRITE are set together.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14695

14 months agoUse vmem_zalloc to silence allocation warning
Brian Behlendorf [Fri, 31 Mar 2023 16:43:54 +0000 (09:43 -0700)]
Use vmem_zalloc to silence allocation warning

The kmem allocation in zfs_prune_aliases() will trigger a large
allocation warning on systems with 64K pages.  Resolve this by
switching to vmem_alloc() which internally uses kvmalloc() so the
right allocator will be used based on the allocation size.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8491
Closes #14694

14 months agoLinux 6.2 compat: META
Tony Hutter [Wed, 29 Mar 2023 22:39:36 +0000 (15:39 -0700)]
Linux 6.2 compat: META

Update the META file to reflect compatibility with the 6.2 kernel.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #14689

14 months agoFixes in persistent error log
George Amanakis [Tue, 28 Mar 2023 23:51:58 +0000 (01:51 +0200)]
Fixes in persistent error log

Address the following bugs in persistent error log:

1) Check nested clones, eg "fs->snap->clone->snap2->clone2".

2) When deleting files containing error blocks in those clones (from
   "clone" the example above), do not break the check chain.

3) When deleting files in the originating fs before syncing the errlog
   to disk, do not break the check chain. This happens because at the
   time of introducing the error block in the error list, we do not have
   its birth txg and the head filesystem. If the original file is
   deleted before the error list is synced to the error log (which is
   when we actually lookup the birth txg and the head filesystem), then
   we do not have access to this info anymore and break the check chain.

The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accommodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).

We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.

The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64

We also introduced two new tests covering the above changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14633

14 months agoFix short-lived txg caused by autotrim
Kevin Jin [Tue, 28 Mar 2023 15:43:41 +0000 (11:43 -0400)]
Fix short-lived txg caused by autotrim

Current autotrim causes short-lived txg through:

1. calling txg_wait_synced() in metaslab_enable()
2. calling txg_wait_open() with should_quiesce = true

This patch addresses all the issues mentioned above.

A new cv, vdev_autotrim_kick_cv is added to kick autotrim activity.
It will be signaled once a txg is synced so that it does not change
the original autotrim pace. Also because it is a cv, the wait is
interruptible which speeds up the vdev_autotrim_stop_wait() call.

Finally, combining big zfs_txg_timeout, txg_wait_open() also causes
delay when exporting a pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Issue #8993
Closes #12194

14 months agoAdditional limits on hole reporting
Brian Behlendorf [Tue, 28 Mar 2023 15:19:03 +0000 (08:19 -0700)]
Additional limits on hole reporting

Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14512
Closes #14641

14 months agoRevert "Do not hold spa_config in ZIL while blocked on IO"
George Wilson [Tue, 28 Mar 2023 15:13:32 +0000 (11:13 -0400)]
Revert "Do not hold spa_config in ZIL while blocked on IO"

This reverts commit 7d638df09be7482935bcf6ec8e4ea2ac8a8be1a8.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #14678