]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
20 months agoAdd SHA2 SIMD feature tests for Linux
Tino Reichardt [Wed, 28 Sep 2022 08:54:02 +0000 (10:54 +0200)]
Add SHA2 SIMD feature tests for Linux

These are added:
- zfs_neon_available() for arm and aarch64
- zfs_sha256_available() for arm and aarch64
- zfs_sha512_available() for aarch64
- zfs_shani_available() for x86_64

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-Authored-By: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #13741

20 months agoAdd SHA2 SIMD feature tests for FreeBSD
Tino Reichardt [Wed, 28 Sep 2022 08:53:18 +0000 (10:53 +0200)]
Add SHA2 SIMD feature tests for FreeBSD

These are added:
- zfs_neon_available() for arm and aarch64
- zfs_sha256_available() for arm and aarch64
- zfs_sha512_available() for aarch64
- zfs_shani_available() for x86_64

Changes:
- simd_powerpc.h: change license from CDDL to BSD

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

20 months agoAdd ARM architecture to OpenZFS buildsystem
Tino Reichardt [Wed, 28 Sep 2022 08:55:04 +0000 (10:55 +0200)]
Add ARM architecture to OpenZFS buildsystem

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

20 months agoRemove old or redundant SHA2 files
Tino Reichardt [Mon, 27 Feb 2023 15:11:51 +0000 (16:11 +0100)]
Remove old or redundant SHA2 files

We had three sha2.h headers in different places.
The FreeBSD version, the Linux version and the generic solaris version.

The only assembly used for acceleration was some old x86-64 openssl
implementation for sha256 within the icp module.

For FreeBSD the whole SHA2 files of FreeBSD were copied into OpenZFS,
these files got removed also.

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741

20 months agozdb: add decryption support
Rob N [Thu, 2 Mar 2023 21:39:09 +0000 (08:39 +1100)]
zdb: add decryption support

The approach is straightforward: for dataset ops, if a key was offered,
find the encryption root and the various encryption parameters, derive a
wrapping key if necessary, and then unlock the encryption root. After
that all the regular dataset ops will return unencrypted data, and
that's kinda the whole thing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #11551
Closes #12707
Closes #14503

20 months agoSystem-wide speculative prefetch limit.
Alexander Motin [Wed, 1 Mar 2023 23:27:40 +0000 (18:27 -0500)]
System-wide speculative prefetch limit.

With some pathological access patterns it is possible to make ZFS
accumulate almost unlimited amount of speculative prefetch ZIOs.
Combined with linear ABD allocations in RAIDZ code, it appears to
be possible to exhaust system KVA, triggering kernel panic.

Address this by introducing a system-wide counter of active prefetch
requests and blocking prefetch distance doubling per stream hits if
the number of active requests is higher that ~6% of ARC size.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14516

20 months agoAccommodate debug-kernel stack frame size
Brian Behlendorf [Wed, 1 Mar 2023 22:40:45 +0000 (14:40 -0800)]
Accommodate debug-kernel stack frame size

The blk_queue_discard() and blk_queue_sector_erase() functions
slightly exceed the allowed 4096 maximum stack frame size when
building with the RedHat debug kernel which causes their
configure checks to fail.

Add an exception for these two tests so the interfaces are
correctly detected.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14540

20 months agoFix data race between zil_commit() and zil_suspend()
Richard Yao [Wed, 1 Mar 2023 21:23:09 +0000 (16:23 -0500)]
Fix data race between zil_commit() and zil_suspend()

openzfsonwindows/openzfs#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <arun.kv@datacore.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14514

20 months agoRemove bad kmem_free() oversight from previous zfsdev_state_list patch
Richard Yao [Wed, 1 Mar 2023 21:20:53 +0000 (16:20 -0500)]
Remove bad kmem_free() oversight from previous zfsdev_state_list patch

I forgot to remove the corresponding kmem_free() from zfs_kmod_fini() in
9a14ce43c3d6a9939804215bbbe66de5115ace42. Clang's static analyzer did
not complain, but the Coverity scan that was run after the patch was
merged did.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Coverity (CID-1535275)
Closes #14556

20 months agoLinux: zfs_fillpage() should handle partial pages from end of file
Richard Yao [Wed, 1 Mar 2023 21:19:47 +0000 (16:19 -0500)]
Linux: zfs_fillpage() should handle partial pages from end of file

After 89cd2197b94986d315b9b1be707b645baf59af4f was merged, Clang's
static analyzer began complaining about a dead assignment in
`zfs_fillpage()`. Upon inspection, I noticed that the dead assignment
was because we are not using the calculated io_len that we should use to
avoid asking the DMU to read past the end of a file. This should result
in `dmu_buf_hold_array_by_dnode()` calling `zfs_panic_recover()`.

This issue predates 89cd2197b94986d315b9b1be707b645baf59af4f, but its
simplification of zfs_fillpage() eliminated the only use of the
assignment to io_len, which made Clang's static analyzer complain about
the issue.

Also, as a precaution, we add an assertion that io_offset < i_size. If
this ever fails, bad things will happen. Otherwise, we are blindly
trusting the kernel not to give us invalid offsets. We continue to
blindly trust it on non-debug kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14534

20 months agoUpdate reclaim disk space script
Tino Reichardt [Wed, 1 Mar 2023 20:25:09 +0000 (21:25 +0100)]
Update reclaim disk space script

The script uses systemd-run, which does the job in background.
We should take the the time and wait for the job to finish.

Maybe some functional tests suffer from not really freed disk
space and fail because of this.

We also add some trimming in the end of the script.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14554

20 months agoHandle unexpected errors in zil_lwb_commit() without ASSERT()
Richard Yao [Wed, 1 Mar 2023 17:39:41 +0000 (12:39 -0500)]
Handle unexpected errors in zil_lwb_commit() without ASSERT()

We tripped `ASSERT(error == ENOENT || error == EEXIST || error ==
EALREADY)` in `zil_lwb_commit()` at Klara when doing robustness testing
of ZIL against drive power cycles.

That assertion presumably exists because when this code was written, the
only errors expected from here were EIO, ENOENT, EEXIST and EALREADY,
with EIO having its own handling before the assertion. However, upon
doing a manual depth first search traversal of the source tree, it turns
out that a large number of unexpected errors are possible here. In
theory, EINVAL and ENOSPC can come from dnode_hold_impl(). However, most
unexpected errors originate in the block layer and come to us from
zio_wait() in various ways. One way is ->zl_get_data() -> dmu_buf_hold()
-> dbuf_read() -> zio_wait().

From vdev_disk.c on Linux alone, zio_wait() can return the unexpected
errors ENXIO, ENOTSUP, EOPNOTSUPP, ETIMEDOUT, ENOSPC, ENOLINK,
EREMOTEIO, EBADE, ENODATA, EILSEQ and ENOMEM

This was only observed after what have been likely over 1000 test
iterations, so we do not expect to reproduce this again to find out what
the error code was. However, circumstantial evidence suggests that the
error was ENXIO.

When ENXIO or any other unexpected error occurs, the `fsync()` or
equivalent operation that called zil_commit() will return success, when
in fact, dirty data has not been committed to stable storage. This is a
violation of the Single UNIX Specification.

The code should be able to handle this and any other unknown error by
calling `txg_wait_synced()`. In addition to changing the code to call
txg_wait_synced() on unexpected errors instead of returning, we modify
it to print information about unexpected errors to dmesg.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Closes #14532

20 months agomancheck: exclude dotfiles in man dir
Rob N [Wed, 1 Mar 2023 17:38:31 +0000 (04:38 +1100)]
mancheck: exclude dotfiles in man dir

Its not uncommon for an editor to drop a hidden swap file in the dir
while editing a file there. mancheck would find it and run mandoc on it,
which would complain about its distinctly not-manpage format.

A more correct solution might be to reconfigure the editor to not put
swap files in the same dir, but its the default a lot of the time, and
this is a very small change that gives a very nice quality-of-life
improvement.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14549

20 months agoman: note that zdb operates directly on pool storage
Rob N [Wed, 1 Mar 2023 01:35:33 +0000 (12:35 +1100)]
man: note that zdb operates directly on pool storage

A frequent misunderstanding is that zdb accesses the pool through the
kernel or filesystem, leading to confusion particularly when it can't
access something that it seems like it should be able to.

I've seen this confusion recently when zdb couldn't access a pool because
the user didn't have permission to read directly from the block devices,
and when it couldn't show attributes of encrypted files even though the
dataset was unlocked.

The manpage already speaks to another symptom of this, namely that zdb
may "behave erratically" on an active pool; here I'm trying to make that
a little more explicit.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14539

20 months agoMakefile.bsd: cleanup and sync with FreeBSD
Allan Jude [Wed, 1 Mar 2023 01:33:59 +0000 (20:33 -0500)]
Makefile.bsd: cleanup and sync with FreeBSD

xxhash.c was not being compiled, so when FreeBSD's kernel
switched to a newer version of ZSTD a few weeks ago, out-of-tree ZFS
failed to build

Sync module/Makefile.bsd with FreeBSD's sys/modules/zfs/Makefile

And restore the alphabetical sort in a number of places

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-by: Klara, Inc.
Closes #14508

20 months agoSuppress static analyzer warning in dmu_objset_create_impl_dnstats()
Richard Yao [Tue, 7 Feb 2023 10:43:05 +0000 (05:43 -0500)]
Suppress static analyzer warning in dmu_objset_create_impl_dnstats()

Clang's static analyzer claims that dereferencing ds in
dmu_objset_create_impl_dnstats() could cause a NULL pointer dereference
when a previous NULL check confirms that it is NULL. It is only NULL on
the MOS, for which dmu_objset_userused_enabled(os) should always return
false, so ds will never be dereferenced when it is NULL. We add an
assertion to suppress this warning.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14470

20 months agoSuppress static analyzer warning in dbuf_hold_copy()
Richard Yao [Tue, 7 Feb 2023 09:51:56 +0000 (04:51 -0500)]
Suppress static analyzer warning in dbuf_hold_copy()

Clang's static analyzer claims that dbuf_hold_copy() will have a NULL
pointer dereference in data->b_data when called by dbuf_hold_impl().
This is impossible because data is dr->dt.dl.dr_data, which is non-NULL
whenever db->db_level == 0, which is always the case whenever
dbuf_hold_impl() calls dbuf_hold_copy(). We add an assertion to suppress
the complaint.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14470

20 months agoStatically allocate first node of zfsdev_state_list
Richard Yao [Tue, 7 Feb 2023 08:23:45 +0000 (03:23 -0500)]
Statically allocate first node of zfsdev_state_list

This avoids a call to kmem_alloc() during module load. It also
suppresses a defect report from Clang's static analyzer that claims that
we will have a NULL pointer dereference in zfsdev_state_init() because
it does not understand that this has already been allocated in
zfs_kmod_init().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14470

20 months agoSuppress static analyzer warning in sa_attr_iter()
Richard Yao [Tue, 7 Feb 2023 08:02:10 +0000 (03:02 -0500)]
Suppress static analyzer warning in sa_attr_iter()

Clang's static analyzer points out that when IS_SA_BONUSTYPE(type) is
true and .sa_length is 0 for an attribute, we have a NULL pointer
dereference. We suppress this with an IMPLY() statement.

This was also identified by Coverity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Coverity (CID-1017954)
Closes #14470

20 months agoSuppress static analyzer warnings in zio_checksum_error_impl()
Richard Yao [Tue, 7 Feb 2023 07:38:45 +0000 (02:38 -0500)]
Suppress static analyzer warnings in zio_checksum_error_impl()

Clang's static analyzer informs us of multiple NULL pointer dereferences
involving zio_checksum_error_impl().

The first is a NULL pointer dereference if bp is NULL and ci->ci_flags &
ZCHECKSUM_FLAG_EMBEDDED is false, but bp is NULL implies that
ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is true, so we add an IMPLY()
statement to suppress the report.

The second and third are identical, and are duplicated because while the
NULL pointer dereference occurs in zio_checksum_gang_verifier(), it is
called by zio_checksum_error_impl() and there is a report for each of
the two functions. The reports state that when bp is NULL, ci->ci_flags
& ZCHECKSUM_FLAG_EMBEDDED is true and checksum is not
ZIO_CHECKSUM_LABEL, we also have a NULL pointer dereference. bp is NULL
should imply that checksum == ZIO_CHECKSUM_LABEL, so we add an IMPLY()
statement to suppress the second report. The two reports are
functionally identical.

A fourth variation of this was also reported by Coverity. It occurs when
checksum == ZIO_CHECKSUM_ZILOG2.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Coverity (CID-1524672)
Closes #14470

20 months agoicp: Prevent compilers from optimizing away memset() in gcm_clear_ctx()
Richard Yao [Wed, 1 Mar 2023 01:28:50 +0000 (20:28 -0500)]
icp: Prevent compilers from optimizing away memset() in gcm_clear_ctx()

The recently merged f58e513f7408f353bf0151fdaf235d4e062e8950 was
intended to zero sensitive data before exit from encryption
functions to harden the code against theoretical information
leaks. Unfortunately, the method by which it did that is
optimized away by the compiler, so some information still leaks. This
was confirmed by counting function calls in disassembly.

After studying how the OpenBSD, FreeBSD and Linux kernels handle this,
and looking at our disassembly, I decided on a two-factor approach to
protect us from compiler dead store elimination passes.

The first factor is to stop trying to inline gcm_clear_ctx(). GCC does
not actually inline it in the first place, and testing suggests that
dead store elimination passes appear to become more powerful in a bad
way when inlining is forced, so we recognize that and move
gcm_clear_ctx() to a C file.

The second factor is to implement an explicit_memset() function based on
the technique used by `secure_zero_memory()` in FreeBSD's blake2
implementation, which coincidentally is functionally identical to the
one used by Linux. The source for this appears to be a LLVM bug:

https://llvm.org/bugs/show_bug.cgi?id=15495

Unlike both FreeBSD and Linux, we explicitly avoid the inline keyword,
based on my observations that GCC's dead store elimination pass becomes
more powerful when inlining is forced, under the assumption that it will
be equally powerful when the compiler does decide to inline function
calls.

Disassembly of GCC's output confirms that all 6 memset() calls are
executed with this patch applied.

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14544

20 months agoLinux: Assert mutex is held in mutex_exit()
Richard Yao [Wed, 1 Mar 2023 01:27:20 +0000 (20:27 -0500)]
Linux: Assert mutex is held in mutex_exit()

A spurious mutex_exit() in a development branch caused weird issues
until I identified it. An assertion prior to mutex_exit() would have
caught it. Rather than adding assertions before invocations of
mutex_exit() in the code, let us simply add an assertion to
mutex_exit(). It is cheap and will likely improve developer
productivity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Closes #14541

20 months agoRevert zfeature_active() to static
George Amanakis [Tue, 28 Feb 2023 22:03:52 +0000 (23:03 +0100)]
Revert zfeature_active() to static

Commit 34ce4c4 made zfeature_active() non-static. This is not required.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14546

20 months agoFix github build failures because of vdevprops.7
Tino Reichardt [Tue, 28 Feb 2023 22:02:48 +0000 (23:02 +0100)]
Fix github build failures because of vdevprops.7

This small fix adds the manpage vdevprops.7 to the file
contrib/debian/openzfs-zfsutils.install and the github
actions will work again.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14553

20 months agoPrevent incorrect datasets being mounted
John Poduska [Tue, 28 Feb 2023 00:49:34 +0000 (19:49 -0500)]
Prevent incorrect datasets being mounted

During a mount, zpl_mount_impl(), uses sget() with the callback
zpl_test_super() to find a super_block with a matching objset,
stored in z_os. It does so without taking the teardown lock on
the zfsvfs.

The problem is that operations like rollback will replace the
z_os. And, there is a window where the objset in the rollback
is freed, but z_os still points to it. Then, a mount like
operation, for instance a clone, can reallocate that exact same
pointer and zpl_test_super() will then match the super_block
associated with the rollback as opposed to the clone.

This fix tests for a match and if so, takes the teardown lock
before doing the final match test.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #14518

20 months agoAdd vdevprops.7 to the Makefile
D. Ebdrup [Tue, 28 Feb 2023 00:38:09 +0000 (01:38 +0100)]
Add vdevprops.7 to the Makefile

Adding vdevprops.7 to the Makefile ensures that it gets installed
properly on FreeBSD.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Daniel Ebdrup Jensen <debdrup@FreeBSD.org>
Closes #14527

20 months agoSkip memory allocation when compressing holes
Richard Yao [Mon, 27 Feb 2023 22:41:02 +0000 (17:41 -0500)]
Skip memory allocation when compressing holes

Hole detection in the zio compression code allows us to
opportunistically skip compression on holes. We can go a step further
by not doing memory allocations on holes either.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes #14500

20 months agoICP: AES-GCM: Refactor gcm_clear_ctx()
Attila Fülöp [Mon, 27 Feb 2023 22:38:12 +0000 (23:38 +0100)]
ICP: AES-GCM: Refactor gcm_clear_ctx()

Currently the temporary buffer in which decryption takes place
isn't cleared on context destruction. Further in some routines we
fail to call gcm_clear_ctx() on error exit. Both flaws may result
in leaking sensitive data.

We follow best practices and zero out the plaintext buffer before
freeing the memory holding it. Also move all cleanup into
gcm_clear_ctx() and call it on any context destruction.

The performance impact should be negligible.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14528

20 months agoMove zap_attribute_t to the heap in dsl_deadlist_merge
Mariusz Zaborski [Mon, 27 Feb 2023 22:27:58 +0000 (23:27 +0100)]
Move zap_attribute_t to the heap in dsl_deadlist_merge

In the case of a regular compilation, the compiler
raises a warning for a dsl_deadlist_merge function, that
the stack size is to large. In debug build this can
generate an error.

Move large structures to heap.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #14524

20 months agoWorkaround GitHub Action failure
Brian Behlendorf [Mon, 27 Feb 2023 17:19:25 +0000 (09:19 -0800)]
Workaround GitHub Action failure

Ubuntu 20.04 and 22.04 workflows are failing due to an error
which is hit when running `apt-get update`.  Until the
problematic package is fixed apply the suggested workaround
described here:

  https://github.com/orgs/community/discussions/47863

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14530

21 months agoUse .section .rodata instead of .rodata on FreeBSD
Dimitry Andric [Sat, 25 Feb 2023 00:45:48 +0000 (01:45 +0100)]
Use .section .rodata instead of .rodata on FreeBSD

In commit 0a5b942d4 the FreeBSD SECTION_STATIC macro was set to
".rodata". This assembler directive is supported by LLVM (as a
convenience alias for ".section .rodata") by not by GNU as.

This caused the FreeBSD builds that are done with gcc to fail.
Therefore, use ".section .rodata" instead, similar to the other
asm_linkage.h headers.

Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #14526

21 months agoMove dmu_buf_rele() after dsl_dataset_sync_done()
George Amanakis [Fri, 24 Feb 2023 01:14:52 +0000 (02:14 +0100)]
Move dmu_buf_rele() after dsl_dataset_sync_done()

Otherwise the dataset may be freed after the last dmu_buf_rele() leading
to a panic.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14522
Closes #14523

21 months agoZTS: Minor fixes
Brian Behlendorf [Fri, 24 Feb 2023 01:10:46 +0000 (17:10 -0800)]
ZTS: Minor fixes

- The migration_012_pos.ksh test case was failing because of a
  missing space after `log_must`.

- None of the tests listed in the runfiles should include the .ksh
  suffix.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14515

21 months agoFix buffered/direct/mmap I/O race
Brian Behlendorf [Thu, 23 Feb 2023 18:57:24 +0000 (10:57 -0800)]
Fix buffered/direct/mmap I/O race

When a page is faulted in for memory mapped I/O the page lock
may be dropped before it has been read and marked up to date.
If a buffered read encounters such a page in mappedread() it
must wait until the page has been updated. Failure to do so
will result in a panic on debug builds and incorrect data on
production builds.

The critical part of this change is in mappedread() where pages
which are not up to date are now handled. Additionally, it
includes the following simplifications.

- zfs_getpage() and zfs_fillpage() could be passed an array of
  pages. This could be more efficient if it was used but in
  practice only a single page was ever provided. These
  interfaces were simplified to acknowledge that.

- update_pages() was modified to correctly set the PG_error bit
  on a page when it cannot be read by dmu_read().

- Setting PG_error and PG_uptodate was moved to zfs_fillpage()
  from zpl_readpage_common(). This is consistent with the
  handling in update_pages() and mappedread().

- Minor additional refactoring to comments and variable
  declarations to improve readability.

- Add a test case to exercise concurrent buffered, direct,
  and mmap IO to the same file.

- Reduce the mmap_sync test case default run time.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13608
Closes #14498

21 months agoFix NULL pointer dereference in zio_ready()
Richard Yao [Thu, 23 Feb 2023 18:19:08 +0000 (13:19 -0500)]
Fix NULL pointer dereference in zio_ready()

Clang's static analyzer correctly identified a NULL pointer dereference
in zio_ready() when ZIO_FLAG_NODATA has been set on a zio that is
missing a block pointer. The NULL pointer dereference occurs because we
have logic intended to disable ZIO_FLAG_NODATA when it has been set on a
gang block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14469

21 months agoUse rw_tryupgrade() in dmu_bonus_hold_by_dnode()
Richard Yao [Thu, 23 Feb 2023 00:33:23 +0000 (19:33 -0500)]
Use rw_tryupgrade() in dmu_bonus_hold_by_dnode()

When dn->dn_bonus == NULL, dmu_bonus_hold_by_dnode() will unlock its
read lock on dn->dn_struct_rwlock and grab a write lock. This can be
micro-optimized by calling rw_tryupgrade().

Linux will not benefit from this since it does not support rwlock
upgrades, but FreeBSD will.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14517

21 months agoImprove error message of zfs redact
Paul Dagnelie [Wed, 22 Feb 2023 01:30:05 +0000 (17:30 -0800)]
Improve error message of zfs redact

We improve the error message of zfs redact by checking if the target
snapshot exists, and if all the redaction snapshots exist. As a
future improvement we could iterate over every snapshot provided and
use that to determine which one specifically doesn't exist.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11426
Closes #14496

21 months agoFreeBSD: don't verify recycled vnode for zfs control directory
rob-wing [Wed, 22 Feb 2023 01:26:33 +0000 (16:26 -0900)]
FreeBSD: don't verify recycled vnode for zfs control directory

Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    #6 0xffffffff80669186 at vgonel+0x186
    #7 0xffffffff80669841 at vgone+0x31
    #8 0xffffffff8065806d at vfs_hash_insert+0x26d
    #9 0xffffffff81a39069 at sfs_vgetx+0x149
    #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    #11 0xffffffff8065a28c at lookup+0x45c
    #12 0xffffffff806594b9 at namei+0x259
    #13 0xffffffff80676a33 at kern_statat+0xf3
    #14 0xffffffff8067712f at sys_fstatat+0x2f
    #15 0xffffffff808ae50c at amd64_syscall+0x10c
    #16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    #6 0xffffffff806711e1 at vgone+0x31
    #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    #8 0xffffffff81a39069 at sfs_vgetx+0x149
    #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    #10 0xffffffff80661c2c at lookup+0x45c
    #11 0xffffffff80660e59 at namei+0x259
    #12 0xffffffff8067e3d3 at kern_statat+0xf3
    #13 0xffffffff8067eacf at sys_fstatat+0x2f
    #14 0xffffffff808b5ecc at amd64_syscall+0x10c
    #15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes #14501

21 months agoFix per-jail zfs.mount_snapshot setting
Allan Jude [Wed, 22 Feb 2023 01:23:01 +0000 (20:23 -0500)]
Fix per-jail zfs.mount_snapshot setting

When jail.conf set the nopersist flag during startup, it was
incorrectly destroying the per-jail ZFS settings.

Reported-by: Martin Matuska <mm@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-by: Modirum MDPay
Sponsored-by: Klara, Inc.
Closes #14509

21 months agoPartially revert eee9362a7
George Amanakis [Tue, 21 Feb 2023 17:36:22 +0000 (18:36 +0100)]
Partially revert eee9362a7

With commit 34ce4c42f applied, there is no need for eee9362a7.
Revert that aside from the test. All tests introduced in those commits
pass.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14502

21 months agoSync thread should avoid holding the spa config write lock when possible
Richard Yao [Thu, 16 Feb 2023 22:10:52 +0000 (17:10 -0500)]
Sync thread should avoid holding the spa config write lock when possible

spa_sync() currently grabs the write lock due to an old hack that is
documented by a comment:

    We need the write lock here because, for aux vdevs,
    calling vdev_config_dirty() modifies sav_config.
    This is ugly and will become unnecessary when we
    eliminate the aux vdev wart by integrating all vdevs
    into the root vdev tree.

This has lead to deadlocks in rare edge cases from holding the write
lock. We can reduce incidence of these deadlocks by not grabbing the
write lock on pools without auxillary vdevs.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14282

21 months agozfs redact fails when dnodesize=auto
Paul Dagnelie [Thu, 16 Feb 2023 17:23:39 +0000 (09:23 -0800)]
zfs redact fails when dnodesize=auto

Add handling to dmu_object_next for the case where *objectp == 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #14479

21 months agozdb: zero-pad checksum output follow up
Brian Behlendorf [Wed, 15 Feb 2023 17:06:29 +0000 (09:06 -0800)]
zdb: zero-pad checksum output follow up

Apply zero padding for checksums consistently.  The SNPRINTF_BLKPTR
macro was not updated in commit ac7648179c8 which results in the
`cli_root/zdb/zdb_checksum.ksh` test case reliably failing.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14497

21 months agoSuppress Clang static analyzer complaint in zfs_replay_create()
Richard Yao [Tue, 14 Feb 2023 19:05:41 +0000 (14:05 -0500)]
Suppress Clang static analyzer complaint in zfs_replay_create()

Clang's static analyzer incorrectly complains about an undefined value
here when lr->lr_common.lrc_txtype == TX_SYMLINK and txtype ==
TX_CREATE. This is impossible, because of this line:

txtype = (lr->lr_common.lrc_txtype & ~TX_CI((uint64_t)0x1 << 63));

Changing the code to compare against txtype suppresses the report.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14472

21 months agoLinux: use filemap_range_has_page()
Brian Behlendorf [Tue, 14 Feb 2023 19:04:34 +0000 (11:04 -0800)]
Linux: use filemap_range_has_page()

As of the 4.13 kernel filemap_range_has_page() can be used to
check if there is a page mapped in a given file range.  When
available this interface should be used which eliminates the
need for the zp->z_is_mapped boolean.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14493

21 months agoGive strlcat() full buffer lengths rather than smaller buffer lengths
Richard Yao [Tue, 14 Feb 2023 19:03:42 +0000 (14:03 -0500)]
Give strlcat() full buffer lengths rather than smaller buffer lengths

strlcat() is supposed to be given the length of the destination buffer,
including the existing contents. Unfortunately, I had been overzealous
when I wrote a51288aabbbc176a8a73a8b3cd56f79607db32cf, since I gave it
the length of the destination buffer, minus the existing contents. This
likely caused a regression on large strings.

On the topic of being overzealous, the use of strlcat() in
dmu_send_estimate_fast() was unnecessary because recv_clone_name is a
fixed length string. We continue using strlcat() mostly as defensive
programming, in case the string length is ever changed, even though it
is unnecessary.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14476

21 months agoquick fix for lingering snapdir unmount problems
Rich Ercolani [Tue, 14 Feb 2023 00:40:13 +0000 (19:40 -0500)]
quick fix for lingering snapdir unmount problems

Unfortunately, even after e79b6807, I still, much more rarely,
tripped asserts when playing with many ctldir mounts at once.

Since this appears to happen if we dispatched twice too fast, just
ignore it. We don't actually need to do anything if someone already
started doing it for us.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14462

21 months agoFix a race condition in dsl_dataset_sync() when activating features
George Amanakis [Tue, 14 Feb 2023 00:37:46 +0000 (01:37 +0100)]
Fix a race condition in dsl_dataset_sync() when activating features

The zio returned from arc_write() in dmu_objset_sync() uses
zio_nowait(). However we may reach the end of dsl_dataset_sync()
which checks if we need to activate features in the filesystem
without knowing if that zio has even run through the ZIO pipeline yet.
In that case we will flag features to be activated in
dsl_dataset_block_born() but dsl_dataset_sync() has already
completed its run and those features will not actually be activated.
Mitigate this by moving the feature activation code in
dsl_dataset_sync_done(). Also add new ASSERTs in
dsl_scan_visitbp() checking if a block contradicts any filesystem
flags.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #13816

21 months agoReduce need for contiguous memory for ioctls
Prakash Surya [Tue, 14 Feb 2023 00:35:59 +0000 (16:35 -0800)]
Reduce need for contiguous memory for ioctls

We've had cases where we trigger an OOM despite having memory freely
available on the system. For example, here, we had about 21GB free:

    kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB
    0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
    22071296kB

The problem being, all the memory is in 4K and 8K contiguous regions,
but the allocation request was for a 16K contiguous region:

    kernel: SafeExecutors-4 invoked oom-killer:
    gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO),
    order=2, oom_score_adj=0

The offending allocation came from this call trace:

    kernel: Call Trace:
    kernel:  dump_stack+0x57/0x7a
    kernel:  dump_header+0x4f/0x1e1
    kernel:  oom_kill_process.cold.33+0xb/0x10
    kernel:  out_of_memory+0x1ad/0x490
    kernel:  __alloc_pages_slowpath+0xd55/0xe40
    kernel:  __alloc_pages_nodemask+0x2df/0x330
    kernel:  kmalloc_large_node+0x42/0x90
    kernel:  __kmalloc_node+0x25a/0x320
    kernel:  ? spl_kmem_free_impl+0x21/0x30 [spl]
    kernel:  spl_kmem_alloc_impl+0xa5/0x100 [spl]
    kernel:  spl_kmem_zalloc+0x19/0x20 [spl]
    kernel:  zfsdev_ioctl+0x2b/0xe0 [zfs]
    kernel:  do_vfs_ioctl+0xa9/0x640
    kernel:  ? __audit_syscall_entry+0xdd/0x130
    kernel:  ksys_ioctl+0x67/0x90
    kernel:  __x64_sys_ioctl+0x1a/0x20
    kernel:  do_syscall_64+0x5e/0x200
    kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kernel: RIP: 0033:0x7fdca3674317

The problem is, for each ioctl that ZFS makes, it has to allocate a
zfs_cmd_t structure, which is 13744 bytes in size (on my system):

    sdb> sizeof zfs_cmd
    (size_t)13744

This size, coupled with the fact that we currently allocate it with
kmem_zalloc, means we need a 16K contiguous region of memory to satisfy
the request.

The solution taken by this change, is to use "vmem" instead of "kmem" to
do the allocation, such that we don't necessarily need a contiguous 16K
memory region to satisfy the allocation.

Arguably, a better solution would be not to require such a large
allocation to begin with (e.g. reduce the size of the zfs_cmd_t
structure), but that'd be a much larger change than this "one liner".
Thus, I've opted for this approach for now; we can always circle back
and attempt to reduce the size of the structure in the future.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #14474

21 months agoImprove arc_read() error reporting
Alexander Motin [Mon, 13 Feb 2023 21:21:53 +0000 (16:21 -0500)]
Improve arc_read() error reporting

Debugging reported NULL de-reference panic in dnode_hold_impl() I found
that for certain types of errors arc_read() may only return error code,
but not properly report it via done and pio arguments.  Lack of done
calls may result in reference and/or memory leaks in higher level code.
Lack of error reporting via pio may result in unnoticed errors there.
For example, dbuf_read(), where dbuf_read_impl() ignores arc_read()
return, relies completely on the pio mechanism and missed the errors.

This patch makes arc_read() to always call done callback and always
propagate errors to parent zio, if either is provided.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14454

21 months agorpm: Use libtirpc-devel and /usr/lib on SUSE
Andreas Vögele [Thu, 9 Feb 2023 19:57:50 +0000 (20:57 +0100)]
rpm: Use libtirpc-devel and /usr/lib on SUSE

SUSE Linux distributions require libtirpc-devel.  The dracut and udev
directories are /usr/lib/dracut and /usr/lib/udev.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Andreas Vögele <andreas@andreasvoegele.com>
Closes #14467
Closes #14468

21 months agozdb: zero-pad checksum output
Rob N ★ [Tue, 7 Feb 2023 21:48:22 +0000 (08:48 +1100)]
zdb: zero-pad checksum output

The leading zeroes are part of the checksum so we should show them.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14464

21 months agoinitramfs: Make mountpoint=none work
Ryan Moeller [Mon, 6 Feb 2023 19:16:01 +0000 (14:16 -0500)]
initramfs: Make mountpoint=none work

In initramfs, mount.zfs fails to mount a dataset with mountpoint=none,
but mount.zfs -o zfsutil works.  Use -o zfsutil when mountpoint=none.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #14455

21 months agoAdd assertion and make variables unsigned in abd_alloc_chunks()
Richard Yao [Thu, 2 Feb 2023 22:30:39 +0000 (17:30 -0500)]
Add assertion and make variables unsigned in abd_alloc_chunks()

Clang's static analyzer pointed out that if alloc_pages >= nr_pages
before the loop, the value of page will be undefined and will be used
anyway. This should not be possible, but as cleanup, we add an
assertion. We also recognize that the local variables should be unsigned
in the first place, so we make them unsigned. This is not enough to
avoid the need for the assertion, since there is still the case that
alloc_pages == nr_pages and nr_pages == 0, which the assertion
implicitly checks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14456

21 months agoCleanup: spa vdev processing should check NULL pointers
Richard Yao [Mon, 23 Jan 2023 20:03:21 +0000 (15:03 -0500)]
Cleanup: spa vdev processing should check NULL pointers

The PVS Studio 2016 FreeBSD kernel report stated:

\contrib\opensolaris\uts\common\fs\zfs\spa.c (1341): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1341, 1342.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1355): error V595: The 'spa->spa_l2cache.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1355, 1357.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1398): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1398, 1408.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1583): error V595: The 'oldvdevs' pointer was utilized before it was verified against nullptr. Check lines: 1583, 1595.

In practice, all of these uses were safe because a NULL pointer
implied a 0 vdev count, which kept us from iterating over vdevs.
However, rearranging the code to check the pointer first is not a
terrible micro-optimization and makes it more readable, so let us
do that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14456

21 months agozfs_get_temporary_prop() should not pass NULL to strcpy()
Richard Yao [Fri, 23 Dec 2022 05:00:38 +0000 (00:00 -0500)]
zfs_get_temporary_prop() should not pass NULL to strcpy()

`dsl_dir_activity_in_progress()` can call `zfs_get_temporary_prop()` with
the forth value set to NULL, which will pass NULL to `strcpy()` when
there is a match

Clang's static analyzer caught this with the help of CodeChecker for
Cross Translation Unit analysis.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14456

21 months agoEIO caused by encryption + recursive gang
Matthew Ahrens [Mon, 6 Feb 2023 17:37:06 +0000 (09:37 -0800)]
EIO caused by encryption + recursive gang

Encrypted blocks can not have 3 DVAs, because they use the space of the
3rd DVA for the IV+salt.  zio_write_gang_block() takes this into
account, setting `gbh_copies` to no more than 2 in this case.  Gang
members BP's do not have the X (encrypted) bit set (nor do they have the
DMU level and type fields set), because encryption is not handled at
this level.  The gang block is reassembled, and then encryption (and
compression) are handled.

To check if this gang block is encrypted, the code in
zio_write_gang_block() checks `pio->io_bp`.  This is normally fine,
because the block that's being ganged is typically the encrypted BP.

The problem is that if there is "recursive ganging", where a gang member
is itself a gang block, then when zio_write_gang_block() is called to
create a gang block for a gang member, `pio->io_bp` is the gang member's
BP, which doesn't have the X bit set, so the number of DVA's is not
restricted to 2.  It should instead be looking at the the "gang leader",
i.e. the top-level gang block, to determine how many DVA's can be used,
to avoid a "NDVA's inversion" (where a child has more DVA's than its
parent).

gang leader BP: X (encrypted) bit set, 2 DVA's, IV+salt in 3rd DVA's
space:
```
DVA[0]=<1:...:100400> DVA[1]=<0:...:100400> salt=... iv=...
[L0 ZFS plain file] fletcher4 uncompressed encrypted LE
gang unique double size=100000L/100000P birth=... fill=1 cksum=...
```

leader's GBH contains a BP with gang bit set and 3 DVA's:
```
DVA[0]=<1:...:55600> DVA[1]=<0:...:55600>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
contiguous unique double size=55600L/55600P birth=... fill=0 cksum=...

DVA[0]=<1:...:55600> DVA[1]=<0:...:55600>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
contiguous unique double size=55600L/55600P birth=... fill=0 cksum=...

DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> DVA[2]=<1:...:200>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
gang unique double size=55400L/55400P birth=... fill=0 cksum=...
```

On nondebug bits, having the 3rd DVA in the gang block works for the
most part, because it's true that all 3 DVA's are available in the gang
member BP (in the GBH).  However, for accounting purposes, gang block
DVA's ASIZE include all the space allocated below them, i.e. the
512-byte gang block header (GBH) as well as the gang members below that.
We see that above where the gang leader BP is 1MB logical (and after
compression: 0x`100000P`), but the ASIZE of each DVA is 2 sectors (1KB)
more than 1MB (0x`100400`).

Since thre are 3 copies of a block below it, we increment the ATIME of
the 3rd DVA of the gang leader by the space used by the 3rd DVA of the
child (1 sector, in this case).  But there isn't really a 3rd DVA of the
parent; the salt is stored in place of the 3rd DVA's ASIZE.

So when zio_write_gang_member_ready() increments the parent's BP's
`DVA[2]`'s ASIZE, it's actually incrementing the parent's salt.  When we
later try to read the encrypted recursively-ganged block, the salt
doesn't match what we used to write it, so MAC verification fails and we
get an EIO.

```
zio_encrypt():  encrypted 515/2/0/403 salt: 25 25 bb 9d ad d6 cd 89
zio_decrypt(): decrypting 515/2/0/403 salt: 26 25 bb 9d ad d6 cd 89
```

This commit addresses the problem by not increasing the number of copies
of the GBH beyond 2 (even for non-encrypted blocks).  This simplifies
the logic while maintaining the ability to traverse all metadata
(including gang blocks) even if one copy is lost.  (Note that 3 copies
of the GBH will still be created if requested, e.g. for `copies=3` or
MOS blocks.)  Additionally, the code that increments the parent's DVA's
ASIZE is made to check the parent DVA's NDVAS even on nondebug bits.  So
if there's a similar bug in the future, it will cause a panic when
trying to write, rather than corrupting the parent BP and causing an
error when reading.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Caused-by: #14356
Closes #14440
Closes #14413

21 months agoRestore FreeBSD to use .rodata
Jorgen Lundman [Mon, 6 Feb 2023 17:34:59 +0000 (02:34 +0900)]
Restore FreeBSD to use .rodata

In https://github.com/openzfs/zfs/pull/14228 the FreeBSD
SECTION_STATIC was set to ".data" instead of ".rodata". This
commit just restores it back to .rodata.

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14460

21 months agoUnify assembly files with macOS
Jorgen Lundman [Mon, 6 Feb 2023 17:27:55 +0000 (02:27 +0900)]
Unify assembly files with macOS

The remaining changes needed to make the assembly files work
with macOS.

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14451

21 months agoFix variable shadowing in libzfs_mount
Reno Reckling [Thu, 2 Feb 2023 23:22:12 +0000 (00:22 +0100)]
Fix variable shadowing in libzfs_mount

We accidentally reused variable name "i" for inner and outer loops.

Reviewed-by: Rich Ercolani <Rincebrain@gmail.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Reno Reckling <e-github@wthack.de>
Closes #14452
Closes #14445

21 months agoPrevent error messages when running tests with no timeout
Paul Dagnelie [Thu, 2 Feb 2023 23:19:26 +0000 (15:19 -0800)]
Prevent error messages when running tests with no timeout

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #14450

21 months agoTeach zdb about DMU_OT_ERROR_LOG objects
George Amanakis [Thu, 2 Feb 2023 23:17:37 +0000 (00:17 +0100)]
Teach zdb about DMU_OT_ERROR_LOG objects

With the persistent error log feature we need to account for
spa_errlog_{scrub, last} containing mappings to other error log objects,
which need to be marked as in-use as well.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14442
Closes #14434

21 months agozfs_main.c: fix unused variable error with GCC
rob-wing [Thu, 2 Feb 2023 23:16:40 +0000 (14:16 -0900)]
zfs_main.c: fix unused variable error with GCC

zfs_setproctitle_init() is stubbed out on FreeBSD.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob Wing <rob.fx907@gmail.com>
Closes #14441

21 months agoResolve WS-2021-0184 vulnerability in zstd
Allan Jude [Thu, 2 Feb 2023 23:12:51 +0000 (18:12 -0500)]
Resolve WS-2021-0184 vulnerability in zstd

Pull in d40f55cd950919d7eac951b122668e55e33e5202 from upstream

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #14439

21 months agorootdelay on zfs should be adaptive
George Wilson [Thu, 2 Feb 2023 23:11:35 +0000 (18:11 -0500)]
rootdelay on zfs should be adaptive

The 'rootdelay' boot option currently pauses the boot for a specified
amount of time. The original intent was to ensure that slower
configurations would have ample time to enumerate the devices to make
importing the root pool successful. This, however, causes unnecessary
boot delay for environments like Azure which set this parameter by
default.

This commit changes the initramfs logic to pause until it can
successfully load the 'zfs' module. The timeout specified by
'rootdelay' now becomes the maximum amount of time that initramfs will
wait before failing the boot.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #14430

21 months agoFix console progress reporting for recursive send
Ameer Hamza [Thu, 2 Feb 2023 23:09:57 +0000 (04:09 +0500)]
Fix console progress reporting for recursive send

After commit 19d3961, progress reporting (-v) with replication flag
enabled does not report the progress on the console. This commit
fixes the issue by updating the logic to check for pa->progress
instead of pa_verbosity in send_progress_thread().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14448

21 months agoIncrease default zfs_rebuild_vdev_limit to 64MB
Brian Behlendorf [Tue, 24 Jan 2023 23:23:32 +0000 (15:23 -0800)]
Increase default zfs_rebuild_vdev_limit to 64MB

When testing distributed rebuild performance with more capable
hardware it was observed than increasing the zfs_rebuild_vdev_limit
to 64M reduced the rebuild time by 17%.  Beyond 64MB there was
some improvement (~2%) but it was not significant when weighed
against the increased memory usage. Memory usage is capped at 1/4
of arc_c_max.

Additionally, vr_bytes_inflight_max has been moved so it's updated
per-metaslab to allow the size to be adjust while a rebuild is
running.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14428

21 months agoIncrease default zfs_scan_vdev_limit to 16MB
Brian Behlendorf [Tue, 24 Jan 2023 22:05:45 +0000 (14:05 -0800)]
Increase default zfs_scan_vdev_limit to 16MB

For HDD based pools the default zfs_scan_vdev_limit of 4M
per-vdev can significantly limit the maximum scrub performance.
Increasing the default to 16M can double the scrub speed from
80 MB/s per disk to 160 MB/s per disk.

This does increase the memory footprint during scrub/resilver
but given the performance win this is a reasonable trade off.
Memory usage is capped at 1/4 of arc_c_max.  Note that number
of outstanding I/Os has not changed and is still limited by
zfs_vdev_scrub_max_active.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14428

22 months agoPrefetch on deadlists merge
Alexander Motin [Wed, 25 Jan 2023 19:30:24 +0000 (14:30 -0500)]
Prefetch on deadlists merge

During snapshot deletion ZFS may issue several reads for each deadlist
to merge them into next snapshot's or pool's bpobj.  Number of the dead
lists increases with number of snapshots.  On HDD pools it may take
significant time during which sync thread is blocked.

This patch introduces prescient prefetch of required blocks for up to
128 deadlists ahead.  Tests show reduction of time required to delete
dataset with 720 snapshots with randomly overwritten file on wide HDD
pool from 75-85 to 22-28 seconds.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Issue #14276
Closes #14402

22 months agoImprove resilver ETAs
Brian Behlendorf [Wed, 25 Jan 2023 19:28:54 +0000 (11:28 -0800)]
Improve resilver ETAs

When resilvering the estimated time remaining is calculated using
the average issue rate over the current pass.  Where the current
pass starts when a scan was started, or restarted, if the pool
was exported/imported.

For dRAID pools in particular this can result in wildly optimistic
estimates since the issue rate will be very high while scanning
when non-degraded regions of the pool are scanned.  Once repair
I/O starts being issued performance drops to a realistic number
but the estimated performance is still significantly skewed.

To address this we redefine a pass such that it starts after a
scanning phase completes so the issue rate is more reflective of
recent performance.  Additionally, the zfs_scan_report_txgs
module option can be set to reset the pass statistics more often.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14410

22 months agolinux 6.2 compat: zpl_set_acl arg2 is now struct dentry
Coleman Kane [Tue, 24 Jan 2023 19:20:50 +0000 (14:20 -0500)]
linux 6.2 compat:  zpl_set_acl arg2 is now struct dentry

Linux 6.2 changes the second argument of the set_acl operation to be a
"struct dentry *" rather than a "struct inode *". The inode* parameter
is still available as dentry->d_inode, so adjust the call to the _impl
function call to dereference and pass that pointer to it.

Also document that the get_acl -> get_inode_acl member name change from
commit 884a693 was an API change also introduced in Linux 6.2.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14415

22 months agoIntroduce minimal ZIL block commit delay
Alexander Motin [Tue, 24 Jan 2023 17:20:32 +0000 (12:20 -0500)]
Introduce minimal ZIL block commit delay

Despite all optimizations, tests on actual hardware show that FreeBSD
kernel can't sleep for less then ~2us.  Similar tests on Linux show
~50us delay at least from nanosleep() (haven't tested inside kernel).
It means that on very fast log device ZIL may not be able to satisfy
zfs_commit_timeout_pct block commit timeout, increasing log latency
more than desired.

Handle that by introduction of zil_min_commit_timeout parameter,
specifying minimal timeout value where additional delays to aggregate
writes may be skipped.  Also skip delays if the LWB is more than 7/8
full, that often happens if I/O sizes are constant and match one of
LWB sizes.  Both things are applied only if there were no already
outstanding log blocks, that may indicate single-threaded workload,
that by definition can not benefit from the commit delays.

While there, add short time moving average to zl_last_lwb_latency to
make it more stable.

Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS
increase by 9% instead of expected 5%.  For zfs_commit_timeout_pct of
1 there IOPS increase by 5.5% instead of expected 1%.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14418

22 months agox86 asm: Replace .align with .balign
Attila Fülöp [Mon, 23 Jan 2023 19:25:21 +0000 (20:25 +0100)]
x86 asm: Replace .align with .balign

The .align directive used to align storage locations is
ambiguous. On some platforms and assemblers it takes a byte count,
on others the argument is interpreted as a shift value. The current
usage expects the first interpretation.

Replace it with the unambiguous .balign directive which always
expects a byte count, regardless of platform and assembler.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422

22 months agoIPC: blake3 x86 asm: fix placement of .size directives
Attila Fülöp [Mon, 23 Jan 2023 18:48:39 +0000 (19:48 +0100)]
IPC: blake3 x86 asm: fix placement of .size directives

The .size directive used by the SET_SIZE C macro uses the special
dot symbol to calculate the size of a function. The dot symbol
refers to the current address, so for the calculation to be
meaningful the SET_SIZE macro must be placed immediately after the
end of the function the size is calculated for.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422

22 months agoWait for txg sync if the last DRR_FREEOBJECTS might result in a hole
David Hedberg [Mon, 23 Jan 2023 21:19:43 +0000 (22:19 +0100)]
Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole

If we receive a DRR_FREEOBJECTS as the first entry in an object range,
this might end up producing a hole if the freed objects were the
only existing objects in the block.

If the txg starts syncing before we've processed any following
DRR_OBJECT records, this leads to a possible race where the backing
arc_buf_t gets its psize set to 0 in the arc_write_ready() callback
while still being referenced from a dirty record in the open txg.

To prevent this, we insert a txg_wait_synced call if the first
record in the range was a DRR_FREEOBJECTS that actually
resulted in one or more freed objects.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Hedberg <david.hedberg@findity.com>
Sponsored by: Findity AB
Closes #11893
Closes #14358

22 months agoReject streams that set ->drr_payloadlen to unreasonably large values
Richard Yao [Mon, 23 Jan 2023 21:16:22 +0000 (16:16 -0500)]
Reject streams that set ->drr_payloadlen to unreasonably large values

In the zstream code, Coverity reported:

"The argument could be controlled by an attacker, who could invoke the
function with arbitrary values (for example, a very high or negative
buffer size)."

It did not report this in the kernel. This is likely because the
userspace code stored this in an int before passing it into the
allocator, while the kernel code stored it in a uint32_t.

However, this did reveal a potentially real problem. On 32-bit systems
and systems with only 4GB of physical memory or less in general, it is
possible to pass a large enough value that the system will hang. Even
worse, on Linux systems, the kernel memory allocator is not able to
support allocations up to the maximum 4GB allocation size that this
allows.

This had already been limited in userspace to 64MB by
`ZFS_SENDRECV_MAX_NVLIST`, but we need a hard limit in the kernel to
protect systems. After some discussion, we settle on 256MB as a hard
upper limit. Attempting to receive a stream that requires more memory
than that will result in E2BIG being returned to user space.

Reported-by: Coverity (CID-1529836)
Reported-by: Coverity (CID-1529837)
Reported-by: Coverity (CID-1529838)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14285

22 months agoConfigure zed's diagnosis engine with vdev properties
rob-wing [Mon, 23 Jan 2023 21:14:25 +0000 (12:14 -0900)]
Configure zed's diagnosis engine with vdev properties

Introduce four new vdev properties:
    checksum_n
    checksum_t
    io_n
    io_t

These properties can be used for configuring the thresholds of zed's
diagnosis engine and are interpeted as <N> events in T <seconds>.

When this property is set to a non-default value on a top-level vdev,
those thresholds will also apply to its leaf vdevs. This behavior can be
overridden by explicitly setting the property on the leaf vdev.

Note that, these properties do not persist across vdev replacement. For
this reason, it is advisable to set the property on the top-level vdev
instead of the leaf vdev.

The default values for zed's diagnosis engine (10 events, 600 seconds)
remains unchanged.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology LLC
Closes #13805

22 months agofree_blocks(): Fix reports from 2016 PVS Studio FreeBSD report
Richard Yao [Mon, 23 Jan 2023 21:12:37 +0000 (16:12 -0500)]
free_blocks(): Fix reports from 2016 PVS Studio FreeBSD report

In 2016, the authors of PVS Studio ran it on the FreeBSD kernel, which
identified a number of bugs / cleanup opportunities in the FreeBSD ZFS kernel
code. A few of them persist to the present day:

https://reviews.freebsd.org/D5245

Note that the scan was done against
freebsd/freebsd-src@46763fd4ca8a37f836c9bf2333f9d687509278f3.

In particular, we have the following in free_blocks():

\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (174): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (171): error V634: The priority of the '*' operation is higher than that of the '<<' operation. It's possible that parentheses should be used in the expression.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (175): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0.

A couple of assertions accidentally typecast the arguments they check to
unsigned in such a way that the result is always true. Also, parentheses
are missing around `1<<epbs` in `(db->db_blkid * 1<<epbs)`. This works
out to be okay due to multiplication not caring what order of operations
we use, but it is better to fix it to be `(db->db_blkid << epbs)`.

A few of the function local variables probably never should have been
32-bit in the first place, so we make them 64-bit. We also replace the
existing assertions with additional assertions to ensure that 64-bit
unsigned arithmetic is safe.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14407

22 months agoFix reading uninitialized variable in receive_read
Chunwei Chen [Fri, 20 Jan 2023 19:49:56 +0000 (11:49 -0800)]
Fix reading uninitialized variable in receive_read

When zfs_file_read returns error, resid may be uninitialized.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #14404

22 months agozfs_receive_one: Check for the more likely error first
Allan Jude [Fri, 20 Jan 2023 19:11:54 +0000 (14:11 -0500)]
zfs_receive_one: Check for the more likely error first

If zfs_receive_one() gets back EINVAL, check for the more likely case,
embedded block pointers + encryption and return that error, before
falling back to the less likely case, a resumable stream when the
kernel has not been upgraded to support resume.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-by: rsync.net
Sponsored-by: Klara Inc.
Closes #14379

22 months agoCleanup ->dd_space_towrite should be unsigned
Richard Yao [Fri, 20 Jan 2023 19:10:15 +0000 (14:10 -0500)]
Cleanup ->dd_space_towrite should be unsigned

This is only ever used with unsigned data, so the type itself should be
unsigned. Also, PVS Studio's 2016 FreeBSD kernel report correctly
identified the following assertion as always being true, so we can drop
it:

ASSERT3U(dd->dd_space_towrite[i & TXG_MASK], >=, 0);

The reason it was always true is because it would do casts to give us
unsigned comparisons. This could have been fixed by switching to
`ASSERT3S()`, but upon inspection, it turned out that this variable
never should have been allowed to be signed in the first place.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14408

22 months agoMicro-optimize dsl_prop_get_dd()
Mark Johnston [Mon, 16 Jan 2023 12:53:16 +0000 (07:53 -0500)]
Micro-optimize dsl_prop_get_dd()

Use the saved property index instead of looking it up once per DSL
directory when traversing up towards the root.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes #14397

22 months agoAvoid passing an uninitialized index to dsl_prop_known_index
Mark Johnston [Sun, 15 Jan 2023 22:05:19 +0000 (17:05 -0500)]
Avoid passing an uninitialized index to dsl_prop_known_index

Reported-by: KMSAN
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes #14397

22 months agoFix unprotected zfs_znode_dmu_fini
Chunwei Chen [Fri, 20 Jan 2023 00:59:05 +0000 (16:59 -0800)]
Fix unprotected zfs_znode_dmu_fini

In original code, zfs_znode_dmu_fini is called in zfs_rmnode without
zfs_znode_hold_enter. It seems to assume it's ok to do so when the znode
is unlinked. However this assumption is not correct, as zfs_zget can be
called by NFS through zpl_fh_to_dentry as pointed out by Christian in
https://github.com/openzfs/zfs/pull/12767, which could result in a
use-after-free bug.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12767
Closes #14364

22 months agoMan: fix defaults for zfs_dirty_data_max_max
George Melikov [Tue, 17 Jan 2023 21:12:29 +0000 (00:12 +0300)]
Man: fix defaults for zfs_dirty_data_max_max

It was changed in e99932f7dec6efeb006e225e0bf0901c30345cac,
but without docs update.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #14400

22 months agoUnify Assembler files between Linux and Windows
Jorgen Lundman [Tue, 17 Jan 2023 19:09:19 +0000 (04:09 +0900)]
Unify Assembler files between Linux and Windows

Add new macro ASMABI used by Windows to change
calling API to "sysv_abi".

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14228

22 months agoUse setproctitle to report progress of zfs send
Ameer Hamza [Tue, 17 Jan 2023 18:17:35 +0000 (23:17 +0500)]
Use setproctitle to report progress of zfs send

This allows parsing of zfs send progress by checking the process
title.
Doing so requires some changes to the send code in libzfs_sendrecv.c;
primarily these changes move some of the accounting around, to allow
for the code to be verbose as normal, or set the process title. Unlike
BSD, setproctitle() isn't standard in Linux; thus, borrowed it from
libbsd with slight modifications.

Authored-by: Sean Eric Fagan <sef@FreeBSD.org>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14376

22 months agoCleanup of dead code suggested by Clang Static Analyzer (#14380)
Richard Yao [Tue, 17 Jan 2023 17:57:12 +0000 (12:57 -0500)]
Cleanup of dead code suggested by Clang Static Analyzer (#14380)

I recently gained the ability to run Clang's static analyzer on the
linux kernel modules via a few hacks. This extended coverage to code
that was previously missed since Clang's static analyzer only looked at
code that we built in userspace. Running it against the Linux kernel
modules built from my local branch produced a total of 72 reports
against my local branch. Of those, 50 were reports of logic errors and
22 were reports of dead code. Since we already had cleaned up all of
the previous dead code reports, I felt it would be a good next step to
clean up these dead code reports. Clang did a further breakdown of the
dead code reports into:

Dead assignment 15

Dead increment 2

Dead nested assignment 5

The benefit of cleaning these up, especially in the case of dead nested
assignment, is that they can expose places where our error handling is
incorrect. A number of them were fairly straight forward. However
several were not:

In vdev_disk_physio_completion(), not only were we not using the return
value from the static function vdev_disk_dio_put(), but nothing used it,
so I changed it to return void and removed the existing (void) cast in
the other area where we call it in addition to no longer storing it to a
stack value.

In FSE_createDTable(), the function is dead code. Its helper function
FSE_freeDTable() is also dead code, as are the CPP definitions in
`module/zstd/include/zstd_compat_wrapper.h`. We just delete it all.

In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig()
returns 0 if there are waiting signals and 1 if there are none. The
Linux SPL version literally returns `signal_pending(current) ? 0 : 1)`
and FreeBSD implements the same semantics, we can just do
`!cv_wait_sig()` in place of `signal_pending(current)` to avoid
unnecessarily calling it again.

zfs_setattr() on FreeBSD version did not have error handling issue
because the code was removed entirely from FreeBSD version. The error is
from updating the attribute directory's files. After some thought, I
decided to propapage errors on it to userspace.

In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the
first check in favor of checking three other permissions. I assume this
is intentional.

In zfs_create_fs(), the return value of zap_update() was not checked
despite setting an important version number. I see no backward
compatibility reason to permit failures, so we add an assertion to catch
failures. Interestingly, Linux is still using ASSERT(error == 0) from
OpenSolaris while FreeBSD has switched to the improved ASSERT0(error)
from illumos, although illumos has yet to adopt it here. ASSERT(error ==
0) was used on Linux while ASSERT0(error) was used on FreeBSD since the
entire file needs conversion and that should be the subject of
another patch.

dnode_move()'s issue was caused by us not having implemented
POINTER_IS_VALID() on Linux. We have a stub in
`include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be
in `include/os/linux/spl/sys/kmem.h` to be consistent with
Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and
`POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we
copy what it did.

Whenever a report was in platform-specific code, I checked the FreeBSD
version to see if it also applied to FreeBSD, but it was only relevant a
few times.

Lastly, the patch that enabled Clang's static analyzer to be run on the
Linux kernel modules needs more work before it can be put into a PR. I
plan to do that in the future as part of the on-going static analysis
work that I am doing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14380

22 months agoZTS: Annotate additonal flaky test cases
Brian Behlendorf [Tue, 17 Jan 2023 17:53:53 +0000 (09:53 -0800)]
ZTS: Annotate additonal flaky test cases

Update several flaky test cases in zts-report.py.in until they
can be made entirely reliable.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14392

22 months agoCI: Reclaim space after package operations
Brian Behlendorf [Tue, 17 Jan 2023 17:52:27 +0000 (09:52 -0800)]
CI: Reclaim space after package operations

Rather than reclaiming space before updating the packages do
it afterwards.  This avoids issues with apt returning an
error due to missing files on the system.

This commit includes a revert for 6320b9e6.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14387

22 months agozpool-set: print error message when pool or vdev is not valid
Rob Wing [Wed, 21 Dec 2022 06:52:26 +0000 (21:52 -0900)]
zpool-set: print error message when pool or vdev is not valid

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14310

22 months agozpool-set: update usage text
Rob Wing [Wed, 21 Dec 2022 06:33:11 +0000 (21:33 -0900)]
zpool-set: update usage text

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14310

22 months agoLinux ppc64le ieee128 compat: Do not redefine __asm on external headers
Richard Yao [Fri, 13 Jan 2023 18:58:58 +0000 (13:58 -0500)]
Linux ppc64le ieee128 compat: Do not redefine __asm on external headers

There is an external assembly declaration extension in GNU C that glibc
uses when building with ieee128 floating point support on ppc64le.
Marking that as volatile makes no sense, so the build breaks.

It does not make sense to only mark this as volatile on Linux, since if
do not want the compiler reordering things on Linux, we do not want the
compiler reordering things on any other platform, so we stop treating
Linux specially and just manually inline the CPP macro so that we can
eliminate it. This should fix the build on ppc64le.

Tested-by: @gyakovlev
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14308
Closes #14384

22 months agoCleanup: Use NULL when doing NULL pointer comparisons
Richard Yao [Tue, 10 Jan 2023 21:58:21 +0000 (16:58 -0500)]
Cleanup: Use NULL when doing NULL pointer comparisons

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/null/badzero.cocci

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

22 months agoCleanup: Remove unneeded semicolons
Richard Yao [Tue, 10 Jan 2023 21:54:23 +0000 (16:54 -0500)]
Cleanup: Remove unneeded semicolons

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/semicolon.cocci

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

22 months agoCleanup: Use MIN() macro
Richard Yao [Tue, 10 Jan 2023 21:41:26 +0000 (16:41 -0500)]
Cleanup: Use MIN() macro

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/minmax.cocci

There was a third opportunity to use `MIN()`, but that was in
`FSE_minTableLog()` in `module/zstd/lib/compress/fse_compress.c`.
Upstream zstd has yet to make this change and I did not want to change
header includes just for MIN, or do a one off, so I left it alone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

22 months agoCleanup: !A || A && B is equivalent to !A || B
Richard Yao [Tue, 10 Jan 2023 21:34:29 +0000 (16:34 -0500)]
Cleanup: !A || A && B is equivalent to !A || B

In zfs_zaccess_dataset_check(), we have the following subexpression:

(!IS_DEVVP(ZTOV(zp)) ||
    (IS_DEVVP(ZTOV(zp)) && (v4_mode & WRITE_MASK_ATTRS)))

When !IS_DEVVP(ZTOV(zp)) is false, IS_DEVVP(ZTOV(zp)) is true under the
law of the excluded middle since we are not doing pseudoboolean alegbra.
Therefore doing:

(IS_DEVVP(ZTOV(zp)) && (v4_mode & WRITE_MASK_ATTRS))

Is unnecessary and we can just do:

(v4_mode & WRITE_MASK_ATTRS)

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/excluded_middle.cocci

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

22 months agoCleanup: Replace oldstyle struct hack with C99 flexible array members
Richard Yao [Tue, 10 Jan 2023 20:44:35 +0000 (15:44 -0500)]
Cleanup: Replace oldstyle struct hack with C99 flexible array members

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/flexible_array.cocci

However, unlike the cases where the GNU zero length array extension had
been used, coccicheck would not suggest patches for the older style
single member arrays. That was good because blindly changing them would
break size calculations in most cases.

Therefore, this required care to make sure that we did not break size
calculations. In the case of `indirect_split_t`, we use
`offsetof(indirect_split_t, is_child[is->is_children])` to calculate
size. This might be subtly wrong according to an old mailing list
thread:

https://inbox.sourceware.org/gcc-prs/20021226123454.27019.qmail@sources.redhat.com/T/

That is because the C99 specification should consider the flexible array
members to start at the end of a structure, but compilers prefer to put
padding at the end. A suggestion was made to allow compilers to allocate
padding after the VLA like compilers already did:

http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n983.htm

However, upon thinking about it, whether or not we allocate end of
structure padding does not matter, so using offsetof() to calculate the
size of the structure is fine, so long as we do not mix it with sizeof()
on structures with no array members.

In the case that we mix them and padding causes offsetof(struct_t,
vla_member[0]) to differ from sizeof(struct_t), we would be doing unsafe
operations if we underallocate via `offsetof()` and then overcopy via
sizeof().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

22 months agoCleanup: Fix indentation in zfs_dbgmsg_t
Richard Yao [Tue, 10 Jan 2023 20:18:57 +0000 (15:18 -0500)]
Cleanup: Fix indentation in zfs_dbgmsg_t

fdc2d303710416868a05084e984fd8f231e948bd accidentally broke the
indentation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372

22 months agoCleanup: Use C99 flexible array members instead of zero length arrays
Richard Yao [Tue, 10 Jan 2023 20:14:59 +0000 (15:14 -0500)]
Cleanup: Use C99 flexible array members instead of zero length arrays

The Linux 5.16.14 kernel's coccicheck caught this. The semantic
patch that caught it was:

./scripts/coccinelle/misc/flexible_array.cocci

The Linux kernel's documentation makes a good case for why we should not
use these:

https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14372