]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
18 months agoicp: properly fix all RETs in x86_64 Asm code
Alexander Lobakin [Sun, 16 Oct 2022 14:53:22 +0000 (16:53 +0200)]
icp: properly fix all RETs in x86_64 Asm code

Commit 43569ee37420 ("Fix objtool: missing int3 after ret warning")
addressed replacing all `ret`s in x86 asm code to a macro in the
Linux kernel in order to enable SLS. That was done by copying the
upstream macro definitions and fixed objtool complaints.
Since then, several more mitigations were introduced, including
Rethunk. It requires to have a jump to one of the thunks in order
to work, so the RET macro was changed again. And, as ZFS code
didn't use the mainline defition, but copied it, this is currently
missing.

Objtool reminds about it time to time (Clang 16, CONFIG_RETHUNK=y):

fs/zfs/lua/zlua.o: warning: objtool: setjmp+0x25: 'naked' return
 found in RETHUNK build
fs/zfs/lua/zlua.o: warning: objtool: longjmp+0x27: 'naked' return
 found in RETHUNK build

Do it the following way:
* if we're building under Linux, unconditionally include
  <linux/linkage.h> in the related files. It is available in x86
  sources since even pre-2.6 times, so doesn't need any conftests;
* then, if RET macro is available, it will be used directly, so that
  we will always have the version actual to the kernel we build;
* if there's no such macro, we define it as a simple `ret`, as it
  was on pre-SLS times.

This ensures we always have the up-to-date definition with no need
to update it manually, and at the same time is safe for the whole
variety of kernels ZFS module supports.
Then, there's a couple more "naked" rets left in the code, they're
just defined as:

.byte 0xf3,0xc3

In fact, this is just:

rep ret

`rep ret` instead of just `ret` seems to mitigate performance issues
on some old AMD processors and most likely makes no sense as of
today.
Anyways, address those rets, so that they will be protected with
Rethunk and SLS. Include <sys/asm_linkage.h> here which now always
has RET definition and replace those constructs with just RET.
This wipes the last couple of places with unpatched rets objtool's
been complaining about.

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes #14035

18 months agoFreeBSD: Fix out of bounds read in zfs_ioctl_ozfs_to_legacy()
Richard Yao [Fri, 4 Nov 2022 18:06:14 +0000 (14:06 -0400)]
FreeBSD: Fix out of bounds read in zfs_ioctl_ozfs_to_legacy()

There is an off by 1 error in the check. Fortunately, this function does
not appear to be used in kernel space, despite being compiled as part of
the kernel module. However, it is used in userspace. Callers of
lzc_ioctl_fd() likely will crash if they attempt to use the
unimplemented request number.

This was reported by FreeBSD's coverity scan.

Reported-by: Coverity (CID 1432059)
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14135

18 months agoExpose zfs_vdev_open_timeout_ms as a tunable
Serapheim Dimitropoulos [Thu, 3 Nov 2022 22:02:46 +0000 (15:02 -0700)]
Expose zfs_vdev_open_timeout_ms as a tunable

Some of our customers have been occasionally hitting zfs import failures
in Linux because udevd doesn't create the by-id symbolic links in time
for zpool import to use them. The main issue is that the
systemd-udev-settle.service that zfs-import-cache.service and other
services depend on is racy. There is also an openzfs issue filed (see
https://github.com/openzfs/zfs/issues/10891) outlining the problem and
potential solutions.

With the proper solutions being significant in terms of complexity and
the priority of the issue being low for the time being, this patch
exposes `zfs_vdev_open_timeout_ms` as a tunable so people that are
experiencing this issue often can increase it as a workaround.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14133

18 months agoAllow mounting snapshots in .zfs/snapshot as a regular user
Allan Jude [Thu, 3 Nov 2022 18:53:24 +0000 (14:53 -0400)]
Allow mounting snapshots in .zfs/snapshot as a regular user

Rather than doing a terrible credential swapping hack, we just
check that the thing being mounted is a snapshot, and the mountpoint
is the zfsctl directory, then we allow it.

If the mount attempt is from inside a jail, on an unjailed dataset
(mounted from the host, not by the jail), the ability to mount the
snapshot is controlled by a new per-jail parameter: zfs.mount_snapshot

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-by: Modirum MDPay
Sponsored-by: Klara Inc.
Closes #13758

18 months agoCleanup: Remove branches that always evaluate the same way
Richard Yao [Thu, 3 Nov 2022 17:47:48 +0000 (13:47 -0400)]
Cleanup: Remove branches that always evaluate the same way

Coverity reported that the ASSERT in taskq_create() is always true and
the `*offp > MAXOFFSET_T` check in zfs_file_seek() is always false.

We delete them as cleanup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14130

18 months agoRemove an unused variable
Brooks Davis [Tue, 1 Nov 2022 20:45:36 +0000 (20:45 +0000)]
Remove an unused variable

Clang-16 detects this set-but-unused variable which is assigned and
incremented, but never referenced otherwise.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14125

18 months agoMake 1-bit bitfields unsigned
Brooks Davis [Tue, 1 Nov 2022 20:43:32 +0000 (20:43 +0000)]
Make 1-bit bitfields unsigned

This fixes -Wsingle-bit-bitfield-constant-conversion warning from
clang-16 like:

lib/libzfs/libzfs_dataset.c:4529:19: error: implicit truncation
  from 'int' to a one-bit wide bit-field changes value from
  1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion]
                flags.nounmount = B_TRUE;
^ ~~~~~~

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14125

18 months agoAddress warnings about possible division by zero from clangsa
Richard Yao [Thu, 3 Nov 2022 16:58:14 +0000 (12:58 -0400)]
Address warnings about possible division by zero from clangsa

* The complaint in ztest_replay_write() is only possible if something
   went horribly wrong. An assertion will silence this and if it goes
   off, we will know that something is wrong.
 * The complaint in spa_estimate_metaslabs_to_flush() is not impossible,
   but seems very unlikely. We resolve this by passing the value from
   the `MIN()` that does not go to infinity when the variable is zero.

There was a third report from Clang's scan-build, but that was a
definite false positive and disappeared when checked again through
Clang's static analyzer with Z3 refution via CodeChecker.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14124

18 months agolibuutil: deobfuscate internal pointers
Brooks Davis [Thu, 3 Nov 2022 16:57:05 +0000 (09:57 -0700)]
libuutil: deobfuscate internal pointers

uu_avl and uu_list stored internal next/prev pointers and parent
pointers (unused) obfuscated (byte swapped) to hide them from a long
forgotten leak checker (No one at the 2022 OpenZFS developers meeting
could recall the history.)  This would break on CHERI systems and adds
no obvious value.  Rename the members, use proper types rather than
uintptr_t, and eliminate the related macros.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14126

18 months agoDeny receiving into encrypted datasets if the keys are not loaded
Attila Fülöp [Thu, 3 Nov 2022 16:55:13 +0000 (17:55 +0100)]
Deny receiving into encrypted datasets if the keys are not loaded

Commit 68ddc06b611854560fefa377437eb3c9480e084b introduced support
for receiving unencrypted datasets as children of encrypted ones but
unfortunately got the logic upside down. This resulted in failing to
deny receives of incremental sends into encrypted datasets without
their keys loaded. If receiving a filesystem, the receive was done
into a newly created unencrypted child dataset of the target. In
case of volumes the receive made the target volume undeletable since
a dataset was created below it, which we obviously can't handle.
Incremental streams with embedded blocks are affected as well.

We fix the broken logic to properly deny receives in such cases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #13598
Closes #14055
Closes #14119

18 months agolua: cast through uintptr_t when return a pointer
Brooks Davis [Thu, 27 Oct 2022 22:39:06 +0000 (23:39 +0100)]
lua: cast through uintptr_t when return a pointer

Don't assume size_t can carry pointer provenance and use uintptr_t
(identialy on all current platforms) instead.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14131

18 months agoUse intptr_t when storing an integer in a pointer
Brooks Davis [Thu, 27 Oct 2022 22:28:03 +0000 (23:28 +0100)]
Use intptr_t when storing an integer in a pointer

Cast the integer type to (u)intptr_t before casting to "void *".  In
CHERI C/C++ we warn on bare casts from integers to pointers to catch
attempts to create pointers our of thin air.  We allow the warning to be
supressed with a suitable cast through (u)intptr_t.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14131

18 months agorecvd_props_mode: use a uintptr_t to stash nvlists
Brooks Davis [Thu, 27 Oct 2022 22:25:42 +0000 (23:25 +0100)]
recvd_props_mode: use a uintptr_t to stash nvlists

Avoid assuming than a uint64_t can hold a pointer.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14131

18 months agozfs_onexit_add_cb: make action_handle point to a uintptr_t
Brooks Davis [Thu, 27 Oct 2022 22:20:05 +0000 (23:20 +0100)]
zfs_onexit_add_cb: make action_handle point to a uintptr_t

Avoid assuming than a uint64_t can hold a pointer and reduce the
number of casts in the process.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14131

18 months agoacl: use uintptr_t for ace walker cookies
Brooks Davis [Thu, 27 Oct 2022 22:04:17 +0000 (23:04 +0100)]
acl: use uintptr_t for ace walker cookies

Avoid assuming that a pointer can fit in a uint64_t and use uintptr_t
instead.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14131

18 months agolinux isa_defs.h: Don't define _ALIGNMENT_REQUIRED
Brooks Davis [Fri, 28 Oct 2022 16:36:43 +0000 (17:36 +0100)]
linux isa_defs.h: Don't define _ALIGNMENT_REQUIRED

Nothing consumes this definition so stop defining it.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14128

18 months agoImprove RISC-V support
Brooks Davis [Thu, 27 Oct 2022 23:58:41 +0000 (00:58 +0100)]
Improve RISC-V support

Check __riscv_xlen == 64 rather than _LP64 and define _LP64 if missing.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14128

19 months agoFreeBSD: Fix regression from kmem_scnprintf() in libzfs
Richard Yao [Tue, 1 Nov 2022 20:58:17 +0000 (16:58 -0400)]
FreeBSD: Fix regression from kmem_scnprintf() in libzfs

kmem_scnprintf() is only available in libzpool. Recent buildbot issues
with showing FreeBSD results kept us from seeing this before
97143b9d314d54409244f3995576d8cc8c1ebf0a was merged.

The code has been changed to sanitize the output from `kmem_scnprintf()`.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14111

19 months agoinclude overrides for zfs snapshot/rollback bootfs.service
Vince van Oosten [Sun, 23 Oct 2022 09:11:58 +0000 (11:11 +0200)]
include overrides for zfs snapshot/rollback bootfs.service

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Vince van Oosten <techhazard@codeforyouand.me>
Closes #14075
Closes #14076

19 months agoinclude overrides for zfs-import.target
Vince van Oosten [Sun, 23 Oct 2022 09:11:18 +0000 (11:11 +0200)]
include overrides for zfs-import.target

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Vince van Oosten <techhazard@codeforyouand.me>
Closes #14075
Closes #14076

19 months agoinclude systemd overrides to zfs-dracut module
Vince van Oosten [Sun, 23 Oct 2022 08:55:46 +0000 (10:55 +0200)]
include systemd overrides to zfs-dracut module

If a user that uses systemd and dracut wants to overide certain
settings, they typically use `systemctl edit [unit]` or place a file in
`/etc/systemd/system/[unit].d/override.conf` directly.

The zfs-dracut module did not include those overrides however, so this
did not have any effect at boot time.

For zfs-import-scan.service and zfs-import-cache.service, overrides are
now included in the dracut initramfs image.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Vince van Oosten <techhazard@codeforyouand.me>
Closes #14075
Closes #14076

19 months agozil: Relax assertion in zil_parse
Ryan Moeller [Tue, 1 Nov 2022 19:19:32 +0000 (15:19 -0400)]
zil: Relax assertion in zil_parse

Rather than panic debug builds when we fail to parse a whole ZIL, let's
instead improve the logging of errors and continue like in a release
build.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #14116

19 months agoZTS: rsend_009_pos.ksh is destructive on zfs-on-root system
youzhongyang [Tue, 1 Nov 2022 19:08:37 +0000 (15:08 -0400)]
ZTS: rsend_009_pos.ksh is destructive on zfs-on-root system

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14113

19 months agoFix oversights from 4170ae4e
Richard Yao [Mon, 31 Oct 2022 17:01:04 +0000 (13:01 -0400)]
Fix oversights from 4170ae4e

4170ae4ea600fea6ac9daa8b145960c9de3915fc was intended to tackle TOCTOU
race conditions reported by CodeQL, but as an oversight, a file
descriptor was not closed and some comments were not updated.
Interestingly, CodeQL did not complain about the file descriptor leak,
so there is room for improvement in how we configure it to try to detect
this issue so that we get early warning about this.

In addition, an optimization opportunity was missed by mistake in
lib/libshare/os/linux/smb.c, which prevented us from truly closing the
TOCTOU race. This was also caught by Coverity.

Reported-by: Coverity (CID 1524424)
Reported-by: Coverity (CID 1526804)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14109

19 months agoAvoid null pointer dereference in dsl_fs_ss_limit_check()
Allan Jude [Sat, 29 Oct 2022 20:08:54 +0000 (16:08 -0400)]
Avoid null pointer dereference in dsl_fs_ss_limit_check()

Check for cr == NULL before dereferencing it in
dsl_enforce_ds_ss_limits() to lookup the zone/jail ID.

Reported-by: Coverity (CID 1210459)
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #14103

19 months agoIntroduce kmem_scnprintf()
Richard Yao [Thu, 27 Oct 2022 18:16:04 +0000 (14:16 -0400)]
Introduce kmem_scnprintf()

`snprintf()` is meant to protect against buffer overflows, but operating
on the buffer using its return value, possibly by calling it again, can
cause a buffer overflow, because it will return how many characters it
would have written if it had enough space even when it did not. In a
number of places, we repeatedly call snprintf() by successively
incrementing a buffer offset and decrementing a buffer length, by its
return value. This is a potentially unsafe usage of `snprintf()`
whenever the buffer length is reached. CodeQL complained about this.

To fix this, we introduce `kmem_scnprintf()`, which will return 0 when
the buffer is zero or the number of written characters, minus 1 to
exclude the NULL character, when the buffer was too small. In all other
cases, it behaves like snprintf(). The name is inspired by the Linux and
XNU kernels' `scnprintf()`. The implementation was written before I
thought to look at `scnprintf()` and had a good name for it, but it
turned out to have identical semantics to the Linux kernel version.
That lead to the name, `kmem_scnprintf()`.

CodeQL only catches this issue in loops, so repeated use of snprintf()
outside of a loop was not caught. As a result, a thorough audit of the
codebase was done to examine all instances of `snprintf()` usage for
potential problems and a few were caught. Fixes for them are included in
this patch.

Unfortunately, ZED is one of the places where `snprintf()` is
potentially used incorrectly. Since using `kmem_scnprintf()` in it would
require changing how it is linked, we modify its usage to make it safe,
no matter what buffer length is used. In addition, there was a bug in
the use of the return value where the NULL format character was not
being written by pwrite(). That has been fixed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14098

19 months agoCleanup dump_bookmarks()
Richard Yao [Thu, 27 Oct 2022 19:41:39 +0000 (15:41 -0400)]
Cleanup dump_bookmarks()

Assertions are meant to check assumptions, but the way that this
assertion is written does not check an assumption, since it is provably
always true. Removing the assertion will cause a compiler warning (made
into an error by -Werror) about printing up to 512 bytes to a 256-byte
buffer, so instead, we change the assertion to verify the assumption
that we never do a snprintf() that is truncated to avoid overrunning the
256-byte buffer.

This was caught by an audit of the codebase to look for misuse of
`snprintf()` after CodeQL reported that we had misused `snprintf()`. An
explanation of how snprintf() can be misused is here:

https://www.redhat.com/en/blog/trouble-snprintf

This particular instance did not misuse `snprintf()`, but it was caught
by the audit anyway.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14098

19 months agoFix too few arguments to formatting function
Richard Yao [Thu, 27 Oct 2022 16:45:26 +0000 (12:45 -0400)]
Fix too few arguments to formatting function

CodeQL reported that when the VERIFY3U condition is false, we do not
pass enough arguments to `spl_panic()`. This is because the format
string from `snprintf()` was concatenated into the format string for
`spl_panic()`, which causes us to have an unexpected format specifier.

A CodeQL developer suggested fixing the macro to have a `%s` format
string that takes a stringified RIGHT argument, which would fix this.
However, upon inspection, the VERIFY3U check was never necessary in the
first place, so we remove it in favor of just calling `snprintf()`.

Lastly, it is interesting that every other static analyzer run on the
codebase did not catch this, including some that made an effort to catch
such things. Presumably, all of them relied on header annotations, which
we have not yet done on `spl_panic()`. CodeQL apparently is able to
track the flow of arguments on their way to annotated functions, which
llowed it to catch this when others did not. A future patch that I have
in development should annotate `spl_panic()`, so the others will catch
this too.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14098

19 months agoFix TOCTOU race conditions reported by CodeQL and Coverity
Richard Yao [Thu, 27 Oct 2022 15:03:48 +0000 (11:03 -0400)]
Fix TOCTOU race conditions reported by CodeQL and Coverity

CodeQL and Coverity both complained about:

 * lib/libshare/os/linux/smb.c
 * tests/zfs-tests/cmd/mmapwrite.c
  * twice
 * tests/zfs-tests/tests/functional/tmpfile/tmpfile_002_pos.c
 * tests/zfs-tests/tests/functional/tmpfile/tmpfile_stat_mode.c
* coverity had a second complaint that CodeQL did not have
 * tests/zfs-tests/cmd/suid_write_to_file.c
* Coverity had two complaints and CodeQL had one complaint, both
  differed. The CodeQL complaint is about the main point of the
  test, so it is not fixable without a hack involving `fork()`.

The issues reported by CodeQL are fixed, with the exception of the last
one, which is deemed to be a false positive that is too much trouble to
wrokaround. The issues reported by Coverity were only fixed if CodeQL
complained about them.

There were issues reported by Coverity in a number of other files that
were not reported by CodeQL, but fixing the CodeQL complaints is
considered a priority since we want to integrate it into a github
workflow, so the remaining Coverity complaints are left for future work.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14098

19 months agoRevert "Cleanup: Delete dead code from send_merge_thread()"
Brian Behlendorf [Fri, 28 Oct 2022 20:25:37 +0000 (13:25 -0700)]
Revert "Cleanup: Delete dead code from send_merge_thread()"

This reverts commit fb823de9f due to a regression.  It is in fact possible
for the range->eos_marker to be false on error.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14042
Closes #14104

19 months agodebug: fix output from VERIFY0 assertion
Rob N ★ [Fri, 28 Oct 2022 18:46:44 +0000 (05:46 +1100)]
debug: fix output from VERIFY0 assertion

The previous version reported all the right info, but the VERIFY3 name
made a little more confusing when looking for the matching location in
the source code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob N ★ <robn@despairlabs.com>
Closes #14099

19 months agoquota: extend quota for dataset
Mariusz Zaborski [Fri, 28 Oct 2022 18:44:18 +0000 (20:44 +0200)]
quota: extend quota for dataset

This patch relax the quota limitation for dataset by around 3%.
What this means is that user can write more data then the quota is
set to. However thanks to that we can get more stable bandwidth, in
case when we are overwriting data in-place, and not consuming any
additional space.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org>
Sponsored-by: Zededa Inc.
Sponsored-by: Klara Inc.
Closes #13839

19 months agoFix ARC target collapse when zfs_arc_meta_limit_percent=100
shodanshok [Fri, 28 Oct 2022 17:21:54 +0000 (19:21 +0200)]
Fix ARC target collapse when zfs_arc_meta_limit_percent=100

Reclaim metadata when arc_available_memory < 0 even if
meta_used is not bigger than arc_meta_limit.

As described in https://github.com/openzfs/zfs/issues/14054 if
zfs_arc_meta_limit_percent=100 then ARC target can collapse to
arc_min due to arc_purge not freeing any metadata.

This patch lets arc_prune to do its work when arc_available_memory
is negative even if meta_used is not bigger than arc_meta_limit,
avoiding ARC target collapse.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #14054
Closes #14093

19 months agoPropagate extent_bytes change to autotrim thread
vaclavskala [Fri, 28 Oct 2022 17:16:31 +0000 (19:16 +0200)]
Propagate extent_bytes change to autotrim thread

The autotrim thread only reads zfs_trim_extent_bytes_min and
zfs_trim_extent_bytes_max variable only on thread start.  We
should check for parameter changes during thread execution to
allow parameter changes take effect without needing to disable
then restart the autotrim.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Václav Skála <skala@vshosting.cz>
Closes #14077

19 months agozfs_rename: support RENAME_* flags
Aleksa Sarai [Sat, 22 Jun 2019 00:35:11 +0000 (10:35 +1000)]
zfs_rename: support RENAME_* flags

Implement support for Linux's RENAME_* flags (for renameat2). Aside from
being quite useful for userspace (providing race-free ways to exchange
paths and implement mv --no-clobber), they are used by overlayfs and are
thus required in order to use overlayfs-on-ZFS.

In order for us to represent the new renameat2(2) flags in the ZIL, we
create two new transaction types for the two flags which need
transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT).
RENAME_NOREPLACE does not need any ZIL support because we know that if
the operation succeeded before creating the ZIL entry, there was no file
to be clobbered and thus it can be treated as a regular TX_RENAME.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes #12209
Closes #14070

19 months agozfs_rename: restructure to have cleaner fallbacks
Aleksa Sarai [Fri, 26 Apr 2019 13:23:07 +0000 (23:23 +1000)]
zfs_rename: restructure to have cleaner fallbacks

This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support
for ZoL, but the changes here allow for far nicer fallbacks than the
previous implementation (the source and target are re-linked in case of
the final link failing).

In addition, a small cleanup was done for the "target exists but is a
different type" codepath so that it's more understandable.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes #12209
Closes #14070

19 months agodebug: add VERIFY_{IMPLY,EQUIV} variants
Aleksa Sarai [Wed, 18 May 2022 10:29:33 +0000 (20:29 +1000)]
debug: add VERIFY_{IMPLY,EQUIV} variants

This allows for much cleaner VERIFY-level assertions.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes #14070

19 months agoRemove zpl_revalidate: fix snapshot rollback
Pavel Snajdr [Thu, 5 Dec 2019 00:52:27 +0000 (01:52 +0100)]
Remove zpl_revalidate: fix snapshot rollback

Open files, which aren't present in the snapshot, which is being
roll-backed to, need to disappear from the visible VFS image of
the dataset.

Kernel provides d_drop function to drop invalid entry from
the dcache, but inode can be referenced by dentry multiple dentries.

The introduced zpl_d_drop_aliases function walks and invalidates
all aliases of an inode.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #9600
Closes #14070

19 months agoFix multiplication converted to larger type
Andrew Innes [Fri, 28 Oct 2022 16:30:37 +0000 (00:30 +0800)]
Fix multiplication converted to larger type

This fixes the instances of the "Multiplication result converted to
larger type" alert that codeQL scanning found.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Closes #14094

19 months agoFix zio_flag_t print format
youzhongyang [Fri, 28 Oct 2022 16:08:12 +0000 (12:08 -0400)]
Fix zio_flag_t print format

Follow up for 4938d01d which changed zio_flag from enum to uint64_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14100

19 months agoProcess `script` directory for all configs
Damian Szuberski [Thu, 27 Oct 2022 23:45:14 +0000 (09:45 +1000)]
Process `script` directory for all configs

Even when only building kmods process the scripts directory.  This
way the common.sh script will be generated and the zfs.sh script
can be used to load/unload the in-tree kernel modules.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #14027
Closes #14051

19 months agoAdd native Debian Packaging for Linux
Umer Saleem [Thu, 27 Oct 2022 22:38:45 +0000 (03:38 +0500)]
Add native Debian Packaging for Linux

Currently, the Debian packages are generated from ALIEN that converts
RPMs to Debian packages. This commit adds native Debian packaging for
Debian based systems.

This packaging is a fork of Debian zfs-linux 2.1.6-2 release.
(source: https://salsa.debian.org/zfsonlinux-team/zfs)

Some updates have been made to keep the footprint minimal that
include removing the tests, translation files, patches directory etc.
All credits go to Debian ZFS on Linux Packaging Team.

For copyright information, please refer to contrib/debian/copyright.

scripts/debian-packaging.sh can be used to invoke the build.

Reviewed-by: Mo Zhou <cdluminate@gmail.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #13451

19 months agoConvert enum zio_flag to uint64_t
Richard Yao [Thu, 27 Oct 2022 16:54:54 +0000 (09:54 -0700)]
Convert enum zio_flag to uint64_t

We ran out of space in enum zio_flag for additional flags. Rather than
introduce enum zio_flag2 and then modify a bunch of functions to take a
second flags variable, we expand the type to 64 bits via `typedef
uint64_t zio_flag_t`.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14086

19 months agoAdd CodeQL workflow
Richard Yao [Thu, 27 Oct 2022 16:36:17 +0000 (09:36 -0700)]
Add CodeQL workflow

CodeQL is a static analyzer from github with a very low false positive
rate. We have long wanted to have static analysis runs done on every
pull request and using CodeQL, we can.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Andrew Innes <andrew.c12@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14087

19 months agoAligned free for aligned alloc
Andrew Innes [Wed, 26 Oct 2022 22:08:31 +0000 (06:08 +0800)]
Aligned free for aligned alloc

Windows port frees memory that was alloc'd aligned in a different way
then alloc'd memory.  So changing frees to be specific.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Co-Authored-By: Jorgen Lundman <lundman@lundman.net>
Closes #14059

19 months agoFreeBSD: vn_flush_cached_data: observe vnode locking contract
Andriy Gapon [Wed, 26 Oct 2022 22:00:58 +0000 (01:00 +0300)]
FreeBSD: vn_flush_cached_data: observe vnode locking contract

vm_object_page_clean() expects that the associated vnode is locked
as VOP_PUTPAGES() may get called on the vnode.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #14079

19 months agoSilence objtool warnings from 55d7afa4
Richard Yao [Wed, 26 Oct 2022 21:57:37 +0000 (14:57 -0700)]
Silence objtool warnings from 55d7afa4

The use of __noreturn__ in 55d7afa4adbb4ca569e9c4477a7d121f4dc0bfbd on
spl_panic() caused objtool warnings on Linux when the kernel is built
with CONFIG_STACK_VALIDATION=y. This patch works around that by
restricting the application of __noreturn__ to builds for static
analyzers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14068

19 months agoLinux 6.0 compat: META
Brian Behlendorf [Wed, 26 Oct 2022 21:55:12 +0000 (14:55 -0700)]
Linux 6.0 compat: META

Update the META file to reflect compatibility with the 6.0 kernel.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14091

19 months agozed: Avoid core dump if wholedisk property does not exist
Ameer Hamza [Fri, 21 Oct 2022 17:46:38 +0000 (22:46 +0500)]
zed: Avoid core dump if wholedisk property does not exist

zed aborts and dumps core in vdev_whole_disk_from_config() if
wholedisk property does not exist. make_leaf_vdev() adds the
property but there may be already pools that don't have the
wholedisk in the label.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14062

19 months agoAdd delay between zpool add zvol and zpool destroy
youzhongyang [Fri, 21 Oct 2022 17:05:13 +0000 (13:05 -0400)]
Add delay between zpool add zvol and zpool destroy

As investigated by #14026, the zpool_add_004_pos can reliably hang if
the timing is not right. This is caused by a race condition between
zed doing zpool reopen (due to the zvol being added to the zpool),
and the command zpool destroy.

This change adds a delay between zpool add zvol and zpool destroy to
avoid these issue, but does not address the underlying problem.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Issue #14026
Closes #14052

19 months agoLinux: Fix big endian and partial read bugs in get_system_hostid()
Richard Yao [Thu, 20 Oct 2022 21:52:35 +0000 (17:52 -0400)]
Linux: Fix big endian and partial read bugs in get_system_hostid()

Coverity made two complaints about this function. The first is that we
ignore the number of bytes read. The second is that we have a sizeof
mismatch.

On 64-bit systems, long is a 64-bit type. Paradoxically, the standard
says that hostid is 32-bit, yet is also a long type. On 64-bit big
endian systems, reading into the long would cause us to return 0 as our
hostid after the mask. This is wrong.

Also, if a partial read were to happen (it should not), we would return
a partial hostid, which is also wrong.

We introduce a uint32_t system_hostid stack variable and ensure that the
read is done into it and check the read's return value. Then we set the
value based on whether the read was successful. This should fix both of
coverity's complaints.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13968

19 months agoSilence new static analyzer defect reports from idmap_util.c
Richard Yao [Thu, 20 Oct 2022 21:46:12 +0000 (17:46 -0400)]
Silence new static analyzer defect reports from idmap_util.c

2a068a1394d179dda4becf350e3afb4e8819675e introduced 2 new defect
reports from Coverity and 1 from Clang's static analyzer.

Coverity complained about a potential resource leak from only calling
`close(fd)` when `fd > 0` because `fd` might be `0`. This is a false
positive, but rather than dismiss it as such, we can change the
comparison to ensure that this never appears again from any static
analyzer. Upon inspection, 6 more instances of this were found in the
file, so those were changed too. Unfortunately, since the file
descriptor has been put into an unsigned variable in `attr.userns_fd`,
we cannot do a non-negative check on it to see if it has not been
allocated, so we instead restructure the error handling to avoid the
need for a check. This also means that errors had not been handled
correctly here, so the static analyzer found a bug (although practically
by accident).

Coverity also complained about a dereference before a NULL check in
`do_idmap_mount()` on `source`. Upon inspection, it appears that the
pointer is never NULL, so we delete the NULL check as cleanup.

Clang's static analyzer complained that the return value of
`write_pid_idmaps()` can be uninitialized if we have no idmaps to write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14061

19 months agoLinux: Upgrade random_get_pseudo_bytes() to xoshiro256++ 1.0
Richard Yao [Thu, 20 Oct 2022 21:14:42 +0000 (17:14 -0400)]
Linux: Upgrade random_get_pseudo_bytes() to xoshiro256++ 1.0

The motivation for upgrading our PRNG is the recent buildbot failures in
the ZTS' tests/functional/fault/decompress_fault test. The probability
of a failure in that test is 0.8^256, which is ~1.6e-25 out of 1, yet we
have observed multiple test failures in it. This suggests a problem with
our random number generation.

The xorshift128+ generator that we were using has been replaced by newer
generators that have "better statistical properties". After doing some
reading, it turns out that these generators have "low linear complexity
of the lowest bits", which could explain the ZTS test failures.

We do two things to try to fix this:

1. We upgrade from xorshift128+ to xoshiro256++ 1.0.

2. We tweak random_get_pseudo_bytes() to copy the higher order
   bytes first.

It is hoped that this will fix the test failures in
tests/functional/fault/decompress_fault, although I have not done
simulations. I am skeptical that any simulations I do on a PRNG with a
period of 2^256 - 1 would be meaningful.

Since we have raised the minimum kernel version to 3.10 since this was
first implemented, we have the option of using the Linux kernel's
get_random_int(). However, I am not currently prepared to do performance
tests to ensure that this would not be a regression (for the time
being), so we opt for upgrading our PRNG to a newer one from Sebastiano
Vigna.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13983

19 months agoOptimize microzaps
Alexander Motin [Thu, 20 Oct 2022 18:57:15 +0000 (14:57 -0400)]
Optimize microzaps

Microzap on-disk format does not include a hash tree, expecting one to
be built in RAM during mzap_open().  The built tree is linked to DMU
user buffer, freed when original DMU buffer is dropped from cache. I've
found that workloads accessing many large directories and having active
eviction from DMU cache spend significant amount of time building and
then destroying the trees.  I've also found that for each 64 byte mzap
element additional 64 byte tree element is allocated, that is a waste
of memory and CPU caches.

Improve memory efficiency of the hash tree by switching from AVL-tree
to B-tree.  It allows to save 24 bytes per element just on pointers.
Save 32 bits on mze_hash by storing only upper 32 bits since lower 32
bits are always zero for microzaps.  Save 16 bits on mze_chunkid, since
microzap can never have so many elements.  Respectively with the 16 bits
there can be no more than 16 bits of collision differentiators.  As
result, struct mzap_ent now drops from 48 (rounded to 64) to 8 bytes.

Tune B-trees for small data.  Reduce BTREE_CORE_ELEMS from 128 to 126
to allow struct zfs_btree_core in case of 8 byte elements to pack into
2KB instead of 4KB.  Aside of the microzaps it should also help 32bit
range trees.  Allow custom B-tree leaf size to reduce memmove() time.

Split zap_name_alloc() into zap_name_alloc() and zap_name_init_str().
It allows to not waste time allocating/freeing memory when processing
multiple names in a loop during mzap_open().

Together on a pool with 10K directories of 1800 files each and DMU
cache limited to 128MB this reduces time of `find . -name zzz` by 41%
from 7.63s to 4.47s, and saves additional ~30% of CPU time on the DMU
cache reclamation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14039

19 months agoFix multiple definitions of struct mount_attr on recent glibc versions
Richard Yao [Thu, 20 Oct 2022 16:12:21 +0000 (12:12 -0400)]
Fix multiple definitions of struct mount_attr on recent glibc versions

The ifdef used would never work because the CPP is not aware of C
structure definitions. Rather than use an autotools check, we can just
use a nameless structure that we typedef to mount_attr_t. This is a
Linux kernel interface, which means that it is stable and this is fine
to do.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14057
Closes #14058

19 months agoAdd defensive assertion to vdev_queue_aggregate()
Richard Yao [Mon, 17 Oct 2022 16:37:31 +0000 (12:37 -0400)]
Add defensive assertion to vdev_queue_aggregate()

a6ccb36b948efed0eeee4fcf99fe4b5fb81ae1d5 had been intended to include
this to silence Coverity reports, but this one was missed by mistake.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043

19 months agoabd_return_buf() should call zfs_refcount_remove_many() early
Richard Yao [Sun, 16 Oct 2022 04:02:47 +0000 (00:02 -0400)]
abd_return_buf() should call zfs_refcount_remove_many() early

Calling zfs_refcount_remove_many() after freeing memory means we pass a
reference to freed memory as the holder. This is not believed to be able
to cause a problem, but there is a bit of a tradition of fixing these
issues when they appear so that they do not obscure more serious issues
in static analyzer output, so we fix this one too.

Clang's static analyzer found this with the help of CodeChecker's CTU
analysis.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043

19 months agocrypto_get_ptrs() should always write to *out_data_2
Richard Yao [Sun, 16 Oct 2022 03:35:56 +0000 (23:35 -0400)]
crypto_get_ptrs() should always write to *out_data_2

Callers will check if it has been set to NULL before trying to access
it, but never initialize it themselves. Whenever "one block spans two
iovecs", `crypto_get_ptrs()` will return, without ever setting
`*out_data_2 = NULL`. The caller will then do a NULL check against the
uninitailized pointer and if it is not zero, pass it to `memcpy()`.

The only reason this has not caused horrible runtime issues is because
`memcpy()` should be told to copy zero bytes when this happens. That
said, this is technically undefined behavior, so we should correct it so
that future changes to the code cannot trigger it.

Clang's static analyzer found this with the help of CodeChecker's CTU
analysis.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043

19 months agoSilence static analyzer warnings about spa_sync_props()
Richard Yao [Sat, 15 Oct 2022 22:34:31 +0000 (18:34 -0400)]
Silence static analyzer warnings about spa_sync_props()

Both Coverity and Clang's static analyzer complain about reading an
uninitialized intval if the property is not passed as DATA_TYPE_UINT64
in the nvlist. This is impossible becuase spa_prop_validate() already
checked this, but they are unlikely to be the last static analyzers to
complain about this, so lets just refactor the code to suppress the
warnings.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043

19 months agoFix theoretical use of uninitialized values
Richard Yao [Sat, 15 Oct 2022 22:27:03 +0000 (18:27 -0400)]
Fix theoretical use of uninitialized values

Clang's static analyzer complains about this.

In get_configs(), if we have an invalid configuration that has no top
level vdevs, we can read a couple of uninitialized variables. Aborting
upon seeing this would break the userland tools for healthy pools, so we
instead initialize the two variables to 0 to allow the userland tools to
continue functioning for the pools with valid configurations.

In zfs_do_wait(), if no wait activities are enabled, we read an
uninitialized error variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043

19 months agoFix userland memory leak in zfs_do_send()
Richard Yao [Thu, 20 Oct 2022 00:08:33 +0000 (20:08 -0400)]
Fix userland memory leak in zfs_do_send()

Clang 15's static analyzer caught this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14045

19 months agoAdd options to zfs redundant_metadata property
Akash B [Thu, 20 Oct 2022 00:07:51 +0000 (05:37 +0530)]
Add options to zfs redundant_metadata property

Currently, additional/extra copies are created for metadata in
addition to the redundancy provided by the pool(mirror/raidz/draid),
due to this 2 times more space is utilized per inode and this decreases
the total number of inodes that can be created in the filesystem. By
setting redundant_metadata to none, no additional copies of metadata
are created, hence can reduce the space consumed by the additional
metadata copies and increase the total number of inodes that can be
created in the filesystem.  Additionally, this can improve file create
performance due to the reduced amount of metadata which needs
to be written.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #13680

19 months agoFix sequential resilver drive failure race condition
samwyc [Wed, 19 Oct 2022 22:48:13 +0000 (04:18 +0530)]
Fix sequential resilver drive failure race condition

This patch handles the race condition on simultaneous failure of
2 drives, which misses the vdev_rebuild_reset_wanted signal in
vdev_rebuild_thread. We retry to catch this inside the
vdev_rebuild_complete_sync function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Samuel Wycliffe J <samwyc@hpe.com>
Closes #14041
Closes #14050

19 months agoSupport idmapped mount
youzhongyang [Wed, 19 Oct 2022 18:17:09 +0000 (14:17 -0400)]
Support idmapped mount

Adds support for idmapped mounts.  Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #12923
Closes #13671

19 months agoFix memory leaks in dmu_send()/dmu_send_obj()
Richard Yao [Tue, 18 Oct 2022 23:03:33 +0000 (19:03 -0400)]
Fix memory leaks in dmu_send()/dmu_send_obj()

If we encounter an EXDEV error when using the redacted snapshots
feature, the memory used by dspp.fromredactsnaps is leaked.

Clang's static analyzer caught this during an experiment in which I had
annotated various headers in an attempt to improve the results of static
analysis.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13973

19 months agoCleanup: Remove NULL pointer check from dmu_send_impl()
Richard Yao [Mon, 17 Oct 2022 21:57:59 +0000 (17:57 -0400)]
Cleanup: Remove NULL pointer check from dmu_send_impl()

The pointer is to a structure member, so it is never NULL.

Coverity complained about this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14042

19 months agoCleanup: Delete dead code from send_merge_thread()
Richard Yao [Mon, 17 Oct 2022 06:20:21 +0000 (02:20 -0400)]
Cleanup: Delete dead code from send_merge_thread()

range is always deferenced before it reaches this check, such that the
kmem_zalloc() call is never executed.

There is also no need to set `range->eos_marker = B_TRUE` because it is
already set.

Coverity incorrectly complained about a potential NULL pointer
dereference because of this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14042

19 months agoCleanup: zvol_add_clones() should not NULL check dp
Richard Yao [Mon, 17 Oct 2022 06:18:09 +0000 (02:18 -0400)]
Cleanup: zvol_add_clones() should not NULL check dp

It is never NULL because we return early if dsl_pool_hold() fails.

This caused Coverity to complain.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14042

19 months agoCleanup: metaslab_alloc_dva() should not NULL check mg->mg_next
Richard Yao [Mon, 17 Oct 2022 06:02:12 +0000 (02:02 -0400)]
Cleanup: metaslab_alloc_dva() should not NULL check mg->mg_next

This is a circularly linked list. mg->mg_next can never be NULL.

This caused 3 defect reports in Coverity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14042

19 months agoCleanup: Delete unnecessary pointer check from vdev_to_nvlist_iter()
Richard Yao [Sun, 16 Oct 2022 04:42:03 +0000 (00:42 -0400)]
Cleanup: Delete unnecessary pointer check from vdev_to_nvlist_iter()

This confused Clang's static analyzer, making it think there was a
possible NULL pointer dereference. There is no NULL pointer dereference.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14042

19 months agoCleanup: Simplify userspace abd_free_chunks()
Richard Yao [Sun, 16 Oct 2022 02:54:57 +0000 (22:54 -0400)]
Cleanup: Simplify userspace abd_free_chunks()

Clang's static analyzer complained that we could use after free here if
the inner loop ever iterated. That is a false positive, but upon
inspection, the userland abd_alloc_chunks() function never will put
multiple consecutive pages into a `struct scatterlist`, so there is no
need to loop. We delete the inner loop.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14042

19 months agoFix possible NULL pointer dereference in sha2_mac_init()
Richard Yao [Mon, 17 Oct 2022 06:06:40 +0000 (02:06 -0400)]
Fix possible NULL pointer dereference in sha2_mac_init()

If mechanism->cm_param is NULL, passing mechanism to
PROV_SHA2_GET_DIGEST_LEN() will dereference a NULL pointer.

Coverity reported this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044

19 months agoset_global_var() should not pass NULL pointers to dlclose()
Richard Yao [Sun, 16 Oct 2022 04:56:55 +0000 (00:56 -0400)]
set_global_var() should not pass NULL pointers to dlclose()

Both Coverity and Clang's static analyzer caught this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044

19 months agoFix NULL pointer dereference in spa_open_common()
Richard Yao [Sun, 16 Oct 2022 04:19:13 +0000 (00:19 -0400)]
Fix NULL pointer dereference in spa_open_common()

Calling spa_open() will pass a NULL pointer to spa_open_common()'s
config parameter. Under the right circumstances, we will dereference the
config parameter without doing a NULL check.

Clang's static analyzer found this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044

19 months agoFix NULL pointer passed to strlcpy from zap_lookup_impl()
Richard Yao [Sat, 15 Oct 2022 02:55:48 +0000 (22:55 -0400)]
Fix NULL pointer passed to strlcpy from zap_lookup_impl()

Clang's static analyzer pointed out that whenever zap_lookup_by_dnode()
is called, we have the following stack where strlcpy() is passed a NULL
pointer for realname from zap_lookup_by_dnode():

strlcpy()
zap_lookup_impl()
zap_lookup_norm_by_dnode()
zap_lookup_by_dnode()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044

19 months agofm_fmri_hc_create() must call va_end() before returning
Richard Yao [Sat, 15 Oct 2022 02:46:43 +0000 (22:46 -0400)]
fm_fmri_hc_create() must call va_end() before returning

clang-tidy caught this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044

19 months agoFix NULL pointer dereference in zdb
Richard Yao [Sat, 15 Oct 2022 02:45:13 +0000 (22:45 -0400)]
Fix NULL pointer dereference in zdb

Clang's static analyzer complained that we dereference a NULL pointer in
dump_path() if we return 0 when there is an error.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044

19 months agoZED: Fix uninitialized value reads
Richard Yao [Tue, 18 Oct 2022 19:42:14 +0000 (15:42 -0400)]
ZED: Fix uninitialized value reads

Coverity complained about a couple of uninitialized value reads in ZED.

 * zfs_deliver_dle() can pass an uninitialized string to zed_log_msg()
 * An uninitialized sev.sigev_signo is passed to timer_create()

The former would log garbage while the latter is not a real issue, but
we might as well suppress it by initializing the field to 0 for
consistency's sake.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14047

19 months agoLinux 6.1 compat: change order of sys/mutex.h includes
Coleman Kane [Tue, 18 Oct 2022 19:29:44 +0000 (15:29 -0400)]
Linux 6.1 compat: change order of sys/mutex.h includes

After Linux 6.1-rc1 came out, the build started failing to build a
couple of the files in the linux spl code due to the mutex_init
redefinition. Moving the sys/mutex.h include to a lower position within
these two files appears to fix the problem.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14040

19 months agoCoverity model file update
Richard Yao [Tue, 18 Oct 2022 18:06:35 +0000 (14:06 -0400)]
Coverity model file update

Upon review, it was found that the model for malloc() was incorrect.

In addition, several general purpose memory allocation functions were
missing models:

 * kmem_vasprintf()
 * kmem_asprintf()
 * kmem_strdup()
 * kmem_strfree()
 * spl_vmem_alloc()
 * spl_vmem_zalloc()
 * spl_vmem_free()
 * calloc()

As an experiment to try to find more bugs, some less than general
purpose memory allocation functions were also given models:

 * zfsvfs_create()
 * zfsvfs_free()
 * nvlist_alloc()
 * nvlist_dup()
 * nvlist_free()
 * nvlist_pack()
 * nvlist_unpack()

Finally, the models were improved using additional coverity primitives:

 * __coverity_negative_sink__()
 * __coverity_writeall0__()
 * __coverity_mark_as_uninitialized_buffer__()
 * __coverity_mark_as_afm_allocated__()

In addition, an attempt to inform coverity that certain modelled
functions read entire buffers was used by adding the following to
certain models:

int first = buf[0];
int last = buf[buflen-1];

It was inspired by the QEMU model file.

No additional false positives were found by this, but it is believed
that the more accurate model file will help to catch false positives in
the future.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14048

19 months agoFix declarations of non-global variables
Tino Reichardt [Tue, 18 Oct 2022 18:05:32 +0000 (20:05 +0200)]
Fix declarations of non-global variables

This patch inserts the `static` keyword to non-global variables,
which where found by the analysis tool smatch.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13970

19 months agoLinux compat: fix DECLARE_EVENT_CLASS() test when ZFS is built-in
Alexander [Mon, 17 Oct 2022 18:08:36 +0000 (20:08 +0200)]
Linux compat: fix DECLARE_EVENT_CLASS() test when ZFS is built-in

ZFS_LINUX_TRY_COMPILE_HEADER macro doesn't take CONFIG_ZFS=y into
account. As a result, on several latest Linux versions, configure
script marks DECLARE_EVENT_CLASS() available for non-GPL when ZFS
is being built as a module, but marks it unavailable when ZFS is
built-in.
Follow the logic of the neighbor macros and adjust
ZFS_LINUX_TRY_COMPILE_HEADER accordingly, so that it doesn't try
to look for a .ko when ZFS is built-in.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes #14006

19 months agoFix theoretical array overflow in lua_typename()
Richard Yao [Fri, 14 Oct 2022 20:41:56 +0000 (16:41 -0400)]
Fix theoretical array overflow in lua_typename()

Out of the 12 defects in lua that coverity reports, 5 of them involve
`lua_typename()` and out of the dozens of defects in ZFS that lua
reports, 3 of them involve `lua_typename()` due to the ZCP code. Given
all of the uses of `lua_typename()` in the ZCP code, I was surprised
that there were not more. It appears that only 2 were reported because
only 3 called `lua_type()`, which does a defective sanity check that
allows invalid types to be passed.

lua/lua@d4fb848be77f4b0209acaf37a5b5e1cee741ddce addressed this in
upstream lua 5.3. Unfortunately, we did not get that fix since we use
lua 5.2 and we do not have assertions enabled in lua, so the upstream
solution would not do anything.

While we could adopt the upstream solution and enable assertions, a
simpler solution is to fix the issue by making `lua_typename()` return
`internal_type_error` whenever it is called with an invalid type. This
avoids the array overflow and if we ever see it appear somewhere, we
will know there is a problem with the lua interpreter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13947

19 months agozstream: allow decompress to fix metadata for uncompressed records
Alan Somers [Fri, 14 Oct 2022 20:40:00 +0000 (14:40 -0600)]
zstream: allow decompress to fix metadata for uncompressed records

If a record is uncompressed on-disk but the block pointer insists
otherwise, reading it will return EIO.  This commit adds an "off" type
to the "zstream decompress" command.  Using it will set the compression
field in a zfs stream to "off" without changing the record's data.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Sponsored by: Axcient
Closes #13997

19 months agoCleanup: Address Clang's static analyzer's unused code complaints
Richard Yao [Fri, 14 Oct 2022 20:37:54 +0000 (16:37 -0400)]
Cleanup: Address Clang's static analyzer's unused code complaints

These were categorized as the following:

 * Dead assignment 23
 * Dead increment 4
 * Dead initialization 6
 * Dead nested assignment 18

Most of these are harmless, but since actual issues can hide among them,
we correct them.

That said, there were a few return values that were being ignored that
appeared to merit some correction:

 * `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from
   `destroy_batched()`. We handle it by returning -1 if there is an
   error.

 * `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from
   `zfs_for_each()`. We handle it by doing a binary OR of the error
   value from the subsequent `zfs_for_each()` call to the existing
   value. This is how errors are mostly handled inside `zfs_for_each()`.
   The error value here is passed to exit from the zfs command, so doing
   a binary or on it is better than what we did previously.

 * `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from
   `dsl_prop_get_ds()` when the property is not of type string. We
   return an error when it does. There is a small concern that the
   `zfs_get_temporary_prop()` call would handle things, but in the case
   that it does not, we would be pushing an uninitialized numval onto
   the lua stack. It is expected that `dsl_prop_get_ds()` will succeed
   anytime that `zfs_get_temporary_prop()` does, so that not giving it a
   chance to fix things is not a problem.

 * `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used
   `nvlist_add_nvlist()` twice in ways in which errors are expected to
   be impossible, so we switch to `fnvlist_add_nvlist()`.

A few notable ones did not merit use of the return value, so we
suppressed it with `(void)`:

 * `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error
   value from `describe_free()`. A look through the commit history
   revealed that this was intentional.

 * `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the
   returned handle from `arc_hdr_realloc()` because it is already
   referenced in lists.

 * `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly
   saying not to use the error from `vdev_label_init()` because whatever
   causes the error could be the reason why a detach is being done.

Unfortunately, I am not presently able to analyze the kernel modules
with Clang's static analyzer, so I could have missed some cases of this.
In cases where reports were present in code that is duplicated between
Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version
too.

After this commit is merged, regressions like dee8934 should become
extremely obvious with Clang's static analyzer since a regression would
appear in the results as the only instance of unused code. That assumes
that Coverity does not catch the issue first.

My local branch with fixes from all of my outstanding non-draft pull
requests shows 118 reports from Clang's static anlayzer after this
patch. That is down by 51 from 169.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13986

19 months agoFix potential NULL pointer dereference in lzc_ioctl()
Richard Yao [Fri, 14 Oct 2022 20:33:22 +0000 (16:33 -0400)]
Fix potential NULL pointer dereference in lzc_ioctl()

Users are allowed to pass NULL to resultp, but we unconditionally assume
that they never do. When an external user does pass NULL to resultp, we
dereference a NULL pointer.

Clang's static analyzer complained about this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14008

19 months agozfs_domount: fix double-disown of dataset / double-free of zfsvfs_t
Christian Schwarz [Fri, 14 Oct 2022 18:46:47 +0000 (20:46 +0200)]
zfs_domount: fix double-disown of dataset / double-free of zfsvfs_t

Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we
leave zfsvfs != NULL. This will lead to execution of the error handling
`if` statement at the `out` label, and hence to a call to
dmu_objset_disown and zfsvfs_free.

However, zfs_umount, which we call upon failure of zfs_root and
d_make_root already does dmu_objset_disown and zfsvfs_free.

I suppose this patch rather adds to the brittleness of this part of the
code base, but I don't want to invest more time in this right now.
To add a regression test, we'd need some kind of fault injection
facility for zfs_root or d_make_root, which doesn't exist right now.
And even then, I think that regression test would be too closely tied
to the implementation.

To repro the double-disown / double-free, do the following:
1. patch zfs_root to always return an error
2. mount a ZFS filesystem

Here's the stack trace you would see then:

  VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000)
  PANIC at dsl_dataset.c:1003:dsl_dataset_disown()
  Showing stack for process 28332
  CPU: 2 PID: 28332 Comm: zpool Tainted: G           O      5.10.103-1.nutanix.el7.x86_64 #1
  Call Trace:
   dump_stack+0x74/0x92
   spl_dumpstack+0x29/0x2b [spl]
   spl_panic+0xd4/0xfc [spl]
   dsl_dataset_disown+0xe9/0x150 [zfs]
   dmu_objset_disown+0xd6/0x150 [zfs]
   zfs_domount+0x17b/0x4b0 [zfs]
   zpl_mount+0x174/0x220 [zfs]
   legacy_get_tree+0x2b/0x50
   vfs_get_tree+0x2a/0xc0
   path_mount+0x2fa/0xa70
   do_mount+0x7c/0xa0
   __x64_sys_mount+0x8b/0xe0
   do_syscall_64+0x38/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Christian Schwarz <christian.schwarz@nutanix.com>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes #14025

19 months agocstyle: Allow URLs in C++ comments
ColMelvin [Thu, 13 Oct 2022 18:05:05 +0000 (13:05 -0500)]
cstyle: Allow URLs in C++ comments

If a C++ comment contained a URL, the `://` part of the URL would
trigger an error because there was no trailing blank, but trailing
blanks make for an invalid URL.  Modify the check to ignore text
within the C++ comment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Lindee <chris.lindee+github@gmail.com>
Closes #13987

19 months agoCleanup: 64-bit kernel module parameters should use fixed width types
Richard Yao [Mon, 3 Oct 2022 19:06:54 +0000 (15:06 -0400)]
Cleanup: 64-bit kernel module parameters should use fixed width types

Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.

Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.

After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:

Parameters that were originally 64-bit on Illumos:

 * dbuf_cache_max_bytes
 * dbuf_metadata_cache_max_bytes
 * l2arc_feed_min_ms
 * l2arc_feed_secs
 * l2arc_headroom
 * l2arc_headroom_boost
 * l2arc_write_boost
 * l2arc_write_max
 * metaslab_aliquot
 * metaslab_force_ganging
 * zfetch_array_rd_sz
 * zfs_arc_max
 * zfs_arc_meta_limit
 * zfs_arc_meta_min
 * zfs_arc_min
 * zfs_async_block_max_blocks
 * zfs_condense_max_obsolete_bytes
 * zfs_condense_min_mapping_bytes
 * zfs_deadman_checktime_ms
 * zfs_deadman_synctime_ms
 * zfs_initialize_chunk_size
 * zfs_initialize_value
 * zfs_lua_max_instrlimit
 * zfs_lua_max_memlimit
 * zil_slog_bulk

Parameters that were originally 32-bit on Illumos:

 * zfs_per_txg_dirty_frees_percent

Parameters that were originally `ssize_t` on Illumos:

 * zfs_immediate_write_sz

Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.

Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:

 * l2arc_rebuild_blocks_min_l2size
 * zfs_key_max_salt_uses
 * zfs_max_log_walking
 * zfs_max_logsm_summary_length
 * zfs_metaslab_max_size_cache_sec
 * zfs_min_metaslabs_to_flush
 * zfs_multihost_interval
 * zfs_unflushed_log_block_max
 * zfs_unflushed_log_block_min
 * zfs_unflushed_log_block_pct
 * zfs_unflushed_max_mem_amt
 * zfs_unflushed_max_mem_ppm

New parameters that do not exist in Illumos:

 * l2arc_trim_ahead
 * vdev_file_logical_ashift
 * vdev_file_physical_ashift
 * zfs_arc_dnode_limit
 * zfs_arc_dnode_limit_percent
 * zfs_arc_dnode_reduce_percent
 * zfs_arc_meta_limit_percent
 * zfs_arc_sys_free
 * zfs_deadman_ziotime_ms
 * zfs_delete_blocks
 * zfs_history_output_max
 * zfs_livelist_max_entries
 * zfs_max_async_dedup_frees
 * zfs_max_nvlist_src_size
 * zfs_rebuild_max_segment
 * zfs_rebuild_vdev_limit
 * zfs_unflushed_log_txg_max
 * zfs_vdev_max_auto_ashift
 * zfs_vdev_min_auto_ashift
 * zfs_vnops_read_chunk_size
 * zvol_max_discard_blocks

Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.

A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:

 * zfs_delete_blocks
 * zfs_key_max_salt_uses
 * zvol_max_discard_blocks

The documentation for a few parameters was found to be incorrect:

 * zfs_deadman_checktime_ms - incorrectly documented as int
 * zfs_delete_blocks - not documented as Linux only
 * zfs_history_output_max - incorrectly documented as int
 * zfs_vnops_read_chunk_size - incorrectly documented as long
 * zvol_max_discard_blocks - incorrectly documented as ulong

The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.

In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:

 * vdev_file_logical_ashift
 * vdev_file_physical_ashift
 * zfs_arc_dnode_limit_percent
 * zfs_arc_dnode_reduce_percent
 * zfs_arc_meta_limit_percent
 * zfs_per_txg_dirty_frees_percent
 * zfs_unflushed_log_block_pct
 * zfs_vdev_max_auto_ashift
 * zfs_vdev_min_auto_ashift

Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.

Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004

19 months agoLinux: Remove ZFS_AC_KERNEL_SRC_MODULE_PARAM_CALL_CONST autotools check
Richard Yao [Sun, 9 Oct 2022 01:29:54 +0000 (21:29 -0400)]
Linux: Remove ZFS_AC_KERNEL_SRC_MODULE_PARAM_CALL_CONST autotools check

On older kernels, the definition for `module_param_call()` typecasts
function pointers to `(void *)`, which triggers -Werror, causing the
check to return false when it should return true.

Fixing this breaks the build process on some older kernels because they
define a `__check_old_set_param()` function in their headers that checks
for a non-constified `->set()`. We workaround that through the c
preprocessor by defining `__check_old_set_param(set)` to `(set)`, which
prevents the build failures.

However, it is now apparent that all kernels that we support have
adopted the GRSecurity change, so there is no need to have an explicit
autotools check for it anymore. We therefore remove the autotools check,
while adding the workaround to our headers for the build time
non-constified `->set()` check done by older kernel headers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004

19 months agoetc: mask zfs-load-key.service
наб [Wed, 12 Oct 2022 22:27:55 +0000 (00:27 +0200)]
etc: mask zfs-load-key.service

Otherwise, systemd-sysv-generator will generate a service equivalent
that breaks the boot: under systemd this is covered by
zfs-mount-generator

We already do this for zfs-import.service, and other init scripts are
suppressed automatically by the "actual" .service files

Fixes: commit f04b97620059d08b37d9e80ada397e742bb2f311 ("Add init script
 to load keys")
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #14010
Closes #14019

19 months agoCI: bump actions/upload-artifact to v3
George Melikov [Wed, 12 Oct 2022 06:33:38 +0000 (09:33 +0300)]
CI: bump actions/upload-artifact to v3

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #14018

19 months agoCI: bump actions/checkout to v3
George Melikov [Tue, 11 Oct 2022 19:59:06 +0000 (22:59 +0300)]
CI: bump actions/checkout to v3

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #14018

19 months agoAdd defensive assertions
Richard Yao [Wed, 12 Oct 2022 18:25:18 +0000 (14:25 -0400)]
Add defensive assertions

Coverity complains about possible bugs involving referencing NULL return
values and division by zero. The division by zero bugs require that a
block pointer be corrupt, either from in-memory corruption, or on-disk
corruption. The NULL return value complaints are only bugs if
assumptions that we make about the state of data structures are wrong.
Some seem impossible to be wrong and thus are false positives, while
others are hard to analyze.

Rather than dismiss these as false positives by assuming we know better,
we add defensive assertions to let us know when our assumptions are
wrong.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13972

19 months agoscripts/enum-extract.pl should not hard code perl path
Richard Yao [Tue, 11 Oct 2022 19:32:07 +0000 (15:32 -0400)]
scripts/enum-extract.pl should not hard code perl path

This is a portability issue. The issue had already been fixed for
scripts/cstyle.pl by 2dbf1bf8296f66f24d5e404505c991bfbeec7808.
scripts/enum-extract.pl was added to the repository the following year
without this portability fix.

Michael Bishop informed me that this broke his attempt to build ZFS
2.1.6 on NixOS, since he was building manually outside of their package
manager (that usually rewrites the shebangs to NixOS' unusual paths).
NixOS puts all of the paths into $PATH, so scripts that portably rely
on env to find the interpreter still work.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14012

19 months agoFreeBSD: Fix a pair of bugs in zfs_fhtovp()
Mark Johnston [Tue, 11 Oct 2022 19:29:55 +0000 (15:29 -0400)]
FreeBSD: Fix a pair of bugs in zfs_fhtovp()

- Add a zfs_exit() call in an error path, otherwise a lock is leaked.
- Remove the fid_gen > 1 check.  That appears to be Linux-specific:
  zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the
  snapshot directory is mounted.  On FreeBSD it fails, making snapshot
  dirs inaccessible via NFS.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Fixes: 43dbf8817808 ("FreeBSD: vfsops: use setgen for error case")
Closes #14001
Closes #13974

19 months agoStop ganging due to past vdev write errors
Serapheim Dimitropoulos [Tue, 11 Oct 2022 19:27:41 +0000 (12:27 -0700)]
Stop ganging due to past vdev write errors

= Problem

While examining a customer's system we noticed unreasonable space
usage from a few snapshots due to gang blocks. Under some further
analysis we discovered that the pool would create gang blocks because
all its disks had non-zero write error counts and they'd be skipped
for normal metaslab allocations due to the following if-clause in
`metaslab_alloc_dva()`:
```
/*
 * Avoid writing single-copy data to a failing,
 * non-redundant vdev, unless we've already tried all
 * other vdevs.
 */
if ((vd->vdev_stat.vs_write_errors > 0 ||
    vd->vdev_state < VDEV_STATE_HEALTHY) &&
    d == 0 && !try_hard && vd->vdev_children == 0) {
metaslab_trace_add(zal, mg, NULL, psize, d,
    TRACE_VDEV_ERROR, allocator);
goto next;
}
```

= Proposed Solution

Get rid of the predicate in the if-clause that checks the past
write errors of the selected vdev. We still try to allocate from
HEALTHY vdevs anyway by checking vdev_state so the past write
errors doesn't seem to help us (quite the opposite - it can cause
issues in long-lived pools like the one from our customer).

= Testing

I first created a pool with 3 vdevs:
```
$ zpool list -v volpool
NAME        SIZE  ALLOC   FREE
volpool    22.5G   117M  22.4G
  xvdb     7.99G  40.2M  7.46G
  xvdc     7.99G  39.1M  7.46G
  xvdd     7.99G  37.8M  7.46G
```

And used `zinject` like so with each one of them:
```
$ sudo zinject -d xvdb -e io -T write -f 0.1 volpool
```

And got the vdevs to the following state:
```
$ zpool status volpool
  pool: volpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
...<cropped>..
action: Determine if the device needs to be replaced, and clear the
...<cropped>..
config:

NAME        STATE     READ WRITE CKSUM
volpool     ONLINE       0     0     0
  xvdb      ONLINE       0     1     0
  xvdc      ONLINE       0     1     0
  xvdd      ONLINE       0     4     0

```

I also double-checked their write error counters with sdb:
```
sdb> spa volpool | vdev | member vdev_stat.vs_write_errors
(uint64_t)0  # <---- this is the root vdev
(uint64_t)2
(uint64_t)1
(uint64_t)1
```

Then I checked that I the problem was reproduced in my VM as I the
gang count was growing in zdb as I was writting more data:
```
$ sudo zdb volpool | grep gang
        ganged count:              1384

$ sudo zdb volpool | grep gang
        ganged count:              1393

$ sudo zdb volpool | grep gang
        ganged count:              1402

$ sudo zdb volpool | grep gang
        ganged count:              1414
```

Then I updated my bits with this patch and the gang count stayed the
same.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14003

19 months agoFix uninitialized value read in vdev_prop_set()
Richard Yao [Tue, 11 Oct 2022 19:24:36 +0000 (15:24 -0400)]
Fix uninitialized value read in vdev_prop_set()

If no errors are encountered, we read an uninitialized error value.

Clang's static analyzer complained about this.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14007

19 months agozvol_wait logic may terminate prematurely
Serapheim Dimitropoulos [Tue, 11 Oct 2022 19:12:04 +0000 (12:12 -0700)]
zvol_wait logic may terminate prematurely

Setups that have a lot of zvols may see zvol_wait terminate prematurely
even though the script is still making progress.  For example, we have a
customer that called zvol_wait for ~7100 zvols and by the last iteration
of that script it was still waiting on ~2900. Similarly another one
called zvol_wait for 2200 and by the time the script terminated there
were only 50 left.

This patch adjusts the logic to stay within the outer loop of the script
if we are making any progress whatsoever.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #13998

19 months agoHandle possible null pointers from malloc/strdup/strndup()
Richard Yao [Fri, 7 Oct 2022 00:18:40 +0000 (20:18 -0400)]
Handle possible null pointers from malloc/strdup/strndup()

GCC 12.1.1_p20220625's static analyzer caught these.

Of the two in the btree test, one had previously been caught by Coverity
and Smatch, but GCC flagged it as a false positive. Upon examining how
other test cases handle this, the solution was changed from
`ASSERT3P(node, !=, NULL);` to using `perror()` to be consistent with
the fixes to the other fixes done to the ZTS code.

That approach was also used in ZED since I did not see a better way of
handling this there. Also, upon inspection, additional unchecked
pointers from malloc()/calloc()/strdup() were found in ZED, so those
were handled too.

In other parts of the code, the existing methods to avoid issues from
memory allocators returning NULL were used, such as using
`umem_alloc(size, UMEM_NOFAIL)` or returning `ENOMEM`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13979