Rob N ★ [Fri, 28 Oct 2022 18:46:44 +0000 (05:46 +1100)]
debug: fix output from VERIFY0 assertion
The previous version reported all the right info, but the VERIFY3 name
made a little more confusing when looking for the matching location in
the source code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rob N ★ <robn@despairlabs.com>
Closes #14099
Mariusz Zaborski [Fri, 28 Oct 2022 18:44:18 +0000 (20:44 +0200)]
quota: extend quota for dataset
This patch relax the quota limitation for dataset by around 3%.
What this means is that user can write more data then the quota is
set to. However thanks to that we can get more stable bandwidth, in
case when we are overwriting data in-place, and not consuming any
additional space.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org> Sponsored-by: Zededa Inc. Sponsored-by: Klara Inc.
Closes #13839
shodanshok [Fri, 28 Oct 2022 17:21:54 +0000 (19:21 +0200)]
Fix ARC target collapse when zfs_arc_meta_limit_percent=100
Reclaim metadata when arc_available_memory < 0 even if
meta_used is not bigger than arc_meta_limit.
As described in https://github.com/openzfs/zfs/issues/14054 if
zfs_arc_meta_limit_percent=100 then ARC target can collapse to
arc_min due to arc_purge not freeing any metadata.
This patch lets arc_prune to do its work when arc_available_memory
is negative even if meta_used is not bigger than arc_meta_limit,
avoiding ARC target collapse.
vaclavskala [Fri, 28 Oct 2022 17:16:31 +0000 (19:16 +0200)]
Propagate extent_bytes change to autotrim thread
The autotrim thread only reads zfs_trim_extent_bytes_min and
zfs_trim_extent_bytes_max variable only on thread start. We
should check for parameter changes during thread execution to
allow parameter changes take effect without needing to disable
then restart the autotrim.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Václav Skála <skala@vshosting.cz>
Closes #14077
Aleksa Sarai [Sat, 22 Jun 2019 00:35:11 +0000 (10:35 +1000)]
zfs_rename: support RENAME_* flags
Implement support for Linux's RENAME_* flags (for renameat2). Aside from
being quite useful for userspace (providing race-free ways to exchange
paths and implement mv --no-clobber), they are used by overlayfs and are
thus required in order to use overlayfs-on-ZFS.
In order for us to represent the new renameat2(2) flags in the ZIL, we
create two new transaction types for the two flags which need
transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT).
RENAME_NOREPLACE does not need any ZIL support because we know that if
the operation succeeded before creating the ZIL entry, there was no file
to be clobbered and thus it can be treated as a regular TX_RENAME.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes #12209
Closes #14070
This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support
for ZoL, but the changes here allow for far nicer fallbacks than the
previous implementation (the source and target are re-linked in case of
the final link failing).
In addition, a small cleanup was done for the "target exists but is a
different type" codepath so that it's more understandable.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes #12209
Closes #14070
Pavel Snajdr [Thu, 5 Dec 2019 00:52:27 +0000 (01:52 +0100)]
Remove zpl_revalidate: fix snapshot rollback
Open files, which aren't present in the snapshot, which is being
roll-backed to, need to disappear from the visible VFS image of
the dataset.
Kernel provides d_drop function to drop invalid entry from
the dcache, but inode can be referenced by dentry multiple dentries.
The introduced zpl_d_drop_aliases function walks and invalidates
all aliases of an inode.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #9600
Closes #14070
Andrew Innes [Fri, 28 Oct 2022 16:30:37 +0000 (00:30 +0800)]
Fix multiplication converted to larger type
This fixes the instances of the "Multiplication result converted to
larger type" alert that codeQL scanning found.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Closes #14094
youzhongyang [Fri, 28 Oct 2022 16:08:12 +0000 (12:08 -0400)]
Fix zio_flag_t print format
Follow up for 4938d01d which changed zio_flag from enum to uint64_t.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14100
Damian Szuberski [Thu, 27 Oct 2022 23:45:14 +0000 (09:45 +1000)]
Process `script` directory for all configs
Even when only building kmods process the scripts directory. This
way the common.sh script will be generated and the zfs.sh script
can be used to load/unload the in-tree kernel modules.
Umer Saleem [Thu, 27 Oct 2022 22:38:45 +0000 (03:38 +0500)]
Add native Debian Packaging for Linux
Currently, the Debian packages are generated from ALIEN that converts
RPMs to Debian packages. This commit adds native Debian packaging for
Debian based systems.
This packaging is a fork of Debian zfs-linux 2.1.6-2 release.
(source: https://salsa.debian.org/zfsonlinux-team/zfs)
Some updates have been made to keep the footprint minimal that
include removing the tests, translation files, patches directory etc.
All credits go to Debian ZFS on Linux Packaging Team.
For copyright information, please refer to contrib/debian/copyright.
scripts/debian-packaging.sh can be used to invoke the build.
Reviewed-by: Mo Zhou <cdluminate@gmail.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #13451
Richard Yao [Thu, 27 Oct 2022 16:54:54 +0000 (09:54 -0700)]
Convert enum zio_flag to uint64_t
We ran out of space in enum zio_flag for additional flags. Rather than
introduce enum zio_flag2 and then modify a bunch of functions to take a
second flags variable, we expand the type to 64 bits via `typedef
uint64_t zio_flag_t`.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14086
Richard Yao [Thu, 27 Oct 2022 16:36:17 +0000 (09:36 -0700)]
Add CodeQL workflow
CodeQL is a static analyzer from github with a very low false positive
rate. We have long wanted to have static analysis runs done on every
pull request and using CodeQL, we can.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Andrew Innes <andrew.c12@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14087
Andrew Innes [Wed, 26 Oct 2022 22:08:31 +0000 (06:08 +0800)]
Aligned free for aligned alloc
Windows port frees memory that was alloc'd aligned in a different way
then alloc'd memory. So changing frees to be specific.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Innes <andrew.c12@gmail.com> Co-Authored-By: Jorgen Lundman <lundman@lundman.net>
Closes #14059
The use of __noreturn__ in 55d7afa4adbb4ca569e9c4477a7d121f4dc0bfbd on
spl_panic() caused objtool warnings on Linux when the kernel is built
with CONFIG_STACK_VALIDATION=y. This patch works around that by
restricting the application of __noreturn__ to builds for static
analyzers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14068
Ameer Hamza [Fri, 21 Oct 2022 17:46:38 +0000 (22:46 +0500)]
zed: Avoid core dump if wholedisk property does not exist
zed aborts and dumps core in vdev_whole_disk_from_config() if
wholedisk property does not exist. make_leaf_vdev() adds the
property but there may be already pools that don't have the
wholedisk in the label.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14062
youzhongyang [Fri, 21 Oct 2022 17:05:13 +0000 (13:05 -0400)]
Add delay between zpool add zvol and zpool destroy
As investigated by #14026, the zpool_add_004_pos can reliably hang if
the timing is not right. This is caused by a race condition between
zed doing zpool reopen (due to the zvol being added to the zpool),
and the command zpool destroy.
This change adds a delay between zpool add zvol and zpool destroy to
avoid these issue, but does not address the underlying problem.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Issue #14026
Closes #14052
Richard Yao [Thu, 20 Oct 2022 21:52:35 +0000 (17:52 -0400)]
Linux: Fix big endian and partial read bugs in get_system_hostid()
Coverity made two complaints about this function. The first is that we
ignore the number of bytes read. The second is that we have a sizeof
mismatch.
On 64-bit systems, long is a 64-bit type. Paradoxically, the standard
says that hostid is 32-bit, yet is also a long type. On 64-bit big
endian systems, reading into the long would cause us to return 0 as our
hostid after the mask. This is wrong.
Also, if a partial read were to happen (it should not), we would return
a partial hostid, which is also wrong.
We introduce a uint32_t system_hostid stack variable and ensure that the
read is done into it and check the read's return value. Then we set the
value based on whether the read was successful. This should fix both of
coverity's complaints.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13968
Coverity complained about a potential resource leak from only calling
`close(fd)` when `fd > 0` because `fd` might be `0`. This is a false
positive, but rather than dismiss it as such, we can change the
comparison to ensure that this never appears again from any static
analyzer. Upon inspection, 6 more instances of this were found in the
file, so those were changed too. Unfortunately, since the file
descriptor has been put into an unsigned variable in `attr.userns_fd`,
we cannot do a non-negative check on it to see if it has not been
allocated, so we instead restructure the error handling to avoid the
need for a check. This also means that errors had not been handled
correctly here, so the static analyzer found a bug (although practically
by accident).
Coverity also complained about a dereference before a NULL check in
`do_idmap_mount()` on `source`. Upon inspection, it appears that the
pointer is never NULL, so we delete the NULL check as cleanup.
Clang's static analyzer complained that the return value of
`write_pid_idmaps()` can be uninitialized if we have no idmaps to write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14061
Richard Yao [Thu, 20 Oct 2022 21:14:42 +0000 (17:14 -0400)]
Linux: Upgrade random_get_pseudo_bytes() to xoshiro256++ 1.0
The motivation for upgrading our PRNG is the recent buildbot failures in
the ZTS' tests/functional/fault/decompress_fault test. The probability
of a failure in that test is 0.8^256, which is ~1.6e-25 out of 1, yet we
have observed multiple test failures in it. This suggests a problem with
our random number generation.
The xorshift128+ generator that we were using has been replaced by newer
generators that have "better statistical properties". After doing some
reading, it turns out that these generators have "low linear complexity
of the lowest bits", which could explain the ZTS test failures.
We do two things to try to fix this:
1. We upgrade from xorshift128+ to xoshiro256++ 1.0.
2. We tweak random_get_pseudo_bytes() to copy the higher order
bytes first.
It is hoped that this will fix the test failures in
tests/functional/fault/decompress_fault, although I have not done
simulations. I am skeptical that any simulations I do on a PRNG with a
period of 2^256 - 1 would be meaningful.
Since we have raised the minimum kernel version to 3.10 since this was
first implemented, we have the option of using the Linux kernel's
get_random_int(). However, I am not currently prepared to do performance
tests to ensure that this would not be a regression (for the time
being), so we opt for upgrading our PRNG to a newer one from Sebastiano
Vigna.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13983
Alexander Motin [Thu, 20 Oct 2022 18:57:15 +0000 (14:57 -0400)]
Optimize microzaps
Microzap on-disk format does not include a hash tree, expecting one to
be built in RAM during mzap_open(). The built tree is linked to DMU
user buffer, freed when original DMU buffer is dropped from cache. I've
found that workloads accessing many large directories and having active
eviction from DMU cache spend significant amount of time building and
then destroying the trees. I've also found that for each 64 byte mzap
element additional 64 byte tree element is allocated, that is a waste
of memory and CPU caches.
Improve memory efficiency of the hash tree by switching from AVL-tree
to B-tree. It allows to save 24 bytes per element just on pointers.
Save 32 bits on mze_hash by storing only upper 32 bits since lower 32
bits are always zero for microzaps. Save 16 bits on mze_chunkid, since
microzap can never have so many elements. Respectively with the 16 bits
there can be no more than 16 bits of collision differentiators. As
result, struct mzap_ent now drops from 48 (rounded to 64) to 8 bytes.
Tune B-trees for small data. Reduce BTREE_CORE_ELEMS from 128 to 126
to allow struct zfs_btree_core in case of 8 byte elements to pack into
2KB instead of 4KB. Aside of the microzaps it should also help 32bit
range trees. Allow custom B-tree leaf size to reduce memmove() time.
Split zap_name_alloc() into zap_name_alloc() and zap_name_init_str().
It allows to not waste time allocating/freeing memory when processing
multiple names in a loop during mzap_open().
Together on a pool with 10K directories of 1800 files each and DMU
cache limited to 128MB this reduces time of `find . -name zzz` by 41%
from 7.63s to 4.47s, and saves additional ~30% of CPU time on the DMU
cache reclamation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14039
Richard Yao [Thu, 20 Oct 2022 16:12:21 +0000 (12:12 -0400)]
Fix multiple definitions of struct mount_attr on recent glibc versions
The ifdef used would never work because the CPP is not aware of C
structure definitions. Rather than use an autotools check, we can just
use a nameless structure that we typedef to mount_attr_t. This is a
Linux kernel interface, which means that it is stable and this is fine
to do.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14057
Closes #14058
Richard Yao [Sun, 16 Oct 2022 04:02:47 +0000 (00:02 -0400)]
abd_return_buf() should call zfs_refcount_remove_many() early
Calling zfs_refcount_remove_many() after freeing memory means we pass a
reference to freed memory as the holder. This is not believed to be able
to cause a problem, but there is a bit of a tradition of fixing these
issues when they appear so that they do not obscure more serious issues
in static analyzer output, so we fix this one too.
Clang's static analyzer found this with the help of CodeChecker's CTU
analysis.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043
Richard Yao [Sun, 16 Oct 2022 03:35:56 +0000 (23:35 -0400)]
crypto_get_ptrs() should always write to *out_data_2
Callers will check if it has been set to NULL before trying to access
it, but never initialize it themselves. Whenever "one block spans two
iovecs", `crypto_get_ptrs()` will return, without ever setting
`*out_data_2 = NULL`. The caller will then do a NULL check against the
uninitailized pointer and if it is not zero, pass it to `memcpy()`.
The only reason this has not caused horrible runtime issues is because
`memcpy()` should be told to copy zero bytes when this happens. That
said, this is technically undefined behavior, so we should correct it so
that future changes to the code cannot trigger it.
Clang's static analyzer found this with the help of CodeChecker's CTU
analysis.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043
Richard Yao [Sat, 15 Oct 2022 22:34:31 +0000 (18:34 -0400)]
Silence static analyzer warnings about spa_sync_props()
Both Coverity and Clang's static analyzer complain about reading an
uninitialized intval if the property is not passed as DATA_TYPE_UINT64
in the nvlist. This is impossible becuase spa_prop_validate() already
checked this, but they are unlikely to be the last static analyzers to
complain about this, so lets just refactor the code to suppress the
warnings.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043
Richard Yao [Sat, 15 Oct 2022 22:27:03 +0000 (18:27 -0400)]
Fix theoretical use of uninitialized values
Clang's static analyzer complains about this.
In get_configs(), if we have an invalid configuration that has no top
level vdevs, we can read a couple of uninitialized variables. Aborting
upon seeing this would break the userland tools for healthy pools, so we
instead initialize the two variables to 0 to allow the userland tools to
continue functioning for the pools with valid configurations.
In zfs_do_wait(), if no wait activities are enabled, we read an
uninitialized error variable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043
Akash B [Thu, 20 Oct 2022 00:07:51 +0000 (05:37 +0530)]
Add options to zfs redundant_metadata property
Currently, additional/extra copies are created for metadata in
addition to the redundancy provided by the pool(mirror/raidz/draid),
due to this 2 times more space is utilized per inode and this decreases
the total number of inodes that can be created in the filesystem. By
setting redundant_metadata to none, no additional copies of metadata
are created, hence can reduce the space consumed by the additional
metadata copies and increase the total number of inodes that can be
created in the filesystem. Additionally, this can improve file create
performance due to the reduced amount of metadata which needs
to be written.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com>
Closes #13680
This patch handles the race condition on simultaneous failure of
2 drives, which misses the vdev_rebuild_reset_wanted signal in
vdev_rebuild_thread. We retry to catch this inside the
vdev_rebuild_complete_sync function.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Samuel Wycliffe J <samwyc@hpe.com>
Closes #14041
Closes #14050
youzhongyang [Wed, 19 Oct 2022 18:17:09 +0000 (14:17 -0400)]
Support idmapped mount
Adds support for idmapped mounts. Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk. This can be useful for portable home directories
and a variety of container related use cases.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #12923
Closes #13671
Richard Yao [Tue, 18 Oct 2022 23:03:33 +0000 (19:03 -0400)]
Fix memory leaks in dmu_send()/dmu_send_obj()
If we encounter an EXDEV error when using the redacted snapshots
feature, the memory used by dspp.fromredactsnaps is leaked.
Clang's static analyzer caught this during an experiment in which I had
annotated various headers in an attempt to improve the results of static
analysis.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13973
Richard Yao [Sun, 16 Oct 2022 02:54:57 +0000 (22:54 -0400)]
Cleanup: Simplify userspace abd_free_chunks()
Clang's static analyzer complained that we could use after free here if
the inner loop ever iterated. That is a false positive, but upon
inspection, the userland abd_alloc_chunks() function never will put
multiple consecutive pages into a `struct scatterlist`, so there is no
need to loop. We delete the inner loop.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14042
Richard Yao [Sun, 16 Oct 2022 04:19:13 +0000 (00:19 -0400)]
Fix NULL pointer dereference in spa_open_common()
Calling spa_open() will pass a NULL pointer to spa_open_common()'s
config parameter. Under the right circumstances, we will dereference the
config parameter without doing a NULL check.
Clang's static analyzer found this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044
Richard Yao [Sat, 15 Oct 2022 02:55:48 +0000 (22:55 -0400)]
Fix NULL pointer passed to strlcpy from zap_lookup_impl()
Clang's static analyzer pointed out that whenever zap_lookup_by_dnode()
is called, we have the following stack where strlcpy() is passed a NULL
pointer for realname from zap_lookup_by_dnode():
Richard Yao [Tue, 18 Oct 2022 19:42:14 +0000 (15:42 -0400)]
ZED: Fix uninitialized value reads
Coverity complained about a couple of uninitialized value reads in ZED.
* zfs_deliver_dle() can pass an uninitialized string to zed_log_msg()
* An uninitialized sev.sigev_signo is passed to timer_create()
The former would log garbage while the latter is not a real issue, but
we might as well suppress it by initializing the field to 0 for
consistency's sake.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14047
Coleman Kane [Tue, 18 Oct 2022 19:29:44 +0000 (15:29 -0400)]
Linux 6.1 compat: change order of sys/mutex.h includes
After Linux 6.1-rc1 came out, the build started failing to build a
couple of the files in the linux spl code due to the mutex_init
redefinition. Moving the sys/mutex.h include to a lower position within
these two files appears to fix the problem.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14040
In addition, an attempt to inform coverity that certain modelled
functions read entire buffers was used by adding the following to
certain models:
int first = buf[0];
int last = buf[buflen-1];
It was inspired by the QEMU model file.
No additional false positives were found by this, but it is believed
that the more accurate model file will help to catch false positives in
the future.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14048
Alexander [Mon, 17 Oct 2022 18:08:36 +0000 (20:08 +0200)]
Linux compat: fix DECLARE_EVENT_CLASS() test when ZFS is built-in
ZFS_LINUX_TRY_COMPILE_HEADER macro doesn't take CONFIG_ZFS=y into
account. As a result, on several latest Linux versions, configure
script marks DECLARE_EVENT_CLASS() available for non-GPL when ZFS
is being built as a module, but marks it unavailable when ZFS is
built-in.
Follow the logic of the neighbor macros and adjust
ZFS_LINUX_TRY_COMPILE_HEADER accordingly, so that it doesn't try
to look for a .ko when ZFS is built-in.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes #14006
Richard Yao [Fri, 14 Oct 2022 20:41:56 +0000 (16:41 -0400)]
Fix theoretical array overflow in lua_typename()
Out of the 12 defects in lua that coverity reports, 5 of them involve
`lua_typename()` and out of the dozens of defects in ZFS that lua
reports, 3 of them involve `lua_typename()` due to the ZCP code. Given
all of the uses of `lua_typename()` in the ZCP code, I was surprised
that there were not more. It appears that only 2 were reported because
only 3 called `lua_type()`, which does a defective sanity check that
allows invalid types to be passed.
lua/lua@d4fb848be77f4b0209acaf37a5b5e1cee741ddce addressed this in
upstream lua 5.3. Unfortunately, we did not get that fix since we use
lua 5.2 and we do not have assertions enabled in lua, so the upstream
solution would not do anything.
While we could adopt the upstream solution and enable assertions, a
simpler solution is to fix the issue by making `lua_typename()` return
`internal_type_error` whenever it is called with an invalid type. This
avoids the array overflow and if we ever see it appear somewhere, we
will know there is a problem with the lua interpreter.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13947
Alan Somers [Fri, 14 Oct 2022 20:40:00 +0000 (14:40 -0600)]
zstream: allow decompress to fix metadata for uncompressed records
If a record is uncompressed on-disk but the block pointer insists
otherwise, reading it will return EIO. This commit adds an "off" type
to the "zstream decompress" command. Using it will set the compression
field in a zfs stream to "off" without changing the record's data.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Sponsored by: Axcient
Closes #13997
* Dead assignment 23
* Dead increment 4
* Dead initialization 6
* Dead nested assignment 18
Most of these are harmless, but since actual issues can hide among them,
we correct them.
That said, there were a few return values that were being ignored that
appeared to merit some correction:
* `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from
`destroy_batched()`. We handle it by returning -1 if there is an
error.
* `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from
`zfs_for_each()`. We handle it by doing a binary OR of the error
value from the subsequent `zfs_for_each()` call to the existing
value. This is how errors are mostly handled inside `zfs_for_each()`.
The error value here is passed to exit from the zfs command, so doing
a binary or on it is better than what we did previously.
* `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from
`dsl_prop_get_ds()` when the property is not of type string. We
return an error when it does. There is a small concern that the
`zfs_get_temporary_prop()` call would handle things, but in the case
that it does not, we would be pushing an uninitialized numval onto
the lua stack. It is expected that `dsl_prop_get_ds()` will succeed
anytime that `zfs_get_temporary_prop()` does, so that not giving it a
chance to fix things is not a problem.
* `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used
`nvlist_add_nvlist()` twice in ways in which errors are expected to
be impossible, so we switch to `fnvlist_add_nvlist()`.
A few notable ones did not merit use of the return value, so we
suppressed it with `(void)`:
* `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error
value from `describe_free()`. A look through the commit history
revealed that this was intentional.
* `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the
returned handle from `arc_hdr_realloc()` because it is already
referenced in lists.
* `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly
saying not to use the error from `vdev_label_init()` because whatever
causes the error could be the reason why a detach is being done.
Unfortunately, I am not presently able to analyze the kernel modules
with Clang's static analyzer, so I could have missed some cases of this.
In cases where reports were present in code that is duplicated between
Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version
too.
After this commit is merged, regressions like dee8934 should become
extremely obvious with Clang's static analyzer since a regression would
appear in the results as the only instance of unused code. That assumes
that Coverity does not catch the issue first.
My local branch with fixes from all of my outstanding non-draft pull
requests shows 118 reports from Clang's static anlayzer after this
patch. That is down by 51 from 169.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Cedric Berger <cedric@precidata.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13986
Richard Yao [Fri, 14 Oct 2022 20:33:22 +0000 (16:33 -0400)]
Fix potential NULL pointer dereference in lzc_ioctl()
Users are allowed to pass NULL to resultp, but we unconditionally assume
that they never do. When an external user does pass NULL to resultp, we
dereference a NULL pointer.
Clang's static analyzer complained about this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14008
zfs_domount: fix double-disown of dataset / double-free of zfsvfs_t
Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we
leave zfsvfs != NULL. This will lead to execution of the error handling
`if` statement at the `out` label, and hence to a call to
dmu_objset_disown and zfsvfs_free.
However, zfs_umount, which we call upon failure of zfs_root and
d_make_root already does dmu_objset_disown and zfsvfs_free.
I suppose this patch rather adds to the brittleness of this part of the
code base, but I don't want to invest more time in this right now.
To add a regression test, we'd need some kind of fault injection
facility for zfs_root or d_make_root, which doesn't exist right now.
And even then, I think that regression test would be too closely tied
to the implementation.
To repro the double-disown / double-free, do the following:
1. patch zfs_root to always return an error
2. mount a ZFS filesystem
Here's the stack trace you would see then:
VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000)
PANIC at dsl_dataset.c:1003:dsl_dataset_disown()
Showing stack for process 28332
CPU: 2 PID: 28332 Comm: zpool Tainted: G O 5.10.103-1.nutanix.el7.x86_64 #1
Call Trace:
dump_stack+0x74/0x92
spl_dumpstack+0x29/0x2b [spl]
spl_panic+0xd4/0xfc [spl]
dsl_dataset_disown+0xe9/0x150 [zfs]
dmu_objset_disown+0xd6/0x150 [zfs]
zfs_domount+0x17b/0x4b0 [zfs]
zpl_mount+0x174/0x220 [zfs]
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x2a/0xc0
path_mount+0x2fa/0xa70
do_mount+0x7c/0xa0
__x64_sys_mount+0x8b/0xe0
do_syscall_64+0x38/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Christian Schwarz <christian.schwarz@nutanix.com> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes #14025
ColMelvin [Thu, 13 Oct 2022 18:05:05 +0000 (13:05 -0500)]
cstyle: Allow URLs in C++ comments
If a C++ comment contained a URL, the `://` part of the URL would
trigger an error because there was no trailing blank, but trailing
blanks make for an invalid URL. Modify the check to ignore text
within the C++ comment.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Lindee <chris.lindee+github@gmail.com>
Closes #13987
Richard Yao [Mon, 3 Oct 2022 19:06:54 +0000 (15:06 -0400)]
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Original-patch-by: Andrew Innes <andrew.c12@gmail.com> Original-patch-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
On older kernels, the definition for `module_param_call()` typecasts
function pointers to `(void *)`, which triggers -Werror, causing the
check to return false when it should return true.
Fixing this breaks the build process on some older kernels because they
define a `__check_old_set_param()` function in their headers that checks
for a non-constified `->set()`. We workaround that through the c
preprocessor by defining `__check_old_set_param(set)` to `(set)`, which
prevents the build failures.
However, it is now apparent that all kernels that we support have
adopted the GRSecurity change, so there is no need to have an explicit
autotools check for it anymore. We therefore remove the autotools check,
while adding the workaround to our headers for the build time
non-constified `->set()` check done by older kernel headers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
Richard Yao [Wed, 12 Oct 2022 18:25:18 +0000 (14:25 -0400)]
Add defensive assertions
Coverity complains about possible bugs involving referencing NULL return
values and division by zero. The division by zero bugs require that a
block pointer be corrupt, either from in-memory corruption, or on-disk
corruption. The NULL return value complaints are only bugs if
assumptions that we make about the state of data structures are wrong.
Some seem impossible to be wrong and thus are false positives, while
others are hard to analyze.
Rather than dismiss these as false positives by assuming we know better,
we add defensive assertions to let us know when our assumptions are
wrong.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13972
Richard Yao [Tue, 11 Oct 2022 19:32:07 +0000 (15:32 -0400)]
scripts/enum-extract.pl should not hard code perl path
This is a portability issue. The issue had already been fixed for
scripts/cstyle.pl by 2dbf1bf8296f66f24d5e404505c991bfbeec7808.
scripts/enum-extract.pl was added to the repository the following year
without this portability fix.
Michael Bishop informed me that this broke his attempt to build ZFS
2.1.6 on NixOS, since he was building manually outside of their package
manager (that usually rewrites the shebangs to NixOS' unusual paths).
NixOS puts all of the paths into $PATH, so scripts that portably rely
on env to find the interpreter still work.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14012
Mark Johnston [Tue, 11 Oct 2022 19:29:55 +0000 (15:29 -0400)]
FreeBSD: Fix a pair of bugs in zfs_fhtovp()
- Add a zfs_exit() call in an error path, otherwise a lock is leaked.
- Remove the fid_gen > 1 check. That appears to be Linux-specific:
zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the
snapshot directory is mounted. On FreeBSD it fails, making snapshot
dirs inaccessible via NFS.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Fixes: 43dbf8817808 ("FreeBSD: vfsops: use setgen for error case")
Closes #14001
Closes #13974
While examining a customer's system we noticed unreasonable space
usage from a few snapshots due to gang blocks. Under some further
analysis we discovered that the pool would create gang blocks because
all its disks had non-zero write error counts and they'd be skipped
for normal metaslab allocations due to the following if-clause in
`metaslab_alloc_dva()`:
```
/*
* Avoid writing single-copy data to a failing,
* non-redundant vdev, unless we've already tried all
* other vdevs.
*/
if ((vd->vdev_stat.vs_write_errors > 0 ||
vd->vdev_state < VDEV_STATE_HEALTHY) &&
d == 0 && !try_hard && vd->vdev_children == 0) {
metaslab_trace_add(zal, mg, NULL, psize, d,
TRACE_VDEV_ERROR, allocator);
goto next;
}
```
= Proposed Solution
Get rid of the predicate in the if-clause that checks the past
write errors of the selected vdev. We still try to allocate from
HEALTHY vdevs anyway by checking vdev_state so the past write
errors doesn't seem to help us (quite the opposite - it can cause
issues in long-lived pools like the one from our customer).
= Testing
I first created a pool with 3 vdevs:
```
$ zpool list -v volpool
NAME SIZE ALLOC FREE
volpool 22.5G 117M 22.4G
xvdb 7.99G 40.2M 7.46G
xvdc 7.99G 39.1M 7.46G
xvdd 7.99G 37.8M 7.46G
```
And used `zinject` like so with each one of them:
```
$ sudo zinject -d xvdb -e io -T write -f 0.1 volpool
```
And got the vdevs to the following state:
```
$ zpool status volpool
pool: volpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
...<cropped>..
action: Determine if the device needs to be replaced, and clear the
...<cropped>..
config:
I also double-checked their write error counters with sdb:
```
sdb> spa volpool | vdev | member vdev_stat.vs_write_errors
(uint64_t)0 # <---- this is the root vdev
(uint64_t)2
(uint64_t)1
(uint64_t)1
```
Then I checked that I the problem was reproduced in my VM as I the
gang count was growing in zdb as I was writting more data:
```
$ sudo zdb volpool | grep gang
ganged count: 1384
Then I updated my bits with this patch and the gang count stayed the
same.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14003
Richard Yao [Tue, 11 Oct 2022 19:24:36 +0000 (15:24 -0400)]
Fix uninitialized value read in vdev_prop_set()
If no errors are encountered, we read an uninitialized error value.
Clang's static analyzer complained about this.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14007
Setups that have a lot of zvols may see zvol_wait terminate prematurely
even though the script is still making progress. For example, we have a
customer that called zvol_wait for ~7100 zvols and by the last iteration
of that script it was still waiting on ~2900. Similarly another one
called zvol_wait for 2200 and by the time the script terminated there
were only 50 left.
This patch adjusts the logic to stay within the outer loop of the script
if we are making any progress whatsoever.
Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #13998
Of the two in the btree test, one had previously been caught by Coverity
and Smatch, but GCC flagged it as a false positive. Upon examining how
other test cases handle this, the solution was changed from
`ASSERT3P(node, !=, NULL);` to using `perror()` to be consistent with
the fixes to the other fixes done to the ZTS code.
That approach was also used in ZED since I did not see a better way of
handling this there. Also, upon inspection, additional unchecked
pointers from malloc()/calloc()/strdup() were found in ZED, so those
were handled too.
In other parts of the code, the existing methods to avoid issues from
memory allocators returning NULL were used, such as using
`umem_alloc(size, UMEM_NOFAIL)` or returning `ENOMEM`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13979
Richard Yao [Thu, 6 Oct 2022 00:09:24 +0000 (20:09 -0400)]
PAM: Fix unchecked return value from zfs_key_config_load()
9a49c6b782443ba6e627f2261c45f082ad843094 was intended to fix this issue,
but I had missed the case in pam_sm_open_session(). Clang's static
analyzer had not reported it and I forgot to look for other cases.
Interestingly, GCC gcc-12.1.1_p20220625's static analyzer had caught
this as multiple double-free bugs, since another failure after the
failure in zfs_key_config_load() will cause us to attempt to free the
memory that zfs_key_config_load() was supposed to allocate, but had
cleaned up upon failure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13978
Jorgen Lundman [Thu, 6 Oct 2022 00:07:50 +0000 (09:07 +0900)]
Avoid calling rw_destroy() on uninitialized rwlock
First the function `memset(&key, 0, ...)` but
any call to "goto error;" would call zio_crypt_key_destroy(key) which
calls `rw_destroy()`. The `rw_init()` is moved up to be right after the
memset. This way the rwlock can be released.
The ctx does allocate memory, but that is handled by the memset to 0
and icp skips NULL ptrs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #13976
shodanshok [Tue, 4 Oct 2022 18:00:02 +0000 (20:00 +0200)]
Remove ambiguity on demand vs prefetch stats reported by arc_summary
arc_summary currently list prefetch stats as "demand prefetch"
However, a hit/miss can be due to demand or prefetch, not both.
To remove any confusion, this patch removes the "Demand" word
from the affected lines.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #13985
Finix1979 [Tue, 4 Oct 2022 17:55:35 +0000 (01:55 +0800)]
Avoid unnecessary metaslab_check_free calling
The metaslab_check_free() function only needs to be called in the
GANG|DEDUP|etc case because zio_free_sync() will internally call
metaslab_check_free().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Finix1979 <yancw@info2soft.com>
Closes #13977
In libzutil, for zpool_search_import and zpool_find_config, we use
libpc_handle_t internally, which does not maintain error code and it is
not exposed in the interface. Due to this, the error information is not
propagated to the caller. Instead, an error message is printed on
stderr.
This commit adds lpc_error field in libpc_handle_t and exposes it in
the interface, which can be used by the users of libzutil to get the
appropriate error information and handle it accordingly.
Users of the API can also control if they want to print the error
message on stderr.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #13969
Richard Yao [Mon, 3 Oct 2022 20:41:58 +0000 (16:41 -0400)]
Fix memory leak found by GCC static analyzer
GCC 12.1.1_p20220625's -fanalyzer found and reported this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13975
Richard Yao [Sat, 1 Oct 2022 00:02:57 +0000 (20:02 -0400)]
Fix userland dereference NULL return value bugs
* `zstream_do_token()` does not handle failures from `libzfs_init()`
* `ztest_global_vars_to_zdb_args()` does not handle failures from
`calloc()`.
* `zfs_snapshot_nvl()` will pass an offset to a NULL pointer as a
source to `strlcpy()` if the provided nvlist is `NULL`.
We handle these by doing what the existing error handling does for other
errors involving these functions.
Coverity complained about these. It had complained about several more,
but one was fixed by 570ca4441e0583c8dcb5c7179f5eb331d1172784 and
another was a false positive. The remaining complaints labelled
"dereferece null return vaue" involve fetching things stored in
in-kernel data structures via `list_head()/list_next()`,
`AVL_PREV()/AVL_NEXT()` and `zfs_btree_find()`. Most of them occur in
void functions that have no error handling. They are much harder to
analyze than the two fixed in this patch, so they are left for a
follow-up patch.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13971
Richard Yao [Fri, 30 Sep 2022 23:59:51 +0000 (19:59 -0400)]
Fix potential NULL pointer dereference in dsl_dataset_promote_check()
If the `list_head()` returns NULL, we dereference it, right before we
check to see if it returned NULL.
We have defined two different pointers that both point to the same
thing, which are `origin_head` and `origin_ds`. Almost everything uses
`origin_ds`, so we switch them to use `origin_ds`.
We also promote `origin_ds` to a const pointer so that the compiler
verifies that nothing modifies it.
Coverity complained about this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13967
So we can use `zio_abd_checksum_func_t` for const declarations now.
It's not needed that we use the `const` qualifier again like this:
`const zio_abd_checksum_func_t *varname;`
This patch solves the double const qualifiers, which were found by
smatch.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13961
This makes both Clang's Static Analyzer and Coverity properly identify
assertions. This change reduced Clang's reported defects from 246 to
180. It also reduced the false positives reported by Coverityi by 10,
while enabling Coverity to find 9 more defects that previously were
false negatives.
A couple examples of this would be CID-1524417 and CID-1524423. After
submitting a build to coverity with the modified assertions, CID-1524417
disappeared while the report for CID-1524423 no longer claimed that the
assertion tripped.
Coincidentally, it turns out that it is possible to more accurately
annotate our headers than the Coverity modelling file permits in the
case of format strings. Since we can do that and this patch annotates
headers whenever `__coverity_panic__()` would have been used in the
model file, we drop all models that use `__coverity_panic__()` from the
model file.
Upon seeing the success in eliminating false positives involving
assertions, it occurred to me that we could also modify our headers to
eliminate coverity's false positives involving byte swaps. We now have
coverity specific byteswap macros, that do nothing, to disable
Coverity's false positives when we do byte swaps. This allowed us to
also drop the byteswap definitions from the model file.
Lastly, a model file update has been done beyond the mentioned
deletions:
* The definitions of `umem_alloc_aligned()`, `umem_alloc()` andi
`umem_zalloc()` were originally implemented in a way that was
intended to inform coverity that when KM_SLEEP has been passed these
functions, they do not return NULL. A small error in how this was
done was found, so we correct it.
* Definitions for umem_cache_alloc() and umem_cache_free() have been
added.
In practice, no false positives were avoided by making these changes,
but in the interest of correctness from future coverity builds, we make
them anyway.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13902
Richard Yao [Thu, 29 Sep 2022 17:16:37 +0000 (13:16 -0400)]
Fix unreachable code in zstreamdump
82226e4f44baa3f7c3101caaaf941927aa318e46 was intended to prevent a
warning from being printed in situations where it was inappropriate, but
accidentally disabled it entirely by setting featureflags in the wrong
case statement.
Coverity reported this as dead code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13946
Recently we hit an assertion panic in `dsl_process_sub_livelist` while
exporting the spa and interrupting `bpobj_iterate_nofree`. In that case
`bpobj_iterate_nofree` stops mid-way returning an EINTR without clearing
the intermediate AVL tree that keeps track of the livelist entries it
has encountered so far. At that point the code has a VERIFY for the
number of elements of the AVL expecting it to be zero (which is not the
case for EINTR).
= Fix
Cleanup any intermediate state before destroying the AVL when
encountering EINTR. Also added a comment documenting the scenario where
the EINTR comes up. There is no need to do anything else for the calles
of `dsl_process_sub_livelist` as they already handle the EINTR case.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #13939
Richard Yao [Thu, 29 Sep 2022 16:02:57 +0000 (12:02 -0400)]
Fix unchecked return values
2a493a4c7127258b14c39e8c71a9d6f01167c5cd was intended to fix all
instances of coverity reported unchecked return values, but
unfortunately, two were missed by mistake. This commit fixes the
unchecked return values that had been missed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13945
Richard Yao [Thu, 29 Sep 2022 15:56:42 +0000 (11:56 -0400)]
Miscellaneous ZTS fixes
Coverity had various complaints about minor issues. They are all fairly
straightforward to understand without reading additional files, with the
exception of the draid.c issue. vdev_draid_rand() takes a 128-bit
starting seed, but we were passing a pointer to a 64-bit value, which
understandably made Coverity complain. This is perhaps the only
significant issue fixed in this patch, since it causes stack corruption.
These are not all of the issues in the ZTS that Coverity caught, but a
number of them are already fixed in other PRs. There is also a class of
TOUTOC complaints that involve very minor things in the ZTS (e.g.
access() before unlink()). I have yet to decide whether they are false
positives (since this is not security sensitive code) or something to
cleanup.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13943
Ameer Hamza [Wed, 28 Sep 2022 16:48:46 +0000 (21:48 +0500)]
zed: mark disks as REMOVED when they are removed
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event. This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13797
The current value causes significant artificial slowdown during mass
parallel file removal, which can be observed both on FreeBSD and Linux
when running real workloads.
Sample results from Linux doing make -j 96 clean after an allyesconfig
modules build:
before: 4.14s user 6.79s system 48% cpu 22.631 total
after: 4.17s user 6.44s system 153% cpu 6.927 total
FreeBSD results in the ticket.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #13932
Closes #13938
Memory leak/unchecked malloc. We do allocate a bit too early (and
fail to validate the result). From this, smatch is angry when we
overwrite the value of 'node' later.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #13941
Richard Yao [Tue, 27 Sep 2022 23:47:24 +0000 (19:47 -0400)]
Fix unsafe string operations
Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`,
`nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use
of `strlcat()` in `nfs_init_tmpfile()`.
Inspired by this, I did an audit of every single usage of `strcpy()` and
`strcat()` in the code. If I could not prove that the usage was safe, I
changed the code to use either `strlcpy()` or `strlcat()`, depending on
which function was originally used. In some cases, `snprintf()` was used
to replace multiple uses of `strcat` because it was cleaner.
Whenever I changed a function, I preferred to use `sizeof(dst)` when the
compiler is able to provide the string size via that. When it could not
because the string was passed by a caller, I checked the entire call
tree of the function to find out how big the buffer was and hard coded
it. Hardcoding is less than ideal, but it is safe unless someone shrinks
the buffer sizes being passed.
Additionally, Coverity reported three more string related issues:
* It caught a case where we do an overlapping memory copy in a call to
`snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`.
* It caught `sizeof (buf)` being used instead of `buflen` in
`zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to
`snprintf()`. We change that to pass `buflen`.
* It caught a theoretical unterminated string passed to `strcmp()`.
This one is likely a false positive, but we have the information
needed to do this more safely, so we change this to silence the false
positive not just in coverity, but potentially other static analysis
tools too. We switch to `strncmp()`.
* There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We
suppress it by switching to `snprintf()` since other static analysis
tools might complain about it too. Interestingly, there is a possible
real bug there too, since it assumes that the passed directory path
ends with '/'. We add a '/' to fix that potential bug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13913
Richard Yao [Tue, 27 Sep 2022 23:45:51 +0000 (19:45 -0400)]
Cleanup spa_export_common()
Coverity complains about a possible NULL pointer dereference. This is
impossible, but it suspects it because we do a NULL check against
`spa->spa_root_vdev`. This NULL check was never necessary and makes the
code harder to understand, so we drop it.
In particular, we dereference `spa->spa_root_vdev` when `new_state !=
POOL_STATE_UNINITIALIZED && !hardforce`. The first is only true when
spa_reset is called, which only occurs under fault injection. The
second is true unless `zpool export -F $POOLNAME` is used. Therefore,
we effectively *always* dereference the pointer. In the cases where we
do not, there is no reason to think it is unsafe. Therefore this change
is safe to make.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13905
It should be noted that exploiting this requires the `SYS_CONFIG`
privilege, and anyone with that privilege likely has other opportunities
to do exploits, so it is unlikely that bad actors could exploit this
unless system administrators are executing untrusted ZFS Channel
Programs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13949
Richard Yao [Tue, 27 Sep 2022 23:42:41 +0000 (19:42 -0400)]
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
Richard Yao [Tue, 27 Sep 2022 23:35:29 +0000 (19:35 -0400)]
Cleanup: Switch to strlcpy from strncpy
Coverity found a bug in `zfs_secpolicy_create_clone()` where it is
possible for us to pass an unterminated string when `zfs_get_parent()`
returns an error. Upon inspection, it is clear that using `strlcpy()`
would have avoided this issue.
Looking at the codebase, there are a number of other uses of `strncpy()`
that are unsafe and even when it is used safely, switching to
`strlcpy()` would make the code more readable. Therefore, we switch all
instances where we use `strncpy()` to use `strlcpy()`.
Unfortunately, we do not portably have access to `strlcpy()` in
tests/zfs-tests/cmd/zfs_diff-socket.c because it does not link to
libspl. Modifying the appropriate Makefile.am to try to link to it
resulted in an error from the naming choice used in the file. Trying to
disable the check on the file did not work on FreeBSD because Clang
ignores `#undef` when a definition is provided by `-Dstrncpy(...)=...`.
We workaround that by explictly including the C file from libspl into
the test. This makes things build correctly everywhere.
We add a deprecation warning to `config/Rules.am` and suppress it on the
remaining `strncpy()` usage. `strlcpy()` is not portably avaliable in
tests/zfs-tests/cmd/zfs_diff-socket.c, so we use `snprintf()` there as a
substitute.
This patch does not tackle the related problem of `strcpy()`, which is
even less safe. Thankfully, a quick inspection found that it is used far
more correctly than strncpy() was used. A quick inspection did not find
any problems with `strcpy()` usage outside of zhack, but it should be
said that I only checked around 90% of them.
Lastly, some of the fields in kstat_t varied in size by 1 depending on
whether they were in userspace or in the kernel. The origin of this
discrepancy appears to be 04a479f7066ccdaa23a6546955303b172f4a6909 where
it was made for no apparent reason. It conflicts with the comment on
KSTAT_STRLEN, so we shrink the kernel field sizes to match the userspace
field sizes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13876
Enforce "-F" flag on resuming recv of full/newfs on existing dataset
When receiving full/newfs on existing dataset, then it should be done
with "-F" flag. Its enforced for initial receive in checks done in
zfs_receive_one function of libzfs. Similarly, on resuming full/newfs
recv on existing dataset, it should be done with "-F" flag.
When dataset doesn't exist, then full/new recv is done on newly created
dataset and it's marked INCONSISTENT. But when receiving on existing
dataset, recv is first done on %recv and its marked INCONSISTENT.
Existing dataset is not marked INCONSISTENT. Resume of full/newfs
receive with dataset not INCONSISTENT indicates that its resuming newfs
on existing dataset. So, enforce "-F" flag in this case.
Also return an error from dmu_recv_resume_begin_check() in zfs kernel,
when its resuming full/newfs recv without force.
Richard Yao [Tue, 27 Sep 2022 19:36:58 +0000 (15:36 -0400)]
Fix bad free in skein code
Clang's static analyzer found a bad free caused by skein_mac_atomic().
It will allocate a context on the stack and then pass it to
skein_final(), which attempts to free it. Upon inspection,
skein_digest_atomic() also has the same problem.
These functions were created to match the OpenSolaris ICP API, so I was
curious how we avoided this in other providers and looked at the SHA2
code. It appears that SHA2 has a SHA2Final() helper function that is
called by the exported sha2_mac_final()/sha2_digest_final() as well as
the sha2_mac_atomic() and sha2_digest_atomic() functions. The real work
is done in SHA2Final() while some checks and the free are done in
sha2_mac_final()/sha2_digest_final().
We fix the use after free in the skein code by taking inspiration from
the SHA2 code. We introduce a skein_final_nofree() that does most of the
work, and make skein_final() into a function that calls it and then
frees the memory.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13954
Richard Yao [Tue, 27 Sep 2022 00:18:05 +0000 (20:18 -0400)]
Fix userspace memory leaks found by Clang Static Analzyer
Recently, I have been making a push to fix things that coverity found.
However, I was curious what Clang's static analyzer reported, so I ran
it and found things that coverity had missed.
* contrib/pam_zfs_key/pam_zfs_key.c: If prop_mountpoint is passed more
than once, we leak memory.
* module/zfs/zcp_get.c: We leak memory on temporary properties in
userspace.
* tests/zfs-tests/cmd/draid.c: On error from vdev_draid_rand(), we leak
memory if best_map had been allocated by a prior iteration.
* tests/zfs-tests/cmd/mkfile.c: Memory used by the loop is not freed
before program termination.
Arguably, these are all minor issues, but if we ignore them, then they
could obscure serious bugs, so we fix them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13955