The 5.15 kernel moved the backing_dev_info structure out of
the request queue structure which causes a build failure.
Rather than look in the new location for the BDI we instead
detect this upstream refactoring by the existance of either
the blk_queue_update_readahead() or disk_update_readahead()
functions. In either case, there's no longer any reason to
manually set the ra_pages value since it will be overridden
with a reasonable default (2x the block size) when
blk_queue_io_opt() is called.
Therefore, we update the compatibility wrapper to do nothing
for 5.9 and newer kernels. While it's tempting to do the
same for older kernels we want to keep the compatibility
code to preserve the existing behavior. Removing it would
effectively increase the default readahead to 128k.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12532
George Melikov [Tue, 31 Aug 2021 20:56:45 +0000 (23:56 +0300)]
CI: don't install abigail-tools
We use docker image instead.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #12529
George Melikov [Tue, 31 Aug 2021 19:26:30 +0000 (22:26 +0300)]
Update ABI files via new libabigail version
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #12529
George Melikov [Tue, 31 Aug 2021 18:52:05 +0000 (21:52 +0300)]
Libabigail: make .abi files more consistent
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #12529
George Melikov [Tue, 31 Aug 2021 17:53:12 +0000 (20:53 +0300)]
CI: use fresh libabigail via docker image
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #12529
George Melikov [Tue, 31 Aug 2021 17:49:29 +0000 (20:49 +0300)]
Check for libabigail version
We need to use 1.8.0+ version, older versions
may segfault and give inconsistent results.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #12529
Ryan Moeller [Wed, 1 Sep 2021 20:20:00 +0000 (16:20 -0400)]
ZTS: Remove exceptions for flaky zhack on FreeBSD
Issue #11854 has been resolved, so we can remove the exceptions for it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12527
Ryan Moeller [Mon, 30 Aug 2021 23:01:09 +0000 (19:01 -0400)]
FreeBSD: Don't remove SA xattr if not SA znode
We attempt to remove an existing SA xattr when setting a dir xattr, but
this only makes sense if the znode has been upgraded to the SA format.
Otherwise, we will hit an assert in zfs_sa_get_xattr.
Make sure this is an SA znode before attempting to remove the SA xattr.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12514
Rich Ercolani [Mon, 30 Aug 2021 21:13:46 +0000 (17:13 -0400)]
Fix cross-endian interoperability of zstd
It turns out that layouts of union bitfields are a pain, and the
current code results in an inconsistent layout between BE and LE
systems, leading to zstd-active datasets on one erroring out on
the other.
Switch everyone over to the LE layout, and add compatibility code
to read both.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12008
Closes #12022
Ka Ho Ng [Sun, 22 Aug 2021 15:22:07 +0000 (23:22 +0800)]
ZTS: Enable punch-hole tests on FreeBSD
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ka Ho Ng <khng@FreeBSD.org> Sponsored-by: The FreeBSD Foundation
Closes #12458
Ka Ho Ng [Wed, 4 Aug 2021 15:57:48 +0000 (23:57 +0800)]
FreeBSD: Implement hole-punching support
This adds supports for hole-punching facilities in the FreeBSD kernel
starting from __FreeBSD_version 1400032.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ka Ho Ng <khng@FreeBSD.org> Sponsored-by: The FreeBSD Foundation
Closes #12458
Brian Behlendorf [Sun, 29 Aug 2021 15:56:58 +0000 (08:56 -0700)]
ZTS: Waiting for zvols to be available
The ZTS block_device_wait helper function should use -e when waiting
for a file to appear since it will be either a block special device
or a symlink. This didn't cause any failures but when a device path
was specified the function would wait longer than needed.
Additionally update the most flakey test cases to pass the file path
to block_device_wait to try and improve the test reliability. The
udev behavior on Fedora in particular can result in frequent false
positives.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12515
Ryan Moeller [Fri, 27 Aug 2021 17:02:54 +0000 (13:02 -0400)]
Correct checking bdev_check_media_change message
We're not looking for bdev_disk_changed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12492
Tony Hutter [Fri, 27 Aug 2021 16:26:49 +0000 (09:26 -0700)]
Make 'zpool labelclear -f' work on offlined disks
This patch allows you to clear the label on offlined disks in an active
pool with `-f`. Previously, labelclear wouldn't let you do that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #12511
Trevor Bautista [Thu, 26 Aug 2021 18:26:49 +0000 (11:26 -0700)]
Extend zpool-iostat to account for ZIO_PRIORITY_REBUILD (#12319)
Previously, zpool-iostat did not display any data regarding rebuild I/Os
in either the latency/size histograms (-w/-l/-r) or the queue data (-q).
This fix essentially utilizes the existing infrastructure for tracking
rebuild queue data and displays this data in the proper places within
zpool-iostat's output.
Signed-off-by: Trevor Bautista <tbautista@newmexicoconsortium.org> Signed-off-by: Trevor Bautista <tbautista@lanl.gov> Co-authored-by: Trevor Bautista <tbautista@newmexicoconsortium.org> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Anton Gubarkov [Wed, 25 Aug 2021 20:01:26 +0000 (23:01 +0300)]
vdev_id: Return an error if config file is not found (#12508)
Signed-off-by: Anton Gubarkov <anton.gubarkov@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Mark Johnston [Mon, 23 Aug 2021 18:10:17 +0000 (14:10 -0400)]
Initialize parity blocks before RAID-Z reconstruction benchmarking
benchmark_raidz() allocates a row to benchmark parity calculation and
reconstruction. In the latter case, the parity blocks are left
uninitialized, leading to reports from KMSAN.
Initialize parity blocks to 0xAA as we do for the data earlier in the
function. This does not affect the selected RAID-Z implementation on
any of several systems tested.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12473
Ryan Moeller [Mon, 26 Jul 2021 20:08:52 +0000 (16:08 -0400)]
ZTS: Add tests for creation time
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12432
Richard Yao [Sun, 17 Mar 2019 00:43:13 +0000 (20:43 -0400)]
Linux 4.11 compat: statx support
Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.
statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.
We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507
Gordon Bergling [Tue, 17 Aug 2021 17:01:07 +0000 (19:01 +0200)]
zfs.4: Fix typo s/compatiblity/compatibility/
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Gordon Bergling <gbergling@googlemail.com>
Closes #12464
Alexander Motin [Tue, 17 Aug 2021 16:15:54 +0000 (12:15 -0400)]
Remove b_pabd/b_rabd allocation from arc_hdr_alloc()
When a header is allocated for full overwrite it is a waste of time
to allocate b_pabd/b_rabd for it, since arc_write() will free them
without ever being touched. If it is a read or a partial overwrite
then arc_read() and arc_hdr_decrypt() allocate them explicitly.
Reduced memory allocation in user threads also reduces ARC eviction
throttling there, proportionally increasing it in ZIO threads, that
is not good. To minimize or even avoid it introduce ARC allocation
reserve, allowing certain arc_get_data_abd() callers to allocate a
bit longer in situations where user threads will already throttle.
Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12398
Alexander Motin [Tue, 17 Aug 2021 15:59:46 +0000 (11:59 -0400)]
Increase default volblocksize from 8KB to 16KB
Many things has changed since previous default was set many years ago.
Nowadays 8KB does not allow adequate compression or even decent space
efficiency on many of pools due to 4KB disk physical block rounding,
especially on RAIDZ and DRAID. It effectively limits write throughput
to only 2-3GB/s (250-350K blocks/s) due to sync thread, allocation,
vdev queue and other block rate bottlenecks. It keeps L2ARC expensive
despite many optimizations and dedup just unrealistic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12406
Alexander Motin [Tue, 17 Aug 2021 15:55:34 +0000 (11:55 -0400)]
Optimize arc_l2c_only lists assertions
It is very expensive and not informative to call multilist_is_empty()
for each arc_change_state() on debug builds to check for impossible.
Instead implement special index function for arc_l2c_only->arcs_list,
multilists, panicking on any attempt to use it.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12421
Alexander Motin [Tue, 17 Aug 2021 15:50:31 +0000 (11:50 -0400)]
Fix/improve dbuf hits accounting
Instead of clearing stats inside arc_buf_alloc_impl() do it inside
arc_hdr_alloc() and arc_release(). It fixes statistics being wiped
every time a new dbuf is filled from the ARC.
Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits.
Since the hits are accounted under hash lock, replace atomics with
simple increments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12422
Alexander Motin [Tue, 17 Aug 2021 15:47:00 +0000 (11:47 -0400)]
Avoid vq_lock drop in vdev_queue_aggregate()
vq_lock is already too congested for two more operations per I/O.
Instead of dropping and reacquiring it inside vdev_queue_aggregate()
delegate the zio_vdev_io_bypass() and zio_execute() calls for parent
I/Os to callers, that drop the lock any way to execute the new I/O.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12297
Alexander Motin [Tue, 17 Aug 2021 15:44:34 +0000 (11:44 -0400)]
Use more atomics in refcounts
Use atomic_load_64() for zfs_refcount_count() to prevent torn reads
on 32-bit platforms. On 64-bit ones it should not change anything.
When built with ZFS_DEBUG but running without tracking enabled use
atomics instead of mutexes same as for builds without ZFS_DEBUG.
Since rc_tracked can't change live we can check it without lock.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12420
Ryan Moeller [Mon, 16 Aug 2021 23:38:34 +0000 (19:38 -0400)]
ZTS: Avoid unset $tmpdir in redacted_panic
The redacted_send tests make use of a $tmpdir variable, except in
redacted_send/redacted_panic the variable is never defined.
Use $TEST_BASE_DIR instead.
Clean up the stream file after the test.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12455
Allan Jude [Mon, 16 Aug 2021 15:35:19 +0000 (11:35 -0400)]
Restore FreeBSD sysctl processing for arc.min and arc.max
Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max
to a disallowed value would return an error.
Since the switch, it instead only generates WARN_IF_TUNING_IGNORED
Keep the ability to set the sysctl's specifically to 0, even though
that is less than the minimum, because some tests depend on this.
Also lost, was the ability to set vfs.zfs.arc_max to a value less
than the default vfs.zfs.arc_min at boot time. Restore this as well.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #12161
Ryan Moeller [Fri, 13 Aug 2021 20:42:45 +0000 (16:42 -0400)]
zfs: add missed dependency of zfs module on zlib
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Martin Matuska <mm@FreeBSD.org> Co-authored-by: Konstantin Belousov <kib@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
External-issue: https://reviews.freebsd.org/D31207
Closes #12442
Ryan Moeller [Fri, 13 Aug 2021 20:37:46 +0000 (16:37 -0400)]
Add zfs.sh -r flag to reload modules
zfs.sh already can load and unload, so why not both?
This is convenient when developing changes to the module and you want
to rapidly make some changes, rebuild the module, reload the module,
and test the changes.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12450
Ryan Moeller [Fri, 13 Aug 2021 20:13:57 +0000 (16:13 -0400)]
Fix usage of find in tests/Makefile.am
The path is not optional on FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12453
Tony Nguyen [Tue, 10 Aug 2021 17:36:26 +0000 (11:36 -0600)]
Run arc_evict thread at higher priority
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.
On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #12397
George Melikov [Thu, 5 Aug 2021 21:30:28 +0000 (00:30 +0300)]
Man zpool-scrub.8: describe sequential scrub
Describe sequential scrub and add examples of scrub status.
Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #12429
hedongzhang [Tue, 3 Aug 2021 17:46:33 +0000 (01:46 +0800)]
Modify checksum obtain method of QAT
CpaDcGeneratefooter function that obtain the checksum code
does not support the CPA_DC_STATELESS mode. So we get the
adler32 chencksum of the end of the zlib from dc_results.
Mark Johnston [Mon, 2 Aug 2021 19:18:24 +0000 (15:18 -0400)]
Allow disabling of unmapped I/O on FreeBSD
We have a tunable which permits one to disable the use of unmapped I/O
for the buffer cache. Respect it in ZFS as well. This is useful for
KMSAN, which cannot easily maintain shadow state for unmapped pages.
No functional change intended, as unmapped I/O is permitted by default
and there's no real reason to disable it in practice except for
debugging.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12446
Alexander Motin [Tue, 27 Jul 2021 23:05:47 +0000 (19:05 -0400)]
Avoid small buffer copying on write
It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to
decide whether to reallocate/copy the buffer, because the answer is
OS-specific and depends on the buffer size. Instead of that use
abd_size_alloc_linear(), moved into public header.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12425
Brian Behlendorf [Tue, 27 Jul 2021 03:22:27 +0000 (20:22 -0700)]
Fix format specifier warnings
Commit 5dbf6c5a66d did not address these format specifier warnings
since they were introduced by an unrelated commit which had not
been rebased on 5dbf6c5a66d when merged. Fix them.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12435
Brian Behlendorf [Tue, 27 Jul 2021 00:31:00 +0000 (17:31 -0700)]
Remove overlooked __sun_attr__ based macros
The __NORETURN, __CONST, and __PURE macros in the FreeBSD platform
code were based on the __sun_attr__ macro which was removed in
commit 5dbf6c5a6. This caused a build failure because the
__NORETURN macro was still used in one place in kernel code.
The __CONST and __PURE macros were entirely unused.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12435
Keep check_file_generic() in shared code base, and allow special case
code in check_file() in os section. In future, macOS will have
additional checks in check_file().
Linux and FreeBSD wrappers just calls check_file_generic().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12385
- Bail out early if we're running the perf tests and forget to
specify disks.
- Allow perf tests to run with any number of disks.
- Remove weekly vs. nightly settings
- Move variables with common values to perf.shlib
- Use zinject to clear the ARC over export/import
- Fix dbuf cache size calculation
When the meaning of `dbuf_cache_max_bytes` changed, the performance
test that covers the dbuf cache started to fail. The test would try to
write files for the test using the max possible size of the cache,
inevitably filling the pool and failing. This change uses
`dbuf_cache_shift` to correctly calculate the dbuf cache size.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes #12408
Matthew Ahrens [Mon, 26 Jul 2021 19:51:39 +0000 (12:51 -0700)]
Read past end of argv array in zpool_do_import()
`zpool_do_import()` passes `argv[0]`, (optionally) `argv[1]`, and
`pool_specified` to `import_pools()`. If `pool_specified==FALSE`, the
`argv[]` arguments are not used. However, these values may be off the
end of the `argv[]` array, so loading them could dereference unmapped
memory. This error is reported by the asan build:
```
=================================================================
==6003==ERROR: AddressSanitizer: heap-buffer-overflow
READ of size 8 at 0x6030000004a8 thread T0
#0 0x562a078b50eb in zpool_do_import zpool_main.c:3796
#1 0x562a078858c5 in main zpool_main.c:10709
#2 0x7f5115231bf6 in __libc_start_main
#3 0x562a07885eb9 in _start
0x6030000004a8 is located 0 bytes to the right of 24-byte region
allocated by thread T0 here:
#0 0x7f5116ac6b40 in __interceptor_malloc
#1 0x562a07885770 in main zpool_main.c:10699
#2 0x7f5115231bf6 in __libc_start_main
```
This commit passes NULL for these arguments if they are off the end
of the `argv[]` array.
Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #12339
George Amanakis [Mon, 26 Jul 2021 19:30:24 +0000 (21:30 +0200)]
Fixes in persistent L2ARC
In l2arc_add_vdev() first decide whether the device is eligible for
L2ARC rebuild or whole device trim and then add it to the list of cache
devices. Otherwise l2arc_feed_thread() might already start writing on
the device invalidating previous content as l2ad_hand = l2ad_start.
However l2arc_rebuild_vdev() needs the device present in the cache
device list to figure out its l2arc_dev_t. Fix this by moving most of
l2arc_rebuild_vdev() in a new function l2arc_rebuild_dev() which does
not need to search in the cache device list.
In contrast to l2arc_add_vdev() we do not have to worry about
l2arc_feed_thread() invalidating previous content when onlining a
cache device. The device parameters (l2ad*) are not cleared when
offlining the device and writing new buffers will not invalidate
all previous content. In worst case only buffers that have not had
their log block written to the device will be lost.
Retire persist_l2arc_00{4,5,8} tests since they cover code already
covered by the remaining ones. Test persist_l2arc_006 is renamed to
persist_l2arc_004 and persist_l2arc_007 is renamed to persist_l2arc_005.
Fix a typo in persist_l2arc_004, and remove an assertion that is not
always true from l2arc_arcstats_pos. Also update an assertion in
persist_l2arc_005 and explain why in a comment.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #12365
Mark Johnston [Fri, 16 Jul 2021 14:12:47 +0000 (10:12 -0400)]
Initialize dn_next_type[] in the dnode constructor
It seems nothing ensures that this array is zeroed when a dnode is
freshly allocated, so in principle it retains the values from the
previous allocation. In practice it seems to be the case that the
fields should end up zeroed, but we can zero the field anyway for
consistency.
This was found using KMSAN.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
Mark Johnston [Fri, 16 Jul 2021 14:05:28 +0000 (10:05 -0400)]
Zero pad bytes following TX_WRITE log data
When logging a TX_WRITE record in the case where file data has to be
copied from the DMU, we pad the log record size to a multiple of 8
bytes. In this case, any padding bytes should be zeroed, otherwise the
contents of uninitialized memory are written to the ZIL.
This was found using KMSAN.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
Mark Johnston [Fri, 16 Jul 2021 13:34:54 +0000 (09:34 -0400)]
Zero pad bytes when allocating a ZIL record
When allocating a record, we round up the allocation size to a multiple
of 8. In this case, any padding bytes should be zeroed, otherwise the
contents of uninitialized memory are written to the ZIL.
This was found using KMSAN.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
Mark Johnston [Fri, 16 Jul 2021 13:27:11 +0000 (09:27 -0400)]
Initialize all fields in zfs_log_xvattr()
When logging TX_SETATTR, we could otherwise fail to initialize part of
the corresponding ZIL record depending on which fields are present in
the xvattr. Initialize the creation time and the AV scan timestamp to
zero so that uninitialized bytes are not written to the ZIL.
This was found using KMSAN.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
Mark Johnston [Fri, 16 Jul 2021 13:19:59 +0000 (09:19 -0400)]
Initialize "autoreplace" in spa_ld_get_props()
spa_prop_find() may fail to find the specified property, in which case
it suppresses ENOENT from zap_lookup(). In this case, the return value
is left uninitialized, so spa_autoreplace was being initialized using an
uninitialized stack variable.
This was found using KMSAN. It appears to be a regression from commit 9eb7b46ed0, which removed the initialization of "autoreplace" from the
definition.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
Coleman Kane [Mon, 26 Jul 2021 17:55:55 +0000 (13:55 -0400)]
Linux 5.14 compat: explicity assign set_page_dirty
Kernel 5.14 introduced a change where set_page_dirty of
struct address_space_operations is no longer implicitly set to
__set_page_dirty_buffers(), which ended up resulting in a NULL
pointer deref in the kernel when it is attempted to be called.
This change sets .set_page_dirty in the structure to
__set_page_dirty_nobuffers(), which was introduced with the
related patch set. The breaking change was introduce in commit 0af573780b0b13fceb7fabd49dc1b073cee9a507 to torvalds/linux.git.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #12427
Brian Behlendorf [Fri, 23 Jul 2021 22:28:03 +0000 (15:28 -0700)]
Linux 5.14 compat: blk_alloc_disk()
In Linux 5.14, blk_alloc_queue is no longer exported, and its usage
has been superseded by blk_alloc_disk, which returns a gendisk struct
from which we can still retrieve the struct request_queue* that is
needed in the one place where it is used. This also replaces the call
to alloc_disk(minors), and minors is now set via struct member
assignment.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12362
Closes #12409
Ryan Moeller [Thu, 22 Jul 2021 21:29:27 +0000 (17:29 -0400)]
zloop: Add a max iterations option, use default run/pass times
It is useful to have control over the number of iterations of zloop so
we can easily produce "x core dumps found *in y iterations*" metrics.
Using random values for run/pass times doesn't improve coverage in a
meaningful way.
Randomizing run time could be seen as a compromise between running a
greater variety of shorter tests versus a smaller variety of longer
tests within a fixed time span. However, it is not desirable when
running a fixed number of iterations.
Pass time already incorporates randomness within ztest.
Either parameter can be passed to ztest explicitly if the defaults are
not satisfactory.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12411
Alexander Motin [Thu, 22 Jul 2021 16:22:14 +0000 (12:22 -0400)]
FreeBSD: Ignore make_dev_s() errors
Since errors returned by zvol_create_minor_impl() are ignored by the
common code, it is more convenient to ignore make_dev_s() errors there.
It allows, for example, to get device created for the zvol after later
rename instead of having it further stuck in half-created state.
zvol_rename_minor() already ignores those errors.
While there, switch from MAXPHYS to maxphys in FreeBSD 13+.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12375
Possibly required in the past, but is currently fills no purpose.
Ordinarily such tiny cleanup is not generally worth it, however
on the macOS port, in a future commit, we do unspeakable things to the
"fd" for send/recv, and it would be easier to only have to deal with
one "fd" instead of two.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12404
Alexander Motin [Wed, 21 Jul 2021 12:40:36 +0000 (08:40 -0400)]
Optimize allocation throttling
Remove mc_lock use from metaslab_class_throttle_*(). The math there
is based on refcounts and so atomic, so the only race possible there
is between zfs_refcount_count() and zfs_refcount_add(). But in most
cases metaslab_class_throttle_reserve() is called with the allocator
lock held, which covers the race. In cases where the lock is not
held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we
do not use zfs_refcount_count(). And even if we assume some other
non-existing scenario, the worst that may happen from this race is
few more I/Os get to allocation earlier, that is not a problem.
Move locks and data of different allocators into different cache
lines to avoid false sharing. Group spa_alloc_* arrays together
into single array of aligned struct spa_alloc spa_allocs. Align
struct metaslab_class_allocator.
Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12314
George Melikov [Tue, 20 Jul 2021 22:21:00 +0000 (01:21 +0300)]
CI: generate ABI files if changed
So commit author can just download them as
artifacts and commit.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #12379
Kevin Jin [Tue, 20 Jul 2021 15:40:24 +0000 (11:40 -0400)]
Add Module Parameter Regarding Log Size Limit
* Add Module Parameters Regarding Log Size Limit
zfs_wrlog_data_max
The upper limit of TX_WRITE log data. Once it is reached,
write operation is blocked, until log data is cleared out
after txg sync. It only counts TX_WRITE log with WR_COPIED
or WR_NEED_COPY.
Alexander Motin [Tue, 20 Jul 2021 14:13:21 +0000 (10:13 -0400)]
Minor ARC optimizations
Remove unneeded global, practically constant, state pointer variables
(arc_anon, arc_mru, etc.), replacing them with macros of real state
variables addresses (&ARC_anon, &ARC_mru, etc.).
Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special
handling in inner loop of ARC reclamation. Respectively change bytes
argument of arc_evict_state() from int64_t to uint64_t.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12348
Alexander [Tue, 20 Jul 2021 14:03:33 +0000 (16:03 +0200)]
A few fixes of callback typecasting (for the upcoming ClangCFI)
* zio: avoid callback typecasting
* zil: avoid zil_itxg_clean() callback typecasting
* zpl: decouple zpl_readpage() into two separate callbacks
* nvpair: explicitly declare callbacks for xdr_array()
* linux/zfs_nvops: don't use external iput() as a callback
* zcp_synctask: don't use fnvlist_free() as a callback
* zvol: don't use ops->zv_free() as a callback for taskq_dispatch()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes #12260
Ryan Moeller [Mon, 19 Jul 2021 16:52:50 +0000 (12:52 -0400)]
Use SET_ERROR for more errors in FreeBSD vnops
We should use SET_ERROR when we first get an error.
Add it in the FreeBSD xattr implementations where missing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12356
Ryan Moeller [Mon, 19 Jul 2021 16:02:35 +0000 (12:02 -0400)]
Remove unused fields from zvol_task_t
We don't use or need the pool name or value source in the zvol tasks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12361
Alexander Motin [Mon, 19 Jul 2021 15:56:58 +0000 (11:56 -0400)]
FreeBSD: Switch from MAXPHYS to maxphys on FreeBSD 13+
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12378
Kevin Bowling [Fri, 16 Jul 2021 20:28:55 +0000 (13:28 -0700)]
Detect HAVE_LARGE_STACKS at compile time
Move HAVE_LARGE_STACKS definitions to header and set when appropriate.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Kevin Bowling <kbowling@FreeBSD.org>
Closes #12350
George Melikov [Fri, 16 Jul 2021 20:04:00 +0000 (23:04 +0300)]
zpool_influxdb: fix -Werror=stringop-truncation
Use strlcpy instead of problematic strncpy
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #12344
Alexander Motin [Fri, 16 Jul 2021 19:39:24 +0000 (15:39 -0400)]
Introduce dsl_dir_diduse_transfer_space()
Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time
is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling
them one after another is a waste of time and even more contention.
Doing that twice for each rewritten block within dbuf_write_done()
via dsl_dataset_block_kill() and dsl_dataset_block_born() created one
of the biggest CPU overheads in case of small blocks rewrite.
dsl_dir_diduse_transfer_space() combines functionality of these two
functions for cases where it is needed, but without double overhead,
practically for the cost of dsl_dir_diduse_space() or even cheaper.
While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space()
and dsl_dir_transfer_space(). It seems Clang detects some aliasing
there, repeating dd->dd_dbuf->db_data dereference multiple times,
increasing dd_lock scope and contention.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Author: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12300
The same change has already been done for domount(). On macOS platform
we need to have access to zhp to handle devdisks and snapshots.
Also, symmetry is pleasing.
In addition, the code in zpool_disable_datasets which sorts the
mountpoints did not sort the related handle, which meant that the
mountpoint, and the handle that it is paired with, was lost.
You'd get a random handle with the mountpoint.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12296
Alexander Motin [Tue, 13 Jul 2021 15:41:59 +0000 (11:41 -0400)]
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12279
Brian Behlendorf [Mon, 12 Jul 2021 20:05:50 +0000 (13:05 -0700)]
Update bug report template
- Remove the "SPL Version" line, the repositories have been merged
since the 0.8 release and we no longer need to ask about this.
- Simply ask for the kernel version / patch level and add a hint
about how to get this information on Linux and FreeBSD.
- Remove "Status: Triage Needed" from the template, in practice
we really haven't been using this label so let's step setting it.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #12340
George Wilson [Sun, 11 Jul 2021 01:00:37 +0000 (20:00 -0500)]
file reference counts can get corrupted
Callers of zfs_file_get and zfs_file_put can corrupt the reference
counts for the file structure resulting in a panic or a soft lockup.
When zfs send/recv runs, it will add a reference count to the
open file, and begin to send or recv the stream. If the file descriptor
is closed, then when dmu_recv_stream() or dmu_send() return we will
call zfs_file_put to remove the reference we placed on the file
structure. Unfortunately, because zfs_file_put() uses the file
descriptor to lookup the file structure, it may end up finding that
the file descriptor table no longer contains the file struct, thus
leaking the file structure. Or it might end up finding a file
descriptor for a different file and blindly updating its reference
counts. Other failure modes probably exists.
This change reworks the zfs_file_[get|put] interface to not rely
on the file descriptor but instead pass the zfs_file_t pointer around.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: George Wilson <gwilson@delphix.com>
External-issue: DLPX-76119
Closes #12299
Could have gone either way with this one, either adding it to
macOS/Windows SPL, or returning it to "classic" usage with strrchr().
Since the new special way isn't really used, and only used once,
we have this commit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12312
Alexander Motin [Wed, 7 Jul 2021 23:39:00 +0000 (19:39 -0400)]
FreeBSD: Use unmapped I/O for scattered/gang ABD buffers
Many FreeBSD disk drivers support "unmapped" I/O mode, when data
buffer represented not with a virtually contiguous KVA-mapped address
range, but with a list of physical memory pages. Originally it was
designed to do I/O from buffers without KVA mapping (unmapped). But
moving virtual addresses out of equation allows us to operate even
non-contiguous data buffers with one condition: all buffer discon-
tinuities must be aligned to memory page borders.
Doing I/O to capable GEOM device this patch traverses through non-
linear ABD buffers, validating the chunks borders. If the condition
is met, it supplies GEOM with the list of original physical memory
pages instead of copying the data into temporary contiguous buffer.
On capable hardware on pools with ashift=12 and default ABD chunk of
4KB it should handle all the I/O without additional memory copying.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12320
Alexander Motin [Wed, 7 Jul 2021 00:39:23 +0000 (20:39 -0400)]
FreeBSD: Hardcode abd_chunk_size to PAGE_SIZE
It makes no sense to set it below PAGE_SIZE, since it increases all
overheads and makes returning memory to OS problematic. It makes no
sense to set it above PAGE_SIZE, since such allocations and especially
frees are too expensive and cause KVA fragmentation to benefit from
fewer chunks. After that it makes no sense to keep more complicated
math here.
What may have sense though is just a tunable border between linear and
scatter ABDs, previously also controlled by this tunable. Retain that
functionality by taking abd_scatter_min_size tunable from Linux, just
with different default value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12328
Alexander Motin [Tue, 6 Jul 2021 21:38:00 +0000 (17:38 -0400)]
Move gethrtime() calls out of vdev queue lock
This dramatically reduces the lock contention on systems with slower
(non-TSC) timecounters. With TSC the difference is minimal, but since
this lock is pretty congested, any improvement counts. Plus I don't
see any reason to do it under the lock other than the latency of the
lock itself, which this change actually reduces.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12281
Udev rules: remove zvol compat symlinks (without the leading zvol/)
This is a potentially arguable change, because it removes some
compatibility cruft that certain systems or people may have come to rely
on (either a very long time ago, or unwisely in recent times).
On the other hand, it's been literally over a decade since OpenZFS
switched to the strategy of using opaque numbered /dev/zd* device nodes,
with the canonical zvol access path being a directory tree of symlinks
created by udev rules inside /dev/zvol/*. (See #102.) Even at the time,
the /dev/* scheme was labeled as being for "compatibility".
This commit removes the second tree of symlinks located directly at
/dev/*, under the assumption that anybody with any sense has been using
the intended /dev/zvol/* path for a very very long time now.
(The more I think about this, the more I anticipate that some large
fraction of people will have been blissfully unaware that the intention
has been for them to use the /dev/zvol/* tree all along, and they will
have come to rely upon the /dev/* tree simply because it's been there
this whole time despite being a compat thing.)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #12303
Justin Gottula [Wed, 30 Jun 2021 04:29:09 +0000 (21:29 -0700)]
Use substantially more robust program exit status logic in zvol_id
Currently, there are several places in zvol_id where the program logic
returns particular errno values, or even particular ioctl return values,
as the program exit status, rather than a straightforward system of
explicit zero on success and explicit nonzero value(s) on failure.
This is problematic for multiple reasons. One particularly interesting
problem that can arise, is that if any of these values happens to have
all 8 least significant bits unset (i.e., it is a positive or negative
multiple of 256), then although the C program sees a nonzero int value
(presumed to be a failure exit status), the actual exit status as seen
by the system is only the bottom 8 bits of that integer: zero.
This can happen in practice, and I have encountered it myself. In a
particularly weird situation, the zvol_open code in the zfs kernel
module was behaving in such a manner that it caused the open() syscall
to fail and for errno to be set to a kernel-private value (ERESTARTSYS,
which happens to be defined as 512). It turns out that 512 is evenly
divisible by 256; or, in other words, its least significant 8 bits are
all-zero. So even though zvol_id believed it was returning a nonzero
(failure) exit status of 512, the system modulo'd that value by 256,
resulting in the actual exit status visible by other programs being 0!
This actually-zero (non-failure) exit status caused problems: udev
believed that the program was operating successfully, when in fact it
was attempting to indicate failure via a nonzero exit status integer.
Combined with another problem, this led to the creation of nonsense
symlinks for zvol dev nodes by udev.
Let's get rid of all this problematic logic, and simply return
EXIT_SUCCESS (0) is everything went fine, and EXIT_FAILURE (1) if
anything went wrong.
Additionally, let's clarify some of the variable names (error is similar
to errno, etc) and clean up the overall program flow a bit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #12302
Justin Gottula [Wed, 30 Jun 2021 03:14:18 +0000 (20:14 -0700)]
Print zvol_id error messages to stderr rather than stdout
The zvol_id program is invoked by udev, via a PROGRAM key in the
60-zvol.rules.in rule file, to determine the "pretty" /dev/zvol/*
symlink paths paths that should be generated for each opaquely named
/dev/zd* dev node.
The udev rule uses the PROGRAM key, followed by a SYMLINK+= assignment
containing the %c substitution, to collect the program's stdout and then
"paste" it directly into the name of the symlink(s) to be created.
Unfortunately, as currently written, zvol_id outputs both its intended
output (a single string representing the symlink path that should be
created to refer to the name of the dataset whose /dev/zd* path is
given) AND its error messages (if any) to stdout.
When processing PROGRAM keys (and others, such as IMPORT{program}), udev
uses only the data written to stdout for functional purposes. Any data
written to stderr is used solely for the purposes of logging (if udev's
log_level is set to debug).
The unintended consequence of this is as follows: if zvol_id encounters
an error condition; and then udev fails to halt processing of the
current rule (either because zvol_id didn't return a nonzero exit
status, or because the PROGRAM key in the rule wasn't written properly
to result in a "non-match" condition that would stop the current rule on
a nonzero exit); then udev will create a space-delimited list of symlink
names derived directly from the words of the error message string!
I've observed this exact behavior on my own system, in a situation where
the open() syscall on /dev/zd* dev nodes was failing sporadically (for
reasons that aren't especially relevant here). Because the open() call
failed, zvol_id printed "Unable to open device file: /dev/zd736\n" to
stdout and then exited.
The udev rule finished with SYMLINK+="zvol/%c %c". Assuming a volume
name like pool/foo/bar, this would ordinarily expand to
SYMLINK+="zvol/pool/foo/bar pool/foo/bar"
and would cause symlinks to be created like this:
/dev/zvol/pool/foo/bar -> /dev/zd736
/dev/pool/foo/bar -> /dev/zd736
But because of the combination of error messages being printed to
stdout, and the udev syntax freely accepting a space-delimited sequence
of names in this context, the error message string
"Unable to open device file: /dev/zd736\n"
in reality expanded to
SYMLINK+="zvol/Unable to open device file: /dev/zd736"
which caused the following symlinks to actually be created:
/dev/zvol/Unable -> /dev/zd736
/dev/to -> /dev/zd736
/dev/open -> /dev/zd736
/dev/device -> /dev/zd736
/dev/file: -> /dev/zd736
/dev//dev/zd736 -> /dev/zd736
(And, because multiple zvols had open() syscall errors, multiple zvols
attempted to claim several of those symlink names, resulting in numerous
udev errors and timeouts and general chaos.)
This commit rectifies all this silliness by simply printing error
messages to stderr, as Dennis Ritchie originally intended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #12302
Justin Gottula [Wed, 30 Jun 2021 01:54:38 +0000 (18:54 -0700)]
Udev rules: use match (==) rather than assign (=) for PROGRAM
Assignment syntax (=) can be used for the PROGRAM key. But the PROGRAM
key is really a match key, not an assign key. The internal logic used by
udev to decide whether a PROGRAM key "matched" or not (which determines
whether the remainder of the rule is evaluated) depends on whether the
operator was OP_MATCH (==) or OP_NOMATCH (!=). [1]
The man page claims that '"=", ":=", and "+=" have the same effect as
"=="' for PROGRAM keys. And, after a brief perusal, the udev source code
does seem to confirm that operators other than OP_MATCH (==) or
OP_NOMATCH (!=) are implicitly converted to OP_MATCH (==). [2] But it's
not entirely clear that this is definitely the case: anecdotal testing
seems to indicate that when OP_ASSIGN (=) is used, the program's exit
status is disregarded and the remainder of the rule is processed
regardless of whether it was, in fact, a successful exit.
The bottom line here is that, if zvol_id hits some snag and returns a
nonzero exit status, then we almost certainly do NOT want to continue on
with the rule and use whatever the stdout contents may have been to
mindlessly create /dev/zvol/* symlinks. Therefore, let's be extra-sure
and use the match (==) operator explicitly, to eliminate any possibility
that udev might do the wrong thing, and ensure that a nonzero exit
status will definitely short-circuit the rest of the rule, bypassing the
SYMLINK+= assignments.
[1]
udev,
file src/udev/udev-rules.c,
func udev_rule_apply_token_to_event,
switch case TK_M_PROGRAM if r != 0 (nonzero exit status):
return token->op == OP_NOMATCH;
switch case TK_M_PROGRAM if r == 0 (zero exit status):
return token->op == OP_MATCH;
func retval 0 => key is considered to have matched
func retval 1 => key is considered to have NOT matched
[2]
udev,
file src/udev/udev-rules.c,
func parse_token,
at func start:
bool is_match = IN_SET(op, OP_MATCH, OP_NOMATCH);
in else-if case streq(key, "PROGRAM"):
if (!is_match) op = OP_MATCH;
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #12302
Justin Gottula [Wed, 30 Jun 2021 01:52:33 +0000 (18:52 -0700)]
Udev rules: replace deprecated $tempnode with $devnode
The $tempnode substitution is so old that it's not even mentioned in the
man page anymore. It is still technically supported by udev, but with
plenty of "deprecated" comments surrounding it.
The preferred modern equivalent of $tempnode is $devnode (or
alternatively, %N).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #12302
Justin Gottula [Wed, 30 Jun 2021 01:50:13 +0000 (18:50 -0700)]
Udev rules: use non-ancient comma syntax
This file is old as dirt. It's entirely possible that commas were
optional in udev back at that time. But they're definitely supposed to
be there nowadays.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #12302
Alexander Motin [Thu, 1 Jul 2021 15:32:31 +0000 (11:32 -0400)]
Remove avl_size field from struct avl_tree
This field is used only by illumos mdb. On other platforms it only
increases the struct size from 32 to 40 bytes. For struct vdev_queue
including 13 instances of avl_tree_t size means active cache lines.
Keep the padding in user-space for now to not break the ABI.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12290
Alexander Motin [Thu, 1 Jul 2021 15:30:31 +0000 (11:30 -0400)]
Compact dbuf/buf hashes and lock arrays
With default dbuf cache size of 1/32 of ARC, it makes no sense to have
hash table of the same size (or even bigger on Linux). Reduce it to
1/8 of ARC's one, still leaving some slack, assuming higher I/O rate
via dbuf cache than via ARC.
Remove padding from ARC hash locks array. The idea behind padding
is to avoid false sharing between locks. It would have sense if
there would be a limited number of very busy locks. But since we
have no limit on the number, using the same memory for more locks we
can achieve even lower lock contention with the same false sharing,
or we can use less memory for the same contention level.
Reduce number of hash locks from 8192 to 2048. The number is still
big enough to not cause contention, but reduced memory size improves
cache hit rate for mutex_tryenter() in ARC eviction thread, saving
about 1% of the thread time.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12289