Tony Hutter [Fri, 28 Jun 2024 16:52:03 +0000 (09:52 -0700)]
Linux 6.9: Call add_disk() from workqueue to fix zfs_allow_010_pos (#16282)
The 6.9 kernel behaves differently in how it releases block devices. In
the common case it will async release the device only after the return
to userspace. This is different from the 6.8 and older kernels which
release the block devices synchronously. To get around this, call
add_disk() from a workqueue so that the kernel uses a different
codepath to release our zvols in the way we expect. This stops
zfs_allow_010_pos from hanging.
Fixes: #16089 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
bnovkov [Fri, 7 Jun 2024 01:11:00 +0000 (03:11 +0200)]
FreeBSD: Update use of UMA-related symbols in arc_available_memory
Recent UMA changes repurposed the use of UMA_MD_SMALL_ALLOC in a way
that breaks arc_available_memory on -CURRENT. This change
ensures that arc_available_memory uses the new symbol
while maintaining compatibility with older FreeBSD releases.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Bojan Novković <bnovkov@FreeBSD.org>
Closes #16230
Ameer Hamza [Fri, 7 Jun 2024 00:01:26 +0000 (05:01 +0500)]
zdb: fix FreeBSD build failure
This fixes FreeBSD build failure with clang-18 after 23a489a got merged.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16252
Ameer Hamza [Mon, 3 Jun 2024 23:28:43 +0000 (04:28 +0500)]
zdb: detect cachefile automatically otherwise force import
If a pool is created with the cache file located in a non-default
path /etc/default/zpool.cache, removed, or the cachefile property
is set to none, zdb fails to show the pool unless we specify the
cache file or use the -e option. This PR automates this process.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16071
Rob Norris [Sun, 19 May 2024 11:49:19 +0000 (21:49 +1000)]
icp: remove redundant FreeBSD check
We don't build illumos-crypto for FreeBSD.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sun, 19 May 2024 11:40:59 +0000 (21:40 +1000)]
icp: remove unused headers
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sun, 19 May 2024 05:00:00 +0000 (15:00 +1000)]
icp: remove skein module
Nothing calls it through the KCF interface, so this is all unused.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sun, 19 May 2024 05:00:44 +0000 (15:00 +1000)]
icp: remove unused SHA2 HMAC mechanisms
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sun, 19 May 2024 03:18:42 +0000 (13:18 +1000)]
icp: reorganise SHA2 digest mechanisms
sha2_mech_type_t serves double-duty, as the list of MAC providers and
also the algo type for direct callers to SHA2Init. Until we disentangle
that, reorganise it to make the separation more clear. While we're
there, remove the digest mechs we don't use.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sun, 19 May 2024 02:58:56 +0000 (12:58 +1000)]
icp: remove digest entry points
For whatever reason, we call digest mechanisms directly, not through the
KCF digest provider. So we can remove those entry points entirely.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sun, 19 May 2024 02:24:35 +0000 (12:24 +1000)]
icp: remove unused KCF_ macros
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sat, 18 May 2024 12:17:36 +0000 (22:17 +1000)]
icp: remove unusued incremental cipher methods
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sat, 18 May 2024 11:57:36 +0000 (21:57 +1000)]
icp: brutally remove unused AES modes
Still retaining the struture, for now.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Rob Norris [Sat, 18 May 2024 11:05:20 +0000 (21:05 +1000)]
icp: remove unused blowfish_ctx and des_ctx
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16209
Tony Hutter [Fri, 31 May 2024 22:11:00 +0000 (15:11 -0700)]
ZTS: Fix redacted_send failures on FreeBSD
We're seeing failures for redacted_deleted and redacted_mount
on FreeBSD 13-15:
09:58:34.74 diff: /dev/fd/3: No such file or directory
09:58:34.74 ERROR: diff /dev/fd/3 /dev/fd/4 exited 2
The test was trying to diff the file listings between two directories to
see if they are the same. The workaround is to do a string comparison
of the directory listings instead of using `diff`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16224
The 'zpool status' output assumes that the longest prefix is six
character long plus colon plus space, eg. 'status: ', 'action: '
or 'config: ' (so eight in total). This works well even when we have
messages that requires more than one line, as '\t' is exactly eight
characters, just like the longest prefix.
The 'zpool import' output is a bit different, as it may display the
comment pool property, then the longest prefix is 'comment: ', which is
nine characters long, not eight.
All the prefixes were given an extra space in front, but:
- 'status: ' did not get an extra space.
- Messages that require more than one line should use nine spaces of
indentation, not eight.
- The extra space in front looks redundant if there is no comment
property set on the given pool.
Fix it by adding an extra space to all prefixes, but only if the comment
property is defined. Also, when we need to continue the message in a new
line, use '\t ' for indentation.
While here, apply small corrections to a couple messages.
Before:
pool: tank
id: 7412636063178848859
state: ONLINE
status: Some supported features are not enabled on the pool.
(Note that they may be intentionally disabled if the
'compatibility' property is set.)
action: The pool can be imported using its name or numeric identif[...]
some features will not be available without an explicit 'zp[...]
comment: Example comment.
config:
bclone ONLINE
ada0 ONLINE
After:
pool: tank
id: 10180960571062436759
state: ONLINE
status: Some supported features are not enabled on the pool.
(Note that they may be intentionally disabled if the
'compatibility' property is set.)
action: The pool can be imported using its name or numeric identifi[...]
some features will not be available without an explicit 'zp[...]
config:
tank ONLINE
ada3 ONLINE
pool: dozer
id: 11028319538368222579
state: ONLINE
status: Some supported features are not enabled on the pool.
(Note that they may be intentionally disabled if the
'compatibility' property is set.)
action: The pool can be imported using its name or numeric identif[...]
some features will not be available without an explicit 'z[...]
comment: Example comment.
config:
dozer ONLINE
ada1 ONLINE
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Dawidek <pawel@dawidek.net>
Closes #16128
Brian Behlendorf [Wed, 29 May 2024 17:46:41 +0000 (10:46 -0700)]
zed: Add deadman-slot_off.sh zedlet
Optionally turn off disk's enclosure slot if an I/O is hung
triggering the deadman.
It's possible for outstanding I/O to a misbehaving SCSI disk to
neither promptly complete or return an error. This can occur due
to retry and recovery actions taken by the SCSI layer, driver, or
disk. When it occurs the pool will be unresponsive even though
there may be sufficient redundancy configured to proceeded without
this single disk.
When a hung I/O is detected by the kmods it will be posted as a
deadman event. By default an I/O is considered to be hung after
5 minutes. This value can be changed with the zfs_deadman_ziotime_ms
module parameter. If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set
the disk's enclosure slot will be powered off causing the outstanding
I/O to fail. The ZED will then handle this like a normal disk failure.
By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set.
As part of this change `zfs_deadman_events_per_second` is added
to control the ratelimitting of deadman events independantly of
delay events. In practice, a single deadman event is sufficient
and more aren't particularly useful.
Alphabetize the zfs_deadman_* entries in zfs.4.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16226
Alexander Motin [Wed, 29 May 2024 15:53:31 +0000 (11:53 -0400)]
Some improvements to metaslabs eviction
- Add old eviction for special and dedup metaslab classes. Those
vdevs may be potentially big and fragmented with large metaslabs,
while their asynchronous write pattern is not really different
from normal class. It seems an omission to not evict old metaslabs
from them.
- If we have metaslab preload enabled, which means we are not too
low on memory, do not evict active metaslabs even if they are not
used for some time. Eviction of active metaslabs means we won't
be able to write anything until we load them, that may take some
time, that is straight opposite to metaslab preload goals. For
small systems the memory saving should be less important after
recent reduction in number of allocators and so open metaslabs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16214
Alexander Motin [Sat, 25 May 2024 02:11:18 +0000 (22:11 -0400)]
Destroy ARC buffer in case of fill error
In case of error dmu_buf_fill_done() returns the buffer back into
DB_UNCACHED state. Since during transition from DB_UNCACHED into
DB_FILL state dbuf_noread() allocates an ARC buffer, we must free
it here, otherwise it will be leaked.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
Closes #15802
Closes #16216
Rob N [Sat, 25 May 2024 02:00:29 +0000 (12:00 +1000)]
Use memset to zero stack allocations containing unions
C99 6.7.8.17 says that when an undesignated initialiser is used, only
the first element of a union is initialised. If the first element is not
the largest within the union, how the remaining space is initialised is
up to the compiler.
GCC extends the initialiser to the entire union, while Clang treats the
remainder as padding, and so initialises according to whatever
automatic/implicit initialisation rules are currently active.
When Linux is compiled with CONFIG_INIT_STACK_ALL_PATTERN,
-ftrivial-auto-var-init=pattern is added to the kernel CFLAGS. This flag
sets the policy for automatic/implicit initialisation of variables on
the stack.
Taken together, this means that when compiling under
CONFIG_INIT_STACK_ALL_PATTERN on Clang, the "zero" initialiser will only
zero the first element in a union, and the rest will be filled with a
pattern. This is significant for aes_ctx_t, which in
aes_encrypt_atomic() and aes_decrypt_atomic() is initialised to zero,
but then used as a gcm_ctx_t, which is the fifth element in the union,
and thus gets pattern initialisation. Later, it's assumed to be zero,
resulting in a hang.
As confusing and undiscoverable as it is, by the spec, we are at fault
when we initialise a structure containing a union with the zero
initializer. As such, this commit replaces these uses with an explicit
memset(0).
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16135
Closes #16206
Rob N [Sat, 25 May 2024 01:55:47 +0000 (11:55 +1000)]
zap: reuse zap_leaf_t on dbuf reuse after shrink
If a shrink or truncate had recently freed a portion of the ZAP, the
dbuf could still be sitting on the dbuf cache waiting for eviction. If
it is then allocated for a new leaf before it can be evicted, the
zap_leaf_t is still attached as userdata, tripping the VERIFY.
Instead, just check for the userdata, and if we find it, reuse it.
Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16157.
Closes #16204
Rob N [Sat, 25 May 2024 01:54:24 +0000 (11:54 +1000)]
Linux 6.7 compat: detect if kernel defines intptr_t
Since Linux 6.7 the kernel has defined intptr_t. Clang has
-Wtypedef-redefinition by default, which causes the build to fail
because we also have a typedef for intptr_t.
Since its better to use the kernel's if it exists, detect it and skip
our own.
Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16201
Brooks Davis [Sat, 25 May 2024 01:45:58 +0000 (18:45 -0700)]
Avoid a gcc -Wint-to-pointer-cast warning
On 32-bit platforms long long is generally 64-bits. Sufficiently modern
versions of gcc (13 in my testing) complains when casting a pointer to
an integer of a different width so cast to uintptr_t first to avoid the
warning.
Fixes: c183d164aa Parallel pool import Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@klarasystems.com> Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #16203
Allow block cloning to be interrupted by a signal.
Even though block cloning is much faster than regular copying,
it is not instantaneous - the file might be large and the recordsize
small. It would be nice to be able to interrupt it with a signal
(e.g., SIGINFO on FreeBSD to see the progress).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #16208
Alexander Motin [Fri, 17 May 2024 00:56:55 +0000 (20:56 -0400)]
FreeBSD: Add zfs_link_create() error handling
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP. Linux fixed it in #7421, but it
was never fully ported to FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13215
Closes #16138
Rob N [Wed, 15 May 2024 20:03:41 +0000 (06:03 +1000)]
dbuf: separate refcount calls for dbuf and dbuf_user
In 92dc4ad83 I updated the dbuf_cache accounting to track the size of
userdata associated with dbufs. This adds the size of the dbuf+userdata
together in a single call to zfs_refcount_add_many(), but sometime
removes them in separate calls to zfs_refcount_remove_many(), if dbuf
and userdata are evicted separately.
What I didn't realise is that when refcount tracking is on,
zfs_refcount_add_many() and zfs_refcount_remove_many() are expected to
be paired, with their second & third args (count & holder) the same on
both sides. Splitting the remove part into two calls means the counts
don't match up, tripping a panic.
This commit fixes that, by always adding and removing the dbuf and
userdata counts separately.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reported-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16191
Rob Norris [Fri, 10 May 2024 03:58:26 +0000 (13:58 +1000)]
zdb/ztest: send dbgmsg output to stderr
And, make the output fd an arg to zfs_dbgmsg_print(). This is a change
in behaviour, but keeps it consistent with where crash traces go, and
it's easy to argue this is what we want anyway; this is information
about the task, not the actual output of the task.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16181
Rob Norris [Fri, 10 May 2024 03:54:08 +0000 (13:54 +1000)]
zfs_dbgmsg_print: make FreeBSD and Linux consistent
FreeBSD was using fprintf(), which might not be signal-safe. Meanwhile,
Linux's locking did not cover the header output. This two quirks are
unrelated, but both have the same response: be like the other one. So
with this commit, both functions are the same except for the names of
their lock and list variables.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16181
Rob Norris [Fri, 10 May 2024 03:04:14 +0000 (13:04 +1000)]
backtrace: rework for signal safety
Mostly, try a lot harder to not allocate anything.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16181
Rob Norris [Fri, 10 May 2024 01:26:11 +0000 (11:26 +1000)]
libspl: lift backtrace into a separate file
If it's going to be used directly by zdb/ztest, then it sort of doesn't
make sense to carry it with the assert code.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16181
Rob Norris [Fri, 10 May 2024 00:19:48 +0000 (10:19 +1000)]
zdb/ztest: use libspl backtrace for crashes
We can show much nicer backtraces these days, lets use them.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16181
Rob Norris [Thu, 9 May 2024 23:56:48 +0000 (09:56 +1000)]
zdb: bring crash handling over from ztest
ztest has a very nice ability to show a backtrace when there's an
unexpected crash. zdb is used often enough on corrupted data and can
blow up too, so nice output is useful there too.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16181
Rob Norris [Thu, 2 May 2024 02:13:38 +0000 (12:13 +1000)]
spa_taskq_dispatch_ent: simplify arguments
This renames it to spa_taskq_dispatch(), and reduces and simplifies its
arguments based on these observations from its two call sites:
- arg is always the zio, so it can be typed that way, and we don't need
to provide it twice;
- ent is always &zio->io_tqent, and zio is always provided, so we can
use it directly;
- the only flag used is TQ_FRONT, which can just be a bool;
- zio != NULL was part of the "use allocator" test, but it never would
have got that far, because that arg was only set to NULL in the
reexecute path, which is forced to type CLAIM, so the condition would
fail at t == WRITE anyway.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16151
Rob Norris [Thu, 2 May 2024 02:06:58 +0000 (12:06 +1000)]
spa: flatten spa_taskq_dispatch_ent()
It is the only user of spa_taskq_dispatch_select(), so might as well
just carry it directly.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16151
Rob Norris [Thu, 2 May 2024 02:04:24 +0000 (12:04 +1000)]
spa: remove spa_taskq_dispatch_sync()
It has no callers anymore.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16151
Rob Norris [Thu, 2 May 2024 01:57:23 +0000 (11:57 +1000)]
zfs_ioc_send: use a dedicated taskq thread for send
When stack space is tight, the stream is written to its target on a
separate taskq thread to make sure there's enough stack space to
complete it.
This has always used an IO taskq, but that doesn't really make sense for
it, and moving it onto a regular taskq lets us get rid of
spa_taskq_dispatch_sync(), which is not used anywhere else.
Stream writes may block for a long time depending on what the target is,
and we have no way of discovering this, so we can't risk using the
system taskq, as there may be many tens of sends in progress. Instead,
we create a dedicated taskq thread for each send writer to run on, and
clean it up when it's done.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16151
Alan Somers [Wed, 8 May 2024 16:01:22 +0000 (10:01 -0600)]
Better control the thread pool size when mounting datasets
Ever since a10d50f999, ZFS has mounted file systems in parallel when
importing a pool. It uses a fixed size of 512 for the thread pool. But
since c183d164aa1, it has also imported pools in parallel. So the total
number of threads at one time is 513 * npools + 1. That can easily
exceed the system's limit on the number of threads per process, which
will cause one or more pools to be unable to allocate any worker
threads, forcing them to fallback to slow serial mounting . To
forestall that, manage the threadpool size in /sbin/zpool, not libzfs.
Use the same size (512), but divided by the number of pools.
This is a backwards-incompatible change to the libzfs abi.
Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Closes #16178
Alan Somers [Tue, 7 May 2024 20:21:31 +0000 (14:21 -0600)]
libzfs: Fix mounting datasets under thread limit pressure
During parallel zpool import, /sbin/zpool will create a separate thread
pool for each pool, used to mount that pool's datasets. If the total
thread count exceed's the system's limit on threads per process, then
tpool_dispatch may fail. If it does, directly execute the mount
operation instead.
Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Closes #16178
Fixes #16172
Alan Somers [Tue, 7 May 2024 19:53:38 +0000 (13:53 -0600)]
tpool_dispatch: fail if it cannot start at least 1 worker.
Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Closes #16178
Don Brady [Sun, 5 May 2024 14:57:33 +0000 (14:57 +0000)]
Simplified the scope of the namespace lock
If we wait until after we check for no spa references to drop the
namespace lock, then we know that spa consumers will need to call
spa_lookup() and end up waiting on the spa_namespace_cv until we
finish. This narrows the external checks to spa_lookup and we no
longer need to worry about the spa_vdev_enter case.
Sponsored-By: Klara Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16153
Don Brady [Thu, 2 May 2024 19:28:10 +0000 (19:28 +0000)]
Add support for parallel pool exports
Changed spa_export_common() such that it no longer holds the
spa_namespace_lock for the entire duration and instead sets
spa_export_thread to indicate an import is in progress on the
spa. This allows for an export to a diffent pool to proceed
in parallel while an export is still processing potentially
long operations like spa_unload_log_sm_flush_all().
Calls like spa_lookup() and spa_vdev_enter() that rely on
the spa_namespace_lock to serialize them against a concurrent
export, now wait for any in-progress export thread to complete
before proceeding.
The 'zpool import -a' sub-command also provides multi-threaded
support, using a thread pool to submit the exports in parallel.
Sponsored-By: Klara Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16153
Brian Behlendorf [Mon, 13 May 2024 22:12:07 +0000 (17:12 -0500)]
Linux: disable lockdep for a couple of locks
When running a debug kernel with lockdep enabled there
are several locks which report false positives. Set
MUTEX_NOLOCKDEP/RW_NOLOCKDEP to disable these warnings.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16188
Alexander Motin [Fri, 10 May 2024 19:35:20 +0000 (15:35 -0400)]
ZAP: Fix leaf references on zap_expand_leaf() errors
Depending on kind of error zap_expand_leaf() may return with or
without valid leaf reference held. Make sure it returns NULL if
due to error it has no leaf to return. Make its callers to check
the returned leaf pointer, and release the leaf if it is not NULL.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #12366
Closes #16159
chenqiuhao1997 [Fri, 10 May 2024 15:47:21 +0000 (23:47 +0800)]
Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN.
In P2ALIGN, the result would be incorrect when align is unsigned
integer and x is larger than max value of the type of align.
In that case, -(align) would be a positive integer, which means
high bits would be zero and finally stay zero after '&' when
align is converted to a larger integer type.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com>
Closes #15940
Rob N [Thu, 9 May 2024 14:43:48 +0000 (00:43 +1000)]
libspl_assert: always link -lpthread on FreeBSD
The pthread_* functions are in -lpthread on FreeBSD. Some of them are
implicitly linked through libc, but on FreeBSD 13 at least
pthread_getname_np() is not. Just be explicit, since -lpthread is the
documented interface anyway.
Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16168
Martin Matuška [Thu, 9 May 2024 14:42:51 +0000 (16:42 +0200)]
Unbreak FreeBSD cross-build on MacOS broken in 051460b8b
MacOS used FreeBSD-compatible getprogname() and pthread_getname_np().
But pthread_getthreadid_np() does not exist on MacOS. This implements
libspl_gettid() using pthread_threadid_np() to get the thread id
of the current thread.
Tested with FreeBSD GitHub actions
freebsd-src/.github/workflows/cross-bootstrap-tools.yml
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #16167
Alexander Motin [Thu, 9 May 2024 14:39:57 +0000 (10:39 -0400)]
Fix ZIL clone records for legacy holes
Previous code overengineered cloned range calculation by using
BP_GET_LSIZE(). The problem is that legacy holes don't have the
logical size, so result will be wrong. But we also don't need
to look on every block size, since they all must be identical.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16165
Alexander Motin [Thu, 9 May 2024 14:32:59 +0000 (10:32 -0400)]
Fix scn_queue races on very old pools
Code for pools before version 11 uses dmu_objset_find_dp() to scan
for children datasets/clones. It calls enqueue_clones_cb() and
enqueue_cb() callbacks in parallel from multiple taskq threads.
It ends up bad for scan_ds_queue_insert(), corrupting scn_queue
AVL-tree. Fix it by introducing a mutex to protect those two
scan_ds_queue_insert() calls. All other calls are done from the
sync thread and so serialized.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16162
Ameer Hamza [Thu, 9 May 2024 14:31:57 +0000 (19:31 +0500)]
zdb: add missing cleanup for early return
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16152
Daniel Perry [Thu, 9 May 2024 14:30:28 +0000 (10:30 -0400)]
Replace usage of schedule_timeout with schedule_timeout_interruptible (#16150)
This commit replaces current usages of schedule_timeout() with
schedule_timeout_interruptible() in code paths that expect the running
task to sleep for a short period of time. When schedule_timeout() is
called without previously calling set_current_state(), the running
task never sleeps because the task state remains in TASK_RUNNING.
By calling schedule_timeout_interruptible() to set the task state to
TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the
intended/desired behavior of putting the task to sleep for the
specified timeout.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Daniel Perry <dtperry@amazon.com>
Closes #16150
Alexander Motin [Fri, 3 May 2024 16:53:34 +0000 (12:53 -0400)]
Disable high priority ZIO threads on FreeBSD and Linux
High priority threads are handling ZIL writes. While there is no
ZIL compression, there is encryption, checksuming and RAIDZ math.
We've found that on large systems 1 taskq with 5 threads can be
a bottleneck for throughput, IOPS or both. Instead of just bumping
number of threads with a risk of overloading CPUs and increasing
latency, switch to using TQ_FRONT mechanism to increase sync write
requests priority within standard write threads. Do not do it on
Illumos, since its TQ_FRONT implementation is inherently unfair.
FreeBSD and Linux don't have this problem, so we can do it there.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #16146
Rob N [Thu, 2 May 2024 22:18:35 +0000 (08:18 +1000)]
vdev_disk: disable flushes if device does not support it
If the underlying device doesn't have a write-back cache, the kernel
will just return a successful response. This doesn't hurt anything, but
it's extra work on the IO taskqs that are unnecessary. So, detect this
when we open the device for the first time.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16148
Alexander Motin [Wed, 1 May 2024 18:07:20 +0000 (14:07 -0400)]
Improve write issue taskqs utilization
- Reduce number of allocators on small system down to one per 4
CPU cores, keeping maximum at 4 on 16+ core systems. Small systems
should not have the lock contention multiple allocators supposed
to solve, while having several metaslabs open and modified each
TXG is not free.
- Reduce number of write issue taskqs down to one per 16 CPU
cores and an integer fraction of number of allocators. On mid-
sized systems, where multiple allocators already make sense, too
many write issue taskqs may reduce write speed on single-file
workloads, since single file is handled by only one taskq to
reduce fragmentation. On large systems, that can actually benefit
from many taskq's better IOPS, the bottleneck is less important,
since in worst case there will be at least 16 cores to handle it.
- Distribute dnodes between allocators (and taskqs) in a round-
robin fashion instead of relying on sync taskqs to be balanced.
The last is not guarantied and may depend on scheduling.
- Remove io_wr_iss_tq from struct zio. io_allocator is enough.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16130
Alexander Motin [Wed, 1 May 2024 17:59:32 +0000 (13:59 -0400)]
Slightly improve dnode hash
As I understand just for being less predictable dnode hash includes
8 bits of objset pointer, starting at 6. But since objset_t is
more than 1KB in size, its allocations are likely aligned to 2KB,
that means 11 lower bits provide no entropy. Just take the 8 bits
starting from 11.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16131
Rob Norris [Tue, 30 Apr 2024 00:37:29 +0000 (10:37 +1000)]
libspl/assert: use libunwind for backtrace when available
libunwind seems to do a better job of resolving a symbols than
backtrace(), and is also useful on platforms that don't have backtrace()
(eg musl). If it's available, use it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/
Closes #16140
Rob Norris [Sun, 28 Apr 2024 02:49:58 +0000 (12:49 +1000)]
libspl/assert: add lock around assertion output
If multiple threads trip an assertion at the same moment (quite common),
they can be printing at the same time, and their output gets messy.
This adds a simple lock around the whole thing, to prevent a second task
printing assert output before the first has finished.
Additionally, if libspl_assert_ok is not set, abort() is called without
dropping the lock, so that any other asserting tasks will be killed
before starting any output, rather than only getting part-way through.
This is a tradeoff; it's assumed that multiple threads asserting at the
same moment are likely the same fault in different instances of a
thread, and so there won't be any more useful information from the other
tasks anyway.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/
Closes #16140
Rob Norris [Sun, 21 Apr 2024 11:43:53 +0000 (21:43 +1000)]
libspl/assert: show process/task details in assert output
Makes it much easier to see what thing complained.
Getting thread id, program name and thread name vary wildly between
Linux and FreeBSD, so those are set up in macros. pthread_getname_np()
did not appear in musl until very recently, but the same info has always
been available via prctl(PR_GET_NAME), so we use that instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/
Closes #16140
Rob Norris [Tue, 30 Apr 2024 02:35:30 +0000 (12:35 +1000)]
find_system_library: fix var cleanup when library not found
The "not found" path is attempting to clear SOMELIB_CFLAGS and
SOMELIB_LIBS by resetting them in AC_SUBST(). However, the second arg to
AC_SUBST is expanded in autoconf with `m4_ifvaln([$2], [[$1]=$2])`,
which is defined as "if the first arg is non-empty". The m4 "empty"
construction is [], therefore, the existing AC_SUBST calls never modify
the variables at all.
The effect of this is that leftovers from the library test can leak out.
At least, if a library header is found in the first stage, but the
library itself is not, -lsomelib is added to SOMELIB_LIBS and further
tests done. If that library is not found, SOMELIB_LIBS will not be
cleared.
For most of our library tests this hasn't been a problem, as they're
either always found properly via pkg-config or set directly, or the
calling test immediately aborts configure. For an optional dependency
however, an apparent "partial" result where the header is found but no
corresponding library causes link errors later.
I think a complete fix should probably not be setting SOMELIB_xxx until
the final result is known, but for now, adjusting the AC_SUBST calls to
explictly set the empty shell string (which is not "empty" to m4) at
least restores the intent.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/
Closes #16140
Rob N [Mon, 29 Apr 2024 22:57:32 +0000 (08:57 +1000)]
zio: try to execute TYPE_NULL ZIOs on the current task
Many TYPE_NULL ZIOs are used to provide a sync point for child ZIOs, and
do not do any actual work themselves. However, they are still dispatched
to a dedicated, single-thread taskq, which leads to their execution
being entirely task switch and dequeue overhead for no actual reason.
This commit changes it so that when selecting a parent ZIO to execute,
if the parent is TYPE_NULL and has no done function (that is, no
additional work), it is executed on the same thread. This reduces task
switches and frees up CPU cores for other work.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16134
Don Brady [Mon, 29 Apr 2024 21:35:53 +0000 (15:35 -0600)]
vdev probe to slow disk can stall mmp write checker
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.
Also allow zpool clear if no evidence of another host
is using the pool.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15839
Ameer Hamza [Mon, 29 Apr 2024 20:28:50 +0000 (01:28 +0500)]
Fix arcstats for FreeBSD after zfetch support
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16141
Someone came to me and pointed out that you could pretty
readily cause the refreservation calculation to exceed
2**64, given the 2**17 multiplier in it, and produce
refreservations wildly less than the actual volsize in cases where
it should have failed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #15996
Tony Hutter [Mon, 29 Apr 2024 18:31:50 +0000 (11:31 -0700)]
GCC: Fixes for gcc 14 on Fedora 40
- Workaround dangling pointer in uu_list.c (#16124)
- Fix calloc() transposed arguments in zpool_vdev_os.c
- Make some temp variables unsigned to prevent triggering a
'-Werror=alloc-size-larger-than' error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16124
Closes #16125
Alan Somers [Thu, 25 Apr 2024 21:24:52 +0000 (16:24 -0500)]
Fix updating the zvol_htable when renaming a zvol
When renaming a zvol, insert it into zvol_htable using the new name, not
the old name. Otherwise some operations won't work. For example,
"zfs set volsize" while the zvol is open.
Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Closes #16127
Closes #16128
Brian Behlendorf [Thu, 25 Apr 2024 20:40:09 +0000 (13:40 -0700)]
Python 3.12 deprecated python3-distutils
As for python-3.12 the distutils package has been deprecated.
The latest ax_python_devel.m4 macro from the autoconf archive
has been updated accordingly so let's pull in the new version.
We can also drop the changes made to our customized version
to continue if the development version is not installed since
this functionality has been included upstream.
Allan Jude [Wed, 24 Apr 2024 21:51:21 +0000 (17:51 -0400)]
Fast Dedup: ZAP Shrinking
This allows ZAPs to shrink. When there are two empty sibling leafs,
one of them is collapsed and its storage space is reused.
This improved performance on directories that at one time contained
a large number of files, but many or all of those files have since
been deleted.
This also applies to all other types of ZAPs as well.
Sponsored-by: iXsystems, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com>
Closes #15888
Alexander Motin [Wed, 24 Apr 2024 21:38:48 +0000 (17:38 -0400)]
Make more taskq parameters writable
There is no reason for these module parameters to be read-only.
Being modified they just apply on next pool import/creation, that
is useful for testing different values.
Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16118
Alexander Motin [Tue, 23 Apr 2024 16:06:00 +0000 (12:06 -0400)]
L2ARC: Cleanup buffer re-compression
When compressed ARC is disabled, we may have to re-compress when
writing into L2ARC. If doing so we can't fit it into the original
physical size, we should just fail immediately, since even if it
may still fit into allocation size, its checksum will never match.
While there, refactor the code similar to other compression places
without using abd_return_buf_copy().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16038
Fix an error in zfs-kmod.spec that causes kmod-zfs packages not to
include the correct RPM requires/conflicts relationships. With this
change applied, RPM correctly no longer allows kmod-zfs & zfs-dkms
packages to be installed together.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Todd Seidelmann <18294602+seidelma@users.noreply.github.com>
Closes #16121
Alexander Motin [Mon, 22 Apr 2024 18:41:03 +0000 (14:41 -0400)]
Refactor dbuf_read() for safer decryption
In dbuf_read_verify_dnode_crypt():
- We don't need original dbuf locked there. Instead take a lock
on a dnode dbuf, that is actually manipulated.
- Block decryption for a dnode dbuf if it is currently being
written. ARC hash lock does not protect anonymous buffers, so
arc_untransform() is unsafe when used on buffers being written,
that may happen in case of encrypted dnode buffers, since they
are not copied by dbuf_dirty()/dbuf_hold_copy().
In dbuf_read():
- If the buffer is in flight, recheck its compression/encryption
status after it is cached, since it may need arc_untransform().
Tested-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16104
Ryan [Mon, 22 Apr 2024 17:59:31 +0000 (01:59 +0800)]
zfs get: add '-t fs' and '-t vol' options
Make `zfs get` accept `fs` for `filesystem` and `vol` for `volume`.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan <errornointernet@envs.net>
Closes #16117
Brooks Davis [Mon, 22 Apr 2024 17:48:58 +0000 (10:48 -0700)]
ztest: use ASSERT3P to compare pointers
With a sufficiently modern gcc (I saw this with gcc13), gcc complains
when casting pointers to an integer of a different type (even a larger
one). On 32-bt ASSERT3U does this on 32-bit systems by casting a 32-bit
pointer to uint64_t so use ASSERT3P which uses uintptr_t.
Fixes: 5caeef02fa53 RAID-Z expansion feature Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #16115
George Wilson [Mon, 22 Apr 2024 16:42:38 +0000 (12:42 -0400)]
Parallel pool import
This commit allow spa_load() to drop the spa_namespace_lock so
that imports can happen concurrently. Prior to dropping the
spa_namespace_lock, the import logic will set the spa_load_thread
value to track the thread which is doing the import.
Consumers of spa_lookup() retain the same behavior by blocking
when either a thread is holding the spa_namespace_lock or the
spa_load_thread value is set. This will ensure that critical
concurrent operations cannot take place while a pool is being
imported.
The zpool command is also enhanced to provide multi-threaded support
when invoking zpool import -a.
Lastly, zinject provides a mechanism to insert artificial delays
when importing a pool and new zfs tests are added to verify parallel
import functionality.
Contributions-by: Don Brady <don.brady@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #16093
Rob N [Fri, 19 Apr 2024 23:41:31 +0000 (09:41 +1000)]
abd_iter_page: rework to handle multipage scatterlists
Previously, abd_iter_page() would assume that every scatterlist would
contain a single page (compound or no), because that's all we ever
create in abd_alloc_chunks(). However, scatterlists can contain multiple
pages of arbitrary provenance, and if we get one of those, we'd get all
the math wrong.
This reworks things to handle multiple pages in a scatterlist, by
properly finding the right page within it for the given offset, and
understanding better where the end of the page is and not crossing it.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reported-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16108
Alexander Motin [Fri, 19 Apr 2024 23:18:54 +0000 (19:18 -0400)]
Handle FLUSH errors as "expected"
Before #16061 zio_vdev_io_done() was not used for FLUSH requests.
Addition of it triggers reprobe each TXG for vdevs not supporting
them. Since those errors are often expected, they are normally
handled by individual vdev drivers and should be ignored here.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16110
Rob Norris [Tue, 16 Apr 2024 04:56:35 +0000 (14:56 +1000)]
tests/quota: consistently clear quota property between tests
When run in isolation, quota_005_pos would fail in cleanup because it
would attempt restore the previous quota, which was 0, and so get an
error (because you can't set quota to '0', you have to use 'none').
It worked as part of the quota tag set because the previous tests did
not clean up their quota, so there was always a non-zero quota to return
to.
This adds a simple quota reset function, and has all quota tests run it
at cleanup. For the ones that weren't cleaning up, they now do, and for
quota_005_pos, which was trying to do the right thing, it now just
resets it.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16097
Rob Norris [Tue, 16 Apr 2024 05:03:33 +0000 (15:03 +1000)]
tests/quota_005_pos: use a long int for doubling the quota size
When run in isolation, quota_005_pos would see an empty ~300G dataset.
Doubling it's space overflows a int32, which meant it was trying to then
set the quota to a negative value, and would fail.
When run as part of the quota tests, the filesystem appears to have
stuff in it, and so a lower available space, which doesn't overflow, and
so succeeds.
The bare minimum fix seems to be to use a int64 for the available space,
so it can be comfortably doubled. Here it is.
(Also a typo fix and a tiny bit of cleanup).
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16097
Ameer Hamza [Fri, 19 Apr 2024 17:19:12 +0000 (22:19 +0500)]
Add zfetch stats in arcstats
arc_summary also reports zfetch stats but it's inconvenient to monitor
contiguously incrementing numbers. Adding them in arcstats allows us to
observe streams more conveniently.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16094
Tony Hutter [Wed, 17 Apr 2024 16:29:21 +0000 (09:29 -0700)]
Linux 6.8 compat: META (#16099)
Update the META file to reflect compatibility with the 6.8 kernel.
Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Rob N [Tue, 16 Apr 2024 16:13:01 +0000 (02:13 +1000)]
zts: add a debug option to get full test output
The test runner accumulates output from individual tests, then writes it
to the log at the end. If a test hangs or crashes the system half way
through, we get no insight into how it got to where it did.
This adds a -D option for "debug". When set, all test output is written
to stdout.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16096
Do no use .cfi_negate_ra_state within the assembly on Arm64
Compiling openzfs on aarch64 with gcc-8 and gcc-9 is failing currently.
See issue #14965 for deeper context.
On platforms without pointer authentication, .cfi_negate_ra_state can be
defined to a no-op:
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/aarch64-tdep.c#l1413
I have tested this on Arm64 FreeBSD 13.2 and AlmaLinux-8.
Andrew Turner [Mon, 15 Apr 2024 20:53:39 +0000 (21:53 +0100)]
Add the BTI elf note to the AArch64 SHA2 assembly
On ELF platforms there is a note to specify when an application or
library supports BTI. When linking one of these the linker needs
all input object files to have the note. If not it will not include
it in the output file.
Normally the compiler would generate it, but for assembly files we
need to do it our selves.
Add the note to the aarch64 sha256 and sha512 assembly files.
Tested by building with BTI enabled and using the -zbti-report=error
flag to lld that makes it an error if the note is missing.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Turner <andrew.turner4@arm.com>
Closes #16086
Rob N [Mon, 15 Apr 2024 20:52:20 +0000 (06:52 +1000)]
zinject: "no-op" error injection
When injected, this causes the matching IO to appear to succeed, but the
actual work is never submitted to the physical device. This can be used
to simulate a write-back cache servicing a write, but the backing device
has failed and the cache cannot complete the operation in the
background.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16085
Rob N [Mon, 15 Apr 2024 20:44:12 +0000 (06:44 +1000)]
zts: allow running a single test by name only
Specifying a single test is kind of a hassle, because the full relative
path under the test suite dir has to be included, but it's not always
clear what that path even is.
This change allows `-t` to take the name of a single test instead of a
full path. If the value has no `/` characters, we search for a file of
that name under the test root, and if found, use that as the full test
path instead.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16088
Kernel documentation for the discard_granularity property says:
A discard_granularity of 0 means that the device does not support
discard functionality.
Some older kernels had drivers (notably loop, but also some USB-SATA
adapters) that would set the QUEUE_FLAG_DISCARD capability flag, but
have discard_granularity=0. Since 5.10 (torvalds/linux@b35fd7422c2f) the
discard entry point blkdev_issue_discard() has had a check for this,
which would immediately reject the call with EOPNOTSUPP, and throw a
scary diagnostic message into the log. See #16068.
Since 6.8, the block layer sets a non-zero default for
discard_granularity (torvalds/linux@3c407dc723bb), and a future kernel
will remove the check entirely[1].
As such, there's no good reason for us to enable discard when
discard_granularity=0. The kernel will never let the request go in
anyway; better that we just disable it so we can report it properly to
the user.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16068
Closes #16082