Rob N [Mon, 23 Oct 2023 15:50:55 +0000 (02:50 +1100)]
spa: document spa_thread() and SDC feature gates
spa_thread() and the "System Duty Cycle" scheduling class are from
Illumos and have not yet been adapted to Linux or FreeBSD.
HAVE_SPA_THREAD has long been explicitly undefined and used to mark
spa_thread(), but there's some related taskq code that can never be
invoked without it, which makes some already-tricky code harder to read.
HAVE_SYSDC is introduced in this commit to mark the SDC parts. SDC
requires spa_thread(), but the inverse is not true, so they are
separate.
I don't want to make the call to just remove it because I still harbour
hopes that OpenZFS could become a first-class citizen on Illumos
someday. But hopefully this will at least make the reason it exists a
bit clearer for people without long memories and/or an interest in
history.
For those that are interested in the history, the original FreeBSD port
of ZFS (before ZFS-on-Linux was adopted there) did have a spa_thread(),
but not SDC. The last version of that before it was removed can be read
here:
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15406
Sam Atkinson [Fri, 20 Oct 2023 21:22:04 +0000 (17:22 -0400)]
Do not persist user/group/project quota zap objects when unneeded
In the zfs_id_over*quota functions, there is a short-circuit to skip
the zap_lookup when the quota zap does not exist. If quotas are never
used in a zpool, then the quota zap will never exist. But if
user/group/project quotas are ever used, the zap objects will be
created and will persist even if the quotas are deleted.
The quota zap_lookup in the write path can become a bottleneck for
write-heavy small I/O workloads. Before this commit, it was not
possible to remove this lookup without creating a new zpool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sam Atkinson <samatk@amazon.com>
Closes #14721
Alexander Motin [Fri, 20 Oct 2023 19:38:37 +0000 (15:38 -0400)]
Trust ARC_BUF_SHARED() more
In my understanding ARC_BUF_SHARED() and arc_buf_is_shared() should
return identical results, except the second also asserts it deeper.
The first is much cheaper though, saving few pointer dereferences.
Replace production arc_buf_is_shared() calls with ARC_BUF_SHARED(),
and call arc_buf_is_shared() in random assertions, while making it
even more strict.
On my tests this in half reduces arc_buf_destroy_impl() time, that
noticeably reduces hash_lock congestion under heavy dbuf eviction.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15397
Alexander Motin [Fri, 20 Oct 2023 19:37:16 +0000 (15:37 -0400)]
Remove lock from dsl_pool_need_dirty_delay()
Torn reads/writes of dp_dirty_total are unlikely: on 64-bit systems
due to register size, while on 32-bit due to memory constraints.
And even if we hit some race, the code implementing the delay takes
the lock any way.
Removal of the poll-wide lock acquisition saves ~1% of CPU time on
8-thread 8KB write workload.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15390
VaibhavB [Fri, 20 Oct 2023 18:57:39 +0000 (00:27 +0530)]
run-zts test procfs/pool_state failed with uncorrectable I/O failure
Once we trigger the zpool scrub, all zpool/zfs command gets stuck for
180 seconds. Post 180 seconds zpool/zfs commands gets start executing
however few more seconds(10s) it take to update the status. hence
sleeping for 200 seconds so that we get the correct status.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: vaibhav.bhanawat <vaibhav.bhanawat@delphix.com>
Closes #15364
Fix typo in tests/zfs-tests/tests/functional/cli_user/misc/misc.cfg
Reviewed-by: Rob N <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dennis R. Friedrichsen <dennis.r.friedrichsen@gmail.com>
Closes #15417
Olivier Certner [Fri, 20 Oct 2023 18:49:56 +0000 (20:49 +0200)]
FreeBSD: taskq: Remove unused declaration
Variable 'uma_align_cache' has not been used since commit "FreeBSD: Use
a hash table for taskqid lookups" (3933305ea). Moreover, it is soon
going to become private to FreeBSD's UMA in 15.0-CURRENT (main),
14.0-STABLE (stable/14) and 13.2-STABLE (stable/13). Should accessing
this information become necessary again, one will have to use the new
accessors for recent versions.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olivier Certner <olce.freebsd@certner.fr>
Closes #15416
Colin Percival [Fri, 20 Oct 2023 17:30:32 +0000 (10:30 -0700)]
Set spa_ccw_fail_time=0 when expanding a vdev.
When a vdev is to be expanded -- either via `zpool online -e` or via
the autoexpand option -- a SPA_ASYNC_CONFIG_UPDATE request is queued
to be handled via an asynchronous worker thread (spa_async_thread).
This normally happens almost immediately; but will be delayed up to
zfs_ccw_retry_interval seconds (default 5 minutes) if an attempt to
write the zpool configuration cache failed.
When FreeBSD boots ZFS-root VM images generated using `makefs -t zfs`,
the zpoolupgrade rc.d script runs `zpool upgrade`, which modifies the
pool configuration and triggers an attempt to write to the cache file.
This attempted write fails because the filesystem is still mounted
read-only at this point in the boot process, triggering a 5-minute
cooldown before SPA_ASYNC_CONFIG_UPDATE requests will be handled by
the asynchronous worker thread.
When expanding a vdev, reset the "when did a configuration cache
write last fail" value so that the SPA_ASYNC_CONFIG_UPDATE request
will be handled promptly. A cleaner but more intrusive option would
be to use separate SPA_ASYNC_ flags for "configuration changed" and
"try writing the configuration cache again", but with FreeBSD 14.0
coming very soon I'd prefer to leave such refactoring for a later
date.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colin Percival <cperciva@FreeBSD.org>
Closes #15405
Don Brady [Fri, 20 Oct 2023 16:29:02 +0000 (10:29 -0600)]
Fix ZED auto-replace for VDEVs using by-id paths
The change is simple -- restore the original code so that the VDEV
path is updated when using by-id paths. The more challenging part
was to devise a second ZTS test, that would test auto-replace for
'by-id' and help prevent a future regression.
With that new test, we can now do an A|B test with , and without,
the fix to confirm that auto-replace for by-id paths works. The
existing auto-replace test, functional/fault/auto_replace_001_pos,
will confirm that we didn't break auto-replace for 'by-vdev' paths.
In the original functional/fault/auto_replace_001_pos test, the disk
wipe (using dd) was not effective in removing the partitioning since
the kernel was never informed of the wipe.
Added a call to wipefs(8) so that the kernel is informed and ZED will
re-partition the device.
Added a validation step that the re-partitioning occurred by
confirming that the GPT partition UUID changes.
Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15363
The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`,
if a zil block is greater than 768K, the priority of the write is
downgraded from sync to async. Increasing the value allows greater
throughput. To select a value for this PR, I ran an fio workload with
the following values for `zil_slog_bulk`:
None of the other tests in the performance suite (run with a zil or
slog) had a significant change, including the random_write_zil tests,
which use multiple datasets.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Closes #14378
Alexander Motin [Fri, 13 Oct 2023 17:41:11 +0000 (13:41 -0400)]
FreeBSD: Improve taskq wrapper
- Group tqent_task and tqent_timeout_task into a union. They are
never used same time. This shrinks taskq_ent_t from 192 to 160 bytes.
- Remove tqent_registered. Use tqent_id != 0 instead.
- Remove tqent_cancelled. Use taskqueue pending counter instead.
- Change tqent_type into uint_t. We don't need to pack it any more.
- Change tqent_rc into uint_t, matching refcount(9).
- Take shared locks in taskq_lookup().
- Call proper taskqueue_drain_timeout() for TIMEOUT_TASK in
taskq_cancel_id() and taskq_wait_id().
- Switch from CK_LIST to regular LIST.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15356
Jason King [Thu, 12 Oct 2023 18:01:54 +0000 (13:01 -0500)]
Zpool can start allocating from metaslab before TRIMs have completed
When doing a manual TRIM on a zpool, the metaslab being TRIMmed is
potentially re-enabled before all queued TRIM zios for that metaslab
have completed. Since TRIM zios have the lowest priority, it is
possible to get into a situation where allocations occur from the
just re-enabled metaslab and cut ahead of queued TRIMs to the same
metaslab. If the ranges overlap, this will cause corruption.
We were able to trigger this pretty consistently with a small single
top-level vdev zpool (i.e. small number of metaslabs) with heavy
parallel write activity while performing a manual TRIM against a
somewhat 'slow' device (so TRIMs took a bit of time to complete).
With the patch, we've not been able to recreate it since. It was on
illumos, but inspection of the OpenZFS trim code looks like the
relevant pieces are largely unchanged and so it appears it would be
vulnerable to the same issue.
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jason King <jking@racktopsystems.com>
Illumos-issue: https://www.illumos.org/issues/15939
Closes #15395
Alexander Motin [Wed, 11 Oct 2023 23:37:21 +0000 (19:37 -0400)]
DMU: Do not pre-read holes during write
dmu_tx_check_ioerr() pre-reads blocks that are going to be dirtied
as part of transaction to both prefetch them and check for errors.
But it makes no sense to do it for holes, since there are no disk
reads to prefetch and there can be no errors. On the other side
those blocks are anonymous, and they are freed immediately by the
dbuf_rele() without even being put into dbuf cache, so we just
burn CPU time on decompression and overheads and get absolutely
no result at the end.
Use of dbuf_hold_impl() with fail_sparse parameter allows to skip
the extra work, and on my tests with sequential 8KB writes to empty
ZVOL with 32KB blocks shows throughput increase from 1.7 to 2GB/s.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15371
Brian Behlendorf [Tue, 10 Oct 2023 20:32:33 +0000 (13:32 -0700)]
ZTS: Debug zfs_share_concurrent_shares failure
Update zfs_share_concurrent_shares test case to wait a few seconds
and recheck that the filesystem isn't shared. The intent here is
determine the nature of the error and if it may be a race.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15379
Brian Behlendorf [Tue, 10 Oct 2023 20:31:15 +0000 (13:31 -0700)]
CI: Move perl script to dist_noinst_DATA
Everything listed in dist_noinst_SCRIPTS is assumed to be a shell
script, this generates a shellcheck SC1071 error since perl is not
supported. Move update_authors.pl to dist_noinst_DATA with the
other perl scripts.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob N <robn@despairlabs.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15392
Daniel Berlin [Tue, 10 Oct 2023 18:04:32 +0000 (14:04 -0400)]
Ensure we call fput when cloning fails due to different devices.
Right now, zpl_ioctl_ficlone and zpl_ioctl_ficlonerange do not call
put on the src fd if the source and destination are on two different
devices. This leaves the source file held open in this case.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Daniel Berlin <dberlin@dberlin.org>
Closes #15386
Tony Hutter [Tue, 10 Oct 2023 15:57:48 +0000 (08:57 -0700)]
zvol: Temporally disable blk-mq
There was a report of zvol data loss (#15351) after enabling blk-mq on a
zvol backed with 16k physical block sized disks. Out of an abundance of
caution, do not allow the user to enable blk-mq until we can look into
the issue.
Note that blk-mq was not enabled by default on zvols. It was always
opt-in via the zvol_use_blk_mq module parameter.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Addresses: #15351
Closes #15378
Rob Norris [Sat, 5 Aug 2023 16:11:19 +0000 (02:11 +1000)]
AUTHORS: update with missing names
This is generated by scripts/update_authors.pl. I've looked over the
results fairly closely and while I don't think they're bad, they could
be improved somewhat, but also, I don't know if its good form to just
update this without explicit consent from those named.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15374
Rob Norris [Sat, 5 Aug 2023 15:58:45 +0000 (01:58 +1000)]
mailmap: initial, trying to tidy up a lot of the commit history
This comes from the observation that a huge number of commit author
fields look quite strange (to my eyes), but quite often the Signed-off-by: trailer has the correct name. For these I have updated
the name where it was obvious how to do so, however, I have not created
a mapping for the commit email to the Signed-off-by email, as whatever I
choose for email will become the prime candidate for inclusion in the
AUTHORS file, and care needs to be taken when acting without explicit
consent.
There's a small handful of commits that look like they were done on
local machines, or CI hosts, or similar, where the git authorship config
wasn't set up properly. Its obvious what this should look like, so I've
just done them.
The remainder is mapping Github noreply emails to either an
obviously-correct Signed-off-by trailer, or to a an author from another
commit. This was mostly done by hand, so there may be errors, but I
think its close. I do not understand where these come from - I know that
they're what commits made via Github web look like when there's no real
address set on the account, but I find it hard to believe that so many
of these came through the web, especially given the complexity of most
of the changes. I suspect there's some kind of merge helper tool in play
here. Regardless, the history is set now, and this tries to get it back
on track.
Obviously, all of this helps the history look tidy, but this also feeds
into the AUTHORS update script. See next commit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15374
We should probably consider using/honouring the standard --with-bashcompletiondir
autoconf option as well, but that's something to do later.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Signed-off-by: Sam James <sam@gentoo.org>
Closes #15372
Alexander Motin [Fri, 6 Oct 2023 17:09:27 +0000 (13:09 -0400)]
ZIL: Reduce maximum size of WR_COPIED to 7.5K
Benchmarks show that at certain write sizes range lock/unlock take
not so much time as extra memory copy. The exact threshold is not
obvious due to other overheads, but it is definitely lower than
~63KB used before. Make it configurable, defaulting at 7.5KB,
that is 8KB of nearest malloc() size minus itx and lr structs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15353
siv0 [Fri, 6 Oct 2023 16:53:23 +0000 (18:53 +0200)]
rpm: Fix `make rpm` on Debian/Ubuntu
The recent patch to change the bash completion install location based
on the Distribution, ignored that it should still be possible to
create RPMs on Debian derived systems. Additionally `make deb` itself
creates RPMs and converts them via `alien`.
This patch adds the bashcompletiondir variable to the rpm defines and
uses this for the location, where to get the bash completion file.
It still changes the location on Debian/Ubuntu systems in the final
packages from /etc/bash_completion.d to
/usr/share/bash-completion/completions
Rob Norris [Sat, 16 Sep 2023 07:02:02 +0000 (17:02 +1000)]
import: require force when cachefile hostid doesn't match on-disk
Previously, if a cachefile is passed to zpool import, the cached config
is mostly offered as-is to ZFS_IOC_POOL_TRYIMPORT->spa_tryimport(), and
the results are taken as the canonical pool config and handed back to
ZFS_IOC_POOL_IMPORT.
In the course of its operation, spa_load() will inspect the pool and
build a new config from what it finds on disk. However, it then
regenerates a new config ready to import, and so rightly sets the hostid
and hostname for the local host in the config it returns.
Because of this, the "require force" checks always decide the pool is
exported and last touched by the local host, even if this is not true,
which is possible in a HA environment when MMP is not enabled. The pool
may be imported on another head, but the import checks still pass here,
so the pool ends up imported on both.
(This doesn't happen when a cachefile isn't used, because the pool
config is discovered in userspace in zpool_find_import(), and that does
find the on-disk hostid and hostname correctly).
Since the systemd zfs-import-cache.service unit uses cachefile imports,
this can lead to a system returning after a crash with a "valid"
cachefile on disk and automatically, quietly, importing a pool that has
already been taken up by a secondary head.
This commit causes the on-disk hostid and hostname to be included in the
ZPOOL_CONFIG_LOAD_INFO item in the returned config, and then changes the
"force" checks for zpool import to use them if present.
This method should give no change in behaviour for old userspace on new
kernels (they won't know to look for the new config items) and for new
userspace on old kernels (the won't find the new config items).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15290
Rob Norris [Mon, 18 Sep 2023 01:07:32 +0000 (11:07 +1000)]
tests: add tests for zpool import behaviour when hostid changes
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15290
Rob N [Fri, 6 Oct 2023 16:06:29 +0000 (03:06 +1100)]
zfsconcepts: add description of block cloning
Here I'm trying to succinctly introduce the concept, the basics of its
construction, how its different to dedup, how to use it, and where its
limitations lie, in four paragraphs and with enough searchable terms to
help the reader find more information both within OpenZFS and elsewhere.
Phew.
Sponsored-By: Klara, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15362
Alexander Motin [Fri, 6 Oct 2023 16:04:00 +0000 (12:04 -0400)]
Reduce number of metaslab preload taskq threads.
Before this change ZFS created threads for 50% of CPUs for each top-
level vdev. Plus it created the same number of threads for embedded
log groups (that have only one metaslab and don't need any preload).
As result, on system with 80 CPUs and pool of 60 vdevs this resulted
in 4800 metaslab preload threads, that is absolutely insane.
This patch changes the preload threads to 50% of CPUs in one taskq
per pool, so on the mentioned system it will be only 40 threads.
Among other things this fixes zdb on the mentioned system and pool
on FreeBSD, that failed to create so many threads in one process.
Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15319
Alexander Motin [Tue, 3 Oct 2023 15:57:48 +0000 (11:57 -0400)]
ARC: Drop different size headers for crypto
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15293
Closes #15347
Alexander Motin [Fri, 6 Oct 2023 15:56:17 +0000 (11:56 -0400)]
ARC: Remove b_bufcnt/b_ebufcnt from ARC headers
In most cases we do not care about exact number of buffers linked
to the header, we just need to know if it is zero, non-zero or one.
That can easily be checked just looking on b_buf pointer or in some
cases derefencing it.
b_ebufcnt is read only once, and in that case we already traverse
the list as part of arc_buf_remove(), so second traverse should not
be expensive.
This reduces L1 ARC header size by 8 bytes and full crypto header by
16 bytes, down to 176 and 232 bytes on FreeBSD respectively.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15350
Rob N [Fri, 6 Oct 2023 15:39:20 +0000 (02:39 +1100)]
tests/block_cloning: sync before write in fallback test
We're still seeing this test fail intermittently (that is, the clone
happens), which must mean the write and the clone can still be happening
on different txgs.
It might be that there's still activity after the pool is created. So
here we force a sync before starting the write.
Sponsored-By: Klara Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15359
Andrew Turner [Tue, 3 Oct 2023 22:12:36 +0000 (23:12 +0100)]
Add BTI landing pads to the AArch64 SHA2 assembly
The Arm Branch Target Identification (BTI) extension guards against
branching to an unintended instruction.
To support BTI add the landing pad instructions to the SHA2 functions.
These are from the hint space so are a nop on hardware that lacks BTI
support or if BTI isn't enabled.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Andrew Turner <andrew.turner4@arm.com>
Closes #14862
Closes #15339
contrib: bash_completion.d: make install destination vendor dependent
Certain Linux distributions (Debian/Ubuntu at least) expect
bash-completion snippets to be installed in
/usr/share/bash-completion/completions instead of
/etc/bash_completion.d.
This patch sets the bashcompletiondir variable based on the vendor,
inspired by similar settings for initdir and initconfdir.
It seems that commit 612b8dff5bc3d827efb864a199a62bda1a419254
caused the file to be installed in the first-place (thus the error
when building debian packages only became apparent when testing a
2.2.0-rc4 build)
The change only sets the variable in Makefile context - the
rpm/zfs.spec.in file has the path hardcoded as
%{_sysconfdir}/bash_completion.d/zfs, but since running
```
./configure --sysconfdir=/myetc ; make rpm
```
also results in all relevant files to be installed in /etc instead of
/myetc I assume this can remain as is.
Umer Saleem [Mon, 2 Oct 2023 23:58:54 +0000 (04:58 +0500)]
Add '-u' - nomount flag for zfs set
This commit adds '-u' flag for zfs set operation. With this flag,
mountpoint, sharenfs and sharesmb properties can be updated
without actually mounting or sharing the dataset.
Previously, if dataset was unmounted, and mountpoint property was
updated, dataset was not mounted after the update. This behavior
is changed in #15240. We mount the dataset whenever mountpoint
property is updated, regardless if it's mounted or not.
To provide the user with option to keep the dataset unmounted and
still update the mountpoint without mounting the dataset, '-u'
flag can be used.
If any of mountpoint, sharenfs or sharesmb properties are updated
with '-u' flag, the property is set to desired value but the
operation to (re/un)mount and/or (re/un)share the dataset is not
performed and dataset remains as it was before.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15322
Chunwei Chen [Mon, 2 Oct 2023 23:58:01 +0000 (16:58 -0700)]
Fix invalid pointer access in trace_dbuf.h
In dnode_destroy, dn_objset is invalidated. However, it will later call
into dbuf_destroy, in which DTRACE_SET_STATE will try to access spa_name
via dn_objset causing illegal pointer access.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #15333
George Amanakis [Mon, 2 Oct 2023 23:57:09 +0000 (01:57 +0200)]
Report ashift of L2ARC devices in zdb
Commit 8af1104f does not actually store the ashift of cache devices in
their label. However, in order to facilitate reporting the ashift
through zdb, we enable this in the present commit. We also document
how the retrieval of the ashift is done.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #15331
Alexander Motin [Fri, 29 Sep 2023 15:22:46 +0000 (11:22 -0400)]
Restrict short block cloning requests
If we are copying only one block and it is smaller than recordsize
property, do not allow destination to grow beyond one block if it
is not there yet. Otherwise the destination will get stuck with
that block size forever, that can be as small as 512 bytes, no
matter how big the destination grow later.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15321
Brian Behlendorf [Fri, 29 Sep 2023 15:21:25 +0000 (08:21 -0700)]
Tweak rebuild in-flight hard limit
Vendor testing shows we should be able to get a little more
performance if we further relax the hard limit which we're hitting.
Authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15324
Akash B [Thu, 28 Sep 2023 21:10:07 +0000 (02:40 +0530)]
Fix ENOSPC for extended quota
When unlinking multiple files from a pool at 100% capacity, it
was possible for ENOSPC to be returned after the first few unlinks.
This issue was fixed previously by PR #13172 but then this was
again introduced by PR #13839.
This is resolved using the existing mechanism of returning ERESTART
when over quota as long as we know enough space will shortly be
available after processing the pending deferred frees.
Also, updated the existing testcase which reliably reproduced the
issue without this patch.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com>
Closes #15312
Paul Dagnelie [Wed, 27 Sep 2023 19:06:24 +0000 (12:06 -0700)]
Reduce trim min size even lower for tests to reduce flakiness
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15315
Paul Dagnelie [Tue, 26 Sep 2023 21:37:28 +0000 (14:37 -0700)]
ZTS: Fix introduced test bug in block_cloning_copyfilerange
Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15316
Paul Dagnelie [Mon, 25 Sep 2023 18:14:00 +0000 (11:14 -0700)]
Set timeout before creating pool in test
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15309
Paul Dagnelie [Fri, 22 Sep 2023 23:08:51 +0000 (16:08 -0700)]
Invoke zdb by guid to avoid import errors
The problem that was occurring is basically that a device was removed
by ztest and replaced with another device. It was then reguided. The
import then failed because there were two possible imports with the
same name; one with the new guid, and one with the old. This can
happen because the label writes from the device removal/replacement
can be subject to ztest's error injection.
The other ways to fix this would be to change the error injection to
not trigger on removals (which may not be technically feasible), or
to change the import code to not report configurations that are so
short on devices (which would potentially have unpleasant end-user
effects when trying to recover from data losses/device configuration
issues).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15298
Alexander Motin [Fri, 22 Sep 2023 01:40:13 +0000 (21:40 -0400)]
ZIL: Avoid dbuf_read() in ztest_get_data()
While working on similar patches for zfs and zvol in #15153 I've
forgot about ztest. Update it also so that we test the same code
paths as use in production.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15301
Coleman Kane [Fri, 15 Sep 2023 05:07:03 +0000 (01:07 -0400)]
Linux 6.6 compat: fsync_bdev() has been removed in favor of sync_blockdev()
In Linux commit 560e20e4bf6484a0c12f9f3c7a1aa55056948e1e, the
fsync_bdev() function was removed in favor of sync_blockdev() to do
(roughly) the same thing, given the same input. This change
conditionally attempts to call sync_blockdev() if fsync_bdev() isn't
discovered during configure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15263
Coleman Kane [Fri, 15 Sep 2023 04:36:39 +0000 (00:36 -0400)]
Linux 6.6 compat: generic_fillattr has a new u32 request_mask added at arg2
In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument
for the request_mask was added to generic_fillattr. This is the same
request_mask for statx that's present in the most recent API implemented
by zpl_getattr_impl. This commit conditionally adds it to the
zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...)
implementation, when configure determines it's present in the kernel's
generic_fillattr(...).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15263
Coleman Kane [Tue, 12 Sep 2023 03:21:29 +0000 (23:21 -0400)]
Linux 6.6 compat: use inode_get/set_ctime*(...)
In Linux commit 13bc24457850583a2e7203ded05b7209ab4bc5ef, direct access
to the i_ctime member of struct inode was removed. The new approach is
to use accessor methods that exclusively handle passing the timestamp
around by value. This change adds new tests for each of these functions
and introduces zpl_* equivalents in include/os/linux/zfs/sys/zpl.h. In
where the inode_get/set_ctime*() functions exist, these zpl_* calls will
be mapped to the new functions. On older kernels, these macros just wrap
direct-access calls. The code that operated on an address of ip->i_ctime
to call ZFS_TIME_DECODE() now will take a local copy using
zpl_inode_get_ctime(), and then pass the address of the local copy when
performing the ZFS_TIME_DECODE() call, in all cases, rather than
directly accessing the member.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15263
Closes #15257
Rob N [Fri, 22 Sep 2023 00:54:15 +0000 (10:54 +1000)]
tests/block_cloning: try harder to stay on same txg in fallback test
We've observed this test failing intermittently. When it does, the
"same block" check shows that both files have the same content, that is,
the file was cloned.
The only way this could have happened is if the open txg moved between
the dd and clonefile calls. That's possible because although we set
zfs_txg_timeout to be large, that only affects the wait time in the sync
thread at the start of a new txg; it doesn't change anything if its
currently waiting or working.
So here we just force the txgs to move immediately before, which should
get both operations onto the same txg as intented.
Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris Rob Norris <rob.norris@klarasystems.com>
Closes #15303
Tony Hutter [Thu, 21 Sep 2023 15:36:26 +0000 (08:36 -0700)]
Add zfs_prepare_disk script for disk firmware install
Have libzfs call a special `zfs_prepare_disk` script before a disk is
included into the pool. The user can edit this script to add things
like a disk firmware update or a disk health check. Use of the script
is totally optional. See the zfs_prepare_disk manpage for full details.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15243
Rob N [Wed, 20 Sep 2023 23:56:45 +0000 (09:56 +1000)]
status: report pool suspension state under failmode=continue
When failmode=continue is set and the pool suspends, both 'zpool status'
and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree
state. There's no clear indicator that the pool is suspended. This is
unlike suspend in failmode=wait, or suspend due to MMP check failure,
which both report "SUSPENDED" explicitly.
This commit changes it so SUSPENDED is reported for failmode=continue
the same as for other modes.
Rationale:
The historical behaviour of failmode=continue is roughly, "press on as
though all is well". To this end, the fact that the pool had suspended
was not shown, to maintain the façade that all is well.
Its unclear why hiding this information was considered appropriate. One
possibility is that it was expected that a true pool fault would always
be reported as DEGRADED or FAULTED, and that the pool could not suspend
without these happening.
That is not necessarily true, as vdev health and suspend state are only
loosely connected, such that a pool in (apparent) good health can be
suspended for good reasons, and of course a degraded pool does not lead
to suspension. Even if that expectation were true, there's still a
difference in urgency - a degraded pool may not need to be attended to
for hours, while a suspended pool is most often unusable until an
operator intervenes.
An operator that has set failmode=continue has presumably done so
because their workload is one that can continue to operate in a useful
way when the pool suspends. In this case the operator still needs a
clear indicator that there is a problem that needs attending to.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15297
Paul Dagnelie [Wed, 20 Sep 2023 23:39:38 +0000 (16:39 -0700)]
Fix occasional rsend test crashes
We have occasional crashes in the rsend tests. Debugging revealed
that this is because the send_worker thread is getting EINTR from
splice(). This happens when a non-fatal signal is received during
the syscall. We should retry the syscall, rather than exiting failure.
Tweak the loop to only break if the splice is finished or we receive
a non-EINTR error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15273
Alexander Motin [Wed, 20 Sep 2023 18:17:11 +0000 (14:17 -0400)]
ZIL: Fix potential race on flush deferring.
zil_lwb_set_zio_dependency() can not set write ZIO dependency on
previous LWB's write ZIO if one is already in done handler and set
state to LWB_STATE_WRITE_DONE. So theoretically done handler of
next LWB's write ZIO may run before done handler of previous LWB
write ZIO completes. In such case we can not defer flushes, since
the flush issue process is not locked.
This may fix some reported assertions of lwb_vdev_tree not being
empty inside zil_free_lwb().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15278
- Remove unnecessary parentheses around variable names.
- Remove spaces between the type and variable in casts.
- Make the panic message for VERIFY0() reflect how the macro is used.
- Use %p to format pointers, except in Linux kernel code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org>
Closes #15225
Improve the handling of sharesmb,sharenfs properties
For sharesmb and sharenfs properties, the status of setting the
property is tied with whether we succeed to share the dataset or
not. In case sharing the dataset is not successful, this is
treated as overall failure of setting the property. In this case,
if we check the property after the failure, it is set to on.
This commit updates this behavior and the status of setting the
share properties is not returned as failure, when we fail to
share the dataset.
For sharenfs property, if access list is provided, the syntax
errors in access list/host adresses are not validated until after
setting the property during postfix phase while trying to
share the dataset. This is not correct, since the property has
already been set when we reach there.
Syntax errors in access list/host addresses are validated while
validating the property list, before setting the property and
failure is returned to user in this case when there are errors
in access list.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15240
There are some inconsistencies in the handling of mountpoint
property. This commit updates the behavior and makes it
consistent.
If mountpoint property is set when dataset is unmounted, this
would update the mountpoint property. The mountpoint could be
valid or invalid in this case. Setting the mountpoint property
would result in success in this case. Dataset would still be
unmounted here.
On the other hand, if dataset is mounted and mountpoint
property is updated to something invalid where mount cannot be
successful, for example, setting the mountpoint inside a readonly
directory. This would unmount the dataset, set the mountpoint
property to requested value and tries to mount the dataset. The
mount operation returns error and this error is treated as
overall failure of setting the property while the property is
actually set.
To make the behavior consistent in case dataset is mounted or
unmounted, we should try to mount the dataset whenever mountpoint
property is updated. This would result in mounting the datasets
if canmount property is set to on, regardless if the dataset was
previously unmounted.
The failure in mount operation while setting the mountpoint
property should not be treated as failure, since the property is
actually set now to user requested value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15240
Rob N [Tue, 19 Sep 2023 16:06:47 +0000 (02:06 +1000)]
cmd: add 'help' subcommand to zpool and zfs
'program help subcommand' is a reasonably common pattern for
multifunction command-line programs. This commit adds support for that
style to the zpool and zfs commands.
When run as 'zpool help [<topic>]' or 'zfs help [<topic>]', executes the
'man' program on the PATH with the most likely manpage name for the
requested topic: "zpool-<topic>" or "zfs-<topic>" for subcommands, or
"zpool<topic>" or "zfs<topic>" for the "concepts" and "props" topics.
If no topic is supplied, uses the top "zpool" or "zfs" pages.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15288
Paul Dagnelie [Tue, 19 Sep 2023 16:02:23 +0000 (09:02 -0700)]
Fix incorrect expected error in ztest
There is an occasional ztest failure that looks like ztest: attach
(/var/tmp/zloop-run/ztest.13a 570425344, draid1-1-0 532152320, 1)
returned 22, expected 95. This is because the value that we return
is EINVAL, but expected_error is set incorrectly.
Change the expected_error value to match both the comment and the
actual error value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15295
Paul Dagnelie [Tue, 19 Sep 2023 15:58:14 +0000 (08:58 -0700)]
Fix l2arc_apply_transforms ztest crash
In #13375 we modified the allocation size of the buffer that we use
to apply l2arc transforms to be the size of the arc hdr we're using,
rather than the allocation size that will be in place on the disk,
because sometimes the hdr size is larger. Unfortunately, sometimes
the allocation size is larger, which means that we overflow the buffer
in that case. This change modifies the allocation to be the max of
the two values
Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15177
Closes #15248
Rob N [Tue, 19 Sep 2023 15:48:02 +0000 (01:48 +1000)]
tests: install missing PAM tests
'pam_change_unmounted' and 'pam_recursive' both exist and are referenced
by the test run config, but weren't being installed and so are excluded.
This gets them installed so they will run as expected.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15291
George Amanakis [Tue, 19 Sep 2023 00:06:35 +0000 (02:06 +0200)]
Update the MOS directory on spa_upgrade_errlog()
spa_upgrade_errlog() does not update the MOS directory when the
head_errlog feature is enabled. In this case if spa_errlog_sync() is not
called, the MOS dir references the old errlog_last and errlog_sync
objects. Thus when doing a scrub a panic will occur:
Added in ab26409db753 ("Linux 3.1 compat, super_block->s_shrink"), with
the only consumer which needed the count getting retired in 066e82522101
("Linux compat: Minimum kernel version 3.10").
The counter gets in the way of not maintaining the list to begin with.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #15274
update max_variance limit in zdb_block_size_histogram test for CI
Commit 2d7843401a628ef8c483229742dd58bca70bc27e had previously
updated this hardcoded limit to allow for CI testing. As there
is no deterministic pass/fail value, the need has arisen for
one more small increase.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes #15252
Alexander Motin [Sat, 9 Sep 2023 17:22:36 +0000 (13:22 -0400)]
Add more constraints for block cloning.
- We cannot clone into files with smaller block size if there is
more than one block, since we can not grow the block size.
- Block size must be power-of-2 if destination offset != 0, since
there can be no multiple blocks of non-power-of-2 size.
The first should handle the case when destination file has several
blocks but still is not bigger than one block of the source file.
The second fixes panic in dmu_buf_hold_array_by_dnode() on attempt
to concatenate files with equal but non-power-of-2 block sizes.
While there, assert that error is reported if we made no progress.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15251
Fix by cleaning up all the sub-entries inside "kernel/spl" when the
spl module is unloaded.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Closes #15239
ZFS historically has had several space allocators that were
dynamically selectable. While these have been retained in
OpenZFS, only a single allocator has been statically compiled
in. This patch compiles all allocators for OpenZFS and provides
a module parameter to allow for manual selection between them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes #15218
Relax error reporting in zpool import and zpool split
For zpool import and zpool split, zpool_enable_datasets is called
to mount and share all datasets in a pool. If there is an error
while mounting or sharing any dataset in the pool, the status of
import or split is reported as failure. However, the changes do
show up in zpool list.
This commit updates the error reporting in zpool import and zpool
split path. More descriptive messages are shown to user in case
there is an error during mount or share. Errors in mount or share
do not effect the overall status of zpool import and zpool split.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15216
Andrea Righi [Sat, 2 Sep 2023 00:21:40 +0000 (02:21 +0200)]
Linux 6.5 compat: safe cleanup in spl_proc_fini()
If we fail to create a proc entry in spl_proc_init() we may end up
calling unregister_sysctl_table() twice: one in the failure path of
spl_proc_init() and another time during spl_proc_fini().
Avoid the double call to unregister_sysctl_table() and while at it
refactor the code a bit to reduce code duplication.
This was accidentally introduced when the spl code was
updated for Linux 6.5 compatibility.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Closes #15234
Closes #15235
Alexander Motin [Sat, 2 Sep 2023 00:14:50 +0000 (20:14 -0400)]
ZIL: Change ZIOs issue order.
In zil_lwb_write_issue(), after issuing lwb_root_zio/lwb_write_zio,
we have no right to access lwb->lwb_child_zio. If it was not there,
the first two ZIOs may have already completed and freed the lwb.
ZIOs issue in opposite order from children to parent should keep
the lwb valid till the end, since the lwb can be freed only after
lwb_root_zio completion callback.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15233
Alexander Motin [Sat, 2 Sep 2023 00:13:52 +0000 (20:13 -0400)]
ZIL: Revert zl_lock scope reduction.
While I have no reports of it, I suspect possible use-after-free
scenario when zil_commit_waiter() tries to dereference zcw_lwb
for lwb already freed by zil_sync(), while zcw_done is not set.
Extension of zl_lock scope as it was originally should block
zil_sync() from freeing the lwb, closing this race.
This reverts #14959 and couple chunks of #14841.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15228
dmu_buf_will_clone: change assertion to fix 32-bit compiler warning
Building module/zfs/dbuf.c for 32-bit targets can result in a warning:
In file included from
/usr/src/sys/contrib/openzfs/include/sys/zfs_context.h:97,
from /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:32:
/usr/src/sys/contrib/openzfs/module/zfs/dbuf.c: In function
'dmu_buf_will_clone':
/usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:116:33: error:
cast from pointer to integer of different size
[-Werror=pointer-to-int-cast]
116 | const uint64_t __left = (uint64_t)(LEFT);
\
| ^
/usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:148:25: note:
in expansion of macro 'VERIFY0'
148 | #define ASSERT0 VERIFY0
| ^~~~~~~
/usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2704:9: note: in
expansion of macro 'ASSERT0'
2704 | ASSERT0(dbuf_find_dirty_eq(db, tx->tx_txg));
| ^~~~~~~
This is because dbuf_find_dirty_eq() returns a pointer, which if
pointers are 32-bit results in a warning about the cast to uint64_t.
Instead, use the ASSERT3P() macro, with == and NULL as second and third
arguments, which should work regardless of the target's bitness.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #15224
Paul Dagnelie [Sat, 26 Aug 2023 18:34:43 +0000 (11:34 -0700)]
Increase limit of redaction list by using spill block
Currently redaction bookmarks and their associated redaction lists
have a relatively low limit of 36 redaction snapshots. This is imposed
by the number of snapshot GUIDs that fit in the bonus buffer of the
redaction list object. While this is more than enough for most use
cases, there are some limited cases where larger numbers would be
useful to support.
We tweak the redaction list creation code to use a spill block if
the number of redaction snapshots is above the amount that would fit
in the bonus buffer. We also make a small change to allow spill blocks
to be use for types of data besides SA. In order to fully leverage
this logic, we also change the redaction code to use vmem_alloc, to
handle extremely large allocations if needed. Finally, small tweaks
were made to the zfs commands and the test suite.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15018
Rich Ercolani [Sat, 26 Aug 2023 18:25:46 +0000 (14:25 -0400)]
Avoid save/restoring AMX registers to avoid a SPR erratum
Intel SPR erratum SPR4 says that if you trip into a vmexit while
doing FPU save/restore, your AMX register state might misbehave...
and by misbehave, I mean save all zeroes incorrectly, leading to
explosions if you restore it.
Since we're not using AMX for anything, the simple way to avoid
this is to just not save/restore those when we do anything, since
we're killing preemption of any sort across our save/restores.
If we ever decide to use AMX, it's not clear that we have any
way to mitigate this, on Linux...but I am not an expert.
Rob N [Fri, 25 Aug 2023 17:31:29 +0000 (03:31 +1000)]
tests/block_cloning: rename and document get_same_blocks helper
`get_same_blocks` is a helper to compare two files and return a list of
the blocks that are clones of each other. Its very necessary for block
cloning tests.
Previously it was incorrectly called `unique_blocks`, which is the
_inverse_ of what it does (an early version did list unique blocks; it
was changed but the name was not). So if nothing else, it should be
called `duplicate_blocks`.
But, keeping the details of a clone operation in your head is actually
quite difficult, without the additional overhead of wondering how the
tools work. So I've renamed it to better describe what it does, added a
usage note, and changed it to return block indexes from 0 instead of 1,
to match how L0 blocks are normally counted.
Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15181
As part of some internal gang block testing within Delphix
we hit the assertion removed by this patch. The assertion
was triggered by a ZIO that had two copies and was a gang
block making the following expression equal to 3:
```
MIN(zp->zp_copies + BP_IS_GANG(bp), spa_max_replication(spa))
```
and failing when we expected the above to be equal to
`BP_GET_NDVAS(bp)`.
The assertion is no longer valid since the following commit:
```
commit 14872aaa4f909d72c6b5e4105dadcfa13c7d9d66
Author: Matthew Ahrens <matthew.ahrens@delphix.com>
Date: Mon Feb 6 09:37:06 2023 -0800
EIO caused by encryption + recursive gang
```
The above commit changed gang block headers so they can't
have more than 2 copies but the assertion in question from
this PR was never updated.
Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #15180
Alexander Motin [Fri, 25 Aug 2023 00:08:49 +0000 (20:08 -0400)]
ZIL: Second attempt to reduce scope of zl_issuer_lock.
The previous patch #14841 appeared to have significant flaw, causing
deadlocks if zl_get_data callback got blocked waiting for TXG sync. I
already handled some of such cases in the original patch, but issue
#14982 shown cases that were impossible to solve in that design.
This patch fixes the problem by postponing log blocks allocation till
the very end, just before the zios issue, leaving nothing blocking after
that point to cause deadlocks. Before that point though any sleeps are
now allowed, not causing sync thread blockage. This require slightly
more complicated lwb state machine to allocate blocks and issue zios
in proper order. But with removal of special early issue workarounds
the new code is much cleaner now, and should even be more efficient.
Since this patch uses null zios between write, I've found that null
zios do not wait for logical children ready status in zio_ready(),
that makes parent write to proceed prematurely, producing incorrect
log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children()
fixes it.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15122