Rob Norris [Tue, 16 Apr 2024 04:56:35 +0000 (14:56 +1000)]
tests/quota: consistently clear quota property between tests
When run in isolation, quota_005_pos would fail in cleanup because it
would attempt restore the previous quota, which was 0, and so get an
error (because you can't set quota to '0', you have to use 'none').
It worked as part of the quota tag set because the previous tests did
not clean up their quota, so there was always a non-zero quota to return
to.
This adds a simple quota reset function, and has all quota tests run it
at cleanup. For the ones that weren't cleaning up, they now do, and for
quota_005_pos, which was trying to do the right thing, it now just
resets it.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16097
Rob Norris [Tue, 16 Apr 2024 05:03:33 +0000 (15:03 +1000)]
tests/quota_005_pos: use a long int for doubling the quota size
When run in isolation, quota_005_pos would see an empty ~300G dataset.
Doubling it's space overflows a int32, which meant it was trying to then
set the quota to a negative value, and would fail.
When run as part of the quota tests, the filesystem appears to have
stuff in it, and so a lower available space, which doesn't overflow, and
so succeeds.
The bare minimum fix seems to be to use a int64 for the available space,
so it can be comfortably doubled. Here it is.
(Also a typo fix and a tiny bit of cleanup).
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16097
Ameer Hamza [Fri, 19 Apr 2024 17:19:12 +0000 (22:19 +0500)]
Add zfetch stats in arcstats
arc_summary also reports zfetch stats but it's inconvenient to monitor
contiguously incrementing numbers. Adding them in arcstats allows us to
observe streams more conveniently.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16094
Tony Hutter [Wed, 17 Apr 2024 16:29:21 +0000 (09:29 -0700)]
Linux 6.8 compat: META (#16099)
Update the META file to reflect compatibility with the 6.8 kernel.
Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Rob N [Tue, 16 Apr 2024 16:13:01 +0000 (02:13 +1000)]
zts: add a debug option to get full test output
The test runner accumulates output from individual tests, then writes it
to the log at the end. If a test hangs or crashes the system half way
through, we get no insight into how it got to where it did.
This adds a -D option for "debug". When set, all test output is written
to stdout.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16096
Do no use .cfi_negate_ra_state within the assembly on Arm64
Compiling openzfs on aarch64 with gcc-8 and gcc-9 is failing currently.
See issue #14965 for deeper context.
On platforms without pointer authentication, .cfi_negate_ra_state can be
defined to a no-op:
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/aarch64-tdep.c#l1413
I have tested this on Arm64 FreeBSD 13.2 and AlmaLinux-8.
Andrew Turner [Mon, 15 Apr 2024 20:53:39 +0000 (21:53 +0100)]
Add the BTI elf note to the AArch64 SHA2 assembly
On ELF platforms there is a note to specify when an application or
library supports BTI. When linking one of these the linker needs
all input object files to have the note. If not it will not include
it in the output file.
Normally the compiler would generate it, but for assembly files we
need to do it our selves.
Add the note to the aarch64 sha256 and sha512 assembly files.
Tested by building with BTI enabled and using the -zbti-report=error
flag to lld that makes it an error if the note is missing.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Turner <andrew.turner4@arm.com>
Closes #16086
Rob N [Mon, 15 Apr 2024 20:52:20 +0000 (06:52 +1000)]
zinject: "no-op" error injection
When injected, this causes the matching IO to appear to succeed, but the
actual work is never submitted to the physical device. This can be used
to simulate a write-back cache servicing a write, but the backing device
has failed and the cache cannot complete the operation in the
background.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16085
Rob N [Mon, 15 Apr 2024 20:44:12 +0000 (06:44 +1000)]
zts: allow running a single test by name only
Specifying a single test is kind of a hassle, because the full relative
path under the test suite dir has to be included, but it's not always
clear what that path even is.
This change allows `-t` to take the name of a single test instead of a
full path. If the value has no `/` characters, we search for a file of
that name under the test root, and if found, use that as the full test
path instead.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16088
Kernel documentation for the discard_granularity property says:
A discard_granularity of 0 means that the device does not support
discard functionality.
Some older kernels had drivers (notably loop, but also some USB-SATA
adapters) that would set the QUEUE_FLAG_DISCARD capability flag, but
have discard_granularity=0. Since 5.10 (torvalds/linux@b35fd7422c2f) the
discard entry point blkdev_issue_discard() has had a check for this,
which would immediately reject the call with EOPNOTSUPP, and throw a
scary diagnostic message into the log. See #16068.
Since 6.8, the block layer sets a non-zero default for
discard_granularity (torvalds/linux@3c407dc723bb), and a future kernel
will remove the check entirely[1].
As such, there's no good reason for us to enable discard when
discard_granularity=0. The kernel will never let the request go in
anyway; better that we just disable it so we can report it properly to
the user.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16068
Closes #16082
Rob Norris [Thu, 4 Apr 2024 11:35:00 +0000 (22:35 +1100)]
zio: rename ZIO_TYPE_IOCTL to ZIO_TYPE_FLUSH
The only possible ioctl is a flush, and any other kind of meta-operation
introduced in the future is likely to have different semantics (much
like trim did). So, lets just call it what it is.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16064
Rob Norris [Wed, 10 Apr 2024 06:07:24 +0000 (16:07 +1000)]
dkio: remove kernel dkio.h compatibility header
Without DKIOCFLUSHWRITECACHE, we no longer need the compat header. Note
that we're keeping the userspace SPL compat header, which is used by
libefi.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16064
Rob Norris [Thu, 4 Apr 2024 11:34:54 +0000 (22:34 +1100)]
zio: remove io_cmd and DKIOCFLUSHWRITECACHE
There's no other options, so we can just always assume its a flush.
Includes some light refactoring where a switch statement was doing
control flow that no longer works.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16064
Rob Norris [Thu, 4 Apr 2024 11:34:42 +0000 (22:34 +1100)]
zio: remove zio_ioctl()
It only had one user, zio_flush(), and there are no other vdev ioctls
anyway.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16064
This commit adds support for mounting a dataset along with all of
it's children with '-R' flag for zfs mount. There can be scenarios
where we want to mount all datasets under one hierarchy instead of
mounting all datasets present on system with '-a' flag.
'-R' flag should work on all root and non-root datasets. Usage
information and man page has been updated for zfs mount. A test
for verifying the behavior for '-R' flag is also added.
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16015
Rob Norris [Wed, 10 Apr 2024 03:14:13 +0000 (13:14 +1000)]
vdev_disk: fix alignment check when buffer has non-zero starting offset
If a linear buffer spans multiple pages, and the first page has a
non-zero starting offset, the checker would not include the offset, and
so would think there was an alignment gap at the end of the first page,
rather than at the start.
That is, for a 16K buffer spread across five pages with an initial 512B
offset:
Since it's already a linear ABD, the "linearising" copy would just reuse
the buffer as-is, and the second check would failing, tripping the
VERIFY in vdev_disk_io_rw().
This commit fixes all this by including the offset in the check for
end-of-page alignment.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16076
Rob Norris [Wed, 10 Apr 2024 01:19:50 +0000 (11:19 +1000)]
tests: add test for vdev_disk page alignment check
This provides a test driver and a set of test vectors for the page
alignment check callback function vdev_disk_check_pages_cb().
Because there's no good facility for exposing this function to a
userspace test right now, for now I'm just duplicating the function and
adding commentary to remind people to keep them in sync.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16076
Andy Fiddaman [Thu, 11 Apr 2024 21:38:22 +0000 (22:38 +0100)]
Illumos#16463 zfs_ioc_recv leaks nvlist
In https://www.illumos.org/issues/16463 it was observed that
an nvlist was being leaked in zfs_ioc_recv() due a missing
call to nvlist_free for "hidden_args".
For OpenZFS the same issue exists in zfs_ioc_recv_new() and
is addressed by this PR.
This change also properly frees nvlists in the unlikely
event that a call to get_nvlist() fails.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Andy Fiddaman <illumos@fiddaman.net>
Closes #16077
Jason Lee [Wed, 10 Apr 2024 22:01:39 +0000 (16:01 -0600)]
return NULL at end of send_progress_thread
Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jason Lee <jasonlee@lanl.gov>
Closes #16074
Alexander Motin [Tue, 9 Apr 2024 23:23:19 +0000 (19:23 -0400)]
L2ARC: Relax locking during write
Previous code held ARC state sublist lock throughout all L2ARC
write process, which included number of allocations and even ZIO
issues. Being blocked in any of those places the code could also
block ARC eviction, that could cause OOM activation or even dead-
lock if system is low on memory or one is too fragmented.
Fix it by dropping the lock as soon as we see a block eligible
for L2ARC writing and pick it up later using earlier inserted
marker. While there, also reduce scope of hash lock, moving
ZIO allocation and other operations not requiring header access
out of it. All operations requiring header access move under
hash lock, since L2_WRITING flag does not prevent header eviction
only transition to arc_l2c_only state with L1 header.
To be able to manipulate sublist lock and marker as needed add few
more multilist functions and modify one.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16040
Alexander Motin [Tue, 9 Apr 2024 23:14:04 +0000 (19:14 -0400)]
Small fix to prefetch ranges aggregation
When after #16022 adding new range we aggregate more than two
existing ranges, that should be very rare, only if several streams
overlap, we may need to zero not the last range, but some earlier.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16072
etc/init.d: decide which variant to use at build time.
Let Debian use the sysv-rc variant of the script, even when OpenRC is
installed. Unlike on Gentoo, OpenRC on Debian consumes both the
sysv-rc scripts and OpenRC ones. ZFS initscripts on Debian should be
the sysv-rc version to provide most compatibility and to integrate
with the rest of initscripts for dependency tracking.
Restrict the substitution in the Makefile to the dedicated list.
This construct is inspired by Mo Zhou's detection of the execution
shell and follows the strategy of Peter in 6ef28c526ba7.
As of 2024, the initscripts are mostly relevant on Debian, Gentoo and
their derivatives.
In `zpool status -t`, scrub date/time is reported using the C locale,
while trim time is reported using the current one. This is inconsistent.
This patch fixes that.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Maxim Filimonov <che@bein.link>
Closes #15878
Closes #15879
Alexander Motin [Mon, 8 Apr 2024 22:23:43 +0000 (18:23 -0400)]
Remove db_state DB_NOFILL checks from syncing context
Syncing context should not depend on current state of dbuf, which
could already change several times in later transaction groups,
but rely solely on dirty record for the transaction group being
synced. Some of the checks seem already impossible, while instead
of others I think we should better check for absence of data in
the specific dirty record rather than DB_NOFILL.
Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16057
Alexander Motin [Mon, 8 Apr 2024 22:13:27 +0000 (18:13 -0400)]
Speculative prefetch for reordered requests
Before this change speculative prefetcher was able to detect a stream
only if all of its accesses are perfectly sequential. It was easy to
implement and is perfectly fine for single-threaded applications.
Unfortunately multi-threaded network servers, such as iSCSI, SMB or
NFS usually have plenty of threads and may often reorder requests,
preventing successful speculation and prefetch.
This change allows speculative prefetcher to detect streams even if
requests are reordered by introducing a list of 9 non-contiguous
ranges up to 16MB ahead of current stream position and filling the
gaps as more requests arrive. It also allows stream to proceed
even with holes up to a certain configurable threshold (25%).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16022
Alexander Motin [Mon, 8 Apr 2024 19:03:18 +0000 (15:03 -0400)]
Fix read errors race after block cloning
Investigating read errors triggering panic fixed in #16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_UNCACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_UNCACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.
Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16052
Rob N [Mon, 8 Apr 2024 18:59:04 +0000 (04:59 +1000)]
zinject: inject device errors into ioctls
Adds 'ioctl' as a valid IO type for device error injection, so we can
simulate a flush error (which OpenZFS currently ignores, but that's by
the by).
To support this, adding ZIO_STAGE_VDEV_IO_DONE to ZIO_IOCTL_PIPELINE,
since that's where device error injection happens. This needs a small
exclusion to avoid the vdev_queue, since flushes are not queued, and I'm
assuming that the various failure responses are still reasonable for
flush failures (probes, media change, etc). This seems reasonable to me,
as a flush failure is not unlike a write failure in this regard, however
this may be too aggressive or subtle to assume in just this change.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16061
Rob N [Mon, 8 Apr 2024 18:50:24 +0000 (04:50 +1000)]
vdev_disk: ensure trim errors are returned immediately
After 06e25f9c4, the discard issuing code was organised such that if
requesting an async discard or secure erase failed before the IO was
issued (that is, calling __blkdev_issue_discard() returned an error),
the failed zio would never be executed, resulting in txg_sync hanging
forever waiting for IO to finish.
This commit fixes that by immediately executing a failed zio on error.
To handle the successful synchronous op case, we fake an async op by,
when not using an asynchronous submission method, queuing the successful
result zio as part of the discard handler.
Since it was hard to understand the differences between discard and
secure erase, and sync and async, across different kernel versions, I've
commented and reorganised the code a bit to try and make everything more
contained and linear.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16070
Rob N [Mon, 8 Apr 2024 18:38:49 +0000 (04:38 +1000)]
zvol_os: fix compile with blk-mq on Linux 4.x
99741bde5 accesses a cached blk-mq hardware context through the mq_hctx
field of struct request. However, this field did not exist until 5.0.
Before that, the private function blk_mq_map_queue() was used to dig it
out of broader queue context. This commit detects this situation, and
handles it with a poor-man's simulation of that function.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16069
Rob N [Mon, 8 Apr 2024 17:13:27 +0000 (03:13 +1000)]
zvol_os: fix build on Linux <3.13
99741bde5 introduced zvol_num_taskqs, but put it behind the HAVE_BLK_MQ
define, preventing builds on versions of Linux that don't have it
(<3.13, incl EL7).
Nothing about it seems dependent on blk-mq, so this just moves it out
from behind that define and so fixes the build.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16062
Ameer Hamza [Thu, 4 Apr 2024 01:21:25 +0000 (06:21 +0500)]
zvol: use multiple taskq
Currently, zvol uses a single taskq, resulting in throughput bottleneck
under heavy load due to lock contention on the single taskq. This patch
addresses the performance bottleneck under heavy load conditions by
utilizing multiple taskqs, thus mitigating lock contention. The number
of taskqs scale dynamically based on the available CPUs in the system,
as illustrated below:
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15992
Pavel Snajdr [Thu, 4 Apr 2024 01:09:19 +0000 (03:09 +0200)]
Fix panics when truncating/deleting files
There's an union in dbuf_dirty_record_t; dr_brtwrite could evaluate
to B_TRUE if the dirty record is of another type than dl. Adding
more explicit dr type check before trying to access dr_brtwrite.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #15983
The commit b53077a added zfs_prepare_disk.8 to the wrong list
dist_man_MANS, in which @zfsexecdir@ will not be properly substituted.
This leads to wrong path in the manpage in generated release tarballs.
Reported-by: Benda Xu <orv@debian.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #15979
Paul Dagnelie [Wed, 3 Apr 2024 23:34:46 +0000 (16:34 -0700)]
Give a better message from 'zpool get' with invalid pool name
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@klarasystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15942
Rob Norris [Wed, 27 Mar 2024 05:15:48 +0000 (16:15 +1100)]
tests: simple zinject disk fault arg check
Just making sure the valid values for disk faults are accepted.
Obviously we can do a lot more, but this will do to get us started.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15953
Rob Norris [Thu, 29 Feb 2024 23:38:41 +0000 (10:38 +1100)]
zinject: show more device fault fields
Once there's a few different kinds injected, its pretty hard to see them
otherwise.
So, lets show IO type, error type and frequency fields in the table too.
Since we now have to convert from error code to pretty string, refactor
the error names into a table and add lookup functions.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15953
Rob Norris [Tue, 26 Mar 2024 23:07:50 +0000 (10:07 +1100)]
Linux 6.9 compat: bdev handles are now struct file
bdev_open_by_path() is replaced by bdev_file_open_by_path(), which
returns a plain old struct file*. Release function is gone entirely; the
regular file release function fput() will take care of the bdev
specifics.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/
Closes #16027
Closes #16033
Rob N [Wed, 3 Apr 2024 22:17:07 +0000 (09:17 +1100)]
vdev_disk: don't touch vbio after its handed off to the kernel
After IO is unplugged, it may complete immediately and vbio_completion
be called on interrupt context. That may interrupt or deschedule our
task. If its the last bio, the vbio will be freed. Then, we get
rescheduled, and try to write to freed memory through vbio->.
This patch just removes the the cleanup, and the corresponding assert.
These were leftovers from a previous iteration of vbio_submit() and were
always "belt and suspenders" ops anyway, never strictly required.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc Reported-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Laurențiu Nicola <lnicola@dend.ro> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16045
Closes #16050
Closes #16049
Rob N [Wed, 3 Apr 2024 22:13:27 +0000 (09:13 +1100)]
xdr: header cleanup
#16047 notes that include/os/freebsd/spl/rpc/xdr.h carried an
(apparently) incompatible license. While looking into it, it seems that
this file is actually unnecessary these days - FreeBSD's kernel XDR has
XDR_CONTROL, xdrmem_control and XDR_GET_BYTES_AVAIL, while userspace has
XDR_CONTROL and xdrmem_control, and our implementation of
XDR_GET_BYTES_AVAIL for libspl works nicely with it. So this removes
that file outright.
To keep the includes in nvpair.c tidy, I've made a few small adjustments
to the Linux headers. By definition, rpc/types.h provides bool_t and is
included before rpc/xdr.h, so I've created rpc/types.h for Linux. This
isn't necessary for userspace; both FreeBSD native and tirpc on Linux
already have these headers set up correctly.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/
Closes #16047
Closes #16051
Alexander Motin [Wed, 3 Apr 2024 22:04:26 +0000 (18:04 -0400)]
Improve dbuf_read() error reporting
Previous code reported non-ZIO errors only via return value, but
not via parent ZIO. It could cause NULL-dereference panics due
to dmu_buf_hold_array_by_dnode() ignoring the return value,
relying solely on parent ZIO status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reported by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16042
Robert Evans [Sat, 30 Mar 2024 00:11:52 +0000 (20:11 -0400)]
Linux 5.18+ compat: Detect filemap_range_has_page
In v5.18 `filemap_range_has_page` moved to `pagemap.h`
`pagemap.h` has been around since 3.10 so just include both
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Robert Evans <evansr@google.com>
Closes #16034
Robert Evans [Fri, 29 Mar 2024 21:59:23 +0000 (17:59 -0400)]
Fix buffer underflow if sysfs file is empty
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Jason Lee <jasonlee@lanl.gov> Signed-off-by: Robert Evans <evansr@google.com>
Closes #16028
Closes #16035
Rob N [Fri, 29 Mar 2024 21:51:33 +0000 (08:51 +1100)]
vdev_disk: clean up spa/bdev mode conversion
43e8f6e37 introduced a subtle API misuse, in that it passed the output
from vdev_bdev_mode() back into itself. Fortunately, the
SPA_MODE_(READ|WRITE) bit values exactly map to the FMODE_(READ|WRITE) &
BLK_OPEN_(READ|WRITE) bit values, so it didn't result in a bug, but it
was hard to read and understand, so I cleaned it up.
In doing so, I noticed that the only call to vdev_bdev_mode() without
the "exclusive" flag set was in that misuse, and actually, we never do a
non-exclusive blkdev_get_by_path(). So I've just made exclusive be
always-on.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15995
currently, the linux kernel allows 2^20 minor devices per major device
number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol
itself, the other 15 for the first partitions of that zvol. as a result,
only 2^16 such blocks are available for use.
there are no checks in place to avoid overflowing into the major device
number when more than 2^16 zvols are allocated (with volmode=dev or
default). instead of ignoring this limit, which comes with all sorts of
weird knock-on effects, detect this situation and simply fail allocating
the zvol block device early on.
without this safeguard, the kernel will reject the attempt to create an
already existing block device, but ZFS doesn't handle this error and
gets confused about which zvol occupies which minor slot, potentially
resulting in kernel NULL derefs and other issues later on.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #16006
George Wilson [Fri, 29 Mar 2024 19:15:56 +0000 (15:15 -0400)]
Add ashift validation when adding devices to a pool
Currently, zpool add allows users to add top-level vdevs that have
different ashifts but doing so prevents users from being able to
perform a top-level vdev removal. Often times consumers may not realize
that they have mismatched ashifts until the top-level removal fails.
This feature adds ashift validation to the zpool add command and will
fail the operation if the sector size of the specified vdev does not
match the existing pool. This behavior can be disabled by using the -f
flag. In addition, new flags have been added to provide fine-grained
control to disable specific checks. These flags
are:
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mmaybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #15509
Robert Evans [Wed, 27 Mar 2024 21:59:16 +0000 (17:59 -0400)]
ZTS: fix flakiness in cp_files_002_pos
Fix RANDOM to not return zero.
Overwriting with `dd ... count=0` does not test anything.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Robert Evans <evansr@google.com>
Closes #16029
Alexander Motin [Mon, 18 Mar 2024 18:19:53 +0000 (14:19 -0400)]
BRT: Fix holes cloning.
- When reading L0 block pointers handle buffers without ones and
without dirty records as a holes. Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers. It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
- When cloning blocks set destination hole logical birth time
to the current TXG. Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15994
Closes #16007
Alexander Motin [Tue, 26 Mar 2024 00:13:45 +0000 (20:13 -0400)]
BRT: Skip getting length in brt_entry_lookup()
Unlike DDT, where ZAP values may have different lengths due to
compression, all BRT entries are identical 8-byte counters. It
does not make sense to first fetch the length only to assert it.
zap_lookup_uint64() is specifically designed to work with counters
of different size and should return error if something odd found.
Calling it straight allows to save some measurable CPU time.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15950
Rob Norris [Wed, 13 Mar 2024 23:57:30 +0000 (10:57 +1100)]
abd_iter_page: don't use compound heads on Linux <4.5
Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages
in a compound page were refcounted separately. This means that using the
head page without taking a reference to it could see it cleaned up later
before we're finished with it. Specifically, bio_add_page() would take a
reference, and drop its reference after the bio completion callback
returns.
If the zio is executed immediately from the completion callback, this is
usually ok, as any data is referenced through the tail page referenced
by the ABD, and so becomes "live" that way. If there's a delay in zio
execution (high load, error injection), then the head page can be freed,
along with any dirty flags or other indicators that the underlying
memory is used. Later, when the zio completes and that memory is
accessed, its either unmapped and an unhandled fault takes down the
entire system, or it is mapped and we end up messing around in someone
else's memory. Both of these are very bad.
The solution on these older kernels is to take a reference to the head
page when we use it, and release it when we're done. There's not really
a sensible way under our current structure to do this; the "best" would
be to keep a list of head page references in the ABD, and release them
when the ABD is freed.
Since this additional overhead is totally unnecessary on 4.5+, where
head and tail pages share refcounts, I've opted to simply not use the
compound head in ABD page iteration there. This is theoretically less
efficient (though cleaning up head page references would add overhead),
but its safe, and we still get the other benefits of not mapping pages
before adding them to a bio and not mis-splitting pages.
There doesn't appear to be an obvious symbol name or config option we
can match on to discover this behaviour in configure (and the mm/page
APIs have changed a lot since then anyway), so I've gone with a simple
version check.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Wed, 21 Feb 2024 00:07:21 +0000 (11:07 +1100)]
vdev_disk: use bio_chain() to submit multiple BIOs
Simplifies our code a lot, so we don't have to wait for each and
reassemble them.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 9 Jan 2024 02:28:57 +0000 (13:28 +1100)]
vdev_disk: add module parameter to select BIO submission method
This makes the submission method selectable at module load time via the
`zfs_vdev_disk_classic` parameter, allowing this change to be backported
to 2.2 safely, and disabled in favour of the "classic" submission method
if new problems come up.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 18 Jul 2023 01:11:29 +0000 (11:11 +1000)]
vdev_disk: rewrite BIO filling machinery to avoid split pages
This commit tackles a number of issues in the way BIOs (`struct bio`)
are constructed for submission to the Linux block layer.
The kernel has a hard upper limit on the number of pages/segments that
can be added to a BIO, as well as a separate limit for each device
(related to its queue depth and other scheduling characteristics).
ZFS counts the number of memory pages in the request ABD
(`abd_nr_pages_off()`, and then uses that as the number of segments to
put into the BIO, up to the hard upper limit. If it requires more than
the limit, it will create multiple BIOs.
Leaving aside the fact that page count method is wrong (see below), not
limiting to the device segment max means that the device driver will
need to split the BIO in half. This is alone is not necessarily a
problem, but it interacts with another issue to cause a much larger
problem.
The kernel function to add a segment to a BIO (`bio_add_page()`) takes a
`struct page` pointer, and offset+len within it. `struct page` can
represent a run of contiguous memory pages (known as a "compound page").
In can be of arbitrary length.
The ZFS functions that count ABD pages and load them into the BIO
(`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never
consider a page to be more than `PAGE_SIZE` (4K), even if the `struct
page` is for multiple pages. In this case, it will load the same `struct
page` into the BIO multiple times, with the offset adjusted each time.
With a sufficiently large ABD, this can easily lead to the BIO being
entirely filled much earlier than it could have been. This is also
further contributes to the problem caused by the incorrect segment limit
calculation, as its much easier to go past the device limit, and so
require a split.
Again, this is not a problem on its own.
The logic for "never submit more than `PAGE_SIZE`" is actually a little
more subtle. It will actually never submit a buffer that crosses a 4K
page boundary.
In practice, this is fine, as most ABDs are scattered, that is a list of
complete 4K pages, and so are loaded in as such.
Linear ABDs are typically allocated from slabs, and for small sizes they
are frequently not aligned to page boundaries. For example, a 12K
allocation can span four pages, eg:
Such an allocation would be loaded into a BIO as you see:
[1K, 4K, 4K, 3K]
This tends not to be a problem in practice, because even if the BIO were
filled and needed to be split, each half would still have either a start
or end aligned to the logical block size of the device (assuming 4K at
least).
---
In ideal circumstances, these shortcomings don't cause any particular
problems. Its when they start to interact with other ZFS features that
things get interesting.
Aggregation will create a "gang" ABD, which is simply a list of other
ABDs. Iterating over a gang ABD is just iterating over each ABD within
it in turn.
Because the segments are simply loaded in order, we can end up with
uneven segments either side of the "gap" between the two ABDs. For
example, two 12K ABDs might be aggregated and then loaded as:
[1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K]
Should a split occur, each individual BIO can end up either having an
start or end offset that is not aligned to the logical block size, which
some drivers (eg SCSI) will reject. However, this tends not to happen
because the default aggregation limit usually keeps the BIO small enough
to not require more than one split, and most pages are actually full 4K
pages, so hitting an uneven gap is very rare anyway.
If the pool is under particular memory pressure, then an IO can be
broken down into a "gang block", a 512-byte block composed of a header
and up to three block pointers. Each points to a fragment of the
original write, or in turn, another gang block, breaking the original
data up over and over until space can be found in the pool for each of
them.
Each gang header is a separate 512-byte memory allocation from a slab,
that needs to be written down to disk. When the gang header is added to
the BIO, its a single 512-byte segment.
Pulling all this together, consider a large aggregated write of gang
blocks. This results a BIO containing lots of 512-byte segments. Given
our tendency to overfill the BIO, a split is likely, and most possible
split points will yield a pair of BIOs that are misaligned. Drivers that
care, like the SCSI driver, will reject them.
---
This commit is a substantial refactor and rewrite of much of `vdev_disk`
to sort all this out.
`vdev_bio_max_segs()` now returns the ideal maximum size for the device,
if available. There's also a tuneable `zfs_vdev_disk_max_segs` to
override this, to assist with testing.
We scan the ABD up front to count the number of pages within it, and to
confirm that if we submitted all those pages to one or more BIOs, it
could be split at any point with creating a misaligned BIO. If the
pages in the BIO are not usable (as in any of the above situations), the
ABD is linearised, and then checked again. This is the same technique
used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size
and allocator quirks.
`vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The
idea is simply that it can hold all the state needed to create, submit
and return multiple BIOs, including all the refcounts, the ABD copy if
it was needed, and so on. Apart from what I hope is a clearer interface,
the major difference is that because we know how many BIOs we'll need up
front, we don't need the old overflow logic that would grow the BIO
array, throw away all the old work and restart. We can get it right from
the start.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 9 Jan 2024 01:29:19 +0000 (12:29 +1100)]
vdev_disk: make read/write IO function configurable
This is just setting up for the next couple of commits, which will add a
new IO function and a parameter to select it.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 9 Jan 2024 01:23:30 +0000 (12:23 +1100)]
vdev_disk: reorganise vdev_disk_io_start
Light reshuffle to make it a bit more linear to read and get rid of a
bunch of args that aren't needed in all cases.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Tue, 9 Jan 2024 01:12:56 +0000 (12:12 +1100)]
vdev_disk: rename existing functions to vdev_classic_*
This is just renaming the existing functions we're about to replace and
grouping them together to make the next commits easier to follow.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Mon, 11 Dec 2023 05:05:54 +0000 (16:05 +1100)]
abd: add page iterator
The regular ABD iterators yield data buffers, so they have to map and
unmap pages into kernel memory. If the caller only wants to count
chunks, or can use page pointers directly, then the map/unmap is just
unnecessary overhead.
This adds adb_iterate_page_func, which yields unmapped struct page
instead.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Rob Norris [Mon, 13 Nov 2023 06:55:29 +0000 (17:55 +1100)]
linux 5.4 compat: page_size()
Before 5.4 we have to do a little math.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
Alexander Motin [Mon, 25 Mar 2024 22:02:38 +0000 (18:02 -0400)]
BRT: Make BRT block sizes configurable
Similar to DDT make BRT data and indirect block sizes configurable
via module parameters. I am not sure what would be the best yet,
but similar to DDT 4KB blocks kill all chances of compression on
vdev with ashift=12 or more, that on my tests reaches 3x.
While here, fix documentation for respective DDT parameters.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15967
George Wilson [Mon, 25 Mar 2024 22:01:54 +0000 (18:01 -0400)]
Provide macros for setting and getting blkptr birth times
There exist a couple of macros that are used to update the blkptr birth
times but they can often be confusing. For example, the
BP_PHYSICAL_BIRTH() macro will provide either the physical birth time
if it is set or else return back the logical birth time. The
complement to this macro is BP_SET_BIRTH() which will set the logical
birth time and set the physical birth time if they are not the same.
Consumers may get confused when they are trying to get the physical
birth time and use the BP_PHYSICAL_BIRTH() macro only to find out that
the logical birth time is what is actually returned.
This change cleans up these macros and makes them symmetrical. The same
functionally is preserved but the name is changed. Instead of calling
BP_PHYSICAL_BIRTH(), consumer can now call BP_GET_BIRTH(). In
additional to cleaning up this naming conventions, two new sets of
macros are introduced -- BP_[SET|GET]_LOGICAL_BIRTH() and
BP_[SET|GET]_PHYSICAL_BIRTH. These new macros allow the consumer to
get and set the specific birth time.
As part of the cleanup, the unused GRID macros have been removed and
that portion of the blkptr are currently unused.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #15962
Alexander Motin [Mon, 25 Mar 2024 21:59:55 +0000 (17:59 -0400)]
BRT: Relax brt_pending_apply() locking
Since brt_pending_apply() is running in syncing context, no other
brt_pending_tree accesses are possible for the TXG. We don't need
to acquire brt_pending_lock here.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15955
Alexander Motin [Mon, 25 Mar 2024 21:58:50 +0000 (17:58 -0400)]
ZAP: Massively switch to _by_dnode() interfaces
Before this change ZAP called dnode_hold() for almost every block
access, that was clearly visible in profiler under heavy load, such
as BRT. This patch makes it always hold the dnode reference between
zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode
operations between those. It also adds several new _by_dnode() APIs
to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode()
variant and uses it in the ZAP code.
After this there remains only one call to dmu_buf_dnode_enter(),
which seems to be unneeded. So remove the call and the functions.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15951
Alexander Motin [Mon, 25 Mar 2024 21:58:04 +0000 (17:58 -0400)]
BRT: Skip duplicate BRT prefetches
If there is a pending entry for this block, then we've already
issued BRT prefetch for it within this TXG, so don't do it again.
BRT vdev lookup and following zap_prefetch_uint64() call can be
pretty expensive and should be avoided when not necessary.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15941
Robert Evans [Mon, 25 Mar 2024 21:56:49 +0000 (17:56 -0400)]
Fix corruption caused by mmap flushing problems
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages
already in writeback unless data-integrity sync is requested.
2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be
skipped due to DMU pushing back on TX assign.
3) Add missing mmap flush when doing block cloning.
4) While here, pass errors from putpage to writepage/writepages.
This change fixes corruption edge cases, but unfortunately adds
synchronous ZIL flushes for dirty mmap pages to llseek and bclone
operations. It may be possible to avoid these sync writes later
but would need more tricky refactoring of the writeback code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Robert Evans <evansr@google.com>
Closes #15933
Closes #16019
Alexander Motin [Thu, 21 Mar 2024 23:43:53 +0000 (19:43 -0400)]
ZAP: Some cleanups/micro-optimizations
- Remove custom zap_memset(), use regular memset().
- Use PANIC() instead of opaque cmn_err(CE_PANIC).
- Provide entry parameter to zap_leaf_rehash_entry().
- Reduce branching in zap_leaf_array_create() inner loop.
- Remove signedness where it should not be.
Should be no function changes.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15976
If a zvol has more than 15 partitions, the minor device number exhausts
the slot count reserved for partitions next to the zvol itself. As a
result, the minor number cannot be used to determine the partition
number for the higher partition, and doing so results in wrong named
symlinks being generated by udev.
Since the partition number is encoded in the block device name anyway,
let's just extract it from there instead.
Alexander Motin [Thu, 21 Mar 2024 22:42:21 +0000 (18:42 -0400)]
BRT: Change brt_pending_tree sorting order
It does not look important how exactly brt_pending_tree is sorted.
When cloning large file, it is quite likely that all of its blocks
have identical physical birth times, so comparing them first does
not provide useful entropy, while accesses additional cache line.
In most cases combination of vdev and offset provides unique result
and physical birth time comparison is not even needed. Meanwhile,
when traversing the tree inside brt_pending_apply(), it can be
beneficial for dbuf cache and CPU cache hits to group processing
by vdev and so by the per-VDEV BRT ZAPs.
Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15954
Rob N [Thu, 21 Mar 2024 19:10:04 +0000 (06:10 +1100)]
zio: update ZIO type x stage documentation
- add column for TRIM ZIOs
- remove R from ZIO_STAGE_ISSUE_ASYNC, never happened
- remove I from ZIO_STAGE_VDEV_IO_DONE, never happened
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15959
Cameron Harr [Thu, 21 Mar 2024 16:00:29 +0000 (09:00 -0700)]
Fix option string, adding -e and fixing order
The recently added '-e' option (PR #15769) missed adding the
new option in the online `zpool status` help command. This
adds the options and reorders a couple of the other options
that were not listed alphabetically.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Cameron Harr <harr1@llnl.gov>
Closes #16008
Alexander Motin [Thu, 21 Mar 2024 00:22:36 +0000 (20:22 -0400)]
Update resume token at object receive.
Before this change resume token was updated only on data receive.
Usually it is enough to resume replication without much overlap.
But we've got a report of a curios case, where replication source
was traversed with recursive grep, which through enabled atime
modified every object without modifying any data. It produced
several gigabytes of replication traffic without a single data
write and so without a single resume point.
While the resume token was not designed to resume from an object,
I've found that the send implementation always sends object before
any data. So by requesting resume from offset 0 we are effectively
resuming from the object, followed (or not) by the data at offset
0, just as we need it.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15927
Rob N [Wed, 20 Mar 2024 23:46:15 +0000 (10:46 +1100)]
Linux 6.8 compat: use splice_copy_file_range() for fallback
Linux 6.8 removes generic_copy_file_range(), which had been reduced to a
simple wrapper around splice_copy_file_range(). Detect that function
directly and use it if generic_ is not available.
Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15930
Closes #15931
Rob Norris [Mon, 19 Feb 2024 10:19:32 +0000 (21:19 +1100)]
ddt: reduce DDT_NAMELEN
This is the buffer size passed to ddt_object_name(), to expand the
DMU_POOL_DDT format. That format inserts the table checksum, class and
type names, which as I write this are max 6, 9 and 3, respectively.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/
Closes #15908
Rob Norris [Mon, 19 Feb 2024 10:13:59 +0000 (21:13 +1100)]
config: use -Wno-format-truncation globally
-Wformat-truncation looks for places where the return code of snprintf()
is unchecked and the provided buffer might be too short. This is based
on a heuristic that can change between compiler versions.
It has been seen to get this wrong in ddt_object_name(), leading to
DDT_NAMELEN being increased somewhat arbitrarily.
There's no good reason to have this warning enabled, so here we disable
it everywhere. Truncation may be undesirable, but snprintf() is
guaranteed to emit a trailing null, so at worst we get a short string,
not a buffer overrun.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/
Closes #15908
Brian Behlendorf [Fri, 16 Feb 2024 17:07:32 +0000 (09:07 -0800)]
Check for minimum partition size
On Linux block devices used for vdevs will by partitioned. The block
device must be large enough for an 64M partition starting at offset
of 2048 sectors (part1), and a second 64M reserved partition at the
end of the device (part9).
This commit adds a capacity check when creating the GPT label to
immediately detect a device which is too small. With the existing
code this would be caught slightly latter when attempting to use
the partition. Catching it sooner let's us print a more useful error.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15898
Rob Norris [Mon, 27 Nov 2023 23:43:36 +0000 (10:43 +1100)]
ddt: document the theory and the key data structures
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Thu, 1 Feb 2024 00:05:18 +0000 (11:05 +1100)]
ddt: only create tables for dedup-capable checksums
Most values in zio_checksum can never be used for dedup, partly because
the dedup= property only offers a limited list, but also some values (eg
ZIO_CHECKSUM_OFF) aren't real and will never be seen.
A true flag would be better than a hardcoded list, but thats more
cleanup elsewhere than I want to do right now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Tue, 5 Dec 2023 03:28:39 +0000 (14:28 +1100)]
ddt: simplify entry load and flags
Only a single bit is needed to track entry state, and definitely not two
whole bytes. Some light refactoring in ddt_lookup() is needed to support
this, but it reads a lot better now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Mon, 31 Jul 2023 07:42:34 +0000 (17:42 +1000)]
ddt: remove ddt_node
Nothing uses it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Mon, 3 Jul 2023 13:28:46 +0000 (23:28 +1000)]
ddt: rework ops interface in terms of keys and values
Store objects store keys and values, so have them take those types and
nothing more. This way, they don't need to be concerned about the "kind"
of entry being operated on; the dispatch layer can take care of the
appropriate conversions.
This adds a "contains" op to see if a particular entry exists without
loading it, which makes a couple of things easier to do; in particular,
it allows us to avoid an allocation in ddt_class_contains().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Thu, 15 Jun 2023 06:10:00 +0000 (16:10 +1000)]
ddt: ensure ddt objects exist before trying to get stats from them
ddt_get_dedup_histogram() was actually checking it, just in an extremely
cursed way. ddt_get_dedup_object_stats() wasn't, but wasn't being called
from a dangerous place so no one noticed.
These checks are necessary, because spa_ddt[] is not populated until
spa_load(), but the spa can exist before that, while being created, and
as vdevs and metaslabs are initialised the space accounting functions
will be called to update pool space counts.
Probably the whole create path doesn't need to go asking for space
accounting from metadata subsystems until after the pool is created.
This will at least catch misuse.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Mon, 3 Jul 2023 02:43:37 +0000 (12:43 +1000)]
ddt: remove struct names and forward declarations
Things get confused when there's more than one name for a thing.
Note that we don't do this for ddt_object_t, ddt_histogram_t and
ddt_stat_t because they're part of the public ZFS interface.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Mon, 3 Jul 2023 02:32:53 +0000 (12:32 +1000)]
ddt: typedef ddt_type and ddt_class
Mostly for consistency, so the reader is less likely to wonder why these
things look different.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Fri, 30 Jun 2023 03:35:18 +0000 (13:35 +1000)]
ddt: split internal DDT API into separate header
Just to make it easier to know which bits to pay attention to.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Mon, 3 Jul 2023 05:25:06 +0000 (15:25 +1000)]
ddt: remove DDE_GET_NDVAS macro
It was a weird and confusing name, because it wasn't actually returning
the number of DVAs in the entry (as in, in the value/phys part) but the
maximum number of possible DVAs in a BP generated from the entry, based
on the encrypt bit in the key. This is unlike the similarly named
BP_GET_NDVAS, which really does return the number of DVAs.
Since its only used in this one place, and for a specific purpose, it
seemed more sensible to just write it in-place and remove the name.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Tue, 16 May 2023 03:30:26 +0000 (13:30 +1000)]
ddt: lift dedup stats out to separate file
We want to add other kinds of dedup-related objects and keep stats for
them. This makes those functions easier to use from outside ddt.c.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Fri, 9 Jun 2023 00:14:42 +0000 (10:14 +1000)]
ddt: compare keys, not entries
We're about to have different kinds of things that we'll compare on key,
so generalise this function to support that.
(It actually worked fine because of the way the casts work out, but it
requires the key to be at the start of the object so the cast through
ddt_entry_t works, and even then it reads strangely for anything that's
not a ddt_entry_t).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Wed, 17 Jan 2024 22:51:41 +0000 (09:51 +1100)]
ddt_zap: standardise temp buffer allocations
Always do them on the heap, and when we know how much we need, only that
much.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887
Rob Norris [Fri, 30 Jun 2023 02:48:45 +0000 (12:48 +1000)]
ddt: move entry compression into ddt_zap
I think I can say with some confidence that anyone making a new storage
type in 2023 is doing their own thing with compression, not this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Closes #15887