The current value causes significant artificial slowdown during mass
parallel file removal, which can be observed both on FreeBSD and Linux
when running real workloads.
Sample results from Linux doing make -j 96 clean after an allyesconfig
modules build:
before: 4.14s user 6.79s system 48% cpu 22.631 total
after: 4.17s user 6.44s system 153% cpu 6.927 total
FreeBSD results in the ticket.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #13932
Closes #13938
Akash B [Thu, 20 Oct 2022 00:07:51 +0000 (05:37 +0530)]
Add options to zfs redundant_metadata property
Currently, additional/extra copies are created for metadata in
addition to the redundancy provided by the pool(mirror/raidz/draid),
due to this 2 times more space is utilized per inode and this decreases
the total number of inodes that can be created in the filesystem. By
setting redundant_metadata to none, no additional copies of metadata
are created, hence can reduce the space consumed by the additional
metadata copies and increase the total number of inodes that can be
created in the filesystem. Additionally, this can improve file create
performance due to the reduced amount of metadata which needs
to be written.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com>
Closes #13680
Mark Johnston [Tue, 11 Oct 2022 19:29:55 +0000 (15:29 -0400)]
FreeBSD: Fix a pair of bugs in zfs_fhtovp()
- Add a zfs_exit() call in an error path, otherwise a lock is leaked.
- Remove the fid_gen > 1 check. That appears to be Linux-specific:
zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the
snapshot directory is mounted. On FreeBSD it fails, making snapshot
dirs inaccessible via NFS.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Fixes: 43dbf8817808 ("FreeBSD: vfsops: use setgen for error case")
Closes #14001
Closes #13974
(cherry picked from commit ed566bf1cd0bdbf85e8c63c1c119e3d2ef5db1f6)
This patch handles the race condition on simultaneous failure of
2 drives, which misses the vdev_rebuild_reset_wanted signal in
vdev_rebuild_thread. We retry to catch this inside the
vdev_rebuild_complete_sync function.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Samuel Wycliffe J <samwyc@hpe.com>
Closes #14041
Closes #14050
Due to a missing semicolon on the ExecStart line, it wasn't possible
to specify the snapshot name on the bootfs.{rollback,snapshot}
kernel parameters if the boot dataset name was obtained from the
root=zfs:... kernel parameter.
Richard Yao [Wed, 12 Oct 2022 22:47:39 +0000 (18:47 -0400)]
kcfpool_alloc() should have its argument list marked void
This error occurred when building on Gentoo with debugging enabled:
zfs-kmod-2.1.6/work/zfs-2.1.6/module/icp/core/kcf_sched.c:1277:14:
error: a function declaration without a prototype is deprecated
in all versions of C [-Werror,-Wstrict-prototypes]
kcfpool_alloc()
^
void
1 error generated.
This function is not present in master.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14023
The logic above skips completely `mount.zfs` and prevents us
from reading filesystem properties and applying mount options.
For comparison, the coreutils `mount(8)` implementation does:
```
openat(AT_FDCWD, "/proc/filesystems", O_RDONLY|O_CLOEXEC) = 3
// figure out that zfs is a `nodev` filesystem and look for a helper
newfstatat(AT_FDCWD, "/sbin/mount.zfs" ...) = 0
execve("/sbin/mount.zfs" ...) = 0
```
Using `mount.zfs` in initramfs would help circumvent deficiencies
of some of `mount(8)` implementations. `mount -t zfs` translates
to `mount.zfs` invocation, except for cases when explicitly disabled
by `-i`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #13305
(cherry picked from commit 35d81a75a8c13e011e19fd12cf553d9c5849386e)
Richard Yao [Tue, 27 Sep 2022 19:36:58 +0000 (15:36 -0400)]
Fix bad free in skein code
Clang's static analyzer found a bad free caused by skein_mac_atomic().
It will allocate a context on the stack and then pass it to
skein_final(), which attempts to free it. Upon inspection,
skein_digest_atomic() also has the same problem.
These functions were created to match the OpenSolaris ICP API, so I was
curious how we avoided this in other providers and looked at the SHA2
code. It appears that SHA2 has a SHA2Final() helper function that is
called by the exported sha2_mac_final()/sha2_digest_final() as well as
the sha2_mac_atomic() and sha2_digest_atomic() functions. The real work
is done in SHA2Final() while some checks and the free are done in
sha2_mac_final()/sha2_digest_final().
We fix the use after free in the skein code by taking inspiration from
the SHA2 code. We introduce a skein_final_nofree() that does most of the
work, and make skein_final() into a function that calls it and then
frees the memory.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13954
Tony Hutter [Fri, 23 Sep 2022 17:24:19 +0000 (10:24 -0700)]
zpool: Don't print "repairing" on force faulted drives
If you force fault a drive that's resilvering, it's scan stats can get
frozen in time, giving the false impression that it's being resilvered.
This commit checks the vdev state to see if the vdev is healthy before
reporting "resilvering" or "repairing" in zpool status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #13927
Closes #13930
Richard Yao [Wed, 14 Sep 2022 19:51:55 +0000 (15:51 -0400)]
FreeBSD: Fix integer conversion for vnlru_free{,_vfsops}()
When reviewing #13875, I noticed that our FreeBSD code has an issue
where it converts from `int64_t` to `int` when calling
`vnlru_free{,_vfsops}()`. The result is that if the int64_t is `1 <<
36`, the int will be 0, since the low bits are 0. Even when some low
bits are set, a value such as `((1 << 36) + 1)` would truncate to 1,
which is wrong.
There is protection against this on 32-bit platforms, but on 64-bit
platforms, there is no check to protect us, so we add a check.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13882
It should be noted that exploiting this requires the `SYS_CONFIG`
privilege, and anyone with that privilege likely has other opportunities
to do exploits, so it is unlikely that bad actors could exploit this
unless system administrators are executing untrusted ZFS Channel
Programs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13949
Richard Yao [Mon, 26 Sep 2022 23:44:22 +0000 (19:44 -0400)]
Linux: Fix uninitialized variable usage in zio_do_crypt_data()
Coverity complained about this. An error from `hkdf_sha512()` before uio
initialization will cause pointers to uninitialized memory to be passed
to `zio_crypt_destroy_uio()`. This is a regression that was introduced
by cf63739191b6cac629d053930a4aea592bca3819. Interestingly, this never
affected FreeBSD, since the FreeBSD version never had that patch ported.
Since moving uio initialization to the top of this function would slow
down the qat_crypt() path, we only move the `memset()` calls to the top
of the function. This is sufficient to fix this problem.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13944
Alexander Motin [Tue, 24 May 2022 16:46:35 +0000 (12:46 -0400)]
Refactor Log Size Limit
Original Log Size Limit implementation blocked all writes in case of
limit reached until the TXG is committed and the log is freed. It
caused huge delays and following speed spikes in application writes.
This implementation instead smoothly throttles writes, using exactly
the same mechanism as used for dirty data.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: jxdking <lostking2008@hotmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Issue #12284
Closes #13476
Brian Behlendorf [Mon, 19 Sep 2022 18:07:15 +0000 (11:07 -0700)]
Revert "Reduce dbuf_find() lock contention"
This reverts commit 34dbc618f50cfcd392f90af80c140398c38cbcd1. While this
change resolved the lock contention observed for certain workloads, it
inadventantly reduced the maximum hash inserts/removes per second. This
appears to be due to the slightly higher acquisition cost of a rwlock vs
a mutex.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
I see a few issues in the issue tracker that might be aided by being
able to turn this on. We have no module parameter for it, so I would
like to add one.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13874
Ameer Hamza [Tue, 20 Sep 2022 22:19:05 +0000 (03:19 +0500)]
Delay ZFS_PROP_SHARESMB property to handle it for encrypted raw receive
For encrypted raw receive, objset creation is delayed until a call to
dmu_recv_stream(). ZFS_PROP_SHARESMB property requires objset to be
populated when calling zpl_earlier_version(). To correctly handle the
ZFS_PROP_SHARESMB property for encrypted raw receive, this change
delays setting the property.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13878
Ameer Hamza [Fri, 16 Sep 2022 20:52:25 +0000 (01:52 +0500)]
zfs recv hangs if max recordsize is less than received recordsize
- Some optimizations for bqueue enqueue/dequeue.
- Added a fix to prevent deadlock when both bqueue_enqueue_impl()
and bqueue_dequeue() waits for signal to be triggered.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13855
Alexander Motin [Thu, 8 Sep 2022 17:30:53 +0000 (13:30 -0400)]
Improve too large physical ashift handling
When iterating through children physical ashifts for vdev, prefer
ones above the maximum logical ashift, that we can actually use,
but within the administrator defined maximum.
When selecting top-level vdev ashift, do not set it to the defined
maximum in case physical ashift is even higher, but just ignore one.
Using the maximum does not prevent misaligned writes, but reduces
space efficiency. Since ZFS tries to write data sequentially and
aggregates the writes, in many cases large misanigned writes may be
not as bad as the space penalty otherwise.
Allow internal physical ashifts for vdevs higher than SHIFT_MAX.
May be one day allocator or aggregation could benefit from that.
Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB),
so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB),
but only if it really has to or explicitly told to, but not as an
"optimization".
There are some read-intensive NVMe SSDs that report Preferred Write
Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a
space inefficiency that can't be justified. Instead these changes
make ZFS fall back to logical ashift of 12 (4KB) by default and
only warn user that it may be suboptimal for performance.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #13798
Rich Ercolani [Thu, 31 Mar 2022 17:09:18 +0000 (13:09 -0400)]
Ask libtool to stop hiding some errors
For #13083, curiously, it did not print the actual error, just
that the compile failed with "Error 1".
In theory, this flag should cause it to report errors twice sometimes.
In practice, I'm pretty okay with reporting some twice if it avoids
reporting some never.
Kevin Jin [Tue, 20 Jul 2021 15:40:24 +0000 (11:40 -0400)]
Add Module Parameter Regarding Log Size Limit
zfs_wrlog_data_max
The upper limit of TX_WRITE log data. Once it is reached,
write operation is blocked, until log data is cleared out
after txg sync. It only counts TX_WRITE log with WR_COPIED
or WR_NEED_COPY.
Kevin Jin [Thu, 1 Jul 2021 15:20:27 +0000 (11:20 -0400)]
Optimize txg_kick() process (#12274)
Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in
original code. Extra parameter "txg" is added for txg_kick(), thus it
knows which txg to kick. Also txg_kick() call is moved from
dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can
know the txg number assigned for txg_kick().
Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is
also cleaned up.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes #12274
Ameer Hamza [Fri, 16 Sep 2022 20:52:25 +0000 (01:52 +0500)]
zfs recv hangs if max recordsize is less than received recordsize
- Some optimizations for bqueue enqueue/dequeue.
- Added a fix to prevent deadlock when both bqueue_enqueue_impl()
and bqueue_dequeue() waits for signal to be triggered.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13855
at the beginning of dbuf_assign_arcbuf can panic
if the object type is a DMU_OT_NEWTYPE that has
DMU_OT_METADATA set.
While we're at it, fix DMU_OT_IS_ENCRYPTED as well.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes #13842
Richard Yao [Mon, 12 Sep 2022 19:51:17 +0000 (15:51 -0400)]
vdev_draid_lookup_map() should not iterate outside draid_maps
Coverity reported this as an out-of-bounds read.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13865
Akash B [Wed, 9 Mar 2022 00:20:41 +0000 (05:50 +0530)]
Add physical device size to SIZE column in 'zpool list -v'
Add physical device size/capacity only for physical devices in
'zpool list -v' instead of displaying "-" in the SIZE column.
This would make it easier to see the individual device capacity and
to determine which spares are large enough to replace which devices.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com>
Closes #12561
Closes #13106
George Amanakis [Thu, 11 Nov 2021 20:52:16 +0000 (21:52 +0100)]
Introduce a tunable to exclude special class buffers from L2ARC
Special allocation class or dedup vdevs may have roughly the same
performance as L2ARC vdevs. Introduce a new tunable to exclude those
buffers from being cacheable on L2ARC.
Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #11761
Closes #12285
Tony Hutter [Thu, 8 Sep 2022 17:32:30 +0000 (10:32 -0700)]
zed: Fix config_sync autoexpand flood
Users were seeing floods of `config_sync` events when autoexpand was
enabled. This happened because all "disk status change" udev events
invoke the autoexpand codepath, which calls zpool_relabel_disk(),
which in turn cause another "disk status change" event to happen,
in a feedback loop. Note that "disk status change" happens every time
a user calls close() on a block device.
This commit breaks the feedback loop by only allowing an autoexpand
to happen if the disk actually changed size.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes: #7132 Closes: #7366
Closes #13729
Walter Huf [Tue, 6 Sep 2022 17:02:18 +0000 (10:02 -0700)]
Add xattr_handler support for Android kernels
Some ARM BSPs run the Android kernel, which has
a modified xattr_handler->get() function signature.
This adds support to compile against these kernels.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Walter Huf <hufman@gmail.com>
Closes #13824
Samuel [Tue, 6 Sep 2022 16:37:47 +0000 (22:07 +0530)]
Fix column width in 'zpool iostat -v' and 'zpool list -v'
This commit fixes a minor spacing issue caused when
enumerating vdev names, which originated from #13031
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Samuel Wycliffe <samuelwycliffe@gmail.com>
Closes #13811
Alexander Motin [Fri, 2 Sep 2022 20:21:18 +0000 (16:21 -0400)]
Apply arc_shrink_shift to ARC above arc_c_min
It makes sense to free memory in smaller chunks when approaching
arc_c_min to let other kernel subsystems to free more, since after
that point we can't free anything. This also matches behavior on
Linux, where to shrinker reported only the size above arc_c_min.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #13794
George Wilson [Fri, 26 Aug 2022 21:04:27 +0000 (16:04 -0500)]
Importing from cachefile can trip assertion
When importing from cachefile, it is possible that the builtin retry
logic will trip an assertion because it also fails to find the pool.
This fix addresses that case and returns the correct error message to
the user.
Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #13781
Tony Hutter [Tue, 9 Aug 2022 20:26:46 +0000 (13:26 -0700)]
ZTS: Fix zpool_expand_001_pos
`zpool_expand_001_pos` was often failing due to not seeing autoexpand
commands in the `zpool history`. During testing, I found this to be
unreliable (sometimes the "online" wouldn't appear in `zpool history`)
and unnecessary, as we could simply check that the pool increased in
size.
This commit revamps the test to check for the expanded pool size
and corresponding new free space.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #13743
Tony Hutter [Thu, 14 Jul 2022 17:19:37 +0000 (10:19 -0700)]
zed: Look for NVMe DEVPATH if no ID_BUS
We tried replacing an NVMe drive using autoreplace, only
to see zed reject it with:
zed[27955]: zed_udev_monitor: /dev/nvme5n1 no devid source
This happened because ZED saw that ID_BUS was not set by udev
for the NVMe drive, and thus didn't think it was "real drive".
This commit allows NVMe drives to be autoreplaced even if
ID_BUS is not set.
Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #13512
Closes #13646
Tony Hutter [Mon, 11 Jul 2022 20:35:19 +0000 (13:35 -0700)]
zed: Ignore false 'atari' partitions in autoreplace
libudev will sometimes falsely identify an 'atari' partition on a
blank disk, preventing it from being used in an autoreplace. This
seems to be a known issue. The workaround is to just ignore the
fake partition and continue with the autoreplace.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #13497
Closes #13632
Richard Yao [Mon, 12 Sep 2022 18:22:15 +0000 (14:22 -0400)]
Fix use-after-free in btree code
Coverty static analysis found these.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #10989
Closes #13861
contrib: dracut: zfs-snapshot-bootfs: exit status fix
When the zfs-snapshot-bootfs service attempts to create a snapshot
that already exists, the exit status of the command is non-zero and
the service reports failed to the systemd service manager. This is a
common occurrence if bootfs.snapshot is left set on the kernel command
line and it should not be considered a failure.
This service was originally set to ignore this error by prefixing
the command with - on the ExecStart line, but the leading - appears
to have been dropped in #13359.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gregory Bartholomew <gregory.lee.bartholomew@gmail.com>
Closes #13769
Coleman Kane [Mon, 8 Aug 2022 23:18:30 +0000 (19:18 -0400)]
Linux 6.0 compat: register_shrinker() now var-arg
The 6.0 kernel added a printf-style var-arg for args > 0 to the
register_shrinker function, in order to add names to shrinkers, in
commit e33c267ab70de4249d22d7eab1cc7d68a889bac2. This enables the
shrinkers to have friendly names exposed in /sys/kernel/debug/shrinker/.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #13748
Paul Zuchowski [Wed, 3 Aug 2022 17:12:23 +0000 (17:12 +0000)]
Fix problem with zdb -d
zdb -d <pool>/<objset ID> does not work when
other command line arguments are included i.e.
zdb -U <cachefile> -d <pool>/<objset ID>
This change fixes the command line parsing
to handle this situation. Also fix issue
where zdb -r <dataset> <file> does not handle
the root <dataset> of the pool. Introduce -N
option to force <objset ID> to be interpreted
as a numeric objsetID.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #12845
Closes #12944
It turns out that short-circuiting the EFAULT behavior on a short read
breaks things on FreeBSD. So until there's a nicer solution, let's
just revert the behavior for not-Linux.
Currently, dmu_read_uio_dnode can read 64K of a requested 1M in one
loop, get EFAULT back from zfs_uiomove() (because the iovec only holds
64k), and return EFAULT, which turns into EAGAIN on the way out. EAGAIN
gets interpreted as "I didn't read anything", the caller tries again
without consuming the 64k we already read, and we're stuck.
This apparently works on newer kernels because the caller which breaks
on older Linux kernels by happily passing along a 1M read request and a
64k iovec just requests 64k at a time.
With this, we now won't return EFAULT if we got a partial read.
Brian Behlendorf [Tue, 26 Jul 2022 21:39:23 +0000 (14:39 -0700)]
ZTS: Fix io_uring support check
Not all Linux distribution kernels enable io_uring support by
default. Update the run time check to verify that the booted
kernel was built with CONFIG_IO_URING=y.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Co-authored-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13648
Closes #13685
Brian Behlendorf [Mon, 20 Jun 2022 23:36:21 +0000 (23:36 +0000)]
Fix objtool: missing int3 after ret warning
Resolve straight-line speculation warnings reported by objtool
for x86_64 assembly on Linux when CONFIG_SLS is set. See the
following LWN article for the complete details.
https://lwn.net/Articles/877845/
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
Brian Behlendorf [Mon, 20 Jun 2022 22:27:55 +0000 (22:27 +0000)]
Fix -Wformat-overflow warning in zfs_project_handle_dir()
Switch to using asprintf() to satisfy the compiler and resolve the
potential format-overflow warning. Not the conditional before the
sprintf() would have prevented this regardless.
cmd/zfs/zfs_project.c: In function ‘zfs_project_handle_dir’:
cmd/zfs/zfs_project.c:241:38: error: ‘/’ directive writing
1 byte into a region of size between 0 and 4352
[-Werror=format-overflow=]
cmd/zfs/zfs_project.c:241:17: note: ‘sprintf’ output between
2 and 4609 bytes into a destination of size 4352
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
Brian Behlendorf [Mon, 20 Jun 2022 21:54:42 +0000 (21:54 +0000)]
Fix -Wformat-truncation warning in upgrade_set_callback()
Extend the buffer slightly resolve the warning.
cmd/zfs/zfs_main.c: In function ‘upgrade_set_callback’:
cmd/zfs/zfs_main.c:2446:22: error: ‘%llu’ directive output
may be truncated writing between 1 and 20 bytes into a
region of size 16 [-Werror=format-truncation=]
cmd/zfs/zfs_main.c:2445:24: note: ‘snprintf’ output between
2 and 21 bytes into a destination of size 16
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
Brian Behlendorf [Mon, 20 Jun 2022 21:35:38 +0000 (21:35 +0000)]
Fix -Wuse-after-free warning in dbuf_destroy()
Move the use of the db pointer after it is freed. It's only used as
a tag so a dereference would never occur, but there's no reason we
can't invert the order to resolve the warning.
module/zfs/dbuf.c: In function 'dbuf_destroy':
module/zfs/dbuf.c:2953:17: error:
pointer 'db' may be used after 'free' [-Werror=use-after-free]
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
Brian Behlendorf [Mon, 20 Jun 2022 21:32:03 +0000 (21:32 +0000)]
Fix -Wuse-after-free warning in dbuf_issue_final_prefetch_done()
Move the use of the private pointer after it is freed. It's only
used as a tag so a dereference would never occur, but there's no
harm in inverting the order to resolve the warning.
module/zfs/dbuf.c: In function 'dbuf_issue_final_prefetch_done':
module/zfs/dbuf.c:3204:17: error:
pointer 'private' may be used after 'free' [-Werror=use-after-free]
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
Brian Behlendorf [Mon, 20 Jun 2022 21:13:26 +0000 (21:13 +0000)]
Fix -Wattribute-warning in dsl layer
The memcpy(), memmove(), and memset() functions have been annotated
to perform bounds checking when using FORTIFY_SOURCE. A warning is
now generted when writing beyond the end of the specified field.
Alternately, the new struct_group() macro could be used to create
an anonymous union member for use by memcpy(). However, since this
is the only place the macro would be helpful it's preferable to
restructure the code slights to avoid the need for additional
compatibility code when the macro does not exist.
Brian Behlendorf [Mon, 20 Jun 2022 19:37:38 +0000 (19:37 +0000)]
Fix -Wattribute-warning in edonr
The wrong union memory was being accessed in EdonRInit resulting in
a write beyond size of field compiler warning. Reference the correct
member to resolve the warning. The warning was correct and this in
case the mistake was harmless.
In function ‘fortify_memcpy_chk’,
inlined from ‘EdonRInit’ at zfs/module/icp/algs/edonr/edonr.c:494:3:
./include/linux/fortify-string.h:344:25: error: call to
‘__write_overflow_field’ declared with attribute warning:
detected write beyond size of field (1st parameter);
maybe use struct_group()? [-Werror=attribute-warning]
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
Brian Behlendorf [Mon, 20 Jun 2022 22:36:38 +0000 (22:36 +0000)]
Fix -Wattribute-warning in zfs_log_xvattr()
Restructure the code in zfs_log_xvattr() to use a lr_attr_end
structure when accessing lr_attr_t elements located after the
variable sized array. This makes the code more understandable
and resolves the accessing beyond the end of the field warnings.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
Brian Behlendorf [Mon, 20 Jun 2022 19:53:58 +0000 (19:53 +0000)]
Silence -Winfinite-recursion warning in luaD_throw()
This code should be kept inline with the upstream lua version as much
as possible. Therefore, we simply want to silence the warning. This
check was enabled by default as part of -Wall in gcc 12.1.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
Alexander Motin [Fri, 8 Jul 2022 19:53:10 +0000 (15:53 -0400)]
Fix scrub resume from newly created hole.
It may happen that scan bookmark points to a block that was turned
into a part of a big hole. In such case dsl_scan_visitbp() may skip
it and dsl_scan_check_resume() will not be called for it. As result
new scan suspend won't be possible until the end of the object, that
may take hours if the object is a multi-terabyte ZVOL on a slow HDD
pool, stretching TXG to all that time, creating all sorts of problems.
This patch changes the resume condition to any greater or equal block,
so even if we miss the bookmarked block, the next one we find will
delete the bookmark, allowing new suspend.
Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Alexander Motin [Tue, 5 Jul 2022 23:27:29 +0000 (19:27 -0400)]
Avoid memory copy when verifying raidz/draid parity
Before this change for every valid parity column raidz_parity_verify()
allocated new buffer and copied there existing data, then recalculated
the parity and compared the result with the copy. This patch removes
the memory copy, simply swapping original buffer pointers with newly
allocated empty ones for parity recalculation and comparison. Original
buffers with potentially incorrect parity data are then just freed,
while new recalculated ones are used for repair.
On a pool of 12 4-wide raidz vdevs, storing 1.5TB of 16MB blocks, this
change reduces memory traffic during scrub by 17% and total unhalted
CPU time by 25%.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13613
Alexander Motin [Tue, 5 Jul 2022 23:26:20 +0000 (19:26 -0400)]
Avoid memory copies during mirror scrub
Issuing several scrub reads for a block we may use the parent ZIO
buffer for one of child ZIOs. If that read complete successfully,
then we won't need to copy the data explicitly. If block has only
one copy (typical for root vdev, which is also a mirror inside),
then we never need to copy -- succeed or fail as-is. Previous
code also copied data from buffer of every successfully completed
child ZIO, but that just does not make any sense.
On healthy N-wide mirror this saves all N+1 (or even more in case
of ditto blocks) memory copies for each scrubbed block, allowing
CPU to focus mostly on check-summing. For other vdev types it
should save one memory copy per block copy at root vdev.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13606
Alexander Motin [Tue, 28 Jun 2022 18:23:31 +0000 (14:23 -0400)]
Fix and disable blocks statistics during scrub
Block statistics calculation during scrub I/O issue in case of sorted
scrub accounted ditto blocks several times. Embedded blocks on other
side were not accounted at all. This change moves the accounting from
issue to scan stage, that fixes both problems and also allows to avoid
pool-wide locking and the lock contention it created.
Since this statistics is quite specific and is not even exposed now
anywhere, disable its calculation by default to not waste CPU time.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13579
Alexander Motin [Fri, 24 Jun 2022 20:55:58 +0000 (16:55 -0400)]
Several B-tree optimizations
- Introduce first element offset within a leaf. It allows to reduce
by ~50% average memmove() size when adding/removing elements. If the
added/removed element is in the first half of the leaf, we may shift
elements before it and adjust the bth_first instead of moving more
elements after it.
- Use memcpy() instead of memmove() when we know there is no overlap.
- Switch from uint64_t to uint32_t. It does not limit anything,
but 32-bit arches should appreciate it greatly in hot paths.
- Store leaf capacity in struct btree to avoid 64-bit divisions.
- Adjust zfs_btree_insert_into_leaf() to always result in balanced
leaves after splitting, no matter where the new element was inserted.
Not that we care about it much, but it should also allow B-trees with
as little as two elements per leaf instead of 4 previously.
When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this
reduces amount of time spent in memmove() inside the scan thread from
13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes.
It should also reduce spacemaps load time, but I haven't measured it.
Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13582
Alexander Motin [Fri, 24 Jun 2022 16:50:37 +0000 (12:50 -0400)]
Several sorted scrub optimizations
- Reduce size and comparison complexity of q_exts_by_size B-tree.
Previous code used two 64-bit divisions and many other operations to
compare two B-tree elements. It created enormous overhead. This
implementation moves the math to the upper level and stores the score
in the B-tree elements themselves. Since all that we need to store in
that B-tree is the extent score and offset, those can fit into single
8 byte value instead of 24 bytes of q_exts_by_addr element and can be
compared with single operation.
- Better decouple secondary tree logic from main range_tree by moving
rt_btree_ops and related functions into dsl_scan.c as ext_size_ops.
Those functions are very small to worry about the code duplication and
range_tree does not need to know details such as rt_btree_compare.
- Instead of accounting number of pending bytes per pool, that needs
atomic on global variable per block, account the number of non-empty
per-vdev queues, that change much more rarely.
- When extent scan is interrupted by TXG end, continue it in the next
TXG instead of selecting next best extent. It allows to avoid leaving
one truncated (and so likely not the best any more) extent each TXG.
On top of some other optimizations this saves about 1.5 minutes out of
10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks.
Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13576
Alexander Motin [Fri, 17 Jun 2022 22:38:51 +0000 (18:38 -0400)]
FreeBSD: Improve crypto_dispatch() handling
Handle crypto_dispatch() return values same as crp->crp_etype errors.
On FreeBSD 12 many drivers returned same errors both ways, and lack
of proper handling for the first ended up in assertion panic later.
It was changed in FreeBSD 13, but there is no reason to not be safe.
While there, skip waiting for completion, including locking and
wakeup() call, for sessions on synchronous crypto drivers, such as
typical aesni and software.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13563
Alexander Motin [Wed, 15 Jun 2022 21:25:08 +0000 (17:25 -0400)]
Reduce ZIO io_lock contention on sorted scrub
During sorted scrub multiple threads (one per vdev) are issuing many
ZIOs same time, all using the same scn->scn_zio_root ZIO as parent.
It causes huge lock contention on the single global lock on that ZIO.
Improve it by introducing per-queue null ZIOs, children to that one,
and using them instead as proxy.
For 12 SSD pool storing 1.5TB of 4KB blocks on 80-core system this
dramatically reduces lock contention and reduces scrub time from 21
minutes down to 12.5, while actual read stages (not scan) are about
3x faster, reaching 100K blocks per second per vdev.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13553
Alexander Motin [Thu, 9 Jun 2022 22:27:36 +0000 (18:27 -0400)]
AVL: Remove obsolete branching optimizations
Modern Clang and GCC can successfully implement simple conditions
without branching with math and flag operations. Use of arrays for
translation no longer helps as much as it was 14+ years ago.
Disassemble of the code generated by Clang 13.0.0 on FreeBSD 13.1,
Clang 14.0.4 on FreeBSD 14 and GCC 10.2.1 on Debian 11 with this
change still shows no branching instructions.
Profiling of CPU-bound scan stage of sorted scrub shows reproducible
reduction of time spent inside avl_find() from 6.52% to 4.58%.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13540
Alexander Motin [Wed, 25 May 2022 17:12:52 +0000 (13:12 -0400)]
More speculative prefetcher improvements
- Make prefetch distance adaptive: up to 4MB prefetch doubles for
every, hit same as before, but after that it grows by 1/8 every time
the prefetch read does not complete in time to satisfy the demand.
My tests show that 4MB is sufficient for wide NVMe pool to saturate
single reader thread at 2.5GB/s, while new 64MB maximum allows the
same thread to reach 1.5GB/s on wide HDD pool. Further distance
increase may increase speed even more, but less dramatic and with
higher latency.
- Allow early reuse of inactive prefetch streams: streams that never
saw hits can be reused immediately if there is a demand, while others
can be reused after 1s of inactivity, starting with the oldest. After
2s of inactivity streams are deleted to free resources same as before.
This allows by several times increase strided read performance on HDD
pool in presence of simultaneous random reads, previously filling the
zfetch_max_streams limit for seconds and so blocking most of prefetch.
- Always issue intermediate indirect block reads with SYNC priority.
Each of those reads if delayed for longer may delay up to 1024 other
block prefetches, that may be not good for wide pools.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13452
Alexander Motin [Wed, 4 May 2022 18:33:42 +0000 (14:33 -0400)]
Improve mg_aliquot math
When calculating mg_aliquot alike to #12046 use number of unique data
disks in the vdev, not the total number of children vdev. Increase
default value of the tunable from 512KB to 1MB to compensate.
Before this change each disk in striped pool was getting 512KB of
sequential data, in 2-wide mirror -- 1MB, in 3-wide RAIDZ1 -- 768KB.
After this change in all the cases each disk should get 1MB.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13388
Alexander Motin [Tue, 26 Apr 2022 17:44:21 +0000 (13:44 -0400)]
Improve log spacemap load time
Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool. As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits. We've observed one of such systems importing pool for about
45 minutes.
This patch improves the situation from five sides:
- By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
- By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
- By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
- By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
- By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.
As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12789
Alexander Motin [Fri, 4 Feb 2022 21:06:38 +0000 (16:06 -0500)]
Add more control/visibility to spa_load_verify().
Use error thresholds from policy to control whether to scrub data
and/or metadata. If threshold is set to UINT64_MAX, then caller
probably does not care about result and we may skip that part.
By default import neither set the data error threshold nor read
the error counter, so skip the data scrub for faster import.
Metadata are still scrubbed and fail if even single error found.
While there just for symmetry return number of metadata errors in
case threshold is not set to zero and we haven't reached it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #13022
Allan Jude [Tue, 26 Oct 2021 23:15:38 +0000 (19:15 -0400)]
spa.c: Replace VERIFY(nvlist_*(...) == 0) with fnvlist_* (#12678)
The fnvlist versions of the functions are fatal if they fail,
saving each call from having to include checking the result.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Allan Jude <allan@klarasystems.com>
Alexander Motin [Tue, 27 Jul 2021 23:05:47 +0000 (19:05 -0400)]
Avoid small buffer copying on write
It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to
decide whether to reallocate/copy the buffer, because the answer is
OS-specific and depends on the buffer size. Instead of that use
abd_size_alloc_linear(), moved into public header.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12425
Alexander Motin [Thu, 1 Jul 2021 15:16:54 +0000 (11:16 -0400)]
Remove refcount from spa_config_*()
The only reason for spa_config_*() to use refcount instead of simple
non-atomic (thanks to scl_lock) variable for scl_count is tracking,
hard disabled for the last 8 years. Switch to simple int scl_count
reduces the lock hold time by avoiding atomic, plus makes structure
fit into single cache line, reducing the locks contention.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12287
Brian Behlendorf [Thu, 14 Jul 2022 17:21:29 +0000 (10:21 -0700)]
Scrub mirror children without BPs
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read. This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children. Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.
However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.
While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid. This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance. This means the damaged
half of the mirror will go undetected even after scrubbing.
For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.
This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing. When the BP isn't available for
verification, then compare the data buffers from each child. They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children. Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13555
Alexander Motin [Fri, 10 Jun 2022 17:01:46 +0000 (13:01 -0400)]
Improve sorted scan memory accounting
Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should
count 2x sizeof (range_seg_gap_t) per node. And since average B-tree
memory efficiency is about 75%, we should increase it to 3x.
Previous code under-counted up to 30% of the memory usage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13537
Rich Ercolani [Wed, 4 May 2022 18:59:30 +0000 (14:59 -0400)]
Corrected edge case in uncompressed ARC->L2ARC handling
I genuinely don't know why this didn't come up before,
but adding the LZ4 early abort pointed out this flaw,
in which we're allocating a buffer of one size, and
then telling the compressor that we're handing it buffers
of a different size, which may be Very Different - say,
allocating 512b and then telling it the inputs are 128k.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #13375
Alexander Motin [Wed, 1 Jun 2022 16:54:35 +0000 (12:54 -0400)]
Remove wrong assertion in log spacemap
It is typical, but not generally true that if log summary has more
blocks it must also have unflushed metaslabs. Normally with metaslabs
flushed in order it works, but there are known exceptions, such as
device removal or metaslab being loaded during its flush attempt.
Before 600a02b8844 if spa_flush_metaslabs() hit loading metaslab it
usually stopped (unless memlimit is also exceeded), but now it may
flush more metaslabs, just skipping that particular one. This
increased chances of assertion to fire when the skipped metaslab is
flushed on next iteration if all other metaslabs in that summary
entry are already flushed out of order.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13486
Closes #13513
Ryan Moeller [Fri, 18 Feb 2022 21:09:03 +0000 (16:09 -0500)]
libzfs: Fail making a dataset handle gracefully
When a dataset is in the process of being received it gets marked as
inconsistent and should not be used. We should check for this when
opening a dataset handle in libzfs and return with an appropriate error
set, rather than hitting an abort because of the incomplete data.
zfs_open() passes errno to zfs_standard_error() after observing
make_dataset_handle() fail, which ends up aborting if errno is 0.
Set errno before returning where we know it has not been set already.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #13077