chrisperedun [Thu, 21 Dec 2023 19:12:30 +0000 (14:12 -0500)]
Don't panic on unencrypted block in encrypted dataset
While 763ca47 closes the situation of block cloning creating
unencrypted records in encrypted datasets, existing data still causes
panic on read. Setting zfs_recover bypasses this but at the cost of
potentially ignoring more serious issues.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Peredun <chris.peredun@ixsystems.com>
Closes #15677
Alexander Motin [Thu, 21 Dec 2023 18:54:44 +0000 (13:54 -0500)]
ZIL: Improve next log block size prediction
Track history in context of bursts, not individual log blocks. It
allows to not blow away all the history by single large burst of
many block, and same time allows optimizations covering multiple
blocks in a burst and even predicted following burst. For each
burst account its optimal block size and minimal first block size.
Use that statistics from the last 8 bursts to predict first block
size of the next burst.
Remove predefined set of block sizes. Allocate any size we see fit,
multiple of 4KB, as required by ZIL now. With compression enabled
by default, ZFS already writes pretty random block sizes, so this
should not surprise space allocator any more.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15635
Tony Hutter [Thu, 21 Dec 2023 18:53:16 +0000 (10:53 -0800)]
zpool: Add slot power control, print power status
Add `zpool` flags to control the slot power to drives. This assumes
your SAS or NVMe enclosure supports slot power control via sysfs.
The new `--power` flag is added to `zpool offline|online|clear`:
zpool offline --power <pool> <device> Turn off device slot power
zpool online --power <pool> <device> Turn on device slot power
zpool clear --power <pool> [device] Turn on device slot power
If the ZPOOL_AUTO_POWER_ON_SLOT env var is set, then the '--power'
option is automatically implied for `zpool online` and `zpool clear`
and does not need to be passed.
zpool status also gets a --power option to print the slot power status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mart Frauenlob <AllKind@fastest.cc> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15662
Rob N [Wed, 20 Dec 2023 22:17:14 +0000 (09:17 +1100)]
spa: make read/write queues configurable
We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.
This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.
This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.
Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15675
Rob Norris [Sat, 16 Dec 2023 13:36:21 +0000 (00:36 +1100)]
Linux 6.7 compat: rework shrinker setup for heap allocations
6.7 changes the shrinker API such that shrinkers must be allocated
dynamically by the kernel. To accomodate this, this commit reworks
spl_register_shrinker() to do something similar against earlier kernels.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn
Closes #15681
Rob Norris [Sat, 16 Dec 2023 06:39:07 +0000 (17:39 +1100)]
Linux 6.7 compat: handle superblock shrinker member change
In 6.7 the superblock shrinker member s_shrink has changed from being an
embedded struct to a pointer. Detect this, and don't take a reference if
it already is one.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn
Closes #15681
Rob Norris [Sat, 16 Dec 2023 11:31:32 +0000 (22:31 +1100)]
Linux 6.7 compat: use inode atime/mtime accessors
6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and
i_mtime. This extends the method used for ctime in b37f29341 to atime
and mtime as well.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn
Closes #15681
Rob Norris [Sat, 16 Dec 2023 07:01:45 +0000 (18:01 +1100)]
Linux 6.7 compat: simplify current_time() check
6.7 changed the names of the time members in struct inode, so we can't
assign back to it because we don't know its name. In practice this
doesn't matter though - if we're missing current_time(), then we must be
on <4.9, and we know our fallback will need to return timespec.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn
Closes #15681
Umer Saleem [Fri, 15 Dec 2023 22:18:27 +0000 (03:18 +0500)]
Test LWB buffer overflow for block cloning
PR#15634 removes 128K into 2x68K LWB split optimization, since it
was found to cause LWB buffer overflow while trying to write 128KB
TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer,
with multiple VDEVs ZIL.
This commit adds a test for this particular scenario by writing
maximum sizes TX_CLONE_RANE record with 1022 block pointers into
68KB buffer, with two SLOG devices.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15672
Alexander Motin [Fri, 15 Dec 2023 17:51:41 +0000 (12:51 -0500)]
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
Alexander Motin [Tue, 12 Dec 2023 20:59:24 +0000 (15:59 -0500)]
dbuf: Set dr_data when unoverriding after clone
Block cloning normally creates dirty record without dr_data. But if
the block is read after cloning, it is moved into DB_CACHED state and
receives the data buffer. If after that we call dbuf_unoverride()
to convert the dirty record into normal write, we should give it the
data buffer from dbuf and release one.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15654
Closes #15656
Alexander Motin [Tue, 12 Dec 2023 20:53:59 +0000 (15:53 -0500)]
dbuf: Handle arcbuf assignment after block cloning
In some cases dbuf_assign_arcbuf() may be called on a block that
was recently cloned. If it happened in current TXG we must undo
the block cloning first, since the only one dirty record per TXG
can't and shouldn't mean both cloning and overwrite same time.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15653
Chunwei Chen [Mon, 11 Dec 2023 22:42:06 +0000 (14:42 -0800)]
For db_marker inherit the db pointer for AVL comparision.
While evicting dbufs of a dnode, a marker node is added to the AVL.
The marker node should be inserted in AVL tree ahead of the dbuf its
trying to delete. The blkid and level is used to ensure this. However,
this could go wrong there's another dbufs with the same blkid and level
in DB_EVICTING state but not yet removed from AVL tree. dbuf_compare()
could fail to give the right location or could cause confusion and
trigger ASSERTs.
To ensure that the marker is inserted before the deleting dbuf, use
the pointer value of the original dbuf for comparision.
Tony Hutter [Mon, 11 Dec 2023 17:59:59 +0000 (09:59 -0800)]
ZTS: Add dirty dnode stress test
Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in
https://github.com/openzfs/zfs/issues/15526
The bug was fixed in https://github.com/openzfs/zfs/pull/15571 and
was backported to 2.2.2 and 2.1.14. This test case is just to
make sure it does not come back.
seekflood.c originally written by Rob Norris.
Reviewed-by: Graham Perrin <grahamperrin@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15608
The io_uring test fails on CentOS 9 with the following fio error.
Disable the test for the benefit of the CI until this can be fully
investigated. This basic test passes as expected on newer kernels.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15636
Alexander Motin [Sat, 9 Dec 2023 00:43:39 +0000 (19:43 -0500)]
DMU: Fix lock leak on dbuf_hold() error
dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in
case dbuf_hold() failed. I don't have reproduction for this, but
it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15644
Ameer Hamza [Thu, 30 Nov 2023 20:14:56 +0000 (01:14 +0500)]
ZTS: Add test cases for block cloning replay
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15614
Ameer Hamza [Wed, 6 Dec 2023 20:18:43 +0000 (01:18 +0500)]
ZTS: block_cloning: Use numeric sort for get_same_blocks
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15614
zed: fix typo in variable ZED_POWER_OFF_ENCLO*US*RE_SLOT_ON_FAULT
Replace ENCLO_US_RE with ENCLO_SU_RE in the name of the variable.
Note this changes the user-visible string in zed.rc, thus might
break current users with the wrong string, but it's ~2 months
since zfs-2.2.0 tag is out, thus should not be widespread yet.
Mark Johnston [Thu, 7 Dec 2023 16:20:11 +0000 (11:20 -0500)]
FreeBSD: Ensure that zfs_getattr() initializes the va_rdev field
Otherwise the field is left uninitialized, leading to a possible kernel
memory disclosure to userspace or to the network. Use the same
initialization value we use in zfsctl_common_getattr().
Reported-by: KMSAN Sponsored-by: The FreeBSD Foundation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ed Maste <emaste@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #15639
Alexander Motin [Wed, 6 Dec 2023 23:37:27 +0000 (18:37 -0500)]
BRT: Limit brt_vdev_dump() to only one vdev
Without this patch on pool of 60 vdevs with ZFS_DEBUG enabled clone
takes much more time than copy, while heavily trashing dbgmsg for
no good reason, repeatedly dumping all vdevs BRTs again and again,
even unmodified ones.
I am generally not sure this dumping is not excessive, but decided
to keep it for now, just restricting its scope to more reasonable.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15625
Alexander Motin [Wed, 6 Dec 2023 23:02:05 +0000 (18:02 -0500)]
ZIL: Remove 128K into 2x68K LWB split optimization
To improve 128KB block write performance in case of multiple VDEVs
ZIL used to spit those writes into two 64KB ones. Unfortunately it
was found to cause LWB buffer overflow, trying to write maximum-
sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into
68KB buffer, since unlike TX_WRITE ZIL code can't split it.
This is a minimally-invasive temporary block cloning fix until the
following more invasive prediction code refactoring.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15634
Alexander Motin [Wed, 6 Dec 2023 20:39:12 +0000 (15:39 -0500)]
zdb: Dump encrypted write and clone ZIL records
Block pointers are not encrypted in TX_WRITE and TX_CLONE_RANGE
records, so we can dump them, that may be useful for debugging.
Related to #15543. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15629
Shengqi Chen [Wed, 6 Dec 2023 20:37:50 +0000 (04:37 +0800)]
compact: workaround for GPL-only symbols on riscv from Linux 6.2
Since Linux 6.2, the implementation of flush_dcache_page on riscv
references GPL-only symbol `PageHuge`, breaking the build of zfs.
This patch uses existing mechanism to override flush_dcache_page,
removing the call to `PageHuge`. According to comments in kernel,
it is only used to do some check against HugeTLB pages, which only
exist in userspace. ZFS uses flush_dcache_page only on kernel pages,
thus this patch will not introduce any behaviour change.
Don Brady [Tue, 5 Dec 2023 22:27:56 +0000 (15:27 -0700)]
Extend import_progress kstat with a notes field
Detail the import progress of log spacemaps as they can take a very
long time. Also grab the spa_note() messages to, as they provide
insight into what is happening
Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com>
Closes #15539
Shengqi Chen [Tue, 5 Dec 2023 20:01:09 +0000 (04:01 +0800)]
module/icp/asm-arm/sha2: enable non-SIMD asm kernels on armv5/6
My merged pull request #15557 fixes compilation of sha2 kernels on arm
v5/6. However, the compiler guards only allows sha256/512_armv7_impl to
be used when __ARM_ARCH > 6. This patch enables these ASM kernels on all
arm architectures. Some compiler guards are adjusted accordingly to
avoid the unnecessary compilation of SIMD (e.g., neon, armv8ce) kernels
on old architectures.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #15623
Rob N [Tue, 5 Dec 2023 19:53:14 +0000 (06:53 +1100)]
zpool: flush output before sleeping
Several zpool commands (status, list, iostat) have modes that present
some information, sleep a while, present the current state, sleep, etc.
Some of those had ways to invoke them that when piped would appear to do
nothing for a while, because non-terminals are block-buffered, not
line-buffered, by default. Fix this by forcing a flush before sleeping.
In particular, all of these buffered:
- zpool status <pool> <interval>
- zpool iostat -y<m> <pool> <interval>
- zpool list <pool> <interval>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15593
oromenahar [Tue, 5 Dec 2023 19:03:48 +0000 (20:03 +0100)]
Allow block cloning across encrypted datasets
When two datasets share the same master encryption key, it is safe
to clone encrypted blocks. Currently only snapshots and clones
of a dataset share with it the same encryption key.
Added a test for:
- Clone from encrypted sibling to encrypted sibling with
non encrypted parent
- Clone from encrypted parent to inherited encrypted child
- Clone from child to sibling with encrypted parent
- Clone from snapshot to the original datasets
- Clone from foreign snapshot to a foreign dataset
- Cloning from non-encrypted to encrypted datasets
- Cloning from encrypted to non-encrypted datasets
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes #15544
Alexander Motin [Tue, 5 Dec 2023 18:58:11 +0000 (13:58 -0500)]
ZIL: Do not clone blocks from the future
ZIL claim can not handle block pointers cloned from the future,
since they are not yet allocated at that point. It may happen
either if the block was just written when it was cloned, or if
the pool was frozen or somehow else rewound on import.
Handle it from two sides: prevent cloning of blocks with physical
birth time from not yet synced or frozen TXG, and abort ZIL claim
if we still detect such blocks due to rewind or something else.
While there, assert that any cloned blocks we claim are really
allocated by calling metaslab_check_free().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15617
Dex Wood [Fri, 1 Dec 2023 23:25:17 +0000 (17:25 -0600)]
Add Ntfy notification support to ZED
This commit adds the zed_notify_ntfy() function and hooks it
into zed_notify(). This will allow ZED to send notifications
to ntfy.sh or a self-hosted Ntfy service, which can be received
on a desktop or mobile device. It is configured with ZED_NTFY_TOPIC,
ZED_NTFY_URL, and ZED_NTFY_ACCESS_TOKEN variables in zed.rc.
Alexander Motin [Fri, 1 Dec 2023 23:23:20 +0000 (18:23 -0500)]
ZIL: Remove TX_CLONE_RANGE replay for ZVOLs.
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay. Later zil_free_clone_range() drops them after replay or on
dataset destroy. The total balance is neutral. It means we do not
need to do anything (drop the references) for not implemented yet
TX_CLONE_RANGE replay for ZVOLs.
This is a logical follow up to #15603.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15612
Alexander Motin [Fri, 1 Dec 2023 19:50:10 +0000 (14:50 -0500)]
ZIO: Add overflow checks for linear buffers
Since we use a limited set of kmem caches, quite often we have unused
memory after the end of the buffer. Put there up to a 512-byte canary
when built with debug to detect buffer overflows at the free time.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15553
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay. Later zil_free_clone_range() drops them after replay or on
dataset destroy. The total balance is neutral. It means on actual
replay we must take additional references, which would stay in BRT.
Without this blocks could be freed prematurely when either original
file or its clone are destroyed. I've observed BRT being emptied
and the feature being deactivated after ZIL replay completion, which
should not have happened. With the patch I see expected stats.
Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15603
Wraithh [Wed, 29 Nov 2023 17:55:17 +0000 (19:55 +0200)]
Fix zoneid when USER_NS is disabled
getzoneid() should return GLOBAL_ZONEID instead of 0 when USER_NS is disabled.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ilkka Sovanto <github@ilkka.kapsi.fi>
Closes #15560
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Warner Losh <imp@FreeBSD.org> Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #15606
Shengqi Chen [Wed, 22 Nov 2023 13:58:47 +0000 (21:58 +0800)]
module/icp/asm-arm/sha2: auto detect __ARM_ARCH
This patch uses __ARM_ARCH set by compiler (both
GCC and Clang have this) whenever possible instead
of hardcoding it to 7. This change allows code to
compile on earlier ARM architectures such as armv5te.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #15557
Linux 6.6 compat: fix configure error with clang (#15558)
With Linux v6.6.x and clang 16, a configure step fails on a warning that
later results in an error while building, due to 'ts' being
uninitialized. Add a trivial initialization to silence the warning.
Signed-off-by: Jaron Kent-Dobias <jaron@kent-dobias.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Rob N [Tue, 28 Nov 2023 17:53:04 +0000 (04:53 +1100)]
dmu_buf_will_clone: fix race in transition back to NOFILL
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.
This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.
Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.
Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.
This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.
Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15566
Closes #15526
Matthew Ahrens [Tue, 28 Nov 2023 17:20:48 +0000 (09:20 -0800)]
unnecessary alloc/free in dsl_scan_visitbp()
Clean up code in dsl_scan_visitbp() by removing an unnecessary
alloc/free and `goto`. This has the side benefit of reducing CPU usage,
which is only really noticeable if we are not doing i/o for the leaf
blocks, like when `zfs_no_scrub_io` is set.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #15549
Rob N [Tue, 28 Nov 2023 17:07:57 +0000 (04:07 +1100)]
dnode_is_dirty: check dnode and its data for dirtiness
Over its history this the dirty dnode test has been changed between
checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and
`dn_dirty_record`.
de198f2d9 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency 2531ce372 Revert "Report holes when there are only metadata changes" ec4f9b8f3 Report holes when there are only metadata changes 454365bba Fix dirty check in dmu_offset_next() 66aca2473 SEEK_HOLE should not block on txg_wait_synced()
In the case of appending data to a newly created file, the dnode proper
is dirtied (at least to change the blocksize) and dirty records are
added. Thus, a single logical operation is represented by separate
dirty indicators, and must not be separated.
The incorrect dirty check becomes a problem when the first block of a
file is being appended to while another process is calling lseek to skip
holes. There is a small window where the dnode part is undirtied while
there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)`
would not know that the file is dirty, and would go to
`dnode_next_offset()`. Since the object has no data blocks yet, it
returns `ESRCH`, indicating no data found, which results in `ENXIO`
being returned to `lseek()`'s caller.
Since coreutils 9.2, `cp` performs sparse copies by default, that is, it
uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to
replicate the holes in the target. When it hits the bug, its initial
search for data fails, and it goes on to call `fallocate()` to create a
hole over the entire destination file.
This has come up more recently as users upgrade their systems, getting
OpenZFS 2.2 as well as a newer coreutils. However, this problem has been
reproduced against 2.1, as well as on FreeBSD 13 and 14.
This change simply updates the dirty check to check both types of dirty.
If there's anything dirty at all, we immediately go to the "wait for
sync" stage, It doesn't really matter after that; both changes are on
disk, so the dirty fields should be correct.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15571
Closes #15526
rmacklem [Tue, 28 Nov 2023 00:31:03 +0000 (16:31 -0800)]
FreeBSD: Fix ZFS so that snapshots under .zfs/snapshot are NFS visible
Call vfs_exjail_clone() for mounts created under .zfs/snapshot
to fill in the mnt_exjail field for the mount. If this is not
done, the snapshots under .zfs/snapshot with not be accessible
over NFS.
This version has the argument name in vfs.h fixed to match that
of the name in spl_vfs.c, although it really does not matter.
External-issue: https://reviews.freebsd.org/D42672 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes #15563
Akash B [Mon, 27 Nov 2023 21:41:58 +0000 (03:11 +0530)]
zdb: Fix zdb '-O|-r' options with -e/exported zpool
zdb with '-e' or exported zpool doesn't work along with
'-O' and '-r' options as we process them before '-e' has
been processed.
Below errors are seen:
~> zdb -e pool-mds65/mdt65 -O oi.9/0x200000009:0x0:0x0
failed to hold dataset 'pool-mds65/mdt65': No such file or directory
~> zdb -e pool-oss0/ost0 -r file1 /tmp/filecopy1 -p.
failed to hold dataset 'pool-oss0/ost0': No such file or directory
zdb: internal error: No such file or directory
We need to make sure to process '-O|-r' options after the
'-e' option has been processed, which imports the pool to
the namespace if it's not in the cachefile.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Akash B <akash-b@hpe.com>
Closes #15532
Rob Norris [Sat, 18 Nov 2023 10:33:45 +0000 (21:33 +1100)]
zdb: show BRT statistics and dump its contents
Same idea as the dedup stats, but for block cloning.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15541
Rob Norris [Sat, 18 Nov 2023 10:32:16 +0000 (21:32 +1100)]
brt: lift internal definitions into _impl header
So that zdb (and others!) can get at the BRT on-disk structures.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15541
Tony Hutter [Mon, 27 Nov 2023 21:24:37 +0000 (13:24 -0800)]
ZTS: Fix zfs_load-key failures on F39
The zfs_load-key tests were failing on F39 due to their use of the
deprecated ssl.wrap_socket function. This commit updates the test to
instead use ssl.SSLContext() as described in
https://stackoverflow.com/a/65194957.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15534
Closes #15550
AllKind [Mon, 27 Nov 2023 21:17:48 +0000 (22:17 +0100)]
zfs-dkms: fix shell-init error message
If all zfs dkms modules have been removed, a shell-init error message
may appear, because /var/lib/dkms/zfs does no longer exist.
Resolve this by leaving the directory earlier on.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mart Frauenlob <AllKind@fastest.cc>
Closes #15576
Alexander Motin [Mon, 27 Nov 2023 21:16:59 +0000 (16:16 -0500)]
ZVOL: Minor code cleanup
- Remove zsda_tx field, it is used only once.
- Remove unneeded string lengths checks, all names are terminated.
- Replace few explicit MAXNAMELEN usages with sizeof().
- Change dsname from MAXNAMELEN to ZFS_MAX_DATASET_NAME_LEN, as
expected by dsl_dataset_name(). Both are 256 bytes now, but it is
better to be safe.
This should have no functional difference.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15535
Alan Somers [Mon, 27 Nov 2023 20:58:03 +0000 (13:58 -0700)]
FreeBSD: Fix the build on FreeBSD 12
It was broken for several reasons:
* VOP_UNLOCK lost an argument in 13.0. So OpenZFS should be using
VOP_UNLOCK1, but a few direct calls to VOP_UNLOCK snuck in.
* The location of the zlib header moved in 13.0 and 12.1. We can drop
support for building on 12.0, which is EoL.
* knlist_init lost an argument in 13.0. OpenZFS change 9d0887402ba
assumed 13.0 or later.
* FreeBSD 13.0 added copy_file_range, and OpenZFS change 67a1b037915
assumed 13.0 or later.
Sponsored-by: Axcient Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #15551
Alexander Motin [Wed, 22 Nov 2023 18:15:32 +0000 (13:15 -0500)]
ZIL: Refactor TX_WRITE encryption similar to TX_CLONE_RANGE
It should be purely textual change to make the code more readable.
Should cause no functional difference.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15543
Closes #15513
Alexander Motin [Sun, 19 Nov 2023 01:01:03 +0000 (20:01 -0500)]
ZIL: Do not encrypt block pointers in lr_clone_range_t
In case of crash cloned blocks need to be claimed on pool import.
It is only possible if they (lr_bps) and their count (lr_nbps) are
not encrypted but only authenticated, similar to block pointer in
lr_write_t. Few other fields can be and are still encrypted.
This should fix panic on ZIL claim after crash when block cloning
is actively used.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15543
Closes #15513
Don Brady [Mon, 27 Nov 2023 17:46:38 +0000 (10:46 -0700)]
Don't allow attach to a raidz child vdev
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15536
Closes #15564
Tony Hutter [Tue, 21 Nov 2023 00:07:32 +0000 (16:07 -0800)]
ZTS: Fix 'could not unmount datasets' on Alma 9 (#15542)
Many tests are failing on AlmaLinux 9 because ZTS could not destroy the
pool in cleanup. This was due to $PWD being set to '.' instead of the
expected full path. This patch sets $PWD to the full path.
Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com>
Brooks Davis [Fri, 17 Nov 2023 22:02:09 +0000 (14:02 -0800)]
freebsd: remove __FBSDID macro use
With FreeBSD's switch to git the $FreeBSD$ string is no longer expanded
and they have mostly been removed upstream. Stop using __FBSDID and
remove the no-longer needed sys/cdefs.h includes.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #15527
Alexander Motin [Fri, 17 Nov 2023 22:00:59 +0000 (17:00 -0500)]
ZIO: Optimize zio_flush()
- Generalize vdev_nowritecache handling by traversing through the
VDEV tree and skipping children ZIOs where not supported.
- Remove intermediate zio_null() in case of several VDEV children.
- Remove children handling from zio_ioctl(). There are no other
use cases for this code beside DKIOCFLUSHWRITECACHED, and would there
be, I doubt they would so straightforward apply to all VDEV children.
Comparing to removed previous optimization this should improve cases
of redundant ZILs/SLOGs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15515
Rob N [Fri, 17 Nov 2023 21:25:53 +0000 (08:25 +1100)]
Consider `dnode_t` allocations in dbuf cache size accounting
Entries in the dbuf cache contribute only the size of the dbuf data to
the cache size. Attached "user" data is not counted. This can lead to
the data currently "owned" by the cache consuming more memory accounting
appears to show. In some cases (eg a metadnode data block with all child
dnode_t slots allocated), the actual size can be as much as 3x as what
the cache believes it to be.
This is arguably correct behaviour, as the cache is only tracking the
size of the dbuf data, not even the overhead of the dbuf_t. On the other
hand, in the above case of dnodes, evicting cached metadnode dbufs is
the only current way to reclaim the dnode objects, and can lead to the
situation where the dbuf cache appears to be comfortably within its
target memory window and yet is holding enormous amounts of slab memory
that cannot be reclaimed.
This commit adds a facility for a dbuf user to artificially inflate the
apparent size of the dbuf for caching purposes. This at least allows for
cache tuning to be adjusted to match something closer to the real memory
overhead.
metadnode dbufs carry a >1KiB allocation per dnode in their user data.
This informs the dbuf cache machinery of that fact, allowing it to make
better decisions when evicting dbufs.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15511
Alexander Motin [Tue, 14 Nov 2023 21:47:57 +0000 (16:47 -0500)]
L2ARC: Restrict write size to 1/4 of the device
PR #15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default. Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.
This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15519
Tony Hutter [Mon, 13 Nov 2023 00:26:07 +0000 (16:26 -0800)]
Workaround UBSAN errors for variable arrays
This gets around UBSAN errors when using arrays at the end of
structs. It converts some zero-length arrays to variable length
arrays and disables UBSAN checking on certain modules.
It is based off of the patch from #15460.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Co-authored-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #15145
Closes #15510
Alexander Motin [Fri, 10 Nov 2023 18:34:46 +0000 (13:34 -0500)]
Linux: Reclaim unused spl_kmem_cache_reclaim
It is unused for 3 years since #10576.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15507
Umer Saleem [Wed, 8 Nov 2023 11:00:59 +0000 (16:00 +0500)]
ZTS: Test for all known zpool feature sets
zpool_create_features_007_pos only tested for compat-2020 feature
set. It would be useful to test for all known features sets. If
any additional feature is found enabled that is not present in
compatibility list or feature set, it should be caught and
reported earlier.
This commit also removes encryption from openzfsonosx-1.8.1
compatibility list. Encryption enables bookmark_v2, since it is
a dependency of encryption, but not listed in openzfsonoxx-1.8.1
compatibility list.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15505
shodanshok [Thu, 9 Nov 2023 00:30:47 +0000 (01:30 +0100)]
Increase L2ARC write rate and headroom
Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).
These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Today we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.
For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #15457
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #15504
Low-power [Wed, 8 Nov 2023 20:19:38 +0000 (04:19 +0800)]
Linux: reject read/write mapping to immutable file only on VM_SHARED
Private read/write mapping can't be used to modify the mapped files, so
they will remain be immutable. Private read/write mappings are usually
used to load the data segment of executable files, rejecting them will
rendering immutable executable files to stop working.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #15344
AllKind [Wed, 8 Nov 2023 18:30:46 +0000 (19:30 +0100)]
Workaround to allow openzfs-zfs-dkms install on Ubuntu
As shown in #15404#issuecomment-1765002181, Ubuntu kernel has
'Provides: zfs-dkms', which will cause uninstall of the kernel, when
attempting to install openzfs-zfs-dkms.
As a workaround remove the 'Conflicts: zfs-dkms' definition from
the debian control file.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mart Frauenlob <AllKind@fastest.cc>
Closes #15503
Don Brady [Wed, 8 Nov 2023 18:19:41 +0000 (11:19 -0700)]
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation Sponsored-by: iXsystems, Inc. Sponsored-by: vStack Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Authored-by: Matthew Ahrens <mahrens@delphix.com> Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com> Contributions-by: Stuart Maybee <stuart.maybee@comcast.net> Contributions-by: Thorsten Behrens <tbehrens@outlook.com> Contributions-by: Fmstrat <nospam@nowsci.com> Contributions-by: Don Brady <dev.fs.zfs@gmail.com> Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
Umer Saleem [Tue, 7 Nov 2023 21:24:16 +0000 (02:24 +0500)]
Linux 6.6 compat: fix implicit conversion error with debug build
With Linux v6.6.0 and GCC 12, when debug build is configured,
implicit conversion error is raised while converting
'enum <anonymous>' to 'boolean_t'. Use 'B_TRUE' instead of
'true' to fix the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15489
Gordon Tetlow [Tue, 7 Nov 2023 21:21:56 +0000 (13:21 -0800)]
Add kern.features.zfs
Add a ZFS feature flag to indicate OpenZFS availability.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gordon Tetlow <gordon@freebsd.org>
Closes #15484
Jason King [Tue, 7 Nov 2023 20:11:48 +0000 (14:11 -0600)]
sa_lookup() ignores buffer size.
When retrieving a system attribute, the size of the supplied
buffer is ignored. If the buffer is too small to hold the attribute,
sa_attr_op() will write past the end of the buffer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jason King <jking@racktopsystems.com>
Closes #15476
Umer Saleem [Tue, 7 Nov 2023 20:04:56 +0000 (01:04 +0500)]
Remove obsolete_counts from grub2 compatibility list
PR#15459 add all read-only compatible zpool features to grub2
compatibility list. 'obsolete_counts' is a read-only features that
depends on 'device_removal' feature which is not read-only and
is marked as ZFEATURE_FLAG_MOS. Creating a pool with grub2
compatibility enables 'device_removal' feature as well, which is
not desired.
This commit removes the 'obsolete_counts' feature from
grub2 compatibility list, as GRUB only supports read-only
compatible features.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15499
Alexander Motin [Tue, 7 Nov 2023 19:37:18 +0000 (14:37 -0500)]
FreeBSD: Implement taskq_init_ent()
Previously taskq_init_ent() was an empty macro, while actual init
was done by taskq_dispatch_ent(). It could be slightly faster in
case taskq never enqueued. But without it taskq_empty_ent() relied
on the structure being zeroed by somebody else, that is not good.
As a side effect this allows the same task to be queued several
times, that is normal on FreeBSD, that may or may not get useful
here also one day.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15455
AllKind [Tue, 7 Nov 2023 19:27:29 +0000 (20:27 +0100)]
Fix dkms installation of deb packages created with Alien.
Alien does not honour the %posttrans hook.
So move the dkms uninstall/install scripts to the
%pre/%post hooks in case of package install/upgrade.
In case of package removal, handle that in %preun.
Add removal of all old dkms modules.
Add checking for broken 'dkms status'. Handle that as
good as possible and warn the user about it.
Also add more verbose messages about what we are doing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mart Frauenlob <AllKind@fastest.cc>
Closes #15415
Mark Johnston [Tue, 7 Nov 2023 18:24:15 +0000 (13:24 -0500)]
Make abd_raidz_gen_iterate() pass an initialized pointer to the callback
Otherwise callbacks may trigger KMSAN violations in the dlen == 0 case.
For example, raidz_syn_pq_abd() will compare an uninitialized pointer
with itself before returning. This seems harmless, but let's maintain
good hygiene and avoid passing uninitialized variables, if only to
placate KMSAN.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #15491
Tony Hutter [Tue, 7 Nov 2023 17:09:24 +0000 (09:09 -0800)]
zed: misc vdev_enc_sysfs_path fixes
There have been rare cases where the VDEV_ENC_SYSFS_PATH value that zed
gets passed is stale. To mitigate this, dynamically check the sysfs
path at the time of zed event processing, and use the dynamic value if
possible. Note that there will be other times when we can not
dynamically detect the sysfs path (like if a disk disappears) and have
to rely on the old value for things like turning on the fault LED. That
is to say, we can't just blindly use the dynamic path in every case.
Also:
- Add enclosure sysfs entry when running 'zpool add'
- Fix 'slot' and 'enc' zpool.d scripts for nvme
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15462
MigeljanImeri [Tue, 7 Nov 2023 17:06:14 +0000 (10:06 -0700)]
Fix accounting error for pending sync IO ops in zpool iostat
Currently vdev_queue_class_length is responsible for checking how long
the queue length is, however, it doesn't check the length when a list
is used, rather it just returns whether it is empty or not. To fix this
I added a counter variable to vdev_queue_class to keep track of the sync
IO ops, and changed vdev_queue_class_length to reference this variable
instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Closes #15478
ednadolski-ix [Mon, 6 Nov 2023 18:38:42 +0000 (11:38 -0700)]
Improve ZFS objset sync parallelism
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).
There are two problems:
1. Each ZVOL in a pool is a separate dataset/objset having a single
dnode. This means the objsets are synchronized serially, which
leads to a bottleneck of ~330K blocks written per second per pool.
2. In the case of multiple dirty dnodes/files on a dataset/objset on a
big system they will be sync'd in parallel taskq threads. However,
it is inefficient to to use 75% of CPU cores of a big system to do
that, because of (a) bottlenecks on a single write issue taskq, and
(b) allocation throttling. In addition, if not for the allocation
throttling sorting write requests by bookmarks (logical address),
writes for different files may reach space allocators interleaved,
leading to unwanted fragmentation.
The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes #15197
Andrew Innes [Wed, 1 Nov 2023 22:19:44 +0000 (06:19 +0800)]
Use env var for sed
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Closes #15470
siv0 [Tue, 31 Oct 2023 20:57:54 +0000 (21:57 +0100)]
Fix nfs_truncate_shares without /etc/exports.d
Calling nfs_reset_shares on Linux prints a warning:
`failed to lock /etc/exports.d/zfs.exports.lock: No such file or
directory`
when /etc/exports.d does not exist. The directory gets created, when a
filesystem is actually exported through nfs_toggle_share and
nfs_init_share. The truncation of /etc/exports.d/zfs.exports happens
unconditionally when calling `zfs mount -a` (via zfs_do_mount and
share_mount in `cmd/zfs/zfs_main.c`).
Fixing the issue only in the Linux part, since the exports file on
freebsd is in `/etc/zfs/`, which seems present on 2 FreeBSD systems I
have access to (through `/etc/zfs/compatibility.d/`), while a Debian
box does not have the directory even if `/usr/sbin/exportfs` is
present through the `nfs-kernel-server` package.
The code for exports_available is copied from nfs_available above.
Fixes: ede037cda73675f42b1452187e8dd3438fafc220
("Make zfs-share service resilient to stale exports")
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Closes #15369
Closes #15468
Martin Matuška [Tue, 31 Oct 2023 20:49:41 +0000 (21:49 +0100)]
Fix block cloning between unencrypted and encrypted datasets
Block cloning from an encrypted dataset into an unencrypted dataset
and vice versa is not possible. The current code did allow cloning
unencrypted files into an encrypted dataset causing a panic when
these were accessed. Block cloning between encrypted and encrypted
is currently supported on the same filesystem only.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Rob N <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #15464
Closes #15465
Umer Saleem [Tue, 31 Oct 2023 16:51:54 +0000 (21:51 +0500)]
Add all read-only compatible zpool features to grub2 compatibility
GRUB opens the boot pool in read-only mode. All read-only
compatible features for zpool can be enabled and added to
grub2 compatibility, as GRUB does not open the boot-pool
for write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15459
Ameer Hamza [Tue, 17 Oct 2023 19:19:58 +0000 (00:19 +0500)]
zvol: fix delayed update to block device ro entry
The change in the zvol readonly property does not update the block
device readonly entry until the first IO to the ZVOL. This patch
addresses the issue by updating the block device readonly property
from the set property IOCTL call.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15409
Ameer Hamza [Tue, 24 Oct 2023 21:53:27 +0000 (02:53 +0500)]
zvol: Implement zvol threading as a Property
Currently, zvol threading can be switched through the zvol_request_sync
module parameter system-wide. By making it a zvol property, zvol
threading can be switched per zvol.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15409
Ameer Hamza [Wed, 11 Oct 2023 22:31:11 +0000 (03:31 +0500)]
zvol: Cleanup set property
zvol_set_volmode() and zvol_set_snapdev() share a common code path.
Merge this shared code path into zvol_set_common().
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15409