]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
14 months agoDo not request data L1 buffers on scan prefetch.
Alexander Motin [Thu, 20 Jul 2023 16:10:04 +0000 (12:10 -0400)]
Do not request data L1 buffers on scan prefetch.

Set ARC_FLAG_NO_BUF when prefetching data L1 buffers for scan.  We
do not prefetch data L0 buffers, so we do not need the L1 buffers,
only want them to be ready in ARC. This saves some CPU time on the
buffers decompression.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15029

14 months agoLinux 6.5 compat: disk_check_media_change() was added
Coleman Kane [Thu, 20 Jul 2023 16:09:25 +0000 (12:09 -0400)]
Linux 6.5 compat: disk_check_media_change() was added

The disk_check_media_change() function was added which replaces
bdev_check_media_change.  This change was introduced in 6.5rc1
444aa2c58cb3b6cfe3b7cc7db6c294d73393a894 and the new function takes a
gendisk* as its argument, no longer a block_device*. Thus, bdev->bd_disk
is now used to pass the expected data.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15060

14 months agoset autotrim default to 'off' everywhere
Yuri Pankov [Thu, 20 Jul 2023 16:06:55 +0000 (18:06 +0200)]
set autotrim default to 'off' everywhere

As it turns out having autotrim default to 'on' on FreeBSD never really
worked due to mess with defines where userland and kernel module were
getting different default values (userland was defaulting to 'off',
module was thinking it's 'on').

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes #15079

14 months agoLinux 6.5 compat: BLK_STS_NEXUS renamed to BLK_STS_RESV_CONFLICT
Coleman Kane [Fri, 14 Jul 2023 23:33:51 +0000 (19:33 -0400)]
Linux 6.5 compat: BLK_STS_NEXUS renamed to BLK_STS_RESV_CONFLICT

This change was introduced in Linux commit
7ba150834b840f6f5cdd07ca69a4ccf39df59a66

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15059

14 months agointptr_t definition is canonically signed
Coleman Kane [Fri, 14 Jul 2023 23:32:49 +0000 (19:32 -0400)]
intptr_t definition is canonically signed

Make the version here match that elsewhere in the kernel and system
headers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15058

14 months agoFix raw receive with different indirect block size.
Alexander Motin [Fri, 14 Jul 2023 23:16:40 +0000 (19:16 -0400)]
Fix raw receive with different indirect block size.

Unlike regular receive, raw receive require destination to have the
same block structure as the source.  In case of dnode reclaim this
triggers two special cases, requiring special handling:
 - If dn_nlevels == 1, we can change the ibs, but dnode_set_blksz()
should not dirty the data buffer if block size does not change, or
durign receive dbuf_dirty_lightweight() will trigger assertion.
 - If dn_nlevels > 1, we just can't change the ibs, dnode_set_blksz()
would fail and receive_object would trigger assertion, so we should
destroy and recreate the dnode from scratch.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15039

14 months agoFix the ZFS checksum error histograms with larger record sizes
Alan Somers [Fri, 14 Jul 2023 23:13:15 +0000 (17:13 -0600)]
Fix the ZFS checksum error histograms with larger record sizes

My analysis in PR #14716 was incorrect.  Each histogram bucket contains
the number of incorrect bits, by position in a 64-bit word, over the
entire record.  8-bit buckets can overflow for record sizes above 2k.
To forestall that, saturate each bucket at 255.  That should still get
the point across: either all bits are equally wrong, or just a couple
are.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #15049

14 months agoAvoid extra snprintf() in dsl_deadlist_merge().
Alexander Motin [Fri, 14 Jul 2023 23:11:46 +0000 (19:11 -0400)]
Avoid extra snprintf() in dsl_deadlist_merge().

Since we are already iterating the ZAP, we have exact string key to
remove, we do not need to call zap_remove_int() with the int key we
just converted, we can call zap_remove() for the original string.

This should make no functional change, only a micro-optimization.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15056

14 months agoAdd missed DMU_PROJECTUSED_OBJECT prefetch.
Alexander Motin [Thu, 13 Jul 2023 16:12:55 +0000 (12:12 -0400)]
Add missed DMU_PROJECTUSED_OBJECT prefetch.

It seems 9c5167d19f "Project Quota on ZFS" missed to add prefetch
for DMU_PROJECTUSED_OBJECT during scan (scrub/resilver).  It should
not cause visible problems, but may affect scub/resilver performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15024

14 months agoFreeBSD: catch up to __FreeBSD_version 1400093
Mateusz Guzik [Thu, 13 Jul 2023 16:06:57 +0000 (18:06 +0200)]
FreeBSD: catch up to __FreeBSD_version 1400093

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #15036

14 months agoUpdate changelog for 2.2
Umer Saleem [Thu, 13 Jul 2023 15:55:12 +0000 (20:55 +0500)]
Update changelog for 2.2

Add a new changelog entry for native packages to reflect version
2.2.99.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15054

14 months agoFreeBSD: Fix build on stable/13 after 1302506.
Alexander Motin [Thu, 13 Jul 2023 15:50:34 +0000 (11:50 -0400)]
FreeBSD: Fix build on stable/13 after 1302506.

Starting approximately from version 1302506 vn_lock_pair() grown two
additional arguments following head.  There is a one week hole, but
that is closet reference point we have.

Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #15047

14 months agoUpdate META
Brian Behlendorf [Fri, 30 Jun 2023 20:32:18 +0000 (13:32 -0700)]
Update META

Increase the version to 2.2.99 to indicate the master branch is
newer than the 2.2.x release.  This ensures packages built from
master branch are considered to be newer than the last release.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
14 months agoTag 2.2.0-rc1
Brian Behlendorf [Fri, 30 Jun 2023 18:42:14 +0000 (11:42 -0700)]
Tag 2.2.0-rc1

New features:
- Fully adaptive ARC eviction (#14359)
- Block cloning (#13392)
- Scrub error log (#12812, #12355)
- Linux container support (#14070, #14097, #12263)
- BLAKE3 Checksums (#12918)
- Corrective "zfs receive" (#9372)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
14 months agoEnable tuning of ZVOL open timeout value
Prakash Surya [Fri, 30 Jun 2023 18:34:05 +0000 (11:34 -0700)]
Enable tuning of ZVOL open timeout value

The default timeout for ZVOL opens may not be sufficient for all cases,
so we should enable the value to be more easily tuned to account for
systems where the default value is insufficient.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #15023

14 months agoRevert "spa.h: use IN_BASE instead of IN_FREEBSD_BASE"
Brian Behlendorf [Fri, 30 Jun 2023 17:03:41 +0000 (10:03 -0700)]
Revert "spa.h: use IN_BASE instead of IN_FREEBSD_BASE"

This reverts commit 77a3bb1f47e67c233eb1961b8746748c02bafde1.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
14 months agoPack our DDT ZAPs a bit denser.
Rich Ercolani [Fri, 30 Jun 2023 16:42:02 +0000 (12:42 -0400)]
Pack our DDT ZAPs a bit denser.

The DDT is really inefficient on 4k and up vdevs, because it always
allocates 4k blocks, and while compression could save us somewhat
at ashift 9, that stops being true.

So let's change the default to 32 KiB, which seems like a reasonable
compromise between improved space savings and inflated write sizes
for DDT updates.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14654

14 months agoddt_addref: remove unnecessary phys fill when refcount is 0
Rob N [Fri, 30 Jun 2023 16:01:58 +0000 (02:01 +1000)]
ddt_addref: remove unnecessary phys fill when refcount is 0

The previous comment wondered if this case could happen; it turns out
that it really can't.

This block can only be entered if dde_type and dde_class are "real";
that only happens when a ddt entry has been previously synced to a ddt
store, that is, it was created on a previous txg. Since its gone through
that sync, its dde_refcount must be >0.

ddt_addref() is called from brt_pending_apply(), which is called at the
beginning of spa_sync(), before pending DMU writes/frees are issued.
Freeing a dedup block is the only thing that can decrement dde_refcount,
so there's no way for it to drop to zero before applying the clone bumps
it.

Further, even if it _could_ go to zero, it wouldn't be necessary to fill
the entry from the block. The phys content is not cleared until the free
is issued, which happens when the refcount goes to zero, when the last
real free comes through. The cloned block should be identical to what's
in the phys already, so the fill should be a no-op anyway.

I've replaced this with an assertion because this is all very dependent
on the ordering in which BRT and DDT changes are applied, and that might
change in the future.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: Klara, Inc.
Closes #15004

14 months agoAgain fix race between zil_commit() and zil_suspend().
Alexander Motin [Fri, 30 Jun 2023 15:59:39 +0000 (11:59 -0400)]
Again fix race between zil_commit() and zil_suspend().

With zl_suspend read in zil_commit() not protected by any locks it
is possible for new ZIL writes to be in progress while zil_destroy()
called by zil_suspend() freeing them.  This patch closes the race
by taking zl_issuer_lock in zil_suspend() and adding the second
zl_suspend check to zil_get_commit_list(), protected by the lock.
It allows all already queued transactions to be logged normally,
while blocks any new ones, calling txg_wait_synced() for the TXGs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14979

14 months agoSome ZIO micro-optimizations.
Alexander Motin [Fri, 30 Jun 2023 15:54:00 +0000 (11:54 -0400)]
Some ZIO micro-optimizations.

- Pack struct zio_prop by 4 bytes from 84 to 80.
 - Skip new child ZIO locking while linking to parent.  The newly
allocated ZIO is not externally visible yet, so nobody should care.
 - Skip io_bp_copy writes when not used (write && non-debug).

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14985

14 months agoDo not report bytes skipped by scan as issued.
Alexander Motin [Fri, 30 Jun 2023 15:47:13 +0000 (11:47 -0400)]
Do not report bytes skipped by scan as issued.

Scan process may skip blocks based on their birth time, DVA, etc.
Traditionally those blocks were accounted as issued, that caused
reporting of hugely over-inflated numbers, having nothing to do
with actual disk I/O.  This change utilizes never used field in
struct dsl_scan_phys to account such skipped bytes, allowing to
report how much data were actually scrubbed/resilvered and what
is the actual I/O speed.  While formally it is an on-disk format
change, it should be compatible both ways, so should not need a
feature flag.

This should partially address the same issue as c85ac731a0e, but
from a different perspective, complementing it.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #15007

14 months agoDon't use hard-coded 'size' value in snprintf()
Arshad Hussain [Fri, 30 Jun 2023 15:37:26 +0000 (21:07 +0530)]
Don't use hard-coded 'size' value in snprintf()

This patch changes the passing of "size" to snprintf
from hard-coded (openended) to sizeof(errbuf). This
is bringing to standard with rest of the code where-
ever 'errbuf' is used.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
Closes #15003

14 months agoFix remount when setting multiple properties.
Alexander Motin [Fri, 30 Jun 2023 15:36:43 +0000 (11:36 -0400)]
Fix remount when setting multiple properties.

The previous code was checking zfs_is_namespace_prop() only for the
last property on the list.  If one was not "namespace", then remount
wasn't called.  To fix that move zfs_is_namespace_prop() inside the
loop and remount if at least one of properties was "namespace".

Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15000

14 months agocontrib: dracut: Conditionalize copying of libgcc_s.so.1 to glibc only
vimproved [Thu, 29 Jun 2023 19:54:37 +0000 (19:54 +0000)]
contrib: dracut: Conditionalize copying of libgcc_s.so.1 to glibc only

The issue that this is designed to work around is only applicable to
glibc, since it's caused by glibc's pthread_cancel() implementation
using dlopen on libgcc_s.so.1 (and therefor not triggering dracut to
include it in the initramfs). This commit adds an extra condition to the
workaround that tests for glibc via "ldconfig -p | grep -qF 'libc.so.6'"
(which should only be present on glibc systems).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Violet Purcell <vimproved@inventati.org>
Closes #14992

14 months agospa.h: use IN_BASE instead of IN_FREEBSD_BASE
Yuri Pankov [Thu, 29 Jun 2023 18:50:52 +0000 (20:50 +0200)]
spa.h: use IN_BASE instead of IN_FREEBSD_BASE

Consistently get the proper default value for autotrim.

Currently, only the kernel module is built with IN_FREEBSD_BASE,
and libzfs get the wrong default value, leading to confusion and
incorrect output when autotrim value was not set explicitly.

Reviewed-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes #15016

14 months agozdb: Add missing poolname to -C synopsis
Mateusz Piotrowski [Thu, 29 Jun 2023 17:54:43 +0000 (19:54 +0200)]
zdb: Add missing poolname to -C synopsis

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara Inc.
Closes #15014

14 months agoZIL: Fix another use-after-free.
Alexander Motin [Wed, 28 Jun 2023 00:03:37 +0000 (20:03 -0400)]
ZIL: Fix another use-after-free.

lwb->lwb_issued_txg can not be accessed after lwb_state is set to
LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
freed by zil_sync().  We must save the txg number before that.

This is similar to the 55b1842f92, but as I see the bug is not new.
It existed for quite a while, just was not triggered due to smaller
race window.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14988
Closes #14999

14 months agoUse big transactions for small recordsize writes.
Alexander Motin [Wed, 28 Jun 2023 00:00:30 +0000 (20:00 -0400)]
Use big transactions for small recordsize writes.

When ZFS appends files in chunks bigger than recordsize, it borrows
buffer from ARC and fills it before opening transaction.  This
supposed to help in case of page faults to not hold transaction open
indefinitely.  The problem appears when recordsize is set lower than
default 128KB. Since each block is committed in separate transaction,
per-transaction overhead becomes significant, and what is even worse,
active use of of per-dataset and per-pool locks to protect space use
accounting for each transaction badly hurts the code SMP scalability.
The same transaction size limitation applies in case of file rewrite,
but without even excuse of buffer borrowing.

To address the issue, disable the borrowing mechanism if recordsize
is smaller than default and the write request is 4x bigger than it.
In such case writes up to 32MB are executed in single transaction,
that dramatically reduces overhead and lock contention.  Since the
borrowing mechanism is not used for file rewrites, and it was never
used by zvols, which seem to work fine, I don't think this change
should create significant problems, partially because in addition to
the borrowing mechanism there are also used pre-faults.

My tests with 4/8 threads writing several files same time on datasets
with 32KB recordsize in 1MB requests show reduction of CPU usage by
the user threads by 25-35%.  I would measure it in GB/s, but at that
block size we are now limited by the lock contention of single write
issue taskqueue, which is a separate problem we are going to work on.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14964

14 months agoRemove unnecessary commas in zpool-create.8
Laevos [Tue, 27 Jun 2023 23:58:32 +0000 (16:58 -0700)]
Remove unnecessary commas in zpool-create.8

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Laevos <5572812+Laevos@users.noreply.github.com>
Closes #15011

14 months agoAnother set of vdev queue optimizations.
Alexander Motin [Tue, 27 Jun 2023 16:09:48 +0000 (12:09 -0400)]
Another set of vdev queue optimizations.

Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists.  AVL-trees are too expensive
for such a simple task.  To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.

To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue.  Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member.  Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().

Introduce deadline of ~0.5s for LBA-sorted queues.  Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation.  With this change I no longer see it.  I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal.  For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.

Arrange members in struct zio to access only one cache line when
searching through vdev queues.  While there, remove io_alloc_node,
reusing the io_queue_node instead.  Those two are never used same
time.

Remove zfs_vdev_aggregate_trim parameter.  It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests.  Just remove the tree.

Remove locking from txg_all_lists_empty().  It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.

With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14925

14 months agoAdd a delay to tearing down threads.
Rich Ercolani [Mon, 26 Jun 2023 20:57:12 +0000 (16:57 -0400)]
Add a delay to tearing down threads.

It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...

I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...

So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14938

15 months agoFix memory leak in zil_parse().
Alexander Motin [Sun, 18 Jun 2023 02:51:37 +0000 (22:51 -0400)]
Fix memory leak in zil_parse().

482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly
leaking up to 128KB of memory per dataset during ZIL replay.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14987

15 months agoShorten arcstat_quiescence sleep time
George Amanakis [Thu, 15 Jun 2023 19:45:36 +0000 (21:45 +0200)]
Shorten arcstat_quiescence sleep time

With the latest L2ARC fixes, 2 seconds is too long to wait for
quiescence of arcstats like l2_size. Shorten this interval to avoid
having the persistent L2ARC tests in ZTS prematurely terminated.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14981

15 months agoRemove ARC/ZIO physdone callbacks.
Alexander Motin [Thu, 15 Jun 2023 17:49:03 +0000 (13:49 -0400)]
Remove ARC/ZIO physdone callbacks.

Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.

Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution.  Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.

While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it.  Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.

My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second.  Same time attempts to measure
any write time fluctuations didn't show anything noticeable.

While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948

15 months agoZTS: Skip send_raw_ashift on FreeBSD
Brian Behlendorf [Wed, 14 Jun 2023 15:04:05 +0000 (10:04 -0500)]
ZTS: Skip send_raw_ashift on FreeBSD

On FreeBSD 14 this test runs slowly in the CI environment
and is killed by the 10 minute timeout.  Skip the test on
FreeBSD until the slow down is resolved.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14961

15 months agoSwitch refcount tracking from lists to AVL-trees.
Alexander Motin [Wed, 14 Jun 2023 15:02:27 +0000 (11:02 -0400)]
Switch refcount tracking from lists to AVL-trees.

With large number of tracked references list searches under the lock
become too expensive, creating enormous lock contention.

On my tests with ZFS_DEBUG enabled this increases write throughput
with 32KB blocks from ~1.2GB/s to ~7.5GB/s.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14970

15 months agoStore the L2ARC device ashift in the vdev label
George Amanakis [Wed, 14 Jun 2023 15:01:17 +0000 (17:01 +0200)]
Store the L2ARC device ashift in the vdev label

If this is not done, and the pool has an ashift other than the default
(at the moment 9) then the following happens:

1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but
   upon export it is not stored anywhere
2) at the first import, vdev_open() sees an vdev_ashift() of 0 and
   assigns the logical_ashift, which is 9
3) reading the contents of L2ARC, including the header fails
4) L2ARC buffers are not restored in ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14313
Closes #14963

15 months agoFix the L2ARC write size calculating logic (2)
George Amanakis [Sat, 10 Jun 2023 00:05:47 +0000 (02:05 +0200)]
Fix the L2ARC write size calculating logic (2)

While commit bcd5321 adjusts the write size based on the size of the log
block, this happens after comparing the unadjusted write size to the
evicted (target) size.

In this case l2ad_hand will exceed l2ad_evict and violate an assertion
at the end of l2arc_write_buffers().

Fix this by adding the max log block size to the allocated size of the
buffer to be committed before comparing the result to the target
size.

Also reset the l2arc_trim_ahead ZFS module variable when the adjusted
write size exceeds the size of the L2ARC device.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14936
Closes #14954

15 months agoFinally drop long disabled vdev cache.
Alexander Motin [Fri, 9 Jun 2023 19:40:55 +0000 (15:40 -0400)]
Finally drop long disabled vdev cache.

It was a vdev level read cache, designed to aggregate many small
reads by speculatively issuing bigger reads instead and caching
the result.  But since it has almost no idea about what is going
on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
it was found to make more harm than good, for which reason it was
disabled for the past 12 years.  These days we have much better
instruments to enlarge the I/Os, such as speculative and prescient
prefetches, I/O scheduler, I/O aggregation etc.

Besides just the dead code removal this removes one extra mutex
lock/unlock per write inside vdev_cache_write(), not otherwise
disabled and trying to do some work.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14953

15 months agoZTS: Skip checkpoint_discard_busy
Brian Behlendorf [Fri, 9 Jun 2023 18:10:01 +0000 (11:10 -0700)]
ZTS: Skip checkpoint_discard_busy

Until the ASSERT which is occasionally hit while running
checkpoint_discard_busy is resolved skip this test case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #12053
Closes #14952

15 months agoImprove l2arc reporting in arc_summary.
Alexander Motin [Fri, 9 Jun 2023 17:14:05 +0000 (13:14 -0400)]
Improve l2arc reporting in arc_summary.

- Do not report L2ARC as FAULTED in presence of in-flight writes.
- Report read and write I/Os, bytes and errors.
- Remove few numbers not important to average user.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #12304
Closes #14946

15 months agoUse list_remove_head() where possible.
Alexander Motin [Fri, 9 Jun 2023 17:12:52 +0000 (13:12 -0400)]
Use list_remove_head() where possible.

... instead of list_head() + list_remove().  On FreeBSD the list
functions are not inlined, so in addition to more compact code
this also saves another function call.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14955

15 months agoZIL: Fix race introduced by f63811f0721.
Alexander Motin [Fri, 9 Jun 2023 17:08:05 +0000 (13:08 -0400)]
ZIL: Fix race introduced by f63811f0721.

We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE
state and dropping zl_lock, since it may be freed by zil_sync().
To free itxs and waiters after dropping the lock we need to move
lwb_itxs and lwb_waiters lists elements to local storage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14957
Closes #14959

15 months agoRevert "systemd: Use non-absolute paths in Exec* lines"
Rich Ercolani [Wed, 7 Jun 2023 18:14:05 +0000 (14:14 -0400)]
Revert "systemd: Use non-absolute paths in Exec* lines"

This reverts commit 79b20949b25c8db4d379f6486b0835a6613b480c since it
doesn't work with the systemd version shipped with RHEL7-based systems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14943
Closes #14945

15 months agoLinux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)
Brian Behlendorf [Wed, 7 Jun 2023 17:43:43 +0000 (10:43 -0700)]
Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)

When a kmem cache is exhausted and needs to be expanded a new
slab is allocated.  KM_SLEEP callers can block and wait for the
allocation, but KM_NOSLEEP callers were incorrectly allowed to
block as well.

Resolve this by attempting an emergency allocation as a best
effort.  This may fail but that's fine since any KM_NOSLEEP
consumer is required to handle an allocation failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
15 months agoFix the L2ARC write size calculating logic
George Amanakis [Tue, 6 Jun 2023 19:32:37 +0000 (21:32 +0200)]
Fix the L2ARC write size calculating logic

l2arc_write_size() should return the write size after adjusting for trim
and overhead of the L2ARC log blocks. Also take into account the
allocated size of log blocks when deciding when to stop writing buffers
to L2ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14939

15 months agozdb: add -B option to generate backup stream
Rob Norris [Wed, 15 Mar 2023 07:18:10 +0000 (18:18 +1100)]
zdb: add -B option to generate backup stream

This is more-or-less like `zfs send`, but specifying the snapshot by its
objset id for situations where it can't be referenced any other way.

Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14642

15 months agoznode: expose zfs_get_zplprop to libzpool
Rob Norris [Sun, 4 Jun 2023 01:14:20 +0000 (11:14 +1000)]
znode: expose zfs_get_zplprop to libzpool

There's no particular reason this function should be kernel-only, and I
want to use it (indirectly) from zdb. I've moved it to zfs_znode.c
because libzpool does not compile in zfs_vfsops.c, and this at least
matches the header its imported from.

Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14642

15 months agoIntroduce zfs_refcount_(add|remove)_few().
Alexander Motin [Mon, 5 Jun 2023 18:51:44 +0000 (14:51 -0400)]
Introduce zfs_refcount_(add|remove)_few().

There are two places where we need to add/remove several references
with semantics of zfs_refcount_(add|remove). But when debug/tracing
is disabled, it is a crime to run multiple atomic_inc() in a loop,
especially under congested pool-wide allocator lock.

Introduced new functions implement the same semantics as the loop,
but without overhead in production builds.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14934

15 months agoLinux 6.3 compat: META (#14930)
Brian Behlendorf [Mon, 5 Jun 2023 18:08:24 +0000 (11:08 -0700)]
Linux 6.3 compat: META (#14930)

Update the META file to reflect compatibility with the 6.3 kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
15 months agozfs-create(8): ZFS for swap: caution, clarity
Graham Perrin [Fri, 2 Jun 2023 18:25:13 +0000 (19:25 +0100)]
zfs-create(8): ZFS for swap: caution, clarity

Make the section heading more generic (the section relates to ZFS files
as well as ZFS volumes).

Swapping to a ZFS volume is prone to deadlock. Remove the related
instruction, direct readers to OpenZFS FAQ. Related, but not linked
from within the manual page:

<https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux>
(Using a zvol for a swap device on Linux).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Perrin <grahamperrin@freebsd.org>
Issue #7734
Closes #14756

15 months agoZIL: Allow to replay blocks of any size.
Alexander Motin [Fri, 2 Jun 2023 18:01:58 +0000 (14:01 -0400)]
ZIL: Allow to replay blocks of any size.

There seems to be no reason for ZIL blocks to be limited by 128KB
other than replay code is written in such a way.  This change does
not increase the limit yet, just removes the artificial limitation.

Avoided extra memcpy() may save us a second during replay.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14910

15 months agoPAM: enable testing on FreeBSD
Val Packett [Thu, 11 May 2023 21:16:57 +0000 (18:16 -0300)]
PAM: enable testing on FreeBSD

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: support password changes even when not mounted
Val Packett [Sat, 6 May 2023 01:17:12 +0000 (22:17 -0300)]
PAM: support password changes even when not mounted

There's usually no requirement that a user be logged in for changing
their password, so let's not be surprising here.

We need to use the fetch_lazy mechanism for the old password to avoid
a double prompt for it, so that mechanism is now generalized a bit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: add 'uid_min' and 'uid_max' options for changing the uid range
Val Packett [Sat, 6 May 2023 01:34:58 +0000 (22:34 -0300)]
PAM: add 'uid_min' and 'uid_max' options for changing the uid range

Instead of a fixed >=1000 check, allow the configuration to override
the minimum UID and add a maximum one as well. While here, add the
uid range check to the authenticate method as well, and fix the return
in the chauthtok method (seems very wrong to report success when we've
done absolutely nothing).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: add 'forceunmount' flag
Val Packett [Sat, 6 May 2023 01:02:13 +0000 (22:02 -0300)]
PAM: add 'forceunmount' flag

Probably not always a good idea, but it's nice to have the option.
It is a workaround for FreeBSD calling the PAM session end earier than
the last process is actually done touching the mount, for example.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: add 'recursive_homes' flag to use with 'prop_mountpoint'
Val Packett [Fri, 5 May 2023 22:35:57 +0000 (19:35 -0300)]
PAM: add 'recursive_homes' flag to use with 'prop_mountpoint'

It's not always desirable to have a fixed flat homes directory.
With the 'recursive_homes' flag, 'prop_mountpoint' search would
traverse the whole tree starting at 'homes' (which can now be '*'
to mean all pools) to find a dataset with a mountpoint matching
the home directory.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: use boolean_t for config flags
Val Packett [Sat, 6 May 2023 00:56:39 +0000 (21:56 -0300)]
PAM: use boolean_t for config flags

Since we already use boolean_t in the file, we can use it here.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoPAM: do not fail to mount if the key's already loaded
Val Packett [Fri, 5 May 2023 23:00:48 +0000 (20:00 -0300)]
PAM: do not fail to mount if the key's already loaded

If we're expecting a working home directory on login, it would be
rather frustrating to not have it mounted just because it e.g. failed to
unmount once on logout.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Signed-off-by: Val Packett <val@packett.cool>
Closes #14834

15 months agoRevert "initramfs: use `mount.zfs` instead of `mount`"
Rich Ercolani [Wed, 31 May 2023 23:58:41 +0000 (19:58 -0400)]
Revert "initramfs: use `mount.zfs` instead of `mount`"

This broke mounting of snapshots on / for users.

See https://github.com/openzfs/zfs/issues/9461#issuecomment-1376162949 for more context.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14908

15 months agoFix NULL pointer dereference when doing concurrent 'send' operations
Luís Henriques [Tue, 30 May 2023 22:15:24 +0000 (23:15 +0100)]
Fix NULL pointer dereference when doing concurrent 'send' operations

A NULL pointer will occur when doing a 'zfs send -S' on a dataset that
is still being received.  The problem is that the new 'send' will
rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will
fail), but then dmu_send() will still do the dsl_dataset_disown().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Luís Henriques <henrix@camandro.org>
Closes #14903
Closes #14890

15 months agoZTS: zvol_misc_trim disable blk mq
Brian Behlendorf [Mon, 29 May 2023 19:55:35 +0000 (12:55 -0700)]
ZTS: zvol_misc_trim disable blk mq

Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted
kernels.  This issue is being actively worked in #14872 and as part of that
fix this commit will be reverted.

    VERIFY(zh->zh_claim_txg == 0) failed
    PANIC at zil.c:904:zil_create()

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14872
Closes #14870

15 months agoUse __attribute__((malloc)) on memory allocation functions
Richard Yao [Fri, 26 May 2023 22:47:52 +0000 (18:47 -0400)]
Use __attribute__((malloc)) on memory allocation functions

This informs the C compiler that pointers returned from these functions
do not alias other functions, which allows it to do better code
optimization and should make the compiled code smaller.

References:
https://stackoverflow.com/a/53654773
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute
https://clang.llvm.org/docs/AttributeReference.html#malloc

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14827

15 months agoZTS: Add zpool_resilver_concurrent exception
Brian Behlendorf [Fri, 26 May 2023 22:39:23 +0000 (15:39 -0700)]
ZTS: Add zpool_resilver_concurrent exception

The zpool_resilver_concurrent test case requires the ZED which is not used
on FreeBSD.  Add this test to the known list of skipped tested for FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14904

15 months agoAdd compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2}
Mike Swanson [Fri, 26 May 2023 22:37:15 +0000 (15:37 -0700)]
Add compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2}

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
Closes #14902

15 months agoAdding new read-only compatible zpool features to compatibility.d/grub2
Colm [Fri, 26 May 2023 17:04:19 +0000 (10:04 -0700)]
Adding new read-only compatible zpool features to compatibility.d/grub2

GRUB2 is compatible with all "read-only compatible" features,
so it is safe to add new features of this type to the grub2
compatibility list. We generally want to include all compatible
features, to minimize the differences between grub2-compatible
pools and no-compatibility pools.

Adding new properties `livelist` and `zpool_checkpoint` accordingly.

Also adding them to the man page which references this file as an
example, for consistency.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #14893

15 months agobtree: Implement faster binary search algorithm
Richard Yao [Fri, 26 May 2023 17:03:12 +0000 (13:03 -0400)]
btree: Implement faster binary search algorithm

This implements a binary search algorithm for B-Trees that reduces
branching to the absolute minimum necessary for a binary search
algorithm. It also enables the compiler to inline the comparator to
ensure that the only slowdown when doing binary search is from waiting
for memory accesses. Additionally, it instructs the compiler to unroll
the loop, which gives an additional 40% improve with Clang and 8%
improvement with GCC.

Consumers must opt into using the faster algorithm. At present, only
B-Trees used inside kernel code have been modified to use the faster
algorithm.

Micro-benchmarks suggest that this can improve binary search performance
by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
compiling with GCC 12.2.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14866

15 months agoFix inconsistent definition of zfs_scrub_error_blocks_per_txg
George Amanakis [Fri, 26 May 2023 16:53:00 +0000 (18:53 +0200)]
Fix inconsistent definition of zfs_scrub_error_blocks_per_txg

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14894

15 months agoAdd missing files to Debian DKMS package
Damiano Albani [Thu, 25 May 2023 23:10:54 +0000 (01:10 +0200)]
Add missing files to Debian DKMS package

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Damiano Albani <damiano.albani@gmail.com>
Closes #14887
Closes #14889

15 months agoUpdate compatibility.d files
Brian Behlendorf [Thu, 25 May 2023 20:53:08 +0000 (13:53 -0700)]
Update compatibility.d files

Add an openzfs-2.2 compatibility file for the next release.

Edon-R support has been enabled for FreeBSD removing the need
for different FreeBSD and Linux files.  Symlinks for the -linux
and -freebsd names are created for any scripts expecting that
convention.

Additionally, a symlink for ubunutu-22.04 was added.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14833

15 months agozil: Add some more statistics.
Alexander Motin [Thu, 25 May 2023 20:51:53 +0000 (16:51 -0400)]
zil: Add some more statistics.

In addition to a number of actual log bytes written, account also a
total written bytes including padding and total allocated bytes (bytes
<= write <= alloc).  It should allow to monitor zil traffic and space
efficiency.

Add dtrace probe for zil block size selection.

Make zilstat report more information and fit it into less width.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14863

16 months agoZIL: Reduce scope of per-dataset zl_issuer_lock.
Alexander Motin [Thu, 25 May 2023 16:48:43 +0000 (12:48 -0400)]
ZIL: Reduce scope of per-dataset zl_issuer_lock.

Before this change ZIL copied all log data while holding the lock.
It caused huge lock contention on workloads with many big parallel
writes.  This change splits the process into two parts: first,
zil_lwb_assign() estimates the log space needed for all transactions,
and zil_lwb_write_close() allocates blocks and zios while holding the
lock, then, after the lock in dropped, zil_lwb_commit() copies the
data, and zil_lwb_write_issue() issues the I/Os.

Also while there slightly reduce scope of zl_lock.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14841

16 months agosystemd: Use non-absolute paths in Exec* lines
Dimitri John Ledkov [Wed, 24 May 2023 19:31:28 +0000 (20:31 +0100)]
systemd: Use non-absolute paths in Exec* lines

Since systemd v239, Exec* binaries are resolved from PATH when they
are not-absolute. Switch to this by default for ease of downstream
maintenance. Many downstream distributions move individual binaries
to locations that existing compile-time configurations cannot
accommodate.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Closes #14880

16 months agoFix concurrent resilvers initiated at same time
Akash B [Wed, 24 May 2023 19:28:09 +0000 (00:58 +0530)]
Fix concurrent resilvers initiated at same time

For draid vdevs it was possible to initiate both the
sequential and healing resilver at same time.

This fixes the following two scenarios.
     1) There's a window where a sequential rebuild can
be started via ZED even if a healing resilver has been
scheduled.
- This is fixed by adding additional check in
spa_vdev_attach() for any scheduled resilver and return
appropriate error code when a resilver is already in
progress.

     2) It was possible for zpool clear to start a healing
resilver when it wasn't needed at all. This occurs because
during a vdev_open() the device is presumed to be healthy not
until the device is validated by vdev_validate() and it's set
unavailable. However, by this point an async resilver will
have already been requested if the DTL isn't empty.
- This is fixed by cancelling the SPA_ASYNC_RESILVER
request immediately at the end of vdev_reopen() when a resilver
is unneeded.

Finally, added a testcase in ZTS for verification.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #14881
Closes #14892

16 months agoLinux 6.4 compat: reclaimed_slab renamed to reclaimed
youzhongyang [Wed, 24 May 2023 19:23:42 +0000 (15:23 -0400)]
Linux 6.4 compat: reclaimed_slab renamed to reclaimed

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14891

16 months agoHold db_mtx when updating db_state
Brian Atkinson [Fri, 19 May 2023 20:05:53 +0000 (16:05 -0400)]
Hold db_mtx when updating db_state

Commit 555ef90 did some general code refactoring for
dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was
not held when update db->db_state in those code block. The rest of the
dbuf code always holds the db_mtx when updating db_state. This is
important because cv_wait() db_changed is used to check for db_state
changes.

Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the
db_mtx when updating db_state.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #14875

16 months agoProbe vdevs before marking removed
Brian Behlendorf [Fri, 19 May 2023 20:05:09 +0000 (13:05 -0700)]
Probe vdevs before marking removed

Before allowing the ZED to mark a vdev as REMOVED due to a
hotplug event confirm that it is non-responsive with probe.
Any device which can be successfully probed should be left
ONLINE to prevent a healthy pool from being incorrectly
SUSPENDED.  This may occur for at least the following two
scenarios.

1) Drive expansion (zpool online -e) in VMware environments.
   If, during the partition resize operation, a partition is
   removed and re-created then udev will send a removed event.

2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan)
   may result in a udev remove and add event being delivered.

Finally, update the ZED to only kick in a spare when the
removal was successful.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14859
Closes #14861

16 months agoTeach zpool scrub to scrub only blocks in error log
George Amanakis [Fri, 17 Dec 2021 20:35:28 +0000 (21:35 +0100)]
Teach zpool scrub to scrub only blocks in error log

Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #8995
Closes #12355

16 months agoAdd the ability to uninitialize
Brian Behlendorf [Thu, 18 May 2023 17:02:20 +0000 (10:02 -0700)]
Add the ability to uninitialize

zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.

So let's add zpool initialize -u to clear it.

Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12451
Closes #14873

16 months agotest-runner: pass kmemleak and kmsg to Cmd.run
Antonio Russo [Mon, 15 May 2023 23:11:33 +0000 (17:11 -0600)]
test-runner: pass kmemleak and kmsg to Cmd.run

test-runner.py orchestrates all of the ZTS executions. The `Cmd` object
manages these process, and its `run` method specifically invokes these
possibly long-running processes, possibly retrying in the event of a
timeout. Since its inception, memory leak detection using the kmemleak
infrastructure [1], and kernel logging [2] have been added to this run
mechanism.

However, the callback to cull a process beyond its timeout threshold,
`kill_cmd`, has evaded modernization by both of these changes. As a
result, this function fails to properly invoke `run`, leading to an
untrapped exception and unreported test failure.

This patch extends `kill_cmd` to receive these kernel devices through
the `options` parameter, and regularizes all the `.run` calls from
`Cmd`, and its subclasses, to accept that parameter.

[1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1
[2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba

Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #14849

16 months agoFix undefined behavior in spa_sync_props()
Richard Yao [Fri, 12 May 2023 21:10:14 +0000 (17:10 -0400)]
Fix undefined behavior in spa_sync_props()

8eae2d214cfa53862833eeeda9a5c1e9d5ded47d caused Coverity to begin
complaining about "Improper use of negative value" in two places in
spa_sync_props() because Coverity correctly inferred from `prop ==
ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name()
and zpool_prop_get_type() use it an array index, which is undefined
behavior.

Assuming that the system does not panic from an attempt to read invalid
memory, the case statement for ZPOOL_PROP_INVAL will ensure that only
user properties will reach this code when prop is ZPOOL_PROP_INVAL, such
that execution will continue safely. However, if we are unlucky enough
to read invalid memory, then the system will panic.

This issue predates the patch that caused coverity to begin complaining.
Thankfully, our userland tools do not pass nonsense to us, so this bug
should not be triggered unless a future userland tool attempts to set a
property that we do not understand.

Reported-by: Coverity (CID-1561129)
Reported-by: Coverity (CID-1561130)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860

16 months agoFix use after free regression in spa_remove_healed_errors()
Richard Yao [Fri, 12 May 2023 20:47:56 +0000 (16:47 -0400)]
Fix use after free regression in spa_remove_healed_errors()

6839ec6f1098c28ff7b772f1b31b832d05e6b567 placed code in
spa_remove_healed_errors() that uses a pointer after the kmem_free()
call that frees it.

Reported-by: Coverity (CID-1562375)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860

16 months agozil: Free lwb_buf after write completion.
Alexander Motin [Fri, 12 May 2023 16:49:26 +0000 (12:49 -0400)]
zil: Free lwb_buf after write completion.

There is no sense to keep that memory allocated during the flush.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14855

16 months agozil: Some micro-optimizations.
Alexander Motin [Fri, 12 May 2023 16:14:29 +0000 (12:14 -0400)]
zil: Some micro-optimizations.

Should not cause functional changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14854

16 months agoRefine special_small_blocks property validation
Don Brady [Fri, 12 May 2023 16:12:28 +0000 (10:12 -0600)]
Refine special_small_blocks property validation

When the special_small_blocks property is being set during a pool
create it enforces a limit of 128KiB even if the pool's record size
is larger.

If the recordsize property is being set during a pool create, then
use that value instead of the default SPA_OLD_MAXBLOCKSIZE value.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #13815
Closes #14811

16 months agoZTS: Add auto_replace_001_pos to exceptions
Brian Behlendorf [Fri, 12 May 2023 16:07:58 +0000 (09:07 -0700)]
ZTS: Add auto_replace_001_pos to exceptions

The auto_replace_001_pos test case does not reliably pass on
Fedora 37 and newer.  Until the test case can be updated to make
it reliable add it to the list of "maybe" exceptions on Linux.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14851
Closes #14852

16 months agoMake sure we are not trying to clone a spill block.
Pawel Jakub Dawidek [Wed, 10 May 2023 05:32:30 +0000 (22:32 -0700)]
Make sure we are not trying to clone a spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoCorrect comment.
Pawel Jakub Dawidek [Thu, 4 May 2023 23:14:19 +0000 (16:14 -0700)]
Correct comment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoRemove badly placed comment.
Pawel Jakub Dawidek [Thu, 4 May 2023 06:25:22 +0000 (23:25 -0700)]
Remove badly placed comment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoDon't call zfs_exit_two() before zfs_enter_two().
Pawel Jakub Dawidek [Wed, 3 May 2023 07:24:47 +0000 (00:24 -0700)]
Don't call zfs_exit_two() before zfs_enter_two().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoDon't use dmu_buf_is_dirty() for unassigned transaction.
Pawel Jakub Dawidek [Tue, 2 May 2023 22:46:14 +0000 (15:46 -0700)]
Don't use dmu_buf_is_dirty() for unassigned transaction.

The dmu_buf_is_dirty() call doesn't make sense here for two reasons:
1. txg is 0 for unassigned tx, so it was a no-op.
2. It is equivalent of checking if we have dirty records and we are doing
   this few lines earlier.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoDeny block cloning is dbuf size doesn't match BP size.
Pawel Jakub Dawidek [Tue, 2 May 2023 21:24:43 +0000 (14:24 -0700)]
Deny block cloning is dbuf size doesn't match BP size.

I don't know an easy way to shrink down dbuf size, so just deny block cloning
into dbufs that don't match our BP's size.

This fixes the following situation:
1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB.
2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB.
3. Truncate the large file to 0. Its dbuf will remain 2kB.
4. Clone the small file into the large file. Small file's BP lsize is
   1kB, but the large file's dbuf is 2kB.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agoAdditional block cloning fixes.
Pawel Jakub Dawidek [Sun, 30 Apr 2023 09:47:09 +0000 (02:47 -0700)]
Additional block cloning fixes.

Reimplement some of the block cloning vs dbuf logic, mostly to fix
situation where we clone a block and in the same transaction group
we want to partially overwrite the clone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825

16 months agozil: Don't expect zio_shrink() to succeed.
Alexander Motin [Thu, 11 May 2023 21:27:12 +0000 (17:27 -0400)]
zil: Don't expect zio_shrink() to succeed.

At least for RAIDZ zio_shrink() does not reduce zio size, but reduced
wsz in that case likely results in writing uninitialized memory.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14853

16 months agoPrevent panic during concurrent snapshot rollback and zvol read
Ameer Hamza [Wed, 10 May 2023 00:56:35 +0000 (05:56 +0500)]
Prevent panic during concurrent snapshot rollback and zvol read

Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent
release of the dnode, avoiding panic when a snapshot is rolled back
in parallel during ongoing zvol read operation.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14839

16 months agopam: Fix "buffer overflow" in pam ZTS tests on F38
Tony Hutter [Wed, 10 May 2023 00:55:19 +0000 (17:55 -0700)]
pam: Fix "buffer overflow" in pam ZTS tests on F38

The pam ZTS tests were reporting a buffer overflow on F38, possibly
due to F38 now setting _FORTIFY_SOURCE=3 by default.  gdb and
valgrind narrowed this down to a snprintf() buffer overflow in
zfs_key_config_modify_session_counter().  I'm not clear why this
particular snprintf() was being flagged as an overflow, but when
I replaced it with an asprintf(), the test passed reliably.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #14802
Closes #14842

16 months agoAdd dmu_tx_hold_append() interface
Brian Behlendorf [Tue, 9 May 2023 16:03:10 +0000 (09:03 -0700)]
Add dmu_tx_hold_append() interface

Provides an interface which callers can use to declare a write when
the exact starting offset in not yet known.  Since the full range
being updated is not available only the first L0 block at the
provided offset will be prefetched.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14819

16 months agoDebug auto_replace_001_pos failures
Brian Behlendorf [Tue, 9 May 2023 15:57:02 +0000 (08:57 -0700)]
Debug auto_replace_001_pos failures

Reduced the timeout to 60 seconds which should be more than
sufficient and allow the test to be marked as FAILED rather
than KILLED.  Also dump the pool status on cleanup.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14829

16 months agoRemove duplicate code in l2arc_evict()
George Amanakis [Tue, 9 May 2023 15:54:41 +0000 (17:54 +0200)]
Remove duplicate code in l2arc_evict()

l2arc_evict() performs the adjustment of the size of buffers to be
written on L2ARC unnecessarily. l2arc_write_size() is called right
before l2arc_evict() and performs those adjustments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14828

16 months agoRemove single parent assertion from zio_nowait().
Alexander Motin [Tue, 9 May 2023 15:54:01 +0000 (11:54 -0400)]
Remove single parent assertion from zio_nowait().

We only need to know if ZIO has any parent there.  We do not care if
it has more than one, but use of zio_unique_parent() == NULL asserts
that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14823