]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
3 months agoEnable L2 cache of all (MRU+MFU) metadata but MFU data only
shodanshok [Fri, 16 Aug 2024 20:34:07 +0000 (22:34 +0200)]
Enable L2 cache of all (MRU+MFU) metadata but MFU data only

`l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU
data and metadata. However it can be useful to cache as much
metadata as possible while, at the same time, restricting data
cache to MFU buffers only.

This patch allow for such behavior by setting `l2arc_mfuonly` to 2
(or higher). The list of possible values is the following:
0: cache both MRU and MFU for both data and metadata;
1: cache only MFU for both data and metadata;
2: cache both MRU and MFU for metadata, but only MFU for data.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #16343
Closes #16402

3 months agoMan page updates for dmu_ddt_copies
Allan Jude [Tue, 23 Jul 2024 20:51:01 +0000 (20:51 +0000)]
Man page updates for dmu_ddt_copies

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #15895

3 months agoddt: lookup and log stats
Rob Norris [Mon, 25 Sep 2023 01:02:46 +0000 (11:02 +1000)]
ddt: lookup and log stats

Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895

3 months agoddt: block scan until log is flushed, and flush aggressively
Rob Norris [Mon, 16 Oct 2023 00:52:17 +0000 (11:52 +1100)]
ddt: block scan until log is flushed, and flush aggressively

The dedup log does not have a stable cursor, so its not possible to
persist our current scan location within it across pool reloads.
Beccause of this, when walking (scanning), we can't treat it like just
another source of dedup entries.

Instead, when a scan is wanted, we switch to an aggressive flushing
mode, pushing out entries older than the scan start txg as fast as we
can, before starting the scan proper.

Entries after the scan start txg will be handled via other methods; the
DDT ZAPs and logs will be written as normal, and blocks not seen yet
will be offered to the scan machinery as normal.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895

3 months agoddt: dedup log
Rob Norris [Thu, 22 Jun 2023 07:46:22 +0000 (17:46 +1000)]
ddt: dedup log

Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.

The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.

All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895

3 months agoddt: tuneable to override copies= on dedup metadata objects
Rob Norris [Fri, 6 Oct 2023 06:06:34 +0000 (17:06 +1100)]
ddt: tuneable to override copies= on dedup metadata objects

All objects stored in the MOS get copies=3. For a large dedup table,
this requires significant extra IO and disk space, when its not really
necessary - the dedup table itself isn't needed to read or write data,
only to keep data usage down. Losing the dedup table does not render the
pool unusable, it just messes up the accounting somewhat.

This adds a dmu_ddt_copies tuneable. When set to 0, the existing
behaviour is used. When set higher, dedup table blocks (ZAP and log)
will have this many copies rather than the usual 3, while indirect
blocks will have one more again.

This is a tuneable for now mostly for testing. Losing a dedup table can
cause blocks to be leaked, and we currently have no facilities to repair
that.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895

3 months agoddt: compare keys 64-bits at a time, trying to match ZAP order
Rob Norris [Wed, 11 Oct 2023 01:46:55 +0000 (12:46 +1100)]
ddt: compare keys 64-bits at a time, trying to match ZAP order

This yields substantial performance improvements when we only write out
some small % of entries at a time, as it will cause entries that will go
into "nearby" ZAP leaf nodes to be grouped closer together in the AVL, and
so touch fewer blocks. Without this, the distribution is an even spread,
so we touch a lot more ZAP leaf nodes for any given number of entries.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895

3 months agoddt: cleanup the stats & histogram code
Rob Norris [Thu, 15 Jun 2023 07:19:41 +0000 (17:19 +1000)]
ddt: cleanup the stats & histogram code

Both the API and the code were kinda mangled and I was really struggling
to follow it. The worst offender was the old ddt_stat_add(); after
fixing it up the rest of the changes are mostly knock-on effects and
targets of opportunity.

Note that the old ddt_stat_add() was safe against overflows - it could
produce crazy numbers, but the compiler wouldn't do anything stupid. The
assertions in ddt_stat_sub() go a lot of the way to protecting against
this; getting in a position where overflows are a problem is definitely
a programming error.

Also expanding ddt_stat_add() and ddt_histogram_empty() produces less
efficient assembly. I'm not bothered about this right now though; these
should not be hot functions, and if they are we'll optimise them later.
If we have to go back to the old form, we'll comment it like crazy.

Finally, I've removed the assertion that the bucket will never be
negative, as it will soon be possible to have entries with zero
refcounts: an entry for a block that is no longer on the pool, but is on
the log waiting to be synced out. It might be better to have a separate
bucket for these, since they're still using real space on disk, but
ultimately these stats are driving UI, and for now I've chosen to keep
them matching how they've looked in the past, as well as match the
operators mental model - pool usage is managed elsewhere.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895

3 months agoddt: add "flat phys" feature
Rob Norris [Tue, 20 Jun 2023 01:09:48 +0000 (11:09 +1000)]
ddt: add "flat phys" feature

Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893

3 months agoddt: slim down ddt_entry_t
Rob Norris [Mon, 3 Jul 2023 09:54:40 +0000 (19:54 +1000)]
ddt: slim down ddt_entry_t

This slims down the in-memory entry to as small as it can be. The
IO-related parts are made into a separate entry, since they're
relatively rarely needed.

The variable allocation for dde_phys is to support the upcoming flat
format.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893

3 months agoddt: introduce lightweight entry
Rob Norris [Mon, 3 Jul 2023 12:16:04 +0000 (22:16 +1000)]
ddt: introduce lightweight entry

The idea here is that sometimes you need the contents of an entry with
no intent to modify it, and/or from a place where its difficult to get
hold of its originating ddt_t to know how to interpret it.

A lightweight entry contains everything you might need to "read" an
entry - its key, type and phys contents - but none of the extras for
modifying it or using it in a larger context. It also has the full
complement of phys slots, so it can represent any kind of dedup entry
without having to know the specific configuration of the table it came
from.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893

3 months agoddt: rework access to phys array slots
Rob Norris [Mon, 3 Jul 2023 05:16:02 +0000 (15:16 +1000)]
ddt: rework access to phys array slots

The "flat phys" feature will use only a single phys slot for all
entries, which means the old "single", "double" etc naming now makes no
sense, and more importantly, means that choosing the right slot for a
given block pointer will depend on how many slots are in use for a given
DDT.

This removes the old names, and adds accessor macros to decouple
specific phys array indexes from any particular meaning.

(These macros look strange in isolation, mainly in the way they take the
ddt_t* as an arg but don't use it. This is mostly a separate commit to
introduce the concept to the reader before the "flat phys" commit
extends it).

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893

3 months agozdb: rework DDT block count and leak check to just count the blocks
Rob Norris [Tue, 18 Jun 2024 04:11:11 +0000 (14:11 +1000)]
zdb: rework DDT block count and leak check to just count the blocks

The upcoming dedup features break the long held assumption that all
blocks on disk with a 'D' dedup bit will always be present in the DDT,
or will have the same set of DVA allocations on disk as in the DDT.

If the DDT is no longer a complete picture of all the dedup blocks that
will be and should be on disk, then it does us no good to walk and prime
it up front, since it won't necessarily match up with every block we'll
see anyway.

Instead, we rework things here to be more like the BRT checks. When we
see a dedup'd block, we look it up in the DDT, consume a refcount, and
for the second-or-later instances, count them as duplicates.

The DDT and BRT are moved ahead of the space accounting. This will
become important for the "flat" feature, which may need to count a
modified version of the block.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15892

3 months agoZTS: tests for dedup legacy/FDT tables
Rob Norris [Thu, 13 Jun 2024 04:50:33 +0000 (14:50 +1000)]
ZTS: tests for dedup legacy/FDT tables

Very basic coverage to make sure things appear to work, have the right
format on disk, and pool upgrades and mixed table types work as
expected.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15892

3 months agoddt: add FDT feature and support for legacy and new on-disk formats
Rob Norris [Tue, 20 Jun 2023 02:06:13 +0000 (12:06 +1000)]
ddt: add FDT feature and support for legacy and new on-disk formats

This is the supporting infrastructure for the upcoming dedup features.

Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.

The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.

This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.

This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.

For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.

ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.

This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15892

3 months agolinux/zvol_os: fix zvol queue limits initialization
Ameer Hamza [Thu, 15 Aug 2024 21:29:50 +0000 (02:29 +0500)]
linux/zvol_os: fix zvol queue limits initialization

zvol queue limits initialization depends on `zv_volblocksize`, but it is
initialized later, leading to several limits being initialized with
incorrect values, including `max_discard_*` limits. This also causes
`blkdiscard` command to consistently fail, as `blk_ioctl_discard` reads
`bdev_max_discard_sectors()` limits as 0, leading to failure. The fix is
straightforward: initialize `zv->zv_volblocksize` early, before setting
the queue limits. This PR should fix `zvol/zvol_misc/zvol_misc_trim`
failure on recent PRs, as the test case issues `blkdiscard` for a zvol.
Additionally, `zvol_misc_trim` was recently enabled in `6c7d41a`,
which is why the issue wasn't identified earlier.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16454

3 months agoFix null ptr deref when renaming a zvol with snaps and snapdev=visible (#16316)
Justin Gottula [Thu, 15 Aug 2024 21:13:18 +0000 (14:13 -0700)]
Fix null ptr deref when renaming a zvol with snaps and snapdev=visible (#16316)

If a zvol is renamed, and it has one or more snapshots, and
snapdev=visible is true for the zvol, then the rename causes a kernel
null pointer dereference error. This has the effect (on Linux, anyway)
of killing the z_zvol taskq kthread, with locks still held; which in
turn causes a variety of zvol-related operations afterward to hang
indefinitely (such as udev workers, among other things).

The problem occurs because of an oversight in #15486
(e36ff84c338d2f7b15aef2538f6a9507115bbf4a). As documented in
dataset_kstats_create, some datasets may not actually have kstats
allocated for them; and at least at the present time, this is true for
snapshots. In practical terms, this means that for snapshots,
dk->dk_kstats will be NULL. The dataset_kstats_rename function
introduced in the patch above does not first check whether dk->dk_kstats
is NULL before proceeding, unlike e.g. the nearby
dataset_kstats_update_* functions.

In the very particular circumstance in which a zvol is renamed, AND that
zvol has one or more snapshots, AND that zvol also has snapdev=visible,
zvol_rename_minors_impl will loop over not just the zvol dataset itself,
but each of the zvol's snapshots as well, so that their device nodes
will be renamed as well. This results in dataset_kstats_create being
called for snapshots, where, as we've established, dk->dk_kstats is
NULL.

Fix this by simply adding a NULL check before doing anything in
dataset_kstats_rename.

This still allows the dataset_name kstat value for the zvol to be
updated (as was the intent of the original patch), and merely blocks
attempts by the code to act upon the zvol's non-kstat-having snapshots.
If at some future time, kstats are added for snapshots, then things
should work as intended in that case as well.

Signed-off-by: Justin Gottula <justin@jgottula.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoLinux 6.10 compat: Fix zvol NULL pointer deference
Tony Hutter [Thu, 15 Aug 2024 21:05:58 +0000 (14:05 -0700)]
Linux 6.10 compat: Fix zvol NULL pointer deference

zvol_alloc_non_blk_mq()->blk_queue_set_write_cache() needs the disk
queue setup to prevent a NULL pointer deference.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16453

3 months agoLinux 6.10 compat: fix rpm-kmod and builtin
Tony Hutter [Thu, 15 Aug 2024 21:00:18 +0000 (14:00 -0700)]
Linux 6.10 compat: fix rpm-kmod and builtin

The 6.10 kernel broke our rpm-kmod builds.  The 6.10 kernel really
wants the source files in the same directory as the object files.
This workaround makes rpm-kmod work again.  It also updates
the builtin kernel codepath to work correctly with 6.10.

See kernel commits:

b1992c3772e6 kbuild: use $(src) instead of $(srctree)/$(src) for source
                     directory
9a0ebe5011f4 kbuild: use $(obj)/ instead of $(src)/ for common pattern
                     rules

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16439
Closes #16450

3 months agoFix incorrect error report on vdev attach/replace
Ameer Hamza [Thu, 15 Aug 2024 19:39:44 +0000 (00:39 +0500)]
Fix incorrect error report on vdev attach/replace

Report the correct error message in libzfs when attaching/replacing a
vdev with a higher ashift.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16449

3 months agoFreeBSD: fix build without kernel option MAC
Gleb Smirnoff [Thu, 15 Aug 2024 16:08:43 +0000 (09:08 -0700)]
FreeBSD: fix build without kernel option MAC

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Gleb Smirnoff <glebius@FreeBSD.org>
Closes #16446

3 months agoFix projid accounting for xattr objects
Jitendra Patidar [Thu, 15 Aug 2024 00:59:19 +0000 (06:29 +0530)]
Fix projid accounting for xattr objects

zpool upgraded with 'feature@project_quota' needs re-layout of SA's
to fix the SA_ZPL_PROJID at SA_PROJID_OFFSET (128). Its necessary for
the correct accounting of object usage against its projid.
Old object (created before upgrade) when gets a projid assigned, its
SA gets re-layout via sa_add_projid(). If object has xattr dir, SA
of xattr dir also gets re-layout. But SA re-layout of xattr objects
inside a xattr dir is not done.

Fix zfs_setattr_dir() to re-layout SA's on xattr objects, when setting
projid on old xattr object (created before upgrade).

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #16355
Closes #16356

3 months agoAdd missing kstats to dataset kstats
Paul Dagnelie [Wed, 14 Aug 2024 21:18:46 +0000 (14:18 -0700)]
Add missing kstats to dataset kstats

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #16431

3 months agoZTS: Use /dev/urandom instead of /dev/random
Tony Hutter [Wed, 14 Aug 2024 19:27:07 +0000 (12:27 -0700)]
ZTS: Use /dev/urandom instead of /dev/random

Use /dev/urandom so we never have to wait on entropy.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16442

3 months agoLinux 6.11: avoid passing "end" sentinel to register_sysctl()
Rob Norris [Wed, 31 Jul 2024 11:39:31 +0000 (21:39 +1000)]
Linux 6.11: avoid passing "end" sentinel to register_sysctl()

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400

3 months agoLinux 6.11: add compat macro for page_mapping()
Rob Norris [Wed, 31 Jul 2024 08:43:39 +0000 (18:43 +1000)]
Linux 6.11: add compat macro for page_mapping()

Since the change to folios it has just been a wrapper anyway. Linux has
removed their wrapper, so we add one.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400

3 months agoLinux 6.11: add more queue_limit fields with removed setters
Rob Norris [Wed, 31 Jul 2024 07:22:20 +0000 (17:22 +1000)]
Linux 6.11: add more queue_limit fields with removed setters

These fields are very old, so no detection necessary; we just move them
into the limit setup functions.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400

3 months agoLinux 6.11: IO stats is now a queue feature flag
Rob Norris [Wed, 31 Jul 2024 04:48:58 +0000 (14:48 +1000)]
Linux 6.11: IO stats is now a queue feature flag

Apply them with with the rest of the settings.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400

3 months agoLinux 6.11: first arg to proc_handler is now const
Rob Norris [Wed, 31 Jul 2024 02:15:07 +0000 (12:15 +1000)]
Linux 6.11: first arg to proc_handler is now const

Detect it, and use a macro to make sure we always match the prototype.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400

3 months agoLinux 6.11: get backing_dev_info through queue gendisk
Rob Norris [Tue, 30 Jul 2024 12:25:50 +0000 (22:25 +1000)]
Linux 6.11: get backing_dev_info through queue gendisk

It's no longer available directly on the request queue, but its easy to
get from the attached disk.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400

3 months agoLinux 6.11: enable queue flush through queue limits
Rob Norris [Tue, 30 Jul 2024 11:40:35 +0000 (21:40 +1000)]
Linux 6.11: enable queue flush through queue limits

In 6.11 struct queue_limits gains a 'features' field, where, among other
things, flush and write-cache are enabled. Detect it and use it.

Along the way, the blk_queue_set_write_cache() compat wrapper gets a
little cleanup. Since both flags are alway set together, its now a
single bool. Also the very very ancient version that sets q->flush_flags
directly couldn't actually turn it off, so I've fixed that. Not that we
use it, but still.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400

3 months agolinux/zvol_os: tidy and document queue limit/config setup
Rob Norris [Wed, 31 Jul 2024 04:35:48 +0000 (14:35 +1000)]
linux/zvol_os: tidy and document queue limit/config setup

It gets hairier again in Linux 6.11, so I want some actual theory of
operation laid out for next time.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400

3 months agoGithub workflow: fix typo in `zloop` artifact
Ameer Hamza [Fri, 9 Aug 2024 23:49:19 +0000 (04:49 +0500)]
Github workflow: fix typo in `zloop` artifact

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16432

3 months agoconfig: don't force shared linkage on FreeBSD
Rob Norris [Fri, 9 Aug 2024 21:34:04 +0000 (07:34 +1000)]
config: don't force shared linkage on FreeBSD

-shared was hardcoded, so when building with --disable-shared it amounts
to trying to do shared linkage against static libs, which naturally
fails.

The fix is straightforward; just don't hardcode it. libtool will work
out what to do.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16427

3 months agoMake txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD
Alan Somers [Fri, 9 Aug 2024 21:32:59 +0000 (15:32 -0600)]
Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD

This applies the same change in #9115 to FreeBSD.  This was actually the
old behavior in FreeBSD 12; it only regressed when FreeBSD support was
added to OpenZFS.  As far as I can tell, the timeline went like this:

* Illumos's zfsvfs_teardown used an unconditional txg_wait_synced
* Illumos added the dirty data check [^4]
* FreeBSD merged in Illumos's conditional check [^3]
* OpenZFS forked from Illumos
* OpenZFS removed the dirty data check in #7795 [^5]
* @mattmacy forked the OpenZFS repo and began to add FreeBSD support
* OpenZFS PR #9115[^1] recreated the same dirty data check that Illumos
  used, in slightly different form.  At this point the OpenZFS repo did
  not yet have multi-OS support.
* Matt Macy merged in FreeBSD support in #8987[^2] , but it was based on
  slightly outdated OpenZFS code.

In my local testing, this vastly improves the reboot speed of a server
with a large pool that has 1000 datasets and is resilvering an HDD.

[^1]: https://github.com/openzfs/zfs/pull/9115
[^2]: https://github.com/openzfs/zfs/pull/8987
[^3]: https://github.com/freebsd/freebsd-src/commit/10b9d77bf1ccf2f3affafa6261692cb92cf7e992
[^4]: https://github.com/illumos/illumos-gate/commit/5aaeed5c617553c4cec6328c1f4c19079a5a495a
[^5]: https://github.com/openzfs/zfs/pull/7795

Sponsored by: Axcient
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #16268

3 months agozstream: remove duplicate highbit64 definition
Rob Norris [Fri, 9 Aug 2024 21:31:41 +0000 (07:31 +1000)]
zstream: remove duplicate highbit64 definition

When building a static build (--disable-shared), zstream fails to link
because of the duplicate highbit64() in libzpool/kernel.c. Since they're
identical, and the libzpool one is visible to zstream, we remove
zstream's copy and just use the common one.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16426

3 months agoabd: lift ABD zero scan from zio_compress_data() to abd_cmp_zero()
Rob Norris [Fri, 9 Aug 2024 21:30:26 +0000 (07:30 +1000)]
abd: lift ABD zero scan from zio_compress_data() to abd_cmp_zero()

It's now the caller's responsibility do special handling for holes if
that's something it wants.

This also makes zio_compress_data() and zio_decompress_data() properly
the inverse of each other.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Jason Lee <jasonlee@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16326

3 months agoUpdating bash completion build file
Brian Atkinson [Thu, 8 Aug 2024 22:39:25 +0000 (18:39 -0400)]
Updating bash completion build file

Commit 46ebd0a updated the build system to make symbolic link for zpool.
However, this commit did not update the automake file to also add the
symbolic link to the CLEANFILES variable. This is necessary so the link
is removed when running make clean/distclean.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16422

3 months agocontrib: bash_completion.d: force zpool symlink recreation
Rob Norris [Thu, 8 Aug 2024 22:36:09 +0000 (08:36 +1000)]
contrib: bash_completion.d: force zpool symlink recreation

ln will fail if the target already exists, which causes make to bail
out. Adding -f makes it more "compiler-like", overwriting the target
instead.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16423

3 months agoLinux: Make zfs_prune() fair on NUMA systems
Alexander Motin [Thu, 8 Aug 2024 22:33:36 +0000 (18:33 -0400)]
Linux: Make zfs_prune() fair on NUMA systems

Previous code evicted nr_to_scan items from each NUMA node.  This
not only multiplied the eviction by the number of nodes, but could
exhaust the smaller ones, evicting inodes used by acive workload
and requiring their immediate recreation.  This patch spreads the
requested eviction between all NUMA nodes proportionally to their
evictable counts, which should be closer to expected LRU logic.
See kernel's super_cache_scan() as a similar logic example.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16397

3 months agoSoften pruning threshold on not evictable metadata
Alexander Motin [Thu, 8 Aug 2024 22:26:35 +0000 (18:26 -0400)]
Soften pruning threshold on not evictable metadata

Previous code pruned 10% of dnodes once 3/4 of metadata appeared
unevictable.  On workloads with many millions of dnodes and little
other metadata it creates significant load spikes for many seconds
straight.  This change instead gradually increases pruning as
unevictable metadata grow above the 3/4, which may allow it to
stabilize at some level.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16401

3 months agoImprove zfs_blkptr_verify()
Alexander Motin [Thu, 8 Aug 2024 22:25:10 +0000 (18:25 -0400)]
Improve zfs_blkptr_verify()

- Skip config lock enter/exit for embedded blocks.  They have no
DVAs, so there is nothing to check under the lock.
 - Skip CHECKSUM check and properly check PSIZE for embedded blocks.
 - Add static branch predictions for unlikely conditions.
 - Do not verify DVAs for blocks already in ARC.  ARC hit already
"verified" the first (often the only) DVA, and it does not worth to
enter/exit config lock for nothing.

Some profiles show me up to 3% of CPU saving from this change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16387

3 months agolibzfs.h: Set ZFS_MAXPROPLEN and ZPOOL_MAXPROPLEN to ZAP_MAXVALUELEN
Mateusz Piotrowski [Wed, 5 Jun 2024 09:59:11 +0000 (11:59 +0200)]
libzfs.h: Set ZFS_MAXPROPLEN and ZPOOL_MAXPROPLEN to ZAP_MAXVALUELEN

So far, the values of ZFS_MAXPROPLEN and ZPOOL_MAXPROPLEN were equal to
MAXPATHLEN, which is 1024 on FreeBSD and 4096 on Linux. This wasn't
ideal. Some of the surprising outcomes of this implementation are:

1. When creating a pool user property with zpool-set(8), libzfs makes
   sure that the length of the property's value is less than
   ZFS_MAXPROPLEN. However, the ZFS kernel module does not do that.
   Instead, it checks the length against ZAP_MAXVALUELEN. As a result,
   it is possible to create a property the length of which is going to
   be larger than zpool(8) is ready to read.
2. A pool user property created on Linux is too big to be read on
   FreeBSD.

This change sets both ZFS_MAXPROPLEN and ZPOOL_MAXPROPLEN to
ZAP_MAXVALUELEN, which is 8192 at the moment.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16248

3 months agozpoolprops.7: Fix max length of name of user property
Mateusz Piotrowski [Mon, 3 Jun 2024 14:46:46 +0000 (16:46 +0200)]
zpoolprops.7: Fix max length of name of user property

The documentation mentioned that the property name can be 256 characters
long. This was incorrect. The last byte is reserved for NUL, so the
name provided by the operator can be only 255 characters long.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16248

3 months agotests: user_property_001_pos: Remove unnecessary evals
Mateusz Piotrowski [Mon, 3 Jun 2024 10:17:22 +0000 (12:17 +0200)]
tests: user_property_001_pos: Remove unnecessary evals

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16248

3 months agotests: user_property: Clarify comments
Mateusz Piotrowski [Fri, 31 May 2024 14:02:06 +0000 (16:02 +0200)]
tests: user_property: Clarify comments

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16248

3 months agoSync AUX label during pool import
Ameer Hamza [Thu, 8 Aug 2024 22:16:46 +0000 (03:16 +0500)]
Sync AUX label during pool import

Spare and l2cache vdev labels are not updated during import. Therefore,
if disk paths are updated between pool export and import, the AUX label
still shows the old paths. This patch syncs the AUX label
during import to show the correct path information.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15817

3 months agoZTS: Add a test to verify that copy_file_range obeys RLIMIT_FSIZE
Mark Johnston [Mon, 5 Aug 2024 15:57:44 +0000 (15:57 +0000)]
ZTS: Add a test to verify that copy_file_range obeys RLIMIT_FSIZE

Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoFreeBSD: Fix RLIMIT_FSIZE handling for block cloning
Mark Johnston [Tue, 23 Jul 2024 14:20:46 +0000 (10:20 -0400)]
FreeBSD: Fix RLIMIT_FSIZE handling for block cloning

ZFS implements copy_file_range(2) using block cloning when possible.
This implementation must respect the RLIMIT_FSIZE limit.

zfs_clone_range() already checks the limit, so it is safe to remove this
check in zfs_freebsd_copy_file_range().  Moreover, the removed check
produces false positives: the length passed to copy_file_range(2) may be
larger than the input file size; as the man page notes, "for best
performance, call copy_file_range() with the largest len value
possible."  In particular, some existing code passes SSIZE_MAX there.

The check in zfs_clone_range() clamps the length to the input file's
size before checking, but the removed check uses the caller supplied
length, so something like

$ echo a > /tmp/foo
$ limits -f 1024 cat /tmp/foo > /tmp/bar

fails because FreeBSD's cat(1) uses copy_file_range(2) in the manner
described above.

Reported-by: Philip Paeps <philip@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoFix memory corruption during parallel zpool import with -o cachefile (#16419)
Alan Somers [Wed, 7 Aug 2024 20:44:55 +0000 (14:44 -0600)]
Fix memory corruption during parallel zpool import with -o cachefile (#16419)

When importing multiple pools, the nvlist of properties given with "-o"
is shared amongst the several threads.  So no thread should modify it.
Previously, in the course of validating the cachefile property, the
zpool_valid_proplist function would temporarily modify the value, and
then change it back.  Now it will operate on a clone of the value.

Sponsored by:   Axcient
Fixes #16405
Signed-off-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
3 months agoZTS: small fix for SEEK_DATA/SEEK_HOLE tests (#16413)
Tino Reichardt [Wed, 7 Aug 2024 16:52:37 +0000 (18:52 +0200)]
ZTS: small fix for SEEK_DATA/SEEK_HOLE tests (#16413)

Some libc's like uClibc lag the proper definition of SEEK_DATA
and SEEK_HOLE. Since we have only two files in ZTS which use
these definitons, let's define them by hand:

```
#ifndef SEEK_DATA
#define SEEK_DATA 3
#endif
#ifndef SEEK_HOLE
#define SEEK_HOLE 4
#endif
```

There should be no failures, because:
- FreeBSD has support for SEEK_DATA/SEEK_HOLE since FreeBSD 8
- Linux has it since Linux 3.1
- the libc will submit the parameters unchanged to the kernel

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
3 months agoFix the names of some FreeBSD sysctls in include/tunables.cfg (#16395)
Allan Jude [Tue, 6 Aug 2024 23:36:55 +0000 (19:36 -0400)]
Fix the names of some FreeBSD sysctls in include/tunables.cfg (#16395)

Sponsored-by: Klara, Inc.
Signed-off-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
3 months agoZTS: fix io_uring test on RHEL 9 variants (#16411)
Tino Reichardt [Tue, 6 Aug 2024 23:30:11 +0000 (01:30 +0200)]
ZTS: fix io_uring test on RHEL 9 variants (#16411)

Simplify the test, by using the variable "$PLATFORM_ID" in favor
of "$REDHAT_SUPPORT_PRODUCT_VERSION".

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
3 months agoJSON: Fix class values for mirrored special vdevs
Tony Hutter [Thu, 11 Jul 2024 22:35:40 +0000 (15:35 -0700)]
JSON: Fix class values for mirrored special vdevs

This fixes things so mirrored special vdevs report themselves as
"class=special" rather than "class=normal".

This happens due to the way the vdev nvlists are constructed:

mirrored special devices - The 'mirror' vdev has allocation bias as
"special" and it's leaf vdevs are "normal"

single or RAID0 special devices - Leaf vdevs have allocation bias as
"special".

This commit adds in code to check if a leaf's parent is a "special"
vdev to see if it should also report "special".

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16217

3 months agoZTS: Add zfs/zpool JSON sanity tests
Tony Hutter [Wed, 10 Jul 2024 22:27:33 +0000 (15:27 -0700)]
ZTS: Add zfs/zpool JSON sanity tests

Run basic JSON validation tests on the new `zfs|zpool -j` output.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16217

3 months agoJSON output support for zpool status
Umer Saleem [Thu, 9 May 2024 11:54:47 +0000 (16:54 +0500)]
JSON output support for zpool status

This commit adds support for zpool status command to displpay status
of ZFS pools in JSON format using '-j' option. Status information is
collected in nvlist which is later dumped on stdout in JSON format.
Existing options for zpool status work with '-j' flag. man page for
zpool status is updated accordingly.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217

3 months agoJSON output support for zpool list
Umer Saleem [Thu, 25 Apr 2024 12:59:41 +0000 (17:59 +0500)]
JSON output support for zpool list

This commit adds support for zpool list command to output the list of
ZFS pools in JSON format using '-j' option.. Information about available
pools is collected in nvlist which is later printed to stdout in JSON
format.

Existing options for zfs list command work with '-j' flag. man page for
zpool list is updated accordingly.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217

3 months agoJSON output support for zpool get
Umer Saleem [Wed, 24 Apr 2024 14:58:17 +0000 (19:58 +0500)]
JSON output support for zpool get

This commit adds support for zpool get command to output the list of
properties for ZFS Pools and VDEVS in JSON format using '-j' option.
Man page for zpool get is updated to include '-j' option.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217

3 months agoJSON output support for zpool version
Umer Saleem [Mon, 22 Apr 2024 09:22:12 +0000 (14:22 +0500)]
JSON output support for zpool version

This commit adds support for zpool version to output in JSON format
using '-j' option. Userland kernel module version is collected in nvlist
which  is later displayed in JSON format. man page for zpool is updated.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217

3 months agoJSON output support zfs mount
Umer Saleem [Thu, 18 Apr 2024 06:41:32 +0000 (11:41 +0500)]
JSON output support zfs mount

This commit adds support for zfs mount to display mounted file systems
in JSON format using '-j' option. Data is collected in nvlist which is
printed in JSON format. man page for zfs mount is updated accordingly.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217

3 months agoJSON output support for zfs list
Umer Saleem [Thu, 18 Apr 2024 17:09:15 +0000 (22:09 +0500)]
JSON output support for zfs list

This commit adds support for JSON output for zfs list using '-j' option.
Information is collected in JSON format which is later printed in jSON
format. Existing options for zfs list also work with '-j'. man pages are
updated with relevant information.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217

3 months agoJSON output support for zfs version and zfs get
Umer Saleem [Fri, 5 Apr 2024 16:02:30 +0000 (21:02 +0500)]
JSON output support for zfs version and zfs get

This commit adds support for JSON output for zfs version and zfs get
commands. '-j' flag can be used to get output in JSON format.

Information is collected in nvlist objects which is later printed in
JSON format. Existing options that work for zfs get and zfs version
also work with '-j' flag.

man pages for zfs get and zfs version are updated accordingly.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217

3 months agoZTS: remove skips for zvol_misc tests
Rob Norris [Thu, 18 Jul 2024 09:53:35 +0000 (19:53 +1000)]
ZTS: remove skips for zvol_misc tests

Last commit should fix the underlying problem, so these should be
passing reliably again.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16364

3 months agozvol: ensure device minors are properly cleaned up
Rob Norris [Thu, 18 Jul 2024 03:24:05 +0000 (13:24 +1000)]
zvol: ensure device minors are properly cleaned up

Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.

This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.

The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.

Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364

3 months agolinux/zvol_os: fix SET_ERROR with negative return codes
Rob Norris [Thu, 18 Jul 2024 03:13:44 +0000 (13:13 +1000)]
linux/zvol_os: fix SET_ERROR with negative return codes

SET_ERROR is our facility for tracking errors internally. The negation
is to match the what the kernel expects from us. Thus, the negation
should happen outside of the SET_ERROR.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16364

3 months agozvol_impl: document and tidy flags
Rob Norris [Thu, 18 Jul 2024 02:37:43 +0000 (12:37 +1000)]
zvol_impl: document and tidy flags

ZVOL_DUMPIFIED is a vestigial Solaris leftover, and not used anywhere.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16364

3 months agoFreeBSD: remove support for FreeBSD < 13.0-RELEASE (#16372)
Rob Norris [Mon, 5 Aug 2024 23:56:45 +0000 (09:56 +1000)]
FreeBSD: remove support for FreeBSD < 13.0-RELEASE (#16372)

This includes the last 12.x release (now EOL) and 13.0 development
versions (<1300139).

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoZTS: fix zfs_copies_006_pos test on Ubuntu 20.04 (#16409)
Tino Reichardt [Mon, 5 Aug 2024 23:18:07 +0000 (01:18 +0200)]
ZTS: fix zfs_copies_006_pos test on Ubuntu 20.04 (#16409)

This test was failing before:
- FAIL cli_root/zfs_copies/zfs_copies_006_pos (expected PASS)

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
3 months agoZTS: fix history_007_pos test on Ubuntu 24.04 (#16410)
Tino Reichardt [Mon, 5 Aug 2024 23:17:23 +0000 (01:17 +0200)]
ZTS: fix history_007_pos test on Ubuntu 24.04 (#16410)

The timezone "US/Mountain" isn't supported on newer linux versions.
Using the correct timezone "America/Denver" like it's done in FreeBSD
will fix this. Older Linux distros should behave also okay with this.

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
3 months agocontrib: link zpool to zfs in bash-completion (#16376)
Shengqi Chen [Mon, 5 Aug 2024 16:44:10 +0000 (00:44 +0800)]
contrib: link zpool to zfs in bash-completion (#16376)

Currently user won't have completion of `zpool` command until they
trigger completion of `zfs` first. This patch adds a link to `zfs`,
thus user can use both to initialize the completion.

Fixes: #16320
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
3 months agoOnce more refactor arc_summary output
Alexander Motin [Thu, 1 Aug 2024 19:27:29 +0000 (15:27 -0400)]
Once more refactor arc_summary output

Before this arc_summary was not reporting any information about
evictable ARC memory.  As result I've found difficult to analyze
behavior of dnode-heavy workload with lots of unevictable buffers.

This change adds evictable sizes into states breakdown section.
While there, add/refactor sections for global memory statistics,
for ARC breakdown between different structures, for data/metadata.
Add information about memory reclamation requests.

While there, refactor and polish graph mode, neglected for a while.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoFreeBSD: Add missing memory reclamation accounting
Alexander Motin [Thu, 1 Aug 2024 19:25:42 +0000 (15:25 -0400)]
FreeBSD: Add missing memory reclamation accounting

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoBlock cloning conditionally destroy ARC buffer
Brian Atkinson [Fri, 2 Aug 2024 01:22:43 +0000 (21:22 -0400)]
Block cloning conditionally destroy ARC buffer

dmu_buf_will_clone() calls arc_buf_destroy() if there is an associated
ARC buffer with the dbuf. However, this can only be done conditionally.
If the previous dirty record's dr_data is pointed at db_dbf then
destroying it can lead to NULL pointer deference when syncing out the
previous dirty record.

This updates dmu_buf_fill_clone() to only call arc_buf_destroy() if the
previous dirty records dr_data is not pointing to db_buf. The block
clone wil still set the dbuf's db_buf and db_data to NULL, but this will
not cause any issues as any previous dirty record dr_data will still be
pointing at the ARC buffer.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16337

3 months agoFix sa.c to build on FreeBSD again. (#16403)
Tino Reichardt [Thu, 1 Aug 2024 20:04:08 +0000 (22:04 +0200)]
Fix sa.c to build on FreeBSD again. (#16403)

Fix multiple build errors on FreeBSD.

The main reason is, that the variable 'dxattr_obj' is used
uninitialized within the start of the 'out label'.

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
3 months agoFix sa_add_projid to lookup and update SA_ZPL_DXATTR (avoid DXATTR loss) (#16288)
Jitendra Patidar [Thu, 1 Aug 2024 01:41:49 +0000 (07:11 +0530)]
Fix sa_add_projid to lookup and update SA_ZPL_DXATTR (avoid DXATTR loss) (#16288)

sa_add_projid() gets called via zfs_setattr() for setting project id
on old file/dir, which were created before upgrading to project quota
feature. This function does lookup for all possible SA and update them
all together along with project ID at needed fixed offset. But its
missing lookup and update of SA_ZPL_DXATTR, effectively it losses
SA_ZPL_DXATTR.

Closes #16287
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
3 months agoFix zdb_dump_block for little endian (#16310)
Chunwei Chen [Thu, 1 Aug 2024 01:33:39 +0000 (18:33 -0700)]
Fix zdb_dump_block for little endian (#16310)

The endian macros were changed but zdb_dump_block wasn't updated
accordingly.

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Allan Jude <allan@klarasystems.com>
3 months agozfs: add bounds checking to zil_parse (#16308)
c1ick [Thu, 1 Aug 2024 00:17:04 +0000 (08:17 +0800)]
zfs: add bounds checking to zil_parse (#16308)

Make sure log record don't stray beyond valid memory region.

There is a lack of verification of the space occupied by fixed members
of lr_t in the zil_parse.

We can create a crafted image to trigger an out of bounds read by
following these steps:
    1) Do some file operations and reboot to simulate abnormal exit
       without umount
    2) zil_chain.zc_nused: 0x1000
    3) First lr_t
       lr_t.lrc_txtype: 0x0
       lr_t.lrc_reclen: 0x1000-0xb8-0x1
       lr_t.lrc_txg: 0x0
       lr_t.lrc_seq: 0x1
    4) Update checksum in zil_chain.zc_eck

Fix:
Add some checks to make sure the remaining bytes are large enough to
hold an log record.

Signed-off-by: XDTG <click1799@163.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
3 months agoLinux: Report reclaimable memory to kernel as such (#16385)
Alexander Motin [Tue, 30 Jul 2024 18:40:47 +0000 (14:40 -0400)]
Linux: Report reclaimable memory to kernel as such (#16385)

Linux provides SLAB_RECLAIM_ACCOUNT and __GFP_RECLAIMABLE flags to
mark memory allocations that can be freed via shinker calls.  It
should allow kernel to tune and group such allocations for lower
memory fragmentation and better reclamation under pressure.

This patch marks as reclaimable most of ARC memory, directly
evictable via ZFS shrinker, plus also dnode/znode/sa memory,
indirectly evictable via kernel's superblock shrinker.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
3 months agodnode: allow storage class to be overridden by object type
Rob Norris [Tue, 27 Jun 2023 02:50:18 +0000 (12:50 +1000)]
dnode: allow storage class to be overridden by object type

spa_preferred_class() selects a storage class based on (among other
things) the DMU object type. This only works for old-style object types
that match only one specific kind of thing. For DMU_OTN_ types we need
another way to signal the storage class.

This commit allows the object type to be overridden in the IO policy for
the purposes of choosing a storage class. It then adds the ability to
set the storage type on a dnode hold, such that all writes generated
under that hold will get it.

This method has two shortcomings:

- it would be better if we could "name" a set of storage class
  preferences rather than it being implied by the object type.
- it would be better if this info were stored in the dnode on disk.

In the absence of those things, this seems like the smallest possible
change.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15894

3 months agospa_preferred_class: pass the entire zio
Rob Norris [Tue, 27 Jun 2023 01:03:29 +0000 (11:03 +1000)]
spa_preferred_class: pass the entire zio

Rather than picking out specific values out of the properties, just pass
the entire zio in, to make it easier in the future to use more of that
info to decide on the storage class.

I would have rathered just pass io_prop in, but having spa.h include
zio.h gets a bit tricky.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15894

3 months agoSkip dnode handles use when not needed
Alexander Motin [Mon, 22 Jul 2024 01:13:42 +0000 (21:13 -0400)]
Skip dnode handles use when not needed

Neither FreeBSD nor Linux currently implement kmem_cache_set_move(),
which means dnode_move() is never called.  In such situation use of
dnode handles with respective locking to access dnode from dbuf is
a waste of time for no benefit.

This patch implements optional simplified code for such platforms,
saving at least 3 dnode lock/dereference/unlock per dbuf life cycle.
Originally I hoped to drop the handles completely to save memory,
but they are still used in dnodes allocation code, so left for now.

Before this change in CPU profiles of some workloads I saw 4-20% of
CPU time spent in zrl_add_impl()/zrl_remove(), which are gone now.

Reviewed-by: Rob Wing <rob.wing@klarasystems.com
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #16374

3 months agoCleanup DB_DNODE() macros usage
Alexander Motin [Mon, 22 Jul 2024 01:04:38 +0000 (21:04 -0400)]
Cleanup DB_DNODE() macros usage

 - Use the macros in few places it was missed.
 - Reduce scope of DB_DNODE_ENTER/EXIT() and inline some DB_DNODE()
uses to make it more obvious what exactly is protected there and
make unprotected accesses by mistake more difficult.
 - Make use of zrl_owner().

Reviewed-by: Rob Wing <rob.wing@klarasystems.com
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16374

3 months agoddt: add support for prefetching tables into the ARC
Allan Jude [Fri, 26 Jul 2024 16:16:18 +0000 (12:16 -0400)]
ddt: add support for prefetching tables into the ARC

This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Closes #15890

3 months agoFix ZDB to dump projid for projectquota enabled (#16291)
Jitendra Patidar [Fri, 26 Jul 2024 00:18:11 +0000 (05:48 +0530)]
Fix ZDB to dump projid for projectquota enabled (#16291)

ZDB is supposed to dump "projid" via dump_znode(), when projectquota
is enabled.
-----------
static void
dump_znode(objset_t *os, uint64_t object, void *data, size_t size)
{
...
    if (dmu_objset_projectquota_enabled(os) && (pflags & ZFS_PROJID)) {
uint64_t projid;

if (sa_lookup(hdl, sa_attr_table[ZPL_PROJID], &projid,
    sizeof (uint64_t)) == 0)
(void) printf("\tprojid %llu\n", (u_longlong_t)projid);
    }
...
}
----------
But its not dumping "projid", even for project quota enabled.

dmu_objset_projectquota_enabled() does following 3 checks,
----------
boolean_t
dmu_objset_projectquota_enabled(objset_t *os)
{
        return (file_cbs[os->os_phys->os_type] != NULL &&
            DMU_PROJECTUSED_DNODE(os) != NULL &&
            spa_feature_is_enabled(os->os_spa,
SPA_FEATURE_PROJECT_QUOTA));
}
----------
It fails on file_cbs[] check. file_cbs[] gets initialised via
dmu_objset_register_type(); which is not done for the ZDB, its done for
the kernel via zfs_init().

Register a dummy callback handle for the DMU_OST_ZFS type in
ZDB main() function to dump the projid for projectquota enabled.

Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #16290
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
3 months agozil: add stats for commit failure/fallback (#16315)
Rob Norris [Thu, 25 Jul 2024 23:53:59 +0000 (09:53 +1000)]
zil: add stats for commit failure/fallback (#16315)

There's no good way to tell when a ZIL commit fails and falls back to a
transaction sync, other than perhaps a throughput drop. This adds
counters so we can see when it happens and why.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoReplace goo.gl style link (#16373)
Alexander Motin [Thu, 25 Jul 2024 18:00:32 +0000 (14:00 -0400)]
Replace goo.gl style link (#16373)

That URL shortening scheme should stop working soon [1], while we
don't really need it here.

1. https://developers.googleblog.com/en/google-url-shortener-links-will-no-longer-be-available/

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoSeveral improvements to ARC shrinking (#16197)
Alexander Motin [Thu, 25 Jul 2024 17:31:14 +0000 (13:31 -0400)]
Several improvements to ARC shrinking (#16197)

- When receiving memory pressure signal from OS be more strict
trying to free some memory.  Otherwise kernel may come again and
request much more.  Return as result how much arc_c was actually
reduced due to this request, that may be less than requested.
 - On Linux when receiving direct reclaim from some file system
(that may be ZFS) instead of ignoring request completely, just
shrink the ARC, but do not wait for eviction.  Waiting there may
cause deadlock.  Ignoring it as before may put extra pressure on
other caches and/or swap, and cause OOM if nothing help.  While
not waiting may result in more ARC evicted later, and may be too
late if OOM killer activate right now, but I hope it to be better
than doing nothing at all.
 - On Linux set arc_no_grow before waiting for reclaim, not after,
or it may grow back while we are waiting.
 - On Linux add new parameter zfs_arc_shrinker_seeks to balance
ARC eviction cost, relative to page cache and other subsystems.
 - Slightly update Linux arc_set_sys_free() math for new kernels.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
3 months agoddt: dedup table quota enforcement
Allan Jude [Thu, 25 Jul 2024 16:47:36 +0000 (12:47 -0400)]
ddt: dedup table quota enforcement

This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.

Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes #15889

3 months agoZTS: Make do_vol_test() more deterministic (#16379)
Alexander Motin [Wed, 24 Jul 2024 16:33:30 +0000 (12:33 -0400)]
ZTS: Make do_vol_test() more deterministic (#16379)

- Explicitly disable compression since mkfile uses a zero buffer.
 - Explicitly sync file systems instead of waiting for timeout.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
4 months agoLinux 6.9: Fix UBSAN errors in sa.c (#16380)
Tony Hutter [Wed, 24 Jul 2024 00:13:04 +0000 (17:13 -0700)]
Linux 6.9: Fix UBSAN errors in sa.c (#16380)

This is a follow-on to 156a64161b4f9da35f2e0484106173344cf78317
that ignores UBSAN errors in sa.c.

Thank you @thwalker3 for the fix.

Original-patch-by: @thwalker3
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16278
Closes #16330
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
4 months agoAdd support for multiple lines to the sharenfs property for FreeBSD (#16338)
rmacklem [Tue, 23 Jul 2024 23:38:19 +0000 (16:38 -0700)]
Add support for multiple lines to the sharenfs property for FreeBSD (#16338)

There has been a bugzilla PR#147881 requesting this
for a long time (14 years!). It extends the syntax of
the ZFS shanenfs property (for FreeBSD only) to allow
multiple sets of options for different hosts/nets,
separated by ';'s.

Signed-off-by: Rick Macklem <rmacklem@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
4 months agoAdd some missing vdev properties (#16346)
Don Brady [Tue, 23 Jul 2024 23:34:09 +0000 (17:34 -0600)]
Add some missing vdev properties (#16346)

Sponsored-by: Klara, Inc.
Sponsored-By: Wasabi Technology, Inc.
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
4 months agoAUTHORS: refresh with recent new contributors (#16362)
Rob Norris [Tue, 23 Jul 2024 18:47:04 +0000 (04:47 +1000)]
AUTHORS: refresh with recent new contributors (#16362)

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: George Melikov <mail@gmelikov.ru>
4 months agoFix long_free_dirty accounting for small files (#16264)
Chunwei Chen [Tue, 23 Jul 2024 18:34:19 +0000 (11:34 -0700)]
Fix long_free_dirty accounting for small files (#16264)

For files smaller than recordsize, it's most likely that they don't have
L1 blocks. However, current calculation will always return at least 1 L1
block.

In this change, we check dnode level to figure out if it has L1 blocks
or not, and return 0 if it doesn't. This will reduce the chance of
unnecessary throttling when deleting a large number of small files.

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
4 months agoZTS: Change cp_stress to fit timings (#16369)
Tino Reichardt [Mon, 22 Jul 2024 21:03:22 +0000 (23:03 +0200)]
ZTS: Change cp_stress to fit timings (#16369)

cp_stress is getting killed on the new QEMU-based github runners
we're developing. The problem is that the Linux based runners
should do 10 RUNS, where the FreeBSD based runners only have 3
RUNS to succeed.

This patch removes this different handling of Linux and FreeBSD.
The cp_stress test is running fine in around 2 minutes now.

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
4 months agozdb: fix BRT dump (#16335)
Rob Norris [Thu, 18 Jul 2024 17:51:27 +0000 (03:51 +1000)]
zdb: fix BRT dump (#16335)

BRT refcounts are stored as eight uint8_ts rather than a single
uint64_t. This means that za_first_integer is only the first byte, so
max 256. This fixes it by doing a lookup for the whole value.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
4 months agoFix printf typo for `zfs receive -cv` (#16295)
glibg10b [Thu, 18 Jul 2024 00:18:12 +0000 (02:18 +0200)]
Fix printf typo for `zfs receive -cv` (#16295)

Current output:
> receiving  correctivefull stream of a into b
New output:
> receiving corrective full stream of a into b

Signed-off-by: glibg10b <56197853+glibg10b@users.noreply.github.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
4 months agoMake sure avl_tree.avl_pad is not in kernel module (#16280)
youzhongyang [Wed, 17 Jul 2024 20:54:11 +0000 (16:54 -0400)]
Make sure avl_tree.avl_pad is not in kernel module (#16280)

The commit b192a2c (Remove avl_size field from struct avl_tree) uses a
def _KERNEL to decide to include avl_pad or not, but this _KERNEL is
defined in sys/sysmacros.h. If avl.h and sysmacros.h are not included
in the right order, it can cause a headache when working on a zfs
related kernel module.

Add sysmacros.h in avl_impl.h to fix. sysmacros.h is also removed
from spa.h as it's reduntant.

Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Co-authored-by: Youzhong Yang <yyang@mathworks.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
4 months agozdb: dump ZAP_FLAG_UINT64_KEY ZAPs properly (#16334)
Rob Norris [Wed, 17 Jul 2024 19:02:28 +0000 (05:02 +1000)]
zdb: dump ZAP_FLAG_UINT64_KEY ZAPs properly (#16334)

These are used for DDT and BRT stores. There's limited information
available to produce meaningful output, but at least we can put
something on screen rather than crashing.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
4 months agovdev_open: clear async fault flag after reopen
Rob Norris [Tue, 11 Jun 2024 10:49:10 +0000 (20:49 +1000)]
vdev_open: clear async fault flag after reopen

After c3f2f1aa2, vdev_fault_wanted is set on a vdev after a probe fails.
An end-of-txg async task is charged with actually faulting the vdev.

In a single-disk pool, the probe failure will degrade the last disk, and
then suspend the pool. However, vdev_fault_wanted is not cleared. After
the pool returns, the transaction finishes and the async task runs and
faults the vdev, which suspends the pool again.

The fix is simple: when reopening a vdev, clear the async fault flag. If
the vdev is still failed, the startup probe will quickly notice and
degrade/suspend it again. If not, all is well!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@klarasystems.com>