]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
2 months agoLinux 6.8 compat: use splice_copy_file_range() for fallback
Rob N [Wed, 20 Mar 2024 23:46:15 +0000 (10:46 +1100)]
Linux 6.8 compat: use splice_copy_file_range() for fallback

Linux 6.8 removes generic_copy_file_range(), which had been reduced to a
simple wrapper around splice_copy_file_range(). Detect that function
directly and use it if generic_ is not available.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15930
Closes #15931

2 months agofreebsd: fix missing headers in distribution tarball
Rob N [Wed, 20 Mar 2024 17:08:50 +0000 (04:08 +1100)]
freebsd: fix missing headers in distribution tarball

arc_os.h and freebsd_event.h aren't included in release tarballs, so the
build fails on FreeBSD. This fixes it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15963

2 months agoddt: reduce DDT_NAMELEN
Rob Norris [Mon, 19 Feb 2024 10:19:32 +0000 (21:19 +1100)]
ddt: reduce DDT_NAMELEN

This is the buffer size passed to ddt_object_name(), to expand the
DMU_POOL_DDT format. That format inserts the table checksum, class and
type names, which as I write this are max 6, 9 and 3, respectively.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15908

2 months agoconfig: use -Wno-format-truncation globally
Rob Norris [Mon, 19 Feb 2024 10:13:59 +0000 (21:13 +1100)]
config: use -Wno-format-truncation globally

-Wformat-truncation looks for places where the return code of snprintf()
is unchecked and the provided buffer might be too short. This is based
on a heuristic that can change between compiler versions.

It has been seen to get this wrong in ddt_object_name(), leading to
DDT_NAMELEN being increased somewhat arbitrarily.

There's no good reason to have this warning enabled, so here we disable
it everywhere. Truncation may be undesirable, but snprintf() is
guaranteed to emit a trailing null, so at worst we get a short string,
not a buffer overrun.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15908

2 months agoFixed parameter passing error when calling zfs_acl_chmod
Quartz [Mon, 26 Feb 2024 19:41:44 +0000 (03:41 +0800)]
Fixed parameter passing error when calling zfs_acl_chmod

Follow up to 99495ba6abbf0bb726324d03212c6f5ffa00043e which
accidentally introduce this regression.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Quartz <yyhran@163.com>
Closes #15907

3 months agoCheck for minimum partition size
Brian Behlendorf [Fri, 16 Feb 2024 17:07:32 +0000 (09:07 -0800)]
Check for minimum partition size

On Linux block devices used for vdevs will by partitioned.  The block
device must be large enough for an 64M partition starting at offset
of 2048 sectors (part1), and a second 64M reserved partition at the
end of the device (part9).

This commit adds a capacity check when creating the GPT label to
immediately detect a device which is too small.  With the existing
code this would be caught slightly latter when attempting to use
the partition.  Catching it sooner let's us print a more useful error.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15898

3 months agoZTS: Skip cross-fs bclone tests if FreeBSD < 14.0
Tony Hutter [Fri, 16 Feb 2024 16:59:56 +0000 (08:59 -0800)]
ZTS: Skip cross-fs bclone tests if FreeBSD < 14.0

Skip cross filesystem block cloning tests on FreeBSD if running
less than version 14.0.  Cross filesystem copy_file_range() was
added in FreeBSD 14.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15901

3 months agoddt: document the theory and the key data structures
Rob Norris [Mon, 27 Nov 2023 23:43:36 +0000 (10:43 +1100)]
ddt: document the theory and the key data structures

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: only create tables for dedup-capable checksums
Rob Norris [Thu, 1 Feb 2024 00:05:18 +0000 (11:05 +1100)]
ddt: only create tables for dedup-capable checksums

Most values in zio_checksum can never be used for dedup, partly because
the dedup= property only offers a limited list, but also some values (eg
ZIO_CHECKSUM_OFF) aren't real and will never be seen.

A true flag would be better than a hardcoded list, but thats more
cleanup elsewhere than I want to do right now.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: simplify entry load and flags
Rob Norris [Tue, 5 Dec 2023 03:28:39 +0000 (14:28 +1100)]
ddt: simplify entry load and flags

Only a single bit is needed to track entry state, and definitely not two
whole bytes. Some light refactoring in ddt_lookup() is needed to support
this, but it reads a lot better now.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: remove ddt_node
Rob Norris [Mon, 31 Jul 2023 07:42:34 +0000 (17:42 +1000)]
ddt: remove ddt_node

Nothing uses it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: rework ops interface in terms of keys and values
Rob Norris [Mon, 3 Jul 2023 13:28:46 +0000 (23:28 +1000)]
ddt: rework ops interface in terms of keys and values

Store objects store keys and values, so have them take those types and
nothing more. This way, they don't need to be concerned about the "kind"
of entry being operated on; the dispatch layer can take care of the
appropriate conversions.

This adds a "contains" op to see if a particular entry exists without
loading it, which makes a couple of things easier to do; in particular,
it allows us to avoid an allocation in ddt_class_contains().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: ensure ddt objects exist before trying to get stats from them
Rob Norris [Thu, 15 Jun 2023 06:10:00 +0000 (16:10 +1000)]
ddt: ensure ddt objects exist before trying to get stats from them

ddt_get_dedup_histogram() was actually checking it, just in an extremely
cursed way. ddt_get_dedup_object_stats() wasn't, but wasn't being called
from a dangerous place so no one noticed.

These checks are necessary, because spa_ddt[] is not populated until
spa_load(), but the spa can exist before that, while being created, and
as vdevs and metaslabs are initialised the space accounting functions
will be called to update pool space counts.

Probably the whole create path doesn't need to go asking for space
accounting from metadata subsystems until after the pool is created.
This will at least catch misuse.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: remove struct names and forward declarations
Rob Norris [Mon, 3 Jul 2023 02:43:37 +0000 (12:43 +1000)]
ddt: remove struct names and forward declarations

Things get confused when there's more than one name for a thing.

Note that we don't do this for ddt_object_t, ddt_histogram_t and
ddt_stat_t because they're part of the public ZFS interface.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: typedef ddt_type and ddt_class
Rob Norris [Mon, 3 Jul 2023 02:32:53 +0000 (12:32 +1000)]
ddt: typedef ddt_type and ddt_class

Mostly for consistency, so the reader is less likely to wonder why these
things look different.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: split internal DDT API into separate header
Rob Norris [Fri, 30 Jun 2023 03:35:18 +0000 (13:35 +1000)]
ddt: split internal DDT API into separate header

Just to make it easier to know which bits to pay attention to.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: remove DDE_GET_NDVAS macro
Rob Norris [Mon, 3 Jul 2023 05:25:06 +0000 (15:25 +1000)]
ddt: remove DDE_GET_NDVAS macro

It was a weird and confusing name, because it wasn't actually returning
the number of DVAs in the entry (as in, in the value/phys part) but the
maximum number of possible DVAs in a BP generated from the entry, based
on the encrypt bit in the key. This is unlike the similarly named
BP_GET_NDVAS, which really does return the number of DVAs.

Since its only used in this one place, and for a specific purpose, it
seemed more sensible to just write it in-place and remove the name.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: lift dedup stats out to separate file
Rob Norris [Tue, 16 May 2023 03:30:26 +0000 (13:30 +1000)]
ddt: lift dedup stats out to separate file

We want to add other kinds of dedup-related objects and keep stats for
them. This makes those functions easier to use from outside ddt.c.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: compare keys, not entries
Rob Norris [Fri, 9 Jun 2023 00:14:42 +0000 (10:14 +1000)]
ddt: compare keys, not entries

We're about to have different kinds of things that we'll compare on key,
so generalise this function to support that.

(It actually worked fine because of the way the casts work out, but it
requires the key to be at the start of the object so the cast through
ddt_entry_t works, and even then it reads strangely for anything that's
not a ddt_entry_t).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt_zap: standardise temp buffer allocations
Rob Norris [Wed, 17 Jan 2024 22:51:41 +0000 (09:51 +1100)]
ddt_zap: standardise temp buffer allocations

Always do them on the heap, and when we know how much we need, only that
much.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: move entry compression into ddt_zap
Rob Norris [Fri, 30 Jun 2023 02:48:45 +0000 (12:48 +1000)]
ddt: move entry compression into ddt_zap

I think I can say with some confidence that anyone making a new storage
type in 2023 is doing their own thing with compression, not this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoddt: modernise assertions
Rob Norris [Thu, 15 Feb 2024 08:37:38 +0000 (19:37 +1100)]
ddt: modernise assertions

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15887

3 months agoLinux: Cleanup taskq threads spawn/exit
Alexander Motin [Tue, 13 Feb 2024 19:15:16 +0000 (14:15 -0500)]
Linux: Cleanup taskq threads spawn/exit

This changes taskq_thread_should_stop() to limit maximum exit rate
for idle threads to one per 5 seconds.  I believe the previous one
was broken, not allowing any thread exits for tasks arriving more
than one at a time and so completing while others are running.

Also while there:
 - Remove taskq_thread_spawn() calls on task allocation errors.
 - Remove extra taskq_thread_should_stop() call.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15873

3 months agozdb: Fix false leak report for BRT objects
Bi11 [Tue, 13 Feb 2024 00:58:47 +0000 (08:58 +0800)]
zdb: Fix false leak report for BRT objects

Fix a misreport in 'zdb -d' where it falsely marked
BRT objects as leaked.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuxin Wang <yuxinwang9999@gmail.com>
Closes #15882

3 months agoBRT: Fix slop space calculation with block cloning
Bi11 [Mon, 12 Feb 2024 21:53:33 +0000 (05:53 +0800)]
BRT: Fix slop space calculation with block cloning

Similar to deduplication, the size of data duplicated by block cloning
should not be included in the slop space calculation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuxin Wang <yuxinwang9999@gmail.com>
Closes #15874

3 months agoAllowing PERFPOOL to be defined by zfs-test users
Kevin Greene [Fri, 9 Feb 2024 18:02:46 +0000 (10:02 -0800)]
Allowing PERFPOOL to be defined by zfs-test users

Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kevin Greene <kevin.greene@delphix.com>
Closes #15868

3 months agoUpdate zfs-snapshot.8
Shawn Bayern [Thu, 8 Feb 2024 21:06:12 +0000 (16:06 -0500)]
Update zfs-snapshot.8

Fixes a small inaccuracy in the description of snapshot
atomicity

zfs-snapshot(8) appears to contain a small error.  The existing
version reads "Snapshots are taken atomically, so that all
snapshots correspond to the same moment in time."  Per
zfs_main.c, which in do_snapshot() simply loops over argv, this
does not appear to be correct when multiple snapshots are
specified explicitly on the command line.  I believe the intent
of the man page was to say that *recursive* snapshots are all
created atomically.

This proposed change fixes that error.  Because the existing
statement may confuse some readers anyway, the commit also also
adds a small amount of general explanatory information that may
be helpful.

The change also adds an introductory sentence that summarizes
what 'zfs snapshot' does in the first place.  In that sentence,
the text "different datasets" is intended to indicate that
(again per the code) the same dataset cannot be specified
multiple times on the command line.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shawn Bayern <sbayern@law.fsu.edu>
Closes #15857

3 months agozfs list: add '-t fs' and '-t vol' options
Rob N [Thu, 8 Feb 2024 18:22:58 +0000 (05:22 +1100)]
zfs list: add '-t fs' and '-t vol' options

Because "filesystem" and "volume" are just too long!

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15864

3 months agoAdd slow disk diagnosis to ZED
Don Brady [Thu, 8 Feb 2024 17:19:52 +0000 (10:19 -0700)]
Add slow disk diagnosis to ZED

Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay).  However, no action is taken by ZED,
like is done for checksum or I/O errors.  This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
  VDEV_PROP_SLOW_IO_N
  VDEV_PROP_SLOW_IO_T

If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15469

3 months agoLUA: Backport CVE-2020-24370's patch
the-Chain-Warden-thresh [Wed, 7 Feb 2024 19:53:05 +0000 (03:53 +0800)]
LUA: Backport CVE-2020-24370's patch

CVE-2020-24370 is a security vulnerability in lua. Although the CVE
description in CVE-2020-24370 said that this CVE only affected lua
5.4.0, according to lua this CVE actually existed since lua 5.2. The
root cause of this CVE is the negation overflow that occurs when you
try to take the negative of 0x80000000. Thus, this CVE also exists in
openzfs. Try to backport the fix to the lua in openzfs since the
original fix is for 5.4 and several functions have been changed.

https://github.com/advisories/GHSA-gfr4-c37g-mm3v
https://nvd.nist.gov/vuln/detail/CVE-2020-24370
https://www.lua.org/bugs.html#5.4.0-11
https://github.com/lua/lua/commit/a585eae6e7ada1ca9271607a4f48dfb1786

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ChenHao Lu <18302010006@fudan.edu.cn>
Closes #15847

3 months agoAdd 'zpool status -e' flag to see unhealthy vdevs
Cameron Harr [Wed, 7 Feb 2024 17:12:12 +0000 (09:12 -0800)]
Add 'zpool status -e' flag to see unhealthy vdevs

When very large pools are present, it can be laborious to find
reasons for why a pool is degraded and/or where an unhealthy vdev
is. This option filters out vdevs that are ONLINE and with no errors
to make it easier to see where the issues are. Root and parents of
unhealthy vdevs will always be printed.

Testing:
ZFS errors and drive failures for multiple vdevs were simulated with
zinject.

Sample vdev listings with '-e' option
- All vdevs healthy
    NAME        STATE     READ WRITE CKSUM
    iron5       ONLINE       0     0     0

- ZFS errors
    NAME        STATE     READ WRITE CKSUM
    iron5       ONLINE       0     0     0
      raidz2-5  ONLINE       1     0     0
        L23     ONLINE       1     0     0
        L24     ONLINE       1     0     0
        L37     ONLINE       1     0     0

- Vdev faulted
    NAME        STATE     READ WRITE CKSUM
    iron5       DEGRADED     0     0     0
      raidz2-6  DEGRADED     0     0     0
        L67     FAULTED      0     0     0  too many errors

- Vdev faults and data errors
    NAME        STATE     READ WRITE CKSUM
    iron5       DEGRADED     0     0     0
      raidz2-1  DEGRADED     0     0     0
        L2      FAULTED      0     0     0  too many errors
      raidz2-5  ONLINE       1     0     0
        L23     ONLINE       1     0     0
        L24     ONLINE       1     0     0
        L37     ONLINE       1     0     0
      raidz2-6  DEGRADED     0     0     0
        L67     FAULTED      0     0     0  too many errors

- Vdev missing
    NAME        STATE     READ WRITE CKSUM
    iron5       DEGRADED     0     0     0
      raidz2-6  DEGRADED     0     0     0
        L67     UNAVAIL      3     1     0

- Slow devices when -s provided with -e
    NAME        STATE     READ WRITE CKSUM  SLOW
    iron5       DEGRADED     0     0     0     -
      raidz2-5  DEGRADED     0     0     0     -
        L10     FAULTED      0     0     0     0  external device fault
        L51     ONLINE       0     0     0    14

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Cameron Harr <harr1@llnl.gov>
Closes #15769

3 months agoBRT: Fix FICLONE/FICLONERANGE shortened copy
Brian Behlendorf [Tue, 6 Feb 2024 00:44:45 +0000 (16:44 -0800)]
BRT: Fix FICLONE/FICLONERANGE shortened copy

On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15728
Closes #15842

3 months agolibzdb: Initial breakout of libzdb
Rich Ercolani [Mon, 5 Feb 2024 18:00:41 +0000 (13:00 -0500)]
libzdb: Initial breakout of libzdb

Step 1 in trying to slowly rip the zdb functions out of zdb.c
to allow people to play with more flexible things to leverage
zdb's functionality.

No promises on any functions or structs being stable, now or probably
in general unless someone builds a more polished abstraction, the
goal at the moment is to slowly untangle the global state usage
in zdb...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #15804

3 months agoImprove performance for zpool trim on linux
Umer Saleem [Fri, 2 Feb 2024 19:51:51 +0000 (00:51 +0500)]
Improve performance for zpool trim on linux

On Linux, ZFS uses blkdev_issue_discard in vdev_disk_io_trim to issue
trim command which is synchronous.

This commit updates vdev_disk_io_trim to use __blkdev_issue_discard,
which is asynchronous. Unfortunately there isn't any asynchronous
version for blkdev_issue_secure_erase, so performance of secure trim
will still suffer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15843

3 months agoLinux 6.8 compat: handle mnt_idmap user_namespace change
Rob Norris [Tue, 23 Jan 2024 10:14:06 +0000 (21:14 +1100)]
Linux 6.8 compat: handle mnt_idmap user_namespace change

struct mnt_idmap no longer has a struct user_namespace within it. Work
around this by creating a temporary with the copy of the map we need
taken from the idmap.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15805

3 months agoLinux 6.8 compat: fix inode permission tests
Rob Norris [Tue, 23 Jan 2024 06:43:20 +0000 (17:43 +1100)]
Linux 6.8 compat: fix inode permission tests

The name inode_permission is now defined in the kernel. Rename ours to
test_permission, in line with most of our other tests.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15805

3 months agoLinux 6.8 compat: replace MAX_ORDER define
Rob Norris [Tue, 23 Jan 2024 05:41:05 +0000 (16:41 +1100)]
Linux 6.8 compat: replace MAX_ORDER define

MAX_ORDER has been renamed to MAX_PAGE_ORDER. Rather than just
redefining it, instead define our own name and set it consistently from
the start.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15805

3 months agoLinux 6.8 compat: implement strlcpy fallback
Rob Norris [Tue, 23 Jan 2024 05:34:49 +0000 (16:34 +1100)]
Linux 6.8 compat: implement strlcpy fallback

Linux has removed strlcpy in favour of strscpy. This implements a
fallback implementation of strlcpy for this case.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15805

3 months agoLinux 6.8 compat: update for new bdev access functions
Rob Norris [Tue, 23 Jan 2024 04:42:57 +0000 (15:42 +1100)]
Linux 6.8 compat: update for new bdev access functions

blkdev_get_by_path() and blkdev_put() have been replaced by
bdev_open_by_path() and bdev_release(), which return a "handle" object
with the bdev object itself inside.

This adds detection for the new functions, and macros to handle the old
and new forms consistently.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15805

3 months agoLinux 6.8 compat: make test functions static
Rob Norris [Mon, 22 Jan 2024 23:50:53 +0000 (10:50 +1100)]
Linux 6.8 compat: make test functions static

The kernel is now being compiled with -Wmissing-prototypes. Most of our
test stub functions had no prototype, and failed to compile. Since they
don't need to be visible anywhere else, just make them all static.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15805

3 months agoLinux 6.7 compat: META
Brian Behlendorf [Mon, 29 Jan 2024 19:35:43 +0000 (11:35 -0800)]
Linux 6.7 compat: META

Update the META file to reflect compatibility with the 6.7 kernel.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15833

3 months agoDon't assert mg_initialized due to device addition race
Paul Dagnelie [Mon, 29 Jan 2024 18:36:42 +0000 (10:36 -0800)]
Don't assert mg_initialized due to device addition race

During device removal stress tests, we noticed that we were tripping
the assertion that mg_initialized was true. After investigation, it was
determined that the mg in question was the embedded log metaslab
group for a newly added vdev; the normal mg had been initialized (by
metaslab_sync_reassess, via vdev_sync_done). However, because the spa
config alloc lock is not held as writer across both calls to
metaslab_sync_reassess, it is possible for an allocation to happen
between the two metaslab_groups being initialized. Because the metaslab
code doesn't check the group in question, just the vdev's main mg, it
is possible to get past the initial check in vdev_allocatable and
later fail due to the assertion.

We simply remove the assertions. We could also consider locking the
ALLOC lock around the reassess calls in vdev_sync_done, but that risks
deadlocks. We could check the actual target mg in vdev_allocatable,
but that risks racing with a passivation that comes in after that
check but before the assertion. We still won't be able to actually
allocate from the metaslab group if no metaslabs are ready, so this
change shouldn't break anything.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15818

3 months agolibzfs: use zfs_strerror() in place of strerror()
Richard Kojedzinszky [Mon, 22 Jan 2024 23:28:18 +0000 (00:28 +0100)]
libzfs: use zfs_strerror() in place of strerror()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Richard Kojedzinszky <richard@kojedz.in>
Closes #15793

3 months agolibzfs: make userquota_propname_decode threadsafe
Richard Kojedzinszky [Wed, 17 Jan 2024 19:48:02 +0000 (20:48 +0100)]
libzfs: make userquota_propname_decode threadsafe

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Richard Kojedzinszky <richard@kojedz.in>
Closes #15793

3 months agolibnvpair.c: replace strstr() with strchr() for a single character
rilysh [Mon, 29 Jan 2024 17:46:13 +0000 (23:16 +0530)]
libnvpair.c: replace strstr() with strchr() for a single character

Since we're looking for a single new-line character in the haystack,
it's better (and slightly more efficient) to use strchr() instead of
strstr().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: rilysh <nightquick@proton.me>
Closes #15798

3 months agoUpdate man pages to time(1) from time(2)
Chris Davidson [Mon, 29 Jan 2024 17:44:08 +0000 (12:44 -0500)]
Update man pages to time(1) from time(2)

zpool-iostat.8: Updated time(2) -> time(1) to align to manual page
zpool-list.8: Updated time(2) -> time(1) to align to manual page
zpool-status.8: Updated time(2) -> time(1) to align to manual page
zpool-wait.8: Update time(2) -> time(1) to align to manual page

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christopher Davidson <christopher.davidson@gmail.com>
Closes #15823

3 months agoZTS: Allow longer run time for zdb_args_pos
Brian Behlendorf [Mon, 29 Jan 2024 17:41:26 +0000 (09:41 -0800)]
ZTS: Allow longer run time for zdb_args_pos

The zdb_args_pos test may take slightly longer than 600 seconds to run
on some of the CI builders.  To prevent this from causing failures allow
up to 1200 seconds for tests in this group.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15826

3 months agoMove nodes into correct subgraphs
Andrew Innes [Mon, 29 Jan 2024 17:16:02 +0000 (01:16 +0800)]
Move nodes into correct subgraphs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Closes #15828

3 months agoRemove list_size struct member from list implementation
MigeljanImeri [Fri, 26 Jan 2024 22:46:42 +0000 (15:46 -0700)]
Remove list_size struct member from list implementation

Removed the list_size struct member as it was only used in a single
assertion, as mentioned in PR #15478.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: MigeljanImeri <imerimigel@gmail.com>
Closes #15812

3 months agozpool wait: print timestamp before the header
Rob N [Fri, 26 Jan 2024 22:41:31 +0000 (09:41 +1100)]
zpool wait: print timestamp before the header

list, status and iostat all display the -T timestamp before the header,
but wait showed it after. Make it be like the others.

Reported-by: Kyle Evans <kevans@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15825

3 months agoUpdate vdev devid and physpath if changed between imports
Ameer Hamza [Fri, 26 Jan 2024 22:24:35 +0000 (03:24 +0500)]
Update vdev devid and physpath if changed between imports

If devid or physpath for a vdev changes between imports, ensure it is
updated to the new value.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15816

3 months agoZTS: Update deprecated Github Action version numbers
Tino Reichardt [Fri, 26 Jan 2024 22:22:26 +0000 (23:22 +0100)]
ZTS: Update deprecated Github Action version numbers

GitHub Actions is transitioning from Node 16 to Node 20.

So we need to update these:
- actions/checkout@v3 -> v4
- actions/download-artifact@v3 -> v4
- actions/upload-artifact@v3 -> v4 and some minor changes

Update also the documentation of the testings workflow.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Andrew Innes <andrew.c12@gmail.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #15820

3 months agoSwitch to CodeQL to detect prohibited function use
Richard Yao [Fri, 26 Jan 2024 22:11:33 +0000 (17:11 -0500)]
Switch to CodeQL to detect prohibited function use

The LLVM/Clang developers pointed out that using the CPP to detect use
of functions that our QA policies prohibit risks invoking undefined
behavior. To resolve this, we configure CodeQL to detect forbidden
function usage.

Note that cpp in the context of CodeQL refers to C/C++, rather than the
C PreProcessor, which C++ also uses. It really should have been written
cxx, but that ship sailed a long time ago. This misuse of the term cpp
is retained in the CodeQL configuration for consistency with upstream
CodeQL.

As a side benefit, verbose make no longer is a wall of text showing a
bunch of CPP macros, which can make debugging slightly easier.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #15819
Closes #14134

3 months agoZTS: Apply small changes for speeding up the tests
Tino Reichardt [Fri, 26 Jan 2024 21:36:59 +0000 (22:36 +0100)]
ZTS: Apply small changes for speeding up the tests

The Github Action Runner got some new hardware metrics.  We should use
the provided and empty disk which is pre-mounted at /mnt now.

Disk1: 89GiB -> rootfs + bootfs with ~80MB/s -> don't care
Disk2: 64GiB -> /mnt with 420MB/s -> new testing ssd

This commit will mount the new disk to /var/tmp and provide hopefully
some speedups within our testings.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Andrew Innes <andrew.c12@gmail.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #15811

3 months agoFix file descriptor leak on pool import.
Pawel Jakub Dawidek [Tue, 23 Jan 2024 23:03:48 +0000 (15:03 -0800)]
Fix file descriptor leak on pool import.

Descriptor leak can be easily reproduced by doing:

# zpool import tank
# sysctl kern.openfiles
# zpool export tank; zpool import tank
# sysctl kern.openfiles

We were leaking four file descriptors on every import.

Similar leak most likely existed when using file-based VDEVs.

External-issue: https://reviews.freebsd.org/D43529
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #15630

3 months agoZTS: Apply zfs_bclone_enabled to bclone tests
Brian Behlendorf [Tue, 23 Jan 2024 00:14:08 +0000 (16:14 -0800)]
ZTS: Apply zfs_bclone_enabled to bclone tests

If block cloning is disabled by default then enable it when running
the bclone tests.  Follow up to #15529.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15796

4 months agoFreeBSD: Fix bootstrapping tools under Linux/musl
Val Packett [Fri, 19 Jan 2024 21:01:26 +0000 (18:01 -0300)]
FreeBSD: Fix bootstrapping tools under Linux/musl

musl libc has deprecated LFS64 aliases, so bootstrapping FreeBSD tools
under musl distros has been failing with stat64 errors.

Apply the aliases under non-glibc Linux to fix this problem.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Val Packett <val@packett.cool>
Closes #15780

4 months agofix: variable type with zfs-tests/cmd/clonefile.c
Tino Reichardt [Wed, 17 Jan 2024 17:06:14 +0000 (18:06 +0100)]
fix: variable type with zfs-tests/cmd/clonefile.c

Compiling on arm64 freebsd-13.2 and arm64 almalinux-8 brings currently
this error:

```
  CC       tests/zfs-tests/cmd/clonefile.o
tests/zfs-tests/cmd/clonefile.c:166:43: error: result of comparison of \
constant -1 with expression of type 'char' is always true \
[-Werror,-Wtautological-constant-out-of-range-compare]
        while ((c = getopt(argc, argv, "crfdq")) != -1) {
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~
1 error generated.
gmake[2]: *** [Makefile:8675: tests/zfs-tests/cmd/clonefile.o] Error 1
```

Fix: use correct variable type `int`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #15783

4 months agolinux spl: fix typo in top comment of spl-condvar.c
Tino Reichardt [Wed, 17 Jan 2024 17:05:12 +0000 (18:05 +0100)]
linux spl: fix typo in top comment of spl-condvar.c

Credential Implementation -> Condition Variables Implementation

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #15782

4 months agoAutotrim High Load Average Fix
Kevin Jin [Wed, 17 Jan 2024 17:03:58 +0000 (12:03 -0500)]
Autotrim High Load Average Fix

Switch from cv_wait() to cv_wait_idle() in vdev_autotrim_wait_kick(),
which should mitigate the high load average while waiting.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes #15781

4 months agoFix cloning into mmaped and cached file.
Pawel Jakub Dawidek [Wed, 17 Jan 2024 16:51:07 +0000 (08:51 -0800)]
Fix cloning into mmaped and cached file.

If the destination file is mmaped and the mmaped region was already
read, so it is cached, we need to update mmaped pages after successful
clone using update_pages().

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Pointed out by: Ka Ho Ng <khng@freebsd.org>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #15772

4 months agoLinux 6.7 compat: zfs_setattr fix atime update
Rob N [Tue, 16 Jan 2024 22:01:17 +0000 (09:01 +1100)]
Linux 6.7 compat: zfs_setattr fix atime update

In db4fc559c I messed up and changed this bit of code to set the inode
atime to an uninitialised value, when actually it was just supposed to
loading the atime from the inode to be stored in the SA. This changes it
to what it should have been.

Ensure times change by the right amount Previously, we only checked
if the times changed at all, which missed a bug where the atime was
being set to an undefined value.

Now ensure the times change by two seconds (or thereabouts), ensuring
we catch cases where we set the time to something bonkers

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15762
Closes #15773

4 months agoMake sure all necessary RPM path macros are defined
Lalufu [Tue, 16 Jan 2024 21:32:59 +0000 (22:32 +0100)]
Make sure all necessary RPM path macros are defined

When building (s)rpm files through the Makefile, a directory structure
is created in /tmp to hold the various files.

In case the user running the command has overridden some of the RPM path
settings through their user profile (for example in `~/.rpmmacros`),
these paths do not line up with the configuration, and the build fails.

Make sure all paths used are properly defined.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ralf Ertzinger <ralf@skytale.net>
Closes #15756

4 months agoMake spl_kmem_cache size check consistent
youzhongyang [Tue, 16 Jan 2024 21:30:58 +0000 (16:30 -0500)]
Make spl_kmem_cache size check consistent

On Linux x86_64, kmem cache can have size up to 4M,
however increasing spl_kmem_cache_slab_limit can lead
to crash due to the size check inconsistency.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #15757

4 months agoAdd path handling for aux vdevs in `label_path`
Ameer Hamza [Thu, 4 Jan 2024 14:35:04 +0000 (19:35 +0500)]
Add path handling for aux vdevs in `label_path`

If the AUX vdev is added using UUID, importing the pool falls back AUX
vdev to open it with disk name instead of UUID due to the absence of
path information for AUX vdevs. Since AUX label now have path
information, this PR adds path handling for it in `label_path`.

Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15737

4 months agoExtend aux label to add path information
Ameer Hamza [Thu, 4 Jan 2024 14:32:53 +0000 (19:32 +0500)]
Extend aux label to add path information

Pool import logic uses vdev paths, so it makes sense to add path
information on AUX vdev as well.

Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15737

4 months agofix: Uber block label not always found for aux vdevs
Ameer Hamza [Thu, 4 Jan 2024 14:02:50 +0000 (19:02 +0500)]
fix: Uber block label not always found for aux vdevs

When spare or l2cache (aux) vdev is added during pool creation,
spa->spa_uberblock is not dumped until that point. Subsequently,
the aux label is never synchronized after its initial creation,
resulting in the uberblock label remaining undumped. The uberblock
is crucial for lib_blkid in identifying the ZFS partition type. To
address this issue, we now ensure sync of the uberblock label once
if it's not dumped initially.

Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15737

4 months agoMake zdb -R a little more sane.
Rich Ercolani [Tue, 16 Jan 2024 21:16:08 +0000 (16:16 -0500)]
Make zdb -R a little more sane.

zdb -R has a minor flaw in which it will not always print the full
output of a decompressed block. Oops.

While I was in there, I also reworked the logic so it won't try
ZLE unless everything else fails, which will hopefully avoid the
problem ZDB_NO_ZLE was intended to mitigate of reporting a lot of
false positives of ZLE compressed blocks...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #15723

4 months agoZTS: Test for clone, mmap and write for block cloning
Umer Saleem [Tue, 16 Jan 2024 21:15:10 +0000 (02:15 +0500)]
ZTS: Test for clone, mmap and write for block cloning

For block cloning, if we mmap the cloned file and write from the
map into the file, it triggers a panic in dbuf_redirty() on Linux.

The same scenario causes data corruption on FreeBSD. Both these
issues are fixed under PR#15656 and PR#15665.

It would be good to add a test for this scenario in ZTS. The test
program and issue was produced by @robn.

Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15717

4 months agoFix "out of memory" error
Brian Behlendorf [Fri, 12 Jan 2024 20:35:29 +0000 (12:35 -0800)]
Fix "out of memory" error

Drop the no_memory() call from zpool_in_use() when reading the
label fails and instead return the error to the caller.  This
prevents a misleading "internal error: out of memory" error
when the label can't be read.  This will result in is_spare()
returning B_FALSE instead of aborting, which is already safely
handled.

Furthermore, on Linux it's possible for EREMOTEIO to returned
by an NVMe device if the device has been low-level formatted
and not rescanned.  In this case we want to fallback to the
legacy scanning method and read any of the labels we can.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #13538
Closes #15747

4 months agofix: preserve linux kmod signature in zfs-kmod rpm spec
Benjamin Sherman [Fri, 12 Jan 2024 20:33:41 +0000 (14:33 -0600)]
fix: preserve linux kmod signature in zfs-kmod rpm spec

This change provides rpm spec macros to sign the zfs and spl kmods as
the final step after the %install scriptlet. This is needed since the
find-debuginfo.sh script strips out debug symbols plus signatures.

Kernel module signing only occurs when the required files are present
as typically required in the Linux source tree:
- certs/signing_key.pem
- certs/signing_key.x509

The method for overriding the default __spec_install_post macro is
inspired by (and largely copied from) the Fedora kernel.spec.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Benjamin Sherman <benjamin@holyarmy.org>
Closes #15744

4 months agospa: Let spa_taskq_param_get()'s addition of a newline be optional
Mark Johnston [Fri, 29 Dec 2023 17:56:35 +0000 (12:56 -0500)]
spa: Let spa_taskq_param_get()'s addition of a newline be optional

For FreeBSD sysctls, we don't want the extra newline, since the
sysctl(8) utility will format strings appropriately.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #15719

4 months agospa: Fix FreeBSD sysctl handlers
Mark Johnston [Fri, 29 Dec 2023 15:22:58 +0000 (10:22 -0500)]
spa: Fix FreeBSD sysctl handlers

sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by
sbuf_new_for_sysctl().  In particular, this code triggers an assertion
failure in sbuf_clear().

Simplify by just using sysctl_handle_string() for both reading and
setting the tunable.

Fixes: 6930ecbb7 ("spa: make read/write queues configurable")
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #15719

4 months agoStop wasting time on malloc in snprintf_zstd_header
Rich Ercolani [Fri, 12 Jan 2024 20:17:26 +0000 (15:17 -0500)]
Stop wasting time on malloc in snprintf_zstd_header

Profiling zdb -vvvvv on datasets with a lot of zstd blocks, we find
ourselves spending quite a lot of time on malloc/free, because we
allocate a 16M abd each call, and never free it, so we're leaking
16M per call as well.

This seems sub-optimal. So let's just keep the buffer around and
reuse it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #15721

4 months agofix(mount): do not truncate shares not zfs mount
Stefan Lendl [Fri, 12 Jan 2024 20:05:11 +0000 (21:05 +0100)]
fix(mount): do not truncate shares not zfs mount

When running zfs share -a resetting the exports.d/zfs.exports makes
sense the get a clean state.
Truncating was also called with zfs mount which would not populate the
file again.
Add test to verify shares persist after mount -a.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Stefan Lendl <s.lendl@proxmox.com>
Closes #15607
Closes #15660

4 months agoEnable block_cloning tests on FreeBSD
Brian Behlendorf [Fri, 12 Jan 2024 19:57:13 +0000 (11:57 -0800)]
Enable block_cloning tests on FreeBSD

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #15749

4 months agoMake zdb -R scale less poorly
Rich Ercolani [Fri, 12 Jan 2024 19:55:17 +0000 (14:55 -0500)]
Make zdb -R scale less poorly

zdb -R with :d tries to use gzip decompression 9 times per size.
There's absolutely no reason for that, they're all the same
decompressor.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #15726

4 months agoFix a potential use-after-free in zfs_setsecattr()
Mark Johnston [Tue, 9 Jan 2024 23:57:09 +0000 (18:57 -0500)]
Fix a potential use-after-free in zfs_setsecattr()

In general, VOPs must not load the "z_log" field until having called
zfs_enter_verify_zp().

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #15752

4 months agoLinux: Defer loading the object set in zfs_setattr()
Mark Johnston [Tue, 9 Jan 2024 15:57:29 +0000 (10:57 -0500)]
Linux: Defer loading the object set in zfs_setattr()

We need to wait until after having done a zfs_enter() to load some
fields from the zfsvfs structure.  Otherwise a use-after-free is
possible in the face of a concurrent rollback.

Other functions in this file are careful to avoid this bug, I believe
this is the only instance.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #15752

4 months agoAdd Gotify notification support to ZED
gofaster [Tue, 9 Jan 2024 17:49:30 +0000 (12:49 -0500)]
Add Gotify notification support to ZED

This commit adds the zed_notify_gotify() function and hooks it
into zed_notify(). This will allow ZED to send notifications
to a self-hosted Gotify service, which can be received
on a desktop or mobile device. It is configured with ZED_GOTIFY_URL,
ZED_GOTIFY_APPTOKEN and ZED_GOTIFY_PRIORITY variables in zed.rc.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: gofaster <felix.gofaster@gmail.com>
Closes #15693

4 months agoFix livelist assertions for dedup and cloning
Alexander Motin [Tue, 9 Jan 2024 17:48:40 +0000 (12:48 -0500)]
Fix livelist assertions for dedup and cloning

Two block pointers in livelist pointing to the same location may
be caused not only by dedup, but also by block cloning. We should
not assert D bit set in them.

Two block pointers in livelist pointing to the same location may
have different logical birth time in case of dedup or cloning. We
should assert identical physical birth time instead.

Assert identical physical block size between pointers in addition
to checksum, since that is what checksums are calculated on.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15732

4 months agoImprove block sizes checks during cloning
Alexander Motin [Tue, 9 Jan 2024 17:46:43 +0000 (12:46 -0500)]
Improve block sizes checks during cloning

- Fail if source block is smaller than destination.  We can only
grow blocks, not shrink them.
 - Fail if we do not have full znode range lock.  In that case grow
is not even called.  We should improve zfs_rangelock_cb() somehow
to know when cloning needs to grow the block size unlike write.
 - Fail of we tried to resize, but failed.  There are many reasons
for it to fail that we can not predict at this level, so be ready
for them.  Unlike write, that may proceed after growth failure,
block cloning can't and must return error.

This fixes assertion inside dmu_brt_clone() when it sees different
number of blocks held in destination than it got block pointers.
Builds without ZFS_DEBUG returned EXDEV, so are not affected much.

Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15724
Closes #15735

4 months agomake zdb_decompress_block check decompression reliably
Kent Ross [Tue, 9 Jan 2024 17:13:52 +0000 (09:13 -0800)]
make zdb_decompress_block check decompression reliably

This function decompresses to two buffers and then compares them to
check whether the (opaque) decompression process filled the whole
buffer. Previously it began with lbuf uninitialized and lbuf2 filled
with pseudorandom data. This neither guarantees that any bytes not
written by the compressor would be different, nor seems incredibly
sound otherwise!

After these changes, instead of filling one buffer with generated
pseudorandom data we overwrite each buffer with completely different
data. This should remove the possibility of low-probability failures,
as well as make the process simpler and cheaper.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Kent Ross <k@mad.cash>
Closes #15733

4 months agozpoolprops.7: Remove unnecessary .Ns
Jose Luis Duran [Tue, 9 Jan 2024 01:03:15 +0000 (22:03 -0300)]
zpoolprops.7: Remove unnecessary .Ns

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jose Luis Duran <jlduran@gmail.com>
Closes #15727

4 months agoZIL: Update Linux tracing after #15635
Alexander Motin [Tue, 9 Jan 2024 00:49:39 +0000 (19:49 -0500)]
ZIL: Update Linux tracing after #15635

While picking parts from #14909 I've missed Linux tracing specific
ones, that went unnoticed in default configurations, but breaks the
build in some.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15730

4 months agoLinux 6.2 compat: add check for kernel_neon_* availability
Shengqi Chen [Tue, 9 Jan 2024 00:05:24 +0000 (08:05 +0800)]
Linux 6.2 compat: add check for kernel_neon_* availability

This patch adds check for `kernel_neon_*` symbols on arm and arm64
platforms to address the following issues:

1. Linux 6.2+ on arm64 has exported them with `EXPORT_SYMBOL_GPL`, so
   license compatibility must be checked before use.
2. On both arm and arm64, the definitions of these symbols are guarded
   by `CONFIG_KERNEL_MODE_NEON`, but their declarations are still
   present. Checking in configuration phase only leads to MODPOST
   errors (undefined references).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #15711
Closes #14555
Closes: #15401
4 months agoFix the FreeBSD userspace build (#15716)
Mark Johnston [Wed, 27 Dec 2023 20:17:53 +0000 (15:17 -0500)]
Fix the FreeBSD userspace build (#15716)

- Mark some parameters to zpool_power*() as unused.
- Add a stub zpool_disk_wait().

Fixes: a9520e6e5 ("zpool: Add slot power control, print power status")
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
4 months agoBlock cloning tests.
Pawel Jakub Dawidek [Tue, 26 Dec 2023 20:01:53 +0000 (12:01 -0800)]
Block cloning tests.

The test mostly focus on testing various corner cases.
The tests take a long time to run, so for the common.run runfile
we randomly select a hundred tests.
To run all the bclone tests, bclone.run runfile should be used.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #15631

4 months agoLinux 6.5 compat: check BLK_OPEN_EXCL is defined
Brian Behlendorf [Thu, 21 Dec 2023 19:22:56 +0000 (11:22 -0800)]
Linux 6.5 compat: check BLK_OPEN_EXCL is defined

On some systems we already have blkdev_get_by_path() with 4 args
but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined.
The vdev_bdev_mode() function was added to handle this case
but there was no generic way to specify exclusive access.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15692

4 months agoDon't panic on unencrypted block in encrypted dataset
chrisperedun [Thu, 21 Dec 2023 19:12:30 +0000 (14:12 -0500)]
Don't panic on unencrypted block in encrypted dataset

While 763ca47 closes the situation of block cloning creating
unencrypted records in encrypted datasets, existing data still causes
panic on read. Setting zfs_recover bypasses this but at the cost of
potentially ignoring more serious issues.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Peredun <chris.peredun@ixsystems.com>
Closes #15677

4 months agoZIL: Improve next log block size prediction
Alexander Motin [Thu, 21 Dec 2023 18:54:44 +0000 (13:54 -0500)]
ZIL: Improve next log block size prediction

Track history in context of bursts, not individual log blocks. It
allows to not blow away all the history by single large burst of
many block, and same time allows optimizations covering multiple
blocks in a burst and even predicted following burst.  For each
burst account its optimal block size and minimal first block size.
Use that statistics from the last 8 bursts to predict first block
size of the next burst.

Remove predefined set of block sizes. Allocate any size we see fit,
multiple of 4KB, as required by ZIL now.  With compression enabled
by default, ZFS already writes pretty random block sizes, so this
should not surprise space allocator any more.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15635

4 months agozpool: Add slot power control, print power status
Tony Hutter [Thu, 21 Dec 2023 18:53:16 +0000 (10:53 -0800)]
zpool: Add slot power control, print power status

Add `zpool` flags to control the slot power to drives.  This assumes
your SAS or NVMe enclosure supports slot power control via sysfs.

The new `--power` flag is added to `zpool offline|online|clear`:

    zpool offline --power <pool> <device>    Turn off device slot power
    zpool online --power <pool> <device>     Turn on device slot power
    zpool clear --power <pool> [device]      Turn on device slot power

If the ZPOOL_AUTO_POWER_ON_SLOT env var is set, then the '--power'
option is automatically implied for `zpool online` and `zpool clear`
and does not need to be passed.

zpool status also gets a --power option to print the slot power status.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mart Frauenlob <AllKind@fastest.cc>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15662

5 months agospa: make read/write queues configurable
Rob N [Wed, 20 Dec 2023 22:17:14 +0000 (09:17 +1100)]
spa: make read/write queues configurable

We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.

This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.

This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.

Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15675

5 months agoLinux 6.7 compat: rework shrinker setup for heap allocations
Rob Norris [Sat, 16 Dec 2023 13:36:21 +0000 (00:36 +1100)]
Linux 6.7 compat: rework shrinker setup for heap allocations

6.7 changes the shrinker API such that shrinkers must be allocated
dynamically by the kernel. To accomodate this, this commit reworks
spl_register_shrinker() to do something similar against earlier kernels.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robn
Closes #15681

5 months agoLinux 6.7 compat: handle superblock shrinker member change
Rob Norris [Sat, 16 Dec 2023 06:39:07 +0000 (17:39 +1100)]
Linux 6.7 compat: handle superblock shrinker member change

In 6.7 the superblock shrinker member s_shrink has changed from being an
embedded struct to a pointer. Detect this, and don't take a reference if
it already is one.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robn
Closes #15681

5 months agoLinux 6.7 compat: use inode atime/mtime accessors
Rob Norris [Sat, 16 Dec 2023 11:31:32 +0000 (22:31 +1100)]
Linux 6.7 compat: use inode atime/mtime accessors

6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and
i_mtime. This extends the method used for ctime in b37f29341 to atime
and mtime as well.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robn
Closes #15681

5 months agoLinux 6.7 compat: simplify current_time() check
Rob Norris [Sat, 16 Dec 2023 07:01:45 +0000 (18:01 +1100)]
Linux 6.7 compat: simplify current_time() check

6.7 changed the names of the time members in struct inode, so we can't
assign back to it because we don't know its name. In practice this
doesn't matter though - if we're missing current_time(), then we must be
on <4.9, and we know our fallback will need to return timespec.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robn
Closes #15681

5 months agoTest LWB buffer overflow for block cloning
Umer Saleem [Fri, 15 Dec 2023 22:18:27 +0000 (03:18 +0500)]
Test LWB buffer overflow for block cloning

PR#15634 removes 128K into 2x68K LWB split optimization, since it
was found to cause LWB buffer overflow while trying to write 128KB
TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer,
with multiple VDEVs ZIL.

This commit adds a test for this particular scenario by writing
maximum sizes TX_CLONE_RANE record with 1022 block pointers into
68KB buffer, with two SLOG devices.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #15672

5 months agodmu: Allow buffer fills to fail
Alexander Motin [Fri, 15 Dec 2023 17:51:41 +0000 (12:51 -0500)]
dmu: Allow buffer fills to fail

When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.

On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range.  Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data.  The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location.  In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo.  Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.

This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened.  One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty().  It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665

5 months agodbuf: Set dr_data when unoverriding after clone
Alexander Motin [Tue, 12 Dec 2023 20:59:24 +0000 (15:59 -0500)]
dbuf: Set dr_data when unoverriding after clone

Block cloning normally creates dirty record without dr_data.  But if
the block is read after cloning, it is moved into DB_CACHED state and
receives the data buffer.  If after that we call dbuf_unoverride()
to convert the dirty record into normal write, we should give it the
data buffer from dbuf and release one.

Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15654
Closes #15656