]> git.proxmox.com Git - mirror_zfs.git/log
mirror_zfs.git
3 years agolibzutil: zfs_isnumber(): return false if input empty
наб [Tue, 6 Apr 2021 19:25:53 +0000 (21:25 +0200)]
libzutil: zfs_isnumber(): return false if input empty

zpool list, which is the only user, would mistakenly try to parse the
empty string as the interval in this case:

  $ zpool list "a"
  cannot open 'a': no such pool
  $ zpool list ""
  interval cannot be zero
  usage: <usage string follows>
which is now symmetric with zpool get:
  $ zpool list ""
  cannot open '': name must begin with a letter

Avoid breaking the  "interval cannot be zero" string.
There simply isn't a need for this, and it's user-facing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11841
Closes #11843

3 years agoZTS: pool_checkpoint improvements
Brian Behlendorf [Sat, 3 Apr 2021 15:33:22 +0000 (08:33 -0700)]
ZTS: pool_checkpoint improvements

The pool_checkpoint tests may incorrectly fail because several of
them invoke zdb for an imported pool.  In this scenario it's not
unexpected for zdb to fail if the pool is modified.  To resolve
this these zdb checks are now done after the pool has been exported.

Additionally, the default cleanup functions assumed the pool would
be imported when they were run.  If this was not the case they're
exit early and fail to cleanup all of the test state causing
subsequent tests to fail.  Add a check to only destroy the pool
when it is imported.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11832

3 years agoFix various typos
Andrea Gelmini [Sat, 3 Apr 2021 01:38:53 +0000 (18:38 -0700)]
Fix various typos

Correct an assortment of typos throughout the code base.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #11774

3 years agobash_completion.d: always call zfs/zpool binaries directly
наб [Fri, 2 Apr 2021 23:34:58 +0000 (01:34 +0200)]
bash_completion.d: always call zfs/zpool binaries directly

/dev/zfs is 0:0 666 on most systems, so the [ -w /dev/zfs ] check always
succeeds, but if zfs isn't in $PATH (e.g. when completing from
"/sbin/zfs list" on a regular account) this can lead to error spew like

  nabijaczleweli@szarotka:~$ /sbin/zfs list bash: zfs: command not found
  @ bash: zfs: command not found

We only do read-only commands, and quite general ones at that,
so there's no need to elevate one way or another.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11828

3 years agoAdd RELEASES.md file
Brian Behlendorf [Fri, 2 Apr 2021 23:33:40 +0000 (16:33 -0700)]
Add RELEASES.md file

Document the project's policy regarding publishing and maintaining
official OpenZFS releases.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11821

3 years agozed: allow limiting concurrent jobs
наб [Mon, 29 Mar 2021 13:21:54 +0000 (15:21 +0200)]
zed: allow limiting concurrent jobs

200ms time-out is relatively long, but if we already hit the cap,
then we'll likely be able to spawn multiple new jobs when we wake up

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11807

3 years agozed: remove unused zed_file_close_on_exec()
наб [Sat, 27 Mar 2021 13:18:27 +0000 (14:18 +0100)]
zed: remove unused zed_file_close_on_exec()

The FIXME comment was there since the initial implementation in 2014,
there are no users

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11807

3 years agozed: use separate reaper thread and collect ZEDLETs asynchronously
наб [Fri, 26 Mar 2021 13:41:38 +0000 (14:41 +0100)]
zed: use separate reaper thread and collect ZEDLETs asynchronously

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11807

3 years agozed: set names for all threads
наб [Fri, 26 Mar 2021 20:18:18 +0000 (21:18 +0100)]
zed: set names for all threads

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11807

3 years agoZTS: inheritance/inherit_001_pos is flaky
Ryan Moeller [Fri, 2 Apr 2021 18:11:52 +0000 (14:11 -0400)]
ZTS: inheritance/inherit_001_pos is flaky

Add inheritance/inherit_001_pos to the maybe fails on FreeBSD list.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11830

3 years agoAvoid taking global lock to destroy zfsdev state
Ryan Moeller [Fri, 2 Apr 2021 18:09:05 +0000 (14:09 -0400)]
Avoid taking global lock to destroy zfsdev state

We have exclusive access to our zfsdev state object in this section
until it is invalidated by setting zs_minor to -1, so we can destroy
the state without taking a lock if we do the invalidation last, after
a member to ensure correct ordering.

While here, strengthen the assertions that zs_minor is valid when we
enter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #11751

3 years agoFreeBSD: Fix stable/12 after AT_BENEATH removal
Ryan Moeller [Fri, 2 Apr 2021 18:06:44 +0000 (14:06 -0400)]
FreeBSD: Fix stable/12 after AT_BENEATH removal

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11827

3 years agoBump libzfs.so and libzpool.so versions
Brian Behlendorf [Thu, 1 Apr 2021 23:53:05 +0000 (16:53 -0700)]
Bump libzfs.so and libzpool.so versions

Bump the library versions as advised by the libtool guidelines.

https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html

Two new functions were added but no existing functions were changed,
so we increase the version and the age (version:revision:age).

Added functions (2):
- boolean_t zpool_is_draid_spare(const char *);
- zpool_compat_status_t zpool_load_compat(const char *,
      boolean_t *, char *, char *);

Additionally bump the libzpool.so version information.  This library
is for internal use but we still want to update the version to track
major changes to the interfaces.

The libzfsbootenv, libuutil, libnvpair and libzfs_core libraries
have not been updated.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11817

3 years agoAllow pool names that look like Solaris disk names
Ryan Moeller [Thu, 1 Apr 2021 15:49:41 +0000 (11:49 -0400)]
Allow pool names that look like Solaris disk names

Nothing bad happens if a prefix of your pool name matches a disk name.
This is a bit of a silly restriction at this point.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #11781
Closes #11813

3 years agoDon't scale zfs_zevent_len_max by CPU count
Ryan Moeller [Wed, 31 Mar 2021 17:56:37 +0000 (13:56 -0400)]
Don't scale zfs_zevent_len_max by CPU count

The lower bound for this scaling to too low and the upper bound is too
high.  Use a fixed default length of 512 instead, which is a reasonable
value on any system.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11822

3 years agoAtomically check and set dropped zevent count
Ryan Moeller [Mon, 29 Mar 2021 19:44:27 +0000 (15:44 -0400)]
Atomically check and set dropped zevent count

ratelimit_dropped isn't protected by a lock and is expected to
be updated atomically.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11822

3 years agoCI: Increase free space in workflow
Brian Behlendorf [Thu, 1 Apr 2021 15:39:27 +0000 (08:39 -0700)]
CI: Increase free space in workflow

Recently we've been running out of free space in the ubuntu 20.04
environment resulting in test failures.  This appears to be caused
by a change in the default available free space and not because of
any change in OpenZFS. Try and avoid this failure by applying a
suggested workaround which removes some unnecessary files.

https://github.com/actions/virtual-environments/issues/2840

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11826

3 years agoFixing m4 iops rename check
Brian Atkinson [Thu, 1 Apr 2021 15:37:41 +0000 (09:37 -0600)]
Fixing m4 iops rename check

The configure check for iops->rename wanting flags was missing the
AC_MSG_CHECKING() so it would just print yes without saying what was
being checked.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11825

3 years agofsck.zfs: implement 4/8 exit codes as suggested in manpage
наб [Wed, 31 Mar 2021 17:49:56 +0000 (19:49 +0200)]
fsck.zfs: implement 4/8 exit codes as suggested in manpage

Update the fsck.zfs helper to bubble up some already-known-about
errors if they are detected in the pool.

health=degraded => 4/"Filesystem errors left uncorrected"
health=faulted && dataset in /etc/fstab => 8/"Operational error"
pool not found => 8/"Operational error"
everything else => 0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11806

3 years agoAdd compatibility file sets (ZoL 0.6.1, 0.6.4, OpenZFS 2.1)
Mike Swanson [Wed, 31 Mar 2021 16:40:25 +0000 (09:40 -0700)]
Add compatibility file sets (ZoL 0.6.1, 0.6.4, OpenZFS 2.1)

ZoL 0.6.1 introduced feature flags with the three features that all
implementations at the time were guaranteed to have.  0.6.4 introduced
a few more until 0.6.5 added two after that.  OpenZFS 2.1 added the
dRAID feature.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
Closes #11818

3 years agoUpdate META
Brian Behlendorf [Tue, 30 Mar 2021 17:32:29 +0000 (10:32 -0700)]
Update META

Increase the version to 2.1.99 to indicate the master branch is
newer than the 2.1.x release.  This ensures packages built from
master branch are considered to be newer than the last release.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3 years agoTag 2.1.0-rc1
Brian Behlendorf [Mon, 29 Mar 2021 23:31:29 +0000 (16:31 -0700)]
Tag 2.1.0-rc1

New features:
- Distributed Spare (dRAID) Feature
- Added "compatibility" property for zpool feature sets
- Added zpool_influxdb command to collect zpool statistics

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3 years agozed: reap child after killing on time-out
наб [Fri, 26 Mar 2021 21:21:00 +0000 (22:21 +0100)]
zed: reap child after killing on time-out

When a child process is killed waitpid() must be called on the
pid the reap the zombie process.

Update BUGS section to reflect reality by replacing "zedlets
aren't time limited with "zedlets can be interrupted".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11769
Closes #11798

3 years agoUse a helper function to clarify gang block size
Matthew Ahrens [Fri, 26 Mar 2021 18:19:35 +0000 (11:19 -0700)]
Use a helper function to clarify gang block size

For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the
gang DVA including its children BP's.  The space allocated at each DVA's
vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`.

This commit makes this relationship more clear by using a helper
function, `vdev_gang_header_asize()`, for the space allocated at the
gang block's vdev/offset.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11744

3 years agoWhen specifying raidz vdev name, parity count should match
Matthew Ahrens [Fri, 26 Mar 2021 18:12:22 +0000 (11:12 -0700)]
When specifying raidz vdev name, parity count should match

When specifying the name of a RAIDZ vdev on the command line, it can be
specified as raidz-<vdevID> or raidzP-<vdevID>.
e.g. `zpool clear poolname raidz-0` or `zpool clear poolname raidz2-0`

If the parity is specified in the vdev name, it should match the actual
parity of that RAIDZ vdev, otherwise the command should fail.  This
commit makes it so.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Stuart Maybee <stuart.maybee@comcast.net>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11742

3 years agoFix error code on __zpl_ioctl_setflags()
Luis Henriques [Fri, 26 Mar 2021 17:46:45 +0000 (17:46 +0000)]
Fix error code on __zpl_ioctl_setflags()

Other (all?) Linux filesystems seem to return -EPERM instead of -EACCESS
when trying to set FS_APPEND_FL or FS_IMMUTABLE_FL without the
CAP_LINUX_IMMUTABLE capability.  This was detected by generic/545 test
in the fstest suite.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Luis Henriques <henrix@camandro.org>
Closes #11791

3 years agoSupport running FreeBSD buildworld on Arm-based macOS hosts
Jessica Clarke [Fri, 26 Mar 2021 17:45:12 +0000 (17:45 +0000)]
Support running FreeBSD buildworld on Arm-based macOS hosts

Arm-based Macs are like FreeBSD and provide a full 64-bit stat from the
start, so have no stat64 variants. Thus, define stat64 and fstat64 as
aliases for the normal versions.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Jessica Clarke <jrtc27@jrtc27.com>
Closes #11771

3 years agoRemoved duplicated includes
Andrea Gelmini [Mon, 22 Mar 2021 19:34:58 +0000 (20:34 +0100)]
Removed duplicated includes

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #11775

3 years agoFix typo in Python method name
Andrea Gelmini [Mon, 22 Mar 2021 19:32:38 +0000 (20:32 +0100)]
Fix typo in Python method name

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #11776

3 years agoSplit dmu_zfetch() speculation and execution parts
Alexander Motin [Sat, 20 Mar 2021 05:56:11 +0000 (01:56 -0400)]
Split dmu_zfetch() speculation and execution parts

To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11652

3 years agoFix zfs_get_data access to files with wrong generation
Chunwei Chen [Sat, 20 Mar 2021 05:53:31 +0000 (22:53 -0700)]
Fix zfs_get_data access to files with wrong generation

If TX_WRITE is create on a file, and the file is later deleted and a new
directory is created on the same object id, it is possible that when
zil_commit happens, zfs_get_data will be called on the new directory.
This may result in panic as it tries to do range lock.

This patch fixes this issue by record the generation number during
zfs_log_write, so zfs_get_data can check if the object is valid.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #10593
Closes #11682

3 years agoFix regression in POSIX mode behavior
Andrew [Sat, 20 Mar 2021 05:50:46 +0000 (01:50 -0400)]
Fix regression in POSIX mode behavior

Commit 235a85657 introduced a regression in evaluation of POSIX modes
that require group DENY entries in the internal ZFS ACL. An example
of such a POSX mode is 007. When write_implies_delete_child is set,
then ACE_WRITE_DATA is added to `wanted_dirperms` in prior to calling
zfs_zaccess_common(). This occurs is zfs_zaccess_delete().

Unfortunately, when zfs_zaccess_aces_check hits this particular DENY
ACE, zfs_groupmember() is checked to determine whether access should be
denied, and since zfs_groupmember() always returns B_TRUE on Linux and
so this check is failed, resulting ultimately in EPERM being returned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Closes #11760

3 years agoZTS: New test for kernel panic induced by redacted send
Palash Gandhi [Sat, 20 Mar 2021 05:47:50 +0000 (22:47 -0700)]
ZTS: New test for kernel panic induced by redacted send

This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in https://github.com/openzfs/zfs/pull/11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes #11764

3 years agoAllow setting bootfs property on pools with indirect vdevs
Martin Matuška [Sat, 20 Mar 2021 05:46:43 +0000 (06:46 +0100)]
Allow setting bootfs property on pools with indirect vdevs

The FreeBSD boot loader relies on the bootfs property and is capable
of booting from removed (indirect) vdevs.

Reviewed-by Eric van Gyzen
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #11763

3 years agoFix typo in zgenhostid.8
Ryan Moeller [Sat, 20 Mar 2021 05:39:42 +0000 (01:39 -0400)]
Fix typo in zgenhostid.8

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11770

3 years agoRemoving old code for k(un)map_atomic
Brian Atkinson [Sat, 20 Mar 2021 05:38:44 +0000 (23:38 -0600)]
Removing old code for k(un)map_atomic

It used to be required to pass a enum km_type to kmap_atomic() and
kunmap_atomic(), however this is no longer necessary and the wrappers
zfs_k(un)map_atomic removed these. This is confusing in the ABD code as
the struct abd_iter member iter_km no longer exists and the wrapper
macros simply compile them out.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11768

3 years agoInitialize metaslab range trees in metaslab_init
Serapheim Dimitropoulos [Sat, 20 Mar 2021 05:36:02 +0000 (22:36 -0700)]
Initialize metaslab range trees in metaslab_init

= Motivation

We've noticed several zloop crashes within Delphix generated
due to the following sequence of events:

- A device gets expanded and new metaslabas are allocated for
  it. These metaslabs go through `metaslab_init()` but haven't
  gone through `metaslab_sync_done()` yet. This meas that the
  only range tree that's actually set is the `ms_allocatable`.
  All the others are NULL.

- A vdev_initialization is issues and `vdev_initialize_thread`
  starts processing one of these new metaslabs of the expanded
  vdev.

- As part of `vdev_initialize_calculate_progress()` we call
  into `metaslab_load()` and `metaslab_load_impl()` which
  in turn tries to dereference the metaslabs trees that
  are still NULL and therefore we crash.

The same failure can come up from the `vdev_trim` code paths.

= This Patch

We considered the following solutions to deal with this issue:

[A] Add logic to `vdev_initialize/trim` to skip those new
    metaslabs. We decided against this as it would be good
    to avoid exposing this lower-level detail to higer-level
    operations.

[B] Have `metaslab_load_impl()` return early for new metaslabs
    and thus never touch those range_trees that are NULL at
    that time. This seemed more of a work-around for the bug
    and not a clear-cut solution.

[C] Refactor our logic so all metaslabs have their range_trees
    created at the time of their creatin in `metaslab_init()`.

In this patch we decided to go with [C] because:

(1) It doesn't expose more metaslab details to higher level
    operations such as vdev initialize and trim.

(2) The current behavior of creating the range trees lazily
    in `metaslab_sync_done()` is unnecessarily complicated.

(3) Always initializing the metaslab range_trees makes other
    parts of the codebase cleaner. For example, we used to
    use `ms_freed` as the reference value for knowing whether
    all the range_trees have been initialized. Now we no
    longer need to do that check in most places (and in the
    few that we do we use the `ms_new` boolean field now
    which is more readable).

= Side Changes

Probably due to a mismerge we set `ms_loaded` to `B_TRUE` twice
in `metasloab_load_impl()`. In this patch we remove the extraneous
assignment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #11737

3 years agoLinux 5.12 update: bio_max_segs() replaces BIO_MAX_PAGES
Coleman Kane [Sat, 20 Mar 2021 05:33:42 +0000 (01:33 -0400)]
Linux 5.12 update: bio_max_segs() replaces BIO_MAX_PAGES

The BIO_MAX_PAGES macro is being retired in favor of a bio_max_segs()
function that implements the typical MIN(x,y) logic used throughout the
kernel for bounding the allocation, and also the new implementation is
intended to be signed-safe (which the former was not).

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #11765

3 years agoLinux 5.12 compat: idmapped mounts
Coleman Kane [Sat, 20 Mar 2021 04:00:59 +0000 (00:00 -0400)]
Linux 5.12 compat: idmapped mounts

In Linux 5.12, the filesystem API was modified to support ipmapped
mounts by adding a "struct user_namespace *" parameter to a number
functions and VFS handlers. This change adds the needed autoconf
macros to detect the new interfaces and updates the code appropriately.
This change does not add support for idmapped mounts, instead it
preserves the existing behavior by passing the initial user namespace
where needed.  A subsequent commit will be required to add support
for idmapped mounted.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #11712

3 years agoClean up RAIDZ/DRAID ereport code
Matthew Ahrens [Fri, 19 Mar 2021 23:22:10 +0000 (16:22 -0700)]
Clean up RAIDZ/DRAID ereport code

The RAIDZ and DRAID code is responsible for reporting checksum errors on
their child vdevs.  Checksum errors represent events where a disk
returned data or parity that should have been correct, but was not.  In
other words, these are instances of silent data corruption.  The
checksum errors show up in the vdev stats (and thus `zpool status`'s
CKSUM column), and in the event log (`zpool events`).

Note, this is in contrast with the more common "noisy" errors where a
disk goes offline, in which case ZFS knows that the disk is bad and
doesn't try to read it, or the device returns an error on the requested
read or write operation.

RAIDZ/DRAID generate checksum errors via three code paths:

1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are
reported on any children whose data was not used during the
reconstruction.  This is handled in `raidz_reconstruct()`.  This is the
most common type of RAIDZ/DRAID checksum error.

2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that
means that the data has been lost.  The zio fails and an error is
returned to the consumer (e.g. the read(2) system call).  This would
happen if, for example, three different disks in a RAIDZ2 group are
silently damaged.  Since the damage is silent, it isn't possible to know
which three disks are damaged, so a checksum error is reported against
every child that returned data or parity for this read.  (For DRAID,
typically only one "group" of children is involved in each io.)  This
case is handled in `vdev_raidz_cksum_finish()`. This is the next most
common type of RAIDZ/DRAID checksum error.

3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in
case 2), but there happens to be additional copies of this block due to
"ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those
copies is good, then RAIDZ/DRAID compares each sector of the data or
parity that it retrieved with the good data from the other DVA, and if
they differ then it reports a checksum error on this child.  This
differs from case 2 in that the checksum error is reported on only the
subset of children that actually have bad data or parity.  This case
happens very rarely, since normally only metadata has ditto blocks.  If
the silent damage is extensive, there will be many instances of case 2,
and the pool will likely be unrecoverable.

The code for handling case 3 is considerably more complicated than the
other cases, for two reasons:

1. It needs to run after the main raidz read logic has completed.  The
data RAIDZ read needs to be preserved until after the alternate DVA has
been read, which necessitates refcounts and callbacks managed by the
non-raidz-specific zio layer.

2. It's nontrivial to map the sections of data read by RAIDZ to the
correct data.  For example, the correct data does not include the parity
information, so the parity must be recalculated based on the correct
data, and then compared to the parity that was read from the RAIDZ
children.

Due to the complexity of case 3, the rareness of hitting it, and the
minimal benefit it provides above case 2, this commit removes the code
for case 3.  These types of errors will now be handled the same as case
2, i.e. the checksum error will be reported against all children that
returned data or parity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11735

3 years agoFreeBSD: make seqc asserts conditional on replay
Mateusz Guzik [Thu, 18 Mar 2021 05:09:45 +0000 (06:09 +0100)]
FreeBSD: make seqc asserts conditional on replay

Avoids tripping on asserts when doing pool recovery.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11739

3 years agoRemove unused rr_code
Matthew Ahrens [Thu, 18 Mar 2021 04:57:09 +0000 (21:57 -0700)]
Remove unused rr_code

The `rr_code` field in `raidz_row_t` is unused.

This commit removes the field, as well as the code that's used to set
it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11736

3 years agoFreeBSD: Fix memory leaks in kstats
Ryan Moeller [Thu, 18 Mar 2021 04:55:18 +0000 (00:55 -0400)]
FreeBSD: Fix memory leaks in kstats

Don't handle (incorrectly) kmem_zalloc() failure.  With KM_SLEEP,
will never return NULL.

Free the data allocated for non-virtual kstats when deleting the object.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11767

3 years agoLinux: always check or verify return of igrab()
Adam D. Moss [Tue, 16 Mar 2021 23:33:34 +0000 (16:33 -0700)]
Linux: always check or verify return of igrab()

zhold() wraps igrab() on Linux, and igrab() may fail when the inode
is in the process of being deleted.  This means zhold() must only be
called when a reference exists and therefore it cannot be deleted.
This is the case for all existing consumers so add a VERIFY and a
comment explaining this requirement.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adam Moss <c@yotes.com>
Closes #11704

3 years agoUpdate FreeBSD versions
Dries Michiels [Tue, 16 Mar 2021 22:03:28 +0000 (23:03 +0100)]
Update FreeBSD versions

Update supported FreeBSD versions in documentation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Dries Michiels <driesm.michiels@gmail.com>
Closes #11718

3 years agoHold and release permissions exist
gldisater [Tue, 16 Mar 2021 22:01:21 +0000 (18:01 -0400)]
Hold and release permissions exist

The man page was missing these two permissions.
Add the missing permissions to the man page.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeremy Faulkner <gldisater@gldis.ca>
Closes #11727

3 years agoZTS: Add tests for DOS mode attributes
Ryan Moeller [Tue, 16 Mar 2021 22:00:14 +0000 (18:00 -0400)]
ZTS: Add tests for DOS mode attributes

Create a new section of tests to run with acltype=off.

For now the only test we have is for the DOS mode READONLY attribute on
FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11734

3 years agoReference_tracking_enable should be a module param
Don Brady [Tue, 16 Mar 2021 21:56:17 +0000 (15:56 -0600)]
Reference_tracking_enable should be a module param

To make use of zfs_refcount_held tunable it should be a module
parameter in open-zfs.  Also, since the macros will auto-generate OS
specific tunables, removed the existing zfs_refcount_held reference
in module/os/freebsd/zfs/sysctl_os.c.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #11753

3 years agoZTS: Fix incorrect use of libtest in user_run by xattr_003_neg
Ryan Moeller [Mon, 9 Nov 2020 22:57:00 +0000 (17:57 -0500)]
ZTS: Fix incorrect use of libtest in user_run by xattr_003_neg

You can't use user_run to eval ksh functions defined in libtest unless
you include libtest in the user shell.

Fix xattr_003_neg by:
* include libtest in the user shell
* *then* run get_xattr
* assert this fails
* use variables for filenames so they don't change in the user's shell
* don't log the contents of /etc/passwd
* cleanup all byproducts

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11185

3 years agoZTS: Use ksh and current environment for user_run
Ryan Moeller [Thu, 11 Mar 2021 20:01:58 +0000 (15:01 -0500)]
ZTS: Use ksh and current environment for user_run

The current user_run often does not work as expected.  Commands are run
in a different shell, with a different environment, and all output is
discarded.

Simplify user_run to retain the current environment, eliminate eval,
and feed the command string into ksh.  Enhance the logging for
user_run so we can see out and err.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11185

3 years agoFreeBSD: bring back possibility to rewind the checkpoint from bootloader
Mariusz Zaborski [Sat, 13 Mar 2021 00:12:14 +0000 (01:12 +0100)]
FreeBSD: bring back possibility to rewind the checkpoint from bootloader

Add parsing of the rewind options.

When I was upstreaming the change [1], I omitted the part where we
detect that the pool should be rewind. When the FreeBSD repo has
synced with the OpenZFS, this part of the code was removed.

[1] FreeBSD repo: 277f38abffc6a8160b5044128b5b2c620fbb970c
[2] OpenZFS repo: f2c027bd6a003ec5793f8716e6189c389c60f47a

External-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254152
Originally reviewed by: tsoome, allanjude
Originally reviewed by: kevans (ok from high-level overview)
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org>
Closes #11730

3 years agoFreeBSD: Clean up zfsdev_close to match Linux
Ryan Moeller [Sat, 13 Mar 2021 00:09:15 +0000 (19:09 -0500)]
FreeBSD: Clean up zfsdev_close to match Linux

Resolve some oddities in zfsdev_close() which could result in a
panic and were not present in the equivalent function for Linux.

- Remove unused definition ZFS_MIN_MINOR
- FreeBSD: Simplify zfsdev state destruction
- Assert zs_minor is valid in zfsdev_close
- Make locking around zfsdev state match Linux

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11720

3 years agoFreeBSD: switch teardown lock to rms
Mateusz Guzik [Wed, 4 Nov 2020 22:28:56 +0000 (17:28 -0500)]
FreeBSD: switch teardown lock to rms

This deserializes otherwise non-contending operations.

The previous scheme of using 17 locks hashed by curthread runs into
conflicts very quickly. Check the pull request for sample results.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11153

3 years agoMacroify teardown lock handling
Mateusz Guzik [Wed, 4 Nov 2020 22:23:48 +0000 (17:23 -0500)]
Macroify teardown lock handling

This will allow platforms to implement it as they see fit, in particular
in a different manner than rrm locks.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11153

3 years agoFreeBSD: rename teardown inactive macros to mimick rrm convention
Mateusz Guzik [Wed, 4 Nov 2020 22:19:35 +0000 (17:19 -0500)]
FreeBSD: rename teardown inactive macros to mimick rrm convention

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11153

3 years agoFreeBSD: remove 2 assertions that teardown lock is not held
Mateusz Guzik [Thu, 12 Nov 2020 22:33:14 +0000 (17:33 -0500)]
FreeBSD: remove 2 assertions that teardown lock is not held

They are not very useful and hard to implement in the rms routine
the code is about to start using.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11153

3 years agoFreeBSD: rework asserts in zfs_dd_lookup
Mateusz Guzik [Mon, 12 Oct 2020 21:27:59 +0000 (21:27 +0000)]
FreeBSD: rework asserts in zfs_dd_lookup

1. even up ifdefs
2. drop the arguably useless teardown lock asserts -- nothing else
   checks for it

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11153

3 years agoAdd branch prediction to ZFS_ENTER and ZFS_VERIFY_ZP macros
Mateusz Guzik [Thu, 15 Oct 2020 05:45:28 +0000 (05:45 +0000)]
Add branch prediction to ZFS_ENTER and ZFS_VERIFY_ZP macros

They are expected to fail only in corner cases.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11153

3 years agozpool import cachefile improvements
George Wilson [Fri, 12 Mar 2021 23:42:27 +0000 (17:42 -0600)]
zpool import cachefile improvements

Importing a pool using the cachefile is ideal to reduce the time
required to import a pool. However, if the devices associated with
a pool in the cachefile have changed, then the import would fail.
This can easily be corrected by doing a normal import which would
then read the pool configuration from the labels.

The goal of this change is make importing using a cachefile more
resilient and auto-correcting. This is accomplished by having
the cachefile import logic automatically fallback to reading the
labels of the devices similar to a normal import. The main difference
between the fallback logic and a normal import is that the cachefile
import logic will only look at the device directories that were
originally used when the cachefile was populated. Additionally,
the fallback logic will always import by guid to ensure that only
the pools in the cachefile would be imported.

External-issue: DLPX-71980
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #11716

3 years agoFix whitespace introduced in ecc277cff
Martin Matuška [Fri, 12 Mar 2021 03:42:04 +0000 (04:42 +0100)]
Fix whitespace introduced in ecc277cff

The manual page change in ecc277c has introduced whitespace on
line ends.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #11722

3 years agoFreeBSD: Fix scope of deadman tunables
Ryan Moeller [Fri, 12 Mar 2021 03:23:24 +0000 (22:23 -0500)]
FreeBSD: Fix scope of deadman tunables

A few deadman tunables ended up in the wrong sysctl node.

Move them to vfs.zfs.deadman.*

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11715

3 years agoMicrooptimizations for VERIFY() and friends
Adam D. Moss [Fri, 12 Mar 2021 01:16:09 +0000 (17:16 -0800)]
Microoptimizations for VERIFY() and friends

Add branch hints and constify the intermediate evaluations of
left/right params in VERIFY3*().

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adam Moss <c@yotes.com>
Closes #11708

3 years agoAdd missing files to Makefile
Allan Jude [Fri, 12 Mar 2021 01:13:34 +0000 (20:13 -0500)]
Add missing files to Makefile

Some .h files that were added were missed in this Makefile. Since
they are .h files, their being missing only resulted in them
disappeared from the dist archive.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #11705

3 years agoCI checkstyle: pin ubuntu version
George Melikov [Fri, 12 Mar 2021 01:11:31 +0000 (04:11 +0300)]
CI checkstyle: pin ubuntu version

Our checkstyle doesn't work well on Ubuntu 20.04,
temporary pin it to 18.04.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #11713

3 years agoReturn finer grain errors in libzfs unmount_one
Don Brady [Mon, 8 Mar 2021 16:46:45 +0000 (09:46 -0700)]
Return finer grain errors in libzfs unmount_one

Added errno mappings to unmount_one() in libzfs.  Changed do_unmount()
implementation to return errno errors directly like is done for
do_mount() and others.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #11681

3 years agovdev_id: Create symlinks even if no /dev/mapper/
Tony Hutter [Mon, 8 Mar 2021 16:43:30 +0000 (08:43 -0800)]
vdev_id: Create symlinks even if no /dev/mapper/

vdev_id uses the /dev/mapper/ symlinks to resolve a UUID to a dm name
(like dm-1).  However on some multipath setups, there is no /dev/mapper/
entry for the UUID at the time vdev_id is called by udev.  However,
this isn't necessarily needed, as we may be able to resolve the dm
name from the $DEVNAME that udev passes us (like DEVNAME="/dev/dm-1").

This patch tries to resolve the dm name from $DEVNAME first, before
falling back to looking in /dev/mapper/.  This fixed an issue where the
by-vdev names weren't reliably showing up on one of our nodes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #11698

3 years agoZTS events_002: Improve speed and reliability
Antonio Russo [Mon, 8 Mar 2021 16:42:45 +0000 (09:42 -0700)]
ZTS events_002: Improve speed and reliability

events_002 exercises the ZED, ensuring that it neither misses events,
nor reporting events twice.

On slow test hardware, some of the timeouts are insufficient to allow
the ZED to properly settle.  Conversely, on fast hardware these same
timeouts are too long, unnecessarily slowing the test run.

Instead of using a fixed timeout, wait for the expected final event
before returning.  Additionally, wait with a timeout for unexpected
events to avoid missing them if they show up late.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #11703

3 years agozvol: call zil_replaying() during replay
Christian Schwarz [Sun, 7 Mar 2021 17:49:58 +0000 (18:49 +0100)]
zvol: call zil_replaying() during replay

zil_replaying(zil, tx) has the side-effect of informing the ZIL that an
entry has been replayed in the (still open) tx.  The ZIL uses that
information to record the replay progress in the ZIL header when that
tx's txg syncs.

ZPL log entries are not idempotent and logically dependent and thus
calling zil_replaying() is necessary for correctness.

For ZVOLs the question of correctness is more nuanced: ZVOL logs only
TX_WRITE and TX_TRUNCATE, both of which are idempotent. Logical
dependencies between two records exist only if the write or discard
request had sync semantics or if the ranges affected by the records
overlap.

Thus, at a first glance, it would be correct to restart replay from
the beginning if we crash before replay completes. But this does not
address the following scenario:
Assume one log record per LWB.
The chain on disk is

    HDR -> 1:W(1, "A") -> 2:W(1, "B") -> 3:W(2, "X") -> 4:W(3, "Z")

where N:W(O, C) represents log entry number N which is a TX_WRITE of C
to offset A.
We replay 1, 2 and 3 in one txg, sync that txg, then crash.
Bit flips corrupt 2, 3, and 4.
We come up again and restart replay from the beginning because
we did not call zil_replaying() during replay.
We replay 1 again, then interpret 2's invalid checksum as the end
of the ZIL chain and call replay done.
The replayed zvol content is "AX".

If we had called zil_replaying() the HDR would have pointed to 3
and our resumed replay would not have replayed anything because
3 was corrupted, resulting in zvol content "BX".

If 3 logically depends on 2 then the replay corrupted the ZVOL_OBJ's
contents.

This patch adds the zil_replaying() calls to the replay functions.
Since the callbacks in the replay function need the zilog_t* pointer
so that they can call zil_replaying() we open the ZIL while
replaying in zvol_create_minor(). We also verify that replay has
been done when on-demand-opening the ZIL on the first modifying
bio.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11667

3 years agoZTS: Improve cleanup in zpool tests
Ryan Moeller [Sun, 7 Mar 2021 17:41:01 +0000 (12:41 -0500)]
ZTS: Improve cleanup in zpool tests

* Restore original kern.corefile value after the test.
* Don't leave behind a frozen pool.
* Clean up leftover vdev files.
* Make zpool_002_pos and zpool_003_pos consistent in their handling of
core files while here.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11694

3 years agoClarify compressed zfs send/recv behavior
manfromafar [Sun, 7 Mar 2021 17:39:16 +0000 (10:39 -0700)]
Clarify compressed zfs send/recv behavior

Docs for send and receive do not explain behavior when sending a
compressed stream then receiving on a host that overrides compression
with -o compress=value.

The data from the send stream is written as it was from the send is
the compressed form but the compression algorithm set on the receiver
is the overridden version which causes some confusion as to what
algorithm was actually used.

Updated man docs to clarify behavior

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed By: Allan Jude <allanjude@freebsd.org>
Signed-off-by: manfromafar <manfromafar@outlook.com>
Closes #11690

3 years agoIntentionally allow ZFS_READONLY in zfs_write
Ryan Moeller [Sun, 7 Mar 2021 17:31:52 +0000 (12:31 -0500)]
Intentionally allow ZFS_READONLY in zfs_write

ZFS_READONLY represents the "DOS R/O" attribute.
When that flag is set, we should behave as if write access
were not granted by anything in the ACL.  In particular:
We _must_ allow writes after opening the file r/w, then
setting the DOS R/O attribute, and writing some more.
(Similar to how you can write after fchmod(fd, 0444).)

Restore these semantics which were lost on FreeBSD when refactoring
zfs_write.  To my knowledge Linux does not actually expose this flag,
but we'll need it to eventually so I've added the supporting checks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11693

3 years agoSuppress cppcheck invalidSyntax warninigs
Brian Behlendorf [Sat, 6 Mar 2021 01:56:35 +0000 (17:56 -0800)]
Suppress cppcheck invalidSyntax warninigs

For some reason cppcheck 1.90 is generating an invalidSyntax warning
when the BF64_SET macro is used in the zstream source.  The same
warning is not reported by cppcheck 2.3, nor is their any evident
problem with the expanded macro.  This appears to be an issue with
this version of cppcheck.  This commit annotates the source to suppress
the warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11700

3 years agoInitialize ZIL buffers
Brian Behlendorf [Fri, 5 Mar 2021 22:45:13 +0000 (14:45 -0800)]
Initialize ZIL buffers

When populating a ZIL destination buffer ensure it is always
zeroed before its contents are constructed.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11687

3 years agoFix abd_get_offset_struct() may allocate new abd
Jorgen Lundman [Fri, 5 Mar 2021 20:22:57 +0000 (05:22 +0900)]
Fix abd_get_offset_struct() may allocate new abd

Even when supplied with an abd to abd_get_offset_struct(), the call
to abd_get_offset_impl() can allocate a different abd. Ensure to
call abd_fini_struct() on the abd that is not used.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #11683

3 years agoFreeBSD module --enable-debug --enable-invariants
Ryan Moeller [Fri, 5 Mar 2021 20:16:41 +0000 (15:16 -0500)]
FreeBSD module --enable-debug --enable-invariants

Wire up the --enable-debug flag for configure to the FreeBSD module
build.  Add --enable-invariants.

The running FreeBSD kernel config is used to detect whether to enable
INVARIANTS if not explicitly specified with --enable-invariants or
--disable-invariants.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11678

3 years agozpool: use tab to intend continuation from removal status
Thomas Lamprecht [Fri, 5 Mar 2021 20:15:35 +0000 (21:15 +0100)]
zpool: use tab to intend continuation from removal status

Bring the output of the removal status in line with the other
"fields" that zpool status outputs, and thus allows an parser to
easier detect this as continuation of the 'remove:' output.

Before:
remove: Removal of vdev 0 copied 282G in 0h9m, completed on [...]
    776K memory used for removed device mappings

Now:
remove: Removal of vdev 0 copied 282G in 0h9m, completed on [...]
776K memory used for removed device mappings

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Closes #11674

3 years agoDon't bomb out when using keylocation=file://
James Wah [Wed, 3 Mar 2021 16:28:49 +0000 (03:28 +1100)]
Don't bomb out when using keylocation=file://

Avoid following the error path when the operation in fact succeeded.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: James Wah <james@laird-wah.net>
Closes #11651

3 years agolinux: zvol: avoid heap allocation for zvol_request_sync=1
Christian Schwarz [Wed, 3 Mar 2021 16:15:28 +0000 (17:15 +0100)]
linux: zvol: avoid heap allocation for zvol_request_sync=1

The spl_kmem_alloc showed up in some flamegraphs in a single-threaded
4k sync write workload at 85k IOPS on an
Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz.
Certainly not a huge win but I believe the change is clean and
easy to maintain down the road.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11666

3 years agoAdd "zstd-fast" to help options for "compression" property
Jake Howard [Wed, 3 Mar 2021 16:14:19 +0000 (16:14 +0000)]
Add "zstd-fast" to help options for "compression" property

This value does work as expected, and is documented in the manpage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jake Howard <git@theorangeone.net>
Closes #11670

3 years agoCancel TRIM / initialize on FAULTED non-writeable vdevs
nssrikanth [Tue, 2 Mar 2021 18:27:27 +0000 (23:57 +0530)]
Cancel TRIM / initialize on FAULTED non-writeable vdevs

When a device which is actively trimming or initializing becomes
FAULTED, and therefore no longer writable, cancel the active
TRIM or initialization.  When the device is merely taken offline
with `zpool offline` then stop the operation but do not cancel it.
When the device is brought back online the operation will be
resumed if possible.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Vipin Kumar Verma <vipin.verma@hpe.com>
Signed-off-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com>
Closes #11588

3 years agoFix assert in FreeBSD-specific dmu_read_pages
Andriy Gapon [Sun, 28 Feb 2021 01:23:09 +0000 (03:23 +0200)]
Fix assert in FreeBSD-specific dmu_read_pages

The function has three similar pieces of code: for read-behind pages,
requested pages and read-ahead pages.  All three pieces had an
assert to ensure that the page is not mapped.  Later the assert was
relaxed to require that the page is not mapped for writing.  But that
was done in two places out of three.  This change fixes the third piece,
read-ahead.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #11654

3 years agoZTS: zpool_trim_start_and_cancel_pos.ksh
Brian Behlendorf [Sun, 28 Feb 2021 01:19:50 +0000 (17:19 -0800)]
ZTS: zpool_trim_start_and_cancel_pos.ksh

Several of the TRIM tests were based of the initialize tests and
then adapted for TRIM.  The zpool_trim_start_and_cancel_pos.ksh
test was intended to be one such test but it was overlooked and
actually never adapted.  Update it accordingly.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11649

3 years agoAdd missing checks for unsupported features
Martin Matuška [Sun, 28 Feb 2021 01:16:02 +0000 (02:16 +0100)]
Add missing checks for unsupported features

After 35ec517 it has become possible to import ZFS pools witn an
active org.illumos:edonr feature on FreeBSD, leading to a panic.

In addition, "zpool status" reported all pools without edonr
as upgradable and "zpool upgrade -v" reported edonr in the list
of upgradable features.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #11653

3 years agoLinux 5.12 compat: replace bio_*_io_acct with disk_*_io_acct
Coleman Kane [Tue, 23 Feb 2021 02:18:41 +0000 (21:18 -0500)]
Linux 5.12 compat: replace bio_*_io_acct with disk_*_io_acct

The bio_*_acct functions became GPL exports, which causes the
kernel modules to refuse to compile. This replaces code with
alternate function calls to the disk_*_io_acct interfaces, which
are not GPL exports. This change was added in kernel commit
99dfc43ecbf67f12a06512918aaba61d55863efc.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #11639

3 years agoLinux 5.12 compat: bio->bi_disk member moved
Coleman Kane [Tue, 23 Feb 2021 02:07:51 +0000 (21:07 -0500)]
Linux 5.12 compat: bio->bi_disk member moved

The struct bio member bi_disk was moved underneath a new member named
bi_bdev. So all attempts to reference bio->bi_disk need to now become
bio->bi_bdev->bd_disk.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #11639

3 years agoFix vdev_rebuild_thread deadlock
Brian Behlendorf [Wed, 24 Feb 2021 18:01:00 +0000 (10:01 -0800)]
Fix vdev_rebuild_thread deadlock

The metaslab_disable() call may block waiting for a txg sync.
Therefore it's important that vdev_rebuild_thread release the
SCL_CONFIG read lock it is holding before this call.  Failure
to do so can result in the txg_sync thread getting blocked
waiting for this lock which results in a deadlock.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewd-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11647

3 years agoFix overly broad locking in spa_vdev_config_exit()
Brian Behlendorf [Wed, 24 Feb 2021 18:00:21 +0000 (10:00 -0800)]
Fix overly broad locking in spa_vdev_config_exit()

Calling vdev_free() only requires the we acquire the spa config
SCL_STATE_ALL locks, not the SCL_ALL locks.  In particular, we need
need to avoid taking the SCL_CONFIG lock (included in SCL_ALL) as a
writer since this can lead to a deadlock.  The txg_sync_thread() may
block in spa_txg_history_init_io() when taking the SCL_CONFIG lock
as a reading when it detects there's a pending writer.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11585

3 years agovdev_id: Fix partition regular expression
Tony Hutter [Wed, 24 Feb 2021 17:58:46 +0000 (09:58 -0800)]
vdev_id: Fix partition regular expression

Given a DM device name, the old vdev_id script would extract any text
after a 'p' as the partition number.  It then appends "-part" + the
partition number to the name, giving a by-vdev name like "L0-part5".

This works fine if the DM name is like 'dm-2p5', but doesn't work if
the DM name is a multipath name like "mpatha".  In those cases it
incorrectly matches the 'p' in "mpatha", giving by-vdev names like
"L0-partatha".

This patch fixes the issue by making the partition regex match stricter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #11637

3 years agoLinux: increase max nvlist_src size
Brian Behlendorf [Wed, 24 Feb 2021 17:57:18 +0000 (09:57 -0800)]
Linux: increase max nvlist_src size

On Linux increase the maximum allowed size of the src nvlist which
can be passed to the /dev/zfs ioctl.  Originally, this was set
to a maximum of KMALLOC_MAX_SIZE (4M) because it was kmalloc'd.
Since that time it's been converted to a vmalloc so that's no
longer a hard limit, and it's desirable for `zfs send/recv` to
allow larger nvlists so more snapshots can be sent at once.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6572
Closes #11638

3 years agoAdd upper bound for slop space calculation
Prakash Surya [Wed, 24 Feb 2021 17:52:43 +0000 (09:52 -0800)]
Add upper bound for slop space calculation

This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #11023

3 years agoWrap bare EINVAL returns with SET_ERROR
Ryan Moeller [Wed, 24 Feb 2021 17:51:10 +0000 (12:51 -0500)]
Wrap bare EINVAL returns with SET_ERROR

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11636

3 years agoForce symlink creation for zpool.d compat links
Ryan Moeller [Wed, 24 Feb 2021 17:49:59 +0000 (12:49 -0500)]
Force symlink creation for zpool.d compat links

gmake install fails when zpool.d compat links already exist.

Force the symlinks to be recreated if already present.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11633

3 years agosend_iterate_snap : doall send without fromsnap
Cedric Maunoury [Wed, 24 Feb 2021 17:48:58 +0000 (18:48 +0100)]
send_iterate_snap : doall send without fromsnap

The behavior of a NULL fromsnap was inadvertently changed for a doall
send when the send/recv logic in libzfs was updated.  Restore the
previous behavior by correcting send_iterate_snap() to include all
the snapshots in the nvlist for this case.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Cedric Maunoury <cedric.maunoury@gmail.com>
Closes #11608

3 years agoFix error message when zfs module are already unloaded
Adam D. Moss [Sun, 21 Feb 2021 04:23:10 +0000 (20:23 -0800)]
Fix error message when zfs module are already unloaded

Using zfs-sh -u on linux will fail with inaccurate message when the
zfs modules are already unloaded.  Deal with the case where a module
is already unloaded; its USE_COUNT will be the empty string

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adam Moss <c@yotes.com>
Closes #11627

3 years agovdev_ops: don't try to call vdev_op_hold or vdev_op_rele when NULL
fbynite [Sun, 21 Feb 2021 04:19:20 +0000 (19:19 -0900)]
vdev_ops: don't try to call vdev_op_hold or vdev_op_rele when NULL

This prevents a panic after a SLOG add/removal on the root pool followed
by a zpool scrub.

When a SLOG is removed, a hole takes its place - the vdev_ops for a hole
is vdev_hole_ops, which defines the handler functions of vdev_op_hold
and vdev_op_rele as NULL.

This bug has been reported in illumos and FreeBSD, a different trigger
in the FreeBSD report though.

Credit for this patch goes to Patrick Mooney <pmooney@pfmooney.com>

Obtained from: illumos-gate commit: c65bd18728f34725
External-issue: https://www.illumos.org/issues/12981
External-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252396
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.fx907@gmail.com>
Closes #11623

3 years agoBetter zfs_get_enclosure_sysfs_path() enclosure support
Tony Hutter [Sun, 21 Feb 2021 04:17:45 +0000 (20:17 -0800)]
Better zfs_get_enclosure_sysfs_path() enclosure support

A multpathed disk will have several 'underlying' paths to the disk.  For
example, multipath disk 'dm-0' may be made up of paths:
/dev/{sda,sdb,sdc,sdd}.  On many enclosures those underlying sysfs
paths will have a symlink back to their enclosure device entry
(like 'enclosure_device0/slot1').  This is used by the
statechange-led.sh script to set/clear the fault LED for a disk, and
by 'zpool status -c'.

However, on some enclosures, those underlying paths may not all have
symlinks back to the enclosure device.  Maybe only two out of four
of them might.

This patch updates zfs_get_enclosure_sysfs_path() to favor returning
paths that have symlinks back to their enclosure devices, rather
than just returning the first path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #11617

3 years agoCleaning up uio headers
Brian Atkinson [Sun, 21 Feb 2021 04:16:50 +0000 (21:16 -0700)]
Cleaning up uio headers

Making uio_impl.h the common header interface between Linux and FreeBSD
so both OS's can share a common header file. This also helps reduce code
duplication for zfs_uio_t for each OS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11622

3 years agoztest: propagate -o to the zdb child process
Christian Schwarz [Thu, 18 Feb 2021 11:20:09 +0000 (12:20 +0100)]
ztest: propagate -o to the zdb child process

I think this is the behavior that most users expect.

Future work: have a separate flag, e.g., -O, to specify separate
set_global_vars for the zdb child than for the ztest children.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11602

3 years agoztest: fix -o by calling set_global_var in child processes
Christian Schwarz [Tue, 16 Feb 2021 10:14:44 +0000 (11:14 +0100)]
ztest: fix -o by calling set_global_var in child processes

Without set_global_var() in the child processes the -o option provides
little use.

Before this change set_global_var() was called as a side-effect of
getopt processing which only happens for the parent ztest process.

This change limits the set of options that can be set and makes them
available to the child through ztest_shared_opts_t.

Future work: support arbitrary option count and length.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11602

3 years agolibzpool: set_global_var: refactor to not modify 'arg'
Christian Schwarz [Tue, 16 Feb 2021 11:27:48 +0000 (12:27 +0100)]
libzpool: set_global_var: refactor to not modify 'arg'

Also fixes leak of the dlopen handle in the error case.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11602