Rich Ercolani [Fri, 4 Jun 2021 21:00:39 +0000 (17:00 -0400)]
Let zfs diff be more permissive
In the current world, `zfs diff` will die on certain kinds of errors
that come up on ordinary, not-mangled filesystems - like EINVAL,
which can come from a file with multiple hardlinks having the one
whose name is referenced deleted.
Since it should always be safe to continue, let's relax about all
error codes - still print something for most, but don't immediately
abort when we encounter them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12072
jharmening [Fri, 4 Jun 2021 20:11:08 +0000 (13:11 -0700)]
FreeBSD: incorporate changes to the VFS_QUOTACTL(9) KPI
VFS_QUOTACTL(9) has been updated to allow each filesystem to indicate
whether it has changed the busy state of the mount. The filesystem
may still assume that its .vfs_quotactl entrypoint is always called
with the mount busied, but only needs to unbusy the mount (and clear
*mp_busy) if it does something that actually requires the mount to be
unbusied. It no longer needs to blindly copy-paste the UFS protocol
for calling vfs_unbusy(9) for the Q_QUOTAOFF and Q_QUOTAON commands.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jason Harmening <jason.harmening@gmail.com>
Closes #12052
Ryan Moeller [Fri, 4 Jun 2021 19:53:44 +0000 (15:53 -0400)]
Fix error check in nvlist_print_json_string
Move check for errors from mbrtowc() into the loop. The error values
are not actually negative, so we don't break out of the loop when they
are encountered.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12175
Closes #12176
Linux: Set spl_kmem_cache_slab_limit when page size !4K
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.
This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).
To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures.
This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12152
Closes #11429
Closes #11574
Closes #12150
наб [Sat, 15 May 2021 09:53:14 +0000 (11:53 +0200)]
libzfs: convert to -fvisibility=hidden
Also mark all printf-like funxions in libzfs_impl.h as printf-like
and add --no-show-locs to storeabi, in hopes diffs will make more sense
in future
This removes these symbols from libzfs:
D nfs_only
T SHA256Init
T SHA2Final
T SHA2Init
T SHA2Update
T SHA384Init
T SHA512Init
D share_all_proto
D smb_only
T zfs_is_shared_proto
W zpool_mount_datasets
W zpool_unmount_datasets
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12048
Colm [Thu, 3 Jun 2021 15:13:42 +0000 (16:13 +0100)]
A couple of small style cleanups
In `zpool_load_compat()`:
* initialize `l_features[]` with a loop rather than a static
initializer.
* don't redefine system constants; use private names instead
Rationale here:
When an array is initialized using a static {foo}, only the specified
members are initialized to the provided values, the rest are
initialized to zero. While B_FALSE is of course zero, it feels
unsafe to rely on this being true forever, so I'm inclined to sacrifice
a few microseconds of runtime here and initialize using a loop.
When looking for the correct combination of system constants to use
(in open() and mmap()), I prefer to use private constants rather than
redefining system ones; due to the small chance that the system
ones might be referenced later in the file. So rather than defining
O_PATH and MAP_POPULATE, I use distinct constant names.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #12156
zfs_arc_overflow_shift was never a parameter: ca0bf58d65f77e944b9905571df9a2eae647aeca ("Illumos 5497 - lock
contention on arcs_mtx") is the only result in
git log -Soverflow_shift, and it wasn't exposed then, nor is it now
zfs_read_chunk_size was renamed to zfs_vnops_read_chunk_size in e53d678d4ad596a310d51dab107bb6fa97e2b226 ("Share zfs_fsync, zfs_read,
zfs_write, et al between Linux and FreeBSD")
zio_decompress_fail_fraction was never a parameter: it was added in c3bd3fb4ac49705819666055ff1206a9fa3d1b9e ("OpenZFS 9403 - assertion
failed in arc_buf_destroy()") as a developer aid for setting in zdb, but
it's a dangerous test tunable and has no place in public documentation,
(not to mention that it obviously doesn't work):
> Although this did uncover a few low priority issues, this
unfortuantely also causes ztest to ASSERT in many locations where the
code is working correctly since it is designed to fail on IO errors.
Developers can manually set this variable with the '-o' option to find
and debug issues.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12157
Rich Ercolani [Tue, 1 Jun 2021 21:20:50 +0000 (17:20 -0400)]
Added another missed case to arc_summary3
It turns out that sometimes, evidently only when run inside the
ZTS handler, arc_summary3 | head > /dev/null will die with ENOTCONN,
and ruin the test run.
Added handling for that.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12160
Ryan Moeller [Tue, 1 Jun 2021 21:13:26 +0000 (17:13 -0400)]
libzfs_core: Fix some style violations
Made function names start on a new line. Added a blank line between
functions. This helps when grepping for functions.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12137
grembo [Tue, 1 Jun 2021 21:03:49 +0000 (23:03 +0200)]
FreeBSD boot code reminder after zpool upgrade
There used to be a warning after upgrading a zpool in FreeBSD, so users
won't forget to update the boot loader that pool is booted from.
This change brings this warning back, but only if the bootfs property
is set on the pool, which should be sufficient for the vast majority of
FreeBSD installations. People running something custom are most likely
aware of what to do after an upgrade in their specific environment.
Functionality is implemented in an OS specific helper function.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Michael Gmelin <grembo@FreeBSD.org> Signed-off-by: Michael Gmelin <grembo@FreeBSD.org>
Closes #12099
Closes #12104
Rich Ercolani [Tue, 1 Jun 2021 18:58:08 +0000 (14:58 -0400)]
Remove iov_iter_advance() for iter_write
The additional iter advance is incorrect, as copy_from_iter() has
already done the right thing. This will result in the following
warning being printed to the console as of the 5.12 kernel.
Attempted to advance past end of bvec iter
This change should have been included with #11378 when a
similar change was made on the read side.
Manoj Joseph [Fri, 28 May 2021 22:06:07 +0000 (15:06 -0700)]
long options for ztest
This change introduces long options for ztest. It builds the usage
message as well as the long_options array from a single table. It also
adds #defines for the default values.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Manoj Joseph <manoj.joseph@delphix.com>
Closes #12117
Paul Dagnelie [Fri, 28 May 2021 21:08:41 +0000 (14:08 -0700)]
Don't direct to freenode in issue template
While Libera doesn't yet have a webchat client, we should at least
direct them to the right network. Once a webchat client is available,
we can direct them to it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #12127
Rich Ercolani [Fri, 28 May 2021 05:39:53 +0000 (01:39 -0400)]
Minor fix to configure on s390x
configure on s390x has a key check fail with an error about
a variable being used uninitialized. So let's initialize it.
Reviewed-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12126
Alexander Motin [Thu, 27 May 2021 20:27:29 +0000 (16:27 -0400)]
Introduce write-mostly sums
wmsum counters are a reduced version of aggsum counters, optimized for
write-mostly scenarios. They do not provide optimized read functions,
but instead allow much cheaper add function. The primary usage is
infrequently read statistic counters, not requiring exact precision.
The Linux implementation is directly mapped into percpu_counter KPI.
The FreeBSD implementation is directly mapped into counter(9) KPI.
In user-space due to lack of better implementation mapped to aggsum.
Unfortunately neither Linux percpu_counter nor FreeBSD counter(9)
provide sufficient functionality to completelly replace aggsum, so
it still remains to be used for several hot counters.
Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12114
Alexander Motin [Thu, 27 May 2021 16:11:39 +0000 (12:11 -0400)]
Improve scrub maxinflight_bytes math.
Previously, ZFS scaled maxinflight_bytes based on total number of
disks in the pool. A 3-wide mirror was receiving a queue depth of 3
disks, which it should not, since it reads from all the disks inside.
For wide raidz the situation was slightly better, but still a 3-wide
raidz1 received a depth of 3 disks instead of 2.
The new code counts only unique data disks, i.e. 1 disk for mirrors
and non-parity disks for raidz/draid. For draid the math is still
imperfect, since vdev_get_nparity() returns number of parity disks
per group, not per vdev, but still some better than it was.
This should slightly reduce scrub influence on payload for some pool
topologies by avoiding excessive queuing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closing #12046
A plain rewrite of the shell version, and generates identical
units, save for replacing some empty lines with nothing, having fewer
meaningless spaces in After=s and different spacing in the lock scripts,
for a clean git diff -w
This is a gain of anywhere from 0m0.336s vs 0m0.022s (15.27x)
to 0m0.202s vs 0m0.006s (33.67x), depending on the hardware,
a.k.a. from "absolutely unusable" to "perfectly fine"
This also properly deals with canmount=noauto units across multiple
pools
See PR for detailed timings (of an early version) and diffs
Reviewed-by: Antonio Russo <aerusso@aerusso.net> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: InsanePrawn <insane.prawny@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #11915
Closes #11917
Alexander Motin [Wed, 26 May 2021 18:14:26 +0000 (14:14 -0400)]
FreeBSD: Update dataset_kstats for zvols in dev mode
Previous commit added accounting for geom mode, but not for dev.
In geom mode we actually have GEOM statistics, while in dev mode
additional accounting actually makes more sense.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12097
Alexander Motin [Wed, 26 May 2021 01:38:34 +0000 (21:38 -0400)]
FreeBSD: avoid memory allocation in arc_prune_async
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12049
Rich Ercolani [Wed, 26 May 2021 01:08:27 +0000 (21:08 -0400)]
Add note for printing all dbgmsg entries on FreeBSD
I looked for a bit, and couldn't find any documentation on
how to print all logged dbgmsg entries, just messages since
the DTrace probe started, until @allanjude kindly pointed me
toward the sysctl.
So let's add that note where the DTrace probe is mentioned for
FreeBSD, so other people can find it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12113
vermavipinkumar [Tue, 25 May 2021 18:32:07 +0000 (00:02 +0530)]
Propagate vdev state due to invalid label corruption
Propagate vdev child state to parents on invalid label
Add VDEV_AUX_BAD_LABEL to print_import_config()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com> Signed-off-by: Vipin Kumar Verma <vipin.verma@hpe.com>
Closes #12088
Toomas Soome [Tue, 25 May 2021 17:33:18 +0000 (20:33 +0300)]
zdb: dump_history needs space
One space is missing from zdb -h output causing strings to be concatenated. (fixing #11940)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #12098
verify_slog_support no longer applies to ZFS since slog support
is always available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #12092
Alexander Motin [Mon, 24 May 2021 20:42:45 +0000 (16:42 -0400)]
FreeBSD: Retry OCF ENOMEM errors.
ZFS does not expect transient errors from crypto. For read they are
counted as checksum errors, while for write end up in panic. To not
panic on random low memory conditions retry ENOMEM errors in the OCF
wrapper function.
While there remove unneeded timeout and priority from msleep().
External-issue: https://reviews.freebsd.org/D30339 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12077
наб [Sat, 8 May 2021 11:17:04 +0000 (13:17 +0200)]
Don't abuse vfork()
According to POSIX.1, "vfork() has the same effect as fork(2),
except that the behavior is undefined if the process created by vfork()
either modifies any data other than a variable of type pid_t
used to store the return value from vfork(), [...],
or calls any other function before successfully calling _exit(2)
or one of the exec(3) family of functions."
These do all three, and work by pure chance
(or maybe they don't, but we blisfully don't know).
Either way: bad idea to call vfork() from C,
unless you're the standard library, and POSIX.1-2008 removes it entirely
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12015
наб [Wed, 19 May 2021 11:56:24 +0000 (13:56 +0200)]
libzfs: run_process: set O_NONBLOCK on lines pipe
Without this, we can deadlock: the child is stuck writing to the pipe,
and we are stuck waiting on the child
With this, we the child fills up the pipe (a few hundred kBish)
and starts getting EAGAINs, which allows it to either crash
or ignore them
libzfs_run_process_get_stdout*() is used only by zpool -c scripts,
which output short runs of K=V pairs, so the likelihood of losing
legitimate data there is relatively low
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12082
This change addresses two distinct scenarios which are possible
when performing a sequential resilver to a dRAID pool with vdevs
that contain silent unknown damage. Which in this circumstance
took the form of the devices being intentionally overwritten with
zeros. However, it could also result from a device returning incorrect
data while a sequential resilver was in progress.
Scenario 1) A sequential resilver is performed while all of the
dRAID vdevs are ONLINE and there is silent damage present on the
vdev being resilvered. In this case, nothing will be repaired
by vdev_raidz_io_done_reconstruct_known_missing() because
rc->rc_error isn't set on any of the raid columns. To address
this vdev_draid_io_start_read() has been updated to always mark
the resilvering column as ESTALE for sequential resilver IO.
Scenario 2) Multiple columns contain silent damage for the same
block and a sequential resilver is performed. In this case it's
impossible to generate the correct data from parity unless all of
the damaged columns are being sequentially resilvered (and thus
only good data is used to generate parity). This is as expected
and there's nothing which can be done about it. However, we need
to be careful not to make to situation worse. Since we can't
verify the data is actually good without a checksum, we must
only repair the devices which are being sequentially resilvered.
Otherwise, an incorrect repair to a device which previously
contained good data could effectively lock in the damage and
make reconstruction impossible. A check for this was added to
vdev_raidz_io_done_verified() along with a new test case.
Lastly, this change updates the redundancy_draid_spare1 and
redundancy_draid_spare3 test cases to be more representative
of normal dRAID replacement operation. Specifically, what we
care about is that the scrub run after a sequential resilver
does not find additional blocks which need repair. This would
indicate the sequential resilver failed to rebuild a section of
one of the devices. Note also the tests were switched to using
the verify_pool() function which still checks for checksum errors.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12061
Lauri Tirkkonen [Thu, 20 May 2021 16:03:03 +0000 (19:03 +0300)]
zfs-allow.8: mention 'bookmark' permission
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Lauri Tirkkonen <lauri@hacktheplanet.fi>
Closes #12064
Alexander Motin [Fri, 14 May 2021 16:13:53 +0000 (12:13 -0400)]
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #11966
Brian Behlendorf [Fri, 14 May 2021 16:11:56 +0000 (09:11 -0700)]
ZTS: Increase redundancy test timeout
The redundancy_draid.ksh and redundancy_raidz.ksh tests were updated
by commit 93c8e91fe to additionally verify self-healing. This
additional check increased the run time which can now occasionally
exceed the default maximum timeout in the CI environment. To prevent
this from causing failures increase the default timeout for the
redundancy test cases.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12043
наб [Fri, 14 May 2021 04:47:53 +0000 (06:47 +0200)]
i-t: rewrite hooks
This produces a leaner image, doesn't fail if zdb doesn't exist,
properly handles hostnameless systems, doesn't mention crypto modules
for no reason, doesn't add useless empty executable in hopes an
eight-year-old PR is merged, uses i-t builtins for all copies
Also optimize the checkbashisms filter to spawn one (or a few) awks
instead of one per regular file and remove initramfs/hooks therefrom due
to a command -v false positive
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12017