autoconf: use include directives instead of recursing down test data
We drop /multiple/ seconds off the generation, a dozen off a clean
rebuild, 185 files, and trivialise the distribution,
which can now be trivially generated via the provided snippets
autoconf: use include directives instead of recursing down tests (mostly)
Only down to tests/zfs-tests/tests, but pull out C programs into the
main Makefile ‒ this means we get correct dependency tracking for all
programs (and parallelise across them)
The order in fs/Makefile doesn't matter,
the order in fs/Kconfig is preserved (ext2 is included as the first
thing in the first if BUILD block, and only once), but I don't think it
matters much either, realistically
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13316
This was added in 93ce2b4ca5a40c41ac945cd3aaf4a4a22bb751e1 ("Update
build system and packaging"), which merged the SPL and ZFS trees,
and included in 0.8.0; "the next major release" was 2.0.0
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13316
hping [Mon, 9 May 2022 23:30:16 +0000 (07:30 +0800)]
abd_os: remove redundant refcount creation for abd_children
Refcount creation for abd_zero_scatter->abd_children is redundant in
abd_alloc_zero_scatter, as it has been done in abd_init_struct.
In addition, abd_children is undefined when ZFS_DEBUG is disabled, the
reference of abd_children in abd_alloc_zero_scatter breaks build of
libzpool when ZFS_DEBUG is disabled.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Ping Huang <huangping@smartx.com>
Closes #13429
Aidan Harris [Fri, 6 May 2022 18:57:37 +0000 (18:57 +0000)]
Fix functions without a prototype
clang-15 emits the following error message for functions without
a prototype:
fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error:
a function declaration without a prototype is deprecated
in all versions of C [-Werror,-Wstrict-prototypes]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Aidan Harris <me@aidanharr.is>
Closes #13421
Rich Ercolani [Wed, 4 May 2022 18:59:30 +0000 (14:59 -0400)]
Corrected edge case in uncompressed ARC->L2ARC handling
I genuinely don't know why this didn't come up before,
but adding the LZ4 early abort pointed out this flaw,
in which we're allocating a buffer of one size, and
then telling the compressor that we're handing it buffers
of a different size, which may be Very Different - say,
allocating 512b and then telling it the inputs are 128k.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #13375
Alexander Motin [Wed, 4 May 2022 18:33:42 +0000 (14:33 -0400)]
Improve mg_aliquot math
When calculating mg_aliquot alike to #12046 use number of unique data
disks in the vdev, not the total number of children vdev. Increase
default value of the tunable from 512KB to 1MB to compensate.
Before this change each disk in striped pool was getting 512KB of
sequential data, in 2-wide mirror -- 1MB, in 3-wide RAIDZ1 -- 768KB.
After this change in all the cases each disk should get 1MB.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #13388
Holding a dbuf is a common operation which can become highly contended
in dbuf_find() when acquiring the dbuf hash mutex. This is particularly
true on Linux when reading/writing volumes since by default up to 32
threads from the zvol_taskq may be taking a hold of the same dbuf.
This should also be observable on FreeBSD as long as there are enough
processes accessing the volume concurrently.
This is further aggregrated by the fact that only the block id will
be unique when calculating the dbuf hash for a single volume. The
objset id, object id, and level will be the same for data blocks.
This has been observed to result in a somehwat less than uniform hash
distribution and a longer than expected max hash chain depth (~20)
on a large memory system (256 GB) using volumes.
This commit improves the siutation by switching the hash mutex to
an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS
from 2048 to 8192 to further reduce the odds of a hash collision.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13405
Shaan Nobee [Tue, 3 May 2022 20:23:26 +0000 (00:23 +0400)]
Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.
This commit deals with 2 cases:
- No page writeback is active. A WB_SYNC_ALL page writeback starts
and even completes. But when it's about to check if the PG_writeback
bit has been cleared, another writeback with WB_SYNC_NONE starts.
The sync page writeback ends up waiting for the non-sync page
writeback to complete.
- A page writeback with WB_SYNC_NONE is already active when a
WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
waiting for the WB_SYNC_NONE writeback.
The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.
- Prefer O_* flags over F* flags that mostly mirror O_* flags anyway,
but O_* flags seem to be preferred.
- Simplify the code as all the F*SYNC flags were defined as FFSYNC flag.
- Don't define FRSYNC flag, so we don't generate unnecessary ZIL commits.
- Remove EXCL define, FreeBSD ignores the excl argument for zfs_create()
anyway.
Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #13400
Add missing replay entry in zvol_replay_vector for TX_SETSAXATTR
Commit 361a7e8 (log xattr=sa create/remove/update to ZIL) introduced a
TX_SETSAXATTR, but missed to add a corresponding entry in
zvol_replay_vector. Adding a missing replay entry in zvol_replay_vector.
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #13396
Closes #13395
Brian Behlendorf [Fri, 29 Apr 2022 21:21:11 +0000 (14:21 -0700)]
Silence unused-but-set-variable warnings
Clang 13.0.0 added support for `Wunused-but-set-parameter` and
`-Wunused-but-set-variable` which correctly detects two unused
variables in zstd resulting in a build failure. This commit
annotates these instances accordingly.
Increase the default allowed maximum recordsize from 1M to 16M.
As described in the zfs(4) man page, there are significant costs
which need to be considered before using very large blocks.
However, there are scenarios where they make good sense and
it should no longer be necessary to artificially restrict their
use behind a module option.
Note that for 32-bit platforms we continue to leave this
restriction in place due to the limited virtual address space
available (256-512MB). On these systems only a handful
of blocks could be cached at any one time severely impacting
performance and potentially stability.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12830
Closes #13302
Brian Behlendorf [Wed, 27 Apr 2022 19:56:17 +0000 (12:56 -0700)]
Fix O_APPEND for Linux 3.15 and older kernels
When using a Linux kernel which predates the iov_iter interface the
O_APPEND flag should be applied in zpl_aio_write() via the call to
generic_write_checks(). The updated pos variable was incorrectly
ignored resulting in the current offset being used.
This issue should only realistically impact the RHEL/CentOS 7.x
kernels which are based on Linux 3.10.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13370
Closes #13377
Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic()
inline helpers very indirectly include a reference to the GPL'd
array mmu_feature_keys[] and fails to build. Workaround this by
using copy_from_user() and throwing EFAULT for any calls to
__copy_from_user_inatomic(). This is a workaround until a fix
for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458
"powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #11958
Closes #12590
Closes #13367
- Unconditionally inject `CONFIG_MODULES` make variable
and `#define CONFIG_MODULES` to Kbuild in `ZFS_LINUX_COMPILE`
autoconf function to emulate loadable kernel modules support.
This allows OpenZFS to perform Linux checks despite
`CONFIG_MODULES=n` in the actual Linux config.
- Add `ZFS_AC_KERNEL_CONFIG_MODULES` check which encompasses
the logic from `ZFS_AC_KERNEL_TEST_MODULE` with additional
diagnostic messages to the user
- Removed `ZFS_AC_KERNEL_TEST_MODULE` as it merely duplicates
every check in `ZFS_AC_KERNEL_CONFIG_DEFINED`
- Moved `ZFS_AC_MODULE_SYMVERS` after `ZFS_AC_KERNEL_CONFIG_DEFINED`
so the user has a chance to see the proper diagnostic from the
steps before.
kbuild: unify modules(_install) for in-tree and external modules
If you attempt to build or install modules ('make modules(_install)'
with CONFIG_MODULES disabled, you will get a clear error message, but
nothing for external module builds.
Factor out the modules and modules_install rules into the common part,
so you will get the same error message when you try to build external
modules with CONFIG_MODULES=n.
Alexander Motin [Tue, 26 Apr 2022 17:44:21 +0000 (13:44 -0400)]
Improve log spacemap load time
Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool. As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits. We've observed one of such systems importing pool for about
45 minutes.
This patch improves the situation from five sides:
- By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
- By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
- By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
- By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
- By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.
As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Closes #12789
George Amanakis [Tue, 26 Apr 2022 00:25:42 +0000 (02:25 +0200)]
Improve zpool status output, list all affected datasets
Currently, determining which datasets are affected by corruption is
a manual process.
The primary difficulty in reporting the list of affected snapshots is
that since the error was initially found, the snapshot where the error
originally occurred in, may have been deleted. To solve this issue, we
add the ID of the head dataset of the original snapshot which the error
was detected in, to the stored error report. Then any time a filesystem
is deleted, the errors associated with it are deleted as well. Any time
a clone promote occurs, we modify reports associated with the original
head to refer to the new head. The stored error reports are identified
by this head ID, the birth time of the block which the error occurred
in, as well as some information about the error itself are also stored.
Once this information is stored, we can find the set of datasets
affected by an error by walking back the list of snapshots in the given
head until we find one with the appropriate birth txg, and then traverse
through the snapshots of the clone family, terminating a branch if the
block was replaced in a given snapshot. Then we report this information
back to libzfs, and to the zpool status command, where it is displayed
as follows:
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3
08:27:57 2021
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
sdb ONLINE 0 0 1.58K
errors: Permanent errors have been detected in the following files:
A new feature flag is introduced to mark the presence of this change, as
well as promotion and backwards compatibility logic. This is an updated
version of #9175. Rebase required fixing the tests, updating the ABI of
libzfs, updating the man pages, fixing bugs, fixing the error returns,
and updating the old on-disk error logs to the new format when
activating the feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain <tulsi.jain@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #9175
Closes #12812
Richard Laager [Fri, 22 Apr 2022 22:37:03 +0000 (17:37 -0500)]
zvol_wait: Ignore locked zvols
"When an encrypted zvol is locked the zfs-volume-wait service does not
start. The /sbin/zvol_wait should not wait for links when the volume
has property keystatus=unavailable."
-- https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1888405
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Thanks: James Dingwall <james-launchpad@dingwall.me.uk> Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #10662
- Add `CONFIG_BLOCK` Linux config requirement to
`ZFS_AC_KERNEL_CONFIG_DEFINED`. OpenZFS won't compile without
that block device support due to large amount of functional
dependencies on it.
- Remove dependency on `groups_alloc()` in
`ZFS_AC_KERNEL_SRC_GROUP_INFO_GID` to circumvent the missing stub
in Linux 4.X kernel headers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #13351
contrib; dracut: centralise root= parsing, actually support root=s
So far, everything parsed root= manually, which meant that while
zfs-parse.sh was updated, and supposedly supported + -> ' ' conversion,
it meant nothing
Instead, centralise parsing, and allow:
root=
root=zfs
root=zfs:
root=zfs:AUTO
root=ZFS=data/set
root=zfs:data/set
root=zfs:ZFS=data/set (as a side-effect; allowed but undocumented)
rootfstype=zfs AND root=data/set <=> root=data/set
rootfstype=zfs AND root= <=> root=zfs:AUTO
So rootfstype=zfs /also/ behaves as expected, and + decoding works
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13291
contrib: dracut: parse-zfs: stop pretending we support FILESYSTEM=
It was added in the original ae26d0465a ("Add dracut support") commit
in 2011, and was then broken a bit later with the advent of
dracut-zfs-generator, or maybe earlier as part of other churn
Either way, it's broken, and has been in 2.0+ as well, and no-one
complained. Stop pretending we support it at all
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13291
It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do
have differing semantics in some ways - in particular,
one requires KEEP_SIZE, and the other does not.
Also added a zero-range test to catch this, corrected a flaw
that made the punch-hole test succeed vacuously, and a typo
in file_write.
Alexander Motin [Wed, 20 Apr 2022 23:05:38 +0000 (19:05 -0400)]
FreeBSD: Fix translation from ABD to physical pages
In hypothetical case of non-linear ABD with single segment, multiple
to page size but not aligned to it, vdev_geom_fill_unmap_cb() could
fill one page less into bio_ma array.
I am not sure it is exploitable, but better to be safe than sorry.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reported-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #13345
наб [Thu, 31 Mar 2022 15:20:50 +0000 (17:20 +0200)]
linux: module: weld all but spl.ko into zfs.ko
Originally it was thought it would be useful to split up the kmods
by functionality. This would allow external consumers to only load
what was needed. However, in practice we've never had a case where
this functionality would be needed, and conversely managing multiple
kmods can be awkward. Therefore, this change merges all but the
spl.ko kmod in to a single zfs.ko kmod.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13274
Allan Jude [Wed, 20 Apr 2022 20:16:25 +0000 (16:16 -0400)]
Improve the inline descriptions of the ARC module parameters
These are displayed as the descriptions of the sysctl's on FreeBSD
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #13334
Paul Dagnelie [Tue, 19 Apr 2022 20:55:04 +0000 (13:55 -0700)]
Fix style checking error introduced by zfs-mount changes
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #13347
Brian Behlendorf [Tue, 19 Apr 2022 17:38:04 +0000 (10:38 -0700)]
Linux 5.17 compat: GENHD_FL_EXT_DEVT / GENHD_FL_NO_PART_SCAN
As of the 5.17 kernel the GENHD_FL_EXT_DEVT flag has been removed
and the GENHD_FL_NO_PART_SCAN flag renamed GENHD_FL_NO_PART. Update
zvol_alloc() to set GENHD_FL_NO_PART for the newer kernels which
is sufficient. The behavior for prior kernels remains unchanged.
1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT") 46e7eac6 ("block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART")
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13294
Closes #13297
Brian Behlendorf [Tue, 12 Apr 2022 00:26:01 +0000 (17:26 -0700)]
ZTS: Retry auto_spare_multiple.ksh
The auto_spare_multiple.ksh test may incorrectly fail for a similar
reason as the auto_spare_shared.ksh test. Add it to known list of
exceptions which should be retried to prevent failures in the CI.
Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13318
Brian Behlendorf [Mon, 11 Apr 2022 23:18:51 +0000 (16:18 -0700)]
ZTS: Retry redundancy_draid_spare[1,3].ksh
The redundancy_draid_spare1.ksh and redundancy_draid_spare3.ksh test
cases are a little to strict for the sequential resilver case. While
unlikely it is possible that a handful of correctable checksum errors
will be reported resulting in a test failure. Update the zts-report.py
script to allow this the test case to be retried if requested.
Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13318