Val Packett [Thu, 11 May 2023 21:16:57 +0000 (18:16 -0300)]
PAM: enable testing on FreeBSD
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Felix Dörre <felix@dogcraft.de> Signed-off-by: Val Packett <val@packett.cool>
Closes #14834
Val Packett [Sat, 6 May 2023 01:17:12 +0000 (22:17 -0300)]
PAM: support password changes even when not mounted
There's usually no requirement that a user be logged in for changing
their password, so let's not be surprising here.
We need to use the fetch_lazy mechanism for the old password to avoid
a double prompt for it, so that mechanism is now generalized a bit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Felix Dörre <felix@dogcraft.de> Signed-off-by: Val Packett <val@packett.cool>
Closes #14834
Val Packett [Sat, 6 May 2023 01:34:58 +0000 (22:34 -0300)]
PAM: add 'uid_min' and 'uid_max' options for changing the uid range
Instead of a fixed >=1000 check, allow the configuration to override
the minimum UID and add a maximum one as well. While here, add the
uid range check to the authenticate method as well, and fix the return
in the chauthtok method (seems very wrong to report success when we've
done absolutely nothing).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Felix Dörre <felix@dogcraft.de> Signed-off-by: Val Packett <val@packett.cool>
Closes #14834
Val Packett [Sat, 6 May 2023 01:02:13 +0000 (22:02 -0300)]
PAM: add 'forceunmount' flag
Probably not always a good idea, but it's nice to have the option.
It is a workaround for FreeBSD calling the PAM session end earier than
the last process is actually done touching the mount, for example.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Felix Dörre <felix@dogcraft.de> Signed-off-by: Val Packett <val@packett.cool>
Closes #14834
Val Packett [Fri, 5 May 2023 22:35:57 +0000 (19:35 -0300)]
PAM: add 'recursive_homes' flag to use with 'prop_mountpoint'
It's not always desirable to have a fixed flat homes directory.
With the 'recursive_homes' flag, 'prop_mountpoint' search would
traverse the whole tree starting at 'homes' (which can now be '*'
to mean all pools) to find a dataset with a mountpoint matching
the home directory.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Felix Dörre <felix@dogcraft.de> Signed-off-by: Val Packett <val@packett.cool>
Closes #14834
Val Packett [Sat, 6 May 2023 00:56:39 +0000 (21:56 -0300)]
PAM: use boolean_t for config flags
Since we already use boolean_t in the file, we can use it here.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Felix Dörre <felix@dogcraft.de> Signed-off-by: Val Packett <val@packett.cool>
Closes #14834
Val Packett [Fri, 5 May 2023 23:00:48 +0000 (20:00 -0300)]
PAM: do not fail to mount if the key's already loaded
If we're expecting a working home directory on login, it would be
rather frustrating to not have it mounted just because it e.g. failed to
unmount once on logout.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Felix Dörre <felix@dogcraft.de> Signed-off-by: Val Packett <val@packett.cool>
Closes #14834
Luís Henriques [Tue, 30 May 2023 22:15:24 +0000 (23:15 +0100)]
Fix NULL pointer dereference when doing concurrent 'send' operations
A NULL pointer will occur when doing a 'zfs send -S' on a dataset that
is still being received. The problem is that the new 'send' will
rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will
fail), but then dmu_send() will still do the dsl_dataset_disown().
Brian Behlendorf [Mon, 29 May 2023 19:55:35 +0000 (12:55 -0700)]
ZTS: zvol_misc_trim disable blk mq
Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted
kernels. This issue is being actively worked in #14872 and as part of that
fix this commit will be reverted.
VERIFY(zh->zh_claim_txg == 0) failed
PANIC at zil.c:904:zil_create()
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14872
Closes #14870
Richard Yao [Fri, 26 May 2023 22:47:52 +0000 (18:47 -0400)]
Use __attribute__((malloc)) on memory allocation functions
This informs the C compiler that pointers returned from these functions
do not alias other functions, which allows it to do better code
optimization and should make the compiled code smaller.
Mike Swanson [Fri, 26 May 2023 22:37:15 +0000 (15:37 -0700)]
Add compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2}
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
Closes #14902
Colm [Fri, 26 May 2023 17:04:19 +0000 (10:04 -0700)]
Adding new read-only compatible zpool features to compatibility.d/grub2
GRUB2 is compatible with all "read-only compatible" features,
so it is safe to add new features of this type to the grub2
compatibility list. We generally want to include all compatible
features, to minimize the differences between grub2-compatible
pools and no-compatibility pools.
Adding new properties `livelist` and `zpool_checkpoint` accordingly.
Also adding them to the man page which references this file as an
example, for consistency.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #14893
Richard Yao [Fri, 26 May 2023 17:03:12 +0000 (13:03 -0400)]
btree: Implement faster binary search algorithm
This implements a binary search algorithm for B-Trees that reduces
branching to the absolute minimum necessary for a binary search
algorithm. It also enables the compiler to inline the comparator to
ensure that the only slowdown when doing binary search is from waiting
for memory accesses. Additionally, it instructs the compiler to unroll
the loop, which gives an additional 40% improve with Clang and 8%
improvement with GCC.
Consumers must opt into using the faster algorithm. At present, only
B-Trees used inside kernel code have been modified to use the faster
algorithm.
Micro-benchmarks suggest that this can improve binary search performance
by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
compiling with GCC 12.2.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14866
George Amanakis [Fri, 26 May 2023 16:53:00 +0000 (18:53 +0200)]
Fix inconsistent definition of zfs_scrub_error_blocks_per_txg
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14894
Brian Behlendorf [Thu, 25 May 2023 20:53:08 +0000 (13:53 -0700)]
Update compatibility.d files
Add an openzfs-2.2 compatibility file for the next release.
Edon-R support has been enabled for FreeBSD removing the need
for different FreeBSD and Linux files. Symlinks for the -linux
and -freebsd names are created for any scripts expecting that
convention.
Additionally, a symlink for ubunutu-22.04 was added.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14833
Alexander Motin [Thu, 25 May 2023 20:51:53 +0000 (16:51 -0400)]
zil: Add some more statistics.
In addition to a number of actual log bytes written, account also a
total written bytes including padding and total allocated bytes (bytes
<= write <= alloc). It should allow to monitor zil traffic and space
efficiency.
Add dtrace probe for zil block size selection.
Make zilstat report more information and fit it into less width.
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14863
Alexander Motin [Thu, 25 May 2023 16:48:43 +0000 (12:48 -0400)]
ZIL: Reduce scope of per-dataset zl_issuer_lock.
Before this change ZIL copied all log data while holding the lock.
It caused huge lock contention on workloads with many big parallel
writes. This change splits the process into two parts: first,
zil_lwb_assign() estimates the log space needed for all transactions,
and zil_lwb_write_close() allocates blocks and zios while holding the
lock, then, after the lock in dropped, zil_lwb_commit() copies the
data, and zil_lwb_write_issue() issues the I/Os.
Also while there slightly reduce scope of zl_lock.
Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14841
Since systemd v239, Exec* binaries are resolved from PATH when they
are not-absolute. Switch to this by default for ease of downstream
maintenance. Many downstream distributions move individual binaries
to locations that existing compile-time configurations cannot
accommodate.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Closes #14880
Akash B [Wed, 24 May 2023 19:28:09 +0000 (00:58 +0530)]
Fix concurrent resilvers initiated at same time
For draid vdevs it was possible to initiate both the
sequential and healing resilver at same time.
This fixes the following two scenarios.
1) There's a window where a sequential rebuild can
be started via ZED even if a healing resilver has been
scheduled.
- This is fixed by adding additional check in
spa_vdev_attach() for any scheduled resilver and return
appropriate error code when a resilver is already in
progress.
2) It was possible for zpool clear to start a healing
resilver when it wasn't needed at all. This occurs because
during a vdev_open() the device is presumed to be healthy not
until the device is validated by vdev_validate() and it's set
unavailable. However, by this point an async resilver will
have already been requested if the DTL isn't empty.
- This is fixed by cancelling the SPA_ASYNC_RESILVER
request immediately at the end of vdev_reopen() when a resilver
is unneeded.
Finally, added a testcase in ZTS for verification.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com>
Closes #14881
Closes #14892
youzhongyang [Wed, 24 May 2023 19:23:42 +0000 (15:23 -0400)]
Linux 6.4 compat: reclaimed_slab renamed to reclaimed
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14891
Brian Atkinson [Fri, 19 May 2023 20:05:53 +0000 (16:05 -0400)]
Hold db_mtx when updating db_state
Commit 555ef90 did some general code refactoring for
dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was
not held when update db->db_state in those code block. The rest of the
dbuf code always holds the db_mtx when updating db_state. This is
important because cv_wait() db_changed is used to check for db_state
changes.
Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the
db_mtx when updating db_state.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #14875
Brian Behlendorf [Fri, 19 May 2023 20:05:09 +0000 (13:05 -0700)]
Probe vdevs before marking removed
Before allowing the ZED to mark a vdev as REMOVED due to a
hotplug event confirm that it is non-responsive with probe.
Any device which can be successfully probed should be left
ONLINE to prevent a healthy pool from being incorrectly
SUSPENDED. This may occur for at least the following two
scenarios.
1) Drive expansion (zpool online -e) in VMware environments.
If, during the partition resize operation, a partition is
removed and re-created then udev will send a removed event.
2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan)
may result in a udev remove and add event being delivered.
Finally, update the ZED to only kick in a spare when the
removal was successful.
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14859
Closes #14861
George Amanakis [Fri, 17 Dec 2021 20:35:28 +0000 (21:35 +0100)]
Teach zpool scrub to scrub only blocks in error log
Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain tulsi.jain@delphix.com Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #8995
Closes #12355
Antonio Russo [Mon, 15 May 2023 23:11:33 +0000 (17:11 -0600)]
test-runner: pass kmemleak and kmsg to Cmd.run
test-runner.py orchestrates all of the ZTS executions. The `Cmd` object
manages these process, and its `run` method specifically invokes these
possibly long-running processes, possibly retrying in the event of a
timeout. Since its inception, memory leak detection using the kmemleak
infrastructure [1], and kernel logging [2] have been added to this run
mechanism.
However, the callback to cull a process beyond its timeout threshold,
`kill_cmd`, has evaded modernization by both of these changes. As a
result, this function fails to properly invoke `run`, leading to an
untrapped exception and unreported test failure.
This patch extends `kill_cmd` to receive these kernel devices through
the `options` parameter, and regularizes all the `.run` calls from
`Cmd`, and its subclasses, to accept that parameter.
Richard Yao [Fri, 12 May 2023 21:10:14 +0000 (17:10 -0400)]
Fix undefined behavior in spa_sync_props()
8eae2d214cfa53862833eeeda9a5c1e9d5ded47d caused Coverity to begin
complaining about "Improper use of negative value" in two places in
spa_sync_props() because Coverity correctly inferred from `prop ==
ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name()
and zpool_prop_get_type() use it an array index, which is undefined
behavior.
Assuming that the system does not panic from an attempt to read invalid
memory, the case statement for ZPOOL_PROP_INVAL will ensure that only
user properties will reach this code when prop is ZPOOL_PROP_INVAL, such
that execution will continue safely. However, if we are unlucky enough
to read invalid memory, then the system will panic.
This issue predates the patch that caused coverity to begin complaining.
Thankfully, our userland tools do not pass nonsense to us, so this bug
should not be triggered unless a future userland tool attempts to set a
property that we do not understand.
Reported-by: Coverity (CID-1561129) Reported-by: Coverity (CID-1561130) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860
Reported-by: Coverity (CID-1562375) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860
Brian Behlendorf [Fri, 12 May 2023 16:07:58 +0000 (09:07 -0700)]
ZTS: Add auto_replace_001_pos to exceptions
The auto_replace_001_pos test case does not reliably pass on
Fedora 37 and newer. Until the test case can be updated to make
it reliable add it to the list of "maybe" exceptions on Linux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14851
Closes #14852
Don't use dmu_buf_is_dirty() for unassigned transaction.
The dmu_buf_is_dirty() call doesn't make sense here for two reasons:
1. txg is 0 for unassigned tx, so it was a no-op.
2. It is equivalent of checking if we have dirty records and we are doing
this few lines earlier.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825
Deny block cloning is dbuf size doesn't match BP size.
I don't know an easy way to shrink down dbuf size, so just deny block cloning
into dbufs that don't match our BP's size.
This fixes the following situation:
1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB.
2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB.
3. Truncate the large file to 0. Its dbuf will remain 2kB.
4. Clone the small file into the large file. Small file's BP lsize is
1kB, but the large file's dbuf is 2kB.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825
Reimplement some of the block cloning vs dbuf logic, mostly to fix
situation where we clone a block and in the same transaction group
we want to partially overwrite the clone.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14825
Ameer Hamza [Wed, 10 May 2023 00:56:35 +0000 (05:56 +0500)]
Prevent panic during concurrent snapshot rollback and zvol read
Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent
release of the dnode, avoiding panic when a snapshot is rolled back
in parallel during ongoing zvol read operation.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14839
Tony Hutter [Wed, 10 May 2023 00:55:19 +0000 (17:55 -0700)]
pam: Fix "buffer overflow" in pam ZTS tests on F38
The pam ZTS tests were reporting a buffer overflow on F38, possibly
due to F38 now setting _FORTIFY_SOURCE=3 by default. gdb and
valgrind narrowed this down to a snprintf() buffer overflow in
zfs_key_config_modify_session_counter(). I'm not clear why this
particular snprintf() was being flagged as an overflow, but when
I replaced it with an asprintf(), the test passed reliably.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #14802
Closes #14842
Provides an interface which callers can use to declare a write when
the exact starting offset in not yet known. Since the full range
being updated is not available only the first L0 block at the
provided offset will be prefetched.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14819
Reduced the timeout to 60 seconds which should be more than
sufficient and allow the test to be marked as FAILED rather
than KILLED. Also dump the pool status on cleanup.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14829
George Amanakis [Tue, 9 May 2023 15:54:41 +0000 (17:54 +0200)]
Remove duplicate code in l2arc_evict()
l2arc_evict() performs the adjustment of the size of buffers to be
written on L2ARC unnecessarily. l2arc_write_size() is called right
before l2arc_evict() and performs those adjustments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14828
George Amanakis [Tue, 9 May 2023 15:53:27 +0000 (17:53 +0200)]
Enable the head_errlog feature to remove errors
In case check_filesystem() does not error out and does not report
an error, remove that error block from error lists and logs
without requiring a scrub. This can happen when the original file and
all snapshots/clones referencing it have been removed.
Otherwise zpool status will still report that "Permanent errors have
been detected..." without actually reporting any of them.
To implement this change the functions introduced in corrective
receive were modified to take into account the head_errlog feature.
Before this change:
=============================
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/home/user/vdev_a ONLINE 0 0 2
errors: Permanent errors have been detected in the following files:
=============================
After this change:
=============================
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/home/user/vdev_a ONLINE 0 0 2
errors: No known data errors
=============================
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14813
George Amanakis [Mon, 8 May 2023 20:35:03 +0000 (22:35 +0200)]
Fixes in head_errlog feature with encryption
For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of
dsl_dataset_hold_obj() in order to enable access to the encryption keys
(if loaded). This enables reporting of errors in encrypted filesystems
which are not mounted but have their keys loaded.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14837
Matthew Ahrens [Mon, 8 May 2023 18:20:23 +0000 (11:20 -0700)]
Verify block pointers before writing them out
If a block pointer is corrupted (but the block containing it checksums
correctly, e.g. due to a bug that overwrites random memory), we can
often detect it before the block is read, with the `zfs_blkptr_verify()`
function, which is used in `arc_read()`, `zio_free()`, etc.
However, such corruption is not typically recoverable. To recover from
it we would need to detect the memory error before the block pointer is
written to disk.
This PR verifies BP's that are contained in indirect blocks and dnodes
before they are written to disk, in `dbuf_write_ready()`. This way,
we'll get a panic before the on-disk data is corrupted. This will help
us to diagnose what's causing the corruption, as well as being much
easier to recover from.
To minimize performance impact, only checks that can be done without
holding the spa_config_lock are performed.
Additionally, when corruption is detected, the raw words of the block
pointer are logged. (Note that `dprintf_bp()` is a no-op by default,
but if enabled it is not safe to use with invalid block pointers.)
Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14817
When using zdb to output the value of an xattr only interpret it
as printable characters if the entire byte array is printable.
Additionally, if the --parseable option is set always output the
buffer contents as octal for easy parsing.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14830
Alexander Motin [Fri, 5 May 2023 16:17:55 +0000 (12:17 -0400)]
Fix two abd_gang_add_gang() issues.
- There is no reason to assert that added gang is not empty. It
may be weird to add an empty gang, but it is legal.
- When moving chain list from the added gang clear its size, or it
will trigger assertion in abd_verify() when that gang is freed.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14816
On kernel module unload, free all zfsdev state structures, except for
zfsdev_state_listhead, which is statically allocated.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14824
Ameer Hamza [Wed, 3 May 2023 22:10:32 +0000 (03:10 +0500)]
zpool import -m also removing spare and cache when log device is missing
spa_import() relies on a pool config fetched by spa_try_import() for
spare/cache devices. Import flags are not passed to spa_tryimport(),
which makes it return early due to a missing log device and missing
retrieving the cache device and spare eventually. Passing
ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct
configuration regardless of the missing log device.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14794
buzzingwires [Wed, 3 May 2023 16:03:57 +0000 (12:03 -0400)]
Allow zhack label repair to restore detached devices.
This commit expands on the zhack label repair command in d04b5c9 by
adding the -u option to undetach a device by regenerating uberblocks,
in addition to the existing functionality of fixing checksums, now
represented by -c. Previous behavior is retained in the case of no
options.
The changes are heavily inspired by Jeff Bonwick's labelfix
utility, as archived at:
Additionally, it is now capable of properly determining the size of
block devices and other media, as well as handling sizes which are
not divisible by 2^18. This should make it viable for use on physical
devices and partitions, in addition to files.
These changes should make it possible to import zpools that have had
their uberblocks erased, such as in the case of pools rendered
inaccessible by erroneous detach commands.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: buzzingwires <buzzingwires@outlook.com>
Closes #14773
George Amanakis [Wed, 3 May 2023 16:00:14 +0000 (18:00 +0200)]
Optimize check_filesystem() and process_error_log()
Integrate check_clones() into check_filesystem() and implement a list
instead of iterating recursively over the clones, thus eliminating the
risk of a stack overflow.
Also use kmem_zalloc() to allocate large structures in
process_error_log() reducing its stack size from ~700 to ~128 bytes.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14744
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14806
Currently when layering the ABD buffer of each split block on top of
an indirect vdev's ZIO ABD we don't specify the split block's ABD.
This results in those ABDs being incorrectly sized by inheriting
the size of their parent ABD which is larger than what each split
block needs.
The above behavior isn't causing any bugs currently but can lead
to unexpected ABD sizes for people analyzing and/or working on
the ZIO codepath. This patch fixes this behavior by properly setting
the ABD size for split block ZIOs.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14804
FreeBSD/powerpc64 is all ELFv2 since FreeBSD 13, even big endian. The
existing sha256 and sha512 asm code assumes that BE is all ELFv1, and LE
is ELFv2. Minor changes to add ELFv2 in the BE side gets this working
correctly on FreeBSD with latest OpenZFS import.
Alexander Motin [Thu, 27 Apr 2023 19:32:58 +0000 (15:32 -0400)]
Mark TX_COMMIT transaction with TXG_NOTHROTTLE.
TX_COMMIT has no on-disk representation and does not produce any more
dirty data. It should not wait for anything, and even just skipping
the checks if not waiting gives improvement noticeable in profiler.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14798
Val Packett [Thu, 27 Apr 2023 16:49:03 +0000 (13:49 -0300)]
PAM: support the authentication facility
Implement the pam_sm_authenticate method, using the noop argument of
lzc_load_key to do a passphrase check without actually loading the key.
This allows using ZFS as the source of truth for user passwords,
without storing any password hashes in /etc or using other PAM modules.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Felix Dörre <felix@dogcraft.de> Signed-off-by: Val Packett <val@packett.cool>
Closes #14789
Alexander Motin [Wed, 26 Apr 2023 16:20:43 +0000 (12:20 -0400)]
Fix positive ABD size assertion in abd_verify().
Gang ABDs without childred are legal, and they do have zero size.
For other ABD types zero size doesn't have much sense and likely
not working correctly now.
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14795
Brian Behlendorf [Wed, 26 Apr 2023 15:43:39 +0000 (08:43 -0700)]
zdb: Fix minor memory leak
Commit 6b6aaf6dc2e65c63c74fbd7840c14627e9a91ce2 introduced a small
memory leak in zdb. This was detected by the LeakSanitizer and was
causing all ztest runs to fail.
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14796
Brian Behlendorf [Tue, 25 Apr 2023 23:40:55 +0000 (16:40 -0700)]
Revert "Fix data race between zil_commit() and zil_suspend()"
This reverts commit 4c856fb333ac57d9b4a6ddd44407fd022a702f00 to
resolve a newly introduced deadlock which in practice in more
disruptive that the issue this commit intended to address.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14775
Closes #14790
LoongArch is a new RISC ISA, which is a bit like MIPS or RISC-V.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Han Gao <gaohan@uniontech.com> Signed-off-by: WANG Xuerui <xen0n@gentoo.org>
Closes #13422
People often want estimates of how much of their pool is occupied
by metadata, but they end up using lots of text processing on zdb's
output to get it.
So let's just...provide it for them.
Now, zdb -bbbs will output something like:
Blocks LSIZE PSIZE ASIZE avg comp %Total Type
[...]
68 1.06M 272K 544K 8K 4.00 0.00 L6 Total
1.71K 212M 6.85M 13.7M 8K 30.91 0.00 L5 Total
1.71K 212M 6.85M 13.7M 8K 30.91 0.00 L4 Total
1.73K 214M 6.92M 13.8M 8K 30.89 0.00 L3 Total
18.7K 2.29G 111M 221M 11.8K 21.19 0.00 L2 Total
3.56M 454G 28.4G 56.9G 16.0K 15.97 0.19 L1 Total
308M 36.8T 28.2T 28.6T 95.1K 1.30 99.80 L0 Total
311M 37.3T 28.3T 28.6T 94.2K 1.32 100.00 Total
50.4M 774G 113G 291G 5.77K 6.85 0.99 Metadata Total
It became illegal to not have them as of 5f6df177758b9dff88e4b6069aeb2359e8b0c493 ("vfs: validate that vop
vectors provide all or none fplookup vops") upstream.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14788
API contract requires VOPs to handle EXDEV internally, worst case by
falling back to the generic copy routine. This broke with the recent
changes.
While here whack custom loop to lock 2 vnodes with vn_lock_pair, which
provides the same functionality internally. write start/finish around
it plays no role so got eliminated.
One difference is that vn_lock_pair always takes an exclusive lock on
both vnodes. I did not patch around it because current code takes an
exclusive lock on the target vnode. zfs supports shared-locking for
writes, so this serializes different calls to the routine as is, despite
range locking inside. At the same time you may notice the source vnode
can get some traffic if only shared-locked, thus once more this goes
the safer route of exclusive-locking. Note this should be patched to
use shared-locking for both once the feature is considered stable.
Technically the switch to vn_lock_pair should be a separate change, but
it would only introduce churn immediately whacked by the rest of the
patch.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by: Rubicon Communications, LLC ("Netgate")
Closes #14723
FreeBSD: make zfs_vfs_held() definition consistent with declaration
Noticed while attempting to change FreeBSD's boolean_t into an actual
bool: in include/sys/zfs_ioctl_impl.h, zfs_vfs_held() is declared to
return a boolean_t, but in module/os/freebsd/zfs/zfs_ioctl_os.c it is
defined to return an int. Make the definition match the declaration.
Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #14776
Richard Yao [Tue, 11 Apr 2023 17:56:16 +0000 (17:56 +0000)]
Linux: Suppress -Wordered-compare-function-pointers in tracepoint code
Clang points out that there is a comparison against -1, but we cannot
fix it because that is from the kernel headers, which we must support.
We can workaround this by using a pragma.
Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14738
Richard Yao [Tue, 11 Apr 2023 17:50:43 +0000 (17:50 +0000)]
Linux: zfs_zaccess_trivial() should always call generic_permission()
Building with Clang on Linux generates a warning that err could be
uninitialized if mnt_ns is a NULL pointer. However, mnt_ns should never
be NULL, so there is no need to put this behind an if statement. Taking
it outside of the if statement means that the possibility of err being
uninitialized goes from being always zero in a way that the compiler
could not realize to a way that is always zero in a way that the
compiler can realize.
Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14738
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.
Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2
Notably, this allows to get/set properties on the root vdev:
% zpool set user:prop=value <pool> root-0
Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):
% zpool set user:prop=value <pool> mirror-0
This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.
The following tests have been added:
- zpool get all properties from root vdev
- zpool set a property on root vdev
- verify root vdev ZAP is created
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology Submitted-by: Klara, Inc.
Closes #14405
Herb Wartens [Wed, 19 Apr 2023 20:22:59 +0000 (13:22 -0700)]
Allow MMP to bypass waiting for other threads
At our site we have seen cases when multi-modifier protection is enabled
(multihost=on) on our pool and the pool gets suspended due to a single
disk that is failing and responding very slowly. Our pools have 90 disks
in them and we expect disks to fail. The current version of MMP requires
that we wait for other writers before moving on. When a disk is
responding very slowly, we observed that waiting here was bad enough to
cause the pool to suspend. This change allows the MMP thread to bypass
waiting for other threads and reduces the chances the pool gets
suspended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Herb Wartens <hawartens@gmail.com>
Closes #14659
Paul Dagnelie [Wed, 19 Apr 2023 20:20:02 +0000 (13:20 -0700)]
ZTS: send-c_volume is flaky
We use block_device_wait to wait for the zvol block device to
actually appear, and we log the result of the dd calls by using
an intermediate file.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #14767
Ameer Hamza [Wed, 19 Apr 2023 16:04:32 +0000 (21:04 +0500)]
Fix "Detach spare vdev in case if resilvering does not happen"
Spare vdev should detach from the pool when a disk is reinserted.
However, spare detachment depends on the completion of resilvering,
and if resilver does not schedule, the spare vdev keeps attached to
the pool until the next resilvering. When a zfs pool contains
several disks (25+ mirror), resilvering does not always happen when
a disk is reinserted. In this patch, spare vdev is manually detached
from the pool when resilvering does not occur and it has been tested
on both Linux and FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14722
Work around Raspberry Pi kernel packaging oddities
On Debian and Ubuntu and friends, you get something like
"linux-image-$(uname -r)" and "linux-headers-$(uname -r)" you
can put a Depends on.
On Raspberry Pi OS, you get "raspberrypi-kernel" and
"raspberrypi-kernel-headers", with version numbers like 20230411.
There is not, as far as I can tell, a reasonable way to map that
to a kernel version short of reaching out and digging around in
the changelogs or Makefile, so just special-case it so the packages
don't fail to install at install time. They still might not build
if the versions don't match, but I don't see a way to do anything
about that...
The zfs_log_clone_range() function is never called from the
zfs_clone_range_replay() function, so I assumed it is safe to assert
that zil_replaying() is never TRUE here. It turns out zil_replaying()
also returns TRUE when the sync property is set to disabled.
Fix the problem by just returning if zil_replaying() returns TRUE.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported by: Florian Smeets Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14758
* Fixed one typo (effects -> affects)
* Re-worded raidz description to make it clearer that it is not
quite the same as RAID5, though similar
* Clarified that data is not necessarily written in a static
stripe width
* Minor grammar consistency improvement
* Noted that "volumes" means zvols
* Fixed a couple of split infinitives
* Clarified that hot spares come from the same pool they were
assigned to
* "we" -> ZFS
* Fixed warnings thrown by mandoc, and removed unnecessary
wordiness in one fixed line.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brandon Thetford <brandon@dodecatec.com>
Closes #14726
Don't overwrite blk_phys_birth, as for embedded blocks it is part of
the payload.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Issue #13392
Closes #14739
initramfs: source user scripts from /e/z/initramfs-tools-load-key{,.d/*}
By dropping in a file in a directory (for packages) or by making a file
(for local administrators), custom key loading methods may be provided
for the rootfs and necessities.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nicholas Morris <security@niwamo.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Co-authored-by: Nicholas Morris <security@niwamo.com>
Supersedes: #14704 Closes: #13757
Closes #14733
Alan Somers [Mon, 10 Apr 2023 21:24:27 +0000 (15:24 -0600)]
Trim needless zeroes from checksum events
The ereport.fs.zfs.checksum event contains histograms of the bits that
were wrongly set or cleared according to their bit position in a 64-bit
word. So the maximum value that any histogram bucket could have would
be 64. But ZFS currently uses a uint32_t to hold each bucket. As a
result, the event report is full of needless zeroes.
Change the bucket size to uint8_t, stripping 768 needless zeros from
each event.