Josef Bacik [Mon, 24 Oct 2022 18:46:53 +0000 (14:46 -0400)]
btrfs: move the lockdep helpers into locking.h
These more naturally fit in with the locking related code, and they're
all defines so they can easily go anywhere, move them out of ctree.h
into locking.h
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 20 Oct 2022 16:58:28 +0000 (12:58 -0400)]
btrfs: extend btrfs_dir_item type to store encryption status
For directories with encrypted files/filenames, we need to store a flag
indicating this fact. There's no room in other fields, so we'll need to
borrow a bit from dir_type. Since it's now a combination of type and
flags, we rename it to dir_flags to reflect its new usage.
The new flag, FT_ENCRYPTED, indicates a directory containing encrypted
data, which is orthogonal to file type; therefore, add the new
flag, and make conversion from directory type to file type strip the
flag.
As the file types almost never change we can afford to use the bits.
Actual usage will be guarded behind an incompat bit, this patch only
adds the support for later use by fscrypt.
Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: use struct fscrypt_str instead of struct qstr
While struct qstr is more natural without fscrypt, since it's provided
by dentries, struct fscrypt_str is provided by the fscrypt handlers
processing dentries, and is thus more natural in the fscrypt world.
Replace all of the struct qstr uses with struct fscrypt_str.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: setup qstr from dentrys using fscrypt helper
Most places where we get a struct qstr, we are doing so from a dentry.
With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt
provides a helper to convert a dentry name to the appropriate disk name
if necessary. Convert each of the dentry name accesses to use
fscrypt_setup_filename(), then convert the resulting fscrypt_name back
to an unencrypted qstr. This does not work for nokey names, but the
specific locations that could spawn nokey names are noted.
At present, since there are no encrypted directories, nothing goes down
the filename encryption paths.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: use struct qstr instead of name and namelen pairs
Many functions throughout btrfs take name buffer and name length
arguments. Most of these functions at the highest level are usually
called with these arguments extracted from a supplied dentry's name.
But the entire name can be passed instead, making each function a little
more elegant.
Each function whose arguments are currently the name and length
extracted from a dentry is herein converted to instead take a pointer to
the name in the dentry. The couple of calls to these calls without a
struct dentry are converted to create an appropriate qstr to pass in.
Additionally, every function which is only called with a name/len
extracted directly from a qstr is also converted.
This change has positive effect on stack consumption, frame of many
functions is reduced but this will be used in the future for fscrypt
related structures.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 19 Oct 2022 16:21:42 +0000 (21:51 +0530)]
btrfs: merge module cleanup sequence to one helper
The module exit function exit_btrfs_fs() is duplicating a section of code
in init_btrfs_fs(). Add a helper to remove the duplicated code. Due
to the init/exit section requirements the function must be inline and
not a plain static as it could cause section mismatch.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 14 Oct 2022 13:59:35 +0000 (15:59 +0200)]
btrfs: switch GFP_NOFS to GFP_KERNEL in scrub_setup_recheck_block
There's only one caller that calls scrub_setup_recheck_block in the
memalloc_nofs_save/_restore protection so it's effectively already
GFP_NOFS and it's safe to use GFP_KERNEL.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:51:01 +0000 (10:51 -0400)]
btrfs: remove temporary btrfs_map_token declaration in ctree.h
This was added while I was moving this code to its new home, it can be
removed now.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:51:00 +0000 (10:51 -0400)]
btrfs: move accessor helpers into accessors.h
This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:59 +0000 (10:50 -0400)]
btrfs: move btrfs_map_token to accessors
This is specific to the item-accessor code, move it out of ctree.h into
accessor.h/.c and then update the users to include the new header file.
This un-inlines btrfs_init_map_token, however this is only called once
per function so it's not critical to be inlined. This also saves 904
bytes of code on a release build.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:58 +0000 (10:50 -0400)]
btrfs: rename struct-funcs.c to accessors.c
Rename struct-funcs.c to accessors.c so we can move the item accessors
out of ctree.h. accessors.c is a better description of the code that is
contained in these files.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:57 +0000 (10:50 -0400)]
btrfs: move the compat/incompat flag masks to fs.h
This is fs wide information, move it out of ctree.h into fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:56 +0000 (10:50 -0400)]
btrfs: remove fs_info::pending_changes and related code
Now that we're not using this code anywhere we can remove it as well as
the member from fs_info.
We don't have any mount options or on/off features that would utilize
the pending infrastructure, the last one was inode_cache.
There was a patchset [1] to enable some features from sysfs that would
break things if it would be set immediately. In case we'll need that
kind of logic again the patch can be reverted, but for the current use
it can be replaced by the single state bit to do the commit.
Josef Bacik [Wed, 19 Oct 2022 14:50:55 +0000 (10:50 -0400)]
btrfs: add a BTRFS_FS_NEED_TRANS_COMMIT flag
Currently we are only using fs_info->pending_changes to indicate that we
need a transaction commit. The original users for this were removed
years ago and we don't have more usage in sight, so this is the only
remaining reason to have this field. Add a flag so we can remove this
code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:54 +0000 (10:50 -0400)]
btrfs: move fs_info::flags enum to fs.h
These definitions are fs wide, take them out of ctree.h and put them in
fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:53 +0000 (10:50 -0400)]
btrfs: move mount option definitions to fs.h
These are fs wide definitions and helpers, move them out of ctree.h and
into fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:52 +0000 (10:50 -0400)]
btrfs: convert incompat and compat flag test helpers to macros
These helpers use functions not defined in fs.h, they're simply
accessors of the super block in fs_info, convert them to macros so
that we don't have a weird dependency between fs.h and accessors.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:51 +0000 (10:50 -0400)]
btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.h
We're going to use fs.h to hold fs wide related helpers and definitions,
move the FS_STATE enum and related helpers to fs.h, and then update all
files that need these definitions to include fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:50 +0000 (10:50 -0400)]
btrfs: push printk index code into their respective helpers
The printk index work can be pushed into the printk helpers themselves,
this allows us to further sanitize messages.h, removing the last
include in the header itself.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:49 +0000 (10:50 -0400)]
btrfs: move the printk helpers out of ctree.h
We have a bunch of printk helpers that are in ctree.h. These have
nothing to do with ctree.c, so move them into their own header.
Subsequent patches will cleanup the printk helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:48 +0000 (10:50 -0400)]
btrfs: move assert helpers out of ctree.h
These call functions that aren't defined in, or will be moved out of,
ctree.h Move them to super.c where the other assert/error message code
is defined. Drop the __noreturn attribute for btrfs_assertfail as
objtool does not like it and fails with warnings like
fs/btrfs/dir-item.o: warning: objtool: .text.unlikely: unexpected end of section
fs/btrfs/xattr.o: warning: objtool: btrfs_setxattr() falls through to next function btrfs_setxattr_trans.cold()
fs/btrfs/xattr.o: warning: objtool: .text.unlikely: unexpected end of section
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 19 Oct 2022 14:50:47 +0000 (10:50 -0400)]
btrfs: move fs wide helpers out of ctree.h
We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code. Move these into a fs.h header, which will
serve as the location for file system wide related helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 18 Oct 2022 14:06:38 +0000 (16:06 +0200)]
btrfs: simplify generation check in btrfs_get_dentry
Callers that pass non-zero generation always want to perform the
generation check, we can simply encode that in one parameter and drop
check_generation. Add function documentation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 26 Jul 2022 18:54:10 +0000 (20:54 +0200)]
btrfs: auto enable discard=async when possible
There's a request to automatically enable async discard for capable
devices. We can do that, the async mode is designed to wait for larger
freed extents and is not intrusive, with limits to iops, kbps or latency.
The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
The automatic selection is done if there's at least one discard capable
device in the filesystem (not capable devices are skipped). Mounting
with any other discard option will honor that option, notably mounting
with nodiscard will keep it disabled.
David Sterba [Fri, 30 Sep 2022 15:23:01 +0000 (17:23 +0200)]
btrfs: sysfs: convert remaining scnprintf to sysfs_emit
The sysfs_emit is the safe API for writing to the sysfs files,
previously converted from scnprintf, there's one left to do in
btrfs_read_policy_show.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 14 Oct 2022 14:00:41 +0000 (10:00 -0400)]
btrfs: do not panic if we can't allocate a prealloc extent state
We sometimes have to allocate new extent states when clearing or setting
new bits in an extent io tree. Generally we preallocate this before
taking the tree spin lock, but we can use this preallocated extent state
sometimes and then need to try to do a GFP_ATOMIC allocation under the
lock.
Unfortunately sometimes this fails, and then we hit the BUG_ON() and
bring the box down. This happens roughly 20 times a week in our fleet.
However the vast majority of callers use GFP_NOFS, which means that if
this GFP_ATOMIC allocation fails, we could simply drop the spin lock, go
back and allocate a new extent state with our given gfp mask, and begin
again from where we left off.
For the remaining callers that do not use GFP_NOFS, they are generally
using GFP_NOWAIT, which still allows for some reclaim. So allow these
allocations to attempt to happen outside of the spin lock so we don't
need to rely on GFP_ATOMIC allocations.
This in essence creates an infinite loop for anything that isn't
GFP_NOFS. To address this we may want to migrate to using mempools for
extent states so that we will always have emergency reserves in order to
make our allocations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 14 Oct 2022 14:00:39 +0000 (10:00 -0400)]
btrfs: do not use GFP_ATOMIC in the read endio
We have done read endio in an async thread for a very, very long time,
which makes the use of GFP_ATOMIC and unlock_extent_atomic() unneeded in
our read endio path. We've noticed under heavy memory pressure in our
fleet that we can fail these allocations, and then often trip a
BUG_ON(!allocation), which isn't an ideal outcome. Begin to address
this by simply not using GFP_ATOMIC, which will allow us to do things
like actually allocate a extent state when doing
set_extent_bits(UPTODATE) in the endio handler.
End io handlers are not called in atomic context, besides we have been
allocating failrec with GFP_NOFS so we'd notice there's a problem.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: skip update of block group item if used bytes are the same
[BACKGROUND]
When committing a transaction, we will update block group items for all
dirty block groups.
But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several COW operations, but still have the same amount of used bytes.
In that case, we may unnecessarily COW a tree block doing nothing.
[ENHANCEMENT]
This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.
This would be more common for large filesystems, where metadata block
group can be as large as 1GiB, containing at most 64K metadata items.
In that case, if COW added and then deleted one metadata item near the
end of the block group, then it's completely possible we don't need to
touch the block group item at all.
[BENCHMARK]
The change itself can have quite a high chance (20~80%) to skip block
group item updates in lot of workloads.
As a result, it would result shorter time spent on
btrfs_write_dirty_block_groups(), and overall reduce the execution time
of the critical section of btrfs_commit_transaction().
Here comes a fio command, which will do random writes in 4K block size,
causing a very heavy metadata updates.
The file size (512M) and number of threads (4) means 2GiB file size in
total, but during the full 300s run time, my dedicated SATA SSD is able
to write around 20~25GiB, which is over 10 times the file size.
Thus after we fill the initial 2G, we should not cause much block group
item updates.
Please note, the fio numbers by themselves don't have much change, but
if we look deeper, there is some reduced execution time, especially for
the critical section of btrfs_commit_transaction().
I added extra trace_printk() to measure the following per-transaction
execution time:
- Critical section of btrfs_commit_transaction()
By re-using the existing update_commit_stats() function, which
has already calculated the interval correctly.
- The while() loop for btrfs_write_dirty_block_groups()
Although this includes the execution time of btrfs_run_delayed_refs(),
it should still be representative overall.
Both result involves transid 7~30, the same amount of transaction
committed.
The result looks like this:
| Before | After | Diff
----------------------+-------------------+----------------+--------
Transaction interval | 229247198.5 | 215016933.6 | -6.2%
Block group interval | 23133.33333 | 18970.83333 | -18.0%
The change in block group item updates is more obvious, as skipped block
group item updates also mean less delayed refs.
And the overall execution time for that block group update loop is
pretty small, thus we can assume the extent tree is already mostly
cached. If we can skip an uncached tree block, it would cause more
obvious change.
Unfortunately the overall reduction in commit transaction critical
section is much smaller, as the block group item updates loop is not
really the major part, at least not for the above fio script.
But still we have a observable reduction in the critical section.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 9 Sep 2022 15:43:06 +0000 (17:43 +0200)]
btrfs: convert __TRANS_* defines to enum bits
The base transaction bits can be defined as bits in a contiguous
sequence, although right now there's a hole from bit 1 to 8.
The bits are used for btrfs_trans_handle::type, and there's another set
of TRANS_STATE_* defines that are for btrfs_transaction::state. They are
mutually exclusive though the hole in the sequence looks like was made
for the states.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 9 Sep 2022 15:20:25 +0000 (17:20 +0200)]
btrfs: add helper for bit enumeration
Define helper macro that can be used in enum {} to utilize the automatic
increment to define all bits without directly defining the values or
using additional linear bits.
1. capture the sequence value, N
2. use the value to define the given enum with N-th bit set
3. reset the sequence back to N
Use for enums that do not require fixed values for symbolic names (like
for on-disk structures):
But it's not the case, some exit functions don't match their init
functions sequence in init_btrfs_fs().
Furthermore in init_btrfs_fs(), we need to have a new error label for
each new init function we added. This is not really expandable,
especially recently we may add several new functions to init_btrfs_fs().
[ENHANCEMENT]
The patch will introduce the following things to enhance the situation:
- struct init_sequence
Just a wrapper of init and exit function pointers.
The init function must use int type as return value, thus some init
functions need to be updated to return 0.
The exit function can be NULL, as there are some init sequence just
outputting a message.
- struct mod_init_seq[] array
This is a const array, recording all the initialization we need to do
in init_btrfs_fs(), and the order follows the old init_btrfs_fs().
- bool mod_init_result[] array
This is a bool array, recording if we have initialized one entry in
mod_init_seq[].
The reason to split mod_init_seq[] and mod_init_result[] is to avoid
section mismatch in reference.
All init function are in .init.text, but if mod_init_seq[] records
the @initialized member it can no longer be const, thus will be put
into .data section, and cause modpost warning.
For init_btrfs_fs() we just call all init functions in their order in
mod_init_seq[] array, and after each call, setting corresponding
mod_init_result[] to true.
For exit_btrfs_fs() and error handling path of init_btrfs_fs(), we just
iterate mod_init_seq[] in reverse order, and skip all uninitialized
entry.
With this patch, init_btrfs_fs()/exit_btrfs_fs() will be much easier to
expand and will always follow the strict order.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 14 Oct 2022 13:44:33 +0000 (14:44 +0100)]
btrfs: remove gfp_t flag from btrfs_tree_mod_log_insert_key()
All callers of btrfs_tree_mod_log_insert_key() are now passing a GFP_NOFS
flag to it, so remove the flag from it and from alloc_tree_mod_elem() and
use it directly within alloc_tree_mod_elem().
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 13 Oct 2022 10:36:25 +0000 (11:36 +0100)]
btrfs: switch GFP_ATOMIC to GFP_NOFS when fixing up low keys
When fixing up the first key of each node above the current level, at
fixup_low_keys(), we are doing a GFP_ATOMIC allocation for inserting an
operation record for the tree mod log. However we can do just fine with
GFP_NOFS nowadays. The need for GFP_ATOMIC was for the old days when we
had custom locks with spinning behaviour for extent buffers and we were
in spinning mode while at fixup_low_keys(). Now we use rw semaphores for
extent buffer locks, so we can safely use GFP_NOFS.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Thu, 13 Oct 2022 22:52:10 +0000 (15:52 -0700)]
btrfs: re-check reclaim condition in reclaim worker
I have observed the following case play out and lead to unnecessary
relocations:
1. write a file across multiple block groups
2. delete the file
3. several block groups fall below the reclaim threshold
4. reclaim the first, moving extents into the others
5. reclaim the others which are now actually very full, leading to poor
reclaim behavior with lots of writing, allocating new block groups,
etc.
I believe the risk of missing some reasonable reclaims is worth it
when traded off against the savings of avoiding overfull reclaims.
Going forward, it could be interesting to make the check more advanced
(zoned aware, fragmentation aware, etc...) so that it can be a really
strong signal both at extent delete and reclaim time.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Thu, 13 Oct 2022 22:52:09 +0000 (15:52 -0700)]
btrfs: skip reclaim if block_group is empty
As we delete extents from a block group, at some deletion we cross below
the reclaim threshold. It is possible we are still in the middle of
deleting more extents and might soon hit 0. If the block group is empty
by the time the reclaim worker runs, we will still relocate it.
This works just fine, as relocating an empty block group ultimately
results in properly deleting it. However, we have more direct ways of
removing empty block groups in the cleaner thread. Those are either
async discard or the unused_bgs list. In fact, when we decide whether to
relocate a block group during extent deletion, we do check for emptiness
and prefer the discard/unused_bgs mechanisms when possible.
Not using relocation for this case reduces some modest overhead from
empty bg relocation:
- extra transactions
- extra metadata use/churn for creating relocation metadata
- trying to read the extent tree to look for extents (and in this case
finding none)
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:09 +0000 (13:17 +0100)]
btrfs: avoid unnecessary resolution of indirect backrefs during fiemap
During fiemap, when determining if a data extent is shared or not, if we
don't find the extent is directly shared, then we need to determine if
it's shared through subtrees. For that we need to resolve the indirect
reference we found in order to figure out the path in the inode's fs tree,
which is a path starting at the fs tree's root node and going down to the
leaf that contains the file extent item that points to the data extent.
We then proceed to determine if any extent buffer in that path is shared
with other trees or not.
However when the generation of the data extent is more recent than the
last generation used to snapshot the root, we don't need to determine
the path, since the data extent can not be shared through snapshots.
For this case we currently still determine the leaf of that path (at
find_parent_nodes(), but then stop determining the other nodes in the
path (at btrfs_is_data_extent_shared()) as it's pointless.
So do the check of the data extent's generation earlier, at
find_parent_nodes(), before trying to resolve the indirect reference to
determine the leaf in the path. This saves us from doing one expensive
b+tree search in the fs tree of our target inode, as well as other minor
work.
The following test was run on a non-debug kernel (Debian's default kernel
config):
$ cat test-fiemap.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 extents, each with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
# Add some more files to increase the size of the fs and extent
# trees (in the real world there's a lot of files and extents
# from other files).
xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1285 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 742 milliseconds (metadata cached)
After applying this patch:
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 689 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 393 milliseconds (metadata cached)
That's a -46.4% total reduction for the metadata not cached case, and
a -47.0% reduction for the cached metadata case.
The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:08 +0000 (13:17 +0100)]
btrfs: avoid duplicated resolution of indirect backrefs during fiemap
During fiemap, when determining if a data extent is shared or not, if we
don't find the extent is directly shared, then we need to determine if
it's shared through subtrees. For that we need to resolve the indirect
reference we found in order to figure out the path in the inode's fs tree,
which is a path starting at the fs tree's root node and going down to the
leaf that contains the file extent item that points to the data extent.
We then proceed to determine if any extent buffer in that path is shared
with other trees or not.
Currently whenever we find the data extent that a file extent item points
to is not directly shared, we always resolve the path in the fs tree, and
then check if any extent buffer in the path is shared. This is a lot of
work and when we have file extent items that belong to the same leaf, we
have the same path, so we only need to calculate it once.
This change does that, it keeps track of the current and previous leaf,
and when we find that a data extent is not directly shared, we try to
compute the fs tree path only once and then use it for every other file
extent item in the same leaf, using the existing cached path result for
the leaf as long as the cache results are valid.
This saves us from doing expensive b+tree searches in the fs tree of our
target inode, as well as other minor work.
The following test was run on a non-debug kernel (Debian's default kernel
config):
$ cat test-with-snapshots.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 extents, each with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
# Add some more files to increase the size of the fs and extent
# trees (in the real world there's a lot of files and extents
# from other files).
xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
# Create a snapshot so all the extents become indirectly shared
# through subtrees, with a generation less than or equals to the
# generation used to create the snapshot.
btrfs subvolume snapshot -r $MNT $MNT/snap1
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1204 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 729 milliseconds (metadata cached)
Result after applying this patch:
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 732 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 421 milliseconds (metadata cached)
That's a -46.1% total reduction for the metadata not cached case, and
a -42.2% reduction for the cached metadata case.
The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:07 +0000 (13:17 +0100)]
btrfs: move up backref sharedness cache store and lookup functions
Move the static functions to lookup and store sharedness check of an
extent buffer to a location above find_all_parents(), because in the
next patch the lookup function will be used by find_all_parents().
The store function is also moved just because it's the counter part
to the lookup function and it's best to have their definitions close
together.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:06 +0000 (13:17 +0100)]
btrfs: cache sharedness of the last few data extents during fiemap
During fiemap we process all the file extent items of an inode, by their
file offset order (left to right b+tree order), and then check if the data
extent they point at is shared or not. Until now we didn't cache those
results, we only did it for b+tree nodes/leaves since for each unique
b+tree path we have access to hundreds of file extent items. However, it
is also common to repeat checking the sharedness of a particular data
extent in a very short time window, and the cases that lead to that are
the following:
So during fiemap we well check for the sharedness of the data extent
with bytenr X twice. Typically for COW writes and for at least
moderately updated files, we end up with many file extent items that
point to different sections of the same data extent.
2) Writing into a NOCOW file after a snapshot is taken.
This happens if the target extent was created in a generation older
than the generation where the last snapshot for the root (the tree the
inode belongs to) was made.
This leads to a scenario like the previous one.
3) Writing into sections of a preallocated extent.
So we end up with 4 consecutive file extent items pointing to the data
extent at bytenr X.
4) Hole punching in the middle of an extent.
For example if a file has the following file extent item:
[ bytenr X, offset = 0, num_bytes = 8M ]
0 8M
And then hole is punched for the file range [4M, 6M[, we our file
extent item split into two:
[ bytenr X, offset = 0, num_bytes = 4M ]
0 4M
[ 2M hole, implicit or explicit depending on NO_HOLES feature ]
4M 6M
[ bytenr X, offset = 6M, num_bytes = 2M ]
6M 8M
Again, we end up with two file extent items pointing to the same
data extent.
5) When reflinking (clone and deduplication) within the same file.
This is probably the least common case of all.
In cases 1, 2, 4 and 4, when we have multiple file extent items that point
to the same data extent, their distance is usually short, typically
separated by a few slots in a b+tree leaf (or across sibling leaves). For
case 5, the distance can vary a lot, but it's typically the less common
case.
This change caches the result of the sharedness checks for data extents,
but only for the last 8 extents that we notice that our inode refers to
with multiple file extent items. Whenever we want to check if a data
extent is shared, we lookup the cache which consists of doing a linear
scan of an 8 elements array, and if we find the data extent there, we
return the result and don't check the extent tree and delayed refs.
The array/cache is small so that doing the search has no noticeable
negative impact on the performance in case we don't have file extent items
within a distance of 8 slots that point to the same data extent.
Slots in the cache/array are overwritten in a simple round robin fashion,
as that approach fits very well.
Using this simple approach with only the last 8 data extents seen is
effective as usually when multiple file extents items point to the same
data extent, their distance is within 8 slots. It also uses very little
memory and the time to cache a result or lookup the cache is negligible.
The following test was run on non-debug kernel (Debian's default kernel
config) to measure the impact in the case of COW writes (first example
given above), where we run fiemap after overwriting 33% of the blocks of
a file:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT
FILE_SIZE=$((1 * 1024 * 1024 * 1024))
# Create the file full of 1M extents.
xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar
block_count=$((FILE_SIZE / 4096))
# Overwrite about 33% of the file blocks.
overwrite_count=$((block_count / 3))
The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:05 +0000 (13:17 +0100)]
btrfs: remove useless logic when finding parent nodes
At find_parent_nodes(), at its last step, when iterating over all direct
references, we are checking if we have a share context and if we have
a reference with a different root from the one in the share context.
However that logic is pointless because of two reasons:
1) After the previous patch in the series (subject "btrfs: remove roots
ulist when checking data extent sharedness"), the roots argument is
always NULL when using a share check context (struct share_check), so
this code is never triggered;
2) Even before that previous patch, we could not hit this code because
if we had a reference with a root different from the one in our share
context, then we would have exited earlier when doing either of the
following:
- Adding a second direct ref to the direct refs red black tree
resulted in extent_is_shared() returning true when called from
add_direct_ref() -> add_prelim_ref(), after processing delayed
references or while processing references in the extent tree;
- When adding a second reference to the indirect refs red black
tree (same as above, extent_is_shared() returns true);
- If we only have one indirect reference and no direct references,
then when resolving it at resolve_indirect_refs() we immediately
return that the target extent is shared, therefore never reaching
that loop that iterates over all direct references at
find_parent_nodes();
- If we have 1 indirect reference and 1 direct reference, then we
also exit early because extent_is_shared() ends up returning true
when called through add_prelim_ref() (by add_direct_ref() or
add_indirect_ref()) or add_delayed_refs(). Same applies as when
having a combination of direct, indirect and indirect with missing
key references.
This logic had been obsoleted since commit 3ec4d3238ab165 ("btrfs:
allow backref search checks for shared extents"), which introduced the
early exits in case an extent is shared.
So just remove that logic, and assert at find_parent_nodes() that when we
have a share context we don't have a roots ulist and that we haven't found
the extent to be directly shared after processing delayed references and
all references from the extent tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:04 +0000 (13:17 +0100)]
btrfs: remove roots ulist when checking data extent sharedness
Currently btrfs_is_data_extent_shared() is passing a ulist for the roots
argument of find_parent_nodes(), however it does not use that ulist for
anything and for this context that list always ends up with at most one
element.
Since find_parent_nodes() is able to deal with a NULL ulist for its roots
argument, make btrfs_is_data_extent_shared() pass it NULL and avoid the
burden of allocating memory for the unnused roots ulist, initializing it,
releasing it and allocating one struct ulist_node for it during the call
to find_parent_nodes().
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:03 +0000 (13:17 +0100)]
btrfs: move ulists to data extent sharedness check context
When calling btrfs_is_data_extent_shared() we pass two ulists that were
allocated by the caller. This is because the single caller, fiemap, calls
btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
instead of allocating new ones before each call and freeing them after
each call.
Now that we have a context structure/object that we pass to
btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
their allocation and the context's allocation in a helper function, as
well as the freeing of the ulists and the context object. This allows to
reduce the number of parameters passed to btrfs_is_data_extent_shared(),
the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
and having the caller deal with allocating and releasing the ulists.
Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
that's a much better name as it reflects what the list is used for (and
matching the argument name for find_parent_nodes()).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:02 +0000 (13:17 +0100)]
btrfs: turn the backref sharedness check cache into a context object
Right now we are using a struct btrfs_backref_shared_cache to pass state
across multiple btrfs_is_data_extent_shared() calls. The structure's name
closely follows its current purpose, which is to cache previous checks
for the sharedness of metadata extents. However we will start using the
structure for more things other than caching sharedness checks, so rename
it to struct btrfs_backref_share_check_ctx.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:01 +0000 (13:17 +0100)]
btrfs: directly pass the inode to btrfs_is_data_extent_shared()
Currently we pass a root and an inode number as arguments for
btrfs_is_data_extent_shared() and the inode number is always from an
inode that belongs to that root (it wouldn't make sense otherwise).
In every context that we call btrfs_is_data_extent_shared() (fiemap only),
we have an inode available, so directly pass the inode to the function
instead of a root and inode number. This reduces the number of parameters
and it makes the function's signature conform to most other functions we
have.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:17:00 +0000 (13:17 +0100)]
btrfs: remove checks for a 0 inode number during backref walking
When doing backref walking to determine if an extent is shared, we are
testing if the inode number, stored in the 'inum' field of struct
share_check, is 0. However that can never be case, since the all instances
of the structure are created at btrfs_is_data_extent_shared(), which
always initializes it with the inode number from a fs tree (and the number
for any inode from any tree can never be 0). So remove the checks.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:16:59 +0000 (13:16 +0100)]
btrfs: remove checks for a root with id 0 during backref walking
When doing backref walking to determine if an extent is shared, we are
testing the root_objectid of the given share_check struct is 0, but that
is an impossible case, since btrfs_is_data_extent_shared() always
initializes the root_objectid field with the id of the given root, and
no root can have an objectid of 0. So remove those checks.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:16:58 +0000 (13:16 +0100)]
btrfs: drop redundant bflags initialization when allocating extent buffer
When allocating an extent buffer, at __alloc_extent_buffer(), there's no
point in explicitly assigning zero to the bflags field of the new extent
buffer because we allocated it with kmem_cache_zalloc().
So just remove the redundant initialization, it saves one mov instruction
in the generated assembly code for x86_64 ("movq $0x0,0x10(%rax)").
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:16:57 +0000 (13:16 +0100)]
btrfs: drop pointless memset when cloning extent buffer
At btrfs_clone_extent_buffer(), before allocating the pages array for the
new extent buffer we are calling memset() to zero out the pages array of
the extent buffer. This is pointless however, because the extent buffer
already has every element in its pages array pointing to NULL, as it was
allocated with kmem_cache_zalloc(). The memset() was introduced with
commit dd137dd1f2d719 ("btrfs: factor out allocating an array of pages"),
but even before that commit we already depended on the pages array being
initialized to NULL for the error paths that need to call
btrfs_release_extent_buffer().
So remove the memset(), it's useless and slightly increases the object
text size.
Before this change:
$ size fs/btrfs/extent_io.o
text data bss dec hex filename
70580 5469 40 76089 12939 fs/btrfs/extent_io.o
After this change:
$ size fs/btrfs/extent_io.o
text data bss dec hex filename
70564 5469 40 76073 12929 fs/btrfs/extent_io.o
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:16:56 +0000 (13:16 +0100)]
btrfs: skip unnecessary delalloc search during fiemap and lseek
During fiemap and lseek (hole and data seeking), there's no point in
iterating the inode's io tree to count delalloc bits if the inode's
delalloc bytes counter has a value of zero, as that counter is updated
whenever we set a range for delalloc or clear a range from delalloc.
So skip the counting and io tree iteration if the inode's delalloc bytes
counter has a value of zero. This helps save time when processing a file
range corresponding to a hole or prealloc (unwritten) extent.
This patch is part of a series comprised of the following patches:
btrfs: get the next extent map during fiemap/lseek more efficiently
btrfs: skip unnecessary extent map searches during fiemap and lseek
btrfs: skip unnecessary delalloc search during fiemap and lseek
The following test was performed on a release kernel (Debian's default
kernel config) before and after applying those 3 patches.
# Wrapper to call fiemap in extent count only mode.
# (struct fiemap::fm_extent_count set to 0)
$ cat fiemap.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <linux/fiemap.h>
int main(int argc, char **argv)
{
struct fiemap fiemap = { 0 };
int fd;
Filipe Manana [Tue, 11 Oct 2022 12:16:55 +0000 (13:16 +0100)]
btrfs: skip unnecessary extent map searches during fiemap and lseek
If we have no outstanding extents it means we don't have any extent maps
corresponding to delalloc that is flushing, as when an ordered extent is
created we increment the number of outstanding extents to 1 and when we
remove the ordered extent we decrement them by 1. So skip extent map tree
searches if the number of outstanding ordered extents is 0, saving time as
the tree is not empty if we have previously made some reads or flushed
delalloc, as in those cases it can have a very large number of extent maps
for files with many extents.
This helps save time when processing a file range corresponding to a hole
or prealloc (unwritten) extent.
The next patch in the series has a performance test in its changelog and
its subject is:
"btrfs: skip unnecessary delalloc search during fiemap and lseek"
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 11 Oct 2022 12:16:54 +0000 (13:16 +0100)]
btrfs: get the next extent map during fiemap/lseek more efficiently
At find_delalloc_subrange(), when we need to get the next extent map, we
do a full search on the extent map tree (a red black tree). This is fine
but it's a lot more efficient to simply use rb_next(), which typically
requires iterating over less nodes of the tree and never needs to compare
the ranges of nodes with the one we are looking for.
So add a public helper to extent_map.{h,c} to get the extent map that
immediately follows another extent map, using rb_next(), and use that
helper at find_delalloc_subrange().
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 10 Oct 2022 10:36:10 +0000 (18:36 +0800)]
btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate
For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).
We call cache_rbio_pages() to mark a qualified rbio ready for cache.
The timing happens at:
- finish_rmw()
At this timing, we have already read all necessary sectors, along with
the rbio sectors, we have covered all data stripes.
- __raid_recover_end_io()
At this timing, we have rebuild the rbio, thus all data sectors
involved (either from stripe or bio list) are uptodate now.
Thus at the timing of cache_rbio_pages(), we should have all data
sectors uptodate.
This patch will make it explicit that all data sectors are uptodate at
cache_rbio_pages() timing, mostly to prepare for the incoming
verification at RMW time.
This patch will add:
- Extra ASSERT()s in cache_rbio_pages()
This is to make sure all data sectors, which are not covered by bio,
are already uptodate.
- Extra ASSERT()s in steal_rbio()
Since only cached rbio can be stolen, thus every data sector should
already be uptodate in the source rbio.
- Update __raid_recover_end_io() to update recovered sector->uptodate
Previously __raid_recover_end_io() will only mark failed sectors
uptodate if it's doing an RMW.
But this can trigger new ASSERT()s, as for recovery case, a recovered
failed sector will not be marked uptodate, and trigger ASSERT() in
later cache_rbio_pages() call.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Then update rbio pointers to point the extra space after the rbio
structure itself.
Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.
This is too hacky, and is going to make later pointers expansion harder.
This patch will change it to use regular kcalloc() for each pointer
inside btrfs_raid_bio, making the later expansion much easier.
And introduce a helper free_raid_bio_pointers() to free up all the
pointer members in btrfs_raid_bio, which will be used in both
free_raid_bio() and error path of alloc_rbio().
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 9 Sep 2022 13:35:01 +0000 (09:35 -0400)]
btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:41 +0000 (11:06 -0400)]
btrfs: move the btrfs_verity_descriptor_item defs up in ctree.h
These are wrapped in CONFIG_FS_VERITY, but we can have the definitions
without verity enabled. Move these definitions up with the other
accessor helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:40 +0000 (11:06 -0400)]
btrfs: move btrfs_next_old_item into ctree.c
This uses btrfs_header_nritems, which I will be moving out of ctree.h.
In order to avoid needing to include the relevant header in ctree.h,
simply move this helper function into ctree.c.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ rename parameters ] Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:39 +0000 (11:06 -0400)]
btrfs: move free space cachep's out of ctree.h
This is local to the free-space-cache.c code, remove it from ctree.h and
inode.c, create new init/exit functions for the cachep, and move it
locally to free-space-cache.c.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:38 +0000 (11:06 -0400)]
btrfs: move btrfs_path_cachep out of ctree.h
This is local to the ctree code, remove it from ctree.h and inode.c,
create new init/exit functions for the cachep, and move it locally to
ctree.c.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:37 +0000 (11:06 -0400)]
btrfs: move trans_handle_cachep out of ctree.h
This is local to the transaction code, remove it from ctree.h and
inode.c, create new helpers in the transaction to handle the init work
and move the cachep locally to transaction.c.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:36 +0000 (11:06 -0400)]
btrfs: move btrfs_print_data_csum_error into inode.c
This isn't used outside of inode.c, there's no reason to define it in
btrfs_inode.h. Drop the inline and add __cold as it's for errors that
are not in any hot path.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:35 +0000 (11:06 -0400)]
btrfs: move flush related definitions to space-info.h
This code is used in space-info.c, move the definitions to space-info.h.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:34 +0000 (11:06 -0400)]
btrfs: move btrfs_should_fragment_free_space into block-group.c
This function uses functions that are not defined in block-group.h, move
it into block-group.c in order to keep the header clean.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:32 +0000 (11:06 -0400)]
btrfs: move discard stat defs to free-space-cache.h
These definitions are used for discard statistics, move them out of
ctree.h and put them in free-space-cache.h.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:31 +0000 (11:06 -0400)]
btrfs: move BTRFS_MAX_MIRRORS into scrub.c
This is only used locally in scrub.c, move it out of ctree.h into
scrub.c.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:29 +0000 (11:06 -0400)]
btrfs: move btrfs_get_block_group helper out of disk-io.h
This inline helper calls btrfs_fs_compat_ro(), which is defined in
another header. To avoid weird header dependency problems move this
helper into disk-io.c with the rest of the global root helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:28 +0000 (11:06 -0400)]
btrfs: move btrfs on-disk definitions out of ctree.h
The bulk of our on-disk definitions exist in btrfs_tree.h, which user
space can use. Keep things consistent and move the rest of the on disk
definitions out of ctree.h into btrfs_tree.h. Note I did have to update
all u8's to __u8, but otherwise this is a strict copy and paste.
Most of the definitions are mainly for internal use and are not
guaranteed stable public API and may change as we need. Compilation
failures by user applications can happen.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 14 Sep 2022 15:06:27 +0000 (11:06 -0400)]
btrfs: remove unused BTRFS_IOPRIO_READA
The last user of this definition was removed in patch f26c92386028
("btrfs: remove reada infrastructure") so we can remove this definition.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This hasn't been used since 138a12d86574 ("btrfs: rip out
btrfs_space_info::total_bytes_pinned") so it is safe to remove.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The last users of these helpers were removed in 5297199a8bca ("btrfs:
remove inode number cache feature") so delete these helpers.
The point was for mount options that were applicable after transaction
commit so they could not be applied immediately. We don't have such
options anymore and if we do the patch can be reverted.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 30 Sep 2022 20:45:13 +0000 (16:45 -0400)]
btrfs: add cached_state to read_extent_buffer_subpage
We don't use a cached state here at all, which generally makes sense as
async reads are going to unlock at endio time. However for blocking
reads we will call wait_extent_bit() for our range. Since the
lock_extent() stuff will return the cached_state for the start of the
range this is a helpful optimization to have for this case, we'll have
the exact state we want to wait on. Add a cached state here and simply
throw it away if we're a non-blocking read, otherwise we'll get a small
improvement by eliminating some tree searches.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 30 Sep 2022 20:45:12 +0000 (16:45 -0400)]
btrfs: cache the failed state when locking extents
Currently if we fail to lock a range we'll return the start of the range
that we failed to lock. We'll then search down to this range and wait
on any extent states in this range.
However we can avoid this search altogether if we simply cache the
extent_state that had the contention. We can pass this into
wait_extent_bit() and start from that extent_state without doing the
search. In the most optimistic case we can avoid all searches, more
likely we'll avoid the initial search and have to perform the search
after we wait on the failed state, or worst case we must search both
times which is what currently happens.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 30 Sep 2022 20:45:11 +0000 (16:45 -0400)]
btrfs: use a cached_state everywhere in relocation
All of the relocation code avoids using the cached state, despite
everywhere using the normal
lock_extent()
// do something
unlock_extent()
pattern. Fix this by plumbing a cached state throughout all of these
functions in order to allow for less tree searches.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 30 Sep 2022 20:45:10 +0000 (16:45 -0400)]
btrfs: use cached_state for btrfs_check_nocow_lock
Now that try_lock_extent() takes a cached_state, plumb the cached_state
through btrfs_try_lock_ordered_range() and then use a cached_state in
btrfs_check_nocow_lock everywhere to avoid extra tree searches on the
extent_io_tree.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 30 Sep 2022 20:45:09 +0000 (16:45 -0400)]
btrfs: add a cached_state to try_lock_extent
With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent(). This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
It has been reported to cause huge performance regressions on some loads
(will-it-scale.per_process_ops, but also building the kernel with
clang).
The commit did speed up gcc builds by a small amount, so it's not an
unambiguous regression, but until the big regressions are understood,
let's revert it.
Jan Dabros [Mon, 28 Nov 2022 19:56:51 +0000 (20:56 +0100)]
char: tpm: Protect tpm_pm_suspend with locks
Currently tpm transactions are executed unconditionally in
tpm_pm_suspend() function, which may lead to races with other tpm
accessors in the system.
Specifically, the hw_random tpm driver makes use of tpm_get_random(),
and this function is called in a loop from a kthread, which means it's
not frozen alongside userspace, and so can race with the work done
during system suspend:
Linus Torvalds [Sun, 4 Dec 2022 20:33:44 +0000 (12:33 -0800)]
Merge tag 'timers_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Revert a fix to RISC-V timers supposed to address an uncertainty
whether clock events are received during S3 or not which locks up
other RISC-V platforms. The issue will be fixed differently later.
* tag 'timers_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "clocksource/drivers/riscv: Events are stopped during CPU suspend"
Linus Torvalds [Sun, 4 Dec 2022 20:24:58 +0000 (12:24 -0800)]
Merge tag 'powerpc-6.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix oops in 32-bit BPF tail call tests
- Add missing declaration for machine_check_early_boot()
Thanks to Christophe Leroy and Naveen N. Rao.
* tag 'powerpc-6.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Add missing declaration for machine_check_early_boot()
powerpc/bpf/32: Fix Oops on tail call tests
Linus Torvalds [Sat, 3 Dec 2022 21:51:37 +0000 (13:51 -0800)]
Merge tag 'i2c-for-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A power state fix in the core for ACPI devices, a regression fix
regarding bus recovery for the cadence driver, a DMA handling fix for
the imx driver, and two error path fixes (npcm7xx and qcom-geni)"
* tag 'i2c-for-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: imx: Only DMA messages with I2C_M_DMA_SAFE flag set
i2c: qcom-geni: fix error return code in geni_i2c_gpi_xfer
i2c: cadence: Fix regression with bus recovery
i2c: Restore initial power state if probe fails
i2c: npcm7xx: Fix error handling in npcm_i2c_init()
Linus Torvalds [Sat, 3 Dec 2022 21:43:38 +0000 (13:43 -0800)]
Merge tag 'dax-fixes-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"A few bug fixes around the handling of "Soft Reserved" memory and
memory tiering information.
Linux is starting to enounter more real world systems that deploy an
ACPI HMAT to describe different performance classes of memory, as well
the "special purpose" (Linux "Soft Reserved") designation from EFI.
These fixes result from that testing.
It has all appeared in -next for a while with no known issues.
- Fix duplicate overlapping device-dax instances for HMAT described
"Soft Reserved" Memory
- Fix missing node targets in the sysfs representation of memory
tiers
- Remove a confusing variable initialization"
* tag 'dax-fixes-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: Fix duplicate 'hmem' device registration
ACPI: HMAT: Fix initiator registration for single-initiator systems
ACPI: HMAT: remove unnecessary variable initialization
Linus Torvalds [Sat, 3 Dec 2022 00:27:15 +0000 (16:27 -0800)]
Merge tag 'block-6.1-2022-12-02' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Just a small NVMe merge for this week, fixing protection of the name
space list, and a missing clear of a reserved field when unused"
* tag 'block-6.1-2022-12-02' of git://git.kernel.dk/linux:
nvme: fix SRCU protection of nvme_ns_head list
nvme-pci: clear the prp2 field when not used