Josef Bacik [Fri, 12 Mar 2021 20:25:28 +0000 (15:25 -0500)]
btrfs: do proper error handling in create_reloc_inode
We already handle some errors in this function, and the callers do the
correct error handling, so clean up the rest of the function to do the
appropriate error handling.
There's a little extra work that needs to be done here, as we create the
inode item before we create the orphan item. We could potentially add
the orphan item, but if we failed to create the inode item we would have
to abort the transaction.
Instead add a helper to delete the inode item we created in the case
that we're unable to look up the inode (this would likely be caused by
an ENOMEM), which if it succeeds means we can avoid a transaction abort
in this particular error case.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:25 +0000 (15:25 -0500)]
btrfs: handle extent reference errors in do_relocation
We can already deal with errors appropriately from do_relocation, simply
handle any errors that come from changing the refs at this point
cleanly. We have to abort the transaction if we fail here as we've
modified metadata at this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:24 +0000 (15:25 -0500)]
btrfs: handle errors in reference count manipulation in replace_path
If any of the reference count manipulation stuff fails in replace_path
we need to abort the transaction, as we've modified the blocks already.
We can simply break at this point and everything will be cleaned up.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:23 +0000 (15:25 -0500)]
btrfs: handle btrfs_search_slot failure in replace_path
The search can fail for various reasons, in case of errors there's no
cleanup to be done so we can pass the error to the caller, adjusting for
the case where the key is not found and search slot returns 1.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:20 +0000 (15:25 -0500)]
btrfs: do proper error handling in btrfs_update_reloc_root
We call btrfs_update_root in btrfs_update_reloc_root, which can fail for
all sorts of reasons, including IO errors. Instead of panicing the box
lets return the error, now that all callers properly handle those
errors.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:15 +0000 (15:25 -0500)]
btrfs: validate root::reloc_root after recording root in trans
If we fail to setup a root->reloc_root in a different thread that path
will error out, however it still leaves root->reloc_root NULL but would
still appear set up in the transaction. Subsequent calls to
btrfs_record_root_in_transaction would succeed without attempting to
create the reloc root, as the transid has already been updated.
Handle this case by making sure we have a root->reloc_root set after a
btrfs_record_root_in_transaction call so we don't end up dereferencing a
NULL pointer.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:14 +0000 (15:25 -0500)]
btrfs: do proper error handling in create_reloc_root
We do memory allocations here, read blocks from disk, all sorts of
operations that could easily fail at any given point. Instead of
panicing the box, simply return the error back up the chain, all callers
at this point have proper error handling.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:12 +0000 (15:25 -0500)]
btrfs: return an error from btrfs_record_root_in_trans
We can create a reloc root when we record the root in the trans, which
can fail for all sorts of different reasons. Propagate this error up
the chain of callers. Future patches will fix the callers of
btrfs_record_root_in_trans() to handle the error.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:05 +0000 (15:25 -0500)]
btrfs: handle btrfs_record_root_in_trans failure in btrfs_recover_log_trees
btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_recover_log_trees.
This appears tricky, however we have a reference count on the
destination root, so if this fails we need to continue on in the loop to
make sure the proper cleanup is done.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:01 +0000 (15:25 -0500)]
btrfs: do proper error handling in record_reloc_root_in_trans
Generally speaking this shouldn't ever fail, the corresponding fs root
for the reloc root will already be in memory, so we won't get ENOMEM
here.
However if there is no corresponding root for the reloc root then we
could get ENOMEM when we try to allocate it or we could get ENOENT
when we look it up and see that it doesn't exist.
Convert these BUG_ON()'s into ASSERT()'s and add proper error handling
for the case of corruption.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:25:00 +0000 (15:25 -0500)]
btrfs: check record_root_in_trans related failures in select_reloc_root
We will record the fs root or the reloc root in the trans in
select_reloc_root. These will actually return errors in the following
patches, so check their return value here and return it up the stack.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:24:59 +0000 (15:24 -0500)]
btrfs: convert BUG_ON()'s in select_reloc_root() to proper errors
We have several BUG_ON()'s in select_reloc_root() that can be tripped if
there is an extent tree corruption. Convert these to ASSERT()'s, because
if we hit it during testing it really is bad, or could indicate a
problem with the backref walking code.
However if users hit these problems it generally indicates corruption,
I've hit a few machines in the fleet that trip over these with clearly
corrupted extent trees, so be nice and print out an error message and
return an error instead of bringing the whole box down.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:24:58 +0000 (15:24 -0500)]
btrfs: handle errors from select_reloc_root()
Currently select_reloc_root() doesn't return an error, but followup
patches will make it possible for it to return an error. We do have
proper error recovery in do_relocation however, so handle the
possibility of select_reloc_root() having an error properly instead of
BUG_ON(!root).
I've also adjusted select_reloc_root() to return ERR_PTR(-ENOENT) if we
don't find a root, instead of NULL, to make the error case easier to
deal with. I've replaced the BUG_ON(!root) with an ASSERT(0) for this
case as it indicates we messed up the backref walking code, but it could
also indicate corruption.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:24:57 +0000 (15:24 -0500)]
btrfs: convert BUG_ON()'s in relocate_tree_block
We have a couple of BUG_ON()'s in relocate_tree_block() that can be
tripped if we have file system corruption. Convert these to ASSERT()'s
so developers still get yelled at when they break the backref code, but
error out nicely for users so the whole box doesn't go down.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 12 Mar 2021 20:24:56 +0000 (15:24 -0500)]
btrfs: convert some BUG_ON()'s to ASSERT()'s in do_relocation
A few of these are checking for correctness, and won't be triggered by
corrupted file systems, so convert them to ASSERT() instead of BUG_ON()
and add a comment explaining their existence.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Implement readahead_batch_length() to determine the number of bytes in
the current batch of readahead pages and use it in btrfs. Also use the
readahead_pos to get the offset.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Wan Jiabing [Thu, 1 Apr 2021 08:03:39 +0000 (16:03 +0800)]
btrfs: move forward declarations to the beginning of extent_io.h
There are two forward declarations deep in extent_io.h, move them
to the beginning and remove the duplicate one.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:44 +0000 (15:14 +0800)]
btrfs: make set_btree_ioerr accept extent buffer and be subpage compatible
Current set_btree_ioerr() only accepts @page parameter and grabs extent
buffer from page::private. This works fine for sector size == PAGE_SIZE
case, but not for subpage case.
Add an extra parameter, @eb, for callers to pass extent buffer to this
function, so that subpage code can reuse this function.
And also add subpage special handling to update
btrfs_subpage::error_bitmap.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:43 +0000 (15:14 +0800)]
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:42 +0000 (15:14 +0800)]
btrfs: support page uptodate assertions in subpage mode
There are quite some assert checks on page uptodate in extent buffer
write accessors. They ensure the destination page is already uptodate.
This is fine for regular sector size case, but not for subpage case, as
for subpage we only mark the page uptodate if the page contains no hole
and all its extent buffers are uptodate.
So instead of checking PageUptodate(), for subpage case we check the
uptodate bitmap of btrfs_subpage structure.
To make the check more elegant, introduce a helper,
assert_eb_page_uptodate() to do the check for both subpage and regular
sector size cases.
Qu Wenruo [Thu, 25 Mar 2021 07:14:41 +0000 (15:14 +0800)]
btrfs: make alloc_extent_buffer() check subpage dirty bitmap
In alloc_extent_buffer(), we make sure that the newly allocated page is
never dirty.
This is fine for sector size == PAGE_SIZE case, but for subpage it's
possible that one extent buffer in the page is dirty, thus the whole
page is marked dirty, and could cause false alert.
To support subpage, call btrfs_page_test_dirty() to handle both cases.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:40 +0000 (15:14 +0800)]
btrfs: subpage: support metadata checksum calculation at write time
Add a new helper, csum_dirty_subpage_buffers(), to iterate through all
dirty extent buffers in one bvec.
Also extract the code of calculating csum for one extent buffer into
csum_one_extent_buffer(), so that both the existing csum_dirty_buffer()
and the new csum_dirty_subpage_buffers() can reuse the same routine.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:39 +0000 (15:14 +0800)]
btrfs: subpage: do more sanity checks on metadata page dirtying
For btree_set_page_dirty(), we should also check the extent buffer
sanity for subpage support.
Unlike the regular sector size case, since one page can contain multiple
extent buffers, we need to make sure there is at least one dirty extent
buffer in the page.
So this patch will iterate through the btrfs_subpage::dirty_bitmap
to get the extent buffers, and check if any dirty extent buffer in the page
range has EXTENT_BUFFER_DIRTY and proper refs.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:38 +0000 (15:14 +0800)]
btrfs: subpage: introduce helpers for writeback status
Introduces the following functions to handle subpage writeback status:
- btrfs_subpage_set_writeback()
- btrfs_subpage_clear_writeback()
- btrfs_subpage_test_writeback()
These helpers can only be called when the range is ensured to be
inside the page.
- btrfs_page_set_writeback()
- btrfs_page_clear_writeback()
- btrfs_page_test_writeback()
These helpers can handle both regular sector size and subpage without
problem.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:37 +0000 (15:14 +0800)]
btrfs: subpage: introduce helpers for dirty status
Introduce the following functions to handle subpage dirty status:
- btrfs_subpage_set_dirty()
- btrfs_subpage_clear_dirty()
- btrfs_subpage_test_dirty()
These helpers can only be called when the range is ensured to be
inside the page.
- btrfs_page_set_dirty()
- btrfs_page_clear_dirty()
- btrfs_page_test_dirty()
These helpers can handle both regular sector size and subpage without
problem.
Thus they would be used to replace PageDirty() related calls in
later patches.
There is one special point to note here, just like set_page_dirty() and
clear_page_dirty_for_io(), btrfs_*page_set_dirty() and
btrfs_*page_clear_dirty() must be called with page locked.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:34 +0000 (15:14 +0800)]
btrfs: use min() to replace open-code in btrfs_invalidatepage()
In btrfs_invalidatepage() we introduce a temporary variable, new_len, to
update ordered->truncated_len. But we can use min() to replace it
completely and no need for the variable.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 25 Mar 2021 07:14:33 +0000 (15:14 +0800)]
btrfs: add sysfs interface for supported sectorsize
Export supported sector sizes in /sys/fs/btrfs/features/supported_sectorsizes.
Currently all architectures have PAGE_SIZE, There's some disparity
between read-only and read-write support but that will be unified in the
future so there's only one file exporting the size.
The read-only support for systems with 64K pages also works for 4K
sector size.
This new sysfs interface would help eg. mkfs.btrfs to print more
accurate warnings about potentially incompatible option combinations.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 31 Mar 2021 10:56:21 +0000 (11:56 +0100)]
btrfs: improve btree readahead for full send operations
Currently a full send operation uses the standard btree readahead when
iterating over the subvolume/snapshot btree, which despite bringing good
performance benefits, it could be improved in a few aspects for use cases
such as full send operations, which are guaranteed to visit every node
and leaf of a btree, in ascending and sequential order. The limitations
of that standard btree readahead implementation are the following:
1) It only triggers readahead for leaves that are physically close
to the leaf being read, within a 64K range;
2) It only triggers readahead for the next or previous leaves if the
leaf being read is not currently in memory;
3) It never triggers readahead for nodes.
So add a new readahead mode that addresses all these points and use it
for full send operations.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
Filipe Manana [Wed, 31 Mar 2021 10:55:50 +0000 (11:55 +0100)]
btrfs: fix exhaustion of the system chunk array due to concurrent allocations
When we are running out of space for updating the chunk tree, that is,
when we are low on available space in the system space info, if we have
many task concurrently allocating block groups, via fallocate for example,
many of them can end up all allocating new system chunks when only one is
needed. In extreme cases this can lead to exhaustion of the system chunk
array, which has a size limit of 2048 bytes, and results in a transaction
abort with errno EFBIG, producing a trace in dmesg like the following,
which was triggered on a PowerPC machine with a node/leaf size of 64K:
The following steps explain how we can end up in this situation:
1) Task A is at check_system_chunk(), either because it is allocating a
new data or metadata block group, at btrfs_chunk_alloc(), or because
it is removing a block group or turning a block group RO. It does not
matter why;
2) Task A sees that there is not enough free space in the system
space_info object, that is 'left' is < 'thresh'. And at this point
the system space_info has a value of 0 for its 'bytes_may_use'
counter;
3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate
a new system block group (chunk) and then reserves 'thresh' bytes in
the chunk block reserve with the call to btrfs_block_rsv_add(). This
changes the chunk block reserve's 'reserved' and 'size' counters by an
amount of 'thresh', and changes the 'bytes_may_use' counter of the
system space_info object from 0 to 'thresh'.
Also during its call to btrfs_alloc_chunk(), we end up increasing the
value of the 'total_bytes' counter of the system space_info object by
8MiB (the size of a system chunk stripe). This happens through the
call chain:
4) After it finishes the first phase of the block group allocation, at
btrfs_chunk_alloc(), task A unlocks the chunk mutex;
5) At this point the new system block group was added to the transaction
handle's list of new block groups, but its block group item, device
items and chunk item were not yet inserted in the extent, device and
chunk trees, respectively. That only happens later when we call
btrfs_finish_chunk_alloc() through a call to
btrfs_create_pending_block_groups();
Note that only when we update the chunk tree, through the call to
btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter
of the chunk block reserve as we COW/allocate extent buffers,
through:
If we end up COWing less chunk btree nodes/leaves than expected, which
is the typical case since the amount of space we reserve is always
pessimistic to account for the worst possible case, we release the
unused space through:
But before task A gets into btrfs_create_pending_block_groups()...
6) Many other tasks start allocating new block groups through fallocate,
each one does the first phase of block group allocation in a
serialized way, since btrfs_chunk_alloc() takes the chunk mutex
before calling check_system_chunk() and btrfs_alloc_chunk().
However before everyone enters the final phase of the block group
allocation, that is, before calling btrfs_create_pending_block_groups(),
new tasks keep coming to allocate new block groups and while at
check_system_chunk(), the system space_info's 'bytes_may_use' keeps
increasing each time a task reserves space in the chunk block reserve.
This means that eventually some other task can end up not seeing enough
free space in the system space_info and decide to allocate yet another
system chunk.
This may repeat several times if yet more new tasks keep allocating
new block groups before task A, and all the other tasks, finish the
creation of the pending block groups, which is when reserved space
in excess is released. Eventually this can result in exhaustion of
system chunk array in the superblock, with btrfs_add_system_chunk()
returning EFBIG, resulting later in a transaction abort.
Even when we don't reach the extreme case of exhausting the system
array, most, if not all, unnecessarily created system block groups
end up being unused since when finishing creation of the first
pending system block group, the creation of the following ones end
up not needing to COW nodes/leaves of the chunk tree, so we never
allocate and deallocate from them, resulting in them never being
added to the list of unused block groups - as a consequence they
don't get deleted by the cleaner kthread - the only exceptions are
if we unmount and mount the filesystem again, which adds any unused
block groups to the list of unused block groups, if a scrub is
run, which also adds unused block groups to the unused list, and
under some circumstances when using a zoned filesystem or async
discard, which may also add unused block groups to the unused list.
So fix this by:
*) Tracking the number of reserved bytes for the chunk tree per
transaction, which is the sum of reserved chunk bytes by each
transaction handle currently being used;
*) When there is not enough free space in the system space_info,
if there are other transaction handles which reserved chunk space,
wait for some of them to complete in order to have enough excess
reserved space released, and then try again. Otherwise proceed with
the creation of a new system chunk.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 23 Mar 2021 18:39:49 +0000 (18:39 +0000)]
btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags
If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a
file that has the S_SYNC attribute set, we totally ignore that and do not
durably persist the reflink changes. Since a reflink can change the data
readable from a file (and mtime/ctime, or a file size), it makes sense to
durably persist (fsync) the source and destination files/ranges.
The recently introduced test case generic/628, from fstests, exercises
these scenarios and currently fails without this change.
So make sure we fsync the source and destination files/ranges when either
of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set,
just like XFS already does.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Arnd Bergmann [Tue, 23 Mar 2021 14:31:19 +0000 (15:31 +0100)]
btrfs: zoned: bail out in btrfs_alloc_chunk for bad input
gcc complains that the ctl->max_chunk_size member might be used
uninitialized when none of the three conditions for initializing it in
init_alloc_chunk_ctl_policy_zoned() are true:
In function ‘init_alloc_chunk_ctl_policy_zoned’,
inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized]
4998 | ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
| ^~~
fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
5316 | struct alloc_chunk_ctl ctl;
| ^~~
If we ever get into this condition, something is seriously
wrong, as validity is checked in the callers
so the same logic as in init_alloc_chunk_ctl_policy_regular()
and a few other places should be applied. This avoids both further
data corruption, and the compile-time warning.
Fixes: 1cd6121f2a38 ("btrfs: zoned: implement zoned chunk allocator") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
BingJing Chang [Thu, 25 Mar 2021 01:56:22 +0000 (09:56 +0800)]
btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko@synology.com> Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Wed, 24 Mar 2021 14:23:11 +0000 (23:23 +0900)]
btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex
Commit 6e37d2459941 ("btrfs: zoned: fix deadlock on log sync") pointed out
a deadlock warning and removed mutex_{lock,unlock} of fs_info::tree_root->log_mutex.
While it looks like it always cause a deadlock, we didn't see actual
deadlock in fstests runs. The reason is log_root_tree->log_mutex !=
fs_info->tree_root->log_mutex, not taking the same lock. So, the warning
was actually a false-positive.
Since btrfs_alloc_log_tree_node() is protected only by
fs_info->tree_root->log_mutex, we can (and should) move the code out of
the lock scope of log_root_tree->log_mutex and silence the warning.
Fixes: 6e37d2459941 ("btrfs: zoned: fix deadlock on log sync") Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 24 Mar 2021 13:44:21 +0000 (09:44 -0400)]
btrfs: use percpu_read_positive instead of sum_positive for need_preempt
Looking at perf data for a fio workload I noticed that we were spending
a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
need_preemptive_reclaim. This is silly, as we only want to know if we
have more ordered than delalloc to see if we should be counting the
delayed items in our threshold calculation. Change this to
percpu_read_positive() to avoid the overhead.
I ran this through fsperf to validate the changes, obviously the latency
numbers in dbench and fio are quite jittery, so take them as you wish,
but overall the improvements on throughput, iops, and bw are all
positive. Each test was run two times, the given value is the average
of both runs for their respective column.
Filipe Manana [Fri, 26 Mar 2021 13:14:41 +0000 (13:14 +0000)]
btrfs: update outdated comment at btrfs_replace_file_extents()
There is a comment at btrfs_replace_file_extents() that mentions that we
set the full sync flag on an inode when cloning into a file with a size
greater than or equals to 16MiB, through try_release_extent_mapping() when
we truncate the page cache after replacing file extents during a clone
operation.
That is not true anymore since commit 5e548b32018d96 ("btrfs: do not set
the full sync flag on the inode during page release"), so update the
comment to remove that part and rephrase it slightly to make it more
clear why the full sync flag is set at btrfs_replace_file_extents().
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 16 Mar 2021 16:54:13 +0000 (16:54 +0000)]
btrfs: update outdated comment at btrfs_orphan_cleanup()
btrfs_orphan_cleanup() has a comment referring to find_dead_roots, but
function does not exists since commit cb517eabba4f10 ("Btrfs: cleanup the
similar code of the fs root read"). What we use now to find and load dead
roots is btrfs_find_orphan_roots(). So update the comment and make it a
bit more detailed about why we can not delete an orphan item for a root.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 11 Mar 2021 14:31:13 +0000 (14:31 +0000)]
btrfs: update debug message when checking seq number of a delayed ref
We used to encode two different numbers in the tree mod log counter used
for sequence numbers, one in the upper 32 bits and the other one in the
lower 32 bits. However that is no longer the case, we stopped doing that
since commit fcebe4562dec83 ("Btrfs: rework qgroup accounting").
So update the debug message at btrfs_check_delayed_seq to stop extracting
the two 32 bits counters and print instead the 64 bits sequence numbers.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 11 Mar 2021 14:31:12 +0000 (14:31 +0000)]
btrfs: add and use helper to get lowest sequence number for the tree mod log
There are two places outside the tree mod log module that extract the
lowest sequence number of the tree mod log. These places end up
duplicating code and open coding the logic and internal implementation
details of the tree mod log. So add a helper to the tree mod log module
and header that returns the lowest sequence number or 0 if there aren't
any tree mod log users at the moment.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 11 Mar 2021 14:31:11 +0000 (14:31 +0000)]
btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb()
At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf,
and if so, return immediately and do nothing. However this check can be
removed, because after it we call tree_mod_need_log(), which returns
false when given an extent buffer that corresponds to a leaf.
So just remove the leaf check and pass the extent buffer to
tree_mod_need_log().
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 11 Mar 2021 14:31:10 +0000 (14:31 +0000)]
btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block()
Instead of exposing implementation details of the tree mod log to check
if there are active tree mod log users at btrfs_free_tree_block(), use
the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This
way extent-tree.c does not need to known about any of the internals of
the tree mod log and avoids taking a lock unnecessarily as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 11 Mar 2021 14:31:09 +0000 (14:31 +0000)]
btrfs: use a bit to track the existence of tree mod log users
The tree modification log functions are called very frequently, basically
they are called every time a btree is modified (a pointer added or removed
to a node, a new root for a btree is set, etc). Because of that, to avoid
heavy lock contention on the lock that protects the list of tree mod log
users, we have checks that test the emptiness of the list with a full
memory barrier before the checks, so that when there are no tree mod log
users we avoid taking the lock.
Replace the memory barrier and list emptiness check with a test for a new
bit set at fs_info->flags. This bit is used to indicate when there are
tree mod log users, set whenever a user is added to the list and cleared
when the last user is removed from the list. This makes the intention a
bit more obvious and possibly more efficient (assuming test_bit() may be
cheaper than a full memory barrier on some architectures).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 11 Mar 2021 14:31:07 +0000 (14:31 +0000)]
btrfs: move the tree mod log code into its own file
The tree modification log, which records modifications done to btrees, is
quite large and currently spread all over ctree.c, which is a huge file
already.
To make things better organized, move all that code into its own separate
source and header files. Functions and definitions that are used outside
of the module (mostly by ctree.c) are renamed so that they start with a
"btrfs_" prefix. Everything else remains unchanged.
This makes it easier to go over the tree modification log code every
time I need to go read it to fix a bug.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
Ira Weiny [Wed, 17 Feb 2021 02:48:26 +0000 (18:48 -0800)]
btrfs: integrity-checker: convert block context kmap's to kmap_local_page
btrfsic_read_block() (which calls kmap()) and
btrfsic_release_block_ctx() (which calls kunmap()) are always called
within a single thread of execution.
Therefore the mappings created within these calls can be a thread local
mapping.
Convert the kmap() of bloc_ctx->pagev to kmap_local_page(). Luckily the
unmap loops backwards through the array pointer so no adjustment needs
to be made to the unmapping order.
Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Ira Weiny [Wed, 17 Feb 2021 02:48:24 +0000 (18:48 -0800)]
btrfs: raid56: convert kmaps to kmap_local_page
These kmaps are thread local and don't need to be atomic. So they can use
the more efficient kmap_local_page(). However, the mapping of pages in
the stripes and the additional parity and qstripe pages are a bit
trickier because the unmapping must occur in the opposite order from the
mapping. Furthermore, the pointer array in __raid_recover_end_io() may
get reordered.
Convert these calls to kmap_local_page() taking care to reverse the
unmappings of any page arrays as well as being careful with the mappings
of any special pages such as the parity and qstripe pages.
Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Ira Weiny [Wed, 17 Feb 2021 02:48:23 +0000 (18:48 -0800)]
btrfs: convert kmap to kmap_local_page, simple cases
Use a simple coccinelle script to help convert the most common
kmap()/kunmap() patterns to kmap_local_page()/kunmap_local().
Note that some kmaps which were caught by this script needed to be
handled by hand because of the strict unmapping order of kunmap_local()
so they are not included in this patch. But this script got us started.
There's another temp variable added for the final length write to the
first page so it does not interfere with cpage_out that is used for
mapping other pages.
The development of this patch was aided by the follow script:
// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap and replace with kmap_local_page then mark kunmap
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
Filipe Manana [Tue, 23 Feb 2021 12:08:49 +0000 (12:08 +0000)]
btrfs: remove stale comment and logic from btrfs_inode_in_log()
Currently btrfs_inode_in_log() checks the list of modified extents of the
inode, and has a comment mentioning why, as it used to be necessary to
make sure if we did something like the following:
mmap write range A
mmap write range B
msync range A (ranged fsync)
msync range B (ranged fsync)
we ended up with both ranges being logged.
If we did not check it, then the second fsync would do nothing because
btrfs_inode_in_log() would return true. This was added in 125c4cf9f37c98
("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and
test case generic/325 from fstests exercises that scenario.
However, as of commit 487781796d3022 ("btrfs: make fast fsyncs wait only
for writeback"), every ranged fsync is now turned into a full ranged fsync
(operates on the range from 0 to LLONG_MAX), so it is now pointless to
test of emptiness of the list of modified extents, and the comment is
clearly outdated.
So just remove the comment and list emptiness check, while also changing
the function's return type to be a boolean instead of an integer.
In case one day we get support for ranged fsyncs again, it will be easy
to notice the check is necessary again, because it will make generic/325
always fail.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 23 Feb 2021 12:08:48 +0000 (12:08 +0000)]
btrfs: fix race between marking inode needs to be logged and log syncing
We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.
1) We are at transaction N;
2) Inode I was previously fsynced in the current transaction so it has:
inode->logged_trans set to N;
3) The inode's root currently has:
root->log_transid set to 1
root->last_log_commit set to 0
Which means only one log transaction was committed to far, log
transaction 0. When a log tree is created we set ->log_transid and
->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
4) One more range of pages is dirtied in inode I;
5) Some task A starts an fsync against some other inode J (same root), and
so it joins log transaction 1.
Before task A calls btrfs_sync_log()...
6) Task B starts an fsync against inode I, which currently has the full
sync flag set, so it starts delalloc and waits for the ordered extent
to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
7) During ordered extent completion we have btrfs_update_inode() called
against inode I, which in turn calls btrfs_set_inode_last_trans(),
which does the following:
So ->last_trans is set to N and ->last_sub_trans set to 1.
But before setting ->last_log_commit...
8) Task A is at btrfs_sync_log():
- it increments root->log_transid to 2
- starts writeback for all log tree extent buffers
- waits for the writeback to complete
- writes the super blocks
- updates root->last_log_commit to 1
It's a lot of slow steps between updating root->log_transid and
root->last_log_commit;
9) The task doing the ordered extent completion, currently at
btrfs_set_inode_last_trans(), then finally runs:
Which results in inode->last_log_commit being set to 1.
The ordered extent completes;
10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
true because we have all the following conditions met:
inode->logged_trans == N which matches fs_info->generation &&
inode->last_subtrans (1) <= inode->last_log_commit (1) &&
inode->last_subtrans (1) <= root->last_log_commit (1) &&
list inode->extent_tree.modified_extents is empty
And as a consequence we return without logging the inode, so the
existing logged version of the inode does not point to the extent
that was written after the previous fsync.
It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.
However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:
So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.
So fix this in two different ways:
1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
value of ->last_sub_trans minus 1;
2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
like we do for buffered and direct writes at btrfs_file_write_iter(),
which is all we need to make sure multiple writes and fsyncs to an
inode in the same transaction never result in an fsync missing that
the inode changed and needs to be logged. Turn this into a helper
function and use it both at btrfs_page_mkwrite() and at
btrfs_file_write_iter() - this also fixes the problem that at
btrfs_page_mkwrite() we were setting those fields without the
protection of the inode's spinlock.
This is an extremely unlikely race to happen in practice.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 23 Feb 2021 12:08:47 +0000 (12:08 +0000)]
btrfs: fix race between memory mapped writes and fsync
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).
That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.
This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.
Scenario 1 - logging overlapping extents
The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:
1) Task A starts an fsync on inode X, which has the full sync runtime flag
set. First it starts by flushing all delalloc for the inode;
2) Task A then locks the inode and flushes any other delalloc that might
have been created after the previous flush and waits for all ordered
extents to complete;
3) In the inode's root we have the following leaf:
The last file extent item in leaf N covers the file range from 640K to
768K;
4) Task B does a memory mapped write for the page corresponding to the
file range from 764K to 768K;
5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
btrfs_search_forward() to search for leafs modified in the current
transaction that contain items for the inode. It finds leaf N and copies
all the inode items from that leaf into the log tree.
Now the log tree has a copy of the last file extent item from leaf N.
At the end of the while loop at copy_inode_items_to_log(), we have the
minimum key set to:
Then we increment the key's offset by 1 so that the next call to
btrfs_search_forward() leaves us at the first key greater than the key
we just processed.
But before btrfs_search_forward() is called again...
6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
It can be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) When the respective ordered extent completes, it trims the length of
the existing file extent item for file offset 640K from 128K to 124K,
and a new file extent item is added with a key offset of 764K and a
length of 4K;
8) Task A calls btrfs_search_forward(), which returns us a path pointing
to the leaf (can be leaf N or some other) containing the new file extent
item for file offset 764K.
We end up copying this item to the log tree, which overlaps with the
last copied file extent item, which covers the file range from 640K to
768K.
When writeback is triggered for log tree's extent buffers, the issue
will be detected by the tree checker which will dump a trace and an
error message on dmesg/syslog. If the writeback is triggered when
syncing the log, which typically is, then we also end up aborting the
current transaction.
This is the same type of problem fixed in 0c713cbab6200b ("Btrfs: fix race
between ranged fsync and writeback of adjacent ranges").
Scenario 2 - logging a version of the file that never existed
This scenario only happens when using the NO_HOLES feature and results in
a silent corruption, in the sense that is not detectable by 'btrfs check'
or the tree checker:
1) We have an inode I with a size of 1M and two file extent items, one
covering an extent with disk_bytenr == X for the file range [0, 512K)
and another one covering another extent with disk_bytenr == Y for the
file range [512K, 1M);
2) A hole is punched for the file range [512K, 1M);
3) Task A starts an fsync of inode I, which has the full sync runtime flag
set. It starts by flushing all existing delalloc, locks the inode (VFS
lock), starts any new delalloc that might have been created before
taking the lock and waits for all ordered extents to complete;
4) Some other task does a memory mapped write for the page corresponding to
the file range [640K, 644K) for example;
5) Task A then logs all items of the inode with the call to
copy_inode_items_to_log();
6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) The ordered extent for the range [640K, 644K) completes and a file
extent item for that range is added to the subvolume tree, pointing
to a 4K extent with a disk_bytenr == Z;
8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
the subvolume tree. It finds two implicit holes:
- one for the file range [512K, 640K)
- one for the file range [644K, 1M)
As a result we end up neither logging a hole for the range [640K, 644K)
nor logging the file extent item with a disk_bytenr == Z.
This means that if we have a power failure and replay the log tree we
end up getting the following file extent layout:
This can be fixed by serializing memory mapped writes with fsync, and there
are two ways to do it:
1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
in the inode's io tree. This prevents the race but also blocks any reads
during the duration of the fsync, which has a negative impact for many
common workloads;
2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
semaphore was recently added by Josef's patch set:
btrfs: add a i_mmap_lock to our inode
btrfs: cleanup inode_lock/inode_unlock uses
btrfs: exclude mmaps while doing remap
btrfs: exclude mmap from happening during all fallocate operations
and is used to solve races between memory mapped writes and
clone/dedupe/fallocate. This also makes us have the same behaviour we
have regarding other writes (buffered and direct IO) and fsync - block
them while the inode logging is in progress.
This change uses the second approach due to the performance impact of the
first one.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 10 Feb 2021 22:14:36 +0000 (17:14 -0500)]
btrfs: exclude mmap from happening during all fallocate operations
There's a small window where a deadlock can happen between fallocate and
mmap. This is described in detail by Filipe:
"""
When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.
This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:
1) A fallocate operation starts, locks the inode, flushes delalloc in the
range and waits for ordered extents in the range to complete;
2) Before the fallocate task locks the file range, another task does a
memory mapped write for a page in the fallocate target range. This is
possible since memory mapped writes do not (and can not) lock the
inode;
3) The fallocate task locks the file range. At this point there is one
dirty page in the range (due to the memory mapped write);
4) When the fallocate task attempts to start a transaction, it blocks when
attempting to reserve metadata space, since we are low on available
metadata space. Before blocking (wait on its reservation ticket), it
starts the async reclaim task (if not running already);
5) The async reclaim task is not able to release space through any other
means, so it decides to flush delalloc for inodes with dirty pages.
It finds that the inode used in the fallocate operation has a dirty
page and therefore queues a job (fs_info->flush_workers workqueue) to
flush delalloc for that inode and waits on that job to complete;
6) The flush job blocks when attempting to lock the file range because
it is currently locked by the fallocate task;
7) The fallocate task keeps waiting for its metadata reservation, waiting
for a wakeup on its reservation ticket. The async reclaim task is
waiting on the flush job, which in turn is waiting for locking the file
range that is currently locked by the fallocate task. So unless some
other task is able to release enough metadata space, for example an
ordered extent for some other inode completes, we end up in a deadlock
between all these tasks.
When this happens stack traces like the following show up in dmesg/syslog:
Josef Bacik [Wed, 10 Feb 2021 22:14:35 +0000 (17:14 -0500)]
btrfs: exclude mmaps while doing remap
Darrick reported a potential issue to me where we could allow mmap
writes after validating a page range matched in the case of dedupe.
Generally we rely on lock page -> lock extent with the ordered flush to
protect us, but this is done after we check the pages because we use the
generic helpers, so we could modify the page in between doing the check
and locking the range.
There also exists a deadlock, as described by Filipe
"""
When cloning a file range, we lock the inodes, flush any delalloc within
the respective file ranges, wait for any ordered extents and then lock the
file ranges in both inodes. This means that right after we flush delalloc
and before we lock the file ranges, memory mapped writes can come in and
dirty pages in the file ranges of the clone operation.
Most of the time this is harmless and causes no problems. However, if we
are low on available metadata space, we can later end up in a deadlock
when starting a transaction to replace file extent items. This happens if
when allocating metadata space for the transaction, we need to wait for
the async reclaim thread to release space and the reclaim thread needs to
flush delalloc for the inode that got the memory mapped write and has its
range locked by the clone task.
Basically what happens is the following:
1) A clone operation locks inodes A and B, flushes delalloc for both
inodes in the respective file ranges and waits for any ordered extents
in those ranges to complete;
2) Before the clone task locks the file ranges, another task does a
memory mapped write (which does not lock the inode) for one of the
inodes of the clone operation. So now we have a dirty page in one of
the ranges used by the clone operation;
3) The clone operation locks the file ranges for inodes A and B;
4) Later, when iterating over the file extents of inode A, the clone
task attempts to start a transaction. There's not enough available
free metadata space, so the async reclaim task is started (if not
running already) and we wait for someone to wake us up on our
reservation ticket;
5) The async reclaim task is not able to release space by any other
means and decides to flush delalloc for the inode of the clone
operation;
6) The workqueue job used to flush the inode blocks when starting
delalloc for the inode, since the file range is currently locked by
the clone task;
7) But the clone task is waiting on its reservation ticket and the async
reclaim task is waiting on the flush job to complete, which can't
progress since the clone task has the file range locked. So unless
some other task is able to release space, for example an ordered
extent for some other inode completes, we have a deadlock between all
these tasks;
When this happens stack traces like the following show up in dmesg/syslog:
Josef Bacik [Wed, 10 Feb 2021 22:14:34 +0000 (17:14 -0500)]
btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers
A few places we intermix btrfs_inode_lock with a inode_unlock, and some
places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
None of these places are using this incorrectly, but as we adjust some
of these callers it would be nice to keep everything consistent, so
convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 10 Feb 2021 22:14:33 +0000 (17:14 -0500)]
btrfs: add a i_mmap_lock to our inode
We need to be able to exclude page_mkwrite from happening concurrently
with certain operations. To facilitate this, add a i_mmap_lock to our
inode, down_read() it in our mkwrite, and add a new ILOCK flag to
indicate that we want to take the i_mmap_lock as well. I used pahole to
check the size of the btrfs_inode, the sizes are as follows
no lockdep:
before: 1120 (3 per 4k page)
after: 1160 (3 per 4k page)
lockdep:
before: 2072 (1 per 4k page)
after: 2224 (1 per 4k page)
We're slightly larger but it doesn't change how many objects we can fit
per page.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove force argument from run_delalloc_nocow()
force_cow can be calculated from inode and does not need to be passed as
an argument.
This simplifies run_delalloc_nocow() call from btrfs_run_delalloc_range()
A new function, should_nocow() checks if the range should be NOCOWed or
not. The function returns true iff either BTRFS_INODE_NODATA or
BTRFS_INODE_PREALLOC, but is not a defrag extent.
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 2 Mar 2021 10:44:40 +0000 (12:44 +0200)]
btrfs: don't opencode extent_changeset_free
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 1 Mar 2021 09:26:43 +0000 (09:26 +0000)]
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 1 Mar 2021 09:26:42 +0000 (09:26 +0000)]
btrfs: add btree read ahead for full send operations
When doing a full send we know that we are going to be reading every node
and leaf of the send root, so we benefit from enabling read ahead for the
btree.
This change enables read ahead for full send operations only, incremental
sends will have read ahead enabled in a different way by a separate patch.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
with $initial_file_count == 200000: 165 seconds
with $initial_file_count == 500000: 407 seconds
After this change, full send duration:
with $initial_file_count == 200000: 149 seconds (-10.2%)
with $initial_file_count == 500000: 353 seconds (-14.2%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, while for $initial_file_count == 500000 there
are 152476 nodes and leaves. The roots were at level 2.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 22 Feb 2021 16:40:47 +0000 (18:40 +0200)]
btrfs: simplify code flow in btrfs_delayed_inode_reserve_metadata
btrfs_block_rsv_add can return only ENOSPC since it's called with
NO_FLUSH modifier. This so simplify the logic in
btrfs_delayed_inode_reserve_metadata to exploit this invariant.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ add assert and comment ] Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 22 Feb 2021 16:40:45 +0000 (18:40 +0200)]
btrfs: simplify commit logic in try_flush_qgroup
It's no longer expected to call this function with an open transaction
so all the workarounds concerning this can be removed. In fact it'll
constitute a bug to call this function with a transaction already held
so WARN in this case.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 11 Feb 2021 05:25:17 +0000 (21:25 -0800)]
btrfs: scrub: drop a few function declarations
Drop function declarations at the beginning of the file scrub.c. These
functions are defined before they are used in the same file and don't
need forward declaration.
No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 17 Feb 2021 13:12:50 +0000 (15:12 +0200)]
btrfs: replace open coded while loop with proper construct
btrfs_inc_block_group_ro wants to ensure that the current transaction is
not running dirty block groups, if it is it waits and loops again.
That logic is currently implemented using a goto label. Actually using
a proper do {} while() construct doesn't hurt readability nor does it
introduce excessive nesting and makes the relevant code stand out by
being encompassed in the loop construct. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 11 Feb 2021 08:14:05 +0000 (16:14 +0800)]
btrfs: fix comment for btrfs ordered extent flag bits
There is small error in comment about BTRFS_ORDERED_* flags, added in
commit 3c198fe06449 ("btrfs: rework the order of
btrfs_ordered_extent::flags") but the fixup did not get merged in time.
The 4 types are for ordered extent itself, not for direct io.
Only 3 types support direct io, REGULAR/NOCOW/PREALLOC.
Fix the comment to reflect that.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Merge tag 'arm-fixes-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Another smaller set of fixes for three of the Arm platforms:
TI OMAP:
Fix swapped mmc device order also for omap3 that got changed with
the recent PROBE_PREFER_ASYNCHRONOUS changes. While eventually the
aliases should be board specific, all the mmc device instances are
all there in the SoC, and we do probe them by default so that PM
runtime can idle the devices if left enabled from the bootloader.
Qualcomm Snapdragon:
This bypasses the recently introduced interconnect handling in
the GENI (serial engine) driver when running off ACPI, as this
causes the GENI probe to fail and the Lenovo Yoga C630 to boot
without keyboard and touchpad.
Allwinner:
One 32kHz clock fix for the beelink gs1, a CD polarity fix for the
SoPine, some MAINTAINERS maintainance, and a clk / reset switch to
our headers"
* tag 'arm-fixes-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: dts: allwinner: h6: beelink-gs1: Remove ext. 32 kHz osc reference
MAINTAINERS: Match on allwinner keyword
MAINTAINERS: Add our new mailing-list
arm64: dts: allwinner: Fix SD card CD GPIO for SOPine systems
arm64: dts: allwinner: h6: Switch to macros for RSB clock/reset indices
ARM: OMAP2+: Fix uninitialized sr_inst
ARM: dts: Fix swapped mmc order for omap3
ARM: OMAP2+: Fix warning for omap_init_time_of()
soc: qcom: geni: shield geni_icc_get() for ACPI boot
Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
- Halve maximum number of CPUs if DEBUG_KMAP_LOCAL is enabled
- Fix conversion for_each_membock() to for_each_mem_range()
- Fix footbridge PCI mapping
- Avoid uprobes hooking on thumb instructions
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9071/1: uprobes: Don't hook on thumb instructions
ARM: footbridge: fix PCI interrupt mapping
ARM: 9069/1: NOMMU: Fix conversion for_each_membock() to for_each_mem_range()
ARM: 9063/1: mm: reduce maximum number of CPUs if DEBUG_KMAP_LOCAL is enabled
Fredrik Strupe [Mon, 5 Apr 2021 20:52:05 +0000 (21:52 +0100)]
ARM: 9071/1: uprobes: Don't hook on thumb instructions
Since uprobes is not supported for thumb, check that the thumb bit is
not set when matching the uprobes instruction hooks.
The Arm UDF instructions used for uprobes triggering
(UPROBE_SWBP_ARM_INSN and UPROBE_SS_ARM_INSN) coincidentally share the
same encoding as a pair of unallocated 32-bit thumb instructions (not
UDF) when the condition code is 0b1111 (0xf). This in effect makes it
possible to trigger the uprobes functionality from thumb, and at that
using two unallocated instructions which are not permanently undefined.
Signed-off-by: Fredrik Strupe <fredrik@strupe.net> Cc: stable@vger.kernel.org Fixes: c7edc9e326d5 ("ARM: add uprobes support") Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two fixes: the libsas fix is for a problem that occurs when trying to
change the cache type of an ATA device and the libiscsi one is a
regression fix from this merge window"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: libsas: Reset num_scatter if libata marks qc as NODATA
scsi: iscsi: Fix iSCSI cls conn state
Merge tag 'drm-fixes-2021-04-18' of git://anongit.freedesktop.org/drm/drm
Pull vmwgfx fixes from Dave Airlie:
"This contains two regression fixes for vmwgfx, one due to a refactor
which meant locks were being used before initialisation, and the other
in fixing up some warnings from the core when destroying pinned
buffers.
vmwgfx:
- fixed unpinning before destruction
- lockdep init reordering"
* tag 'drm-fixes-2021-04-18' of git://anongit.freedesktop.org/drm/drm:
drm/vmwgfx: Make sure bo's are unpinned before putting them back
drm/vmwgfx: Fix the lockdep breakage
drm/vmwgfx: Make sure we unpin no longer needed buffers
Dave Airlie [Sat, 17 Apr 2021 23:26:54 +0000 (09:26 +1000)]
Merge tag 'vmwgfx-fixes-2021-04-14' of gitlab.freedesktop.org:zack/vmwgfx into drm-fixes
vmwgfx fixes for regressions in 5.12
Here's a set of 3 patches fixing ugly regressions
in the vmwgfx driver. We broke lock initialization
code and ended up using spinlocks before initialization
breaking lockdep.
Also there was a bit of a fallout from drm changes
which made the core validate that unreferenced buffers
have been unpinned. vmwgfx pinning code predates a lot
of the core drm and wasn't written to account for those
semantics. Fortunately changes required to fix it
are not too intrusive.
The changes have been validated by our internal ci.