Nikolay Borisov [Mon, 8 Jan 2018 08:59:43 +0000 (10:59 +0200)]
btrfs: handle failure of add_pending_csums
add_pending_csums was added as part of the new data=ordered
implementation in e6dcd2dc9c48 ("Btrfs: New data=ordered
implementation"). Even back then it called the btrfs_csum_file_blocks
which can fail but it never bothered handling the failure. In ENOMEM
situation this could lead to the filesystem failing to write the
checksums for a particular extent and not detect this. On read this
could lead to the filesystem erroring out due to crc mismatch. Fix it by
propagating failure from add_pending_csums and handling them.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Fri, 16 Feb 2018 03:59:47 +0000 (22:59 -0500)]
btrfs: use kvzalloc to allocate btrfs_fs_info
The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
kernels built with NR_CPUS=8192, this can result in kmalloc failures
that prevent mounting.
There is work in progress to try to resolve this for every user of
srcu_struct but using kvzalloc will work around the failures until
that is complete.
As an example with NR_CPUS=512 on x86_64: the overall size of
subvol_srcu is 3460 bytes, fs_info is 6496.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 30 Jan 2018 14:07:37 +0000 (16:07 +0200)]
btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced
btrfs_free_stale_device which iterates the device lists for all
registered btrfs filesystems and deletes those devices which aren't
mounted. In a btrfs_devices structure has only 1 device attached to it
and it is unused then btrfs_free_stale_devices will proceed to also free
the btrfs_fs_devices struct itself. Currently this leads to a use after
free since list_for_each_entry will try to perform a check on the
already freed memory to see if it has to terminate the loop.
The fix is to use 'break' when we know we are freeing the current
fs_devs.
Fixes: 4fde46f0cc71 ("Btrfs: free the stale device") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 30 Jan 2018 18:40:22 +0000 (18:40 +0000)]
Btrfs: fix null pointer dereference when replacing missing device
When we are replacing a missing device we mount the filesystem with the
degraded mode option in which case we are allowed to have a btrfs device
structure without a backing device member (its bdev member is NULL) and
therefore we can't dereference that member. Commit 38b5f68e9811
("btrfs: drop btrfs_device::can_discard to query directly") started to
dereference that member when discarding extents, resulting in a null
pointer dereference:
This is trivially reproduced by running the test btrfs/027 from fstests
like this:
$ MOUNT_OPTIONS="-o discard" ./check btrfs/027
Fix this by skipping devices without a backing device before attempting
to discard.
Fixes: 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Zygo Blaxell [Wed, 24 Jan 2018 03:22:09 +0000 (22:22 -0500)]
btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
Until v4.14, this warning was very infrequent:
WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0
Modules linked in: [...]
CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1
Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014
Call Trace:
dump_stack+0x85/0xc2
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
find_parent_nodes+0xc41/0x14e0
__btrfs_find_all_roots+0xad/0x120
? extent_same_check_offsets+0x70/0x70
iterate_extent_inodes+0x168/0x300
iterate_inodes_from_logical+0x87/0xb0
? iterate_inodes_from_logical+0x87/0xb0
? extent_same_check_offsets+0x70/0x70
btrfs_ioctl+0x8ac/0x2820
? lock_acquire+0xc2/0x200
do_vfs_ioctl+0x91/0x700
? __fget+0x112/0x200
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc6
? trace_hardirqs_off_caller+0x1f/0x140
Starting with v4.14 (specifically 86d5f9944252 ("btrfs: convert prelimary
reference tracking to use rbtrees")) the WARN_ON occurs three orders of
magnitude more frequently--almost once per second while running workloads
like bees.
Replace the WARN_ON() with a comment rationale for its removal.
The rationale is paraphrased from an explanation by Edmund Nadolski
<enadolski@suse.de> on the linux-btrfs mailing list.
Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()") Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 29 Jan 2018 13:53:01 +0000 (15:53 +0200)]
btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
Running generic/019 with qgroups on the scratch device enabled is almost
guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed
to trigger only on -ENOMEM, in reality, however, it's possible to get
-EIO from btrfs_qgroup_trace_extent_post. This function just finds the
roots of the extent being tracked and sets the qrecord->old_roots list.
If this operation fails nothing critical happens except the quota
accounting can be considered wrong. In such case just set the
INCONSISTENT flag for the quota and print a warning, rather than killing
off the system. Additionally, it's possible to trigger a BUG_ON in
btrfs_truncate_inode_items as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com>
[ error message adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Thu, 25 Jan 2018 18:02:56 +0000 (11:02 -0700)]
Btrfs: fix unexpected -EEXIST when creating new inode
The highest objectid, which is assigned to new inode, is decided at
the time of initializing fs roots. However, in cases where log replay
gets processed, the btree which fs root owns might be changed, so we
have to search it again for the highest objectid, otherwise creating
new inode would end up with -EEXIST.
cc: <stable@vger.kernel.org> v4.4-rc6+ Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
The race is between btrfs_orphan_commit_root and btrfs_orphan_del,
t1 t2
btrfs_orphan_commit_root btrfs_orphan_del
spin_lock
check (&root->orphan_inodes)
root->orphan_block_rsv = NULL;
spin_unlock
atomic_dec(&root->orphan_inodes);
access root->orphan_block_rsv
Accessing root->orphan_block_rsv must be done before decreasing
root->orphan_inodes.
cc: <stable@vger.kernel.org> v3.12+ Fixes: 703c88e03524 ("Btrfs: fix tracking of orphan inode count") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Thu, 25 Jan 2018 18:02:52 +0000 (11:02 -0700)]
Btrfs: fix extent state leak from tree log
It's possible that btrfs_sync_log() bails out after one of the two
btrfs_write_marked_extents() which convert extent state's state bit into
EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY
and EXTENT_NEW are searched by free_log_tree() so that those extent states
with EXTENT_NEED_WAIT lead to memory leak.
cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Thu, 25 Jan 2018 18:02:51 +0000 (11:02 -0700)]
Btrfs: fix crash due to not cleaning up tree log block's dirty bits
In cases that the whole fs flips into readonly status due to failures in
critical sections, then log tree's blocks are still dirty, and this leads
to a crash during umount time, the crash is about use-after-free,
cc: <stable@vger.kernel.org> v3.12+ Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Thu, 25 Jan 2018 18:02:50 +0000 (11:02 -0700)]
Btrfs: fix deadlock in run_delalloc_nocow
@cur_offset is not set back to what it should be (@cow_start) if
btrfs_next_leaf() returns something wrong, and the range [cow_start,
cur_offset) remains locked forever.
cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 18 Jan 2018 14:02:35 +0000 (22:02 +0800)]
btrfs: get device pointer from device_list_add()
Instead of pointer to btrfs_fs_devices as an arg in device_list_add()
better to get pointer to btrfs_device as return value, then we have
both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is
needed to handle reappearing missing device.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 18 Jan 2018 14:02:34 +0000 (22:02 +0800)]
btrfs: set the total_devices in device_list_add()
There is no other parent for device_list_add() except for
btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices
if device_list_add is successful and this can be done with in
device_list_add() itself.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 18 Jan 2018 14:02:33 +0000 (22:02 +0800)]
btrfs: move pr_info into device_list_add
Commit 60999ca4b403 ("btrfs: make device scan less noisy")
adds return value 1 to device_list_add(), so that parent function can
call pr_info only when new device is added. Move the pr_info() part
into device_list_add() so that this function can be kept simple.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 18 Jan 2018 14:00:37 +0000 (22:00 +0800)]
btrfs: make btrfs_free_stale_devices() to match the path
The btrfs_free_stale_devices() is updated to match for the given device
path and delete it. (It searches for only unmounted list of devices.)
Also drop the comment about different path being used for the same
device, since now we will have cli to clean any device that's not a
concern any more.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 18 Jan 2018 14:00:34 +0000 (22:00 +0800)]
btrfs: make btrfs_free_stale_device() to iterate all stales
Let the list iterator iterate further and find other stale
devices and delete it. This is in preparation to add support
for user land request-able stale devices cleanup. Also rename
btrfs_free_stale_device() to btrfs_free_stale_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 18 Jan 2018 14:00:33 +0000 (22:00 +0800)]
btrfs: no need to check for btrfs_fs_devices::seeding
There is no need to check for btrfs_fs_devices::seeding when we
have checked for btrfs_fs_devices::opened, because we can't sprout
without its seed FS being opened.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 5 Jan 2018 19:51:17 +0000 (12:51 -0700)]
Btrfs: noinline merge_extent_mapping
In order to debug subtle bugs around merge_extent_mapping(), perf probe
can be used to check the arguments, but sometimes merge_extent_mapping()
got inlined by compiler and couldn't be probed.
This is adding noinline attribute to merge_extent_mapping().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 5 Jan 2018 19:51:12 +0000 (12:51 -0700)]
Btrfs: add extent map selftests
We've observed that btrfs_get_extent() and merge_extent_mapping() could
return -EEXIST in several cases, and they are caused by some racy
condition, e.g dio read vs dio write, which makes the problem very tricky
to reproduce.
This adds extent map selftests in order to simulate those racy situations.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
[ minor string adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 5 Jan 2018 19:51:09 +0000 (12:51 -0700)]
Btrfs: fix unexpected EEXIST from btrfs_get_extent
This fixes a corner case that is caused by a race of dio write vs dio
read/write.
Here is how the race could happen.
Suppose that no extent map has been loaded into memory yet.
There is a file extent [0, 32K), two jobs are running concurrently
against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
read from [0, 4K) or [4K, 8K).
t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
------------------------------------------------------
t1 t2
btrfs_get_blocks_direct() btrfs_get_blocks_direct()
-> btrfs_get_extent() -> btrfs_get_extent()
-> lookup_extent_mapping()
-> add_extent_mapping() -> lookup_extent_mapping()
# load [0, 32K)
-> btrfs_new_extent_direct()
-> btrfs_drop_extent_cache()
# split [0, 32K) and
# drop [8K, 32K)
-> add_extent_mapping()
# add [8K, 32K)
-> add_extent_mapping()
# handle -EEXIST when adding
# [0, 32K)
------------------------------------------------------
About how t2(dio read/write) runs into -EEXIST:
a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),
b) search_extent_mapping() then returns [0, 8k) as the existing em,
even though start == existing->start, em is [0, 32k) so that
extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,
c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
(with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,
d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
which is confusing applications.
Here I conclude all the possible situations,
1) start < existing->start
As we can see, it turns out that if start is within existing em (front
inclusive), then the existing em should be returned as is, otherwise,
we try our best to merge candidate em with sibling ems to form a
larger em (in order to reduce the total number of em).
Reported-by: David Vallender <david.vallender@landmark.co.uk> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 5 Jan 2018 19:51:08 +0000 (12:51 -0700)]
Btrfs: fix incorrect block_len in merge_extent_mapping
%block_len could be checked on deciding if two em are mergeable.
merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.
For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, and for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
Matthew Wilcox [Wed, 17 Jan 2018 20:21:49 +0000 (12:21 -0800)]
btrfs: Remove unused readahead spinlock
The reada_lock in struct btrfs_device was only initialised, and not
actually used. That's good because there's another lock also called
reada_lock in the btrfs_fs_info that was quite heavily used. Remove
this one.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Wed, 10 Jan 2018 01:36:25 +0000 (18:36 -0700)]
Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io
Before rbio_orig_end_io() goes to free rbio, rbio may get merged with
more bios from other rbios and rbio->bio_list becomes non-empty,
in that case, these newly merged bios don't end properly.
Once unlock_stripe() is done, rbio->bio_list will not be updated any
more and we can call bio_endio() on all queued bios.
It should only happen in error-out cases, the normal path of recover
and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable
merge before doing IO, so rbio_orig_end_io() called by them doesn't
have the above issue.
Liu Bo [Sat, 13 Jan 2018 01:07:02 +0000 (18:07 -0700)]
Btrfs: do not cache rbio pages if using raid6 recover
Since raid6 recover tries all possible combinations of failed stripes,
- when raid6 rebuild algorithm is used, i.e. raid6_datap_recov() and
raid6_2data_recov(), it may change the in-memory content of failed
stripes, if such a raid bio is cached, a later raid write rmw or recover
can steal @stripe_pages from it instead of reading from disks, such that
it carries the wrong content to do write rmw or recovery and ends up
with corruption or recovery failures.
- when raid5 rebuild algorithm is used, i.e. xor, raid bio can be cached
because the only failed stripe which contains @rbio->bio_pages gets
modified, others remain the same so that their in-memory content is
consistent with their on-disk content.
This adds a check to skip caching rbio if using raid6 recover.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Sat, 13 Jan 2018 01:07:01 +0000 (18:07 -0700)]
Btrfs: raid56: iterate raid56 internal bio with bio_for_each_segment_all
Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it
will never be a BIO_CLONED bio, and since this is called by end_io
functions, bio->bi_iter.bi_size is zero, we mustn't use
bio_for_each_segment() as that is a no-op if bi_size is zero.
Fixes: 6592e58c6b68e61f003a01ba29a3716e7e2e9484 ("Btrfs: fix write corruption due to bio cloning on raid5/6") Cc: <stable@vger.kernel.org> # v4.12-rc6+ Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Su Yue [Fri, 12 Jan 2018 03:08:02 +0000 (11:08 +0800)]
btrfs: correct wrong comment about magic number of index_cnt
There is no function named btrfs_get_inode_index_count.
Explanation for magic number index_cnt=2 in btrfs_new_inode() is
actually located in btrfs_set_inode_index_count().
So replace 'btrfs_get_inode_index_count' in the comment by
'btrfs_set_inode_index_count'.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 10 Jan 2018 05:15:18 +0000 (13:15 +0800)]
btrfs: cleanup btrfs_free_stale_device() usage
We call btrfs_free_stale_device() only when we alloc a new struct
btrfs_device (ret=1), so move it closer to where we alloc the new
device. Also drop the comments.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 10 Jan 2018 14:13:07 +0000 (15:13 +0100)]
btrfs: tree-check: reduce stack consumption in check_dir_item
I've noticed that the updated item checker stack consumption increased
dramatically in 542f5385e20cf97447 ("btrfs: tree-checker: Add checker
for dir item")
tree-checker.c:check_leaf +552 (176 -> 728)
The array is 255 bytes long, dynamic allocation would slow down the
sanity checks so it's more reasonable to keep it on-stack. Moving the
variable to the scope of use reduces the stack usage again
tree-checker.c:check_leaf -264 (728 -> 464)
Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 9 Jan 2018 01:05:43 +0000 (09:05 +0800)]
btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
It appears from the original commit [1] that there isn't any design
specific reason not to fail the mount instead of just warning. This
patch will change it to fail.
Anand Jain [Tue, 9 Jan 2018 01:05:42 +0000 (09:05 +0800)]
btrfs: add support for SUPER_FLAG_CHANGING_FSID
The UUID change by btrfstune sets SUPER_FLAG_CHANGING_FSID and resets it
only when changing fsid is complete. Its not a good idea to mount the
device anything in between, reading metadata blocks would fail with UUID
mismatch.
This patch doesn't add SUPER_FLAG_CHANGING_FSID into
BTRFS_SUPER_FLAG_SUPP list, so mount will fail (along with the fix in
the next patch) when SUPER_FLAG_CHANGING_FSID is set.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Wed, 15 Nov 2017 23:28:11 +0000 (16:28 -0700)]
Btrfs: avoid losing data raid profile when deleting a device
We've avoided data losing raid profile when doing balance, but it
turns out that deleting a device could also result in the same
problem.
Say we have 3 disks, and they're created with '-d raid1' profile.
- We have chunk P (the only data chunk on the empty btrfs).
- Suppose that chunk P's two raid1 copies reside in disk A and disk B.
- Now, 'btrfs device remove disk B'
btrfs_rm_device()
-> btrfs_shrink_device()
-> btrfs_relocate_chunk() #relocate any chunk on disk B
to other places.
- Chunk P will be removed and a new chunk will be created to hold
those data, but as chunk P is the only one holding raid1 profile,
after it goes away, the new chunk will be created as single profile
which is our default profile.
This fixes the problem by creating an empty data chunk before
relocating the data chunk.
Metadata/System chunk are supposed to have non-zero bytes all the time
so their raid profile is preserved.
Reported-by: James Alandt <James.Alandt@wdc.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 18 Jan 2018 11:34:31 +0000 (11:34 +0000)]
Btrfs: fix space leak after fallocate and zero range operations
If we do a buffered write after a zero range operation that has an
unaligned (with the filesystem's sector size) end which also falls within
an unwritten (prealloc) extent that is currently beyond the inode's
i_size, and the zero range operation has the flag FALLOC_FL_KEEP_SIZE,
we end up leaking data and metadata space. This happens because when
zeroing a range we call btrfs_truncate_block(), which does delalloc
(loads the page and partially zeroes its content), and in the buffered
write path we only clear existing delalloc space reservation for the
range we are writing into if that range starts at an offset smaller then
the inode's i_size, which makes sense since we can not have delalloc
extents beyond the i_size, only unwritten extents are allowed.
Fix this by ensuring the zero range operation does not call
btrfs_truncate_block() if the corresponding extent is an unwritten one
(it's pointless anyway, since reading from an unwritten extent yields
zeroes).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 18 Jan 2018 11:34:20 +0000 (11:34 +0000)]
Btrfs: fix missing inode i_size update after zero range operation
For a fallocate's zero range operation that targets a range with an end
that is not aligned to the sector size, we can end up not updating the
inode's i_size. This happens when the last page of the range maps to an
unwritten (prealloc) extent and before that last page we have either a
hole or a written extent. This is because in this scenario we relied
on a call to btrfs_prealloc_file_range() to update the inode's i_size,
however it can only update the i_size to the "down aligned" end of the
range.
The inode's i_size was left as 428Kb (438272 bytes) when it should have
been updated to 430Kb (440320 bytes).
Fix this by always updating the inode's i_size explicitly after zeroing
the range.
Fixes: ba6d5887946ff86d93dc ("Btrfs: add support for fallocate's zero range operation") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 31 Oct 2017 17:59:54 +0000 (17:59 +0000)]
Btrfs: use cached state when dirtying pages during buffered write
During a buffered IO write, we can have an extent state that we got when
we locked the range (if the range starts at an offset lower than eof), so
always pass it to btrfs_dirty_pages() so that setting the delalloc bit
in the range does not need to do a full search in the inode's io tree,
saving time and reducing the amount of time we hold the io tree's lock.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 25 Oct 2017 10:55:28 +0000 (11:55 +0100)]
Btrfs: add support for fallocate's zero range operation
This implements support the zero range operation of fallocate. For now
at least it's as simple as possible while reusing most of the existing
fallocate and hole punching infrastructure.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 11 Dec 2017 21:56:31 +0000 (14:56 -0700)]
Btrfs: do not merge rbios if their fail stripe index are not identical
Since fail stripe index in rbio would be used to decide which
algorithm reconstruction would be run, we cannot merge rbios if
their's fail striped indexes are different, otherwise, one of the two
reconstructions would fail.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 3 Jan 2018 08:08:30 +0000 (16:08 +0800)]
btrfs: rename btrfs_device::scrub_device to scrub_ctx
btrfs_device::scrub_device is not a device which is being scrubbed,
but it holds the scrub context, so rename to reflect the same. No
functional changes here.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 4 Jan 2018 10:01:54 +0000 (18:01 +0800)]
btrfs: remove check for BTRFS_FS_STATE_ERROR which we just set
__btrfs_handle_fs_error() sets BTRFS_FS_STATE_ERROR, and calls
btrfs_handle_error() so no need to check if the BTRFS_FS_STATE_ERROR
is set in btrfs_handle_error(). And there is no other user of
btrfs_handle_error() as well.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 2 Jan 2018 20:36:41 +0000 (13:36 -0700)]
Btrfs: make raid6 rebuild retry more
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.
This extends btrfs to do more retries and each retry fails only one
stripe. Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.
The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 2 Jan 2018 20:36:42 +0000 (13:36 -0700)]
Btrfs: fix scrub to repair raid6 corruption
The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.
We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.
Anand Jain [Mon, 18 Dec 2017 09:08:59 +0000 (17:08 +0800)]
btrfs: factor btrfs_check_rw_degradable() to check given device
Update btrfs_check_rw_degradable() to check against the given device if
its lost.
We can use this function to know if the volume is going to be in
degraded mode OR failed state, when the given device fails. Which is
needed when we are handling the device failed state.
A preparatory patch does not affect the flow as such.
David Sterba [Tue, 12 Dec 2017 20:43:52 +0000 (21:43 +0100)]
btrfs: sink unlock_extent parameter gfp_flags
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.
David Sterba [Thu, 7 Dec 2017 17:52:54 +0000 (18:52 +0100)]
btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC
There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.
David Sterba [Tue, 2 Jan 2018 17:19:50 +0000 (18:19 +0100)]
btrfs: drop unused parameters from mount_subvol
Recent patches reworking the mount path left some unused parameters. We
pass a vfsmount to mount_subvol, the flags and data (ie. mount options)
have been already applied and we will not need them.
Misono, Tomohiro [Thu, 14 Dec 2017 08:28:00 +0000 (17:28 +0900)]
btrfs: cleanup unnecessary string dup in btrfs_parse_options()
Long ago, commit edf24abe51493 ("btrfs: sanity mount option parsing and
early mount code") split the btrfs_parse_options() into two parts
(btrfs_parse_early_options() and btrfs_parse_options()). As a result,
btrfs_parse_optins no longer gets called twice and is the last one to
parse mount option string. Therefore there is no need to dup it.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 13 Dec 2017 08:25:40 +0000 (10:25 +0200)]
btrfs: Remove redundant pair of bio_get/set in __btrfs_submit_dio_bio
The bio is not referenced after it has been submitted and the endio is
going to consume the sole reference on successful submission. On error,
the callers of __btrfs_submit_dio_bio do invoke bio_put so we don't
leak it either.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 13 Dec 2017 08:25:38 +0000 (10:25 +0200)]
btrfs: Remove redundant bio_get/set from submit_dio_repair_bio
The bio that is passsed is the newly created repair bio which already
has a reference count of 1, which is going to be consumed by the
endio routine on successful submission. On error the handler also
calls bio_put.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 13 Dec 2017 08:25:37 +0000 (10:25 +0200)]
btrfs: Remove redundant bio_get/set calls in compressed read/write paths
bio_get/set is necessary only if the bio is going to be referenced
following submissions. In the code paths where such calls are made
we don't really need them since the bio is referenced only if
btrfs_map_bio returns an error. And this function can return an error
prior to submission only. So referencing the bio is safe. Furthermore
we do call bio_endio which will consume the last reference. So let's
remove the redundant calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Misono, Tomohiro [Wed, 17 Jan 2018 08:38:31 +0000 (17:38 +0900)]
btrfs: remove unused arg from parse_subvol_options()
Remove unused arg 'holder' from parse_subvol_options(), which has been
forgotten to be cleaned in the commit b99beb110e2d ("btrfs: split
parse_early_options() in two").
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
Misono, Tomohiro [Thu, 14 Dec 2017 08:25:28 +0000 (17:25 +0900)]
btrfs: split parse_early_options() in two
Now parse_early_options() is used by both btrfs_mount() and
btrfs_mount_root(). However, the former only needs subvol related part
and the latter needs the others.
Therefore extract the subvol related parts from parse_early_options() and
move it to new parse function (parse_subvol_options()).
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Misono, Tomohiro [Thu, 14 Dec 2017 08:25:01 +0000 (17:25 +0900)]
btrfs: cleanup btrfs_mount() using btrfs_mount_root()
Cleanup btrfs_mount() by using btrfs_mount_root(). This avoids getting
btrfs_mount() called twice in mount path.
Old btrfs_mount() will do:
0. VFS layer calls vfs_kern_mount() with registered file_system_type
(for btrfs, btrfs_fs_type). btrfs_mount() is called on the way.
1. btrfs_parse_early_options() parses "subvolid=" mount option and set the
value to subvol_objectid. Otherwise, subvol_objectid has the initial
value of 0
2. check subvol_objectid is 5 or not. Assume this time id is not 5, then
btrfs_mount() returns by calling mount_subvol()
3. In mount_subvol(), original mount options are modified to contain
"subvolid=0" in setup_root_args(). Then, vfs_kern_mount() is called with
btrfs_fs_type and new options
4. btrfs_mount() is called again
5. btrfs_parse_early_options() parses "subvolid=0" and set 5 (instead of 0)
to subvol_objectid
6. check subvol_objectid is 5 or not. This time id is 5 and mount_subvol()
is not called. btrfs_mount() finishes mounting a root
7. (in mount_subvol()) with using a return vale of vfs_kern_mount(), it
calls mount_subtree()
8. return subvolume's dentry
Reusing the same file_system_type (and btrfs_mount()) for vfs_kern_mount()
is the cause of complication.
Instead, new btrfs_mount() will do:
1. parse subvol id related options for later use in mount_subvol()
2. mount device's root by calling vfs_kern_mount() with
btrfs_root_fs_type, which is not registered to VFS by
register_filesystem(). As a result, btrfs_mount_root() is called
3. return by calling mount_subvol()
The code of 2. is moved from the first part of mount_subvol().
The semantics of device holder changes from btrfs_fs_type to
btrfs_root_fs_type and has to be used in all contexts. Otherwise we'd
get wrong results when mount and dev scan would not check the same
thing. (this has been found indendently and the fix is folded into this
patch)
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the btrfs_control_ioctl fixup, extend the comment ] Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 30 Nov 2017 17:00:02 +0000 (18:00 +0100)]
btrfs: unify extent_page_data type passed as void
Functions called from extent_write_cache_pages used void* as generic
callback data, but all of them convert it to extent_page_data, or use it
directly.
David Sterba [Fri, 23 Jun 2017 02:30:28 +0000 (04:30 +0200)]
btrfs: sink writepage parameter to extent_write_cache_pages
The function extent_write_cache_pages is modelled after
write_cache_pages which is a generic interface and the writepage
parameter makes sense there. In btrfs we know exactly which callback
we're going to use, so we can pass it directly.
David Sterba [Fri, 23 Jun 2017 02:16:17 +0000 (04:16 +0200)]
btrfs: merge two flush_write_bio helpers
flush_epd_write_bio is same as flush_write_bio, no point having two such
functions. Merge them to flush_write_bio. The 'noinline' attribute is
removed as it does not have any meaning.
Nikolay Borisov [Fri, 8 Dec 2017 14:27:43 +0000 (16:27 +0200)]
btrfs: Rename bin_search -> btrfs_bin_search
Currently there are 2 function doing binary search on btrfs nodes:
bin_search and btrfs_bin_search. The latter being a simple wrapper for
the former. So eliminate the wrapper and just rename bin_search to
btrfs_bin_search. No functional changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Fri, 8 Dec 2017 13:55:59 +0000 (15:55 +0200)]
btrfs: sink extent_write_full_page tree argument
The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Fri, 8 Dec 2017 13:55:58 +0000 (15:55 +0200)]
btrfs: sink extent_write_locked_range tree parameter
This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 11 Dec 2017 14:38:48 +0000 (16:38 +0200)]
btrfs: Remove pair of bio_get/put in btrfs_schedule_bio
This code was added in 492bb6deee34 ("Btrfs: Hold a reference on bios
during submit_bio, add some extra bio checks"). However, holding a
reference on a bio is necessary only if it's going to be referenced
after the submit_bio returns and the bio is completed. In this
particular instance this is not the case so there is no need to hold
an extra reference since we directly return.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 12 Dec 2017 09:14:49 +0000 (11:14 +0200)]
btrfs: Fix out of bounds access in btrfs_search_slot
When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
the level variable is going to be 7 (this is the max height of the
tree). On the other hand btrfs_cow_block is always called with
"level + 1" as an index into the nodes and slots arrays. This leads to
an out of bounds access. Admittdely this will be benign since an OOB
access of the nodes array will likely read the 0th element from the
slots array, which in this case is going to be 0 (since we start CoW at
the top of the tree). The OOB access into the slots array in turn will
read the 0th and 1st values of the locks array, which would both be 0
at the time. However, this benign behavior relies on the fact that the
path being passed hasn't been initialised, if it has already been used to
query a btree then it could potentially have populated the nodes/slots arrays.
Fix it by explicitly checking if we are at level 7 (the maximum allowed
index in nodes/slots arrays) and explicitly call the CoW routine with
NULL for parent's node/slot.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Fixes-coverity-id: 711515 Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 5 Dec 2017 07:29:19 +0000 (09:29 +0200)]
btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker
This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers
to deal with pages that have been improperly dirtied") and it didn't do
any error handling then. This function might very well fail in ENOMEM
situation, yet it's not handled, this could lead to inconsistent state.
So let's handle the failure by setting the mapping error bit.
Cc: stable@vger.kernel.org Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Fri, 1 Dec 2017 09:19:43 +0000 (11:19 +0200)]
btrfs: remove redundant check in btrfs_get_extent_fiemap
Before returning hole_em in btrfs_get_fiemap_extent we check if it's different
than null. However, by the time this null check is triggered we already know
hole_em is not null because it means it points to the em we found and it
has already been dereferenced.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Fri, 1 Dec 2017 09:19:40 +0000 (11:19 +0200)]
btrfs: Remove unused variable in btrfs_get_extent
trans was statically assigned to NULL and this never changed over the
course of btrfs_get_extent. So remove any code which checks whether
trans != NULL and just hardcode the fact trans is always NULL.
Resolves-coverity-id: 112806 Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Arnd Bergmann [Wed, 6 Dec 2017 14:18:14 +0000 (15:18 +0100)]
btrfs: tree-checker: use %zu format string for size_t
The return value of sizeof() is of type size_t, so we must print it
using the %z format modifier rather than %l to avoid this warning
on some architectures:
fs/btrfs/tree-checker.c: In function 'check_dir_item':
fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} [-Werror=format=]
Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 1 Dec 2017 00:26:39 +0000 (17:26 -0700)]
Btrfs: use struct completion in scrub_submit_raid56_bio_wait
This changes to use struct completion directly and removes 'struct
scrub_bio_ret' along with the code using it.
This struct is used to get the return value from bio, but the caller can
access bio to get the return value directly and is holding a reference
on it so it won't go away underneath us and can be removed safely.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Timofey Titovets [Mon, 23 Oct 2017 22:29:48 +0000 (01:29 +0300)]
Btrfs: compress_file_range() change page dirty status once
We need to call extent_range_clear_dirty_for_io()
on compression range to prevent application from changing
page content, while pages compressing.
extent_range_clear_dirty_for_io() runs on each loop iteration,
"(end - start)" can be much (up to 1024 times) bigger
then compression range (BTRFS_MAX_UNCOMPRESSED).
The start pointer is advanced each time we manage to compress part of
the range. The end pointer does not change so we could redirty the
remaining parts repeatedly.
Fix that behaviour by call extent_range_clear_dirty_for_io()
only once, the first time it happens.
This is the safest but probably not the best behaviour. Previous
iterations of the patch tried to redirty only the range that we were not
able to compress. This has been refused by David for safety reasons, the
writeout callchain is complex and there could be some path that relies
on redirtying the entire unwritten range.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog, the history and safety concerns, add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs: compression heuristic: replace heap sort with radix sort
Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.
As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.
Add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).
That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB
That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16 elements * 8 byte).
That Radix sort implementation have several points to adjust,
I added him to make radix sort general usable in kernel,
like heap sort, if needed.
Performance tested in userspace copy of heuristic code,
throughput:
- average <-> random data: ~3500 MiB/s - heap sort
- average <-> random data: ~6000 MiB/s - radix sort