git.proxmox.com Git - mirror_ubuntu-eoan-kernel.git/log

btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster

Essentially duplicate the error handling from the above block which
handles the !PageUptodate(page) case and additionally clear
EXTENT_BOUNDARY.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle failure of add_pending_csums

add_pending_csums was added as part of the new data=ordered
implementation in e6dcd2dc9c48 ("Btrfs: New data=ordered
implementation"). Even back then it called the btrfs_csum_file_blocks
which can fail but it never bothered handling the failure. In ENOMEM
situation this could lead to the filesystem failing to write the
checksums for a particular extent and not detect this. On read this
could lead to the filesystem erroring out due to crc mismatch. Fix it by
propagating failure from add_pending_csums and handling them.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use kvzalloc to allocate btrfs_fs_info

The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
kernels built with NR_CPUS=8192, this can result in kmalloc failures
that prevent mounting.

There is work in progress to try to resolve this for every user of
srcu_struct but using kvzalloc will work around the failures until
that is complete.

As an example with NR_CPUS=512 on x86_64: the overall size of
subvol_srcu is 3460 bytes, fs_info is 6496.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device

Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced
btrfs_free_stale_device which iterates the device lists for all
registered btrfs filesystems and deletes those devices which aren't
mounted. In a btrfs_devices structure has only 1 device attached to it
and it is unused then btrfs_free_stale_devices will proceed to also free
the btrfs_fs_devices struct itself. Currently this leads to a use after
free since list_for_each_entry will try to perform a check on the
already freed memory to see if it has to terminate the loop.

The fix is to use 'break' when we know we are freeing the current
fs_devs.

Fixes: 4fde46f0cc71 ("Btrfs: free the stale device")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix null pointer dereference when replacing missing device

When we are replacing a missing device we mount the filesystem with the
degraded mode option in which case we are allowed to have a btrfs device
structure without a backing device member (its bdev member is NULL) and
therefore we can't dereference that member. Commit 38b5f68e9811
("btrfs: drop btrfs_device::can_discard to query directly") started to
dereference that member when discarding extents, resulting in a null
pointer dereference:

[ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing
[ 3145.364116] BTRFS info (device sdf): dev_replace from <missing disk> (devid 2) to /dev/sdg started
[ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0
[ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs]
[ 3145.415085] PGD 0 P4D 0
[ 3145.415085] Oops: 0000 [#1] PREEMPT SMP PTI
[ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw i2c_piix4 i2
[ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: G        W        4.15.0-rc9-btrfs-next-55+ #1
[ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs]
[ 3145.415085] RSP: 0018:ffffc90004813c60 EFLAGS: 00010293
[ 3145.415085] RAX: ffff88020d39cc00 RBX: ffff88020c4ea2a0 RCX: 0000000000000002
[ 3145.415085] RDX: 0000000000000000 RSI: ffff88020c4ea240 RDI: 0000000000000000
[ 3145.415085] RBP: 0000000000000000 R08: 0000000000004000 R09: 0000000000000000
[ 3145.415085] R10: ffffc90004813ae8 R11: 0000000000000000 R12: 0000000000000000
[ 3145.415085] R13: ffff88020c418000 R14: 0000000000000000 R15: 0000000000000000
[ 3145.415085] FS:  00007f565681f8c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[ 3145.415085] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3145.415085] CR2: 00000000000000e0 CR3: 000000020d208006 CR4: 00000000001606f0
[ 3145.415085] Call Trace:
[ 3145.415085]  btrfs_finish_extent_commit+0x9a/0x1be [btrfs]
[ 3145.415085]  btrfs_commit_transaction+0x649/0x7a0 [btrfs]
[ 3145.415085]  ? start_transaction+0x2b0/0x3b3 [btrfs]
[ 3145.415085]  btrfs_dev_replace_start+0x274/0x30c [btrfs]
[ 3145.415085]  btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs]
[ 3145.415085]  btrfs_ioctl+0x1a91/0x1d62 [btrfs]
[ 3145.415085]  ? lock_acquire+0x16a/0x1af
[ 3145.415085]  ? vfs_ioctl+0x1b/0x28
[ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
[ 3145.415085]  vfs_ioctl+0x1b/0x28
[ 3145.415085]  do_vfs_ioctl+0x5a9/0x5e0
[ 3145.415085]  ? _raw_spin_unlock_irq+0x34/0x46
[ 3145.415085]  ? entry_SYSCALL_64_fastpath+0x5/0x8b
[ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
[ 3145.415085]  SyS_ioctl+0x52/0x76
[ 3145.415085]  entry_SYSCALL_64_fastpath+0x1e/0x8b
[ 3145.415085] RIP: 0033:0x7f56558b3c47
[ 3145.415085] RSP: 002b:00007ffdcfac4c58 EFLAGS: 00000202
[ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 48 8b 44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 87 e0 00
[ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: ffffc90004813c60
[ 3145.415085] CR2: 00000000000000e0
[ 3145.458185] ---[ end trace 06302e7ac31902bf ]---

This is trivially reproduced by running the test btrfs/027 from fstests
like this:

  $ MOUNT_OPTIONS="-o discard" ./check btrfs/027

Fix this by skipping devices without a backing device before attempting
to discard.

Fixes: 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes

Until v4.14, this warning was very infrequent:

WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0
Modules linked in: [...]
CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1
Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014
Call Trace:
dump_stack+0x85/0xc2
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
find_parent_nodes+0xc41/0x14e0
__btrfs_find_all_roots+0xad/0x120
? extent_same_check_offsets+0x70/0x70
iterate_extent_inodes+0x168/0x300
iterate_inodes_from_logical+0x87/0xb0
? iterate_inodes_from_logical+0x87/0xb0
? extent_same_check_offsets+0x70/0x70
btrfs_ioctl+0x8ac/0x2820
? lock_acquire+0xc2/0x200
do_vfs_ioctl+0x91/0x700
? __fget+0x112/0x200
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc6
? trace_hardirqs_off_caller+0x1f/0x140

Starting with v4.14 (specifically 86d5f9944252 ("btrfs: convert prelimary
reference tracking to use rbtrees")) the WARN_ON occurs three orders of
magnitude more frequently--almost once per second while running workloads
like bees.

Replace the WARN_ON() with a comment rationale for its removal.
The rationale is paraphrased from an explanation by Edmund Nadolski
<enadolski@suse.de> on the linux-btrfs mailing list.

Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()")
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Ignore errors from btrfs_qgroup_trace_extent_post

Running generic/019 with qgroups on the scratch device enabled is almost
guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed
to trigger only on -ENOMEM, in reality, however, it's possible to get
-EIO from btrfs_qgroup_trace_extent_post. This function just finds the
roots of the extent being tracked and sets the qrecord->old_roots list.
If this operation fails nothing critical happens except the quota
accounting can be considered wrong. In such case just set the
INCONSISTENT flag for the quota and print a warning, rather than killing
off the system. Additionally, it's possible to trigger a BUG_ON in
btrfs_truncate_inode_items as well.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ error message adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix unexpected -EEXIST when creating new inode

The highest objectid, which is assigned to new inode, is decided at
the time of initializing fs roots. However, in cases where log replay
gets processed, the btree which fs root owns might be changed, so we
have to search it again for the highest objectid, otherwise creating
new inode would end up with -EEXIST.

cc: <stable@vger.kernel.org> v4.4-rc6+
Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix use-after-free on root->orphan_block_rsv

I got these from running generic/475,

WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs]
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs]
Call Trace:
  btrfs_orphan_release_metadata+0x9f/0x200 [btrfs]
  btrfs_orphan_del+0x10d/0x170 [btrfs]
  btrfs_setattr+0x500/0x640 [btrfs]
  notify_change+0x7ae/0x870
  do_truncate+0xca/0x130
  vfs_truncate+0x2ee/0x3d0
  do_sys_truncate+0xaf/0xf0
  SyS_truncate+0xe/0x10
  entry_SYSCALL_64_fastpath+0x1f/0x96

The race is between btrfs_orphan_commit_root and btrfs_orphan_del,
        t1                                        t2
btrfs_orphan_commit_root                     btrfs_orphan_del
   spin_lock
   check (&root->orphan_inodes)
   root->orphan_block_rsv = NULL;
   spin_unlock
                                             atomic_dec(&root->orphan_inodes);
                                             access root->orphan_block_rsv

Accessing root->orphan_block_rsv must be done before decreasing
root->orphan_inodes.

cc: <stable@vger.kernel.org> v3.12+
Fixes: 703c88e03524 ("Btrfs: fix tracking of orphan inode count")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly

This regression is introduced in
commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction").

There are two problems,

a) it is ->destroy_inode() that does the final free on inode, not
->evict_inode(),
b) clear_inode() must be called before ->evict_inode() returns.

This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
in evict() because I_CLEAR is set in clear_inode().

Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction")
Cc: <stable@vger.kernel.org> # v4.7-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix extent state leak from tree log

It's possible that btrfs_sync_log() bails out after one of the two
btrfs_write_marked_extents() which convert extent state's state bit into
EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY
and EXTENT_NEW are searched by free_log_tree() so that those extent states
with EXTENT_NEED_WAIT lead to memory leak.

cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix crash due to not cleaning up tree log block's dirty bits

In cases that the whole fs flips into readonly status due to failures in
critical sections, then log tree's blocks are still dirty, and this leads
to a crash during umount time, the crash is about use-after-free,

umount
-> close_ctree
    -> stop workers
    -> iput(btree_inode)
       -> iput_final
          -> write_inode_now
     -> ...
       -> queue job on stop'd workers

cc: <stable@vger.kernel.org> v3.12+
Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix deadlock in run_delalloc_nocow

@cur_offset is not set back to what it should be (@cow_start) if
btrfs_next_leaf() returns something wrong, and the range [cow_start,
cur_offset) remains locked forever.

cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: drop devid as device_list_add() arg

As struct btrfs_disk_super is being passed, so it can get devid
the same way its parent does.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: get device pointer from device_list_add()

Instead of pointer to btrfs_fs_devices as an arg in device_list_add()
better to get pointer to btrfs_device as return value, then we have
both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is
needed to handle reappearing missing device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: set the total_devices in device_list_add()

There is no other parent for device_list_add() except for
btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices
if device_list_add is successful and this can be done with in
device_list_add() itself.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move pr_info into device_list_add

Commit 60999ca4b403 ("btrfs: make device scan less noisy")
adds return value 1 to device_list_add(), so that parent function can
call pr_info only when new device is added. Move the pr_info() part
into device_list_add() so that this function can be kept simple.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make btrfs_free_stale_devices() to match the path

The btrfs_free_stale_devices() is updated to match for the given device
path and delete it. (It searches for only unmounted list of devices.)
Also drop the comment about different path being used for the same
device, since now we will have cli to clean any device that's not a
concern any more.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename btrfs_free_stale_devices() arg to skip_dev

No functional changes.
Rename btrfs_free_stale_devices() arg to skip_dev, so that it
reflects what that arg for.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make btrfs_free_stale_devices() argument optional

This updates btrfs_free_stale_devices() helper function to delete all
unmouted devices, when arg is NULL.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make btrfs_free_stale_device() to iterate all stales

Let the list iterator iterate further and find other stale
devices and delete it. This is in preparation to add support
for user land request-able stale devices cleanup. Also rename
btrfs_free_stale_device() to btrfs_free_stale_devices().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: no need to check for btrfs_fs_devices::seeding

There is no need to check for btrfs_fs_devices::seeding when we
have checked for btrfs_fs_devices::opened, because we can't sprout
without its seed FS being opened.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it

No functional changes, just makes the code more readable

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: noinline merge_extent_mapping

In order to debug subtle bugs around merge_extent_mapping(), perf probe
can be used to check the arguments, but sometimes merge_extent_mapping()
got inlined by compiler and couldn't be probed.

This is adding noinline attribute to merge_extent_mapping().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping

This is a subtle case, so in order to understand the problem, it'd be good
to know the content of existing and em when any error occurs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: extent map selftest: dio write vs dio read

This test case simulates the racy situation of dio write vs dio read,
and see if btrfs_get_extent() would return -EEXIST.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: extent map selftest: buffered write vs dio read

This test case simulates the racy situation of buffered write vs dio
read, and see if btrfs_get_extent() would return -EEXIST.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: add extent map selftests

We've observed that btrfs_get_extent() and merge_extent_mapping() could
return -EEXIST in several cases, and they are caused by some racy
condition, e.g dio read vs dio write, which makes the problem very tricky
to reproduce.

This adds extent map selftests in order to simulate those racy situations.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
[ minor string adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: move extent map specific code to extent_map.c

These helpers are extent map specific, move them to extent_map.c.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: add helper for em merge logic

This is a prepare work for the following extent map selftest, which
runs tests against em merge logic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix unexpected EEXIST from btrfs_get_extent

This fixes a corner case that is caused by a race of dio write vs dio
read/write.

Here is how the race could happen.

Suppose that no extent map has been loaded into memory yet.
There is a file extent [0, 32K), two jobs are running concurrently
against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
read from [0, 4K) or [4K, 8K).

t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).

------------------------------------------------------
             t1                                t2
      btrfs_get_blocks_direct()         btrfs_get_blocks_direct()
       -> btrfs_get_extent()              -> btrfs_get_extent()
           -> lookup_extent_mapping()
           -> add_extent_mapping()            -> lookup_extent_mapping()
              # load [0, 32K)
       -> btrfs_new_extent_direct()
           -> btrfs_drop_extent_cache()
              # split [0, 32K) and
      # drop [8K, 32K)
           -> add_extent_mapping()
              # add [8K, 32K)
                                              -> add_extent_mapping()
                                                 # handle -EEXIST when adding
                                                 # [0, 32K)
------------------------------------------------------
About how t2(dio read/write) runs into -EEXIST:

a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),

b) search_extent_mapping() then returns [0, 8k) as the existing em,
   even though start == existing->start, em is [0, 32k) so that
   extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,

c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
   (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,

d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
   which is confusing applications.

Here I conclude all the possible situations,
1) start < existing->start

            +-----------+em+-----------+
+--prev---+ |     +-------------+      |
|         | |     |             |      |
+---------+ +     +---+existing++      ++
                +
                |
                +
             start

2) start == existing->start

      +------------em------------+
      |     +-------------+      |
      |     |             |      |
      +     +----existing-+      +
            |
            |
            +
         start

3) start > existing->start && start < (existing->start + existing->len)

      +------------em------------+
      |     +-------------+      |
      |     |             |      |
      +     +----existing-+      +
               |
               |
               +
             start

4) start >= (existing->start + existing->len)

+-----------+em+-----------+
|     +-------------+      | +--next---+
|     |             |      | |         |
+     +---+existing++      + +---------+
                      +
                      |
                      +
                   start

As we can see, it turns out that if start is within existing em (front
inclusive), then the existing em should be returned as is, otherwise,
we try our best to merge candidate em with sibling ems to form a
larger em (in order to reduce the total number of em).

Reported-by: David Vallender <david.vallender@landmark.co.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix incorrect block_len in merge_extent_mapping

%block_len could be checked on deciding if two em are mergeable.

merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.

For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, and for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Remove unused readahead spinlock

The reada_lock in struct btrfs_device was only initialised, and not
actually used. That's good because there's another lock also called
reada_lock in the btrfs_fs_info that was quite heavily used. Remove
this one.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

Before rbio_orig_end_io() goes to free rbio, rbio may get merged with
more bios from other rbios and rbio->bio_list becomes non-empty,
in that case, these newly merged bios don't end properly.

Once unlock_stripe() is done, rbio->bio_list will not be updated any
more and we can call bio_endio() on all queued bios.

It should only happen in error-out cases, the normal path of recover
and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable
merge before doing IO, so rbio_orig_end_io() called by them doesn't
have the above issue.

Reported-by: Jérôme Carretero <cJ-ko@zougloub.eu>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: do not cache rbio pages if using raid6 recover

Since raid6 recover tries all possible combinations of failed stripes,

- when raid6 rebuild algorithm is used, i.e. raid6_datap_recov() and
  raid6_2data_recov(), it may change the in-memory content of failed
  stripes, if such a raid bio is cached, a later raid write rmw or recover
  can steal @stripe_pages from it instead of reading from disks, such that
  it carries the wrong content to do write rmw or recovery and ends up
  with corruption or recovery failures.

- when raid5 rebuild algorithm is used, i.e. xor, raid bio can be cached
  because the only failed stripe which contains @rbio->bio_pages gets
  modified, others remain the same so that their in-memory content is
  consistent with their on-disk content.

This adds a check to skip caching rbio if using raid6 recover.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: raid56: iterate raid56 internal bio with bio_for_each_segment_all

Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it
will never be a BIO_CLONED bio, and since this is called by end_io
functions, bio->bi_iter.bi_size is zero, we mustn't use
bio_for_each_segment() as that is a no-op if bi_size is zero.

Fixes: 6592e58c6b68e61f003a01ba29a3716e7e2e9484 ("Btrfs: fix write corruption due to bio cloning on raid5/6")
Cc: <stable@vger.kernel.org> # v4.12-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: correct wrong comment about magic number of index_cnt

There is no function named btrfs_get_inode_index_count.
Explanation for magic number index_cnt=2 in btrfs_new_inode() is
actually located in btrfs_set_inode_index_count().

So replace 'btrfs_get_inode_index_count' in the comment by
'btrfs_set_inode_index_count'.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Make btrfs_inode_rsv_release static

It's not used outside of extent-tree so there is no reason to not be
static.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: cleanup btrfs_free_stale_device() usage

We call btrfs_free_stale_device() only when we alloc a new struct
btrfs_device (ret=1), so move it closer to where we alloc the new
device. Also drop the comments.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-check: reduce stack consumption in check_dir_item

I've noticed that the updated item checker stack consumption increased
dramatically in 542f5385e20cf97447 ("btrfs: tree-checker: Add checker
for dir item")

tree-checker.c:check_leaf +552 (176 -> 728)

The array is 255 bytes long, dynamic allocation would slow down the
sanity checks so it's more reasonable to keep it on-stack. Moving the
variable to the scope of use reduces the stack usage again

tree-checker.c:check_leaf -264 (728 -> 464)

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use correct string length in DEV_INFO ioctl

gcc-8 reports:

fs/btrfs/ioctl.c: In function 'btrfs_ioctl':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 1024 equals destination size [-Wstringop-truncation]

We need one less byte or call strlcpy() to make it a nul-terminated
string. This is done on the next line anyway, but we want to avoid the
warning.

Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP

It appears from the original commit [1] that there isn't any design
specific reason not to fail the mount instead of just warning. This
patch will change it to fail.

[1]
commit 319e4d0661e5323c9f9945f0f8fb5905e5fe74c3
btrfs: Enhance super validation check

Fixes: 319e4d0661e5323 ("btrfs: Enhance super validation check")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add support for SUPER_FLAG_CHANGING_FSID

The UUID change by btrfstune sets SUPER_FLAG_CHANGING_FSID and resets it
only when changing fsid is complete. Its not a good idea to mount the
device anything in between, reading metadata blocks would fail with UUID
mismatch.

This patch doesn't add SUPER_FLAG_CHANGING_FSID into
BTRFS_SUPER_FLAG_SUPP list, so mount will fail (along with the fix in
the next patch) when SUPER_FLAG_CHANGING_FSID is set.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: define SUPER_FLAG_METADUMP_V2

btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
So just define that in kernel so that we know its been used.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: avoid losing data raid profile when deleting a device

We've avoided data losing raid profile when doing balance, but it
turns out that deleting a device could also result in the same
problem.

Say we have 3 disks, and they're created with '-d raid1' profile.

- We have chunk P (the only data chunk on the empty btrfs).

- Suppose that chunk P's two raid1 copies reside in disk A and disk B.

- Now, 'btrfs device remove disk B'
         btrfs_rm_device()
   -> btrfs_shrink_device()
      -> btrfs_relocate_chunk() #relocate any chunk on disk B
       to other places.

- Chunk P will be removed and a new chunk will be created to hold
  those data, but as chunk P is the only one holding raid1 profile,
  after it goes away, the new chunk will be created as single profile
  which is our default profile.

This fixes the problem by creating an empty data chunk before
relocating the data chunk.

Metadata/System chunk are supposed to have non-zero bytes all the time
so their raid profile is preserved.

Reported-by: James Alandt <James.Alandt@wdc.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix space leak after fallocate and zero range operations

If we do a buffered write after a zero range operation that has an
unaligned (with the filesystem's sector size) end which also falls within
an unwritten (prealloc) extent that is currently beyond the inode's
i_size, and the zero range operation has the flag FALLOC_FL_KEEP_SIZE,
we end up leaking data and metadata space. This happens because when
zeroing a range we call btrfs_truncate_block(), which does delalloc
(loads the page and partially zeroes its content), and in the buffered
write path we only clear existing delalloc space reservation for the
range we are writing into if that range starts at an offset smaller then
the inode's i_size, which makes sense since we can not have delalloc
extents beyond the i_size, only unwritten extents are allowed.

Example reproducer:

$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "falloc -k 428K 4K" /mnt/foobar
$ xfs_io -c "fzero -k 0 430K" /mnt/foobar
$ xfs_io -c "pwrite -S 0xaa 428K 4K" /mnt/foobar
$ umount /mnt

After the unmount we get the metadata and data space leaks reported in
dmesg/syslog:

[95794.602253] ------------[ cut here ]------------
[95794.603322] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9561 btrfs_destroy_inode+0x4e/0x206 [btrfs]
[95794.605167] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
[95794.613000] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
[95794.614448] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[95794.615972] task: ffff880075aa0240 task.stack: ffffc90001734000
[95794.617114] RIP: 0010:btrfs_destroy_inode+0x4e/0x206 [btrfs]
[95794.618001] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202
[95794.618721] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
[95794.619645] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
[95794.620711] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
[95794.621932] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
[95794.623124] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
[95794.624188] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[95794.625578] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95794.626522] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
[95794.627647] Call Trace:
[95794.628128]  destroy_inode+0x3d/0x55
[95794.628573]  evict+0x177/0x17e
[95794.629010]  dispose_list+0x50/0x71
[95794.629478]  evict_inodes+0x132/0x141
[95794.630289]  generic_shutdown_super+0x3f/0x10b
[95794.630864]  kill_anon_super+0x12/0x1c
[95794.631383]  btrfs_kill_super+0x16/0x21 [btrfs]
[95794.631930]  deactivate_locked_super+0x30/0x68
[95794.632539]  deactivate_super+0x36/0x39
[95794.633200]  cleanup_mnt+0x49/0x67
[95794.633818]  __cleanup_mnt+0x12/0x14
[95794.634416]  task_work_run+0x82/0xa6
[95794.634902]  prepare_exit_to_usermode+0xe1/0x10c
[95794.635525]  syscall_return_slowpath+0x18c/0x1af
[95794.636122]  entry_SYSCALL_64_fastpath+0xab/0xad
[95794.636834] RIP: 0033:0x7fa678cb99a7
[95794.637370] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[95794.638672] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
[95794.639596] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
[95794.640703] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
[95794.641773] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
[95794.643150] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
[95794.644249] Code: ff 4c 8b a8 80 06 00 00 48 8b 87 c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 <0f> ff 83 bb 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00
[95794.646929] ---[ end trace e95877675c6ec007 ]---
[95794.647751] ------------[ cut here ]------------
[95794.648509] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9562 btrfs_destroy_inode+0x59/0x206 [btrfs]
[95794.649842] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
[95794.654659] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
[95794.655894] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[95794.657546] task: ffff880075aa0240 task.stack: ffffc90001734000
[95794.658433] RIP: 0010:btrfs_destroy_inode+0x59/0x206 [btrfs]
[95794.659279] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202
[95794.660054] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
[95794.660753] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
[95794.661513] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
[95794.662289] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
[95794.663393] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
[95794.664342] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[95794.665673] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95794.666593] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
[95794.667629] Call Trace:
[95794.668065]  destroy_inode+0x3d/0x55
[95794.668637]  evict+0x177/0x17e
[95794.669179]  dispose_list+0x50/0x71
[95794.669830]  evict_inodes+0x132/0x141
[95794.670416]  generic_shutdown_super+0x3f/0x10b
[95794.671103]  kill_anon_super+0x12/0x1c
[95794.671786]  btrfs_kill_super+0x16/0x21 [btrfs]
[95794.672552]  deactivate_locked_super+0x30/0x68
[95794.673393]  deactivate_super+0x36/0x39
[95794.674107]  cleanup_mnt+0x49/0x67
[95794.674706]  __cleanup_mnt+0x12/0x14
[95794.675279]  task_work_run+0x82/0xa6
[95794.675795]  prepare_exit_to_usermode+0xe1/0x10c
[95794.676507]  syscall_return_slowpath+0x18c/0x1af
[95794.677275]  entry_SYSCALL_64_fastpath+0xab/0xad
[95794.678006] RIP: 0033:0x7fa678cb99a7
[95794.678600] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[95794.679739] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
[95794.680779] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
[95794.681837] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
[95794.682867] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
[95794.683891] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
[95794.684843] Code: c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 0f ff 83 bb 40 ff ff ff 00 74 02 <0f> ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff
[95794.687156] ---[ end trace e95877675c6ec008 ]---
[95794.687876] ------------[ cut here ]------------
[95794.688579] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9565 btrfs_destroy_inode+0x7d/0x206 [btrfs]
[95794.689735] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
[95794.695015] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
[95794.696396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[95794.697956] task: ffff880075aa0240 task.stack: ffffc90001734000
[95794.698925] RIP: 0010:btrfs_destroy_inode+0x7d/0x206 [btrfs]
[95794.699763] RSP: 0018:ffffc90001737d00 EFLAGS: 00010206
[95794.700434] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
[95794.701445] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
[95794.702448] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
[95794.703557] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
[95794.704441] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
[95794.705270] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[95794.706341] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95794.707001] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
[95794.708030] Call Trace:
[95794.708466]  destroy_inode+0x3d/0x55
[95794.709071]  evict+0x177/0x17e
[95794.709497]  dispose_list+0x50/0x71
[95794.709973]  evict_inodes+0x132/0x141
[95794.710564]  generic_shutdown_super+0x3f/0x10b
[95794.711200]  kill_anon_super+0x12/0x1c
[95794.711633]  btrfs_kill_super+0x16/0x21 [btrfs]
[95794.712139]  deactivate_locked_super+0x30/0x68
[95794.712608]  deactivate_super+0x36/0x39
[95794.713093]  cleanup_mnt+0x49/0x67
[95794.713514]  __cleanup_mnt+0x12/0x14
[95794.713933]  task_work_run+0x82/0xa6
[95794.714543]  prepare_exit_to_usermode+0xe1/0x10c
[95794.715247]  syscall_return_slowpath+0x18c/0x1af
[95794.715952]  entry_SYSCALL_64_fastpath+0xab/0xad
[95794.716653] RIP: 0033:0x7fa678cb99a7
[95794.721100] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[95794.722052] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
[95794.722856] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
[95794.723698] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
[95794.724736] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
[95794.725928] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
[95794.726728] Code: 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff 00 74 02 0f ff 48 83 bb 30 ff ff ff 00 74 02 <0f> ff 48 83 bb 08 ff ff ff 00 74 02 0f ff 4d 85 e4 0f 84 52 01
[95794.729203] ---[ end trace e95877675c6ec009 ]---
[95794.841054] ------------[ cut here ]------------
[95794.841829] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5831 btrfs_free_block_groups+0x235/0x36a [btrfs]
[95794.843425] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
[95794.850658] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
[95794.852590] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[95794.854752] task: ffff880075aa0240 task.stack: ffffc90001734000
[95794.855812] RIP: 0010:btrfs_free_block_groups+0x235/0x36a [btrfs]
[95794.856811] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
[95794.857805] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001
[95794.859014] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
[95794.860270] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
[95794.861525] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000
[95794.862700] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100
[95794.863810] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[95794.865149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95794.866099] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
[95794.867198] Call Trace:
[95794.867626]  close_ctree+0x1db/0x2b8 [btrfs]
[95794.868188]  ? evict_inodes+0x132/0x141
[95794.869037]  btrfs_put_super+0x15/0x17 [btrfs]
[95794.870400]  generic_shutdown_super+0x6a/0x10b
[95794.871262]  kill_anon_super+0x12/0x1c
[95794.872046]  btrfs_kill_super+0x16/0x21 [btrfs]
[95794.872746]  deactivate_locked_super+0x30/0x68
[95794.873687]  deactivate_super+0x36/0x39
[95794.874639]  cleanup_mnt+0x49/0x67
[95794.875504]  __cleanup_mnt+0x12/0x14
[95794.876126]  task_work_run+0x82/0xa6
[95794.876788]  prepare_exit_to_usermode+0xe1/0x10c
[95794.877777]  syscall_return_slowpath+0x18c/0x1af
[95794.878381]  entry_SYSCALL_64_fastpath+0xab/0xad
[95794.878888] RIP: 0033:0x7fa678cb99a7
[95794.879307] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[95794.880204] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
[95794.881640] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
[95794.882690] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
[95794.883538] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
[95794.884562] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
[95794.885664] Code: 89 ef e8 07 ec 32 e1 e8 9d c0 ea e0 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 <0f> ff 48 83 bb 88 02 00 00 00 74 02 0f ff 48 83 bb d8 02 00 00
[95794.887980] ---[ end trace e95877675c6ec00a ]---
[95794.888739] ------------[ cut here ]------------
[95794.889405] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5832 btrfs_free_block_groups+0x241/0x36a [btrfs]
[95794.891020] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
[95794.897551] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
[95794.898509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[95794.899685] task: ffff880075aa0240 task.stack: ffffc90001734000
[95794.900592] RIP: 0010:btrfs_free_block_groups+0x241/0x36a [btrfs]
[95794.901387] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
[95794.902300] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001
[95794.903260] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
[95794.904332] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
[95794.905300] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000
[95794.906439] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100
[95794.907459] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[95794.908625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95794.909511] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
[95794.910630] Call Trace:
[95794.911153]  close_ctree+0x1db/0x2b8 [btrfs]
[95794.911837]  ? evict_inodes+0x132/0x141
[95794.912344]  btrfs_put_super+0x15/0x17 [btrfs]
[95794.912975]  generic_shutdown_super+0x6a/0x10b
[95794.913788]  kill_anon_super+0x12/0x1c
[95794.914424]  btrfs_kill_super+0x16/0x21 [btrfs]
[95794.915142]  deactivate_locked_super+0x30/0x68
[95794.915831]  deactivate_super+0x36/0x39
[95794.916433]  cleanup_mnt+0x49/0x67
[95794.917045]  __cleanup_mnt+0x12/0x14
[95794.917665]  task_work_run+0x82/0xa6
[95794.918309]  prepare_exit_to_usermode+0xe1/0x10c
[95794.919021]  syscall_return_slowpath+0x18c/0x1af
[95794.919722]  entry_SYSCALL_64_fastpath+0xab/0xad
[95794.920426] RIP: 0033:0x7fa678cb99a7
[95794.921039] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[95794.922303] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
[95794.923335] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
[95794.924364] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
[95794.925435] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
[95794.926533] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
[95794.927557] Code: 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 0f ff 48 83 bb 88 02 00 00 00 74 02 <0f> ff 48 83 bb d8 02 00 00 00 74 02 0f ff 48 83 bb e0 02 00 00
[95794.930166] ---[ end trace e95877675c6ec00b ]---
[95794.930961] ------------[ cut here ]------------
[95794.931727] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs]
[95794.932729] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
[95794.938394] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
[95794.939842] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[95794.941455] task: ffff880075aa0240 task.stack: ffffc90001734000
[95794.942336] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs]
[95794.943268] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
[95794.944127] RAX: ffff8802004fd0e8 RBX: ffff88006145c000 RCX: 0000000000000001
[95794.945211] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
[95794.946316] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
[95794.947271] R10: ffffc90001737c80 R11: 00000000000337fd R12: ffff8802004fd0e8
[95794.948219] R13: ffff88006145c0c0 R14: ffff88006145e598 R15: ffff88006145c100
[95794.949193] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[95794.950495] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95794.951338] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
[95794.952361] Call Trace:
[95794.952811]  close_ctree+0x1db/0x2b8 [btrfs]
[95794.953522]  ? evict_inodes+0x132/0x141
[95794.954543]  btrfs_put_super+0x15/0x17 [btrfs]
[95794.955231]  generic_shutdown_super+0x6a/0x10b
[95794.955916]  kill_anon_super+0x12/0x1c
[95794.956414]  btrfs_kill_super+0x16/0x21 [btrfs]
[95794.956953]  deactivate_locked_super+0x30/0x68
[95794.957635]  deactivate_super+0x36/0x39
[95794.958256]  cleanup_mnt+0x49/0x67
[95794.958701]  __cleanup_mnt+0x12/0x14
[95794.959181]  task_work_run+0x82/0xa6
[95794.959635]  prepare_exit_to_usermode+0xe1/0x10c
[95794.960182]  syscall_return_slowpath+0x18c/0x1af
[95794.960731]  entry_SYSCALL_64_fastpath+0xab/0xad
[95794.961438] RIP: 0033:0x7fa678cb99a7
[95794.961990] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[95794.963111] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
[95794.963975] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
[95794.964680] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
[95794.965763] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
[95794.966868] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
[95794.967800] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff
[95794.970629] ---[ end trace e95877675c6ec00c ]---
[95794.971451] BTRFS info (device sdi): space_info 1 has 7680000 free, is not full
[95794.972351] BTRFS info (device sdi): space_info total=8388608, used=704512, pinned=0, reserved=0, may_use=4096, readonly=0
[95794.973595] ------------[ cut here ]------------
[95794.974353] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs]
[95794.980163] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
[95794.986461] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
[95794.987591] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[95794.988929] task: ffff880075aa0240 task.stack: ffffc90001734000
[95794.989922] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs]
[95794.990715] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
[95794.991431] RAX: ffff88020f6e70e8 RBX: ffff88006145c000 RCX: ffffffff8115a906
[95794.992455] RDX: ffffffff8115a902 RSI: ffff880075aa0b40 RDI: ffff880075aa0b40
[95794.993535] RBP: ffffc90001737d98 R08: 0000000000000020 R09: fffffffffffffff7
[95794.994573] R10: 00000000ffffffc4 R11: ffff8800633b1bc0 R12: ffff88020f6e70e8
[95794.996250] R13: 0000000000000038 R14: ffff88006145e598 R15: 0000000000000000
[95794.997233] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[95794.998592] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95794.999484] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
[95795.000542] Call Trace:
[95795.001138]  close_ctree+0x1db/0x2b8 [btrfs]
[95795.001885]  ? evict_inodes+0x132/0x141
[95795.002407]  btrfs_put_super+0x15/0x17 [btrfs]
[95795.003093]  generic_shutdown_super+0x6a/0x10b
[95795.003720]  kill_anon_super+0x12/0x1c
[95795.004353]  btrfs_kill_super+0x16/0x21 [btrfs]
[95795.005095]  deactivate_locked_super+0x30/0x68
[95795.005716]  deactivate_super+0x36/0x39
[95795.006388]  cleanup_mnt+0x49/0x67
[95795.006939]  __cleanup_mnt+0x12/0x14
[95795.007512]  task_work_run+0x82/0xa6
[95795.008124]  prepare_exit_to_usermode+0xe1/0x10c
[95795.008994]  syscall_return_slowpath+0x18c/0x1af
[95795.009831]  entry_SYSCALL_64_fastpath+0xab/0xad
[95795.010610] RIP: 0033:0x7fa678cb99a7
[95795.011193] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[95795.012327] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
[95795.013432] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
[95795.014558] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
[95795.015577] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
[95795.016569] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
[95795.017662] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff
[95795.020538] ---[ end trace e95877675c6ec00d ]---
[95795.021259] BTRFS info (device sdi): space_info 4 has 1072775168 free, is not full
[95795.022390] BTRFS info (device sdi): space_info total=1073741824, used=114688, pinned=0, reserved=0, may_use=786432, readonly=65536

Fix this by ensuring the zero range operation does not call
btrfs_truncate_block() if the corresponding extent is an unwritten one
(it's pointless anyway, since reading from an unwritten extent yields
zeroes).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix missing inode i_size update after zero range operation

For a fallocate's zero range operation that targets a range with an end
that is not aligned to the sector size, we can end up not updating the
inode's i_size. This happens when the last page of the range maps to an
unwritten (prealloc) extent and before that last page we have either a
hole or a written extent. This is because in this scenario we relied
on a call to btrfs_prealloc_file_range() to update the inode's i_size,
however it can only update the i_size to the "down aligned" end of the
range.

Example:

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ xfs_io -f -c "pwrite -S 0xff 0 428K" /mnt/foobar
$ xfs_io -c "falloc -k 428K 4K" /mnt/foobar
$ xfs_io -c "fzero 0 430K" /mnt/foobar
$ du --bytes /mnt/foobar
438272 /mnt/foobar

The inode's i_size was left as 428Kb (438272 bytes) when it should have
been updated to 430Kb (440320 bytes).
Fix this by always updating the inode's i_size explicitly after zeroing
the range.

Fixes: ba6d5887946ff86d93dc ("Btrfs: add support for fallocate's zero range operation")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: use cached state when dirtying pages during buffered write

During a buffered IO write, we can have an extent state that we got when
we locked the range (if the range starts at an offset lower than eof), so
always pass it to btrfs_dirty_pages() so that setting the delalloc bit
in the range does not need to do a full search in the inode's io tree,
saving time and reducing the amount of time we hold the io tree's lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: add support for fallocate's zero range operation

This implements support the zero range operation of fallocate. For now
at least it's as simple as possible while reusing most of the existing
fallocate and hole punching infrastructure.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: do not merge rbios if their fail stripe index are not identical

Since fail stripe index in rbio would be used to decide which
algorithm reconstruction would be run, we cannot merge rbios if
their's fail striped indexes are different, otherwise, one of the two
reconstructions would fail.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: remove redundant check in rbio_can_merge

Given the above
'
if (last->operation != cur->operation)
return 0;
',
it's guaranteed that two operations are same.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: minor style cleanups in btrfs_scan_one_device

Assign ret = -EINVAL where it is actually required.
Remove { } around single line if else code.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify mutex unlocking code in btrfs_commit_transaction

No functional change rearrange the mutex_unlock.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ edit subject ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename btrfs_device::scrub_device to scrub_ctx

btrfs_device::scrub_device is not a device which is being scrubbed,
but it holds the scrub context, so rename to reflect the same. No
functional changes here.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfS: collapse btrfs_handle_error() into __btrfs_handle_fs_error()

There is no other consumer for btrfs_handle_error() other than
__btrfs_handle_fs_error(), further this function quite small.
Merge it into its parent.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove check for BTRFS_FS_STATE_ERROR which we just set

__btrfs_handle_fs_error() sets BTRFS_FS_STATE_ERROR, and calls
btrfs_handle_error() so no need to check if the BTRFS_FS_STATE_ERROR
is set in btrfs_handle_error(). And there is no other user of
btrfs_handle_error() as well.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: make raid6 rebuild retry more

There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
  be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
  that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.

This extends btrfs to do more retries and each retry fails only one
stripe.  Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.

The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix scrub to repair raid6 corruption

The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.

We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.

[1]: https://patchwork.kernel.org/patch/10091755/

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: factor btrfs_check_rw_degradable() to check given device

Update btrfs_check_rw_degradable() to check against the given device if
its lost.

We can use this function to know if the volume is going to be in
degraded mode OR failed state, when the given device fails. Which is
needed when we are handling the device failed state.

A preparatory patch does not affect the flow as such.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ enhance comment ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: sink unlock_extent parameter gfp_flags

All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC

There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: drop unused parameters from mount_subvol

Recent patches reworking the mount path left some unused parameters. We
pass a vfsmount to mount_subvol, the flags and data (ie. mount options)
have been already applied and we will not need them.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: cleanup unnecessary string dup in btrfs_parse_options()

Long ago, commit edf24abe51493 ("btrfs: sanity mount option parsing and
early mount code") split the btrfs_parse_options() into two parts
(btrfs_parse_early_options() and btrfs_parse_options()). As a result,
btrfs_parse_optins no longer gets called twice and is the last one to
parse mount option string. Therefore there is no need to dup it.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: remove unused wait in btrfs_stripe_hash

In fact nobody is waiting on @wait's waitqueue, it can be safely
removed.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Remove redundant pair of bio_get/set in __btrfs_submit_dio_bio

The bio is not referenced after it has been submitted and the endio is
going to consume the sole reference on successful submission. On error,
the callers of __btrfs_submit_dio_bio do invoke bio_put so we don't
leak it either.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Remove redundant bio_get/bio_set pair from submit_one_bio

The bio is never referenced after it has been submitted so there is no
point in getting an extra reference.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Remove redundant bio_get/set from submit_dio_repair_bio

The bio that is passsed is the newly created repair bio which already
has a reference count of 1, which is going to be consumed by the
endio routine on successful submission. On error the handler also
calls bio_put.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Remove redundant bio_get/set calls in compressed read/write paths

bio_get/set is necessary only if the bio is going to be referenced
following submissions. In the code paths where such calls are made
we don't really need them since the bio is referenced only if
btrfs_map_bio returns an error. And this function can return an error
prior to submission only. So referencing the bio is safe. Furthermore
we do call bio_endio which will consume the last reference. So let's
remove the redundant calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Improve btrfs_search_slot description

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: heuristic: call get4bits directly

As it's a single instance and local to the file, we don't need to pass
it as an argument.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: heuristic: open code copy_call callback of radix sort

The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: heuristic: open code get_num callback of radix sort

The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it and also make the array types explicit.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unused arg from parse_subvol_options()

Remove unused arg 'holder' from parse_subvol_options(), which has been
forgotten to be cleaned in the commit b99beb110e2d ("btrfs: split
parse_early_options() in two").

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unused setup_root_args()

Since setup_root_args() is not used anymore, just remove it.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: split parse_early_options() in two

Now parse_early_options() is used by both btrfs_mount() and
btrfs_mount_root(). However, the former only needs subvol related part
and the latter needs the others.

Therefore extract the subvol related parts from parse_early_options() and
move it to new parse function (parse_subvol_options()).

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: cleanup btrfs_mount() using btrfs_mount_root()

Cleanup btrfs_mount() by using btrfs_mount_root(). This avoids getting
btrfs_mount() called twice in mount path.

Old btrfs_mount() will do:
0. VFS layer calls vfs_kern_mount() with registered file_system_type
   (for btrfs, btrfs_fs_type). btrfs_mount() is called on the way.
1. btrfs_parse_early_options() parses "subvolid=" mount option and set the
   value to subvol_objectid. Otherwise, subvol_objectid has the initial
   value of 0
2. check subvol_objectid is 5 or not. Assume this time id is not 5, then
   btrfs_mount() returns by calling mount_subvol()
3. In mount_subvol(), original mount options are modified to contain
   "subvolid=0" in setup_root_args(). Then, vfs_kern_mount() is called with
   btrfs_fs_type and new options
4. btrfs_mount() is called again
5. btrfs_parse_early_options() parses "subvolid=0" and set 5 (instead of 0)
   to subvol_objectid
6. check subvol_objectid is 5 or not. This time id is 5 and mount_subvol()
   is not called. btrfs_mount() finishes mounting a root
7. (in mount_subvol()) with using a return vale of vfs_kern_mount(), it
   calls mount_subtree()
8. return subvolume's dentry

Reusing the same file_system_type (and btrfs_mount()) for vfs_kern_mount()
is the cause of complication.

Instead, new btrfs_mount() will do:
1. parse subvol id related options for later use in mount_subvol()
2. mount device's root by calling vfs_kern_mount() with
   btrfs_root_fs_type, which is not registered to VFS by
   register_filesystem(). As a result, btrfs_mount_root() is called
3. return by calling mount_subvol()

The code of 2. is moved from the first part of mount_subvol().

The semantics of device holder changes from btrfs_fs_type to
btrfs_root_fs_type and has to be used in all contexts. Otherwise we'd
get wrong results when mount and dev scan would not check the same
thing. (this has been found indendently and the fix is folded into this
patch)

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the btrfs_control_ioctl fixup, extend the comment ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add btrfs_mount_root() and new file_system_type

Add btrfs_mount_root() and new file_system_type for preparation of cleanup
of btrfs_mount(). Code path is not changed yet.

btrfs_mount_root() is almost the same as current btrfs_mount(), but doesn't
have subvolume related part.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: unify extent_page_data type passed as void

Functions called from extent_write_cache_pages used void* as generic
callback data, but all of them convert it to extent_page_data, or use it
directly.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: sink writepage parameter to extent_write_cache_pages

The function extent_write_cache_pages is modelled after
write_cache_pages which is a generic interface and the writepage
parameter makes sense there. In btrfs we know exactly which callback
we're going to use, so we can pass it directly.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: sink flush_fn to extent_write_cache_pages

All callers pass the same value flush_write_bio.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: merge two flush_write_bio helpers

flush_epd_write_bio is same as flush_write_bio, no point having two such
functions. Merge them to flush_write_bio. The 'noinline' attribute is
removed as it does not have any meaning.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Rename bin_search -> btrfs_bin_search

Currently there are 2 function doing binary search on btrfs nodes:
bin_search and btrfs_bin_search. The latter being a simple wrapper for
the former. So eliminate the wrapper and just rename bin_search to
btrfs_bin_search. No functional changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: sink extent_write_full_page tree argument

The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: sink extent_write_locked_range tree parameter

This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Remove pair of bio_get/put in btrfs_schedule_bio

This code was added in 492bb6deee34 ("Btrfs: Hold a reference on bios
during submit_bio, add some extra bio checks"). However, holding a
reference on a bio is necessary only if it's going to be referenced
after the submit_bio returns and the bio is completed. In this
particular instance this is not the case so there is no need to hold
an extra reference since we directly return.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Fix out of bounds access in btrfs_search_slot

When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
the level variable is going to be 7 (this is the max height of the
tree). On the other hand btrfs_cow_block is always called with
"level + 1" as an index into the nodes and slots arrays. This leads to
an out of bounds access. Admittdely this will be benign since an OOB
access of the nodes array will likely read the 0th element from the
slots array, which in this case is going to be 0 (since we start CoW at
the top of the tree). The OOB access into the slots array in turn will
read the 0th and 1st values of the locks array, which would both be 0
at the time. However, this benign behavior relies on the fact that the
path being passed hasn't been initialised, if it has already been used to
query a btree then it could potentially have populated the nodes/slots arrays.

Fix it by explicitly checking if we are at level 7 (the maximum allowed
index in nodes/slots arrays) and explicitly call the CoW routine with
NULL for parent's node/slot.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Fixes-coverity-id: 711515
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove duplicate includes

These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.

Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker

This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers
to deal with pages that have been improperly dirtied") and it didn't do
any error handling then. This function might very well fail in ENOMEM
situation, yet it's not handled, this could lead to inconsistent state.
So let's handle the failure by setting the mapping error bit.

Cc: stable@vger.kernel.org
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: put btrfs_ioctl_vol_args_v2 related defines together

Just a code spatial rearrangement, no functional change.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: show options: use helper to convert compression type string

Use the helper, if the COMPRESS option is set, the result is always
defined and not empty.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prop: use common helper for type to string conversion

Use the helper for conversion, keep the semantics.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: SETFLAGS ioctl: use helper for compression type conversion

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: compression: add helper for type to string conversion

There are several places opencoding this conversion, add a helper now
that we have 3 compression algorithms.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove redundant check in btrfs_get_extent_fiemap

Before returning hole_em in btrfs_get_fiemap_extent we check if it's different
than null. However, by the time this null check is triggered we already know
hole_em is not null because it means it points to the em we found and it
has already been dereferenced.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Remove unused variable in btrfs_get_extent

trans was statically assigned to NULL and this never changed over the
course of btrfs_get_extent. So remove any code which checks whether
trans != NULL and just hardcode the fact trans is always NULL.

Resolves-coverity-id: 112806
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: use %zu format string for size_t

The return value of sizeof() is of type size_t, so we must print it
using the %z format modifier rather than %l to avoid this warning
on some architectures:

fs/btrfs/tree-checker.c: In function 'check_dir_item':
fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} [-Werror=format=]

Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: use struct completion in scrub_submit_raid56_bio_wait

This changes to use struct completion directly and removes 'struct
scrub_bio_ret' along with the code using it.

This struct is used to get the return value from bio, but the caller can
access bio to get the return value directly and is holding a reference
on it so it won't go away underneath us and can be removed safely.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: remove unused variable wait in lock_stripe_add

The defined wait is not used anywhere.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: compress_file_range() change page dirty status once

We need to call extent_range_clear_dirty_for_io()
on compression range to prevent application from changing
page content, while pages compressing.

extent_range_clear_dirty_for_io() runs on each loop iteration,
"(end - start)" can be much (up to 1024 times) bigger
then compression range (BTRFS_MAX_UNCOMPRESSED).

The start pointer is advanced each time we manage to compress part of
the range. The end pointer does not change so we could redirty the
remaining parts repeatedly.

Fix that behaviour by call extent_range_clear_dirty_for_io()
only once, the first time it happens.

This is the safest but probably not the best behaviour. Previous
iterations of the patch tried to redirty only the range that we were not
able to compress. This has been refused by David for safety reasons, the
writeout callchain is complex and there could be some path that relies
on redirtying the entire unwritten range.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog, the history and safety concerns, add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: compression heuristic: replace heap sort with radix sort

Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.

As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.

Add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).

That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB

That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16 elements * 8 byte).

That Radix sort implementation have several points to adjust,
I added him to make radix sort general usable in kernel,
like heap sort, if needed.

Performance tested in userspace copy of heuristic code,
throughput:
- average <-> random data: ~3500 MiB/s - heap sort
- average <-> random data: ~6000 MiB/s - radix sort

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
[ coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>