Filipe Manana [Sun, 7 Dec 2014 21:31:47 +0000 (21:31 +0000)]
Btrfs: fix fs corruption on transaction abort if device supports discard
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.
This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.
I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:
[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
The corruption always happened when a transaction aborted and then fsck complained
like this:
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
read block failed check_tree_block
Couldn't open file system
In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:
1) transaction N started, fs_info->pinned_extents pointed to
fs_info->freed_extents[0];
4) transaction N commit starts, fs_info->pinned_extents now points to
fs_info->freed_extents[1], and transaction N completes;
5) transaction N + 1 starts;
6) eb is COWed, and btrfs_free_tree_block() called for this eb;
7) eb range (94945280 to 94945280 + 16Kb) is added to
fs_info->pinned_extents (fs_info->freed_extents[1]);
8) Something goes wrong in transaction N + 1, like hitting ENOSPC
for example, and the transaction is aborted, turning the fs into
readonly mode. The stack trace I got for example:
[112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
[112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
[112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
[112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
[112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
[112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
(...)
[112065.264493] ---[ end trace dd7903a975a31a08 ]---
[112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
[112065.264997] BTRFS info (device sdc): forced readonly
9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
turn calls btrfs_destroy_pinned_extent();
10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
marked as dirty in fs_info->freed_extents[], and for each one
it calls discard, if the fs was mounted with "-o discard", and
adds the range to the free space cache of the respective block
group;
11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
sees the free space entries and performs a discard;
12) After an umount and mount (or fsck), our eb's location on disk was full
of zeroes, and it should have been untouched, because it was marked as
dirty in the fs_info->pinned_extents tree, and therefore used by the
trees that the last committed superblock points to.
Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.
Cc: stable@vger.kernel.org # any kernel released after 2011-01-06 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 4 Dec 2014 18:38:30 +0000 (18:38 +0000)]
Btrfs: always clear a block group node when removing it from the tree
Always clear a block group's rbnode after removing it from the rbtree to
ensure that any tasks that might be holding a reference on the block group
don't end up accessing stale rbnode left and right child pointers through
next_block_group().
This is a leftover from the change titled:
"Btrfs: fix invalid block group rbtree access after bg is removed"
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 4 Dec 2014 15:31:01 +0000 (15:31 +0000)]
Btrfs: ensure deletion from pinned_chunks list is protected
The call to remove_extent_mapping() actually deletes the extent map
from the list it's included in - fs_info->pinned_chunks - and that
list is protected by the chunk mutex. Therefore make that call
while holding the chunk mutex and remove the redundant list delete
call because it's a noop.
This fixes an overlook of the patch titled
"Btrfs: fix race between fs trimming and block group remove/allocation"
following the same obvervation from the patch titled
"Btrfs: fix unprotected deletion from pending_chunks list".
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Wed, 26 Nov 2014 16:52:54 +0000 (11:52 -0500)]
Btrfs: make get_caching_control unconditionally return the ctl
This was written when we didn't do a caching control for the fast free space
cache loading. However we started doing that a long time ago, and there is
still a small window of time that we could be caching the block group the fast
way, so if there is a caching_ctl at all on the block group just return it, the
callers all wait properly for what they want. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 2 Dec 2014 18:07:49 +0000 (18:07 +0000)]
Btrfs: fix unprotected deletion from pending_chunks list
On block group remove if the corresponding extent map was on the
transaction->pending_chunks list, we were deleting the extent map
from that list, through remove_extent_mapping(), without any
synchronization with chunk allocation (which iterates that list
and adds new elements to it). Fix this by ensure that this is done
while the chunk mutex is held, since that's the mutex that protects
the list in the chunk allocation code path.
This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"
But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 2 Dec 2014 18:07:30 +0000 (18:07 +0000)]
Btrfs: fix fs mapping extent map leak
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).
Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.
This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"
But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 1 Dec 2014 17:04:40 +0000 (17:04 +0000)]
Btrfs: fix memory leak after block remove + trimming
There was a free space entry structure memeory leak if a block
group is remove while a free space entry is being trimmed, which
the following diagram explains:
CPU 1 CPU 2
btrfs_trim_block_group()
trim_no_bitmap()
remove free space entry from
block group cache's rbtree
do_trimming()
The free space entry added after doing the discard of its respective
range ends up never being freed.
Detected after doing an "rmmod btrfs" after running the stress test
recently submitted for fstests:
Filipe Manana [Wed, 26 Nov 2014 15:28:55 +0000 (15:28 +0000)]
Btrfs: make btrfs_abort_transaction consider existence of new block groups
If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.
Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:
Filipe Manana [Mon, 1 Dec 2014 17:04:09 +0000 (17:04 +0000)]
Btrfs: fix race between writing free space cache and trimming
Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:
*** fsck.btrfs output ***
checking extents
checking free space cache
There is no free space entry for 521764864-521781248
There is no free space entry for 521764864-1103101952
cache appears valid but isnt 29360128
Checking filesystem on /dev/sdc
UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
found 1235681286 bytes used err is -22
(...)
Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:
Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 27 Nov 2014 21:14:15 +0000 (21:14 +0000)]
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Wed, 26 Nov 2014 15:28:52 +0000 (15:28 +0000)]
Btrfs: fix freeing used extents after removing empty block group
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).
That is:
CPU 1 CPU 2
btrfs_delete_unused_bgs()
update_block_group()
add block group to
fs_info->unused_bgs
got block group from the list
clear_extent_bits for the whole
block group range in freed_extents[]
set_extent_dirty for the
range covering the freed
extent in freed_extents[]
(fs_info->pinned_extents)
block group deleted, and a new block
group with the same logical address is
created
reserve space from the new block group
for new data or metadata - the reserved
space overlaps the range specified by
CPU 1 for set_extent_dirty()
commit transaction
find all ranges marked as dirty in
fs_info->pinned_extents, clear them
and add them to the free space cache
Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:
Filipe Manana [Wed, 26 Nov 2014 15:28:51 +0000 (15:28 +0000)]
Btrfs: fix crash caused by block group removal
If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".
So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.
Filipe Manana [Wed, 26 Nov 2014 15:28:50 +0000 (15:28 +0000)]
Btrfs: fix invalid block group rbtree access after bg is removed
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.
However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.
Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Miao Xie [Tue, 25 Nov 2014 08:39:28 +0000 (16:39 +0800)]
Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.
The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.
Miao Xie [Fri, 14 Nov 2014 09:45:42 +0000 (17:45 +0800)]
Btrfs, replace: write raid56 parity into the replace target device
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.
Miao Xie [Fri, 14 Nov 2014 08:06:25 +0000 (16:06 +0800)]
Btrfs, replace: write dirty pages into the replace target device
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
RAID56, we add the stripes of the replace target devices at the
end of the stripe array in btrfs bio, and we sort those target
device stripes in the array. And we keep the number of the target
device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
take the target device stripes into account.
- When we do write operation, we read the data from the common devices
and calculate the parity. Then write the dirty data and new parity
out, at this time, we will find the relative replace target stripes
and wirte the relative data into it.
Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.
Miao Xie [Thu, 6 Nov 2014 09:20:58 +0000 (17:20 +0800)]
Btrfs, raid56: support parity scrub on raid56
The implementation is:
- Read and check all the data with checksum in the same stripe.
All the data which has checksum is COW data, and we are sure
that it is not changed though we don't lock the stripe. because
the space of that data just can be reclaimed after the current
transction is committed, and then the fs can use it to store the
other data, but when doing scrub, we hold the current transaction,
that is that data can not be recovered, it is safe that read and check
it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
The data without checksum and the parity may be changed if we don't
lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
is not right
- Unlock the stripe
If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.
And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.
Miao Xie [Thu, 6 Nov 2014 08:14:21 +0000 (16:14 +0800)]
Btrfs, raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.
Miao Xie [Thu, 23 Oct 2014 06:42:50 +0000 (14:42 +0800)]
Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.
Miao Xie [Wed, 15 Oct 2014 03:18:44 +0000 (11:18 +0800)]
Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.
Filipe Manana [Wed, 29 Oct 2014 11:57:59 +0000 (11:57 +0000)]
Btrfs: fix snapshot inconsistency after a file write followed by truncate
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
For example, if we perform the following file operations:
and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:
item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
inode ref index 282 namelen 10 name: foobar
item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
extent data disk byte 1104855040 nr 32768
extent data offset 0 nr 32768 ram 32768
extent compression 0
item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 40960 ram 40960
extent compression 0
There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.
Btrfs' fsck tool complains about such cases with a message like the following:
>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:
1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured
But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 21 Oct 2014 10:11:41 +0000 (11:11 +0100)]
Btrfs: ensure send always works on roots without orphans
Move the logic from the snapshot creation ioctl into send. This avoids
doing the transaction commit if send isn't used, and ensures that if
a crash/reboot happens after the transaction commit that created the
snapshot and before the transaction commit that switched the commit
root, send will not get a commit root that differs from the main root
(that has orphan items).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 3 Nov 2014 14:08:39 +0000 (14:08 +0000)]
Btrfs: fix freeing used extent after removing empty block group
Due to ignoring errors returned by clear_extent_bits (at the moment only
-ENOMEM is possible), we can end up freeing an extent that is actually in
use (i.e. return the extent to the free space cache).
The sequence of steps that lead to this:
1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
the goal of freeing empty block groups;
2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
transaction (or starts a new one if none is running) and attempts to
clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);
3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
-ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
to delete the block group and the respective chunk, while pinned_extents
remains with that bit set for the whole (or a part of the) range covered
by the block group;
4) Later while the transaction is still running, the chunk ends up being reused
for a new block group (maybe for different purpose, data or metadata), and
extents belonging to the new block group are allocated for file data or btree
nodes/leafs;
5) The current transaction is committed, meaning that we unpinned one or more
extents from the new block group (through btrfs_finish_extent_commit() and
unpin_extent_range()) which are now being used for new file data or new
metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
And unpinning means we returned the extents to the free space cache of the
new block group, which implies those extents can be used for future allocations
while they're still in use.
Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
object in unpin_extent_range() if a new block group didn't end up being allocated for
the same chunk (step 4 above).
Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Thu, 30 Oct 2014 08:52:31 +0000 (16:52 +0800)]
btrfs: Fix a lockdep warning when running xfstest.
The following lockdep warning is triggered during xfstests:
[ 1702.980872] =========================================================
[ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
[ 1702.981482] 3.18.0-rc1 #27 Not tainted
[ 1702.981781] ---------------------------------------------------------
[ 1702.982095] kswapd0/77 just changed the state of lock:
[ 1702.982415] (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
[ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1702.983160] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1702.984675]
other info that might help us debug this:
[ 1702.985524] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1702.986799] Possible interrupt unsafe locking scenario:
It is because the btrfs_kobj_{add/rm}_device() will call memory
allocation with GFP_KERNEL,
which may flush fs page cache to free space, waiting for it self to do
the commit, causing the deadlock.
To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
dev_replace lock range, also involing split the
btrfs_rm_dev_replace_srcdev() function into remove and free parts.
Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
called out of the lock range.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
Andy Lutomirski [Fri, 21 Nov 2014 21:26:07 +0000 (13:26 -0800)]
uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
not on non-paranoid returns. I suspect that this is a mistake and that
the code only works because int3 is paranoid.
Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
from the uprobes code.
Thomas Gleixner [Sun, 23 Nov 2014 22:04:52 +0000 (23:04 +0100)]
sched: Provide update_curr callbacks for stop/idle scheduling classes
Chris bisected a NULL pointer deference in task_sched_runtime() to
commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency'.
Chris observed crashes in atop or other /proc walking programs when he
started fork bombs on his machine. He assumed that this is a new exit
race, but that does not make any sense when looking at that commit.
What's interesting is that, the commit provides update_curr callbacks
for all scheduling classes except stop_task and idle_task.
While nothing can ever hit that via the clock_nanosleep() and
clock_gettime() interfaces, which have been the target of the commit in
question, the author obviously forgot that there are other code paths
which invoke task_sched_runtime()
If the stats are read for a stomp machine task, aka 'migration/N' and
that task is current on its cpu, this will happily call the NULL pointer
of stop_task->update_curr. Ooops.
Chris observation that this happens faster when he runs the fork bomb
makes sense as the fork bomb will kick migration threads more often so
the probability to hit the issue will increase.
Add the missing update_curr callbacks to the scheduler classes stop_task
and idle_task. While idle tasks cannot be monitored via /proc we have
other means to hit the idle case.
Linus Torvalds [Sun, 23 Nov 2014 21:56:55 +0000 (13:56 -0800)]
Merge branch 'x86-traps' (trap handling from Andy Lutomirski)
Merge x86-64 iret fixes from Andy Lutomirski:
"This addresses the following issues:
- an unrecoverable double-fault triggerable with modify_ldt.
- invalid stack usage in espfix64 failed IRET recovery from IST
context.
- invalid stack usage in non-espfix64 failed IRET recovery from IST
context.
It also makes a good but IMO scary change: non-espfix64 failed IRET
will now report the correct error. Hopefully nothing depended on the
old incorrect behavior, but maybe Wine will get confused in some
obscure corner case"
* emailed patches from Andy Lutomirski <luto@amacapital.net>:
x86_64, traps: Rework bad_iret
x86_64, traps: Stop using IST for #SS
x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C
Andy Lutomirski [Sun, 23 Nov 2014 02:00:33 +0000 (18:00 -0800)]
x86_64, traps: Rework bad_iret
It's possible for iretq to userspace to fail. This can happen because
of a bad CS, SS, or RIP.
Historically, we've handled it by fixing up an exception from iretq to
land at bad_iret, which pretends that the failed iret frame was really
the hardware part of #GP(0) from userspace. To make this work, there's
an extra fixup to fudge the gs base into a usable state.
This is suboptimal because it loses the original exception. It's also
buggy because there's no guarantee that we were on the kernel stack to
begin with. For example, if the failing iret happened on return from an
NMI, then we'll end up executing general_protection on the NMI stack.
This is bad for several reasons, the most immediate of which is that
general_protection, as a non-paranoid idtentry, will try to deliver
signals and/or schedule from the wrong stack.
This patch throws out bad_iret entirely. As a replacement, it augments
the existing swapgs fudge into a full-blown iret fixup, mostly written
in C. It's should be clearer and more correct.
Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Lutomirski [Sun, 23 Nov 2014 02:00:32 +0000 (18:00 -0800)]
x86_64, traps: Stop using IST for #SS
On a 32-bit kernel, this has no effect, since there are no IST stacks.
On a 64-bit kernel, #SS can only happen in user code, on a failed iret
to user space, a canonical violation on access via RSP or RBP, or a
genuine stack segment violation in 32-bit kernel code. The first two
cases don't need IST, and the latter two cases are unlikely fatal bugs,
and promoting them to double faults would be fine.
This fixes a bug in which the espfix64 code mishandles a stack segment
violation.
This saves 4k of memory per CPU and a tiny bit of code.
Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 23 Nov 2014 19:46:01 +0000 (11:46 -0800)]
Merge tag 'armsoc-for-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"A collection of fixes this week:
- A set of clock fixes for shmobile platforms
- A fix for tegra that moves serial port labels to be per board.
We're choosing to merge this for 3.18 because the labels will start
being parsed in 3.19, and without this change serial port numbers
that used to be stable since the dawn of time will change numbers.
- A few other DT tweaks for Tegra.
- A fix for multi_v7_defconfig that makes it stop spewing cpufreq
errors on Arndale (Exynos)"
* tag 'armsoc-for-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: multi_v7_defconfig: fix failure setting CPU voltage by enabling dependent I2C controller
ARM: tegra: roth: Fix SD card VDD_IO regulator
ARM: tegra: Remove eMMC vmmc property for roth/tn7
ARM: dts: tegra: move serial aliases to per-board
ARM: tegra: Add serial port labels to Tegra124 DT
ARM: shmobile: kzm9g legacy: Set i2c clks_per_count to 2
ARM: shmobile: r8a7740 dtsi: Correct IIC0 parent clock
ARM: shmobile: r8a7790: Fix SD3CKCR address to device tree
ARM: shmobile: r8a7740 legacy: Correct IIC0 parent clock
ARM: shmobile: r8a7740 legacy: Add missing INTCA clock for irqpin module
ARM: shmobile: r8a7790: Fix SD3CKCR address
ARM: dts: sun6i: Re-parent ahb1_mux to pll6 as required by dma controller
Linus Torvalds [Sun, 23 Nov 2014 19:33:49 +0000 (11:33 -0800)]
Merge branch 'for-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu fix from Tejun Heo:
"This contains one patch to fix a race condition which can lead to
percpu_ref using a percpu pointer which is corrupted with a set DEAD
bit. The bug was introduced while separating out the ATOMIC mode flag
from the DEAD flag. The fix is pretty straight forward.
I just committed the patch to the percpu tree but am sending out the
pull request early as I'll be on vacation for a week. The patch
should be fairly safe and while the latency will be higher I'll be
checking emails"
* 'for-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu-ref: fix DEAD flag contamination of percpu pointer
Linus Torvalds [Sun, 23 Nov 2014 19:16:36 +0000 (11:16 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs deadlock fix from Chris Mason:
"This has a fix for a long standing deadlock that we've been trying to
nail down for a while. It ended up being a bad interaction with the
fair reader/writer locks and the order btrfs reacquires locks in the
btree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix lockups from btrfs_clear_path_blocking
Tejun Heo [Sat, 22 Nov 2014 14:22:42 +0000 (09:22 -0500)]
percpu-ref: fix DEAD flag contamination of percpu pointer
While decoupling ATOMIC and DEAD flags, f47ad4578461 ("percpu_ref:
decouple switching to percpu mode and reinit") updated
__ref_is_percpu() so that it only tests ATOMIC flag to determine
whether the ref is in percpu mode or not; however, while DEAD implies
ATOMIC, the two flags are set separately during percpu_ref_kill() and
if __ref_is_percpu() races percpu_ref_kill(), it may see DEAD w/o
ATOMIC. Because __ref_is_percpu() returns @ref->percpu_count_ptr
value verbatim as the percpu pointer after testing ATOMIC, the pointer
may now be contaminated with the DEAD flag.
This can be fixed by clearing the flag bits before returning the
pointer which was the fix proposed by Shaohua; however, as DEAD
implies ATOMIC, we can just test for both flags at once and avoid the
explicit masking.
Update __ref_is_percpu() so that it tests that both ATOMIC and DEAD
are clear before returning @ref->percpu_count_ptr as the percpu
pointer.
1) Fix BUG when decrypting empty packets in mac80211, from Ronald Wahl.
2) nf_nat_range is not fully initialized and this is copied back to
userspace, from Daniel Borkmann.
3) Fix read past end of b uffer in netfilter ipset, also from Dan
Carpenter.
4) Signed integer overflow in ipv4 address mask creation helper
inet_make_mask(), from Vincent BENAYOUN.
5) VXLAN, be2net, mlx4_en, and qlcnic need ->ndo_gso_check() methods to
properly describe the device's capabilities, from Joe Stringer.
6) Fix memory leaks and checksum miscalculations in openvswitch, from
Pravin B SHelar and Jesse Gross.
7) FIB rules passes back ambiguous error code for unreachable routes,
making behavior confusing for userspace. Fix from Panu Matilainen.
8) ieee802154fake_probe() doesn't release resources properly on error,
from Alexey Khoroshilov.
9) Fix skb_over_panic in add_grhead(), from Daniel Borkmann.
10) Fix access of stale slave pointers in bonding code, from Nikolay
Aleksandrov.
11) Fix stack info leak in PPP pptp code, from Mathias Krause.
12) Cure locking bug in IPX stack, from Jiri Bohac.
13) Revert SKB fclone memory freeing optimization that is racey and can
allow accesses to freed up memory, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (71 commits)
tcp: Restore RFC5961-compliant behavior for SYN packets
net: Revert "net: avoid one atomic operation in skb_clone()"
virtio-net: validate features during probe
cxgb4 : Fix DCB priority groups being returned in wrong order
ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg
openvswitch: Don't validate IPv6 label masks.
pptp: fix stack info leak in pptp_getname()
brcmfmac: don't include linux/unaligned/access_ok.h
cxgb4i : Don't block unload/cxgb4 unload when remote closes TCP connection
ipv6: delete protocol and unregister rtnetlink when cleanup
net/mlx4_en: Add VXLAN ndo calls to the PF net device ops too
bonding: fix curr_active_slave/carrier with loadbalance arp monitoring
mac80211: minstrel_ht: fix a crash in rate sorting
vxlan: Inline vxlan_gso_check().
can: m_can: update to support CAN FD features
can: m_can: fix incorrect error messages
can: m_can: add missing delay after setting CCCR_INIT bit
can: m_can: fix not set can_dlc for remote frame
can: m_can: fix possible sleep in napi poll
can: m_can: add missing message RAM initialization
...
Linus Torvalds [Sat, 22 Nov 2014 01:15:28 +0000 (17:15 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Just two radeon and two intel fixes: endian and regression fixes"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: fix endian swapping in vbios fetch for tdp table
drm/radeon: disable native backlight control on pre-r6xx asics (v2)
drm/i915: Kick fbdev before vgacon
drm/i915: drop WaSetupGtModeTdRowDispatch:snb
Linus Torvalds [Sat, 22 Nov 2014 01:11:56 +0000 (17:11 -0800)]
Merge tag 'sound-3.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This batch ended up as a relatively high volume due to pending ASoC
fixes. But most of fixes there are trivial and/or device- specific
fixes and quirks, so safe to apply. The only (ASoC) core fixes are
the DPCM race fix and the machine-driver matching fix for
componentization"
* tag 'sound-3.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - fix the mic mute led problem for Latitude E5550
ALSA: hda - move DELL_WMI_MIC_MUTE_LED to the tail in the quirk chain
ASoC: wm_adsp: Avoid attempt to free buffers that might still be in use
ALSA: usb-audio: Set the Control Selector to SU_SELECTOR_CONTROL for UAC2
ALSA: usb-audio: Add ctrl message delay quirk for Marantz/Denon devices
ASoC: sgtl5000: Fix SMALL_POP bit definition
ASoC: cs42l51: re-hook of_match_table pointer
ASoC: rt5670: change dapm routes of PLL connection
ASoC: rt5670: correct the incorrect default values
ASoC: samsung: Add MODULE_DEVICE_TABLE for Snow
ASoC: max98090: Correct pclk divisor settings
ASoC: dpcm: Fix race between FE/BE updates and trigger
ASoC: Fix snd_soc_find_dai() matching component by name
ASoC: rsnd: remove unsupported PAUSE flag
ASoC: fsi: remove unsupported PAUSE flag
ASoC: rt5645: Mark RT5645_TDM_CTRL_3 as readable
ASoC: rockchip-i2s: fix infinite loop in rockchip_snd_rxctrl
ASoC: es8328-i2c: Fix i2c_device_id name field in es8328_id
ASoC: fsl_asrc: Add reg_defaults for regmap to fix kernel dump
Linus Torvalds [Sat, 22 Nov 2014 00:56:25 +0000 (16:56 -0800)]
Merge tag 'pm+acpi-3.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI power management fix from Rafael Wysocki:
"This is just a one-liner fixing a regression introduced in 3.13 that
broke system suspend on some Chromebooks.
On those machines there are ACPI device objects for some I2C devices
that can wake up the system from sleep states, but that is done via a
platform-specific mechanism and the ACPI objects don't contain any
wakeup-related information. When we started to use ACPI power
management with those devices (which happened during the 3.13 cycle),
their configuration confused the ACPI PM layer that returned error
codes from suspend callbacks for them causing system suspend to fail.
However, the ACPI PM layer can safely ignore the wakeup setting from a
device driver if the ACPI object corresponding to the device in
question doesn't contain wakeup information in which case the driver
itself is responsible for setting up the device for system wakeup"
* tag 'pm+acpi-3.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PM: Ignore wakeup setting if the ACPI companion can't wake up
Linus Torvalds [Sat, 22 Nov 2014 00:40:41 +0000 (16:40 -0800)]
Merge tag 'devicetree-fixes-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
"DeviceTree fixes for 3.18:
- two fixes for OF selftest code
- fix for PowerPC address parsing to disable work-around except on
old PowerMACs
- fix a crash when earlycon is enabled, but no device is found
- DT documentation fixes and missing vendor prefixes
All but the doc updates are also for stable"
* tag 'devicetree-fixes-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of/selftest: Fix testing when /aliases is missing
of/selftest: Fix off-by-one error in removal path
documentation: pinctrl bindings: Fix trivial typo 'abitrary'
devicetree: bindings: Add vendor prefix for Micron Technology, Inc.
of: Add vendor prefix for Chips&Media, Inc.
of/base: Fix PowerPC address parsing hack
devicetree: vendor-prefixes.txt: fix whitespace
of: Fix crash if an earlycon driver is not found
of/irq: Drop obsolete 'interrupts' vs 'interrupts-extended' text
of: Spelling s/stucture/structure/
devicetree: bindings: add sandisk to the vendor prefixes
Linus Torvalds [Sat, 22 Nov 2014 00:36:42 +0000 (16:36 -0800)]
Merge tag 'pci-v3.18-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
"These are fixes for an issue with 64-bit PCI bus addresses on 32-bit
PAE kernels, an APM X-Gene problem (it depended on a generic change we
removed before merging), a fix for my hotplug device configuration
changes, and a devicetree documentation update.
Resource management:
- Support 64-bit bridge windows if we have 64-bit dma_addr_t (Yinghai Lu)
PCI device hotplug:
- Apply _HPX Link Control settings to all devices with a link (Yinghai Lu)
APM X-Gene:
- Assign resources to bus before adding new devices (Duc Dang)"
* tag 'pci-v3.18-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: Support 64-bit bridge windows if we have 64-bit dma_addr_t
PCI: Apply _HPX Link Control settings to all devices with a link
PCI: Add missing DT binding for "linux,pci-domain" property
PCI: xgene: Assign resources to bus before adding new devices
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
IB/isert: Adjust CQ size to HW limits
target: return CONFLICT only when SA key unmatched
iser-target: Handle DEVICE_REMOVAL event on network portal listener correctly
ib_isert: Add max_send_sge=2 minimum for control PDU responses
srp-target: Retry when QP creation fails with ENOMEM
iscsi-target: return the correct port in SendTargets
vhost-scsi: Take configfs group dependency during VHOST_SCSI_SET_ENDPOINT
target: Don't call TFO->write_pending if data_length == 0
Linus Torvalds [Sat, 22 Nov 2014 00:24:27 +0000 (16:24 -0800)]
Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
"We have couple of fixes for dmaengine queued up:
- dma mempcy fix for dma configuration of sun6i by Maxime
- pl330 fixes: First the fixing allocation for data buffers by Liviu
and then Jon's fixe for fifo width and usage"
* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: Fix allocation size for PL330 data buffer depth.
dmaengine: pl330: Limit MFIFO usage for memcpy to avoid exhausting entries
dmaengine: pl330: Align DMA memcpy operations to MFIFO width
dmaengine: sun6i: Fix memcpy operation
Linus Torvalds [Sat, 22 Nov 2014 00:14:58 +0000 (16:14 -0800)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"More 3.18 fixes for MIPS:
- backtraces were not quite working on on 64-bit kernels
- loongson needs a different cache coherency setting
- Loongson 3 is a MIPS64 R2 version but due to erratum we treat is an
older architecture revision.
- fix build errors due to undefined references to __node_distances
for certain configurations.
- fix instruction decodig in the jump label code.
- for certain configurations copy_{from,to}_user destroy the content
of $3 so that register needs to be marked as clobbed by the calling
code.
- Hardware Table Walker fixes.
- fill the delay slot of the last instruction of memcpy otherwise
whatever ends up there randomly might have undesirable effects.
- ensure get_user/__get_user always zero the variable to be read even
in case of an error"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: jump_label.c: Handle the microMIPS J instruction encoding
MIPS: jump_label.c: Correct the span of the J instruction
MIPS: Zero variable read by get_user / __get_user in case of an error.
MIPS: lib: memcpy: Restore NOP on delay slot before returning to caller
MIPS: tlb-r4k: Add missing HTW stop/start sequences
MIPS: asm: uaccess: Add v1 register to clobber list on EVA
MIPS: oprofile: Fix backtrace on 64-bit kernel
MIPS: Loongson: Set Loongson-3's ISA level to MIPS64R1
MIPS: Loongson: Fix the write-combine CCA value setting
MIPS: IP27: Fix __node_distances undefined error
MIPS: Loongson3: Fix __node_distances undefined error
Linus Torvalds [Fri, 21 Nov 2014 23:46:17 +0000 (15:46 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Misc fixes:
- gold linker build fix
- noxsave command line parsing fix
- bugfix for NX setup
- microcode resume path bug fix
- _TIF_NOHZ versus TIF_NOHZ bugfix as discussed in the mysterious
lockup thread"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
x86, kaslr: Handle Gold linker for finding bss/brk
x86, mm: Set NX across entire PMD at boot
x86, microcode: Update BSPs microcode on resume
x86: Require exact match for 'noxsave' command line option
Linus Torvalds [Fri, 21 Nov 2014 23:44:07 +0000 (15:44 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes: two Intel uncore driver fixes, a CPU-hotplug fix and a
build dependencies fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix boot crash on SBOX PMU on Haswell-EP
perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP
perf: Fix corruption of sibling list with hotplug
perf/x86: Fix embarrasing typo
Nobody seems to currently use GENMASK() to fill every single last bit
(which is what overflows) in-tree, and gcc would warn about it, so we
have that going for us. But apparently there are pending changes that
want this.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bitops: Fix shift overflow in GENMASK macros
Calvin Owens [Thu, 20 Nov 2014 23:09:53 +0000 (15:09 -0800)]
tcp: Restore RFC5961-compliant behavior for SYN packets
Commit c3ae62af8e755 ("tcp: should drop incoming frames without ACK
flag set") was created to mitigate a security vulnerability in which a
local attacker is able to inject data into locally-opened sockets by
using TCP protocol statistics in procfs to quickly find the correct
sequence number.
This broke the RFC5961 requirement to send a challenge ACK in response
to spurious RST packets, which was subsequently fixed by commit 7b514a886ba50 ("tcp: accept RST without ACK flag").
Unfortunately, the RFC5961 requirement that spurious SYN packets be
handled in a similar manner remains broken.
RFC5961 section 4 states that:
... the handling of the SYN in the synchronized state SHOULD be
performed as follows:
1) If the SYN bit is set, irrespective of the sequence number, TCP
MUST send an ACK (also referred to as challenge ACK) to the remote
peer:
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
After sending the acknowledgment, TCP MUST drop the unacceptable
segment and stop processing further.
By sending an ACK, the remote peer is challenged to confirm the loss
of the previous connection and the request to start a new connection.
A legitimate peer, after restart, would not have a TCB in the
synchronized state. Thus, when the ACK arrives, the peer should send
a RST segment back with the sequence number derived from the ACK
field that caused the RST.
This RST will confirm that the remote peer has indeed closed the
previous connection. Upon receipt of a valid RST, the local TCP
endpoint MUST terminate its connection. The local TCP endpoint
should then rely on SYN retransmission from the remote end to
re-establish the connection.
This patch lets SYN packets through the discard added in c3ae62af8e755,
so that spurious SYN packets are properly dealt with as per the RFC.
The challenge ACK is sent unconditionally and is rate-limited, so the
original vulnerability is not reintroduced by this patch.
Signed-off-by: Calvin Owens <calvinowens@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 21 Nov 2014 19:47:16 +0000 (11:47 -0800)]
net: Revert "net: avoid one atomic operation in skb_clone()"
Not sure what I was thinking, but doing anything after
releasing a refcount is suicidal or/and embarrassing.
By the time we set skb->fclone to SKB_FCLONE_FREE, another cpu
could have released last reference and freed whole skb.
We potentially corrupt memory or trap if CONFIG_DEBUG_PAGEALLOC is set.
Reported-by: Chris Mason <clm@fb.com> Fixes: ce1a4ea3f1258 ("net: avoid one atomic operation in skb_clone()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Filipe Manana [Thu, 13 Nov 2014 17:01:45 +0000 (17:01 +0000)]
Btrfs: ensure ordered extent errors aren't missed on fsync
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:
1) We start all ordered extents by calling filemap_fdatawrite_range();
2) We do some other work like locking the inode's i_mutex, start a transaction,
start a log transaction, etc;
3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
ordered extents from inode's ordered tree into a list;
4) But by the time we do ordered extent collection, some ordered extents we started
at step 1) might have already completed with an error, and therefore we didn't
found them in the ordered tree and had no idea they finished with an error. This
makes our fsync return success (0) to userspace, but has no bad effects on the log
like for example insertion of file extent items into the log that point to unwritten
extents, because the invalid extent maps were removed before the ordered extent
completed (in inode.c:btrfs_finish_ordered_io).
So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.
This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 13 Nov 2014 17:00:35 +0000 (17:00 +0000)]
Btrfs: collect only the necessary ordered extents on ranged fsync
Instead of collecting all ordered extents from the inode's ordered tree
and then wait for all of them to complete, just collect the ones that
overlap the fsync range.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 13 Nov 2014 16:59:53 +0000 (16:59 +0000)]
Btrfs: don't ignore log btree writeback errors
If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 14 Nov 2014 21:16:30 +0000 (16:16 -0500)]
Btrfs: do not move em to modified list when unpinning
We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time. Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created. We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log. The problem is when
something like this happens
log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
at this point we assume the csum has been copied
sync log
crash
On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr. This of course causes us to
have errors about not having csums for certain ranges of our inode. So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged. This
fixes my test that I could reliably reproduce this problem with. Thanks,
cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 21 Nov 2014 19:52:38 +0000 (14:52 -0500)]
Btrfs: make sure logged extents complete in the current transaction V3
Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described. It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits. Consider
the following
write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes
We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed. If we lose
power before the transaction commit we are save, otherwise we are not.
Fix this by keeping track of all extents we logged in this transaction. Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding. This will make sure that if we lose
power after the transaction commit we still have our data. This also fixes the
problem of the improperly updated extent generation. Thanks,
cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Al Viro [Fri, 21 Nov 2014 16:51:08 +0000 (11:51 -0500)]
Merge branch 'overlayfs-current' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-linus
"The biggest change is to rename the filesystem from "overlayfs" to "overlay".
This will allow legacy overlayfs to be easily carried by distros alongside the
new mainline one. Also fix a couple of copy-up races and allow escaping comma
character in filenames."
The last bit is about commas in pathname mount options...
Jason Wang [Thu, 20 Nov 2014 09:03:05 +0000 (17:03 +0800)]
virtio-net: validate features during probe
We currently trigger BUG when VIRTIO_NET_F_CTRL_VQ
is not set but one of features depending on it is.
That's not a friendly way to report errors to
hypervisors.
Let's check, and fail probe instead.
Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Cornelia Huck <cornelia.huck@de.ibm.com> Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following patchset contains two bugfixes for your net tree, they are:
1) Validate netlink group from nfnetlink to avoid an out of bound array
access. This should only happen with superuser priviledges though.
Discovered by Andrey Ryabinin using trinity.
2) Don't push ethernet header before calling the netfilter output hook
for multicast traffic, this breaks ebtables since it expects to see
skb->data pointing to the network header, patch from Linus Luessing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Bohac [Wed, 19 Nov 2014 22:05:49 +0000 (23:05 +0100)]
ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg
This fixes an old regression introduced by commit b0d0d915 (ipx: remove the BKL).
When a recvmsg syscall blocks waiting for new data, no data can be sent on the
same socket with sendmsg because ipx_recvmsg() sleeps with the socket locked.
This breaks mars-nwe (NetWare emulator):
- the ncpserv process reads the request using recvmsg
- ncpserv forks and spawns nwconn
- ncpserv calls a (blocking) recvmsg and waits for new requests
- nwconn deadlocks in sendmsg on the same socket
Commit b0d0d915 has simply replaced BKL locking with
lock_sock/release_sock. Unlike now, BKL got unlocked while
sleeping, so a blocking recvmsg did not block a concurrent
sendmsg.
Only keep the socket locked while actually working with the socket data and
release it prior to calling skb_recv_datagram().
Signed-off-by: Jiri Bohac <jbohac@suse.cz> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Stringer [Wed, 19 Nov 2014 21:54:49 +0000 (13:54 -0800)]
openvswitch: Don't validate IPv6 label masks.
When userspace doesn't provide a mask, OVS datapath generates a fully
unwildcarded mask for the flow by copying the flow and setting all bits
in all fields. For IPv6 label, this creates a mask that matches on the
upper 12 bits, causing the following error:
openvswitch: netlink: Invalid IPv6 flow label value (value=ffffffff, max=fffff)
This patch ignores the label validation check for masks, avoiding this
error.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mathias Krause [Wed, 19 Nov 2014 17:05:26 +0000 (18:05 +0100)]
pptp: fix stack info leak in pptp_getname()
pptp_getname() only partially initializes the stack variable sa,
particularly only fills the pptp part of the sa_addr union. The code
thereby discloses 16 bytes of kernel stack memory via getsockname().
Fix this by memset(0)'ing the union before.
Cc: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Fri, 21 Nov 2014 02:19:19 +0000 (12:19 +1000)]
Merge branch 'drm-fixes-3.18' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
fix one regression and one endian issue.
* 'drm-fixes-3.18' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: fix endian swapping in vbios fetch for tdp table
drm/radeon: disable native backlight control on pre-r6xx asics (v2)
Josef Bacik [Thu, 6 Nov 2014 15:19:54 +0000 (10:19 -0500)]
Btrfs: make sure we wait on logged extents when fsycning two subvols
If we have two fsync()'s race on different subvols one will do all of its work
to get into the log_tree, wait on it's outstanding IO, and then allow the
log_tree to finish it's commit. The problem is we were just free'ing that
subvols logged extents instead of waiting on them, so whoever lost the race
wouldn't really have their data on disk. Fix this by waiting properly instead
of freeing the logged extents. Thanks,
cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Fri, 14 Nov 2014 14:05:06 +0000 (15:05 +0100)]
btrfs: fix wrong accounting of raid1 data profile in statfs
The sizes that are obtained from space infos are in raw units and have
to be adjusted according to the raid factor. This was missing for
f_bavail and df reported doubled size for raid1.
Reported-by: Martin Steigerwald <Martin@lichtvoll.de> Fixes: ba7b6e62f420 ("btrfs: adjust statfs calculations according to raid profiles") CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
This leads to an ABBA pattern deadlock. To fix it,
o we just change it to an AABB pattern which means to @unlock_extent_bits()
before we @lock_page(), and in this way the @extent_read_full_page_nolock()
is no longer in an locked context, so change it back to @extent_read_full_page()
to regain protection.
o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
the extent may not really point at the physical block we want, so we
have to check it before write.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Sun, 9 Nov 2014 08:38:39 +0000 (08:38 +0000)]
Btrfs: make xattr replace operations atomic
Replacing a xattr consists of doing a lookup for its existing value, delete
the current value from the respective leaf, release the search path and then
finally insert the new value. This leaves a time window where readers (getxattr,
listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
so this has security implications.
This change also fixes 2 other existing issues which were:
*) Deleting the old xattr value without verifying first if the new xattr will
fit in the existing leaf item (in case multiple xattrs are packed in the
same item due to name hash collision);
*) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
exist but we have have an existing item that packs muliple xattrs with
the same name hash as the input xattr. In this case we should return ENOSPC.
A test case for xfstests follows soon.
Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
implementation.
Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 3 Nov 2014 14:12:57 +0000 (14:12 +0000)]
Btrfs: avoid premature -ENOMEM in clear_extent_bit()
We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Mon, 3 Nov 2014 13:56:50 +0000 (08:56 -0500)]
Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2
Our gluster boxes get several thousand statfs() calls per second, which begins
to suck hardcore with all of the lock contention on the chunk mutex and dev list
mutex. We don't really need to hold these things, if we have transient
weirdness with statfs() because of the chunk allocator we don't care, so remove
this locking.
We still need the dev_list lock if you mount with -o alloc_start however, which
is a good argument for nuking that thing from orbit, but that's a patch for
another day. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 31 Oct 2014 13:49:34 +0000 (09:49 -0400)]
Btrfs: move read only block groups onto their own list V2
Our gluster boxes were spending lots of time in statfs because our fs'es are
huge. The problem is statfs loops through all of the block groups looking for
read only block groups, and when you have several terabytes worth of data that
ends up being a lot of block groups. Move the read only block groups onto a
read only list and only proces that list in
btrfs_account_ro_block_groups_free_space to reduce the amount of churn. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
Stefan Behrens [Fri, 17 Oct 2014 12:10:10 +0000 (14:10 +0200)]
Btrfs: check-int: don't complain about balanced blocks
The xfstest btrfs/014 which tests the balance operation caused that the
check_int module complained that known blocks changed their physical
location. Since this is not an error in this case, only print such
message if the verbose mode was enabled.
Reported-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Tested-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
Stefan Behrens [Thu, 16 Oct 2014 15:48:48 +0000 (17:48 +0200)]
Btrfs: check_int: use the known block location
The xfstest btrfs/014 which tests the balance operation caused issues with
the check_int module. The attempt was made to use btrfs_map_block() to
find the physical location for a written block. However, this was not
at all needed since the location of the written block was known since
a hook to submit_bio() was the reason for entering the check_int module.
Additionally, after a block relocation it happened that btrfs_map_block()
failed causing misleading error messages afterwards.
This patch changes the check_int module to use the known information of
the physical location from the bio.
Reported-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Tested-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 13 Oct 2014 11:28:39 +0000 (12:28 +0100)]
Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 13 Oct 2014 11:28:38 +0000 (12:28 +0100)]
Btrfs: make find_first_extent_bit be able to cache any state
Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
This change together with the following one in the patch series (titled
"Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 13 Oct 2014 11:28:37 +0000 (12:28 +0100)]
Btrfs: deal with convert_extent_bit errors to avoid fs corruption
When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.
That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.
Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like "parent transid verify failed on
X wanted Y found Z" when reading btree nodes/leafs from disk).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Eryu Guan [Mon, 13 Oct 2014 04:42:12 +0000 (12:42 +0800)]
Btrfs: return failure if btrfs_dev_replace_finishing() failed
device replace could fail due to another running scrub process or any
other errors btrfs_scrub_dev() may hit, but this failure doesn't get
returned to userspace.
The following steps could reproduce this issue
mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
# once you see the error log in dmesg, check return value of
# replace
echo $?
Introduce a new dev replace result
BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
to catch -EINPROGRESS explicitly and return other errors directly to
userspace.
Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Fri, 10 Oct 2014 09:45:12 +0000 (10:45 +0100)]
Btrfs: report error after failure inlining extent in compressed write path
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.
This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).
This change applies on top of the previous patchset starting at the patch
titled:
"[1/5] Btrfs: set page and mapping error on compressed write failure"
Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 9 Oct 2014 20:18:55 +0000 (21:18 +0100)]
Btrfs: correctly flush compressed data before/after direct IO
For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 9 Oct 2014 20:15:44 +0000 (21:15 +0100)]
Btrfs: make inode.c:compress_file_range() return void
Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
after previous steps, inode will be detected as bad compression ratio,
and NOCOMPRESS flag will be set for that inode.
Reason is that compress have a max limit pages every time(128K), if a
132k write in, it will be splitted into two write(128k+4k), this bug
is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)
Fix this problem by checking every time before compression, if it is a
small write(<=blocksize), we bail out and fall into nocompression directly.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 7 Oct 2014 00:48:26 +0000 (01:48 +0100)]
Btrfs: don't ignore compressed bio write errors
Our compressed bio write end callback was essentially ignoring the error
parameter. When a write error happens, it must pass a value of 0 to the
inode's write_page_end_io_hook callback, SetPageError on the respective
pages and set AS_EIO in the inode's mapping flags, so that a call to
filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
happened (we surely don't want silent failures on fsync for example).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 6 Oct 2014 21:14:26 +0000 (22:14 +0100)]
Btrfs: make inode.c:submit_compressed_extents() return void
Its return value is completely ignored by its single caller and it's
useless anyway, since errors are indicated through SetPageError and
the bit AS_EIO set in the flags of the inode's mapping. The caller
can't do anything with the value, as it's invoked from a workqueue
task and not by the task calling filemap_fdatawrite_range (which calls
the writepages address space callback, which in turn calls the inode's
fill_delalloc callback).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 6 Oct 2014 21:14:25 +0000 (22:14 +0100)]
Btrfs: process all async extents on compressed write failure
If we had an error when processing one of the async extents from our list,
we were not processing the remaining async extents, meaning we would leak
those async_extent structs, never release the pages with the compressed
data and never unlock and clear the dirty flag from the inode's pages (those
that correspond to the uncompressed content).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 6 Oct 2014 21:14:24 +0000 (22:14 +0100)]
Btrfs: don't leak pages and memory on compressed write error
In inode.c:submit_compressed_extents(), if we fail before calling
btrfs_submit_compressed_write(), or when that function fails, we
were freeing the async_extent structure without releasing its pages
and freeing the pages array.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 6 Oct 2014 21:14:23 +0000 (22:14 +0100)]
Btrfs: fix hang on compressed write error
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
we never end the writeback on the pages, so any filemap_fdatawait_range() call will
hang forever. We were also not calling the writepage end io hook, which means the
corresponding ordered extent will never complete and all its waiters will block
forever, such as a full fsync (via btrfs_wait_ordered_range()).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>