Adam Borowski [Fri, 31 Mar 2017 15:19:04 +0000 (17:19 +0200)]
btrfs: drop the nossd flag when remounting with -o ssd
The opposite case was already handled right in the very next switch entry.
And also when turning on nossd, drop ssd_spread.
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Dan Carpenter [Fri, 17 Mar 2017 20:51:20 +0000 (23:51 +0300)]
Btrfs: fix an integer overflow check
This isn't super serious because you need CAP_ADMIN to run this code.
I added this integer overflow check last year but apparently I am
rubbish at writing integer overflow checks... There are two issues.
First, access_ok() works on unsigned long type and not u64 so on 32 bit
systems the access_ok() could be checking a truncated size. The other
issue is that we should be using a stricter limit so we don't overflow
the kzalloc() setting ctx->clone_roots later in the function after the
access_ok():
Liu Bo [Fri, 24 Mar 2017 22:04:50 +0000 (15:04 -0700)]
Btrfs: bring back repair during read
Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.
This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest. We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.
Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ] Signed-off-by: David Sterba <dsterba@suse.com>
Zygo Blaxell [Fri, 10 Mar 2017 21:45:44 +0000 (16:45 -0500)]
btrfs: add missing memset while reading compressed inline extents
This is a story about 4 distinct (and very old) btrfs bugs.
Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).
Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1: uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read. The fix
was to add a memset in btrfs_get_extent to zero out the hole.
Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2: compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases. This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.
There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record. This memset doesn't cover the region between the end of the
decompressed data and the end of the page. It has also moved around a
few times over the years, so there's no single patch to refer to.
This patch fixes bug #3: compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same: zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.
The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code. In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.
A fast reproducer for bug #0 is presented below. A few offending extents
are also created in the wild during large rsync transfers with the -S
flag. A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well. Once an offending
file is created, it can present different content to userspace each
time it is read.
Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y
kernel back to v3.5 has this behavior. There are fossil records of this
bug's effects in commits all the way back to v2.6.32. I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.
It is not clear whether bug #0 is worth fixing. A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.
Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak. So enough about
bug #0, let's get back to bug #3 (this patch).
An example of on-disk structure leading to data corruption found in
the wild:
Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096. The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.
Here is a reproducer from Liu Bo, which demonstrates another method
of creating the same inline extent and hole pattern:
Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Liu Bo [Tue, 7 Mar 2017 02:20:56 +0000 (18:20 -0800)]
Btrfs: fix regression in lock_delalloc_pages
The bug is a regression after commit
(da2c7009f6ca "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8fd "Btrfs: use helper to simplify lock/unlock pages").
So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed
This fixes the bug by returning the proper -EAGAIN.
Cc: David Sterba <dsterba@suse.com> Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Dmitry V. Levin [Tue, 28 Feb 2017 23:12:50 +0000 (02:12 +0300)]
btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h
btrfs_err_str function is not called from anywhere and is replicated
in the userspace headers for btrfs-progs.
It's removal also fixes the following linux/btrfs.h userspace
compilation error:
/usr/include/linux/btrfs.h: In function 'btrfs_err_str':
/usr/include/linux/btrfs.h:740:11: error: 'NULL' undeclared (first use in this function)
return NULL;
Suggested-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 17 Feb 2017 15:24:29 +0000 (16:24 +0100)]
btrfs: add dummy callback for readpage_io_failed and drop checks
Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.
David Sterba [Fri, 17 Feb 2017 14:27:44 +0000 (15:27 +0100)]
btrfs: document existence of extent_io ops callbacks
Some of the callbacks defined in btree_extent_io_ops and
btrfs_extent_io_ops do always exist so we don't need to check the
existence before each call. This patch just reorders the definition and
documents which are mandatory/optional.
David Sterba [Fri, 17 Feb 2017 18:42:43 +0000 (19:42 +0100)]
btrfs: do proper error handling in btrfs_insert_xattr_item
The space check in btrfs_insert_xattr_item is duplicated in it's caller
(do_setxattr) so we won't hit the BUG_ON. Continuing without any check
could be disasterous so turn it to a proper error handling.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 14 Feb 2017 18:45:05 +0000 (19:45 +0100)]
btrfs: derive maximum output size in the compression implementation
The value of max_out can be calculated from the parameters passed to the
compressors, which is number of pages and the page size, and we don't
have to needlessly pass it around.
David Sterba [Tue, 14 Feb 2017 18:04:07 +0000 (19:04 +0100)]
btrfs: merge nr_pages input and output parameter in compress_pages
The parameter saying how many pages can be allocated at maximum can be
merged with the output page counter, to save some stack space. The
compression implementation will sink the parameter to a local variable
so everything works as before.
The nr_pages variables can also be simply merged in compress_file_range
into one.
David Sterba [Tue, 14 Feb 2017 18:04:07 +0000 (19:04 +0100)]
btrfs: merge length input and output parameter in compress_pages
The length parameter is basically duplicated for input and output in the
top level caller of the compress_pages chain. We can simply use one
variable for that and reduce stack consumption. The compression
implementation will sink the parameter to a local variable so everything
works as before.
Nikolay Borisov [Mon, 20 Feb 2017 11:51:06 +0000 (13:51 +0200)]
btrfs: Make get_extent_t take btrfs_inode
In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 17 Feb 2017 18:43:57 +0000 (18:43 +0000)]
Btrfs: try harder to migrate items to left sibling before splitting a leaf
Before attempting to split a leaf we try to migrate items from the leaf to
its right and left siblings. We start by trying to move items into the
rigth sibling and, if the new item is meant to be inserted at the end of
our leaf, we try to free from our leaf an amount of bytes equal to the
number of bytes used by the new item, by setting the variable space_needed
to the byte size of that new item. However if we fail to move enough items
to the right sibling due to lack of space in that sibling, we then try
to move items into the left sibling, and in that case we try to free
an amount equal to the size of the new item from our leaf, when we need
only to free an amount corresponding to the size of the new item minus
the current free space of our leaf. So make sure that before we try to
move items to the left sibling we do set the variable space_needed with
a value corresponding to the new item's size minus the leaf's current
free space.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Tue, 14 Feb 2017 16:56:01 +0000 (16:56 +0000)]
Btrfs: fix data loss after truncate when using the no-holes feature
If we have a file with an implicit hole (NO_HOLES feature enabled) that
has an extent following the hole, delayed writes against regions of the
file behind the hole happened before but were not yet flushed and then
we truncate the file to a smaller size that lies inside the hole, we
end up persisting a wrong disk_i_size value for our inode that leads to
data loss after umounting and mounting again the filesystem or after
the inode is evicted and loaded again.
This happens because at inode.c:btrfs_truncate_inode_items() we end up
setting last_size to the offset of the extent that we deleted and that
followed the hole. We then pass that value to btrfs_ordered_update_i_size()
which updates the inode's disk_i_size to a value smaller then the offset
of the buffered (delayed) writes.
Cc: <stable@vger.kernel.org> # 3.14+ Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Tue, 14 Feb 2017 17:56:32 +0000 (17:56 +0000)]
Btrfs: incremental send, fix unnecessary hole writes for sparse files
When using the NO_HOLES feature, during an incremental send we often issue
write operations for holes when we should not, because that range is already
a hole in the destination snapshot. While that does not change the contents
of the file at the receiver, it avoids preservation of file holes, leading
to wasted disk space and extra IO during send/receive.
A couple examples where the holes are not preserved follows.
# Now add one new extent to our first test file, increasing its size and
# leaving a 1Mb hole between the first extent and this new extent.
$ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo
# Now overwrite the last extent of our second test file.
$ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar
The holes do not exist in the second filesystem and they were replaced
with extents filled with the byte 0x00, making each file take 1032Kb of
space instead of 8Kb.
So fix this by not issuing the write operations consisting of buffers
filled with the byte 0x00 when the destination snapshot already has a
hole for the respective range.
Filipe Manana [Sat, 4 Feb 2017 17:12:00 +0000 (17:12 +0000)]
Btrfs: fix use-after-free due to wrong order of destroying work queues
Before we destroy all work queues (and wait for their tasks to complete)
we were destroying the work queues used for metadata I/O operations, which
can result in a use-after-free problem because most tasks from all work
queues do metadata I/O operations. For example, the tasks from the caching
workers work queue (fs_info->caching_workers), which is destroyed only
after the work queue used for metadata reads (fs_info->endio_meta_workers)
is destroyed, do metadata reads, which result in attempts to queue tasks
into the later work queue, triggering a use-after-free with a trace like
the following:
btrfs_search_slot()
read_block_for_search()
readahead_tree_block()
read_extent_buffer_pages()
submit_one_bio()
btree_submit_bio_hook()
btrfs_bio_wq_end_io()
--> sets the bio's
bi_end_io callback
to end_workqueue_bio()
--> bio is submitted
bio completes
and its bi_end_io callback
is invoked
--> end_workqueue_bio()
--> attempts to queue
a task on fs_info->endio_meta_workers
Filipe Manana [Wed, 1 Feb 2017 22:39:50 +0000 (22:39 +0000)]
Btrfs: fix assertion failure when freeing block groups at close_ctree()
At close_ctree() we free the block groups and then only after we wait for
any running worker kthreads to finish and shutdown the workqueues. This
behaviour is racy and it triggers an assertion failure when freeing block
groups because while we are doing it we can have for example a block group
caching kthread running, and in that case the block group's reference
count can still be greater than 1 by the time we assert its reference count
is 1, leading to an assertion failure:
Filipe Manana [Wed, 1 Feb 2017 14:58:02 +0000 (14:58 +0000)]
Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled
We log holes explicitly by using file extent items, however when replaying
a log tree, if a logged file extent item corresponds to a hole and the
NO_HOLES feature is enabled we do not need to copy the file extent item
into the fs/subvolume tree, as the absence of such file extent items is
the purpose of the NO_HOLES feature. So skip the copying of file extent
items representing holes when the NO_HOLES feature is enabled.
Robbie Ko [Fri, 7 Oct 2016 02:01:29 +0000 (10:01 +0800)]
Btrfs: fix leak of subvolume writers counter
When falling back from a nocow write to a regular cow write, we were
leaking the subvolume writers counter in 2 situations, preventing
snapshot creation from ever completing in the future, as it waits
for that counter to go down to zero before the snapshot creation
starts.
Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Improved changelog and subject] Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Sat, 28 Jan 2017 01:47:56 +0000 (01:47 +0000)]
Btrfs: bulk delete checksum items in the same leaf
Very often we have the checksums for an extent spread in multiple items
in the checksums tree, and currently the algorithm to delete them starts
by looking for them one by one and then deleting them one by one, which
is not optimal since each deletion involves shifting all the other items
in the leaf and when the leaf reaches some low threshold, to move items
off the leaf into its left and right neighbor leafs. Also, after each
item deletion we release our search path and start a new search for other
checksums items.
So optimize this by deleting in bulk all the items in the same leaf that
contain checksums for the extent being freed.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Robbie Ko [Thu, 5 Jan 2017 08:24:58 +0000 (16:24 +0800)]
Btrfs: incremental send, do not issue invalid rmdir operations
When both the parent and send snapshots have a directory inode with the
same number but different generations (therefore they are different
inodes) and both have an entry with the same name, an incremental send
stream will contain an invalid rmdir operation that refers to the
orphanized name of the inode from the parent snapshot.
The following example scenario shows how this happens.
Parent snapshot:
.
|---- d259_old/ (ino 259, gen 9)
| |---- d1/ (ino 258, gen 9)
|
|---- f (ino 257, gen 9)
Send snapshot:
.
|---- d258/ (ino 258, gen 7)
|---- d259/ (ino 259, gen 7)
|---- d1/ (ino 257, gen 7)
When the kernel is processing inode 258 it notices that in both snapshots
there is an inode numbered 259 that is a parent of an inode 258. However
it ignores the fact that the inodes numbered 259 have different generations
in both snapshots, which means they are effectively different inodes.
Then it checks that both inodes 259 have a dentry named "d1" and because
of that it issues a rmdir operation with orphanized name of the inode 258
from the parent snapshot. This happens at send.c:process_record_refs(),
which calls send.c:did_overwrite_first_ref() that returns true and because
of that later on at process_recorded_refs() such rmdir operation is issued
because the inode being currently processed (258) is a directory and it
was deleted in the send snapshot (and replaced with another inode that has
the same number and is a directory too).
Fix this issue by comparing the generations of parent directory inodes
that have the same number and make send.c:did_overwrite_first_ref() when
the generations are different.
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/d1
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/d1 /mnt/dir259/d1
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt/ -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inodes from the second snapshot all have
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10. This means all the inodes
# in the first snapshot (snap1) have a generation value of 9.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 11 Jan 2017 02:15:39 +0000 (02:15 +0000)]
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.
Here follows an example where this happens.
Parent snapshot:
. (ino 256, gen 3)
|--- dir258/ (ino 258, gen 7)
| |--- dir257/ (ino 257, gen 7)
|
|--- dir259/ (ino 259, gen 7)
Send snapshot:
. (ino 256, gen 3)
|--- file258 (ino 258, gen 10)
|
|--- new_dir259/ (ino 259, gen 10)
|--- dir257/ (ino 257, gen 7)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, its new parent is created using its orphan
name (o257-21-0), and the rename operation for inode 257 is delayed
because its new parent (inode 259) was not yet processed - this
decision to delay the rename operation does not make much sense
because the inode 259 in the send snapshot is a new inode, it's not
the same as inode 259 in the parent snapshot.
2) When processing inode 258 we end up delaying its rmdir operation,
because inode 257 was not yet renamed (moved away from the directory
inode 258 represents). We also create the new inode 258 using its
orphan name "o258-10-0", then rename it to its final name of "file258"
and then issue a truncate operation for it. However this truncate
operation contains an incorrect name, which corresponds to the orphan
name and not to the final name, which makes the receiver fail. This
happens because when we attempt to compute the inode's current name
we verify that there's another inode with the same number (258) that
has its rmdir operation pending and because of that we generate an
orphan name for the new inode 258 (we do this in the function
get_cur_path()).
Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.
The following steps reproduce this example scenario.
# Remount the filesystem so that the next created inodes will have the
# numbers 258 and 259. This is because when a filesystem is mounted,
# btrfs sets the subvolume's inode counter to a value corresponding to
# the highest inode number in the subvolume plus 1. This inode counter
# is used to assign a unique number to each new inode and it's
# incremented by 1 after very inode creation.
# Note: we unmount and then mount instead of doing a mount with
# "-o remount" because otherwise the inode counter remains at value 260.
$ umount /mnt
$ mount /dev/sdb /mnt
$ touch /mnt/file258
$ mkdir /mnt/new_dir259
$ mv /mnt/dir257 /mnt/new_dir259/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
Robbie Ko [Thu, 5 Jan 2017 08:24:55 +0000 (16:24 +0800)]
Btrfs: send, fix failure to rename top level inode due to name collision
Under certain situations, an incremental send operation can fail due to a
premature attempt to create a new top level inode (a direct child of the
subvolume/snapshot root) whose name collides with another inode that was
removed from the send snapshot.
Consider the following example scenario.
Parent snapshot:
. (ino 256, gen 8)
|---- a1/ (ino 257, gen 9)
|---- a2/ (ino 258, gen 9)
Send snapshot:
. (ino 256, gen 3)
|---- a2/ (ino 257, gen 7)
In this scenario, when receiving the incremental send stream, the btrfs
receive command fails like this (ran in verbose mode, -vv argument):
rmdir a1
mkfile o257-7-0
rename o257-7-0 -> a2
ERROR: rename o257-7-0 -> a2 failed: Is a directory
What happens when computing the incremental send stream is:
1) An operation to remove the directory with inode number 257 and
generation 9 is issued.
2) An operation to create the inode with number 257 and generation 7 is
issued. This creates the inode with an orphanized name of "o257-7-0".
3) An operation rename the new inode 257 to its final name, "a2", is
issued. This is incorrect because inode 258, which has the same name
and it's a child of the same parent (root inode 256), was not yet
processed and therefore no rmdir operation for it was yet issued.
The rename operation is issued because we fail to detect that the
name of the new inode 257 collides with inode 258, because their
parent, a subvolume/snapshot root (inode 256) has a different
generation in both snapshots.
So fix this by ignoring the generation value of a parent directory that
matches a root inode (number 256) when we are checking if the name of the
inode currently being processed collides with the name of some other
inode that was not yet processed.
We can achieve this scenario of different inodes with the same number but
different generation values either by mounting a filesystem with the inode
cache option (-o inode_cache) or by creating and sending snapshots across
different filesystems, like in the following example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/a2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inode from the second snapshot has
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ btrfs receive /mnt /tmp/1.snap
# Receive of snapshot snap2 used to fail.
$ btrfs receive /mnt /tmp/2.snap
Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com>
Liu Bo [Tue, 21 Feb 2017 20:12:58 +0000 (12:12 -0800)]
Btrfs: use the correct type when creating cow dio extent
'BTRFS_ORDERED_REGULAR' was introduced for the cow case in patch
'Btrfs: specify a new ordered extent type for create_io_em',
but it missed the directIO cow case.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 21 Feb 2017 17:14:52 +0000 (17:14 +0000)]
Btrfs: fix deadlock between dedup on same file and starting writeback
If we are deduping two ranges of the same file we need to make sure that
we lock all pages in ascending order, that is, lock first the pages from
the range with lower offset and then the pages from the other range, as
otherwise we can deadlock with a concurrent task that is starting delalloc
(writeback). Example trace:
btrfs_dedupe_file_range()
--> using same inode as source
and target
--> src range is [768K, 1Mb[
--> dst range is [0, 256K[
btrfs_cmp_data_prepare()
--> calls gather_extent_pages()
for range [768K, 1Mb[ and
locks all pages in that range
do_writepages()
btrfs_writepages()
extent_writepages()
extent_write_cache_pages()
__extent_writepage()
writepage_delalloc()
find_lock_delalloc_range()
--> finds range [0, 1Mb[
lock_delalloc_pages()
--> locks all pages in the
range [0, 768K[
--> tries to lock page at
offset 768K
--> deadlock
--> calls gather_extent_pages()
to lock pages in the range
[0, 256K[
--> deadlock, task at CPU 1
already locked that
range and it's trying
to lock the range we
locked previously
So fix this by making sure that during a dedup we always lock first the
pages from the range with lower offset.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:34 +0000 (16:28 -0500)]
btrfs: use btrfs_debug instead of pr_debug in transaction abort
Commit e5d6b12fe14 (Btrfs: don't WARN() in btrfs_transaction_abort() for
IO errors) added a pr_debug call to be printed when a transaction is
aborted with -EIO instead of WARN. btrfs_debug prints which file system
the message is associated with so let's use that instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_truncate_free_space_cache always allocates a btrfs_path structure
but only uses it when the caller passes a block group. Let's move the
allocation and free into the conditional.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:29 +0000 (16:28 -0500)]
btrfs: convert btrfs_inc_block_group_ro to accept fs_info
btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree. Let's convert
to passing an fs_info and using the extent root.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:27 +0000 (16:28 -0500)]
btrfs: pass fs_info to (more) routines that are only called with extent_root
Outside of interactions with qgroups, the roots passed in extent-tree.c
are usually passed to ensure that we don't do refcounts on log trees or
to get the allocation profile for an allocation request. Otherwise, it
operates on the extent root. This patch converts some more routines in
extent-tree.c that are always called with the extent root to accept
an fs_info instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 15 Feb 2017 02:43:03 +0000 (10:43 +0800)]
btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
Just as Filipe pointed out, the most time consuming parts of qgroup are
btrfs_qgroup_account_extents() and
btrfs_qgroup_prepare_account_extents().
Which both call btrfs_find_all_roots() to get old_roots and new_roots
ulist.
What makes things worse is, we're calling that expensive
btrfs_find_all_roots() at transaction committing time with
TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.
Such behavior is necessary for @new_roots search as current
btrfs_find_all_roots() can't do it correctly so we do call it just
before switch commit roots.
However for @old_roots search, it's not necessary as such search is
based on commit_root, so it will always be correct and we can move it
out of transaction committing.
This patch moves the @old_roots search part out of
commit_transaction(), so in theory we can half the time qgroup time
consumption at commit_transaction().
But please note that, this won't speedup qgroup overall, the total time
consumption is still the same, just reduce the performance stall.
Cc: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>