Colin Ian King [Wed, 21 Feb 2024 11:52:03 +0000 (11:52 +0000)]
bcachefs: remove redundant assignment to variable ret
Variable ret is being assigned a value that is never read, it is
being re-assigned a couple of statements later on. The assignment
is redundant and can be removed.
Cleans up clang scan build warning:
fs/bcachefs/super-io.c:806:2: warning: Value stored to 'ret' is
never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Calvin Owens [Mon, 19 Feb 2024 07:36:08 +0000 (23:36 -0800)]
bcachefs: Silence gcc warnings about arm arch ABI drift
32-bit arm builds emit a lot of spam like this:
fs/bcachefs/backpointers.c: In function ‘extent_matches_bp’:
fs/bcachefs/backpointers.c:15:13: note: parameter passing for argument of type ‘struct bch_backpointer’ changed in GCC 9.1
Apply the change from commit ebcc5928c5d9 ("arm64: Silence gcc warnings
about arch ABI drift") to fs/bcachefs/ to silence them.
Signed-off-by: Calvin Owens <jcalvinowens@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Feb 2024 06:08:25 +0000 (01:08 -0500)]
bcachefs: Split out discard fastpath
Buckets usually can't be discarded until the transaction that made them
empty has been committed in the journal.
Tracing has indicated that we're queuing the discard worker excessively,
only for it to skip over many buckets that are still waiting on a
journal commit, discarding only one or two buckets per iteration.
We want to switch to only queuing the discard worker after a journal
flush write, but there's an important optimization we need to preserve:
if a bucket becomes empty and it was never committed in the journal
while it was in use, we want to discard it and reuse it right away -
since overwriting it before the previous writes are flushed from the
device cache eans those writes only cost bus bandwidth.
So, this patch implements a fast path for buckets that can be discarded
right away. We need new locking between the two discard workers; the new
list of buckets being discarded provides that locking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 17 Feb 2024 04:50:05 +0000 (23:50 -0500)]
bcachefs: Drop redundant btree_path_downgrade()s
If a path doesn't have any active references, we shouldn't downgrade it;
it'll either be reused, possibly with intent refs again, or dropped at
bch2_trans_begin() time.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Feb 2024 04:59:05 +0000 (23:59 -0500)]
bcachefs: check_path() now only needs to walk up to subvolume root
Now that checking subvolume structure is a separate pass, the main
check_directory_connectivity() pass only needs to walk up to a given
inode's subvolume root.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs: omit alignment attribute on big endian struct bkey
This is needed for building Rust bindings on big endian architectures
like s390x. Currently this is only done in userspace, but it might
happen in-kernel in the future. When creating a Rust binding for struct
bkey, the "packed" attribute is needed to get a type with the correct
member offsets in the big endian case. However, rustc does not allow
types to have both a "packed" and "align" attribute. Thus, in order to
get a Rust type compatible with the C type, we must omit the "aligned"
attribute in C.
This does not affect the struct's size or member offsets, only its
toplevel alignment, which should be an acceptable impact.
The little endian version can have the "align" attribute because the
"packed" attr is redundant, and rust-bindgen will omit the "packed" attr
when an "align" attr is present and it can do so without changing a
type's layout
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Feb 2024 02:42:10 +0000 (21:42 -0500)]
bcachefs: bch2_trigger_alloc() handles state changes better
bch2_trigger_alloc() kicks off certain tasks on bucket state changes;
e.g. triggering the bucket discard worker and the invalidate worker.
We've observed the discard worker running too often - most runs it
doesn't do any work, according to the tracepoint - so clearly, we're
kicking it off too often.
This adds an explicit statechange() macro to make these checks more
precise.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 12 Feb 2024 20:17:14 +0000 (15:17 -0500)]
bcachefs: Use kvzalloc() when dynamically allocating btree paths
THis silences a mm/page_alloc.c warning about allocating more than a
page with GFP_NOFAIL - and there's no reason for this to not have a
vmalloc fallback anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 10 Feb 2024 01:15:03 +0000 (20:15 -0500)]
bcachefs: Save key_cache_path in peek_slot()
When bch2_btree_iter_peek_slot() clones the iterator to search for the
next key, and then discovers that the key from the cloned iterator is
the key we want to return - we also want to save the
iter->key_cache_path as well, for the update path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 23 Jan 2024 05:01:07 +0000 (00:01 -0500)]
bcachefs: Pin btree cache in ram for random access in fsck
Various phases of fsck involve checking references from one btree to
another: this means doing a sequential scan of one btree, and then
mostly random access into the second.
This is particularly painful for checking extents <-> backpointers; we
can prefetch btree node access on the sequential scan, but not on the
random access portion, and this is particularly painful on spinning
rust, where we'd like to keep the pipeline fairly full of btree node
reads so that the elevator can reduce seeking.
This patch implements prefetching and pinning of the portion of the
btree that we'll be doing random access to. We already calculate how
much of the random access btree will fit in memory so it's a fairly
straightforward change.
This will put more pressure on system memory usage, so we introduce a
new option, fsck_memory_usage_percent, which is the percentage of total
system ram that fsck is allowed to pin.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 21:04:50 +0000 (16:04 -0500)]
bcachefs: Correctly reattach subvolumes
Subvolumes need special handling to reattach - we always reattach them
in the root subvolume's lost+found, and they need a slightly different
kind of dirent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Feb 2024 05:45:09 +0000 (00:45 -0500)]
bcachefs: check dirent->d_parent_subvol
Check that d_parent_subvol makes sense - the dirent's snapshot must be
visible in d_parent_subvol (i.e. an ancestor of d_parent_subvol's
snapshot) in order to be visible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Feb 2024 04:39:08 +0000 (23:39 -0500)]
bcachefs: check_inode_dirent_inode()
check that if an inode has a backpointer, the dirent it points to points
back to it.
We do this in check_dirent_inode_dirent(), but only for inodes that have
dirents that point to them - we also have to do the check starting from
the inode to catch inodes that don't have dirents that point to them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 6 Feb 2024 22:24:18 +0000 (17:24 -0500)]
bcachefs: Kill more -EIO error codes
This converts -EIOs related to btree node errors to private error codes,
which will help with some ongoing debugging by giving us better error
messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Wed, 7 Feb 2024 19:39:03 +0000 (11:39 -0800)]
bcachefs: thread_with_file: fix various printf problems
Experimentally fix some problems with stdio_redirect_vprintf by creating
a MOO variant with which we can experiment. We can't do a GFP_KERNEL
allocation while holding the spinlock, and I don't like how the printf
function can silently truncate the output if memory allocation fails.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Darrick J. Wong [Wed, 7 Feb 2024 19:43:32 +0000 (11:43 -0800)]
bcachefs: thread_with_file: allow creation of readonly files
Create a new run_thread_with_stdout function that opens a file in
O_RDONLY mode so that the kernel can write things to userspace but
userspace cannot write to the kernel. This will be used to convey xfs
health event information to userspace.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 5 Feb 2024 03:20:40 +0000 (22:20 -0500)]
bcachefs: thread_with_stdio: convert to darray
- eliminate the dependency on printbufs, so that we can lift
thread_with_file for use in xfs
- add a nonblocking parameter to stdio_redirect_printf(), and either
block if the buffer is full or drop it on the floor - don't buffer
infinitely
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The output buffer lock has to be a spinlock so that we can write to it
from interrupt context, so we can't use a direct copy_to_user; this
switches thread_with_file_read() to use fault_in_writeable() and
copy_to_user_nofault(), similar to how thread_with_file_write() works.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 1 Feb 2024 11:28:41 +0000 (06:28 -0500)]
mempool: kvmalloc pool
Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(),
which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc()
fallback.
This is part of a bcachefs cleanup - dropping an internal kvpmalloc()
helper (which predates kvmalloc()) along with mempool helpers; this
replaces the bcachefs-private kvpmalloc_pool.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: linux-mm@kvack.org
Kent Overstreet [Thu, 25 Jan 2024 17:36:37 +0000 (12:36 -0500)]
bcachefs: bch2_lookup() gives better error message on inode not found
When a dirent points to a missing inode, we really should print out the
dirent.
This requires quite a bit of refactoring, but there's some other
benefits: we now do the entire looup (dirent and inode) in a single
btree transaction, and copy to the VFS inode with btree locks still
held, like the create path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: linux-mm@kvack.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 26 Jan 2024 00:00:24 +0000 (19:00 -0500)]
mm: introduce memalloc_flags_{save,restore}
Our proliferation of memalloc_*_{save,restore} APIs is getting a bit
silly, this adds a generic version and converts the existing
save/restore functions to wrappers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: linux-mm@kvack.org Acked-by: Vlastimil Babka <vbabka@suse.cz>
Kent Overstreet [Tue, 6 Feb 2024 02:44:23 +0000 (21:44 -0500)]
bcachefs: Kill some -EINVALs
Repurposing standard error codes in bcachefs code is banned in new code,
and we need to get rid of the remaining ones - private error codes give
us much better error messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 6 Feb 2024 00:28:03 +0000 (19:28 -0500)]
bcachefs: bump max_active on btree_interior_update_worker
WQ_UNBOUND with max_active 1 means ordered workqueue, but we don't
actually need or want ordered semantics - and probably want a higher
concurrency limit anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 21 Jan 2024 21:46:45 +0000 (16:46 -0500)]
bcachefs: Subvolumes may now be renamed
Files within a subvolume cannot be renamed into another subvolume, but
subvolumes themselves were intended to be.
This implements subvolume renaming - we need to ensure that there's only
a single dirent that points to a subvolume key (not multiple versions in
different snapshots), and we need to ensure that dirent.d_parent_subol
and inode.bi_parent_subvol are updated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 31 Jan 2024 19:26:15 +0000 (14:26 -0500)]
bcachefs: better journal pipelining
Recently a severe performance regression was discovered, which bisected
to
a6548c8b5eb5 bcachefs: Avoid flushing the journal in the discard path
It turns out the old behaviour, which issued excessive journal flushes,
worked around a performance issue where queueing delays would cause the
journal to not be able to write quickly enough and stall.
The journal flushes masked the issue because they periodically flushed
the device write cache, reducing write latency for non flushes.
This patch reworks the journalling code to allow more than one
(non-flush) write to be in flight at a time. With this patch, doing 4k
random writes and an iodepth of 128, we are now able to hit 560k iops to
a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 23 Jan 2024 01:55:08 +0000 (20:55 -0500)]
bcachefs: Workqueues should be WQ_HIGHPRI
Most bcachefs workqueues are used for completions, and should be
WQ_HIGHPRI - this helps reduce queuing delays, we want to complete
quickly once we can no longer signal backpressure by blocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 31 Jan 2024 16:28:13 +0000 (11:28 -0500)]
bcachefs: Avoid taking journal lock unnecessarily
Previously, any time we failed to get a journal reservation we'd retry,
with the journal lock held; but this isn't necessary given
wait_event()/wake_up() ordering.
This avoids performance cliffs when the journal starts to get backed up
and lock contention shoots up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 31 Jan 2024 16:21:46 +0000 (11:21 -0500)]
bcachefs: Split out journal workqueue
We don't want journal write completions to be blocked behind btree
transactions - io_complete_wq is used for btree updates after data and
metadata writes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Guoyu Ou [Tue, 13 Feb 2024 08:20:04 +0000 (16:20 +0800)]
bcachefs: skip invisible entries in empty subvolume checking
When we are checking whether a subvolume is empty in the specified snapshot,
entries that do not belong to this subvolume should be skipped.
This fixes the following case:
$ bcachefs subvolume create ./sub
$ cd sub
$ bcachefs subvolume create ./sub2
$ bcachefs subvolume snapshot . ./snap
$ ls -a snap
. ..
$ rmdir snap
rmdir: failed to remove 'snap': Directory not empty
As Kent suggested, we pass 0 in may_delete_deleted_inode() to ignore subvols
in the subvol we are checking, because inode.bi_subvol is only set on
subvolume roots, and we can't go through every inode in the subvolume and
change bi_subvol when taking a snapshot. It makes the check less strict, but
that's ok, the rest of fsck will still catch it.
Signed-off-by: Guoyu Ou <benogy@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 24 Jan 2024 21:32:12 +0000 (16:32 -0500)]
bcachefs: Set path->uptodate when no node at level
We were failing to set path->uptodate when reaching the end of a btree
node iterator, causing the new prefetch code for backpointers gc to go
into an infinite loop.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 8 Mar 2024 19:53:03 +0000 (14:53 -0500)]
bcachefs: Correctly validate k->u64s in btree node read path
validate_bset_keys() never properly validated k->u64s; it checked if it
was 0, but not if it was smaller than keys for the given packed format;
this fixes that small oversight.
This patch was backported, so it's adding quite a few error enums so
that they don't get renumbered and we don't have confusing gaps.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 9 Mar 2024 00:57:22 +0000 (19:57 -0500)]
bcachefs: Fix journal replay with unreadable btree roots
When a btree root is unreadable, we still might be able to get some data
back by replaying what's in the journal. Previously though, we got
confused when journal replay would attempt to replay a key for a level
that didn't exist.
This adds bch2_btree_increase_depth(), so that journal replay can handle
this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 8 Mar 2024 21:03:19 +0000 (16:03 -0500)]
bcachefs: no_splitbrain_check option
This adds an option to disable kicking out devices when splitbrain is
detected - it seems there's some issues with splitbrain detection and
we're kicking out devices erronously.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 8 Mar 2024 20:25:27 +0000 (15:25 -0500)]
bcachefs: extent_entry_next_safe()
We need to be able to iterate over extent ptrs that may be corrupted in
order to print them - this fixes a bug where we'd pop an assert in
bch2_bkey_durability_safe().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 7 Mar 2024 17:30:49 +0000 (12:30 -0500)]
bcachefs: journal_seq_blacklist_add() now handles entries being added out of order
bch2_journal_seq_blacklist_add() was bugged when the new entry
overlapped with multiple existing entries, and it also assumed new
entries are being added in increasing order.
This is true on any sane filesystem, but when trying to recover from
very badly mangled filesystems we might end up with the journal sequence
number rewinding vs. what the blacklist list knows about - easiest to
just handle that here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Li Zetao [Mon, 4 Mar 2024 03:22:03 +0000 (11:22 +0800)]
bcachefs: Fix null-ptr-deref in bch2_fs_alloc()
There is a null-ptr-deref issue reported by kasan:
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
Call Trace:
<TASK>
bch2_fs_alloc+0x1092/0x2170 [bcachefs]
bch2_fs_open+0x683/0xe10 [bcachefs]
...
When initializing the name of bch_fs, it needs to dynamically alloc memory
to meet the length of the name. However, when name allocation failed, it
will cause a null-ptr-deref access exception in subsequent string copy.
Fix this issue by checking if name allocation is successful.
Fixes: 401ec4db6308 ("bcachefs: Printbuf rework") Signed-off-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Linus Torvalds [Sun, 25 Feb 2024 23:31:57 +0000 (15:31 -0800)]
Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Some more mostly boring fixes, but some not
User reported ones:
- the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
performance bug; user reported an untar initially taking two
seconds and then ~2 minutes
- kill a __GFP_NOFAIL in the buffered read path; this was a leftover
from the trickier fix to kill __GFP_NOFAIL in readahead, where we
can't return errors (and have to silently truncate the read
ourselves).
bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
filesystems because our folio state is just barely too big, 2MB
hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
additionally, the flags argument was just buggy, we weren't
supplying GFP_KERNEL previously (!)"
* tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
bcachefs: fix bch2_save_backtrace()
bcachefs: Fix check_snapshot() memcpy
bcachefs: Fix bch2_journal_flush_device_pins()
bcachefs: fix iov_iter count underflow on sub-block dio read
bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
bcachefs: Kill __GFP_NOFAIL in buffered read path
bcachefs: fix backpointer_to_text() when dev does not exist