]> git.proxmox.com Git - mirror_ubuntu-kernels.git/log
mirror_ubuntu-kernels.git
7 months agobcachefs: If we run merges at a lower watermark, they must be nonblocking
Kent Overstreet [Mon, 22 Apr 2024 03:32:18 +0000 (23:32 -0400)]
bcachefs: If we run merges at a lower watermark, they must be nonblocking

Fix another deadlock related to the merge path; previously, we switched
to always running merges at a lower watermark (because they are
noncritical); but when we run at a lower watermark we also need to run
nonblocking or we've introduced a new deadlock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-and-tested-by: s@m-h.ug
7 months agobcachefs: Fix inode early destruction path
Kent Overstreet [Sun, 21 Apr 2024 02:26:47 +0000 (22:26 -0400)]
bcachefs: Fix inode early destruction path

discard_new_inode() is the wrong interface to use when we need to free
an inode that was never inserted into the inode hash table; we can
bypass the whole iput() -> evict() path and replace it with
__destroy_inode(); kmem_cache_free() - this fixes a WARN_ON() about
I_NEW.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix deadlock in journal write path
Kent Overstreet [Sat, 20 Apr 2024 01:54:32 +0000 (21:54 -0400)]
bcachefs: Fix deadlock in journal write path

bch2_journal_write() was incorrectly waiting on earlier journal writes
synchronously; this usually worked because most of the time we'd be
running in the context of a thread that did a journal_buf_put(), but
sometimes we'd be running out of the same workqueue that completes those
prior journal writes.

Additionally, this makes sure to punt to a workqueue before submitting
preflushes - we really don't want to be calling submit_bio() in the main
transaction commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Tweak btree key cache shrinker so it actually frees
Kent Overstreet [Sat, 20 Apr 2024 19:35:40 +0000 (15:35 -0400)]
bcachefs: Tweak btree key cache shrinker so it actually frees

Freeing key cache items is a multi stage process; we need to wait for an
SRCU grace period to elapse, and we handle this ourselves - partially to
avoid callback overhead, but primarily so that when allocating we can
first allocate from the freed items waiting for an SRCU grace period.

Previously, the shrinker was counting the items on the 'waiting for SRCU
grace period' lists as items being scanned, but this meant that too many
items waiting for an SRCU grace period could prevent it from doing any
work at all.

After this, we're seeing that items skipped due to the accessed bit are
the main cause of the shrinker not making any progress, and we actually
want the key cache shrinker to run quite aggressively because reclaimed
items will still generally be found (more compactly) in the btree node
cache - so we also tweak the shrinker to not count those against
nr_to_scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: bkey_cached.btree_trans_barrier_seq needs to be a ulong
Kent Overstreet [Sat, 20 Apr 2024 19:13:20 +0000 (15:13 -0400)]
bcachefs: bkey_cached.btree_trans_barrier_seq needs to be a ulong

this stores the SRCU sequence number, which we use to check if an SRCU
barrier has elapsed; this is a partial fix for the key cache shrinker
not actually freeing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix missing call to bch2_fs_allocator_background_exit()
Kent Overstreet [Sat, 20 Apr 2024 04:31:32 +0000 (00:31 -0400)]
bcachefs: Fix missing call to bch2_fs_allocator_background_exit()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Check for journal entries overruning end of sb clean section
Kent Overstreet [Wed, 17 Apr 2024 19:19:50 +0000 (15:19 -0400)]
bcachefs: Check for journal entries overruning end of sb clean section

Fix a missing bounds check in superblock validation.

Note that we don't yet have repair code for this case - repair code for
individual items is generally low priority, since the whole superblock
is checksummed, validated prior to write, and we have backups.

Reported-by: lei lu <llfamsec@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix bio alloc in check_extent_checksum()
Kent Overstreet [Wed, 17 Apr 2024 21:27:43 +0000 (17:27 -0400)]
bcachefs: Fix bio alloc in check_extent_checksum()

if the buffer is virtually mapped it won't be a single bvec

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix leak in bch2_gc_write_reflink_key
Kent Overstreet [Wed, 17 Apr 2024 06:17:21 +0000 (02:17 -0400)]
bcachefs: fix leak in bch2_gc_write_reflink_key

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: KEY_TYPE_error is allowed for reflink
Kent Overstreet [Wed, 17 Apr 2024 06:04:23 +0000 (02:04 -0400)]
bcachefs: KEY_TYPE_error is allowed for reflink

KEY_TYPE_error is left behind when we have to delete all pointers in an
extent in fsck; it allows errors to be correctly returned by reads
later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix bch2_dev_btree_bitmap_marked_sectors() shift
Kent Overstreet [Tue, 16 Apr 2024 23:16:45 +0000 (19:16 -0400)]
bcachefs: Fix bch2_dev_btree_bitmap_marked_sectors() shift

Fixes: 27c15ed297cb bcachefs: bch_member.btree_allocated_bitmap
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: make sure to release last journal pin in replay
Kent Overstreet [Tue, 16 Apr 2024 03:53:12 +0000 (23:53 -0400)]
bcachefs: make sure to release last journal pin in replay

This fixes a deadlock when journal replay has many keys to insert that
were from fsck, not the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: node scan: ignore multiple nodes with same seq if interior
Kent Overstreet [Tue, 16 Apr 2024 02:54:10 +0000 (22:54 -0400)]
bcachefs: node scan: ignore multiple nodes with same seq if interior

Interior nodes are not really needed, when we have to scan - but if this
pops up for leaf nodes we'll need a real heuristic.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix format specifier in validate_bset_keys()
Nathan Chancellor [Tue, 16 Apr 2024 15:16:02 +0000 (08:16 -0700)]
bcachefs: Fix format specifier in validate_bset_keys()

When building for 32-bit platforms, for which size_t is 'unsigned int',
there is a warning from a format string in validate_bset_keys():

  fs/bcachefs/btree_io.c: In function 'validate_bset_keys':
  fs/bcachefs/btree_io.c:891:34: error: format '%lu' expects argument of type 'long unsigned int', but argument 12 has type 'unsigned int' [-Werror=format=]
    891 |                                  "bad k->u64s %u (min %u max %lu)", k->u64s,
        |                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  fs/bcachefs/btree_io.c:603:32: note: in definition of macro 'btree_err'
    603 |                                msg, ##__VA_ARGS__);                     \
        |                                ^~~
  fs/bcachefs/btree_io.c:887:21: note: in expansion of macro 'btree_err_on'
    887 |                 if (btree_err_on(!bkeyp_u64s_valid(&b->format, k),
        |                     ^~~~~~~~~~~~
  fs/bcachefs/btree_io.c:891:64: note: format string is defined here
    891 |                                  "bad k->u64s %u (min %u max %lu)", k->u64s,
        |                                                              ~~^
        |                                                                |
        |                                                                long unsigned int
        |                                                              %u
  cc1: all warnings being treated as errors

BKEY_U64s is size_t so the entire expression is promoted to size_t. Use
the '%zu' specifier so that there is no warning regardless of the width
of size_t.

Fixes: 031ad9e7dbd1 ("bcachefs: Check for packed bkeys that are too big")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404130747.wH6Dd23p-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202404131536.HdAMBOVc-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINE
Kent Overstreet [Tue, 16 Apr 2024 21:55:02 +0000 (17:55 -0400)]
bcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINE

We need to initialize the stdio redirects before they're used.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: set_btree_iter_dontneed also clears should_be_locked
Kent Overstreet [Sat, 13 Apr 2024 22:02:15 +0000 (18:02 -0400)]
bcachefs: set_btree_iter_dontneed also clears should_be_locked

This is part of a larger series cleaning up the semantics of
should_be_locked and adding assertions around it; if we don't need an
iterator/path anymore, it clearly doesn't need to be locked.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix error path of __bch2_read_super()
Chao Yu [Fri, 12 Apr 2024 06:36:38 +0000 (14:36 +0800)]
bcachefs: fix error path of __bch2_read_super()

In __bch2_read_super(), if kstrdup() fails, it needs to release memory
in sb->holder, fix to call bch2_free_super() in the error path.

Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Check for backpointer bucket_offset >= bucket size
Kent Overstreet [Sun, 14 Apr 2024 04:51:48 +0000 (00:51 -0400)]
bcachefs: Check for backpointer bucket_offset >= bucket size

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: bch_member.btree_allocated_bitmap
Kent Overstreet [Fri, 12 Apr 2024 22:45:47 +0000 (18:45 -0400)]
bcachefs: bch_member.btree_allocated_bitmap

This adds a small (64 bit) per-device bitmap that tracks ranges that
have btree nodes, for accelerating btree node scan if it is ever needed.

- New helpers, bch2_dev_btree_bitmap_marked() and
  bch2_dev_bitmap_mark(), for checking and updating the bitmap

- Interior btree update path updates the bitmaps when required

- The check_allocations pass has a new fsck_err check,
  btree_bitmap_not_marked

- New on disk format version, mi_btree_mitmap, which indicates the new
  bitmap is present

- Upgrade table lists the required recovery pass and expected fsck error

- Btree node scan uses the bitmap to skip ranges if we're on the new
  version

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: sysfs internal/trigger_journal_flush
Kent Overstreet [Sun, 14 Apr 2024 02:43:11 +0000 (22:43 -0400)]
bcachefs: sysfs internal/trigger_journal_flush

Add a sysfs knob for immediately flushing the entire journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix bch2_btree_node_fill() for !path
Kent Overstreet [Fri, 12 Apr 2024 19:54:33 +0000 (15:54 -0400)]
bcachefs: Fix bch2_btree_node_fill() for !path

We shouldn't be doing the unlock/relock dance when we're not using a
path - this fixes an assertion pop when called from btree node scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: add safety checks in bch2_btree_node_fill()
Kent Overstreet [Fri, 12 Apr 2024 19:34:14 +0000 (15:34 -0400)]
bcachefs: add safety checks in bch2_btree_node_fill()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Interior known are required to have known key types
Kent Overstreet [Sun, 14 Apr 2024 03:59:28 +0000 (23:59 -0400)]
bcachefs: Interior known are required to have known key types

For forwards compatibilyt, we allow bkeys of unknown type in leaf nodes;
we can simply ignore metadata we don't understand. Pointers to btree
nodes must always be of known types, howwever.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: add missing bounds check in __bch2_bkey_val_invalid()
Kent Overstreet [Sun, 14 Apr 2024 03:59:06 +0000 (23:59 -0400)]
bcachefs: add missing bounds check in __bch2_bkey_val_invalid()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix btree node merging on write buffer btrees
Kent Overstreet [Wed, 27 Dec 2023 03:42:34 +0000 (22:42 -0500)]
bcachefs: Fix btree node merging on write buffer btrees

The btree write buffer flush fastpath that avoids the main transaction
commit path had the unfortunate side effect of not doing btree node
merging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Disable merges from interior update path
Kent Overstreet [Sat, 13 Apr 2024 22:39:03 +0000 (18:39 -0400)]
bcachefs: Disable merges from interior update path

There's been a bug in the btree write buffer where it wasn't triggering
btree node merges - and leaving behind a bunch of nearly empty btree
nodes.

Then during journal replay, when updates to the backpointers btree
aren't using the btree write buffer (because we require synchronization
with journal replay), we end up doing those merges all at once.

Then if it's the interior update path running them, we deadlock because
those run with the highest watermark.

There's no real need for the interior update path to be doing btree node
merges; other code paths can handle that at lower watermarks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Run merges at BCH_WATERMARK_btree
Kent Overstreet [Sat, 13 Apr 2024 20:13:13 +0000 (16:13 -0400)]
bcachefs: Run merges at BCH_WATERMARK_btree

This fixes a deadlock where the interior update path during journal
replay ends up doing a ton of merges on the backpointers btree, and
deadlocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix missing write refs in fs fio paths
Kent Overstreet [Sat, 13 Apr 2024 04:26:01 +0000 (00:26 -0400)]
bcachefs: Fix missing write refs in fs fio paths

bch2_journal_flush_seq requires us to have a write ref

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix deadlock in journal replay
Kent Overstreet [Sat, 13 Apr 2024 01:07:05 +0000 (21:07 -0400)]
bcachefs: Fix deadlock in journal replay

btree_key_can_insert_cached() should be checking the watermark -
BCH_TRANS_COMMIT_journal_replay really means nonblocking mode when
watermark < reclaim, it was being used incorrectly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Go rw if running any explicit recovery passes
Kent Overstreet [Fri, 12 Apr 2024 18:05:36 +0000 (14:05 -0400)]
bcachefs: Go rw if running any explicit recovery passes

This fixes a bug where we fail to start when upgrading/downgrading
because we forgot we needed to go rw.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Standardize helpers for printing enum strs with bounds checks
Kent Overstreet [Fri, 12 Apr 2024 19:17:00 +0000 (15:17 -0400)]
bcachefs: Standardize helpers for printing enum strs with bounds checks

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: don't queue btree nodes for rewrites during scan
Kent Overstreet [Fri, 12 Apr 2024 04:09:08 +0000 (00:09 -0400)]
bcachefs: don't queue btree nodes for rewrites during scan

many nodes found during scan will be old nodes, overwritten by newer
nodes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix race in bch2_btree_node_evict()
Kent Overstreet [Fri, 12 Apr 2024 03:58:36 +0000 (23:58 -0400)]
bcachefs: fix race in bch2_btree_node_evict()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix unsafety in bch2_stripe_to_text()
Kent Overstreet [Fri, 12 Apr 2024 03:37:24 +0000 (23:37 -0400)]
bcachefs: fix unsafety in bch2_stripe_to_text()

.to_text() functions need to work on key values that didn't pass .valid

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix unsafety in bch2_extent_ptr_to_text()
Kent Overstreet [Fri, 12 Apr 2024 01:20:27 +0000 (21:20 -0400)]
bcachefs: fix unsafety in bch2_extent_ptr_to_text()

Need to check if we have a valid bucket before checking if ptr is stale

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: btree node scan: handle encrypted nodes
Kent Overstreet [Fri, 12 Apr 2024 03:38:07 +0000 (23:38 -0400)]
bcachefs: btree node scan: handle encrypted nodes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Check for packed bkeys that are too big
Kent Overstreet [Fri, 12 Apr 2024 01:30:43 +0000 (21:30 -0400)]
bcachefs: Check for packed bkeys that are too big

add missing validation; fixes assertion pop in bkey unpack

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix UAFs of btree_insert_entry array
Kent Overstreet [Thu, 11 Apr 2024 21:47:42 +0000 (17:47 -0400)]
bcachefs: Fix UAFs of btree_insert_entry array

The btree paths array is now dynamically resizable - and as well the
btree_insert_entries array, as it needs to be the same size.

The merge path (and interior update path) allocates new btree paths,
thus can trigger a resize; thus we need to not retain direct pointers
after invoking merge; similarly when running btree node triggers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Don't use bch2_btree_node_lock_write_nofail() in btree split path
Kent Overstreet [Thu, 11 Apr 2024 05:01:11 +0000 (01:01 -0400)]
bcachefs: Don't use bch2_btree_node_lock_write_nofail() in btree split path

It turns out - btree splits happen with the rest of the transaction
still locked, to avoid unnecessary restarts, which means using nofail
doesn't work here - we can deadlock.

Fortunately, we now have the ability to return errors here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
Kent Overstreet [Wed, 10 Apr 2024 05:30:22 +0000 (01:30 -0400)]
bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()

We weren't respecting trans->journal_replay_not_finished - we shouldn't
be searching the journal keys unless we have a ref on them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
Kent Overstreet [Wed, 10 Apr 2024 04:10:18 +0000 (00:10 -0400)]
bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()

dropping read locks in bch2_btree_node_lock_write_nofail() dates from
before we had the cycle detector; we can now tell the cycle detector
directly when taking a lock may not fail because we can't handle
transaction restarts.

This is needed for adding should_be_locked asserts.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix a race in btree_update_nodes_written()
Kent Overstreet [Wed, 10 Apr 2024 16:53:28 +0000 (12:53 -0400)]
bcachefs: Fix a race in btree_update_nodes_written()

One btree update might have terminated in a node update, and then while
it is in flight another btree update might free that original node.

This race has to be handled in btree_update_nodes_written() - we were
missing a READ_ONCE().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: btree_node_scan: Respect member.data_allowed
Kent Overstreet [Tue, 9 Apr 2024 22:50:27 +0000 (18:50 -0400)]
bcachefs: btree_node_scan: Respect member.data_allowed

If a device wasn't used for btree nodes, no need to scan for them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Don't scan for btree nodes when we can reconstruct
Kent Overstreet [Tue, 9 Apr 2024 04:49:39 +0000 (00:49 -0400)]
bcachefs: Don't scan for btree nodes when we can reconstruct

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix check_topology() when using node scan
Kent Overstreet [Tue, 9 Apr 2024 04:02:47 +0000 (00:02 -0400)]
bcachefs: Fix check_topology() when using node scan

shoot down journal keys _before_ populating journal keys with pointers
to scanned nodes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix eytzinger0_find_gt()
Kent Overstreet [Tue, 9 Apr 2024 02:32:08 +0000 (22:32 -0400)]
bcachefs: fix eytzinger0_find_gt()

- fix return types: promoting from unsigned to ssize_t does not do what
  we want here, and was pointless since the rest of the eytzinger code
  is u32
- nr, not size

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix bch2_get_acl() transaction restart handling
Kent Overstreet [Sun, 7 Apr 2024 20:20:17 +0000 (16:20 -0400)]
bcachefs: fix bch2_get_acl() transaction restart handling

bch2_acl_from_disk() uses allocate_dropping_locks, and can thus return
a transaction restart - this wasn't handled.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
Hongbo Li [Tue, 26 Mar 2024 04:04:56 +0000 (12:04 +0800)]
bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list

When allocating bkey_cached from bc->freed_pcpu list, it missed
decreasing the count of nr_freed_pcpu which would cause the mismatch
between the value of nr_freed_pcpu and the list items. This problem
also exists in moving new bkey_cached to bc->freed_pcpu list.
If these happened, the bug info may appear in
bch2_fs_btree_key_cache_exit by the follow code:

   BUG_ON(list_count_nodes(&bc->freed_pcpu) != bc->nr_freed_pcpu);
   BUG_ON(list_count_nodes(&bc->freed_nonpcpu) != bc->nr_freed_nonpcpu);

Fixes: c65c13f0eac6 ("bcachefs: Run btree key cache shrinker less aggressively")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
Kent Overstreet [Sun, 7 Apr 2024 01:45:46 +0000 (21:45 -0400)]
bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()

Multiple bug fixes for journal iters:

 - When the journal keys gap buffer is resized, we have to adjust the
   iterators for moving the gap to the end
 - We don't want to rewind iterators to point to the key we just
   inserted if it's not for the correct btree/level

Also, add some new assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Rename struct field swap to prevent macro naming collision
Thorsten Blum [Sat, 6 Apr 2024 14:19:20 +0000 (16:19 +0200)]
bcachefs: Rename struct field swap to prevent macro naming collision

The struct field swap can collide with the swap() macro defined in
linux/minmax.h. Rename the struct field to prevent such collisions.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agoMAINTAINERS: Add entry for bcachefs documentation
Bagas Sanjaya [Fri, 5 Apr 2024 07:23:19 +0000 (14:23 +0700)]
MAINTAINERS: Add entry for bcachefs documentation

Now that bcachefs docs exist in Documentation/filesystems/bcachefs/,
cover it in MAINTAINERS entry for the filesystem.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agoDocumentation: filesystems: Add bcachefs toctree
Bagas Sanjaya [Fri, 5 Apr 2024 07:23:18 +0000 (14:23 +0700)]
Documentation: filesystems: Add bcachefs toctree

Commit eb386617be4bdf ("bcachefs: Errcode tracepoint, documentation")
adds initial bcachefs documentation (private error codes) but without
any table of contents tree for the filesystem docs, hence Sphinx warns:

Documentation/filesystems/bcachefs/errorcodes.rst: WARNING: document isn't included in any toctree

Add bcachefs toctree to fix above warning.

Fixes: eb386617be4b ("bcachefs: Errcode tracepoint, documentation")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: JOURNAL_SPACE_LOW
Kent Overstreet [Sat, 6 Apr 2024 03:27:27 +0000 (23:27 -0400)]
bcachefs: JOURNAL_SPACE_LOW

"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant
performance regression (nearly 50%) on multithreaded random writes with
fio.

The reason is that the journal watermark checks multiple things,
including the state of the btree write buffer, and on multithreaded
update heavy workloads we're bottleneked on write buffer flushing - we
don't want kicknig off btree updates to depend on the state of the write
buffer.

This isn't strictly correct; the interior btree update path does do
write buffer updates, but it's a tiny fraction of total accounting
updates and we're more concerned with space in the journal itself.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
Kent Overstreet [Sat, 6 Apr 2024 02:30:30 +0000 (22:30 -0400)]
bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE

BCH_IOCTL_FSCK_OFFLINE allows the userspace fsck tool to use the kernel
implementation of fsck - primarily when the kernel version is a better
version match.

It should look and act exactly like the normal userspace fsck that the
user expected to be invoking, so errors should never result in a kernel
panic.

We may want to consider further restricting errors=panic - it's only
intended for debugging in controlled test environments, it should have
no purpose it normal usage.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
Kent Overstreet [Sat, 6 Apr 2024 02:23:29 +0000 (22:23 -0400)]
bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems

To open an encrypted filesystem, we use request_key() to get the
encryption key from the user's keyring - but request_key() needs to
happen in the context of the process that invoked the ioctl.

This easily fixed by using bch2_fs_open() in nostart mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix rand_delete unit test
Kent Overstreet [Fri, 5 Apr 2024 20:21:18 +0000 (16:21 -0400)]
bcachefs: fix rand_delete unit test

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix ! vs ~ typo in __clear_bit_le64()
Dan Carpenter [Fri, 5 Apr 2024 15:01:02 +0000 (18:01 +0300)]
bcachefs: fix ! vs ~ typo in __clear_bit_le64()

The ! was obviously intended to be ~.  As it is, this function does
the equivalent to: "addr[bit / 64] = 0;".

Fixes: 27fcec6c27ca ("bcachefs: Clear recovery_passes_required as they complete without errors")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix rebalance from durability=0 device
Kent Overstreet [Fri, 5 Apr 2024 06:43:08 +0000 (02:43 -0400)]
bcachefs: Fix rebalance from durability=0 device

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Print shutdown journal sequence number
Kent Overstreet [Wed, 21 Feb 2024 02:08:24 +0000 (21:08 -0500)]
bcachefs: Print shutdown journal sequence number

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Further improve btree_update_to_text()
Kent Overstreet [Wed, 3 Apr 2024 23:52:10 +0000 (19:52 -0400)]
bcachefs: Further improve btree_update_to_text()

Print start and end level of the btree update; also a bit of cleanup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Move btree_updates to debugfs
Kent Overstreet [Wed, 3 Apr 2024 23:15:53 +0000 (19:15 -0400)]
bcachefs: Move btree_updates to debugfs

sysfs is limited to PAGE_SIZE, and when we're debugging strange
deadlocks/priority inversions we need to see the full list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Bump limit in btree_trans_too_many_iters()
Kent Overstreet [Thu, 4 Apr 2024 20:51:40 +0000 (16:51 -0400)]
bcachefs: Bump limit in btree_trans_too_many_iters()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Make snapshot_is_ancestor() safe
Kent Overstreet [Thu, 4 Apr 2024 19:50:26 +0000 (15:50 -0400)]
bcachefs: Make snapshot_is_ancestor() safe

Snapshot table accesses generally need to be checking for invalid
snapshot ID now, fix one that was missed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: create debugfs dir for each btree
Thomas Bertschinger [Thu, 14 Mar 2024 16:02:18 +0000 (10:02 -0600)]
bcachefs: create debugfs dir for each btree

This creates a subdirectory for each individual btree under the btrees/
debugfs directory.

Directory structure, before:

/sys/kernel/debug/bcachefs/$FS_ID/btrees/
├── alloc
├── alloc-bfloat-failed
├── alloc-formats
├── backpointers
├── backpointers-bfloat-failed
├── backpointers-formats
...

Directory structure, after:

/sys/kernel/debug/bcachefs/$FS_ID/btrees/
├── alloc
│   ├── bfloat-failed
│   ├── formats
│   └── keys
├── backpointers
│   ├── bfloat-failed
│   ├── formats
│   └── keys
...

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: reconstruct_inode()
Kent Overstreet [Mon, 1 Apr 2024 04:00:56 +0000 (00:00 -0400)]
bcachefs: reconstruct_inode()

If an inode is missing, but corresponding extents and dirent still
exist, it's well worth recreating it - this does so.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Subvolume reconstruction
Kent Overstreet [Sun, 31 Mar 2024 06:03:03 +0000 (02:03 -0400)]
bcachefs: Subvolume reconstruction

We can now recreate missing subvolumes from dirents and/or inodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Check for extents that point to same space
Kent Overstreet [Sat, 30 Mar 2024 22:43:00 +0000 (18:43 -0400)]
bcachefs: Check for extents that point to same space

In backpointer repair, if we get a missing backpointer - but there's
already a backpointer that points to an existing extent - we've got
multiple extents that point to the same space and need to decide which
to keep.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Reconstruct missing snapshot nodes
Kent Overstreet [Thu, 28 Mar 2024 02:50:19 +0000 (22:50 -0400)]
bcachefs: Reconstruct missing snapshot nodes

When the snapshots btree is going, we'll have to delete huge amounts of
data - unless we can reconstruct it by looking at the keys that refer to
it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Flag btrees with missing data
Kent Overstreet [Sat, 16 Mar 2024 03:03:42 +0000 (23:03 -0400)]
bcachefs: Flag btrees with missing data

We need this to know when we should attempt to reconstruct the snapshots
btree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Topology repair now uses nodes found by scanning to fill holes
Kent Overstreet [Sun, 17 Mar 2024 02:45:30 +0000 (22:45 -0400)]
bcachefs: Topology repair now uses nodes found by scanning to fill holes

With the new btree node scan code, we can now recover from corrupt btree
roots - simply create a new fake root at depth 1, and then insert all
the leaves we found.

If the root wasn't corrupt but there's corruption elsewhere in the
btree, we can fill in holes as needed with the newest version of a given
node(s) from the scan; we also check if a given btree node is older than
what we found from the scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Repair pass for scanning for btree nodes
Kent Overstreet [Tue, 12 Mar 2024 03:11:46 +0000 (23:11 -0400)]
bcachefs: Repair pass for scanning for btree nodes

If a btree root or interior btree node goes bad, we're going to lose a
lot of data, unless we can recover the nodes that it pointed to by
scanning.

Fortunately btree node headers are fully self describing, and
additionally the magic number is xored with the filesytem UUID, so we
can do so safely.

This implements the scanning - next patch will rework topology repair to
make use of the found nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Don't skip fake btree roots in fsck
Kent Overstreet [Sun, 10 Mar 2024 20:18:41 +0000 (16:18 -0400)]
bcachefs: Don't skip fake btree roots in fsck

When a btree root is unreadable, we might still have keys fro the
journal to walk and mark.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()
Kent Overstreet [Fri, 15 Mar 2024 02:17:40 +0000 (22:17 -0400)]
bcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Etyzinger cleanups
Kent Overstreet [Fri, 22 Mar 2024 23:26:33 +0000 (19:26 -0400)]
bcachefs: Etyzinger cleanups

Pull out eytzinger.c and kill eytzinger_cmp_fn. We now provide
eytzinger0_sort and eytzinger0_sort_r, which use the standard cmp_func_t
and cmp_r_func_t callbacks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: bch2_shoot_down_journal_keys()
Kent Overstreet [Tue, 19 Mar 2024 22:56:26 +0000 (18:56 -0400)]
bcachefs: bch2_shoot_down_journal_keys()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Clear recovery_passes_required as they complete without errors
Kent Overstreet [Sun, 31 Mar 2024 02:25:45 +0000 (22:25 -0400)]
bcachefs: Clear recovery_passes_required as they complete without errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: ratelimit informational fsck errors
Kent Overstreet [Tue, 2 Apr 2024 22:57:05 +0000 (18:57 -0400)]
bcachefs: ratelimit informational fsck errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Check for bad needs_discard before doing discard
Kent Overstreet [Tue, 2 Apr 2024 22:30:14 +0000 (18:30 -0400)]
bcachefs: Check for bad needs_discard before doing discard

In the discard worker, we were failing to validate the bucket state -
meaning a corrupt needs_discard btree could cause us to discard a bucket
that we shouldn't.

If check_alloc_info hasn't run yet we just want to bail out, otherwise
it's a filesystem inconsistent error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Improve bch2_btree_update_to_text()
Kent Overstreet [Tue, 2 Apr 2024 20:42:27 +0000 (16:42 -0400)]
bcachefs: Improve bch2_btree_update_to_text()

Print out the mode as a string, and also print out the btree and
watermark.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agomean_and_variance: Drop always failing tests
Guenter Roeck [Sun, 25 Feb 2024 16:29:25 +0000 (08:29 -0800)]
mean_and_variance: Drop always failing tests

mean_and_variance_test_2 and mean_and_variance_test_4 always fail.
The input parameters to those tests are identical to the input parameters
to tests 1 and 3, yet the expected result for tests 2 and 4 is different
for the mean and stddev tests. That will always fail.

     Expected mean_and_variance_get_mean(mv) == mean[i], but
        mean_and_variance_get_mean(mv) == 22 (0x16)
        mean[i] == 10 (0xa)

Drop the bad tests.

Fixes: 65bc41090720 ("mean and variance: More tests")
Closes: https://lore.kernel.org/lkml/065b94eb-6a24-4248-b7d7-d3212efb4787@roeck-us.net/
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix nocow lock deadlock
Kent Overstreet [Tue, 2 Apr 2024 05:03:58 +0000 (01:03 -0400)]
bcachefs: fix nocow lock deadlock

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: BCH_WATERMARK_interior_updates
Kent Overstreet [Mon, 1 Apr 2024 23:20:36 +0000 (19:20 -0400)]
bcachefs: BCH_WATERMARK_interior_updates

This adds a new watermark, higher priority than BCH_WATERMARK_reclaim,
for interior btree updates. We've seen a deadlock where journal replay
triggers a ton of btree node merges, and these use up all available open
buckets and then interior updates get stuck.

One cause of this is that we're currently lacking btree node merging on
write buffer btrees - that needs to be fixed as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix btree node reserve
Kent Overstreet [Mon, 1 Apr 2024 23:16:19 +0000 (19:16 -0400)]
bcachefs: Fix btree node reserve

Sign error when checking the watermark - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: On emergency shutdown, print out current journal sequence number
Kent Overstreet [Sat, 30 Mar 2024 19:59:57 +0000 (15:59 -0400)]
bcachefs: On emergency shutdown, print out current journal sequence number

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix overlapping extent repair
Kent Overstreet [Sat, 30 Mar 2024 05:00:50 +0000 (01:00 -0400)]
bcachefs: Fix overlapping extent repair

overlapping extent repair was colliding with extent past end of inode
checks - don't update "extent ends at" until we know we have an extent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix remove_dirent()
Kent Overstreet [Mon, 1 Apr 2024 04:00:32 +0000 (00:00 -0400)]
bcachefs: Fix remove_dirent()

We were missing an iter_traverse().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Logged op errors should be ignored
Kent Overstreet [Mon, 1 Apr 2024 02:34:45 +0000 (22:34 -0400)]
bcachefs: Logged op errors should be ignored

If something is wrong with a logged op, we just want to delete it -
there's nothing to repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Improve -o norecovery; opts.recovery_pass_limit
Kent Overstreet [Fri, 29 Mar 2024 01:34:14 +0000 (21:34 -0400)]
bcachefs: Improve -o norecovery; opts.recovery_pass_limit

This adds opts.recovery_pass_limit, and redoes -o norecovery to make use
of it; this fixes some issues with -o norecovery so it can be safely
used for data recovery.

Norecovery means "don't do journal replay"; it's an important data
recovery tool when we're getting stuck in journal replay.

When using it this way we need to make sure we don't free journal keys
after startup, so we continue to overlay them: thus it needs to imply
retain_recovery_info, as well as nochanges.

recovery_pass_limit is an explicit option for telling recovery to exit
after a specific recovery pass; this is a much cleaner way of
implementing -o norecovery, as well as being a useful debug feature in
its own right.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: bch2_run_explicit_recovery_pass_persistent()
Kent Overstreet [Sat, 30 Mar 2024 00:43:39 +0000 (20:43 -0400)]
bcachefs: bch2_run_explicit_recovery_pass_persistent()

Flag that we need to run a recovery pass and run it - persistenly, so if
we crash it'll still get run.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Ensure bch_sb_field_ext always exists
Kent Overstreet [Sat, 30 Mar 2024 22:57:53 +0000 (18:57 -0400)]
bcachefs: Ensure bch_sb_field_ext always exists

This makes bch_sb_field_ext more consistent with the rest of -o
nochanges - we don't want to be varying other codepaths based on -o
nochanges, since it's used for testing in dry run mode; also fixes some
potential null ptr derefs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Flush journal immediately after replay if we did early repair
Kent Overstreet [Thu, 28 Mar 2024 06:36:10 +0000 (02:36 -0400)]
bcachefs: Flush journal immediately after replay if we did early repair

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Resume logged ops after fsck
Kent Overstreet [Sat, 23 Mar 2024 23:31:15 +0000 (19:31 -0400)]
bcachefs: Resume logged ops after fsck

Finishing logged ops requires the filesystem to be in a reasonably
consistent state - and other fsck passes don't require it to have
completed, so just run it last.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Add error messages to logged ops fns
Kent Overstreet [Sat, 23 Mar 2024 23:30:58 +0000 (19:30 -0400)]
bcachefs: Add error messages to logged ops fns

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Split out recovery_passes.c
Kent Overstreet [Sun, 24 Mar 2024 00:07:46 +0000 (20:07 -0400)]
bcachefs: Split out recovery_passes.c

We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.

So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: fix backpointer for missing alloc key msg
Kent Overstreet [Thu, 28 Mar 2024 05:41:03 +0000 (01:41 -0400)]
bcachefs: fix backpointer for missing alloc key msg

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix bch2_btree_increase_depth()
Kent Overstreet [Thu, 14 Mar 2024 23:39:26 +0000 (19:39 -0400)]
bcachefs: Fix bch2_btree_increase_depth()

When we haven't yet allocated any btree nodes for a given btree, we
first need to call the regular split path to allocate one.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Kill bch2_bkey_ptr_data_type()
Kent Overstreet [Mon, 25 Mar 2024 23:26:05 +0000 (19:26 -0400)]
bcachefs: Kill bch2_bkey_ptr_data_type()

Remove some duplication, and inconsistency between check_fix_ptrs and
the main ptr marking paths

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix use after free in check_root_trans()
Kent Overstreet [Tue, 26 Mar 2024 22:46:38 +0000 (18:46 -0400)]
bcachefs: Fix use after free in check_root_trans()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix repair path for missing indirect extents
Kent Overstreet [Tue, 26 Mar 2024 22:46:20 +0000 (18:46 -0400)]
bcachefs: Fix repair path for missing indirect extents

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
7 months agobcachefs: Fix use after free in bch2_check_fix_ptrs()
Kent Overstreet [Tue, 26 Mar 2024 21:38:22 +0000 (17:38 -0400)]
bcachefs: Fix use after free in bch2_check_fix_ptrs()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>