git.proxmox.com Git - pve-access-control.git/commit

auth key: fix double rotation in clusters

there is a (hard to trigger) race that can cause a double rotation of
the auth key, with potentially confusing fallout (various processes on
different nodes having an inconsistent view of the current and previous
auth keys, resulting in "random" invalid ticket errors until the next
proper key rotation 24h later).

the underlying cause is that `stat()` calls are excempt from our
otherwise non-cached/direct_io handling of pmxcfs I/O, which allows the
following sequence of events to take place:

LAST: mtime of current auth key

- current epoch advances to LAST + 24h

the following can be arbitrarly interleaved between node A and B:
- LAST+24h node A: pvedaemon/pvestatd on node A calls check_authkey(1)
- LAST+24h node A: it returns 0 (rotation required)
- LAST+24h node A: rotate_key() is called
- LAST+24h node A: cfs_lock_authkey is called
- LAST+24h node B: pvedaemon/pvestatd calls check_authkey(1)
- LAST+24h node B: key is not yet cached in-memory by current process
- LAST+24h node B: key file is opened, stat-ed, read, parsed, and content+mtime
  is cached (the kernel will now cache this stat result for 1s unless
  the path is opened)
- LAST+24h node B: it returns 0 (rotation required)
- LAST+24h node B: rotate_key() is called
- LAST+24h node B: cfs_lock_authkey is called

the following is mutex-ed via a cfs_lock:
- LAST+24h node A: lock is obtained
- LAST+24h node A: check_authkey() is called
- LAST+24h node A: key is stat-ed, mtime is still (correctly) LAST,
  cached mtime and content are returned
- LAST+24h node A: it returns 0 (rotation still required)
- LAST+24h node A: get_pubkey() is called and returns current auth key
- LAST+24h node A: new keypair is generated and persisted
- LAST+24h node A: cfs_lock is released
- LAST+24h node B: changes by node A are processed by pmxcfs
- LAST+24h node B: lock is obtained
- LAST+24h node B: check_authkey() is called
- LAST+24h node B: key is stat-ed, mtime is (incorrectly!) still LAST
  since the stat call is handled by the kernel/page cache, not by
  pmxcfs, cached mtime and content are returned
- LAST+24h node B: it returns 0 (rotation still required)
- LAST+24h node B: get_pubkey() is called and returns either previous or
  key written by node A (depending on whether page cache or pmxcfs
  answers stat call)
- LAST+24h node B: new keypair is generated, key returned by last
  get_pubkey call is written as old key

the end result is that some nodes and process will treat the key
generated by node A as "current", while others will treat the one
generated by nodoe B as "current". both have the same mtime, so the
in-memory cache hash won't be updated unless the service is restarted or
another rotation happens. depending on who generated the ticket and who
attempts validating it, a ticket might be rejected as invalid even
though the generating party would treat it as valid, and time on all
nodes is properly synced.

there seems to be now way for pmxcfs to pro-actively invalidate the page
cache entry safely (since we'd need to do so while writes to the same
path can happen concurrently), so work around by forcing an open/close
at the (stat) call site which does the work for us. regular reads are
not affected since those already bypass the page cache entirely anyway.

thankfully in almost all cases, the following sequence has enough
synchronization overhead baked in to avoid triggering the issue almost
entirely:

- cfs_lock
- generate key
- create tmp file for old key
- write tmp file
- rename tmp file into proper place
- create tmp file for new pub key
- write tmp file
- rename tmp file into proper place
- create tmp file for new priv key
- write tmp file
- rename tmp file into proper place
- release lock

that being said, there has been at least one report where this was
triggered in the wild from time to time.

it is easy to reproduce by increasing the attr_timeout and entry_timeout
fuse settings inside pmxcfs to increase the time stat results are
treated as valid/retained in the page cache:

-----8<-----
diff --git a/data/src/pmxcfs.c b/data/src/pmxcfs.c
index d78a248..e3e807b 100644
--- a/data/src/pmxcfs.c
+++ b/data/src/pmxcfs.c
@@ -935,7 +935,7 @@ int main(int argc, char *argv[])

   mkdir(CFSDIR, 0755);

- char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", NULL};
+ char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", "-oentry_timeout=5", "-oattr_timeout=5", NULL};

   struct fuse_args fuse_args = FUSE_ARGS_INIT(sizeof (fa)/sizeof(gpointer) - 1, fa);

----->8-----

in which case it's even easy to trigger more than double rotation in a
bigger test cluster (stopping all PVE services except for pve-cluster
helps avoiding interference):

on a single node:
$ touch --date yesterday /etc/pve/authkey.pub

in parallel (i.e., via tmux synchronized panes):
-----8<-----
use strict;
use warnings;
use PVE::Cluster;
use PVE::AccessControl;
PVE::Cluster::cfs_update();

# ensure page cache entry is there
PVE::AccessControl::check_authkey(1);
PVE::AccessControl::check_authkey(1);
# now attempt rotation
PVE::AccessControl::rotate_authkey();
----->8-----

Thanks to Wolfgang Bumiller for assistance in triaging and exploring
various avenues of fixing.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>

author	Fabian Grünbichler <f.gruenbichler@proxmox.com>
	Wed, 13 Jul 2022 13:13:59 +0000 (15:13 +0200)
committer	Thomas Lamprecht <t.lamprecht@proxmox.com>
	Thu, 14 Jul 2022 05:39:05 +0000 (07:39 +0200)
commit	a74d5080b3df4336e363454de2d391663dfe0c6e
tree	26cfb8ded2ad4eed34658ce2d98763f9e05ea53a	tree
parent	fe52bc631c2024fa0d7396830fc99be90905f103	commit \| diff