Dominik Csapak [Fri, 21 Oct 2022 08:31:17 +0000 (10:31 +0200)]
authenticate_user: pass undef instead of empty $tfa_challenge to authenticate_2nd_new
just above, we check & return if $tfa_challenge is set, so there is no
way that it would be set here. To make it clearer that it must be undef
here pass it as such.
Dominik Csapak [Fri, 21 Oct 2022 08:31:15 +0000 (10:31 +0200)]
authenticate_2nd_new: only lock tfa config for recovery keys
we currently only need to lock the tfa config when we got a recovery key
as a tfa challenge response, since that's the only thing that can
actually change the tfa config (every other method only reads from
there).
so to do that, factor out the code that was inside the lock, and call it
with/without lock depending on the tfa challenge response
Stoiko Ivanov [Mon, 29 Aug 2022 16:07:55 +0000 (18:07 +0200)]
auth ldap/ad: compare group member dn case-insensitively
currently we add a user to a group if it's DN is listed in the
member-attributes of a group. The comparison for this is done via
existence check of a hash key, which is case-sensitive.
The equality for DNs is defined in a not straight forward way [0]:
(roughly translating to you need to honor the equality rules for each
'component' (RDN) of the DN) and is implementation-specific (Microsoft
AD is case-insensitive).
While this patch does not address the complete complexity of comparing
DNs it should work fine in practice.
issue with case-sensitive mismatches was reported in our community
forum:
https://forum.proxmox.com/threads/.113387
tested against a local test-vm used for reproducing the issue.
there is a (hard to trigger) race that can cause a double rotation of
the auth key, with potentially confusing fallout (various processes on
different nodes having an inconsistent view of the current and previous
auth keys, resulting in "random" invalid ticket errors until the next
proper key rotation 24h later).
the underlying cause is that `stat()` calls are excempt from our
otherwise non-cached/direct_io handling of pmxcfs I/O, which allows the
following sequence of events to take place:
LAST: mtime of current auth key
- current epoch advances to LAST + 24h
the following can be arbitrarly interleaved between node A and B:
- LAST+24h node A: pvedaemon/pvestatd on node A calls check_authkey(1)
- LAST+24h node A: it returns 0 (rotation required)
- LAST+24h node A: rotate_key() is called
- LAST+24h node A: cfs_lock_authkey is called
- LAST+24h node B: pvedaemon/pvestatd calls check_authkey(1)
- LAST+24h node B: key is not yet cached in-memory by current process
- LAST+24h node B: key file is opened, stat-ed, read, parsed, and content+mtime
is cached (the kernel will now cache this stat result for 1s unless
the path is opened)
- LAST+24h node B: it returns 0 (rotation required)
- LAST+24h node B: rotate_key() is called
- LAST+24h node B: cfs_lock_authkey is called
the following is mutex-ed via a cfs_lock:
- LAST+24h node A: lock is obtained
- LAST+24h node A: check_authkey() is called
- LAST+24h node A: key is stat-ed, mtime is still (correctly) LAST,
cached mtime and content are returned
- LAST+24h node A: it returns 0 (rotation still required)
- LAST+24h node A: get_pubkey() is called and returns current auth key
- LAST+24h node A: new keypair is generated and persisted
- LAST+24h node A: cfs_lock is released
- LAST+24h node B: changes by node A are processed by pmxcfs
- LAST+24h node B: lock is obtained
- LAST+24h node B: check_authkey() is called
- LAST+24h node B: key is stat-ed, mtime is (incorrectly!) still LAST
since the stat call is handled by the kernel/page cache, not by
pmxcfs, cached mtime and content are returned
- LAST+24h node B: it returns 0 (rotation still required)
- LAST+24h node B: get_pubkey() is called and returns either previous or
key written by node A (depending on whether page cache or pmxcfs
answers stat call)
- LAST+24h node B: new keypair is generated, key returned by last
get_pubkey call is written as old key
the end result is that some nodes and process will treat the key
generated by node A as "current", while others will treat the one
generated by nodoe B as "current". both have the same mtime, so the
in-memory cache hash won't be updated unless the service is restarted or
another rotation happens. depending on who generated the ticket and who
attempts validating it, a ticket might be rejected as invalid even
though the generating party would treat it as valid, and time on all
nodes is properly synced.
there seems to be now way for pmxcfs to pro-actively invalidate the page
cache entry safely (since we'd need to do so while writes to the same
path can happen concurrently), so work around by forcing an open/close
at the (stat) call site which does the work for us. regular reads are
not affected since those already bypass the page cache entirely anyway.
thankfully in almost all cases, the following sequence has enough
synchronization overhead baked in to avoid triggering the issue almost
entirely:
- cfs_lock
- generate key
- create tmp file for old key
- write tmp file
- rename tmp file into proper place
- create tmp file for new pub key
- write tmp file
- rename tmp file into proper place
- create tmp file for new priv key
- write tmp file
- rename tmp file into proper place
- release lock
that being said, there has been at least one report where this was
triggered in the wild from time to time.
it is easy to reproduce by increasing the attr_timeout and entry_timeout
fuse settings inside pmxcfs to increase the time stat results are
treated as valid/retained in the page cache:
in which case it's even easy to trigger more than double rotation in a
bigger test cluster (stopping all PVE services except for pve-cluster
helps avoiding interference):
on a single node:
$ touch --date yesterday /etc/pve/authkey.pub
in parallel (i.e., via tmux synchronized panes):
-----8<-----
use strict;
use warnings;
use PVE::Cluster;
use PVE::AccessControl;
PVE::Cluster::cfs_update();
# ensure page cache entry is there
PVE::AccessControl::check_authkey(1);
PVE::AccessControl::check_authkey(1);
# now attempt rotation
PVE::AccessControl::rotate_authkey();
----->8-----
Thanks to Wolfgang Bumiller for assistance in triaging and exploring
various avenues of fixing.
Mira Limbeck [Wed, 15 Jun 2022 14:09:50 +0000 (16:09 +0200)]
fix #4074: increase API OpenID code size limit to 2048
Azure AD seems to have a variable authorization code size, depending on
the browser state according to one report in bug #4074 [0].
Sometimes the size is greater than our current limit of 1024, so
increase it to 2048.
The RFC [1] mentions that there is no limit to the code size, but based on
current experience, a size limit of 2048 might be enough for every
current OpenID Connect provider.
to detect similar issues to that fixed in the previous commit early on
and without the potential for dangerous side-effects.
root@pam is intentionally still allowed before the check in case such
situations can be triggered by misconfiguration - root@pam can then
still clean up the affected configs via the GUI/API, and not just via
manual editing.
the previous version using an ACL path of '/access/users/{userid}' was
broken for non-root users, since the '@' character always contained in a
userid is not allowed in ACL paths.
this effectively meant that creating API tokens only worked for:
- root@pam (ACL checks skipped altogether)
- users with User.Modify on '/' with propagation (the roles/privs for
'/' are propagated to the undefined path in this case)
- users creating their own tokens (first branch of 'or')
the userid-group check is used for all other modifications of user
entities, so it can also be used for creating/modifying/removing API
tokens.
when multiple roles are defined on a path that share a privilege, this
randomly took the propagation flag for the priv from the last role
encountered. since perl hashes are iterated randomly, this means the
propagation flag was sometimes set correctly, and sometimes not.
note that this propagation flag is only used for display/dumping
purposes, and for intersection with token privs (see next commit).
actual handling of propagation happens on the role level in
PVE::AccessControl::roles().
modified test case (spuriously) fails without the fix.
Dominik Csapak [Mon, 28 Mar 2022 12:38:03 +0000 (14:38 +0200)]
fix #3668: realm-sync: replace 'full' & 'purge' with 'remove-vanished'
Both, 'full' & 'purge', configure what is removed when an entry
(or property) is not returned anymore from the sync result. For users
it may be thus easier to understand setting a single property without
wondering about the interaction of the two specific ones.
The new 'remove-vanished' is a list of things to remove, namely:
* 'acl'
* 'entry'
* 'properties'.
The old 'full' gets mapped to 'entry;properties' and an old 'purge'
to the 'acl'
Keep the old properties for backwards-compatibility but transform to
the new list on edit. Add a deprecation notice to the description
with a TODO reminder to remove with 8.0
Note, this commit changes two things slightly:
* 'purge' (now remove-vanished -> acl) does now remove ACLs for
vanished entries even when we keep those entries. Previously those
were only deleted when we also removed the entries (this can be
seen by the changed regression test)
* since 'remove-vanished' is empty by default, only the scope must be
given explicitly on sync or the sync-default. Previously 'full' and
'purge' must have been configured explicitly (no big deal, since
the default is to not remove anything)
Bug #3668 is fixed when not selecting 'properties' for the sync, so
that existing (custom) properties are not deleted on update
Suggested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
[ T: some commit message typos/wording ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
unchanged: the user 'test@pve' can allocate new '@pve' users, but
only if the created user will belong to at least one of 'test'
(direct ACL for that user) or 'test3' (indirect ACL via 'test' group)
groups.
changed: if the user 'test@pve' updates an existing user, they need
to (A) have 'User.Modify' on at least one existing group of that
user, and (B) 'User.Modify' on all of the groups passed in via the
'groups' parameter. A is the general rule for 'allowed to modify
user' across the board, but was missing for this specific variant of
the check. B was the case before, but just checking this without also
checking A allows a user to pull off-limits users into groups that
they can modify, which then in turn allows them to modify those users
via A which is now passing.
for example, without this patch 'test@pve' would be able to add
'other@pve' to either 'test' or 'test3', and then in turn call any of
the API endpoints that require 'User.Modify' on a user's group
(change TFA, change password or delete user if realm is pve, ..).
all the other userid-group checks without group_param set remain
unchanged as well, since $check_existing_user is true in that case.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
rpcenv: skip undef propagation flag in more places
these are just cosmetic fixes/safeguards against future bugs -
compute_api_permissions is used to set the 'cap' object to hide parts of
the GUI that are not usable without the corresponding privs in the
backend anyway, and get_effective_permissions is only used to return the
permission tree without a specific path query.
this could return undef for the propagation flag instead of 1/0, leading
to confusing displays of permission trees. all the actual checks using
the returned hash check for definedness anyway, so the actual
privileges checked and the displayed ones were not identical.
The userid-* permission check variants work on
$param->{userid} directly which does not exist for this
call. Also, they work on the realm of the user being
checked, rather than the realm provided as parameter.
The result was that as non-root user this always failed
with the message "userid '' too short"
Fix this by making the check explicitly work like in the
description.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
instead of restricting listing tfa entries of others to
root@pam, perform the same checks the user-list does and
which also reflect the permissions of the api calls actually
operating on those users, so, `User.Modify` on the user (but
also `Sys.Audit`, since it's only a read-operation, just
like the user index API call)
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
In PBS we don't support this, so the current TFA API in rust
does not support this either (although the config does know
about its *existence*).
For now, yubico authentication will be done in perl. Adding
it to rust the rust TFA crate would not make much sense
anyway as we'd likely not want to use the same http client
crate in pve and pbs anyway (since pve is all blocking code
and pbs is async...)
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
The main difference here is that we have no separate api
path for verification. Instead, the ticket api call gets an
optional 'tfa-challenge' parameter.
This is opt-in: old pve-manager UI with new
pve-access-control will still work as expected, but users
won't be able to *update* their TFA settings.
Since the new tfa config parser will build a compatible
in-perl tfa config object as well, the old authentication
code is left unchanged for compatibility and will be removed
with pve-8, where the `new-format` parameter in the ticket
call will change its default to `1`.
The `change_tfa` call will simply die in this commit. It is
removed later when adding the pbs-style TFA API calls.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>