Dominik Csapak [Wed, 16 Nov 2022 15:47:58 +0000 (16:47 +0100)]
datacenter config: add options to control tag usage rights
By adding a 'user-tag-privileges' and 'registered-tags' option.
The former sets the general policy by which "normal" users (with just
'VM.Config.Options' on the respective guest) can set or delete tags
on a guest, and the latter is a list of registered tags only settable
by users with admin-like permission ('Sys.Modify' on '/'); it is
designed to be used as source/target for actions like backup jobs in
the future.
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
[T: drop outdated stuff & reword/work commit message] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Dominik Csapak [Wed, 16 Nov 2022 15:47:56 +0000 (16:47 +0100)]
cluster: add get_guest_config_properties
Akin to the singular get_guest_config_property, but with the
possibility to query a list of properties.
Uses the CFS_IPC_GET_GUEST_CONFIG_PROPERTIES introduced in the
previous patch.
Note that the same details apply w.r.t. parsing and permissions as
the singular variant, iow. one needs to take caution and filter
allowed guests views on call site when using this.
Dominik Csapak [Wed, 16 Nov 2022 15:47:55 +0000 (16:47 +0100)]
pmxcfs: add IPC call to get multiple guest config properties at once
Previously we used the existing IPC call just for getting the `lock`
property of virtual guests in the cluster resource API call, but for
the tag display we'll to get another one and calling this twice seems
rather non-ideal.
Thus introduce a successor: CFS_IPC_GET_GUEST_CONFIG_PROPERTIES
It allows one to get multiple properties from a single, or all
virtual guest in-memory configs in one go. Keep the existing IPC call
as is for backward compatibility and add this as separate, new one.
The new IPC command basically behaves the same as the previous
CFS_IPC_GET_GUEST_CONFIG_PROPERTY, but takes a list of properties
instead and returns multiple properties per guest.
The existing perl wrapper around the IPC call to get a single
property was switched over to use the new IPC call too, so we'll
be able to drop the old IPC command with the next major release if
nothing comes up.
= Benchmark =
== Setup ==
Proxmox VE in a VM with CPU type host (Intel 12700k) and 4 cores
10000 typical configs with both 'lock' and 'tags' set at the end, and
fairly long tags ('test-tag1,test-tag2,test-tag3') (normal VM with a
snapshot, ~ 1 KiB)
== Test ==
Average of 100 runs each with time in milliseconds
== Previous Results ==
num props total time time per iteration
1 1054.2 10.2
== Results with this Patch ==
num props total time time per iter function
2 1332.2 13.2 get_properties
1 1051.2 10.2 get_properties
2 2082.2 20.2 get_property, 2 separate calls
1 1034.2 10.2 get_property
So, a call with the new code for one property is the same as with the
old code, and adding a second property only adds a bit of additional
time (in this case ~30%).
if processing a corosync.conf update is delayed on a single node,
reloading the config too early can have disastrous results (loss of
token and HA fence). artifically delay the reload command by one second
to allow update propagation in most scenarios until a proper solution
(e.g., using broadcasting/querying of locally deployed config versions)
has been developed and fully tested.
reported on the forum:
https://forum.proxmox.com/threads/expanding-cluster-reboots-all-vms.110903/
reported issue can be reproduced by deploying a patched pmxcfs on
non-reloading node that sleeps before writing out a broadcasted
corosync.conf update and adding a node to the cluster, leading to the
following sequence of events:
- corosync config reload command received
- corosync config update written out
which causes that particular node to have a different view of cluster
topology, causing all corosync communication to fail for all nodes until
corosync on the affected node is restarted (the on-disk config is
correct after all, just not in effect).
Stefan Sterz [Wed, 11 May 2022 09:27:07 +0000 (11:27 +0200)]
cluster config: mark qdevice end point as protected
The qdevice status endpoint retrieves some required info via writing
a status command to the qdevice socket
(/var/run/corosync-qdevice/corosync-qdevice.sock) and then parsing
the response.
Since the socket has 0755 permissions and is owned by root it can not
be written by the unprivileged pveproxy that runs as www-data user.
Relay the endpoint to the privileged pvedaemon by marking it as
`protected` to allow the end point to retrieve data from the qdevice
properly.
Fixes an issue where the api would return an empty object.
Signed-off-by: Stefan Sterz <s.sterz@proxmox.com> Tested-by: Oguz Bektas <o.bektas@proxmox.com>
[ T: extend commit message ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Stefan Sterz [Thu, 24 Feb 2022 14:21:48 +0000 (15:21 +0100)]
parse datacenter config: remove "\s*" from comment regex
To be consistent with PBS's implementation of multi-line comments
remove "\s*" here too. Since the regex isn't lazy .* matches
everything \s* would anyway. (Note that new lines occurs after "$").
struct clog_base {
uint32_t size; // total size of this clog_base
uint32_t cpos; // index into data, starts counting at start of clog_base, initially 0
char data[];
};
an entry consists of indices of the next and previous entries and
various fields (fixed-length ones omitted here):
typedef struct {
uint32_t prev; // index of previous entry, or 0 if none exists
uint32_t next; // index of next entry
[..] // fixed-length fields
uint8_t node_len;
uint8_t ident_len;
uint8_t tag_len;
uint32_t msg_len;
char data[]; // node+ident+tag+msg - variable-length fields
} clog_entry_t;
the next and prev indices are calculated when allocating a new entry,
and the position of the current entry 'cpos' is updated accordingly
(clog_alloc_entry):
- size of the entry is padded with up to 7 bytes
- first entry goes to index 8
- second and subsequent entries go to the current entry's 'next' index
- if the current entry's 'next' index is out of bonds, the first entry
is overwritten => wrap-around
- the 'prev' index of the new entry is set to cpos
- cpos is set to the index of the new entry
- the 'next' index of the new entry is set to its index+padded size
when iterating over the entries, the following bounds are used to follow
the 'prev' links starting at the current entry:
while this handles a not-yet-wrapped around ring buffer (cpos would be 0
when reaching the first entry), and tries to handle wrap-arounds by
terminating when reaching a 'red-zone' of 'CLOG_MAX_ENTRY_SIZE' starting
at the current entry (this covers the current entry which was already
visited as first entry during the iteration, and the next entry after it
which might have been overwritten) - but it's possible that entries line
up so that the wrap-around 'prev' index of the first entry points to a
location *before* the current entry.
for example, looking at clog_base with S being the size field, C being
the cpos field, followed by the actual data. N/P are the next/prev
indices of the entry at C, Q denotes the 'prev' index of the first entry
in the data array, and 'R' the red zone used for the loop check in case
of wrap-around.
first, fill up the buffer with six large entries:
Q P C N
| | | |
| | | |
v v v v
+-+-+------+------+------+------+------+------+-+
| | | | | | | | |x|
| | | 1 | 2 | 3 | 4 | 5 | 6 |x|
| | | | | | | | |x|
+-+-+------+------+------+------+------+------+-+
S C RRRRRRRRRRR
iterating from C backwards ends up at Q being 0, terminating the loop
without a wrap-around after having visit 6->1
now the next (in this example, smaller) entry that gets allocated/insert
needs to wrap around, because the empty space at the end (denoted by
XXX) is too small:
C N QP
| | ||
| | ||
v v vv
+-+-+------+------+------+------+------+------+-+
| | | | | | | | |x|
| | | 7 | 2 | 3 | 4 | 5 | 6 |x|
| | | | | | | | |x|
+-+-+------+------+------+------+------+------+-+
S C RRRRRRRRRRR
iterating backwards from C terminates the loop when reaching the red
zone, with the (second) entry no longer being considered since it partly
overlaps it. only 7->3 are visited.
adding more entries we end up with the following layout:
P QC N
| || |
| || |
v vv v
+-+-+------+---+---+---+---+---+---+---+---+--+-+
| | | | | | | | | | | |##|x|
| | | 7 | 8 | 9 |10 |11 |12 |13 |14 |15 |#6|x|
| | | | | | | | | | | |##|x|
+-+-+------+---+---+---+---+---+---+---+---+--+-+
S C RRRRRRRRRRR
with # denoting space previously occupied the last large entry (#6)
which is still unmodified (the rest of that entry's data has been
overwritten by entries #14 and #15).
iterating from C (to the left/P) the loop ends up at entry #7, follows
the link to Q (which satisfies the loop bounds as Q < C), and the data
starting at (invalid index) Q gets interpreted as an entry. it is
possible (though even more unlikely than the partial overwrite case)
that Q and C line up perfectly, which would cause the loop to become an
infinite loop. the loop *should* terminate after having visited 15-7,
without wrapping around.
note that the actual sizes of the entries are not relevant, the
requirements are:
- entry before last wrap-around must be big enough that entry of current
index can overtake it without another wrap-around
- method that does iteration must be called before next wrap-around
the fix is obviously trivial once the issue became apparent - when
wrapping around during iteration, additionally check that we are not
jumping across the red zone into already invalidated parts of data.
clusterlog_merge is technically not affected since it aborts before a
wrap-around anyway, but it doesn't hurt to have the checks consistently
in case this ever changes.
thanks to @kev1904 on our community forums for reporting and providing the data
to nail the cause down fast!
the recommended way is to first shutdown, then delnode, and never let it
come back online, in which case corosync-cfgtool won't be able to kill
the removed (offline) node.
also, the order was wrong - if we first update corosync.conf to remove
the node entry from the nodelist, corosync doesn't know about the nodeid
anymore, so killing will fail even if the node is still online.
currently, when veth or tap interfaces are plugged to bridge,
an igmp v3 report is broadcasted to the network, with the
bridge mac adddress.
Users have reported problems with hetzner for example, blocking the server
because of the unknown mac flooding the network.
https://forum.proxmox.com/threads/proxmox-claiming-mac-address.52601/page-6#post-421676
some traces:
ip addr:
190: fwbr109i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 22:5f:0b:cb:ac:42 brd ff:ff:ff:ff:ff:ff
ebtable log:
Oct 6 09:46:24 kvmformation3 kernel: [437256.753355] MAC-FLOOD-F IN=fwpr109p0 OUT=eno1 MAC source = 22:5f:0b:cb:ac:42 MAC dest = 01:00:5e:00:00:16 proto = 0x0800 IP SRC=0.0.0.0 IP DST=224.0.0.22, IP tos=0xC0, IP proto=2
Stoiko Ivanov [Thu, 11 Nov 2021 15:42:09 +0000 (16:42 +0100)]
sysctl snippet: move to /usr/lib and prefix with 10-
following best-practices according to `sysctl.d(5)`:
* Packages should install their configuration files in /usr/lib/ ...
* It is recommended to prefix all filenames with a two-digit number
and a dash ...
the conffile removal is inspired by how it was done in `procps` (one
of the few packages in the debian repository, which did this
transition) and by following `dpkg-maintscript-helper(1)` and
`deb-conffiles(5)` (the former recommending the latter)
The choice of 10- as prefix is due to pve-container shipping its
snippet with that prefix already. other packages use higher numbers
(e.g. systemd - 50-)
Tested on 2 VMs (one with modifications, the other without) - worked
as advertised (the modified file was kept as
/etc/sysctl.d/pve.conf.dpkg-old and the upgrade notified me of the
change)
api: join info: return explicit error code for no-cluster
allows an API client to more easily differ between this OK "error"
and an actual exception.
Note that I'd rather now just return undef or an empty object for the
no cluster case (not to sure about the original reasons about the die
anymore), but that would be a breaking change, and in fact it would
break current pve-manager versions out there, so schedule that for
the next major release (if we still want it then)
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Similar to notes for nodes.
datacenter.cfg normally uses key-value pairs defined in the schema.
We bypass this to allow potentially long comments at the top.
We have some users running into issues in some cases, like syncing
huge user base through LDAP into users.cfg or having a few thousands+
of HA services, as then the per-file limit is exhausted.
Bumping that one provides only half of the solution as the total
limit of 30 MiB would only allow a few files getting that big, or
reduce the amount left over for actual guest configurations quite a
bit.
So also bump the total filesystem limit from 30 MiB to 128 MiB, so by
a factor of ~4 and in the same spirit bump the maximal numbers of
inodes (i.e., different files) from 10k to 256k, which pmxcfs can
handle still rather easily (tested with touch) and would allow to max
out the full FS limit with 512 byte files, which fits small guest
configs, so sounds like an OK proportioned limit.
That should give use quite some wiggle room again, and should be
relatively safe as most of our access is rather small and on a few
files only, only root has full access anyway and that user can break
everything already, so not much lost here.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Aaron Lauterer [Mon, 3 May 2021 10:00:11 +0000 (12:00 +0200)]
pve-cluster.service: remove ceph.service
The ceph.service file has been removed in pve-manager commit be244f1.
Therefore, there is no need to reference it anymore. This also avoids
showing the `ceph.service` as a `not found` unit.
```
will get you the following error on perl 5.32
```
garbage after JSON object, at character offset 2 (before "\x{0}") at -e line 1.
```
Note, I did not find anything related in the perldelta aricles for
the 28 -> 30 or 30 -> 32 update, the first one made a bigger jump for
the JSON module version used, so possibly a change there.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Thu, 22 Apr 2021 19:38:26 +0000 (21:38 +0200)]
cfs lock: avoid confusing lock prefix on error
we have lots of forum posts where users think that the locking was
the error, not the actual error message from the called code.
This has limited value as general-applied prefix, if a code requires
the lockid or whatever to be included in the error message they can
already do so, so just re-raise the error and be done, at least if it
is a error from the code and not from the lock setup,.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Thu, 22 Apr 2021 08:46:42 +0000 (10:46 +0200)]
pmxcfs: db: tell query planner that prepared statement are long living
SQLITE_PREPARE_PERSISTENT
The SQLITE_PREPARE_PERSISTENT flag is a hint to the query planner
that the prepared statement will be retained for a long time and
probably reused many times. Without this flag,
sqlite3_prepare_v3() and sqlite3_prepare16_v3() assume that the
prepared statement will be used just once or at most a few times
and then destroyed using sqlite3_finalize() relatively soon. The
current implementation acts on this hint by avoiding the use of
lookaside memory so as not to deplete the limited store of
lookaside memory. Future versions of SQLite may act on this hint
differently.
-- https://sqlite.org/c3ref/c_prepare_normalize.html#sqlitepreparepersistent
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Thomas Lamprecht [Thu, 22 Apr 2021 08:18:58 +0000 (10:18 +0200)]
pmxcfs: db: use SQLITE_STATIC to avoid memory copies
we can trust that we own *value and *name until the sqlite statement
was executed, so use the STATIC bind flag to tell sqlite that it does
not need to make it's own copy in the bind statement.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
pmxcfs: do not grant LXC configs o+r permissions anymore
This was initially done because of some hook reading the config from
an unprivileged namespace when using unprivileged containers.
But, we nowadays do not do this anymore, either setup stuff before to
or use another source for getting required information (e.g., our
autodev hook uses "/var/lib/lxc/$vmid/devices").
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Removing them now could count as compat breakage, for users which
still depend on some of this weird behavior it's nicer if we do this
more explicitly with 7.0
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
cpg_mcast_joined (and transitively, cpg_join/leave) are not thread-safe.
pmxcfs triggers such operations via FUSE and CPG dispatch callbacks,
which are running in concurrent threads.
accordingly, we need to protect these operations with a mutex, otherwise
they might return CS_OK without actually doing what they were supposed
to do (which in turn can lead to the dfsm taking a wrong turn and
getting stuck in a supposedly short-lived state, blocking access via
FUSE and getting whole clusters fenced).
huge thanks to Alexandre Derumier for providing the initial bug report
and quite a lot of test runs while debugging this issue.