Suggested-by: Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Alex Hudspith [Mon, 6 Nov 2023 09:17:38 +0000 (09:17 +0000)]
proc: Fix swap handling for cgroups v2 (zero limits)
Since memory.swap.max = 0 is valid under v2, limits of 0 must not be
treated differently. Instead, use UINT64_MAX as the default limit. This aligns
with cgroups v1 behaviour anyway since 'limit_in_bytes' files contain a large
number for unspecified limits (2^63).
Resolves: #534 Signed-off-by: Alex Hudspith <alex@hudspith.io>
Alex Hudspith [Mon, 6 Nov 2023 09:17:38 +0000 (09:17 +0000)]
proc: Fix swap handling for cgroups v2 (can_use_swap)
On cgroups v2, there are no swap current/max files at the cgroup root, so
can_use_swap must look lower in the hierarchy to determine if swap accounting
is enabled. To also account for memory accounting being turned off at some
level, walk the hierarchy upwards from lxcfs' own cgroup.
Signed-off-by: Alex Hudspith <alex@hudspith.io>
[ added check cgroup pointer is not NULL in lxcfs_init() ] Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Tycho Andersen [Wed, 29 Nov 2023 18:49:55 +0000 (11:49 -0700)]
systemd: mkdir -p the target mount dir
This is probably in a postinst for a debian package or a snap somewhere,
but we're repackaging it somewhere and I have an ugly sed to fix it up.
Let's do it here instead.
Kyeong Yoo [Tue, 3 Oct 2023 03:36:51 +0000 (16:36 +1300)]
proc: fix MemAvailable in /proc/meminfo to exclude tmpfs files
The "total_cache" from memory.stat of cgroup includes
the memory used by tmpfs files ("total_shmem"). Considering
it as available memory is wrong because files created
on a tmpfs file system cannot be simply reclaimed.
So the available memory is calculated with the sum of:
* Memory the kernel knows is free
* Memory that contained in the kernel active file LRU,
that can be reclaimed if necessary
* Memory that is contained in the kernel non-active file
LRU, that can be reclaimed if necessary
Cleanup start_loadavg code:
- add a new external symbol load_daemon_v2 with the pthread_create-like signature
- make hacky casts of pthread_t to int (and reverse) unnecessary for new API users
Related to: #610
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Brahmajit Das [Tue, 5 Sep 2023 04:15:06 +0000 (04:15 +0000)]
proc_loadavg.c: Fix incompatible integer to pointer conversion
Newer compiler like Clang 16 and GCC 14 have certain error enabled by
default, namely -Werror=incompatible-function-pointer-types. Which
resutls in build error such as:
proc_loadavg.c:606:10: error: incompatible integer to pointer conversion returning int from a function with result type pthread_t
My patch supresses the error for now, but a proper fix would be better.
Fist discovered on Gentoo linux (bug #894348).
Bug: https://bugs.gentoo.org/894348 Closes: https://github.com/lxc/lxcfs/issues/561 Signed-off-by: Brahmajit Das <brahmajit.xyz@gmail.com>
proc: Fix /proc/cpuinfo not respecting personality
It was found that the personality within the container was not being
properly respected, which for large numbers of CPUs would break
reporting of /proc/cpuinfo in arm32 containers running on an arm64 host.
proc_loadavg: fix ABBA deadlock between read/refresh
Idea of this fix is to always take nested locks in
the same order.
At the same time, we adding an extra check to insert_node()
that prevents adding a new load_node with the same cgroup
(->cg field) value. This is theoretically possible because
we don't hold .rilock/.lock when we call insert_node().
It looks like we have this issue from the initial
implementation of loadavg virtualization and it's hardly
reproducible that's why we weren't able to notice it.
Fixes: #605 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
cpuview: start to use interruptible lock primitives
Let's start using fuse-interruptible locks in cpuview.
It's better to start from one place instead of converting everything
at once to prevent global degradations.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Adds a few helper functions which represents fuse interruptible
versions of a classical pthread locking primitives:
extern int mutex_lock_interruptible(pthread_mutex_t *l);
extern int rwlock_rdlock_interruptible(pthread_rwlock_t *l);
extern int rwlock_wrlock_interruptible(pthread_rwlock_t *l);
Does not change behavior.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Make tests work with non-hybrid cgroup2 configuration.
- skip cgroupfs emulation tests (it's just obsolete and doesn't emulate cgroup2 properly)
- adapt another tests to cgroup2 (tasks -> cgroup.procs, and so on)
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Unfortunately, it's deprecated and not working properly:
https://github.blog/changelog/2022-08-09-github-actions-the-ubuntu-18-04-actions-runner-image-is-being-deprecated-and-will-be-removed-by-12-1-22/
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
We've lost 15th column (discard) in /proc/diskstats output.
After this fix /proc/diskstats format in full agreement with 4.18 kernel.
In 5.5+ kernels two new fields were introduced flush_req/flush_time.
Unfortunately, we can't add support for them as cgroup doesn't provide us
with this stat info.
Tycho Andersen [Thu, 27 Oct 2022 16:23:08 +0000 (10:23 -0600)]
sysfs: don't mask cpus in /sys/devices/system/cpu
The kernel does not mask the cpu%d dirs when they are offlined:
(root) /sys/devices/system/cpu # cat online
0-7
(root) /sys/devices/system/cpu # chcpu -d 4
CPU 4 disabled
(root) /sys/devices/system/cpu # cat online
0-3,5-7
(root) /sys/devices/system/cpu # cat offline
4
(root) /sys/devices/system/cpu # ls -al
total 0
drwxr-xr-x 16 root root 0 Oct 25 20:42 .
drwxr-xr-x 10 root root 0 Oct 25 20:42 ..
drwxr-xr-x 7 root root 0 Oct 25 20:42 cpu0
drwxr-xr-x 7 root root 0 Oct 25 20:42 cpu1
drwxr-xr-x 7 root root 0 Oct 25 20:42 cpu2
drwxr-xr-x 7 root root 0 Oct 25 20:42 cpu3
drwxr-xr-x 5 root root 0 Oct 25 20:42 cpu4
drwxr-xr-x 7 root root 0 Oct 25 20:42 cpu5
drwxr-xr-x 7 root root 0 Oct 25 20:42 cpu6
drwxr-xr-x 7 root root 0 Oct 25 20:42 cpu7
drwxr-xr-x 2 root root 0 Oct 25 20:43 cpufreq
drwxr-xr-x 2 root root 0 Oct 26 15:19 cpuidle
drwxr-xr-x 2 root root 0 Oct 26 15:19 hotplug
-r--r--r-- 1 root root 4096 Oct 25 20:42 isolated
-r--r--r-- 1 root root 4096 Oct 25 20:43 kernel_max
-r--r--r-- 1 root root 4096 Oct 26 15:19 modalias
-r--r--r-- 1 root root 4096 Oct 26 15:19 offline
-r--r--r-- 1 root root 4096 Oct 25 20:42 online
-r--r--r-- 1 root root 4096 Oct 25 20:43 possible
drwxr-xr-x 2 root root 0 Oct 26 15:19 power
-r--r--r-- 1 root root 4096 Oct 25 20:43 present
drwxr-xr-x 2 root root 0 Oct 26 15:19 smt
-rw-r--r-- 1 root root 4096 Oct 25 20:42 uevent
drwxr-xr-x 2 root root 0 Oct 26 15:19 vulnerabilities
let's not mask them in lxcfs either. In particular, we have observed this
causing problems with some JVMs' implementation of
Runtime.getRuntime().availableProcessors().
This is a bit of a strange patch: it seems masking this dir was always
incorrect, so we could go back to just not offering it as an lxcfs
endpoint, and having people use sysfs' implementation directly. But maybe
people are expecting it now, so I've left it as a proxy. Perhaps a more
appropriate patch is to just delete it entirely and add an API extension
note?
Tycho Andersen [Fri, 28 Oct 2022 20:24:54 +0000 (14:24 -0600)]
/proc/stat: render physical cpu number in non-view mode
When the kernel has an offline CPU, it only renders the online CPUs in
/proc/stat.
When in non-use_view mode, /sys/devices/system/cpu/online shows the CPU
numbers as they actually are on the physical system, but /proc/stat used
"virtual" (i.e. always zero-indexed) numbers, which causes confusion for
some applications. Let's use the same use_view logic in /proc/stat as well.
It was discovered that with libfuse3 we lost FOPEN_DIRECT_IO flag
on (struct fuse_file)->open_flags. I'm sure that this is the reason
for all the strange bugs that our users met recently.
cpuview: fix possible use-after-free in find_proc_stat_node
Our current lock design uses 2 sync primitives.
First (pthread_rwlock) protects hash table buckets.
Second (pthread_mutex) protects each struct cg_proc_stat
from concurrent modification. But the problem is that function
find_proc_stat_node() can return a pointer to the node
(struct cg_proc_stat) which can be freed by prune_proc_stat_history()
call *before* we take pthread_mutex. Moreover, we perform
memory release of (struct cg_proc_stat) in prune_proc_stat_list()
without any protection like refcounter or mutex on (struct cg_proc_stat).
An attempt to guess what happens in:
https://github.com/lxc/lxcfs/issues/565
https://discuss.linuxcontainers.org/t/number-of-cpus-reported-by-proc-stat-fluctuates-causing-issues/15780/14
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
cpuview: paththrough personality when reading cpuinfo
Let's change processing thread personality if caller personality
is different. It allows to read /proc/cpuinfo properly in
some cases (arm64 rely on current->personality inside Linux kernel).
https://github.com/lxc/lxcfs/issues/553
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Mathias Gibbens [Thu, 17 Nov 2022 21:57:58 +0000 (21:57 +0000)]
Fix build on ia64
The relevant code was added in commit 35acc24, but the function/macro
prctl_arg() didn't seem to be defined anywhere in the repo. lxc
currently has a corresponding macro defined in src/lxc/macro.h that
casts the value to an unsigned long. But 0 doesn't require any special
handling, so remove the call to prctl_arg().
Verified that the code compiles properly on Debian's ia64 porterbox
(yttrium).
With fuse3 `fuse_get_context` returns NULL before fuse was
fully initialized, so we must not access it.
Futher, we call 'do_reload' for normal initialization as
well, so let's prevent that from re-initializing the
bindings initially and only do this on actual reloads,
otherwise we do it twice on startup.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Fixes #549
`opathdir` was used to replace `opendir` in order to ensure
`O_NOFOLLOW` and `O_CLOEXEC` were set, however it also added
`O_PATH` which prevents `readdir`/`getdents` to be used on
it, causing the `/sys/devices/system/cpu/<subdir>`
directories to be empty.
Instead, let's have an `opendir_flags` utility which simply
passed additional flags to the `open(..., O_DIRECTORY)` call
preceding `fdopendir()`.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
When introducing versioned options, we started using fuse's
"init" callback in order to tell the library to set
`can_use_sys_cpu` and `has_versioned_opts` accordingly.
However, we forgot to also do this on a reload. Fix this by
simply calling `lxcfs_fuse_init()` in `do_reload()` as well.
Additionaly: ignore lxcfs_fuse_init()'s return value.
We just "passed through" the private_data from fuse which is
set via the `fuse_main()` call.
It's better to not leave this up to the library anyway in
order to make it easier to be fuse version agnostic in the
future.
Without this, issuing a reload to lxcfs would cause
files in `/sys/devices/system/cpu/` to be visible via
`readdir`, but accessing them would fail:
~ # ls /sys/devices/system/cpu/
ls: /sys/devices/system/cpu/cpuidle: No such file or directory
ls: /sys/devices/system/cpu/uevent: No such file or directory
(...)