zhang2639 [Thu, 24 May 2018 10:42:37 +0000 (18:42 +0800)]
Calculate and read the average load.
Use load daemon to calculate the loadavg and use proc_loadavg_read() to read the loadavg.
calc_pid : find the process pid from cgroup path of a container.
calc_load : calculate the loadavg of a container.
refresh_load : refresh the loadavg of a container.
load_begin : traverse the hash table and update it.
zhang2639 [Thu, 24 May 2018 10:27:34 +0000 (18:27 +0800)]
Use hash table to store load information
struct load_head{} contains three locks for thread synchronization
and a pointer to the hash list.
struct load_node{} contains special information of container and
pointers to the hash node.
static struct load_head *load_hash[LOAD_SIZE] is hash table.
calc_hash : get the hash of a special container.
init_load : initialize hash table.
insert_node : insert a container node to the hash table.
locate_node : find the specital container node.
del_node : delete a specital container node and return the next node
of it.
Aaron Sokoloski [Mon, 4 Dec 2017 18:30:37 +0000 (12:30 -0600)]
Change MemAvailable figure in /proc/meminfo to include cache memory -- Fixes #175 I think.
MemAvailable represents roughly how much more memory we can use before
we start swapping. Page cache memory can be reclaimed if it's needed
for something else, so it should count as available memory. This
change should also fix the "available" column of the "free" command,
as well as the "avail Mem" value in "top", both of which come from
MemAvailable.
Note that this isn't perfectly accurate. On a physical machine, the
value for MemAvailable is the result of a calculation that takes into
account that when memory gets low (but before it's completely
exhausted), kswapd wakes up and starts paging things out. See:
I tried to think of a way to be more exact, but this calculation
includes figures that we don't have available for a given cgroup
hierarchy, such as reclaimable slab memory and the low watermark for
zones. So it's not really feasible to reproduce it exactly.
Anyway, since the kernel calculation itself is just an estimation, it
doesn't seem too bad that we're a little bit off. Adding in the
amount of memory used for page cache seems much better than what we
were doing before (just copying the free memory figure), because that
can be wrong by gigabytes.
Aaron Sokoloski [Sat, 2 Dec 2017 18:43:06 +0000 (12:43 -0600)]
Fix inaccurate values in /proc/meminfo for containers with child cgroups
The values for Cached, Active, Inactive, Active(anon), Inactive(anon),
Active(file), Inactive(file), and Unevictable are derived/computed
from these values in the relevant meminfo.stat:
However, these value apply only to the cgroup of the lxc container
itself. If your container uses memory cgroups internally, and thus
the container cgroup has children, their memory is not counted.
In order to take the memory usage of child cgroups into account, we
need to look at the "total_" prefixed versions of these values.
In order to enable proper unprivileged cgroup delegation on newer kernels we not
just need delegate the "cgroup.procs" file to the user but also to
"cgroup.threads". But don't report an error in case it doesn't exist.
Also delegate "cgroup.subtree_control" to enable users to hand over controllers
to descendant cgroups.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
yuwang.yuwang [Fri, 20 Oct 2017 06:28:03 +0000 (14:28 +0800)]
Fix wrong calc of swaptoal and swapfree
it make the value of (memswlimit - memlimit) to be swaptotal,
it is wrong, because swapsize in cgroup/container can up to
[0,memswlimit], sometimes if the memsize(unless swap) of all tasks in
cgroup/container is very small, the swaptoal can to be memswlimit
so make the swaptotal to be min(host swtoal,memswlimit)
In order to not require a user to manually list all cgroup controllers
in their PAM configuration, add an "all" option that effectively just
sets all controllers as read-write.
When doing subsequent reads of uptime on an open file handle
in the form:
read
lseek 0L, SEEK_SET
read
the second (and later) reads cause that the error
"failed to write to cache" was printed. This
happens for example with "top". top would print the error:
bad data in /proc/uptime
To fix this problem use the whole size of the buffer instead of the d->size
because this is set on the first read.
Serge Hallyn [Sun, 18 Jun 2017 19:43:22 +0000 (14:43 -0500)]
(temporarily?) revert the virtualization of btime field in /proc/stat
Closes #189
This seems to be responsible for corrupting STIME on processlist
inside containers. Hopefully we can find a reasonable way to fix
both, but compared to unvirtualized btime field, bogus STIME field
is the greater evil here.
So far, only proc_stat_read() is fully using BUF_RESERVE_SIZE so it doesn't
actually benefit from the additional memory assigned to it like all the other
files in proc do. So double the size and have proc_stat_read() only use half of
it.
Supposedly fixes #176.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
We do this for the memlimit when hitting MemTotal which
means if neither is limited we end up subtracting the
hosts's total memory from the 'unlimited' swap value in the
SwapTotal and SwapFree lines.
Fixes #170
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Jason Baron [Fri, 27 Jan 2017 21:57:54 +0000 (16:57 -0500)]
virtualize the 'btime' field of /proc/stat
Currently, the 'btime' of /proc/stat reflects the boot time of the host.
We would like it to reflect when the guest boots, so use the start time of
init.
the memory limit was already correctly set by looking at the
whole cgroup hierarchy and using the minimum value, refactor
that code to support arbitrary files in the memory cgroup
and reuse it for the memsw limit as well.
If the file "/sys/devices/system/cpu/isolated" doesn't exist, we can't just
simply bail. We still need to check whether we need to copy the parents cpu
settings.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
If init has not placed us into our own cgroup on login we will reside in the
root cgroup. In this case cgroup.clone_children will not have been initialized
and so we need to do it. Otherwise users will not be able to start containers
with cpuset limits set.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
pam_cgfs: re-use cgroups that already belong to us
When we detect an already existing cgroup that belongs to our uid and gid, we
simply re-use it. This allows us to avoid creating useless additional cgroups
when e.g. running multiple sudo commands in a script or when we login from
different ttys.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
This is a rewrite of pam_cgfs which leans on LXC's cgfsng.c. Various codepaths
have been adapted and made more appropriate.
The strategy of pam_cgfs v2 is to support cgroupfs v1, cgroupfs v2, and mixed
mounts where some controllers are mounted into a standard cgroupfs v1 hierarchy
location (/sys/fs/cgroup/<controller>) and other controllers are mounted into
the cgroupfs v2 hierarchy.
The functions and types for cgroupfs v1 and cgroupfs v2 have nearly all been
kept separately even if they do nearly the exact same job. This is on purpose!
Although marked non-experimental, cgroupfs v2 is too much of a moving target.
Extrapolating from currentl cgroupfs v2 standard behaviour seems risky and error
prone. Even more so when those assumptions complexify or simplify cgroupfs v1
assumptions when trying to handle both, cgroupfs v1 and cgroupfs v2, in one
function. In short, code duplication currently is on purpose so that we can
easily adapt to changes in cgroupfs v2 behaviour without having to touch any of
the functions or types that deal with the basically standardized cgroupfs v1
behaviour.
A quick run-through of what current pam_cgfs does (The same wording can be found
in the preamble/license to pam_cgfs.c.):
When a user logs in, this pam module will create cgroups which the user may
administer. It handles both pure cgroupfs v1 and pure cgroupfs v2, as well as
mixed mounts, where some controllers are mounted in a standard cgroupfs v1
hierarchy location (/sys/fs/cgroup/<controller>) and others are in the cgroupfs
v2 hierarchy.
Writeable cgroups are either created for all controllers or, if specified, for
any controllers listed on the command line.
The cgroup created will be "user/$user/0" for the first session, "user/$user/1"
for the second, etc.
Systems with a systemd init system are treated specially, both with respect to
cgroupfs v1 and cgroupfs v2. For both, cgroupfs v1 and cgroupfs v2, we check
whether systemd already placed us in a cgroup it created, e.g.
user.slice/user-uid.slice/session-n.scope
by checking whether uid == our uid. If it did, we simply chown the last
part (session-n.scope). If it did not we create a cgroup as outlined above
(user/$user/n) and chown it to our uid.
The same holds for cgroupfs v2 where checking this assumption becomes crucial:
If we systemd already created and placed us in a cgroups, we __have to__ be
placed our under it on login, otherwise things like starting an xserver or
similar will not work.
All requested cgroups must be mounted under /sys/fs/cgroup/$controller,
no messing around with finding mountpoints.
Note, as of now, we currently do not yet necessarily deal correctly with weird
corner cases like not mounting the name=systemd cgroupfs v1 controller at
/sys/fs/cgroup/systemd but rather mounting an empty cgroupfs v2 hierarchy at the
same location which is used by systemd to track processes. This is left for
future commits.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Unless the file was created with chmod 000 the current check for
!O_RDONLY && !O_WRONLY will always be successful, making the current check
basically a noop. And even in the case where a file has chmod 000 we still want
the user to see that it has no permissions. So let's remove the check entirely.
Whether a user sees a file will be determined by a prior check for O_RDONLY on
the directory anyway.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
We should only deny getting the attributes of a file if it neither contains
O_RDONLY permission nor O_WRONLY permissions. Otherwise ls -al will not show
attributes on O_WRONLY files. Such files are quite common under /proc or /sys.
BEFORE:
root@conventiont:~# ls -al /var/lib/lxcfs/cgroup/devices/
ls: cannot access '/var/lib/lxcfs/cgroup/devices/devices.allow': Permission denied
ls: cannot access '/var/lib/lxcfs/cgroup/devices/devices.deny': Permission denied
total 0
drwxr-xr-x 2 root root 0 Oct 7 01:00 .
drwxr-xr-x 2 root root 0 Oct 7 01:00 ..
-rw-r--r-- 1 root root 0 Oct 7 01:00 cgroup.clone_children
-rw-r--r-- 1 root root 0 Oct 7 01:00 cgroup.procs
-r--r--r-- 1 root root 0 Oct 7 01:00 cgroup.sane_behavior
?????????? ? ? ? ? ? devices.allow
?????????? ? ? ? ? ? devices.deny
-r--r--r-- 1 root root 0 Oct 7 01:00 devices.list
drwxr-xr-x 2 root root 0 Oct 7 01:00 init.scope
drwxr-xr-x 2 root root 0 Oct 7 01:00 lxc
-rw-r--r-- 1 root root 0 Oct 7 01:00 notify_on_release
-rw-r--r-- 1 root root 0 Oct 7 01:00 release_agent
drwxr-xr-x 2 root root 0 Oct 7 01:00 system.slice
-rw-r--r-- 1 root root 0 Oct 7 01:00 tasks
drwxr-xr-x 2 root root 0 Oct 7 01:00 user.slice
AFTER:
root@conventiont:~# ls -al /var/lib/lxcfs/cgroup/devices/
total 0
drwxr-xr-x 2 root root 0 Oct 7 01:01 .
drwxr-xr-x 2 root root 0 Oct 7 01:01 ..
-rw-r--r-- 1 root root 0 Oct 7 01:01 cgroup.clone_children
-rw-r--r-- 1 root root 0 Oct 7 01:01 cgroup.procs
-r--r--r-- 1 root root 0 Oct 7 01:01 cgroup.sane_behavior
--w------- 1 root root 0 Oct 7 01:01 devices.allow
--w------- 1 root root 0 Oct 7 01:01 devices.deny
-r--r--r-- 1 root root 0 Oct 7 01:01 devices.list
drwxr-xr-x 2 root root 0 Oct 7 01:01 init.scope
drwxr-xr-x 2 root root 0 Oct 7 01:01 lxc
-rw-r--r-- 1 root root 0 Oct 7 01:01 notify_on_release
-rw-r--r-- 1 root root 0 Oct 7 01:01 release_agent
drwxr-xr-x 2 root root 0 Oct 7 01:01 system.slice
-rw-r--r-- 1 root root 0 Oct 7 01:01 tasks
drwxr-xr-x 2 root root 0 Oct 7 01:01 user.slice
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
- Detect whether we are on a ramfs. We first try via statfs and check for
RAMFS_MAGIC. This may report TMPFS_MAGIC although it should better report
RAMFS_MAGIC. In this case, parse /proc/self/mountinfo and check for
- rootfs rootfs
like we do in LXC.
- When we are on ramfs use chroot(), otherwise use pivot_root().
Signed-off-by: Christian Brauner <christian.brauner@mailbox.org>