Serge Hallyn [Sun, 18 Jun 2017 19:43:22 +0000 (14:43 -0500)]
(temporarily?) revert the virtualization of btime field in /proc/stat
Closes #189
This seems to be responsible for corrupting STIME on processlist
inside containers. Hopefully we can find a reasonable way to fix
both, but compared to unvirtualized btime field, bogus STIME field
is the greater evil here.
So far, only proc_stat_read() is fully using BUF_RESERVE_SIZE so it doesn't
actually benefit from the additional memory assigned to it like all the other
files in proc do. So double the size and have proc_stat_read() only use half of
it.
Supposedly fixes #176.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
We do this for the memlimit when hitting MemTotal which
means if neither is limited we end up subtracting the
hosts's total memory from the 'unlimited' swap value in the
SwapTotal and SwapFree lines.
Fixes #170
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Jason Baron [Fri, 27 Jan 2017 21:57:54 +0000 (16:57 -0500)]
virtualize the 'btime' field of /proc/stat
Currently, the 'btime' of /proc/stat reflects the boot time of the host.
We would like it to reflect when the guest boots, so use the start time of
init.
the memory limit was already correctly set by looking at the
whole cgroup hierarchy and using the minimum value, refactor
that code to support arbitrary files in the memory cgroup
and reuse it for the memsw limit as well.
If the file "/sys/devices/system/cpu/isolated" doesn't exist, we can't just
simply bail. We still need to check whether we need to copy the parents cpu
settings.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
If init has not placed us into our own cgroup on login we will reside in the
root cgroup. In this case cgroup.clone_children will not have been initialized
and so we need to do it. Otherwise users will not be able to start containers
with cpuset limits set.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
pam_cgfs: re-use cgroups that already belong to us
When we detect an already existing cgroup that belongs to our uid and gid, we
simply re-use it. This allows us to avoid creating useless additional cgroups
when e.g. running multiple sudo commands in a script or when we login from
different ttys.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
This is a rewrite of pam_cgfs which leans on LXC's cgfsng.c. Various codepaths
have been adapted and made more appropriate.
The strategy of pam_cgfs v2 is to support cgroupfs v1, cgroupfs v2, and mixed
mounts where some controllers are mounted into a standard cgroupfs v1 hierarchy
location (/sys/fs/cgroup/<controller>) and other controllers are mounted into
the cgroupfs v2 hierarchy.
The functions and types for cgroupfs v1 and cgroupfs v2 have nearly all been
kept separately even if they do nearly the exact same job. This is on purpose!
Although marked non-experimental, cgroupfs v2 is too much of a moving target.
Extrapolating from currentl cgroupfs v2 standard behaviour seems risky and error
prone. Even more so when those assumptions complexify or simplify cgroupfs v1
assumptions when trying to handle both, cgroupfs v1 and cgroupfs v2, in one
function. In short, code duplication currently is on purpose so that we can
easily adapt to changes in cgroupfs v2 behaviour without having to touch any of
the functions or types that deal with the basically standardized cgroupfs v1
behaviour.
A quick run-through of what current pam_cgfs does (The same wording can be found
in the preamble/license to pam_cgfs.c.):
When a user logs in, this pam module will create cgroups which the user may
administer. It handles both pure cgroupfs v1 and pure cgroupfs v2, as well as
mixed mounts, where some controllers are mounted in a standard cgroupfs v1
hierarchy location (/sys/fs/cgroup/<controller>) and others are in the cgroupfs
v2 hierarchy.
Writeable cgroups are either created for all controllers or, if specified, for
any controllers listed on the command line.
The cgroup created will be "user/$user/0" for the first session, "user/$user/1"
for the second, etc.
Systems with a systemd init system are treated specially, both with respect to
cgroupfs v1 and cgroupfs v2. For both, cgroupfs v1 and cgroupfs v2, we check
whether systemd already placed us in a cgroup it created, e.g.
user.slice/user-uid.slice/session-n.scope
by checking whether uid == our uid. If it did, we simply chown the last
part (session-n.scope). If it did not we create a cgroup as outlined above
(user/$user/n) and chown it to our uid.
The same holds for cgroupfs v2 where checking this assumption becomes crucial:
If we systemd already created and placed us in a cgroups, we __have to__ be
placed our under it on login, otherwise things like starting an xserver or
similar will not work.
All requested cgroups must be mounted under /sys/fs/cgroup/$controller,
no messing around with finding mountpoints.
Note, as of now, we currently do not yet necessarily deal correctly with weird
corner cases like not mounting the name=systemd cgroupfs v1 controller at
/sys/fs/cgroup/systemd but rather mounting an empty cgroupfs v2 hierarchy at the
same location which is used by systemd to track processes. This is left for
future commits.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Unless the file was created with chmod 000 the current check for
!O_RDONLY && !O_WRONLY will always be successful, making the current check
basically a noop. And even in the case where a file has chmod 000 we still want
the user to see that it has no permissions. So let's remove the check entirely.
Whether a user sees a file will be determined by a prior check for O_RDONLY on
the directory anyway.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
We should only deny getting the attributes of a file if it neither contains
O_RDONLY permission nor O_WRONLY permissions. Otherwise ls -al will not show
attributes on O_WRONLY files. Such files are quite common under /proc or /sys.
BEFORE:
root@conventiont:~# ls -al /var/lib/lxcfs/cgroup/devices/
ls: cannot access '/var/lib/lxcfs/cgroup/devices/devices.allow': Permission denied
ls: cannot access '/var/lib/lxcfs/cgroup/devices/devices.deny': Permission denied
total 0
drwxr-xr-x 2 root root 0 Oct 7 01:00 .
drwxr-xr-x 2 root root 0 Oct 7 01:00 ..
-rw-r--r-- 1 root root 0 Oct 7 01:00 cgroup.clone_children
-rw-r--r-- 1 root root 0 Oct 7 01:00 cgroup.procs
-r--r--r-- 1 root root 0 Oct 7 01:00 cgroup.sane_behavior
?????????? ? ? ? ? ? devices.allow
?????????? ? ? ? ? ? devices.deny
-r--r--r-- 1 root root 0 Oct 7 01:00 devices.list
drwxr-xr-x 2 root root 0 Oct 7 01:00 init.scope
drwxr-xr-x 2 root root 0 Oct 7 01:00 lxc
-rw-r--r-- 1 root root 0 Oct 7 01:00 notify_on_release
-rw-r--r-- 1 root root 0 Oct 7 01:00 release_agent
drwxr-xr-x 2 root root 0 Oct 7 01:00 system.slice
-rw-r--r-- 1 root root 0 Oct 7 01:00 tasks
drwxr-xr-x 2 root root 0 Oct 7 01:00 user.slice
AFTER:
root@conventiont:~# ls -al /var/lib/lxcfs/cgroup/devices/
total 0
drwxr-xr-x 2 root root 0 Oct 7 01:01 .
drwxr-xr-x 2 root root 0 Oct 7 01:01 ..
-rw-r--r-- 1 root root 0 Oct 7 01:01 cgroup.clone_children
-rw-r--r-- 1 root root 0 Oct 7 01:01 cgroup.procs
-r--r--r-- 1 root root 0 Oct 7 01:01 cgroup.sane_behavior
--w------- 1 root root 0 Oct 7 01:01 devices.allow
--w------- 1 root root 0 Oct 7 01:01 devices.deny
-r--r--r-- 1 root root 0 Oct 7 01:01 devices.list
drwxr-xr-x 2 root root 0 Oct 7 01:01 init.scope
drwxr-xr-x 2 root root 0 Oct 7 01:01 lxc
-rw-r--r-- 1 root root 0 Oct 7 01:01 notify_on_release
-rw-r--r-- 1 root root 0 Oct 7 01:01 release_agent
drwxr-xr-x 2 root root 0 Oct 7 01:01 system.slice
-rw-r--r-- 1 root root 0 Oct 7 01:01 tasks
drwxr-xr-x 2 root root 0 Oct 7 01:01 user.slice
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
- Detect whether we are on a ramfs. We first try via statfs and check for
RAMFS_MAGIC. This may report TMPFS_MAGIC although it should better report
RAMFS_MAGIC. In this case, parse /proc/self/mountinfo and check for
- rootfs rootfs
like we do in LXC.
- When we are on ramfs use chroot(), otherwise use pivot_root().
Signed-off-by: Christian Brauner <christian.brauner@mailbox.org>
Make liblxcfs a libtool module. Also, stop linking lxcfs against liblxcfs. We do
not really need this since we call dlopen() anyway. Furthermore, this allows us
to make sure that functions marked with __attribute__(constructor) are not run
before we call dlopen() in main() in lxcfs. This has the advantage that we can
show help output without __attribute__(constructor) functions being run.
Signed-off-by: Christian Brauner <cbrauner@suse.de>
if (!pick_controller_from_path())
/* Someone's trying to delete "/cgroup". */
if (!find_cgroup_in_path())
/* Someone's trying to delete a controller e.g. "/blkio". */
if (!get_cgdir_and_path()) {
/* Someone's trying to delete a cgroup on the same level as the
* "/lxc" cgroup e.g. rmdir "/cgroup/blkio/lxc" or
* rmdir "/cgroup/blkio/init.slice".
*/
}
All other interesting cases are caught further down.
Signed-off-by: Christian Brauner <cbrauner@suse.de>
We do not need to check whether mode & W_OK is passed in. Even if the cgroup
root mount is writeable operations like cg_mkdir() et al. will fail with e.g.
EPERM. Basically all operations will fail on the cgroup root mount point because
the first operation they perform is pick_controller_from_path(). That is to say
they try to e.g. pick "blkio" from /var/lib/lxcfs/cgroup/blkio/some/cgroups an
similiar for all other controllers. If pick_controller_from_path() fails they
all return an appropriate errno. For example, cg_mkdir() does:
This means, we do not need to return an errno already in cg_access when
mode & W_OK is passed in. This has the advantage that users are still able to
descend into /var/lib/lxcfs/cgroup via:
cd /var/lib/lxcfs/cgroup
but are still blocked from doing any write operations.
Signed-off-by: Christian Brauner <cbrauner@suse.de>
- replace multiple DEBUG ifdefines with a single ifdefine at the top
- ifdefine lxcfs_debug() macro function that expands to nothing when -DDEBUG is
not given
Signed-off-by: Christian Brauner <cbrauner@suse.de>