- reaper_busy was off by a factor of 10 (possibly originally
for precision?)
- get_pid1_time was expecting a '1' byte like in
the pid_to/from_ns_wrapper functions instead of reading its
value which is what is actually written
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
/*
If not the first time through, we require old_size to be
at least MINSIZE and to have prev_inuse set.
*/
assert ((old_top == initial_top (av) && old_size == 0) ||
((unsigned long) (old_size) >= MINSIZE &&
prev_inuse (old_top) &&
((unsigned long) old_end & pagemask) == 0));
Serge Hallyn [Fri, 13 Nov 2015 23:18:55 +0000 (17:18 -0600)]
Implement privilege check when moving tasks
When writing pids to a tasks file in lxcfs, lxcfs was checking
for privilege over the tasks file but not over the pid being
moved. Since the cgm_movepid request is done as root on the host,
not with the requestor's credentials, we must copy the check which
cgmanager was doing to ensure that the requesting task is allowed
to change the victim task's cgroup membership.
This is CVE-2015-1344
https://bugs.launchpad.net/ubuntu/+source/lxcfs/+bug/1512854
Serge Hallyn [Fri, 13 Nov 2015 23:07:36 +0000 (17:07 -0600)]
Fix checking of parent directories
Taken from the justification in the launchpad bug:
To a task in freezer cgroup /a/b/c/d, it should appear that there are no
cgroups other than its descendents. Since this is a filesystem, we must have
the parent directories, but each parent cgroup should only contain the child
which the task can see.
So, when this task looks at /a/b, it should see only directory 'c' and no
files. Attempt to create /a/b/x should result in -EPERM, whether /a/b/x already
exists or not. Attempts to query /a/b/x should result in -ENOENT whether /a/b/x
exists or not. Opening /a/b/tasks should result in -ENOENT.
The caller_may_see_dir checks specifically whether a task may see a cgroup
directory - i.e. /a/b/x if opening /a/b/x/tasks, and /a/b/c/d if doing
opendir('/a/b/c/d').
caller_is_in_ancestor() will return true if the caller in /a/b/c/d looks at
/a/b/c/d/e. If the caller is in a child cgroup of the queried one - i.e. if the
task in /a/b/c/d queries /a/b, then *nextcg will container the next (the only)
directory which he can see in the path - 'c'.
Beyond this, regular DAC permissions should apply, with the
root-in-user-namespace privilege over its mapped uids being respected. The
fc_may_access check does this check for both directories and files.
Serge Hallyn [Thu, 12 Nov 2015 07:41:52 +0000 (01:41 -0600)]
Limit caching to 0.5s
If a cgroup is deleted or chmoded using the underlying cgroupfs, then we
want to minimize the amount of time during which we get stale info. At the
same time, we don't want to do away with caching in the fuse kernel module
altogether, since calling out to userspace is expensive.
Serge Hallyn [Fri, 30 Oct 2015 23:30:56 +0000 (18:30 -0500)]
/proc/meminfo: show the lowest limit amongst our ancestors
If we are in /a/b/c, and b is limited to 500k, then c's limit_in_bytes
will not reflect the 500k, although that will be enforced for us. So
check our lineage for the lowest limit.
Serge Hallyn [Fri, 30 Oct 2015 17:29:18 +0000 (12:29 -0500)]
don't let idletime be > reaperage
This is not a good way to do this. We should decide on a proper
heuristic. We could take something like reaperage * (idletime/total_uptime),
but that doesn't scale for how much our own container used the cpu nor
for time.
I will open a github issue to fix this.
However as it currently stands the test_proc testcase was failing, this
at least lets it pass.
Serge Hallyn [Wed, 28 Oct 2015 20:41:45 +0000 (20:41 +0000)]
fix corner cases in uptime and diskstat read
Closes #33
The code for these (a shortcut version of the other proc_*_read ones) was
doing snprintf(buf, size, ...). If the user only requests one byte, we
just served them a trailing 0. Fix that.
We weren't handling reads with offset in these cases either. Fix that.
/proc/uptime has the format %lu.%02lu %lu.%02lu. The format used by
lxcfs doesn't consider the nano second portion of the uptime which might
cause programs that rely on that format to fail.
This commit adapts the uptime format to match the kernels by adding
trailing dummy values (.0) to the uptime and idle time values.
The parsing of /proc/uptime was updated.
Signed-off-by: Bernhard Miklautz <bernhard.miklautz@shacknet.at>
Serge Hallyn [Fri, 16 Oct 2015 19:44:29 +0000 (14:44 -0500)]
swap out libnih and libnih-dbus for glib
The motivation is to make threading possible, to hopefully greatly
speed up systemd startup inside containers.
This required converting all the nih-ified memory tracking. (Some
of this can probably be done smarter in a more glib-friendly way, i.e.
avoiding doing a glib string alloc followed by strdup followed by
freeing the glib string)
We open a single dbus connection for all threads to use. If that
connection is closed (i.e. cgmanager exits / restarts) the first
task to find it so takes a mutex and attempts to reconnect, once
per second, until it is reconnected.
When creating a directory for non-root user, execute a new binary
to get a clean dbus session as that user.
Serge Hallyn [Fri, 5 Jun 2015 04:18:20 +0000 (23:18 -0500)]
Return host's meminfo file if no memory cgroup
If memory cgroup is not available (that is, we can't find a
memory cgroup for the reading task), then just return the
contents of the host's /proc/meminfo. (Same for all proc file
reads)
Serge Hallyn [Wed, 20 May 2015 15:56:46 +0000 (08:56 -0700)]
fix two threading issues
Make sure to prep dbus for threading before we start.
And use _exit() any time we are exiting from a forked child. This is
to avoid calling the at_exit() functions. Once a thread in the main
program has called nih_error_init(), this registers an at_exit fn which
asserts that the nih_context not be null - but after we fork, if libnih
is built with --enable-threading, then the nih_context is in fact null.
The only way to clear the atexit fns would be to exec(). So call
_exit() instead of exit(), because _exit() avoids calling the atexit
fns.
Serge Hallyn [Tue, 19 May 2015 21:27:05 +0000 (14:27 -0700)]
use threads when safe
libnih, when not built with --enable-threading, cannot be safely
used by a threaded application. Detect whether it is built to be
threadsafe using a new libnih helper, and, if so, run threaded by
(a) not passing '-s' to fuse, and (b) making the dbus connection
and detected api version thread-local.
Also add a testcase to make sure that the new function is correct.
In order to share the cpuset range checking code with with the
test, move it into cpuset.c. Not sure whether we want that in a
utils.c instead.