Serge Hallyn [Mon, 1 Feb 2016 11:21:01 +0000 (12:21 +0100)]
Make the bulk of the lxcfs code reloadable
Move the majority of the code (the bits most likely to have security
bugs coming up) reloadable. Sending USR1 signal to lxcfs will cause
it to reload the shared library so as to immediately start using the
fixed code. This allows us to upgrade lxcfs in the majority of
cases without having to restart containers.
To achieve this, some code was moved around so that lxcfs.c itself
does not risk pinning any symbols from the shared library (which
would prevent it being unloaded). We track the number of threads
currently using the bindings, and do the reload after it hits
zero (specifically, the next time that we turn the count from 0 to 1)
Also add a test case to make sure an updated library does in fact
get loaded.
Seth Forshee [Thu, 28 Jan 2016 16:17:42 +0000 (17:17 +0100)]
Remove unused chunks in caching code
Several pieces of code which deal with caching contents for proc
files contain code like this:
if (l >= cache_size) {
...
goto err;
}
if (l < cache_size) {
...
} else {
...
}
When the first condition is false the second condition will
always be true, so the code in the else block is never used.
The second if/else statement can then just be replaced with the
code from the if block.
Serge Hallyn [Thu, 28 Jan 2016 13:48:19 +0000 (14:48 +0100)]
tests: update to handle lxcfs virtualizing based on init
lxcfs used to use $current's cgroups to virtualize proc, but
switched in 0.17 to using $current's init's cgroups. The
tests need to be updated to reflect that.
Serge Hallyn [Fri, 22 Jan 2016 02:21:13 +0000 (18:21 -0800)]
simplify getreaperage
We don't need to switch to their ns, mount their proc, and check /proc/1.
Just find out their init pid using scm credentials and check /pid/$initpid
in our own procfs.
When no limit is specified using lxc.cgroup.memory.memsw.limit_in_bytes,
overflow occurs while calculating Swap{Total,Free}. Commit a2de34b tried
to fix this, but introduced another bug, wherein if
memory.memsw.limit_in_bytes >= memory.limit_in_bytes, then Swap{Total,Free}
are not shown as expected.
cgfs: make dorealloc allocate the first batch, too
With a short first line the case can be
*mem = NULL
oldlen = 0
newlen = 5 (anything < 50)
making newbatches == oldbatches == 1 causing the
(newbatches <= oldbatches)
condition to be true.
Let realloc() handle *mem==NULL and use
(!*mem || newbatches > oldbatches) as the only condition.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Thu, 7 Jan 2016 19:17:17 +0000 (11:17 -0800)]
dorealloc: avoid extra reallocs
The original check was very wrong, using % instead of /. However
the length we track is the actual used length, not the allocated
length, which is always (len / BATCH_SIZE) + 1. We don't want
to realloc when newlen is between oldlen and (oldlen / BATCH_SIZE) + 1)
getline() returns the length which can be passed to
append_line to avoid a strlen() call.
Additionally with the length already known memcpy() can be
used instead of strcpy(). A +1 to the length will include
the terminating null byte as it is included in getline(3)'s
output.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
The initial check should use real lengths as with modulo a
new required length of eg. 52 would be considered smaller
than an old length of 48 (2 < 48).
To get the 'batches' count 'newlen' must be divided and not
taken modulo BATCH_SIZE. Otherwise '101', which would need a
3rd batch to reach 150, would end up with two (2*50 = 100
bytes) and thereby be truncated instead.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
If the first realloc() call fails then 'd' becomes NULL,
subsequent realloc() retries will behave like malloc() and
the the original src pointer is never freed. Further more
the newly allocated data then contains uninitialized data
where the previous pids had been stored.
Avoid this by passing the the original pointer from '*src'
to realloc().
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
- reaper_busy was off by a factor of 10 (possibly originally
for precision?)
- get_pid1_time was expecting a '1' byte like in
the pid_to/from_ns_wrapper functions instead of reading its
value which is what is actually written
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
/*
If not the first time through, we require old_size to be
at least MINSIZE and to have prev_inuse set.
*/
assert ((old_top == initial_top (av) && old_size == 0) ||
((unsigned long) (old_size) >= MINSIZE &&
prev_inuse (old_top) &&
((unsigned long) old_end & pagemask) == 0));
Serge Hallyn [Fri, 13 Nov 2015 23:18:55 +0000 (17:18 -0600)]
Implement privilege check when moving tasks
When writing pids to a tasks file in lxcfs, lxcfs was checking
for privilege over the tasks file but not over the pid being
moved. Since the cgm_movepid request is done as root on the host,
not with the requestor's credentials, we must copy the check which
cgmanager was doing to ensure that the requesting task is allowed
to change the victim task's cgroup membership.
This is CVE-2015-1344
https://bugs.launchpad.net/ubuntu/+source/lxcfs/+bug/1512854
Serge Hallyn [Fri, 13 Nov 2015 23:07:36 +0000 (17:07 -0600)]
Fix checking of parent directories
Taken from the justification in the launchpad bug:
To a task in freezer cgroup /a/b/c/d, it should appear that there are no
cgroups other than its descendents. Since this is a filesystem, we must have
the parent directories, but each parent cgroup should only contain the child
which the task can see.
So, when this task looks at /a/b, it should see only directory 'c' and no
files. Attempt to create /a/b/x should result in -EPERM, whether /a/b/x already
exists or not. Attempts to query /a/b/x should result in -ENOENT whether /a/b/x
exists or not. Opening /a/b/tasks should result in -ENOENT.
The caller_may_see_dir checks specifically whether a task may see a cgroup
directory - i.e. /a/b/x if opening /a/b/x/tasks, and /a/b/c/d if doing
opendir('/a/b/c/d').
caller_is_in_ancestor() will return true if the caller in /a/b/c/d looks at
/a/b/c/d/e. If the caller is in a child cgroup of the queried one - i.e. if the
task in /a/b/c/d queries /a/b, then *nextcg will container the next (the only)
directory which he can see in the path - 'c'.
Beyond this, regular DAC permissions should apply, with the
root-in-user-namespace privilege over its mapped uids being respected. The
fc_may_access check does this check for both directories and files.
Serge Hallyn [Thu, 12 Nov 2015 07:41:52 +0000 (01:41 -0600)]
Limit caching to 0.5s
If a cgroup is deleted or chmoded using the underlying cgroupfs, then we
want to minimize the amount of time during which we get stale info. At the
same time, we don't want to do away with caching in the fuse kernel module
altogether, since calling out to userspace is expensive.