Tycho Andersen [Mon, 21 Mar 2016 22:52:02 +0000 (16:52 -0600)]
c/r: don't fail if there is no console_fd on restore
If we set lxc.console=none, this fd won't exist, so let's not fail if it
doesn't. We already partially handled this case correctly, so let's
actually handle it correctly :)
Tycho Andersen [Mon, 21 Mar 2016 22:50:39 +0000 (16:50 -0600)]
c/r: don't pass --ext-mount-map flag when console=none
We don't pass anything on the restore side since we didn't save anything,
but the restore side will expect something if we pass this. Instead, let's
not pass anything.
Tycho Andersen [Fri, 18 Mar 2016 19:13:17 +0000 (13:13 -0600)]
c/r: print criu's stdout when it fails
In particular, when CRIU fails before it has its log completely initialized
(e.g. if the log directory doesn't exist, or if the argument parser fails),
it prints this to stdout. Let's log that.
Tycho Andersen [Thu, 17 Mar 2016 11:14:43 +0000 (05:14 -0600)]
autodev: don't always create /dev/console
In particular, only create /dev/console when it is set to "none".
Otherwise, we will bind mount a pts device later, so let's just leave it.
Also, when bind mounting the pts device, let's create /dev/console if it
doesn't exist, since it may not already exist due to the above :)
v2: s/ot/to
v3: add O_EXCL so we actually get EEXIST, use the right condition for
mount_console (we want to compare against console.path, not
console.name, and console.path can be null)
Serge Hallyn [Wed, 16 Mar 2016 06:02:10 +0000 (23:02 -0700)]
cgroups: try to load cgmanager first
If cgmanager is running, use it. This allows the admin to simply
stop cgmanager if they don't want to use it. The other way there
is no way to choose to use cgmanager.
Serge Hallyn [Wed, 16 Mar 2016 21:48:49 +0000 (14:48 -0700)]
Prevent access to pci devices
Prevent privileged containers from messing with the host's pci devices
directly. Refuse access under /proc/bus, and drop cap_sys_rawio. Some
containers may need to re-enable cap_sys_rawio (i.e. if they run an
X server).
It may be desirable to break some of this stuff into files which can be
separately included (or not included), but this patch isn't the right
place for that.
Tycho Andersen [Tue, 15 Mar 2016 18:01:36 +0000 (12:01 -0600)]
build: fix build on android (and ppc)
The problem here is that dev_t on most platforms is `long unsigned`, but on
android (and ppc?) it's `long long unsigned`. Let's just upcast to `long
long unsigned` and use that format string to keep the compilers happy.
Tycho Andersen [Sat, 12 Mar 2016 01:10:40 +0000 (18:10 -0700)]
c/r: drop lxc.console=none config requirement
There are a few things going on in this patch.
1. /dev/console is an external mount since it is bind mounted from the
host. However, we don't want to use criu's --ext-mount-map auto handling
here, because that will bind mount exactly the same path from the host
on restore, but if the pts device is different on the target host, we'll
bind mount the wrong one, which is obviously wrong.
2. We need to tell CRIU how to restore the TTY. Since we declare the tty as
--external, we need to provide it via --inherit-fd (even though we've
already fixed up the environment).
Tycho Andersen [Sat, 12 Mar 2016 02:01:43 +0000 (19:01 -0700)]
criu: hide more stuff in criu.c
Various other functions/structures are now only used in criu.c, so let's
hide stuff there so as not to pollute headers.
This commit also bumps the required CRIU versions to 2.0. While we don't
*require* any features that aren't in 1.8 patchlevel 21 or above, 2.0 is a
vast improvement, and so we should use that instead.
Tycho Andersen [Thu, 10 Mar 2016 18:10:14 +0000 (11:10 -0700)]
cgroup: cgroup_escape takes no arguments
cgroup_escape() is a slight abuse of the cgroup code: what we really want
here is to escape the *current* process, whether it happens to be the LXC
monitor or not, into the / cgroups.
In the case of dump, we can't do an lxc_init(), because:
We don't want to make this a command to send to the handler, because again,
cgroup_escape() is intended to escape the *current* task to the root
cgroups.
So, let's just have cgroup_escape() build its own handler when required.
Serge Hallyn [Wed, 9 Mar 2016 07:04:46 +0000 (23:04 -0800)]
cgfsng: fix real bug and fake libc realloc bug
read_file was using the wrong value for the string length. Also,
realloc on i386 is wonky with small sizes - so use a batch size
to avoid small reallocs.
Serge Hallyn [Tue, 8 Mar 2016 03:10:58 +0000 (19:10 -0800)]
prevent containers from reading /sys/kernel/debug
Unprivileged containers cannot read it anyway, but also prevent root
owned containers from doing so. Sadly upstart's mountall won't run
if we try to prevent it from being mounted at all.
Serge Hallyn [Mon, 7 Mar 2016 19:16:43 +0000 (11:16 -0800)]
cgfsng - remove the code checking whether devices cgroup lines are already done
We may need to revert this, but I *think* we no longer need this
with default configs. The idea iirc was that if caller cannot
write to devices.allow (i.e. is in a user namespace), then ignore
permission failures if the cgroups are already sufficiently setup.
Serge Hallyn [Thu, 3 Mar 2016 18:31:23 +0000 (10:31 -0800)]
cgfsng: next generation filesystem-backed cgroup implementation
This makes simplifying assumptions: all usable cgroups must be
mounted under /sys/fs/cgroup/controller or /sys/fs/cgroup/contr1,contr2.
Currently this will only work with cgroup namespaces, because
lxc.mount.auto = cgroup is not implemented. So cgfsng_ops_init()
returns NULL if cgroup namespaces are not enabled.
lxc-attach -n a -- sh -c 'echo ERR >&2' > /dev/null
There seems to be no easy way to discern when we need to write to stderr
instead of stdout when we receive an event on the master fd of an allocated
pty. So we're using a "trick"/"hack". We write to STDOUT_FILENO if it refers to
a pty. If STDOUT_FILENO does not refer to a pty we check whether STDERR_FILENO
refers to a pty and if so write to it.
Signed-off-by: Christian Brauner <christian.brauner@mailbox.org>
Execute script lxc-devsetup also with sysvinit and upstart.
* This script sets /dev/.lxc which is needed for autodev containers.
* Previously was only executed with systemd. Execute it also with
the other init systems (sysvinit and upstart)
Signed-off-by: Carlos Alberto Lopez Perez <clopez@igalia.com>
Serge Hallyn [Thu, 3 Mar 2016 00:11:14 +0000 (16:11 -0800)]
cgfs: don't try to remove cgroups we haven't created
info_ptr->created_paths_count can be 0, so don't blindly dereference
info_ptr->created_paths[ created_paths_count - 1]. Apparently we never
used to have 0 at the cleanup_name_on_this_level before, but now that
we can fail with -eperm and not just -eexist, we do.