Adrian Reber [Wed, 13 Dec 2017 11:04:02 +0000 (12:04 +0100)]
criu: add feature check capability
For migration optimization features like pre-copy or post-copy migration
the support cannot be determined by simply looking at the CRIU version.
Features like that depend on the architecture/kernel/criu combination
and CRIU offers a feature checking interface to query if it is
supported.
This adds a LXC interface to query CRIU for those feature via the
migrate() API call. For the recent pre-copy migration support in LXD
this can be used to automatically detect if pre-copy migration should be
used.
In addition to the existing migrate() API commands this adds a new
command: 'MIGRATE_FEATURE_CHECK'.
The migrate_opts{} structure is extended by the member features_to_check
which is a bitmask defining which CRIU features should be queried.
Currently only the querying of the features FEATURE_MEM_TRACK and
FEATURE_LAZY_PAGES is supported.
Serge Hallyn [Thu, 14 Dec 2017 19:16:02 +0000 (13:16 -0600)]
dir_detect: warn on eperm
if user has lxc.rootfs.path = /some/path/foo, but can't access
some piece of that path, then we'll get an unhelpful "failed to
mount" without any indication of the problem.
Tycho Andersen [Fri, 8 Dec 2017 23:23:26 +0000 (23:23 +0000)]
init: don't kill(-1) if we aren't in a pid ns
...otherwise we'll kill everyone on the machine. Instead, let's explicitly
try to kill our children. Let's do a best effort against fork bombs by
disabling forking via the pids cgroup if it exists. This is best effort for
a number of reasons:
* the pids cgroup may not be available
* the container may have bind mounted /dev/null over pids.max, so the write
doesn't do anything
Prior to this patch we raced with a very short-lived init process. Essentially,
the init process could exit before we had time to record the cgroup namespace
causing the container to abort and report ABORTING to the caller when it
actually started just fine. Let's not do this.
(This uses syscall(SYS_getpid) in the the child to retrieve the pid just in case
we're on an older glibc version and we end up in the namespace sharing branch
of the actual lxc_clone() call.)
Additionally this fixes the shortlived tests. They were faulty so far and
should have actually failed because of the cgroup namespace recording race but
the ret variable used to return from the function was not correctly
initialized. This fixes it.
Furthermore, the shortlived tests used the c->error_num variable to determine
success or failure but this is actually not correct when the container is
started daemonized.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
In the case the container has a console with a valid slave pty file descriptor
we duplicate std{in,out,err} to the slave file descriptor so console logging
works correctly. When the container does not have a valid slave pty file
descriptor for its console and is started daemonized we should dup to
/dev/null.
Closes #1646.
Signed-off-by: Li Feng <lifeng68@huawei.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
we made std{err,in,out} a duplicate of the slave file descriptor of the console
if it existed. This meant we also duplicated all of them when we executed
application containers in the foreground even if some std{err,in,out} file
descriptor did not refer to a {p,t}ty. This blocked use cases such as:
echo foo | lxc-execute -n -- cat
which are very valid and common with application containers but less common
with system containers where we don't have to care about this. So my suggestion
is to unconditionally duplicate std{err,in,out} to the console file descriptor
if we are either running daemonized - this ensures that daemonized application
containers with a single bash shell keep on working - or when we are not
running an application container. In other cases we only duplicate those file
descriptors that actually refer to a {p,t}ty. This logic is similar to what we
do for lxc-attach already.
Refers to #1690.
Closes #2028.
Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Detaching network namespaces as an unprivileged user is currently not possible
and attaching to the user namespace will mean we are not allowed to move the
network device into an ancestor network namespace.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
tools: block using lxc-execute without config file
Moving away from internal symbols we can't do hacks like we currently do in
lxc-start and call internal functions like lxc_conf_init(). This is unsafe
anyway. Instead, we should simply error out if the user didn't give us a
configuration file to use. lxc-start refuses to start in that case already.
Relates to discussion in https://github.com/lxc/go-lxc/pull/96#discussion_r155075560 .
Closes #2023.
Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When I first solved this problem I went for a fork() + setns() + clone() model.
This works fine but has unnecessary overhead for a couple of reasons:
- doing a full fork() including copying file descriptor table and virtual
memory
- using pipes to retrieve the pid of the second child (the actual container
process)
This can all be avoided by being a little smart in how we employ the clone()
syscall:
- using CLONE_VM will let us get rid of using pipes since we can simply write
to the handler because we share the memory with our parent
- using CLONE_VFORK will also let us get rid of using pipes since the execution
of the parent is suspended until the child returns
- using CLONE_VM will not cause virtual memory to be copied
- using CLONE_FILES will not cause the file descriptor table to be copied
Note that the intermediate clone() is used with CLONE_VM. Some glibc versions
used to reset the pid/tid to -1 when CLONE_VM was used without CLONE_THREAD.
But since the memory between parent and child is shared on CLONE_VM this would
invalidate the getpid() cache that glibc used to maintain and so getpid() in
the child would return the parent's pid. This is all fixed in newer glibc
versions where the getpid() cache is removed and the pid/tid is not reset
anymore. However, if for whatever reason you - dear commiter - somehow need to
get the pid of the dummy intermediate process for do_share_ns() you need to
call syscall(__NR_getpid) directly. The next lxc_clone() call does not employ
CLONE_VM and will be fine.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Avoid NULL-pointer dereference. Apparently monitor.{c,h} calls
lxc_check_inherited() with NULL passed for the config. This isn't really a big
issue since monitor.{c,h} is effectively dead for all liblxc versions that have
the state client patch. Also, the patch that introduces the relevant lines into
lxc_check_inherited() is only in master and yet unreleased.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
It doesn't make sense to error out when an app container doesn't pass explicit
arguments through c->start{l}(). This is especially true since we implemented
lxc.execute.cmd. However, even before we could have always relied on
lxc.init.cmd and errored out after that.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>