When we deleted cgroups for unprivileged containers we used to allocate a new
mapping and clone a new user namespace each time we delete a cgroup. This of
course meant - on a cgroup v1 system - doing this >= 10 times when all
controllers were used. Let's not to do this and only allocate and establish a
mapping once.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When fully unprivileged users run a container that only maps their own {g,u}id
and they do not have access to setuid new{g,u}idmap binaries we will write the
idmapping directly. This however requires us to write "deny" to
/proc/[pid]/setgroups otherwise any write to /proc/[pid]/gid_map will be
denied.
On a sidenote, this patch enables fully unprivileged containers. If you now set
lxc.net.[i].type = empty no privilege whatsoever is required to run a container.
Enhances #2033.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Felix Abecassis <fabecassis@nvidia.com> Cc: Jonathan Calmels <jcalmels@nvidia.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Serge Hallyn [Thu, 4 Jan 2018 03:02:53 +0000 (21:02 -0600)]
configure.ac: fix the check for static libcap
The existing check doesn't work, because when you statically
link a program against libc, any functions not called are not
included. So cap_init() which we check for is not there in
the built binary.
So instead just check whether a "gcc -lcap -static" works.
If libcap.a is not available it will fail, if it is it will
succeed.
- As discussed we will have a proper API extension that will allow updating
various parts of a running container. The prior approach wasn't a good idea.
- Revert this is not a problem since we haven't released any version with the
set_running_config_item() API extension.
- I'm not simply reverting so that master users can still call into new
liblxc's without crashing the container. This is achieved by keeping the
commands callback struct member number identical.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
mainloop: capture output of short-lived init procs
The handler for the signal fd will detect when the init process of a container
has exited and cause the mainloop to close. However, this can happen before the
console handlers - or any other events for that matter - are handled. So in the
case of init exiting we still need to allow for all buffered input to the
console to be handled before exiting. This allows us to capture output from
short-lived init processes.
This is conceptually equivalent to my implementation of ExecReaderToChannel()
https://github.com/lxc/lxd/blob/master/shared/util_linux.go#L527
Closes #1694.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When set set{u,g}id() the kernel will make us undumpable. This is unnecessary
since we can guarantee that whatever is running inside the child process at
this point this is fully trusted by the parent. Making us dumpable let's users
use debuggers on the child process before the exec as well and also allows us
to open /proc/<child-pid> files in lieu of the child.
Note, that we only need to perform the prctl(PR_SET_DUMPABLE, ...) if our
effective uid on the host is not 0. If our effective uid on the host is 0 then
we will keep all capabilities in the child user namespace across set{g,u}id().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Receive fd for LSM security module before we set{g,u}id(). The reason is that
on set{g,u}id() the kernel will a) make us undumpable and b) we will change our
effective uid. This means our effective uid will be different from the
effective uid of the process that created us which means that this processs no
longer has capabilities in our namespace including CAP_SYS_PTRACE. This means
we will not be able to read and /proc/<pid> files for the process anymore when
/proc is mounted with hidepid={1,2}. So let's get the lsm label fd before the
set{g,u}id().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
At this point, macros such DEBUG or ERROR does not take effect because
this code is called from cgroup_ops_init(cgroup.c), which runs with
__attribute__((constructor)), before any log level is set form any tool
like lxc-start, so these messages are lost.
For now on, use the same LXC_DEBUG_CGFSNG environment variable to
control these messages.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Excerpt from dnsmasq(8):
By default, the DHCP server will attempt to ensure that an address in not
in use before allocating it to a host. It does this by sending an ICMP echo
request (aka "ping") to the address in question. If it gets a reply, then the
address must already be in use, and another is tried. This flag disables this check.
This is useful if one expects all the containers to get an IP address
from the LXC authoritative DHCP server and wants to speed up the process
of getting a lease.
Signed-off-by: Jonathan Calmels <jcalmels@nvidia.com>
- Merge dhclient-start and dhclient-stop into a single hook.
- Wait for a lease before returning from the hook.
- Generate a logfile when LXC log level is either DEBUG or TRACE.
- Rely on namespace file descriptors for the stop hook.
- Use settings from /<sysconf>/lxc/dhclient.conf if available.
- Attempt to cleanup if dhclient fails to shutdown properly.
Signed-off-by: Jonathan Calmels <jcalmels@nvidia.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Adrian Reber <areber@redhat.com>
but missed unprivileged btrfs snapshot creation. Fix it too.
Follow-up to #1956.
Closes #2051.
Reported-by: Oleg Freedhom overlayfs@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This is based on raw_clone in systemd but adapted to our needs. The main reason
is that we need an implementation of fork()/clone() that does guarantee us that
no pthread_atfork() handlers are run. While clone() in glibc currently doesn't
run pthread_atfork() handlers we should be fine but there's no guarantee that
this won't be the case in the future. So let's do the syscall directly - or as
direct as we can. An additional nice feature is that we get fork() behavior,
i.e. lxc_raw_clone() returns 0 in the child and the child pid in the parent.
Our implementation tries to make sure that we cover all cases according to
kernel sources. Note that we are not interested in any arguments that could be
passed after the stack.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When we report STOPPED to a caller and then close the command socket it is
technically possible - and I've seen this happen on the test builders - that a
container start() right after a wait() will receive ECONNREFUSED because it
called open() before we close(). So for all new state clients simply close the
command socket. This will inform all state clients that the container is
STOPPED and also prevents a race between a open()/close() on the command socket
causing a new process to get ECONNREFUSED because we haven't yet closed the
command socket.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Adrian Reber [Wed, 13 Dec 2017 11:04:02 +0000 (12:04 +0100)]
criu: add feature check capability
For migration optimization features like pre-copy or post-copy migration
the support cannot be determined by simply looking at the CRIU version.
Features like that depend on the architecture/kernel/criu combination
and CRIU offers a feature checking interface to query if it is
supported.
This adds a LXC interface to query CRIU for those feature via the
migrate() API call. For the recent pre-copy migration support in LXD
this can be used to automatically detect if pre-copy migration should be
used.
In addition to the existing migrate() API commands this adds a new
command: 'MIGRATE_FEATURE_CHECK'.
The migrate_opts{} structure is extended by the member features_to_check
which is a bitmask defining which CRIU features should be queried.
Currently only the querying of the features FEATURE_MEM_TRACK and
FEATURE_LAZY_PAGES is supported.
Serge Hallyn [Thu, 14 Dec 2017 19:16:02 +0000 (13:16 -0600)]
dir_detect: warn on eperm
if user has lxc.rootfs.path = /some/path/foo, but can't access
some piece of that path, then we'll get an unhelpful "failed to
mount" without any indication of the problem.