This fixes a race in liblxc logging which can lead to deadlocks. The reproducer
for this issue before this is to simply compile with --enable-tests and then
run:
So far, we opened a file descriptor refering to proc on the host inside the
host namespace and handed that fd to the attached process in
attach_child_main(). This was done to ensure that LSM labels were correctly
setup. However, by exploiting a potential kernel bug, ptrace could be used to
prevent the file descriptor from being closed which in turn could be used by an
unprivileged container to gain access to the host namespace. Aside from this
needing an upstream kernel fix, we should make sure that we don't pass the fd
for proc itself to the attached process. However, we cannot completely prevent
this, as the attached process needs to be able to change its apparmor profile
by writing to /proc/self/attr/exec or /proc/self/attr/current. To minimize the
attack surface, we only send the fd for /proc/self/attr/exec or
/proc/self/attr/current to the attached process. To do this we introduce a
little more IPC between the child and parent:
* IPC mechanism: (X is receiver)
* initial process intermediate attached
* X <--- send pid of
* attached proc,
* then exit
* send 0 ------------------------------------> X
* [do initialization]
* X <------------------------------------ send 1
* [add to cgroup, ...]
* send 2 ------------------------------------> X
* [set LXC_ATTACH_NO_NEW_PRIVS]
* X <------------------------------------ send 3
* [open LSM label fd]
* send 4 ------------------------------------> X
* [set LSM label]
* close socket close socket
* run program
The attached child tells the parent when it is ready to have its LSM labels set
up. The parent then opens an approriate fd for the child PID to
/proc/<pid>/attr/exec or /proc/<pid>/attr/current and sends it via SCM_RIGHTS
to the child. The child can then set its LSM laben. Both sides then close the
socket fds and the child execs the requested process.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
The identifiers for namespaces used with lxc-unshare and lxc-attach as given on
the manpage do not align with the standard identifiers. This affects network,
mount, and uts namespaces. The standard identifiers are: "mnt", "uts", and
"net" whereas lxc-unshare and lxc-attach use "MOUNT", "UTSNAME", and "NETWORK".
I'm weary to hack this into namespace.{c.h} by e.g. adding additional members
to the ns_info struct or to special case this in lxc_fill_namespace_flags().
Internally, we should only accept standard identifiers to ensure that we are
always correctly aligned with the kernel. So let's use some cheap memmove()s to
replace them by their standard identifiers in lxc-unshare and lxc-attach.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This function safely parses an unsigned integer. On success it returns 0 and
stores the unsigned integer in @converted. On error it returns a negative
errno.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
If the file "/sys/devices/system/cpu/isolated" doesn't exist, we can't just
simply bail. We still need to check whether we need to copy the parents cpu
settings.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Move the user namespace at the first position in the array so that we always
attach to it first when iterating over the struct and using setns() to switch
namespaces. This especially affects lxc_attach(): Suppose you cloned a new user
namespace and mount namespace as an unprivileged user on the host and want to
setns() to the mount namespace. This requires you to attach to the user
namespace first otherwise the kernel will fail this check:
Using custom structs in attach.c risks getting out of sync with the commonly
used ns_info[LXC_NS_MAX] struct and thus attaching to wrong namespaces. Switch
to using ns_info[LXC_NS_MAX].
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
- Allocating an error message that the caller must free seems pointless. We can
just print the error message in preserve_ns() itself. This also allows us to
avoid using the GNU extension asprintf().
- Improve lxc_preserve_ns(): By passing in NULL or "" as the second argument
the function can now also be used to check whether namespaces are supported
by the kernel.
- Use lxc_preserve_ns() in preserve_ns().
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
- So far we blindly called lxc_delete_network() to make sure that we deleted
all network interfaces. This resulted in pointless netlink calls, especially
when a container had multiple networks defined. Let's be smarter and have
lxc_delete_network() return a boolean that indicates whether *all* configured
networks have been deleted. If so, don't needlessly try to delete them again
in start.c. This also decreases confusing error messages a user might see.
- When we receive -ENODEV from one of our lxc_netdev_delete_*() functions,
let's assume that either the network device already got deleted or that it
got moved to a different network namespace. Inform the user about this but do
not report an error in this case.
- When we have explicitly deleted the host side of a veth pair let's
immediately free(priv.veth_attr.pair) and NULL it, or
memset(priv.veth_attr.pair, ...) the corresponding member so we don't
needlessly try to destroy them again when we have to call
lxc_delete_network() again in start.c
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
When we set LXC_DEBUG_CGFSNG=1 we print out info about detected cgroup
hierarchies. When there's no named cgroup mounted we need to make sure that we
don't try to index an unallocated pointer.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Adrian Reber [Mon, 14 Nov 2016 14:44:04 +0000 (14:44 +0000)]
lxc-checkpoint: enable dirty memory tracking in criu
CRIU supports dirty memory tracking to take incremental checkpoints.
Incremental checkpoints are one way of reducing downtime during
migration. The first checkpoint dumps all the memory pages and the
second (and third, and fourth, ...) only dumps pages which have changed.
Most of the necessary code has already been implemented. This just adds
the existing functionality to lxc-checkpoint:
-p, --pre-dump Only pre-dump the memory of the container.
Container keeps on running and following
checkpoints will only dump the changes.
--predump-dir=DIR path to images from previous dump (relative to -D)
The following is an example from a container running CentOS 7 with psql
and tomcat:
# lxc-checkpoint -n c7 -D /tmp/cp -p
Container keeps on running
# du -h /tmp/cp
229M /tmp/cp
Sync initial checkpoint to destination
# rsync -a /tmp/cp host2:/tmp/
Sync file-system
# rsync -a /var/lib/lxc/c7 host2:/var/lib/lxc/
Final dump; container is stopped
# lxc-checkpoint -n c7 -D /tmp/cp --predump-dir=../cp -s
# du -h /tmp/cp2
90M /tmp/cp2
After transferring the second (incremental checkpoint) and the changes
to the container's file system the container can be restored on the
second host by pointing lxc-checkpoint to the second checkpoint
directory:
Stéphane Graber [Mon, 14 Nov 2016 16:53:07 +0000 (11:53 -0500)]
debian: Don't depend on libui-dialog-perl
This package doesn't exist in stretch anymore, and it's unclear why we
were depending on a library to begin with (as opposed to having it
brought by whatever needs it).
we cannot simply copy the cpuset.cpus file from our parent cgroup. For example,
in the root cgroup cpuset.cpus will contain all of the cpus including the
isolated cpus. Copying the values of the root cgroup into a child cgroup will
lead to a wrong view in /proc/self/status: For the root cgroup
/sys/fs/cgroup/cpuset /proc/self/status will correctly show
Cpus_allowed_list: 0-1,3
even though cpuset.cpus will show
0-3
However, initializing a subcgroup in the cpuset controller by copying the
cpuset.cpus setting from the root cgroup will cause /proc/self/status to
incorrectly show
Cpus_allowed_list: 0-3
Hence, we need to make sure to remove the isolated cpus from cpuset.cpus. Seth
has argued that this is not a kernel bug but by design. So let us be the smart
guys and fix this in liblxc.
The solution is straightforward: To avoid having to work with raw cpulist
strings we create cpumasks based on uint32_t bit arrays.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
start: CLONE_NEWCGROUP after we have setup cgroups
If we do it earlier we end up with a wrong view of /proc/self/cgroup. For
example, assume we unshare(CLONE_NEWCGROUP) first, and then create the cgroup
for the container, say /sys/fs/cgroup/cpuset/lxc/c, then /proc/self/cgroup
would show us:
8:cpuset:/lxc/c
whereas it should actually show
8:cpuset:/
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>