We need to have lxc_attach() distinguish between a caller specifying specific
namespaces to attach to and a caller not requesting specific namespaces. The
latter is taken by lxc_attach() to mean that all namespaces will be attached.
This also needs to include all inherited namespaces.
Closes #1890.
Closes #1897.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The configure checks for these use AC_CHECK_DECLS, which define the symbol
to 0 if not available - So adjust the code to match. From the autoconf
manual:
- Implement inheriting user namespaces.
- When inheriting user namespaces make sure to not try and map ids again. The
kernel will not allow you to do this.
- Change clone() logic:
1. If we inherit no namespaces simply call lxc_clone().
2. If we inherit any namespaces call lxc_fork_attach_clone(). Here's why:
- Causes one syscall (fork()) instead of two syscalls (setns() to
inherited namespace and setns() back to parent namespace) to be
performed.
- Allows us to get rid of a bunch of variables and helper functions/code.
- Sharing a user namespaces requires us to setns() to the inherited user
namespace but the kernel does not allow reattaching to a parent user
namespace. So the old logic made user namespace inheritance impossible.
By using the lxc_fork_attach_clone() model we can simply setns() to the
inherited user namespace in the fork()ed child and be done with it.
The only thing we need to do is to specify CLONE_PARENT when calling
clone() in lxc_fork_attach_clone() so that we can wait on the child.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This commit also gets rid of ~10 unnecessarily file descriptors that were kept
open. Before we kept open:
- A set of file descriptors that refer to the monitor's namespaces. These were
only used to reattach to the monitor's namespace in lxc_spawn() and were
never used anywhere else. So close them and don't keep them around.
- A list of inherited file descriptors.
- A list of file descriptors referring to the containers's namespaces to pass
to lxc.hook.stop. This list duplicated inherited file descriptors.
Let's simply use a single list in the handler that has all file descriptors we
need and get rid of all other ones. As an illustration. Starting a container
1. Without this patch and looking at the fds that the monitor keeps open (26):
There's no obvious need to strdup() the name of the container in the handler.
We can simply make this a pointer to the memory allocated in
lxc_container_new().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Some toolchains which are not bionic like uclibc does not support
prlimit or prlimit64. In this case, return an error.
Moreover, if prlimit64 is available, use lxc implementation of prlimit.
In case cgroup namespaces are supported but we do not have CAP_SYS_ADMIN we
need to mount cgroups for the container. This patch enables both privileged and
unprivileged containers without CAP_SYS_ADMIN.
Closes #1737.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When attaching to a container's namespaces we did not handle the case where we
inherited namespaces correctly. In essence, liblxc on start records the
namespaces the container was created with in the handler. But it only records
the clone flags that were passed to clone() and doesn't record the namespaces
we e.g. inherited from other containers. This means that attach only ever
attached to the clone flags. But this is only correct if all other namespaces
not recorded in the handler refer to the namespaces of the caller. However,
this need not be the case if the container has inherited namespaces from
another container. To handle this case we need to check whether caller and
container are in the same namespace. If they are, we know that things are all
good. If they aren't then we need to attach to these namespaces as well.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Antonio Terceiro [Sat, 28 Oct 2017 11:20:35 +0000 (09:20 -0200)]
lxc-debian: don't hardcode valid releases
This avoids the dance of updating the list of valid releases every time
Debian makes a new release.
It also fixes the following bug: even though lxc-debian will default to
creating containers of the latest stable by querying the archive, it
won't allow you to explicitly request `stable` because the current list
of valid releases don't include it.
Last, but not least, avoid hitting the mirror in the case the desired
release is one of the ones we know will always be there, i.e. stable,
testing, sid, and unstable.
Signed-off-by: Antonio Terceiro <terceiro@debian.org>
Antonio Terceiro [Thu, 26 Oct 2017 22:42:49 +0000 (20:42 -0200)]
lxc-debian: allow creating `testing` and `unstable`
Being able to create `testing` containers, regardless of what's the name
of the next stable, is useful in several contexts, included but not
limited to testing purposes. i.e. one won't need to explicitly switch to
`bullseye` once `buster` is released to be able to continue tracking
`testing`. While we are at it, let's also enable `unstable`, which is
exactly the same as `sid`, but there is no reason for not being able to.
Signed-off-by: Antonio Terceiro <terceiro@debian.org>
ringbuf: implement simple and efficient ringbuffer
liblxc will use a ringbuffer implementation that employs mmap()ed memory.
Specifically, the ringbuffer will create an anonymous memory mapping twice the
requested size for the ringbuffer. Afterwards, an in-memory file the requested
size for the ringbuffer will be created. This in-memory file will then be
memory mapped twice into the previously established anonymous memory mapping
thereby effectively splitting the anoymous memory mapping in two halves of
equal size. This will allow the ringbuffer to get rid of any complex boundary
and wrap-around calculation logic. Since the underlying physical memory is the
same in both halves of the memory mapping only a single memcpy() call for both
reads and writes from and to the ringbuffer is needed.
Design Notes:
- Since we're using MAP_FIXED memory mappings to map the same in-memory file
twice into the anonymous memory mapping the kernel requires us to always
operate on properly aligned pages. To guarantee proper page aligment the size
of the ringbuffer must always be a muliple of the kernel's page size. This
also implies that the minimum size of the ringbuffer must be at least equal to
one page size. This additional requirement is reasonably unproblematic.
First, any ringbuffer smaller than the size of a single page is very likely
useless since the standard page size on linux is 4096 bytes.
- Because liblxc is not able to predict the output a user is going to produce
(e.g. users could cat binary files onto the console) and because the
ringbuffer is located in a hotpath and needs to be as performant as possible
liblxc will not parse the buffer.
Use Case:
The ringbuffer is needed by liblxc in order to safely log the output of write
intensive callers that produce unpredictable output or unpredictable amounts of
output. The console output created by a booting system and the user is one of
those cases. Allowing a container to log the console's output to a file it
would be possible for a malicious user to fill up the host filesystem by
producing random ouput on the container's console if quota support is either
not enabled or not available for the underlying filesystem. Using a ringbuffer
is a reliable and secure way to ensure a fixed-size log.
Closes #1857.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Adam Borowski [Sun, 15 Oct 2017 19:20:34 +0000 (19:20 +0000)]
Use the proper type for rlim_t, fixing build failure on x32.
Assuming a particular width of a type (or equivalence with "long") doesn't
work everywhere. On new architectures, LFS/etc is enabled by default,
making rlim_t same as rlim64_t even if long is only 32-bit.
Not sure how you handle too big values -- you may want to re-check the
strtoull part.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
The kernel only allows 4k writes to most files in /proc including {g,u}id_map
so let's not try to write partial mappings. (This will obviously become a lot
more relevant when my patch to extend the idmap limit in the kernel is merged.)
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This adds set_running_config_item() which is the analogue of
get_running_config_item(). In essence it allows a caller to livepatch the
container's in-memory configuration. This POC is severly limited. Here are the
most obvious ones:
- Only the container's in-memory config can be updated but no further actions
(e.g. on-disk actions) are made.
- Only keys in the "lxc.net." namespace can be changed. This POC also allows
updating an existing network. For example it allows to change the network
type of an existing network. This is obviously nonsense and in a non-POC
implementation this should be blocked.
Use Case:
Callers can hotplug a new network for the container. For example, LXD can
create a pair of veth devices in the host and in the container and add it to
the container's in-memory config. This means, the container can later be
queried for the name of the device later on etc. Note that liblxc will
currently not delete hotplugged network devices on container shutdown since it
won't have the ifindex of the container.
Relates to https://github.com/lxc/lxd/issues/3920 .
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
With the release LXC 2.1 we started warning users who use LXC through the API
and users who use LXC through the tools equally about updating their config.
This quickly got confusing and annoying to API users who e.g. generate configs
on the fly (e.g. LXD). So instead of unconditionally warning users we make this
opt-in. If LXC detects that the env variable LXC_UPDATE_CONFIG_FORMAT is set
then it will warn the user if any legacy configuration keys are present. If it
is not set however, it will not warn the user. This is ok, since the log will
still log WARN()s for all legacy configuration keys.
The tools will all set LXC_UPDATE_CONFIG_FORMAT since it is very much required
that users update to the new configuration format pre-LXC 3.0.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Serge Hallyn [Wed, 4 Oct 2017 05:14:00 +0000 (05:14 +0000)]
implement lxc_string_split_quoted
lxc_string_split_quoted() splits a string on spaces, but keeps
groups in single or double qoutes together. In other words,
generally what we'd want for argv behavior.
Switch lxc-execute to use this for lxc.execute.cmd.
Switch lxc-oci template to put the lxc.execute.cmd inside single
quotes, because parse_line() will eat those. If we don't do that,
then if we have lxc.execute.cmd = /bin/echo "hello, world", then the
last double quote will disappear.
We need to clear any ifindeces we recorded so liblxc won't have cached stale
data which would cause it to fail on reboot we're we don't re-read the on-disk
config file.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
echo "starting" > /tmp/debug
ip link add host1 type veth peer name peer1
ip link set host1 master lxcbr0
ip link set host1 up
ip link set peer1 netns "${LXC_PID}"
=================================
The nic 'peer1' was placed into the container as expected.
For this to work, we pass the container init's pid as LXC_PID in
an environment variable, since lxc-info cannot work at that point.