In the past, if the console client exited, lxc_console_cb_con return 1. And
the lxc_poll will exit, the process will wait at waitpid. At this moment, the
process could not handle any command (For example get the container state
LXC_CMD_GET_STATE or stop the container LXC_CMD_STOP.).
I think we should clean the tty_state and return 0 in this case. So, we can use
the lxc-console to connect the console of the container. And we will not exit
the function lxc_polland we can handle the commands by lxc_cmd_process
Reproducer prior to this commit:
- open a new terminal, get the tty device name by command tty /dev/pts/6
- set lxc.console.path = /dev/pts/6
- start the container and the ouptut will print to /dev/pts/6
- close /dev/pts/6
- try an operation e.g. getting state with lxc-ls and lxc-ls will hang
Closes #1787.
Signed-off-by: LiFeng <lifeng68@huawei.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
A bit of context:
userns_exec_1() is only used to operate based on privileges for the user's own
{g,u}id on the host and for the container root's unmapped {g,u}id. This means
we require only to establish a mapping from:
- the container root {g,u}id as seen from the host -> user's host {g,u}id
- the container root -> some sub{g,u}id
This function however was buggy. It relied on some pointer pointing to the same
memory, namely specific idmap entries in the idmap list in the container's
in-memory configuration. However, due to a stupid mistake of mine, the pointers
to be compared pointed to freshly allocated memory. They were never pointing to
the intended memory locations. To reproduce what I'm talking about prior to
this commit simply place:
We allocate pty {master,slave} file descriptors in the childs namespaces after
we have setup devpts. After we have sent the pty file descriptors to the parent
and set up the pty file descriptors under /dev/tty* and before we exec the init
binary we need to delete these file descriptors in the child. However, one of
my commits made the deletion occur before setting up the file descriptors under
/dev/tty*. This caused a failures when trying to attach to the container's ttys
since they werent actually configured although the file descriptors were
available in the in-memory configuration of the parent.
This commit reworks setting up tty such that deletion occurs after all setup
has been performed. The commit is actually minimal but needs to also move all
the functions into one place since they well now be called from
"lxc_create_ttys()".
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
I thought we could send all ttys at once but this limits the number of ttys
users can use because of iovec_len restrictions. So let's sent them in batches
of 2.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Since find_line() was changed before count_entries() started counting lines
wrong. It would report maximum reached before you actually reached your alloted
maximum.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
We use data_sock for all things we need to send around between parent and child
now. It doesn't make sense to have so many different pipes and sockets if one
will do just fine.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
templates/ubuntu: support netplan in newer releases by default
If netplan is present in the container, configure default networking
with neplan instead of ifupdown. Also, do not install ifupdown when
boostrapping minbase variant, unless using currently support
non-netplan releases (trusty, zenial, zesty).
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
network: stop recording saved physical net devices
liblxc will now correctly log any network device names and ifindeces in their
respective network namespaces. So there's no need to record physical network
devices any more. This spares us heap allocations and memory we need to have
lying around til the container is shutdown.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
On privileged network creation we only retrieved the names and ifindeces of
network devices in the host's network namespace. This meant that the monitor
process was acting on possibly incorrect information. With this commit we have
the child send back the correct device names and ifindeces in the container's
network namespace.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
- Retrieve the host's veth device ifindex in the host's network namespace.
- Add a note why we retrieve the container's veth device ifindex in the host's
network namespace.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
- On unprivileged veth network creation have lxc-user-nic send the names of the
veth devices and their respective ifindeces. The advantage of retrieving this
information from lxc-user-nic is that we spare us sending around more stuff
via the netpipe in start.c. Also, lxc-user-nic operates in both namespaces
(the container's namespace and the hosts's namespace) via setns and so is
guaranteed to retrieve the correct ifindex via if_nametoindex() which is an
network namespace aware ioctl() call. While I'm pretty sure the ifindeces for
veth devices are identical across network namespaces I'm weary to rely on
this. We need the ifindexes to guarantee safe deletion of unprivileged
network devices via lxc-user-nic later on since we use them to identify the
network devices in their corresponding network namespaces.
- Move the network device logging from the child to the parent. The child does
not have all of the information about the network devices available only the
few bits it actually needs to now. The monitor process is the only process
that needs all this information.
- The network creation code for privileged and unprivileged networks was
previously mangled into one single function but at the same time some of the
privileged code had additional functions that were called in other places in
start.c. Let's divide and conquer and split out the privileged and
unprivileged network creation into completely separate functions. This makes
what's happening way more clear. This will also have no performance impact
since either you are privileged and only execute the privileged network
creation functions or you are unprivileged and only execute the unprivileged
network creation functions.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
We should not just record the ifindex for the container's veth device but also
for the host's veth device. This is useful when {configuring,deconfiguring}
veth devices and becomes crucial when calling our lxc-user-nic setuid helper
where we rely on the ifindex to make decisions about whether we are licensed to
perform certain operations on the veth device in question.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
If the user specified lxc.net.[i].veth.pair attribute to request that the host
side of a veth pair be given a specific name let's log it at the trace level.
Otherwise, if the user didn't not specify lxc.net.[i].veth.pair veth_attr.veth1
will contain the name of the host side veth device.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When lxc-user-nic is called with the "delete" subcommand we need to make sure
that we are actually privileged over the network namespace for which we are
supposed to delete devices on the host. To this end we require that path to the
affected network namespace is passed. We then setns() to the network namespace
and drop privilege to the caller's real user id. Then we try to delete the
loopback interface which is not possible. If we are privileged over the network
namespace this operation will fail with ENOTSUP. If we are not privileged over
the network namespace we will get EPERM.
This is the first part of the commit. As of now nothing guarantees that the
caller does not just give us a random path to a network namespace it is
privileged over.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This is the cause of the unnecessary extraneous slashes when creating cgroups.
Our lxc.system.conf page also clearly shows "lxc/%n" as example, not "/lxc%n".
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> cgfsng: try to delete parent cgroups
>
> Say we have
>
> lxc.uts.name = c1
> lxc.cgroup.dir = lxd/a/b/c
>
> the path for the container's cgroup would be
>
> lxd/a/b/c/c1
>
> When the container is shutdown we should not just try to delete "c1" we
> should also try to delete "c", "b", "a", and "lxd". This is to ensure
> that we don't leave empty cgroups around thereby increasing the chance
> that we run into trouble with cgroup limits. The algorithm for this isn't
> too costly since we can simply stop walking upwards at the first rmdir()
> failure.
The algorithm employs recursive_destroy() which opens each directory
specified in lxc.cgroup.dir and tries to delete each directory within that
directory. For example, assume "/sys/fs/cgroup/memory/lxd/a/b/c" only
contains the cgroup "c1" for container "c1". Assume that "c1" calls
recursive_destroy() to cleanup it's cgroups. It will first delete "c1" and
anything underneath it. This is perfectly fine since anything underneath
that cgroup is under its control. The new algorithm will then tell it to
"recurse upwards". So recursive_destroy() will try to delete
"/sys/fs/cgroup/lxd/a/b/c" next. Now assume that a second container "c2"
has "lxc.cgroup.dir = lxd/a/b/c" set in its config file and calls
cgroup_create(). This will create the *empty* cgroup
"/sys/fs/cgroup/memory/lxd/a/b/c/c2". Now assume that after having created
"c2" container "c1"'s call to recursive_destroy() reaches
"/sys/fs/cgroup/memory/lxd/a/b/c/c2" before it is populated. Then the
cgroup "c2" will be removed. Now "c2" calls cgroup_enter() to enter its
created cgroup. This will fail since c1 deleted the cgroup "c2". (As a
sidenote: This is in the set of the few race conditions that are actually
easy to describe.)
Possible Solution:
Instead of calling recursive_destroy() on all cgroups specified in
lxc.cgroup.dir we only call recursive_destroy() on the container's own
cgroup "/sys/fs/cgroup/memory/lxd/a/b/c/c1". When we start to recurse
upwards we only call unlinkat(AT_FDCWD, path, AT_REMOVEDIR). This should
avoid the race described above. My argument is as follows. Assume that the
container c1 has created the cgroup "/sys/fs/cgroup/lxd/a/b/c/c1" for
itself. Now c1 calls cgroup_destroy(). First, recursive_destroy() will be
called on the cgroup "c1" which will delete any emtpy cgroup directories
underneath "c1" and finally "c1" itself. This is fine since everything
under "c1" is the container's c1 sole property. Now container c1 will call
unlinkat() on "/sys/fs/cgroup/memory/lxd/a/b/c/c1":
- Assume that in the meantime container c2 has created the cgroup
"/sys/fs/cgroup/memory/lxd/a/b/c/c2". Then c1's unlinkat() will fail.
This will stop c1 from recursing upwards. So c2's cgroup_enter() call
will find all its cgroups intact and well. unlinkat() will come with the
appropriate in-kernel locking which will stop it from racing with
mkdir().
- There's still a subtle race left. c2 might be calling an implementation
of mkdir -p to try and create e.g. the cgroup
"/sys/fs/cgroup/memory/lxd/a/b". Let's assume "b" exists then c2 will
receive EEXIST on "b" and move on to create "c". Let's further assume c1
has already deleted "c". c1 will now be able to delete
"/sys/fs/cgroup/memory/lxd/a/b/" and c2's call to create "c" will fail.
The latter subtle race makes me rethink this approach. For now we'll just leave
empty cgroups behind since I don't want to start locking stuff.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This moves all of the network handling code into network.{c,h}. This makes what
is going on much clearer. Also it's easier to find relevant code if it is all
in one place.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Each occurrence of "lxc.network.type" indicated the definition of a new
network. This syntax is not allowed in newer liblxc instances. Instead, network
must carry an index. So in new liblxc these two networks would be translated to:
- lxc-user-nic gains the subcommands {create,delete}
- dup2() STDERR_FILENO as well so that we can show helpful messages in our logs
on failure
- initialize output buffer so that we don't print garbage
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>