Serge Hallyn [Fri, 7 Mar 2014 18:24:27 +0000 (12:24 -0600)]
lxc: manually move NICs back to host after container stops
This prevents things like bridges from being destroyed by the kernel.
My hope is that just doing this will be enough to also ensure that
the device will be available to be renamed immediately, so that
we don't need to do a retry loop.
Tested with a dummy device. renaming dummy0 to dummy5 in container,
then shutting down container, returns dummy0 to the host.
S.Çağlar Onur [Fri, 7 Mar 2014 04:27:05 +0000 (23:27 -0500)]
put shared variables into thread-local storage
This doesn't solve the general design problem of the log.c (eg; some log lines
got lost or scattered into multiple files) but at least prevent multithreaded
code from crashing.
Before this change something like following;
sudo src/tests/lxc-test-concurrent -i 10 -j 20
was crashing nearly all the time due to 3afbcc4600a as we started to
set lxc.loglevel and lxc.logfile with that commit.
Dwight Engen [Wed, 5 Mar 2014 20:48:39 +0000 (15:48 -0500)]
fix console stdin,stdout,stderr fds
The fds for stdin,stdout,stderr that we were leaving open for /sbin/init
in the container were those from /dev/tty or lxc.console (if given), which
wasn't right. Inside the container it should only have access to the pty
that lxc creates representing the console.
This was noticed because busybox's init was resetting the termio on its
stdin which was effecting the actual users terminal instead of the pty.
This meant it was setting icanon so were were not passing keystrokes
immediately to the pty, and hence command line history/editing wasn't
working.
Fix by dup'ing the console pty to stdin,stdout,stderr just before
exec()ing /sbin/init. Fix fd leak in error handling that I noticed while
going through this code.
Also tested with lxc.console = none, lxc.console = /dev/tty7 and no
lxc.console specified.
Serge Hallyn [Tue, 4 Mar 2014 20:54:04 +0000 (14:54 -0600)]
snapshot: fix overlayfs restore
And add a testcase to catch regressions.
Without this patch, restoring a snapshot of an overlayfs based
container fails, because we do not pass in LXC_CLONE_SNAPSHOT,
and overlayfs does not support clone without snapshot.
Serge Hallyn [Tue, 4 Mar 2014 18:18:08 +0000 (12:18 -0600)]
cgmanager: switch to TLS
Drop the thread mutex. Set a (TLS) boolean at container start to
indicate that the connection should be kept open; set it back to false
only when container start is complete. Every cgm_ method opens the
connection if not already open, and closes it if cgm_keep_connection
is false.
Serge Hallyn [Mon, 3 Mar 2014 22:39:00 +0000 (16:39 -0600)]
cgmanager updates
1. remove the cgm_dbus_disconnected handler. We're using a proxy
anyway, and not keeping it around.
2. comment most of the cgm functions to describe when they are called, to
ease locking review
3. the cgmanager mutex is now held for the duration of a connection, from
cgm_dbus_connect to cgm_dbus_disconnect.
3b. so remove the mutex lock/unlock from functions which are called during
container startup with the cgmanager connection already up
4. remove the cgroup_restart(). It's no longer needed since we don't
daemonize while we have the cgmanager socket open.
5. report errors and return early if cgm_dbus_connect() fails
6. don't keep the cgm connection open after cgm_ops_init. I'm a bit torn
on this one as it means that things like lxc-start will always connect
twice. But if we do this there is no good answer, given threaded API
users, on when to drop that initial connection.
7. cgm_unfreeze and nrtasks: grab the dbus connection, as we'll never
have it at that point. (technically i doubt anyone will use
cgmanager and utmp helper on the same host :)
8. lxc_spawn: make sure we only disconnect cgroups if they were already
connected.
Stéphane Graber [Tue, 4 Mar 2014 18:20:10 +0000 (13:20 -0500)]
lxc-ls: Fix support of --nesting for unpriv
This reworks the way lxc-ls works in nesting mode. In the past it'd use
attach_wait's subprocess function to call itself in the container's
namespace, carefully only attaching to the namespaces it needed.
This works great for system containers but not so much as soon as you
also need to attach to userns. Instead this fix moves all of the
container listing code into a get_containers function (hence the massive
diff, sorry), this function is then called recursively.
For running containers, the function is called through attach_wait
inside the container's namespace, for stopped container, the function is
simply called recursively with a base path (container's rootfs) in an
attempt to find containers that way.
Communication between the parent lxc-ls and the child lxc-ls is done
through a temporary fd and serialized state using json (similar to what
was done using stdout in the previous implementation).
As get_global_config_item unfortunately caches the values, there's no
easy way to figure out what the lxcpath should be for a root container
when running as non-root, so just use @LXCPATH@ for now and have
python do the parsing itself.
As a result, the following things now work as expected:
- listing nested unprivileged containers (root containers inside unpriv)
- listing nested containers when they're not running
- filtering containers in nesting mode (only the first level is filtered)
- copy with invalid config (used to traceback)
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Natanael Copa [Tue, 4 Mar 2014 09:50:27 +0000 (09:50 +0000)]
lua: respect configure's --prefix
Install lua files under the confiugred --prefix rather than use the
pkg-config's variables LUA_INSTALL_[CL]MOD.
Users will likely want user --prefix while packagers will use DESTDIR.
Set the default to $datadir/lua/$LUA_VERSION for arch independent
lua modules and $libdir/lua/$LUA_VERSION for arch dependant .so module.
This should work for most distros. If it does not, then packagers
can still do:
make install lualibdir=$(pkg-config lua --variable=INSTALL_CMOD) ...
Serge Hallyn [Mon, 3 Mar 2014 19:57:14 +0000 (13:57 -0600)]
clone: don't set new containers' rootfs to the old
If clone is called from the api, the container object in memory
retains the bad fs. The line is wrong, being a leftover from a
previous attempt before copy_storage was moved earlier.
Stéphane Graber [Mon, 3 Mar 2014 16:31:03 +0000 (11:31 -0500)]
Fix typo I introduced in the bdev change.
When adding the missing return value in Caglar's change (as discussed on
the mailing-list), I set err = -1 instead or ret = -1, causing an
obvious build failure...
Serge Hallyn [Sat, 1 Mar 2014 05:41:12 +0000 (23:41 -0600)]
simpler shared rootfs handling
Only do the funky chroot_into_slave if / is in fact the rootfs.
Rootfs is a special blacklisted case for pivot_root.
If / is not rootfs but is shared, just mount / rslave. We're
already in our own namespace.
This appears to solve the extra /proc/$$/mount entries in
containers and the host directories in lxc-attach which have
been plagueing at least fedora and arch.
Serge Hallyn [Fri, 28 Feb 2014 03:49:27 +0000 (21:49 -0600)]
clone: don't ever mark the clone's rootfs as being the old, on disk
Otherwise an interrupted clone can lead to the original rootfs
being delete.
There is a period during lxcapi_clone during which we have written down
a temporary configuration file on disk, for the new container, using the
old rootfs. Interruption of clone doesn't allow us to do the cleanup we
do in error paths, so a subsequent lxc-destroy removes the old rootfs.
Fix this by doing the copy_storage as early as possible, and not
writing down the rootfs when we write down the temporary configuration
file.
(note - I tested this by putting a series of
'if (strcmp(newname, "u%d") == 0) exit(1)' inline to trigger
interruption between most blocks. If someone has a good idea
for a generic way to regression-test this henceforth that'd be
great)
Serge Hallyn [Fri, 28 Feb 2014 23:50:22 +0000 (17:50 -0600)]
cgmanager: don't stay connected
There are only a few times when we need to be connected to the
cgroup manager:
* when starting a container, from cgm_init until we've set cgroup limits
* when changing a cgroup setting (while running)
* when cleaning up (when shutting down)
* around the cgroup entering at attach
So only connect/disconnect the cgmanager socket on-demand as
needed. This should have a few benefits.
1. Reduce the # open fds when many containers are running
2. if cgmanager is stopped and restarted, the container
doesn't have to deal with the disconnection.
This is currently RFC. There are a few issues outstanding:
1. the cgm_set and cgm_get may need to be made thread-safe.
2. a non-daemonized start which fails while cgm is connected,
will not disconnected.
Stéphane Graber [Wed, 26 Feb 2014 18:00:36 +0000 (13:00 -0500)]
Fix unprivileged containers started by root
This change makes it possible to create unprivileged containers as root.
They will be stored in the usual system wide location, use the usual
system wide cache but will be running using a uid/gid map.
This also updates lxc_usernsexec to use the same function as the rest of
LXC, centralizing all the userns switch in a single function.
That function now detects the presence of newuidmap and newgidmap on the
system, if they are present, they will be used for containers created as
either user or root. If they're not and the user isn't root, an error is
shown. If they're not and the user is root, LXC will directly set the
uid_map and gid_map values.
All that should allow for a consistent experience as well as supporting
distributions that don't yet ship newuidmap/newgidmap.
To make things simpler in the future, an helper function "on_path" is
also introduced and used to detect the presence of newuidmap and
newgidmap.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Stéphane Graber [Thu, 27 Feb 2014 22:32:39 +0000 (17:32 -0500)]
start: Fix print_top_failing_dir for /var/lib/lxc
In the case where /var/lib/lxc itself was not accessible,
print_top_failing_dir would fail to print the error message.
This fixes it and also change the initial access check for X_OK instead
of R_OK (to match what we actually need and print_top_failing_dir's own
check).
Stéphane Graber [Thu, 27 Feb 2014 20:46:23 +0000 (15:46 -0500)]
lxc-download: Ignore return code from subshell
The previous change fixed parsing of multiple uid/gid ranges by using a
while loop, however a failure in that loop will cause the script to exit
(due to -e), so we need to ignore the return value of the commands
inside that loop.
Dwight Engen [Wed, 26 Feb 2014 18:54:58 +0000 (13:54 -0500)]
fix attach when cgroups mounted after container start
When booting an OL7 container on OL6, systemd in the OL7 container mounted
some extra cgroup controllers, which are then present in /proc/self/cgroups
of every task on the host. This is the list used by attach to determine
which cgroups to move the attached task into, but when it asks the container
over the command interface for the path to the subsystem this will fail
since the controller didn't exist when the container was first started.
Instead of failing, this change allows the attach to continue, warning that
those cgroups that could not be found won't be attached to.
The problem can be more simply reproduced by starting a busybox container,
mounting a cgroup that was not previously mounted, and then attempting
to attach to to the busybox container.
The problem will likely not manifest with cgmanager since it only requests
the path for the first controller, which is likely to always be mounted.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Stéphane Graber [Wed, 26 Feb 2014 19:15:27 +0000 (14:15 -0500)]
lxc-download: Detect unpriv created by real root
This adds yet another case in the in_userns function detecting the case
where an unprivileged container is created by the real uid 0, in which
case we want to share the system wide cache but still use the
unprivileged templates and unpack method.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Stéphane Graber [Wed, 26 Feb 2014 00:15:28 +0000 (19:15 -0500)]
upstart: Don't forward requests for LXC_DOMAIN
Without this change, a request to *.LXC_DOMAIN that doesn't get a local
result from dnsmasq will be forwarded to its upstream server with the
potential of a loop.
Thanks to Ed for the patch on Launchpad (LP: #1246094).
Reported-by: Ed Swierk Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Stéphane Graber [Tue, 25 Feb 2014 20:50:44 +0000 (15:50 -0500)]
python3: Add support for wlan device add
With this change it's now possible to add wlan devices to the container.
This will track down the right phy device, move it to the right
namespace (we don't care about its name), then if the user asked for a
new device name for the actual interface, we attach to the container and
rename the interface in there using attach.
I have tested this to work with both Intel and Atheros NICs.
This patch is based on the one provided to lxc-devel by Gregor Beck and
has then been updated to do the device renaming as well as minor code
style changes. Thanks!
Reported-by: Gregor Beck <gbeck@sernet.de> Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Tue, 25 Feb 2014 05:08:26 +0000 (23:08 -0600)]
always check whether rootfs is shared
(this expands on Dwight's recent patch, commit c597baa8f9)
After unshare(CLONE_NEWNS) and before doing any mounting, always
check whether rootfs is shared. Otherwise template runs or clone
scripts can bleed mount activity to the host.
Serge Hallyn [Fri, 21 Feb 2014 20:36:06 +0000 (14:36 -0600)]
add dir support
It used to be supported with the lxc-create.in script, and
the manpage says it's supported... So let's just support it.
Now
sudo lxc-create -t download --dir /opt/ab -n ab
works, creating the container rootfs under /opt/ab. This
generally isn't something I'd recommend, however telling users
to use a different lxc-path isn't as friendly as I'd like,
because each lxcpath requires separate lxc-ls and lxc-autostart
runs.
Dwight Engen [Wed, 19 Feb 2014 21:44:19 +0000 (16:44 -0500)]
fix mounts not propagating back to root mntns during create and clone
Systems based on systemd mount the root shared by default. We don't want
mounts done during creation by templates nor those done internally by
bdev during rsync based clones to propagate to the root mntns.
The create case already had the right check, but the mount call was
missing "/", so it was failing.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Stéphane Graber [Tue, 18 Feb 2014 22:33:51 +0000 (17:33 -0500)]
Set a reasonable fallback for get_rundir
If get_rundir can't find XDG_RUNTIME_DIR in the environment, it'll
attempt to build a path using ~/.cache/lxc/run/. Should that fail
because of missing $HOME in the environment, it'll then return NULL an
all callers will fail in that case.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Tue, 18 Feb 2014 21:12:52 +0000 (15:12 -0600)]
Fix unprivileged networking
If we are unprivileged and have asked for a veth device, then create
a pipe over which to pass the veth names.
Network-related todos:
1. set mtu on the container side of veth device
2. set mtu in lxc-user-nic. Note that this probably requires an
update to the /etc/lxc/lxc-usernet file :(
Serge Hallyn [Tue, 18 Feb 2014 21:01:38 +0000 (15:01 -0600)]
cache whether 'optional' was in mntopts
after commit 4e4ca16158f91ac1271495638a4e62881169474e we are
checking for optional in mntopts after we forcibly remove it.
Cache whether we had it before removing it.
Serge Hallyn [Mon, 17 Feb 2014 18:47:35 +0000 (12:47 -0600)]
attach: try to use the container's seccomp policy
We can't get the actual policy (in the case where the policy file
has changed) from the container, but at least we can use the
seccomp policy file listed in the container config file.
(If anyone wants to further improve this, it may be better to get
the seccomp policy over the cmd api; not sure that's what we want,
and this seems simpler to hook into the existing code, so I went
this way for now)
Stéphane Graber [Mon, 17 Feb 2014 15:51:53 +0000 (10:51 -0500)]
download: Support nested containers in unpriv
This adds detection for the case where we are root in an unprivileged
container and then run LXC from there. In this case, we want to download
to the system location, ignore the missing uid/gid ranges and run
templates that are userns-ready.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
S.Çağlar Onur [Sun, 16 Feb 2014 21:20:48 +0000 (16:20 -0500)]
fill missing netdev fields for unprivileged containers
lxc-user-nic now returns the names of the interfaces and
unpriv_assign_nic function parses that information to fill
missing netdev->veth_attr.pair and netdev->name.
With this patch get_running_config_item started to provide
correct information;