Scott Moser [Fri, 16 Aug 2013 20:47:32 +0000 (16:47 -0400)]
ubuntu-cloud-prep: patch /sbin/start for overlayfs
upstart depends on inotify, and overlayfs does not support inotify.
That means that the following results in 'tgt' not running. tgt is simply
used here as an example of a service that installs an upstart job and
starts it on package install.
lxc-clone -s -B overlayfs -o source-precise-amd64 -n test1
lxc-start -n test1
..
apt-get install tgt
The change here is to modify /sbin/start inside the container so that when
something explicitly tries 'start', it results in an explicit call to
'initctl reload-configuration' so that upstart is aware of the newly
placed job.
Should overlayfs ever gain inotify support, this should still not cause
any harm.
Signed-off-by: Scott Moser <smoser@ubuntu.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Thu, 15 Aug 2013 17:22:26 +0000 (12:22 -0500)]
bdev_create: don't default to btrfs if possible
Ideally it would be great to default to a btrfs subvolume for each new
container created. However, this is not as we previously thought
without consequence. 'rsync --one-file-system' will not descend into
btrfs subvolumes. This means that 'lxc-create -B _unset' will cause
different behavior for rsync -vax /var/lib/lxc based on whether that
fs is btrfs or not.
So don't do that. If -B is not specified, use -B dir.
lxc-fedrora: New patch for systemd detection and init configuration.
Satoshi Matsumoto certainly had the right idea and in spotting a bug in
the lxc-fedora template for systemd detection. Heart was in the right
spot but patch was not what we needed.
I've looked the patch code over for systemd support and init/upstart
support and modified the logic appropriately. If /etc/systemd/system
exists, we'll do the right thing by systemd. If /etc/rc.sysinit exists,
we'll do the right thing by init / upstart. If both are installed,
we'll trying and accommodate both in case someone is playing games with
the two (I've done this).
Patch was trivial, just took more time to actually test it and create
some containers with it and verify them, than it did to code them.
Signed-off-by: Michael H. Warfield <mhw@WittsEnd.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Christian Seiler [Tue, 13 Aug 2013 21:04:37 +0000 (23:04 +0200)]
attach: implement remaining options of lxc_attach_set_environment
This patch implements the extra_env and extra_keep options of
lxc_attach_set_environment.
The Python implementation, the C container API and the lxc-attach
utility are able to utilize this feature; lxc-attach has gained two new
command line options for this.
Signed-off-by: Christian Seiler <christian@iwakd.de> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Christian Seiler [Tue, 21 May 2013 12:57:06 +0000 (14:57 +0200)]
python: add attach support
Add methods attach() and attach_wait() to the Python API that give
access to the attach functionality of LXC. Both accept two main
arguments:
1. run: A python function that is executed inside the container
2. payload: (optional) A parameter that will be passed to the python
function
Additionally, the following keyword arguments are supported:
attach_flags: How attach should operate, i.e. whether to attach to
cgroups, whether to drop capabilities, etc. The following
constants are defined as part of the lxc module that may
be OR'd together for this option:
LXC_ATTACH_MOVE_TO_CGROUP
LXC_ATTACH_DROP_CAPABILITIES
LXC_ATTACH_SET_PERSONALITY
LXC_ATTACH_APPARMOR
LXC_ATTACH_REMOUNT_PROC_SYS
LXC_ATTACH_DEFAULT
namespaces: Which namespaces to attach to, as defined as the flags that
may be passed to the clone(2) system call. Note: maybe we
should export these flags too.
personality: The personality of the process, it will be passed to the
personality(2) syscall. Note: maybe we should provide
access to the function that converts arch into
personality.
initial_cwd: The initial working directory after attaching.
uid: The user id after attaching.
gid: The group id after attaching.
env_policy: The environment policy, may be one of:
LXC_ATTACH_KEEP_ENV
LXC_ATTACH_CLEAR_ENV
extra_env_vars: A list (or tuple) of environment variables (in the form
KEY=VALUE) that should be set once attach has
succeeded.
extra_keep_env: A list (or tuple) of names of environment variables
that should be kept regardless of policy.
stdin: A file/socket/... object that should be used as stdin for the
attached process. (If not a standard Python object, it has to
implemented the fileno() method and provide a fd as the result.)
stdout, stderr: See stdin.
attach() returns the PID of the attached process, or -1 on failure.
attach_wait() returns the return code of the attached process after
that has finished executing, or -1 on failure. Note that if the exit
status of the process is 255, -1 will also be returned, since attach
failures result in an exit code of 255.
Two default run functions are also provided in the lxc module:
attach_run_command: Runs the specified command
attach_run_shell: Runs a shell in the container
Examples (assumeing c is a Container object):
c.attach_wait(lxc.attach_run_command, 'id')
c.attach_wait(lxc.attach_run_shell)
def foo():
print("Hello World")
# the following line is important, otherwise the exit code of
# the attached program will be -1
# sys.exit(0) will also work
return 0
c.attach_wait(foo)
c.attach_wait(lxc.attach_run_command, ['cat', '/proc/self/cgroup'])
c.attach_wait(lxc.attach_run_command, ['cat', '/proc/self/cgroup'],
attach_flags=(lxc.LXC_ATTACH_DEFAULT &
~lxc.LXC_ATTACH_MOVE_TO_CGROUP))
Note that while it is possible to execute Python code inside the
container by passing a function (see example), it is unwise to import
modules, since there is no guarantee that the Python installation
inside the container is in any way compatible with that outside of it.
If you want to run Python code directly, please import all modules
before attaching and only use them within the container.
Signed-off-by: Christian Seiler <christian@iwakd.de> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
convert_tuple_to_char_pointer_array now also accepts lists and not only
tuples when converting to a C array. Other fixes:
- some checking that it's actually a list/tuple before trying to
convert
- off-by-a-few-bytes allocation error
(sizeof(char *)*n+1 vs. sizeof(char *)*(n+1)/calloc(...))
Signed-off-by: Christian Seiler <christian@iwakd.de> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
lxc-attach: Completely rework lxc-attach and move to API function
- Move attach functionality to a completely new API function for
attaching to containers. The API functions accepts the name of the
container, the lxcpath, a structure indicating options for attaching
and returns the pid of the attached process. The calling thread may
then use waitpid() or similar to wait for the attached process to
finish. lxc-attach itself is just a simple wrapper around the new
API function.
- Use CLONE_PARENT when creating the attached process from the
intermediate process. This allows the intermediate process to exit
immediately after attach and the original thread may supervise the
attached process directly.
- Since the intermediate process exits quickly, its only job is to
send the original process the pid of the attached process (as seen
from outside the pidns) and exit. This allows us to simplify the
synchronisation logic by quite a bit.
- Use O_CLOEXEC / SOCK_CLOEXEC on (hopefully) all FDs opened in the
main thread by the attach logic so that other threads of the same
program may safely fork+exec off. Also, use shutdown() on the
synchronisation socket, so that if another thread forks off without
exec'ing, the synchronisation will not fail. (Not tested whether
this solves this issue.)
- Instead of directly specifying a program to execute on the API
level, one specifies a callback function and a payload. This allows
code using the API to execute a custom function directly inside the
container without having to execute a program. Two default callbacks
are provided directly, one to execute an arbitrary program, another
to execute a shell. The lxc-attach utility will always use either
one of these default callbacks.
- More fine-grained control of the attached process on the API level
(not implemented in lxc-attach utility yet, some may not be sensible):
* Specify which file descriptors should be stdin/stdout/stderr of
the newly created process. If fds other than 0/1/2 are
specified, they will be dup'd in the attached process (and the
originals closed). This allows e.g. threaded applications to
specify pipes for communication with the attached process
without having to modify its own stdin/stdout/stderr before
running lxc-attach.
* Specify user and group id for the newly attached process.
* Specify initial working directory for the newly attached
process.
* Fine-grained control on whether to do any, all or none of the
following: move attached process into the container's init's
cgroup, drop capabilities of the process, set the processes's
personality, load the proper apparmor profile and (for partial
attaches to any but not mount-namespaces) whether to unshare the
mount namespace and remount /sys and /proc. If additional
features (SELinux policy, SMACK policy, ...) are implemented,
flags for those may also be provided.
Signed-off-by: Christian Seiler <christian@iwakd.de> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Sat, 10 Aug 2013 04:47:37 +0000 (23:47 -0500)]
cgroups: rework to handle nested containers with multiple and partial mounts
Currently, if you create a container and use the mountcgruop hook,
you get the /lxc/c1/c1.real cgroup mounted to /. If you then try
to start containers inside that container, lxc can get confused.
This patch addresses that, by accepting that the cgroup as found
in /proc/self/cgroup can be partially hidden by bind mounts.
In this patch:
Add optional 'lxc.cgroup.use' to /etc/lxc/lxc.conf to specify which
mounted cgroup filesystems lxc should use. So far only the cgroup
creation respects this.
Keep separate cgroup information for each cgroup mountpoint. So if
the caller is in devices cgroup /a but cpuset cgroup /b that should
now be ok.
Change how we decide whether to ignore failure to set devices cgroup
settings. Actually look to see if our current cgroup already has the
settings. If not, add them.
Finally, the real reason for this patch: in a nested container,
/proc/self/cgroup says nothing about where under /sys/fs/cgroup you
might find yourself. Handle this by searching for our pid in tasks
files, and keep that info in the cgroup handler.
Also remove all strdupa from cgroup.c (not android-friendly).
Serge Hallyn [Fri, 9 Aug 2013 19:48:35 +0000 (14:48 -0500)]
add lxc-user-nic
It is meant to be run setuid-root to allow unprivileged users to
tunnel veths from a host bridge to their containers. The program
looks at /etc/lxc/lxc-usernet which has entries of the form
user type bridge number
The type currently must be veth. Whenver lxc-user-nic creates a
nic for a user, it records it in /var/lib/lxc/nics (better location
is needed). That way when a container dies lxc-user-nic can cull
the dead nic from the list.
The -DISTEST allows lxc-user-nic to be compiled so that it uses
files under /tmp and doesn't actually create the nic, so that
unprivileged users can compile and test the code. lxc-test-usernic
is a script which runs a few tests using lxc-usernic-test, which
is a version of lxc-user-nic compiled with -DISTEST.
The next step, after issues with this code are raised and addressed,
is to have lxc-start, when running unprivileged, call out to
lxc-user-nic (will have to exec so that setuid-root is honored).
On top of my previous unprivileged-creation patchset, that should
allow unprivileged users to create and start useful containers.
Scott Moser [Sat, 10 Aug 2013 09:51:21 +0000 (05:51 -0400)]
ubuntu-cloud-prep: cleanup, fix bug with userdata
--userdata was broken, completely missing an implementation.
This adds that implementation back in, makes 'debug' logic
correct, and then also improves the doc at the top.
Signed-off-by: Scott Moser <smoser@ubuntu.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Franz Pletz [Mon, 12 Aug 2013 12:01:39 +0000 (14:01 +0200)]
lxc-destroy: Fix regular expression for getting rootfs
The `lxc-destroy` script was using a simple `grep` for extracting
`lxc.rootfs` from the lxc config. This regex also matches commented lines
and breaks at least removing btrfs subvolumes if the string `lxc.rootfs`
is mentioned in a comment. Furthermore, due to the unescaped dot in the
regex it would also match other wrong strings like `lxc rootfs`.
This patch modifies the regular expression to correctly match the beginning
of the line plus potential whitespace characters and the string
`lxc.rootfs`.
Signed-off-by: Franz Pletz <fpletz@fnordicwalking.de> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Scott Moser [Thu, 8 Aug 2013 18:16:59 +0000 (19:16 +0100)]
add a clone hook for ubuntu-cloud images
This allows ability to now specify '--userdata' arguments to 'create' or
to 'clone'. So now, the following means very fast start of instances with
different user-data.
Also present here is
* an improvement to the static list of Ubuntu releases. It uses
ubuntu-distro-info if available degrades back to a static list on failure.
* moving of the replacement variables to the top of the create template This
is just to make it more obvious what is being replaced and put them in a
single location.
Stéphane Graber [Fri, 9 Aug 2013 09:32:55 +0000 (11:32 +0200)]
Replace mktemp() by a new mkifname()
Using mktemp() leads to build time warnings and isn't actually
appropriate for what we want to do as it's checking for the existence of
a file and not a network interface.
Replace those calls by an equivalent mkifname() function which uses the
same template as mktemp but instead checks for existing network
interfaces.
Serge Hallyn [Tue, 6 Aug 2013 19:56:48 +0000 (14:56 -0500)]
Logging: don't confuse command line and config file specified values
Currently if loglevel/logfile are specified on command line in a
program using LXC api, and that program does any
container->save_config(), then the new config will be saved with the
loglevel/logfile specified on command line. This is wrong, especially
in the case of
cat > lxc.conf << EOF
lxc.logfile=a
EOF
lxc-create -t cirros -n c1 -o b
which will result in a container config with lxc.logfile=b.
Serge Hallyn [Mon, 5 Aug 2013 20:20:29 +0000 (15:20 -0500)]
lxc-clone: don't s/oldname/newname in the config file and hooks
1. container hooks should use lxcpath and lxcname from the environment.
2. the utsname now gets separately updated
3. the rootfs path gets updated by the bdev backend.
4. the fstab mount targets should be relative
5. the fstab source directories could be separately updated if needed.
This leaves one definate bug: the lxc.logfile does not get updated.
This made me wonder why it was in the configuration file to begin with.
Digging deeper, I realized that whatever '-o outfile' you give
lxc-create gets set in log.c and gets used by the lxc_container object
we create at write_config(). So if you say
lxc-create -t cirros -n c1 -o /tmp/out1
then /var/lib/lxc/c1/config will have lxc.logfile=/tmp/out1 - which is
clearly wrong. Therefore I leave fixing that for later.
I'm looking for candidates for $p/$n expansion. Note we can't expand
these at config_utsname() etc, because then lxc-clone would see the
expanded variable. So we want to read $p/$n verbatim at config_*(),
and expand them only when they are used. lxc.logfile is an obvious
good use case. lxc.utsname can do it too, in case you want container
c1 to be called "c1-whatever". I'm not sure that's worth it though.
Are there any others, or is that it?
It uses the newuidmap and newgidmap program to start a shell in
a mapped user namespace. While newuidmap and newgidmap are
setuid-root, lxc-usernsexec is not.
If new{ug}idmap are not available, then this program is not
built or installed. Otherwise, it will be used to support creating,
starting, destroying, etc containers by unprivileged users using
their authorized subuids and subgids.
Example:
usernsexec -m u:0:100000:1 -- /bin/bash
will, if the user is authorized to use subuid 100000, start a
bash shell in a user namespace where 100000 on the host is
mapped to root in the namespace, and the shell is running as
(privileged) root.
lxclock: use XDG_RUNTIME_DIR for lock if appropriate (v2)
If we are euid==0 or XDG_RUNTIME_DIR is not set, then use
/run/lock/lxc/$lxcpath/$lxcname as before. Otherwise,
use $XDG_RUNTIME_DIR/lock/lxc/$lxcpath/$lxcname.
The lock/subsys/lxc-ubuntu-cloud lock is to protect the tarballs
managed under /var/cache/lxc/cloud-$release. Don't lock if we've
been handed a tarball.
fake device creation
Unprivileged users can't create devices, so bind mount null, tty, urandom
and console from the host.
Changelog:
Jul 22: as Stéphane points out, remove a left-over debug line
Serge Hallyn [Thu, 9 May 2013 01:25:06 +0000 (20:25 -0500)]
lxc-create: support unpriv users
Just make sure we are root if we are asked to deal with something other
than a directory, and make sure we have permission to create the
container in the given lxcpath.
Serge Hallyn [Thu, 9 May 2013 01:15:29 +0000 (20:15 -0500)]
templates: require running as root
Up to now lxc-create ensured that you were running as root. Now the
templates which require root need to do it for themselves. Templates
which do mknod definately require root.
ubuntu templates: add some kernel filesystems to container fstab
The debugfs, fusectl, and securityfs may not be mounted inside a
non-init userns. But mountall hangs waiting for them to be
mounted. So just pre-mount them using $lxcpath/$name/fstab as
bind mounts, which will prevent mountall from trying to mount
them.
If the kernel doesn't provide them, then the bind mount failure
will be ignored, and mountall in the container will proceed
without the mount since it is 'optional'. But without these
bind mounts, starting a container inside a user namespace
hangs.
Otherwise (a) there is a memory leak when using user namespaces and
clearing a config, and (b) saving a container configuration file doesn't
maintain the userns mapping. For instance, if container c1 has
lxc.id_map configuration entries, then
lxc_create: prepend pretty header to config file (v2)
Define a sha1sum_file() function in utils.c. Use that in lxcapi_create
to write out the sha1sum of the template being used. If libgnutls is
not found, then the template sha1sum simply won't be printed into the
container config.
This patch also trivially fixes some cases where SYSERROR is used after
a fclose (masking errno) and missing consts in mkdir_p.
If set, then fds 0,1,2 will be redirected while the creation
template is executed.
Note, as Dwight has pointed out, if fd 0 is redirected, then if
templates ask for input there will be a problem. We could simply
not redirect fd 0, or we could require that templates work without
interaction. I'm assuming here that we want to do the latter, but
I'm open to changing that.
3.10 kernel comes with proper hierarchical enforcement of devices
cgroup. To keep that code somewhat sane, certain things are not
allowed. Switching from default-allow to default-deny and vice versa
are not allowed when there are children cgroups. (This *could* be
simplified in the kernel by checking that all child cgroups are
unpopulated, but that has not yet been done and may be rejected)
The mountcgroup hook causes lxc-start to break with 3.10 kernels, because
you cannot write 'a' to devices.deny once you have a child cgroup. With
this patch, (a) lxcpath is passed to hooks, (b) the cgroup mount hook sets
the container's devices cgroup, and (c) setup_cgroup() during lxc startup
ignores failures to write to devices subsystem if we are already in a
child of the container's new cgroup.
((a) is not really related to this bug, but is definately needed.
The followup work of making the other hooks use the passed-in lxcpath
is still to be done)
This hook script updates the hostname in various files under /etc in the
cloned container. In order to do so, the old container name is passed in
the LXC_SRC_NAME environment variable.
lxc-fedora template - Fix retries, use os-release for release, add utsname.
Hey all!
Patch for the Fedora template. Several things...
1) A month or so ago, I floated an idea of adding an option for utsname
which Serge seemed to like but we let it float for more feedback (none
came).
2) In private mail to Serge and Stéphane I mentioned the idea of using
the CPE (Common Platform Enumeration) for host distro and version
identification. I heard back from Serge but not Stéphane. CPE is a
standard promoted by NIST and Mitre (along with CVE and CVSS) as part of
the security community as a common identification mechanism. It's
supported by RedHat based distros and many others (notable exception
Ubuntu). I've patched the Fedora template to parse first
the /etc/os-release file or, alternatively, the /etc/system-release-cpe
file for the distro ID and version instead of the human
readable /etc/redhat-release. There's more that can be done with that
in the realm of cross distro container builds, I suspect.
3) At the time of working on 1&2 I noticed that the retry logic in the
Fedora template just didn't seem right. I believe I posted a message
asking for clarification on that behavior. A recently post in the
-users list indicating that someone could not create a Fedora 19
container (because the release ver string was 19-2 and the template was
only looking for -1) prompted me to rework the retry logic for handling
the mirror list and servers as well as revamp the download logic to
properly identify the correct release package.
The patch for all of the above is attached below the jump. It's been
tested on Fedora 17 through Fedora 19 hosts and has created containers
for F11, F12, F13, F14, F16, F17, F18, and F19. F15 failed for rpm
dependency issues that are not worth fixing (IMHO).
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw@WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
--
Signed-off-by: Michael H. Warfield <mhw@WittsEnd.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
I noticed that if find_first_wholeword() is called with word at the very
beginning of p, we will deref *(p - 1) to see if it is a word boundary.
Fix by considering p = p0 to be a word boundary.
The new openssh uses a different mechanism to start/stop the daemon
which in turn requires a few tweaks in our template to deal with both
the new and old ways of doing that.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
lxc-start-ephemeral: Fix console() and add storage option
The introduction of the new console() python API broke
lxc-start-ephemeral's console(tty=1) call, I now changed that to
console() which does the right thing with both API versions.
This also adds a new storage-type option, letting the user choose to use
a standard directory instead of tmpfs for the container (but still have
it ephemeral).
It turns out that most API users want some kind of timeout option for
get_ips, so instead of re-implementing it in every single client
software, let's just have it as a python overlay upstream.
commit 829dd918 added parsing of a -c argument to both the common options
handling and to lxc-start. It is not a common option, and should have only
been added to lxc-start. Because the common code is processing it, no other
command can use -c. Remove -c from being processed by the common code.
Tested that -c still works with lxc-start.
Andrew Gilbert [Thu, 27 Jun 2013 13:09:05 +0000 (08:09 -0500)]
Add -n differentiation to lxc-netstat
lxc-netstat now only processes an -n argument if it has not previously
received a value for $name from --name or -n. If it _has_ received such
a value, it stops processing arguments and leaves the -n for netstat.
This does not apply to the use of --name after a name has been provided
by --name or -n; the current behaviour continues. The new behaviour
makes
netstat -n <container> -n -a
behave like
netstat -n <container> -a -n
which already will act as though there is '--' between '<container>' and
'-a' (see line 91 of lxc-netstat.in).
Signed-off-by: Andrew Gilbert <andrewg800@gmail.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Andrew Gilbert [Thu, 27 Jun 2013 13:07:14 +0000 (08:07 -0500)]
Add double-dash to lxc-netstat re-call arguments
When lxc-netstat was called by lxc-unshare, it would be given the
arguments intended for netstat from the first invocation, but without
anything to separate them from the arguments intended for lxc-netstat.
This meant that netstat arguments like -n would result in lxc-netstat
trying to process them.
Signed-off-by: Andrew Gilbert <andrewg800@gmail.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Fri, 21 Jun 2013 19:15:42 +0000 (14:15 -0500)]
Accept more word delimiters when updating hooks
When updating container names in hook files during a container clone,
we substitute the new container name for the old any time the old name
shows up as a separate word. This patch adds the four characters
'.,_-' as additional delimiters.
Serge Hallyn [Tue, 18 Jun 2013 19:52:24 +0000 (14:52 -0500)]
conf.c: always strdup rootfs.mount
The reason is that the generic code which handles reading
lxc.rootfs.mount always frees the old value if not NULL.
So without this setting lxc.rootfs.mount = /mnt causes
segfault.
Serge Hallyn [Thu, 13 Jun 2013 15:06:15 +0000 (10:06 -0500)]
don't set up console for lxc-execute
Currently due to some safety checks for !rootfs.path, lxc-execute works
ok if you do not set lxc.rootfs at all in your lxc.conf. But if you
set lxc.rootfs = '/', then it sets up console, and when you do an
lxc-execute, the console appears hung.
However the lxc.rootfs NULL check was just incidental to not dereference
a NULL pointer. In fact we should not be setting up a console if the
container isn't running a full-fledged distro with a getty/login
running on the container's /dev/console.
Have lxc_execute() mark in lxc_conf that this is a lxc-execute and not
an lxc-start, and don't set up the console.
The issue is documented at https://sourceforge.net/p/lxc/bugs/67/ .