Petr Malat [Mon, 19 Jul 2021 10:28:45 +0000 (12:28 +0200)]
bpf: bpf_devices_cgroup_supported() should check if bpf() is available
bpf_devices_cgroup_supported() tries to load a simple BPF program to
test if BPF works. This is problematic because the function used to load
the program - bpf_program_load_kernel() - emits an error to the log if
BPF is not enabled in the kernel although device controller is not
requested in the configuration. Users could interpret that as a problem.
Make bpf_devices_cgroup_supported() check if the BPF syscall is available
before calling bpf_program_load_kernel(). We can do it by passing a NULL
pointer instead of the syscall argument as the kernel returns either
ENOSYS, when the syscall is not implemented or EFAULT, when it is
implemented.
Petr Malat [Mon, 19 Jul 2021 19:51:25 +0000 (21:51 +0200)]
lxc_setup_ttys: Handle existing ttyN file without underlying device
If a device file is opened and there isn't the underlying device,
the open call fails with ENXIO, but the path can be opened with
O_PATH, which is enough for mounting over the device file.
Generalize this idea and use O_PATH for all cases when the file
is there. One still must check for both ENXIO and EEXIST as it's
unspecified what error is reported if multiple error conditions
occur at the same time.
With the changes introduced in: b7b1e3a34ce28b01206c48227930ff83d399e7b6
the hierarchy-struct did not have the path_lim set anymore, which is
needed by setup_limits_legacy (->cg_legacy_set_data->lxc_write_openat)
to actually access the cgroup directory.
The issue can be reproduced with a container config having
```
lxc.cgroup.devices.deny = a
```
(or any lxc.cgroup.devices entry) set on a system booted with
systemd.unified_cgroup_hierarchy=0.
This affects all privileged containers on PVE (due to the default
devices.deny entry).
This is not a fatal error and the fallback codepath is equally safe.
When we use TIOCGPTPEER we're using a stashed fd to the container's
devpts mount's ptmx device and allocating a new fd non-path based
through this ioctl. If this ioctl can't be used we're falling back to
allocating a pts device from the host's devpts mount's ptmx device which
is path-based but is not under control of the container and so that's
safe. The difference is just that the first method gets you a nice
native terminal with all the pleasantries of having tty and friends
working whereas the latter method does not.
Fixes: #3625 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
New versions of LXC always stash a file descriptor for the root of the
cgroup mount at /sys/fs/cgroup and then resolve the current cgroup
parsed from /proc/{1,self}/cgroup relative to that file descriptor. This
doesn't work when the caller's cgroup is mouned over the controllers.
Older versions of LXC simply counted such layouts as having no cgroups
available for delegation at all and moved on provided no cgroup limits
were requested. But mainline LXC would fail such layouts. While I would
argue that failing such layouts is the semantically clean approach we
shouldn't regress users so make mainline LXC treat such cgroup layouts
as having no cgroups available for delegation.
Fixes: #3890 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
conf: improve read-only /sys with read-write /sys/devices/virtual/net
Some tools require /sys/devices/virtual/net to be read-write. At the
same time we want all other parts of /sys to be read-only. To do this we
created a layout where we hade a read-only instance of sysfs mounted on
top of a read-write instance of sysfs:
execute: ensure parent is notified about child exec and close all unneeded fds
lxc_container_init() creates the container payload process as it's child
so lxc_container_init() itself never really exits and thus the parent
isn't notified about the child exec'ing since the sync file descriptor
is never closed. Make sure it's closed to notify the parent about the
child's exec.
In addition we're currently leaking all file descriptors associated with
the handler into the stub init. Make sure that all file descriptors
other than stderr are closed.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Tycho Andersen [Mon, 28 Jun 2021 14:38:48 +0000 (08:38 -0600)]
execute: don't exec init, call it
Instead of having a statically linked init that we put on the host fs
somewhere via packaging, have to either bind mount in or detect fexecve()
functionality, let's just call it as a library function. This way we don't
have to do any of that.
This also fixes up a bunch of conditions from:
if (quiet)
fprintf(stderr, "log message");
to
if (!quiet)
fprintf(stderr, "log message");
:)
and it drops all the code for fexecve() detection and bind mounting our
init in, since we no longer need any of that.
A couple other thoughts:
* I left the lxc-init binary in since we ship it, so someone could be using
it outside of the internal uses.
* There are lots of unused arguments to lxc-init (including presumably
--quiet, since nobody noticed the above); those may be part of the API
though and so we don't want to drop them.
Tomasz Blaszczak [Wed, 23 Jun 2021 07:17:05 +0000 (09:17 +0200)]
When an item is added to an array, then the array is realloc()ed (to size+1),
and the item is copied (strdup()) to the array.
Thus, when an item is removed from an array, memory allocated for that item
should be freed, successive items should be left-shifted and the array
realloc()ed again (size-1).
Additional changes:
- If strdup() fails in add_to_array(), then an array should be
realloc()ed again to original size.
- Initialize an array in list_all_containers().
Signed-off-by: Tomasz Blaszczak <tomasz.blaszczak@consult.red>
Tomasz Blaszczak [Fri, 25 Jun 2021 10:04:49 +0000 (12:04 +0200)]
Resize array in remove_from_array() and fix a crash
When an item is added to an array, then the array is realloc()ed (to size+1),
and the item is copied (strdup()) to the array.
Thus, when an item is removed from an array, allocated memory pointed by
the item (not the item itself) should be freed, successive items should
be left-shifted and the array realloc()ed again (size-1).
Additional changes:
- Initialize an array in list_all_containers().
Signed-off-by: Tomasz Blaszczak <tomasz.blaszczak@consult.red>
Tomasz Blaszczak [Wed, 23 Jun 2021 07:17:05 +0000 (09:17 +0200)]
When an item is added to an array, then the array is realloc()ed (to size+1),
and the item is copied (strdup()) to the array.
Thus, when an item is removed from an array, memory allocated for that item
should be freed, successive items should be left-shifted and the array
realloc()ed again (size-1).
Additional changes:
- If strdup() fails in add_to_array(), then an array should be
realloc()ed again to original size.
- Initialize an array in list_all_containers().
Signed-off-by: Tomasz Blaszczak <tomasz.blaszczak@consult.red>
Ruben Jenster [Wed, 2 Jun 2021 14:31:31 +0000 (16:31 +0200)]
Add support for LISTEN_FDS environment variable.
The LISTEN_FDS environment variable defines the number of
file descriptors that should be inherited by the container,
in addition to stdio.
The LISTEN_FDS environment variable is defined in the OCI spec
and used to support socket activation.
Refs #3845
Signed-off-by: Ruben Jenster <r.jenster@drachenfels.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
LiFeng [Sat, 12 Jun 2021 06:52:46 +0000 (14:52 +0800)]
string utils: Make sure don't return uninitialized memory.
The function lxc_string_split_quoted and lxc_string_split_and_trim use
realloc to reduce the memory. But the result may be NULL, the the
returned memory will be uninitialized