seccomp: send caller pidfd along with proxied requests
On the one hand this should close the race between the
process exiting until the proxy reads the request.
On the other hand it'll help the proxy quickly access info
from /proc (such as ./cwd, ./ns/mnt, ...)
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
We only read the message without the cookie. For now assert
that the sender also didn't try to send more by letting
`recvmsg()` return the original size of the packet if it was
longer.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
With the previous commit we now attempt to reconnect to the
proxy in the beginning of the notify handler if we had no
connection.
If the connection fails later on, we now don't really need
to immediately try to reconnect if we send a default
response anyway (particularly if the recv() fails). (This
also gives the proxy more time, for instance if it was just
restarted.)
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
If a syscall happens after we already failed to communicate
with the proxy, proxy_fd was -1.
Before the previous commit we'd then be stuck in the state
where there was no proxy registered. With the previous
commit we'd send a default reply and only then try to
reconnect.
Improve this even further by trying to reconnect right at
the start.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
seccomp: send default response when there's no proxy
Particularly, when there's no proxy registered (iow. none
configured but the seccomp profile still had a 'notify'
rule), we don't want to leave them hanging.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
If the first sendmsg() fails, try to reconnect once before
failing. Otherwise if a proxy restarts while no syscall
happens, the next syscall always fails with ENOSYS.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
When we fail to send a message, we send a default seccomp
response and try to reconnect to the proxy. It doesn't
really make much sense to retry to send the request over the
new connection as the syscall has already been answered. The
same goes for receiving the response - after reconnecting to
the proxy, we're a new client to a potentially new proxy
process, so awaiting a response without having sent a
request doesn't make all too much sense either.
In the future we should probably have a timeout or retry
count for the entire proxy _transaction_ before sending a
response to seccomp at all (and probably handle requests
asynchronously).
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
The seccomp notify API has a few variables: The struct sizes
are queried at runtime, and we now also have a user
configured cookie.
This means that with a SOCK_STREAM connection the proxy
needs to carefully read() the right amount of data based on
the contents of our proxy message struct to avoid ending up
in the middle of a packet.
While for now this may not be too tragic, since we currently
only ever send a single packet and then wait for the
response, we may at some point want to be able to handle
multiple processes simultaneously, hence it makes sense to
switch to a packet based connection.
So switch to using SOCK_SEQPACKET which is packet based,
(and also guarantees ordering). The `MSG_PEEK` flag can be
used with `recvmsg()` to figure out a packet's size on the
other end, and usually the size *should* not change after
that for an existing connection from a running container.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
The previous API doesn't reflect the fact that
`seccomp_notif` and `seccomp_notif_resp` are allocatd
dynamically with sizes figured out at runtime.
We now query the sizes via the seccomp(2) syscall and change
`struct seccomp_notify_proxy_msg` to contain the sizes
instead of the data, with the data following afterwards.
Additionally it did not provide a convenient way to identify
the container the message originated from, for which we now
include a cookie configured via `lxc.seccomp.notify.cookie`.
Since we currently always send exactly one request and await
the response immediately, verify the `id` in the client's
response.
Finally, the proxy message's "version" field is removed, and
we reserve 64 bits in its place.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Thomas Parrott [Thu, 4 Jul 2019 21:38:23 +0000 (22:38 +0100)]
start: call lxc_find_gateway_addresses early
This restores the lxc.net.x.ipv4.gateway = auto and
lxc.net.x.ipv6.gateway = auto functionality.
When the child is created the parent and child have different views of
struct lxc_handler since - obviously - virtual memory is duplicated. So any
changes to done by the parent that the child should see need to be IPCed to it.
For any non-actual device creation stuff this does not make much sense. This
includes finding gateway addresses. Move it back prior to clone().
Fixes #3078
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
[christian.brauner@ubuntu.com: non-functional changes and update commit message] Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Make sure that network creation happens at the same time for containers started
by privileged and unprivileged users. The only reason we didn't do this so far
was to avoid sending network device ifindices around in the privileged case.
Rachid Koucha [Sat, 15 Jun 2019 13:17:50 +0000 (15:17 +0200)]
Fixed file descriptor leak for network namespace
In privileged mode, the container startup looses a file descriptor for "handler->nsfd[LX_NS_NET]". At line 1782, we preserve the namespaces file descriptor (in privileged mode, the network namespace is also preserved) :
for (i = 0; i < LXC_NS_MAX; i++)
if (handler->ns_on_clone_flags & ns_info[i].clone_flag)
INFO("Cloned %s", ns_info[i].flag_name);
if (!lxc_try_preserve_namespaces(handler, handler->ns_on_clone_flags, handler->pid)) {
ERROR("Failed to preserve cloned namespaces for lxc.hook.stop");
goto out_delete_net;
}
Then at line 1830, we preserve one more time the network namespace :
ret = lxc_try_preserve_ns(handler->pid, "net");
if (ret < 0) {
if (ret != -EOPNOTSUPP) {
SYSERROR("Failed to preserve net namespace");
goto out_delete_net;
}
The latter overwrites the file descriptor already stored in handler->nsfd[LXC_NS_NET] at line 1786.
So, this fix checks that the entry is not already filled.
seccomp: do not set SECCOMP_FILTER_FLAG_NEW_LISTENER
Do not set SECCOMP_FILTER_FLAG_NEW_LISTENER as seccomp attribute.
Prior to libseccomp merging support for SECCOMP_RET_USER_NOTIF there was a
libseccomp specific attribute that needed to be set before
SECCOMP_RET_USER_NOTIF could be used. This has been removed.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
BugLink: https://bugs.launchpad.net/bugs/1831258 Cc: Dimitri John Ledkov <xnox@ubuntu.com> Cc: Scott Moser <smoser@ubuntu.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>