btf: add support for VAR and DATASEC in btf_dedup()
This patch adds support for VAR and DATASEC in btf_dedup(). VAR/DATASEC
are never deduplicated, but they need to be processed anyway as types
they refer to might need to be remapped due to deduplication and
compaction.
Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Yonghong Song <yhs@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
kbuild: handle old pahole more gracefully when generating BTF
When CONFIG_DEBUG_INFO_BTF is enabled but available version of pahole is too
old to support BTF generation, build script is supposed to emit warning and
proceed with the build. Due to using exit instead of return from BASH function,
existing handling code prematurely exits exit code 0, not completing some of
the build steps. This patch fixes issue by correctly returning just from
gen_btf() function only.
Fixes: e83b9f55448a ("kbuild: add ability to generate BTF type info for vmlinux") Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jiong Wang [Fri, 12 Apr 2019 21:59:37 +0000 (22:59 +0100)]
bpf: refactor "check_reg_arg" to eliminate code redundancy
There are a few "regs[regno]" here are there across "check_reg_arg", this
patch factor it out into a simple "reg" pointer. The intention is to
simplify code indentation and make the later patches in this set look
cleaner.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jiong Wang [Fri, 12 Apr 2019 21:59:36 +0000 (22:59 +0100)]
bpf: factor out reg and stack slot propagation into "propagate_liveness_reg"
After code refactor in previous patches, the propagation logic inside the
for loop in "propagate_liveness" becomes clear that they are good enough to
be factored out into a common function "propagate_liveness_reg".
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf: Fix distinct pointer types warning for ARCH=i386
Fix a new warning reported by kbuild for make ARCH=i386:
In file included from kernel/bpf/cgroup.c:11:0:
kernel/bpf/cgroup.c: In function '__cgroup_bpf_run_filter_sysctl':
include/linux/kernel.h:827:29: warning: comparison of distinct pointer types lacks a cast
(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
^
include/linux/kernel.h:841:4: note: in expansion of macro '__typecheck'
(__typecheck(x, y) && __no_side_effects(x, y))
^~~~~~~~~~~
include/linux/kernel.h:851:24: note: in expansion of macro '__safe_cmp'
__builtin_choose_expr(__safe_cmp(x, y), \
^~~~~~~~~~
include/linux/kernel.h:860:19: note: in expansion of macro '__careful_cmp'
#define min(x, y) __careful_cmp(x, y, <)
^~~~~~~~~~~~~
>> kernel/bpf/cgroup.c:837:17: note: in expansion of macro 'min'
ctx.new_len = min(PAGE_SIZE, *pcount);
^~~
====================
v2->v3:
- simplify C based selftests by relying on variable offset stack access.
v1->v2:
- add fs/proc/proc_sysctl.c mainteners to Cc:.
The patch set introduces new BPF hook for sysctl.
It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
BPF_CGROUP_SYSCTL.
BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
that accesses (read/write) to sysctl can be controlled for specific cgroup
and either allowed or denied, or traced.
The hook has access to sysctl name, current sysctl value and (on write
only) to new sysctl value via corresponding helpers. New sysctl value can
be overridden by program. Both name and values (current/new) are
represented as strings same way they're visible in /proc/sys/. It is up to
program to parse these strings.
To help with parsing the most common kind of sysctl value, vector of
integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
semantic similar to user space strtol(3) and strtoul(3).
The hook also provides bpf_sysctl context with two fields:
* @write indicates whether sysctl is being read (= 0) or written (= 1);
* @file_pos is sysctl file position to read from or write to, can be
overridden.
The hook allows to make better isolation for containerized applications
that are run as root so that one container can't change a sysctl and affect
all other containers on a host, make changes to allowed sysctl in a safer
way and simplify sysctl tracing for cgroups.
Patch 1 is preliminary refactoring.
Patch 2 adds new program and attach types.
Patches 3-5 implement helpers to access sysctl name and value.
Patch 6 adds file_pos field to bpf_sysctl context.
Patch 7 updates UAPI in tools.
Patches 8-9 add support for the new hook to libbpf and corresponding test.
Patches 10-14 add selftests for the new hook.
Patch 15 adds support for new arg types to verifier: pointer to integer.
Patch 16 adds bpf_strto{l,ul} helpers to parse integers from sysctl value.
Patch 17 updates UAPI in tools.
Patch 18 updates bpf_helpers.h.
Patch 19 adds selftests for pointer to integer in verifier.
Patches 20-21 add selftests for bpf_strto{l,ul}, including integration
C based test for sysctl value parsing.
====================
Andrey Ignatov [Sat, 23 Mar 2019 22:51:00 +0000 (15:51 -0700)]
selftests/bpf: C based test for sysctl and strtoX
Add C based test for a few bpf_sysctl_* helpers and bpf_strtoul.
Make sure that sysctl can be identified by name and that multiple
integers can be parsed from sysctl value with bpf_strtoul.
net/ipv4/tcp_mem is chosen as a testing sysctl, it contains 3 unsigned
longs, they all are parsed and compared (val[0] < val[1] < val[2]).
Example of output:
# ./test_sysctl
...
Test case: C prog: deny all writes .. [PASS]
Test case: C prog: deny access by name .. [PASS]
Test case: C prog: read tcp_mem .. [PASS]
Summary: 39 PASSED, 0 FAILED
Andrey Ignatov [Tue, 19 Mar 2019 01:21:18 +0000 (18:21 -0700)]
selftests/bpf: Test bpf_strtol and bpf_strtoul helpers
Test that bpf_strtol and bpf_strtoul helpers can be used to convert
provided buffer to long or unsigned long correspondingly and return both
correct result and number of consumed bytes, or proper errno.
Example of output:
# ./test_sysctl
..
Test case: bpf_strtoul one number string .. [PASS]
Test case: bpf_strtoul multi number string .. [PASS]
Test case: bpf_strtoul buf_len = 0, reject .. [PASS]
Test case: bpf_strtoul supported base, ok .. [PASS]
Test case: bpf_strtoul unsupported base, EINVAL .. [PASS]
Test case: bpf_strtoul buf with spaces only, EINVAL .. [PASS]
Test case: bpf_strtoul negative number, EINVAL .. [PASS]
Test case: bpf_strtol negative number, ok .. [PASS]
Test case: bpf_strtol hex number, ok .. [PASS]
Test case: bpf_strtol max long .. [PASS]
Test case: bpf_strtol overflow, ERANGE .. [PASS]
Summary: 36 PASSED, 0 FAILED
Andrey Ignatov [Tue, 19 Mar 2019 01:17:03 +0000 (18:17 -0700)]
selftests/bpf: Test ARG_PTR_TO_LONG arg type
Test that verifier handles new argument types properly, including
uninitialized or partially initialized value, misaligned stack access,
etc.
Example of output:
#456/p ARG_PTR_TO_LONG uninitialized OK
#457/p ARG_PTR_TO_LONG half-uninitialized OK
#458/p ARG_PTR_TO_LONG misaligned OK
#459/p ARG_PTR_TO_LONG size < sizeof(long) OK
#460/p ARG_PTR_TO_LONG initialized OK
Andrey Ignatov [Tue, 19 Mar 2019 00:55:26 +0000 (17:55 -0700)]
bpf: Introduce bpf_strtol and bpf_strtoul helpers
Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned
long correspondingly. It's similar to user space strtol(3) and
strtoul(3) with a few changes to the API:
* instead of NUL-terminated C string the helpers expect buffer and
buffer length;
* resulting long or unsigned long is returned in a separate
result-argument;
* return value is used to indicate success or failure, on success number
of consumed bytes is returned that can be used to identify position to
read next if the buffer is expected to contain multiple integers;
* instead of *base* argument, *flags* is used that provides base in 5
LSB, other bits are reserved for future use;
* number of supported bases is limited.
Documentation for the new helpers is provided in bpf.h UAPI.
The helpers are made available to BPF_PROG_TYPE_CGROUP_SYSCTL programs to
be able to convert string input to e.g. "ulongvec" output.
E.g. "net/ipv4/tcp_mem" consists of three ulong integers. They can be
parsed by calling to bpf_strtoul three times.
Implementation notes:
Implementation includes "../../lib/kstrtox.h" to reuse integer parsing
functions. It's done exactly same way as fs/proc/base.c already does.
Unfortunately existing kstrtoX function can't be used directly since
they fail if any invalid character is present right after integer in the
string. Existing simple_strtoX functions can't be used either since
they're obsolete and don't handle overflow properly.
Andrey Ignatov [Mon, 18 Mar 2019 23:57:10 +0000 (16:57 -0700)]
bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types
Currently the way to pass result from BPF helper to BPF program is to
provide memory area defined by pointer and size: func(void *, size_t).
It works great for generic use-case, but for simple types, such as int,
it's overkill and consumes two arguments when it could use just one.
Introduce new argument types ARG_PTR_TO_INT and ARG_PTR_TO_LONG to be
able to pass result from helper to program via pointer to int and long
correspondingly: func(int *) or func(long *).
New argument types are similar to ARG_PTR_TO_MEM with the following
differences:
* they don't require corresponding ARG_CONST_SIZE argument, predefined
access sizes are used instead (32bit for int, 64bit for long);
* it's possible to use more than one such an argument in a helper;
* provided pointers have to be aligned.
It's easy to introduce similar ARG_PTR_TO_CHAR and ARG_PTR_TO_SHORT
argument types. It's not done due to lack of use-case though.
Andrey Ignatov [Fri, 8 Mar 2019 23:25:02 +0000 (15:25 -0800)]
selftests/bpf: Test file_pos field in bpf_sysctl ctx
Test access to file_pos field of bpf_sysctl context, both read (incl.
narrow read) and write.
# ./test_sysctl
...
Test case: ctx:file_pos sysctl:read read ok .. [PASS]
Test case: ctx:file_pos sysctl:read read ok narrow .. [PASS]
Test case: ctx:file_pos sysctl:read write ok .. [PASS]
...
Andrey Ignatov [Fri, 8 Mar 2019 23:08:21 +0000 (15:08 -0800)]
selftests/bpf: Test BPF_CGROUP_SYSCTL
Add unit test for BPF_PROG_TYPE_CGROUP_SYSCTL program type.
Test that program can allow/deny access.
Test both valid and invalid accesses to ctx->write.
Example of output:
# ./test_sysctl
Test case: sysctl wrong attach_type .. [PASS]
Test case: sysctl:read allow all .. [PASS]
Test case: sysctl:read deny all .. [PASS]
Test case: ctx:write sysctl:read read ok .. [PASS]
Test case: ctx:write sysctl:write read ok .. [PASS]
Test case: ctx:write sysctl:read write reject .. [PASS]
Summary: 6 PASSED, 0 FAILED
Andrey Ignatov [Fri, 8 Mar 2019 02:50:52 +0000 (18:50 -0800)]
bpf: Add file_pos field to bpf_sysctl ctx
Add file_pos field to bpf_sysctl context to read and write sysctl file
position at which sysctl is being accessed (read or written).
The field can be used to e.g. override whole sysctl value on write to
sysctl even when sys_write is called by user space with file_pos > 0. Or
BPF program may reject such accesses.
Add helpers to work with new value being written to sysctl by user
space.
bpf_sysctl_get_new_value() copies value being written to sysctl into
provided buffer.
bpf_sysctl_set_new_value() overrides new value being written by user
space with a one from provided buffer. Buffer should contain string
representation of the value, similar to what can be seen in /proc/sys/.
Both helpers can be used only on sysctl write.
File position matters and can be managed by an interface that will be
introduced separately. E.g. if user space calls sys_write to a file in
/proc/sys/ at file position = X, where X > 0, then the value set by
bpf_sysctl_set_new_value() will be written starting from X. If program
wants to override whole value with specified buffer, file position has
to be set to zero.
Documentation for the new helpers is provided in bpf.h UAPI.
Add bpf_sysctl_get_current_value() helper to copy current sysctl value
into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.
It provides same string as user space can see by reading corresponding
file in /proc/sys/, including new line, etc.
Documentation for the new helper is provided in bpf.h UAPI.
Since current value is kept in ctl_table->data in a parsed form,
ctl_table->proc_handler() with write=0 is called to read that data and
convert it to a string. Such a string can later be parsed by a program
using helpers that will be introduced separately.
Unfortunately it's not trivial to provide API to access parsed data due to
variety of data representations (string, intvec, uintvec, ulongvec,
custom structures, even NULL, etc). Instead it's assumed that user know
how to handle specific sysctl they're interested in and appropriate
helpers can be used.
Since ctl_table->proc_handler() expects __user buffer, conversion to
__user happens for kernel allocated one where the value is stored.
Andrey Ignatov [Wed, 27 Feb 2019 20:59:24 +0000 (12:59 -0800)]
bpf: Sysctl hook
Containerized applications may run as root and it may create problems
for whole host. Specifically such applications may change a sysctl and
affect applications in other containers.
Furthermore in existing infrastructure it may not be possible to just
completely disable writing to sysctl, instead such a process should be
gradual with ability to log what sysctl are being changed by a
container, investigate, limit the set of writable sysctl to currently
used ones (so that new ones can not be changed) and eventually reduce
this set to zero.
The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.
New program type has access to following minimal context:
struct bpf_sysctl {
__u32 write;
};
Where @write indicates whether sysctl is being read (= 0) or written (=
1).
Helpers to access sysctl name and value will be introduced separately.
BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
passing control to ctl_table->proc_handler so that BPF program can
either allow or deny access to sysctl.
Andrey Ignatov [Tue, 12 Mar 2019 16:27:09 +0000 (09:27 -0700)]
bpf: Add base proto function for cgroup-bpf programs
Currently kernel/bpf/cgroup.c contains only one program type and one
proto function cgroup_dev_func_proto(). It'd be useful to have base
proto function that can be reused for new cgroup-bpf program types
coming soon.
David S. Miller [Fri, 12 Apr 2019 17:50:56 +0000 (10:50 -0700)]
Merge branch 'smc-next'
Ursula Braun says:
====================
net/smc: patches 2019-04-12
here are patches for SMC:
* patch 1 improves behavior of non-blocking connect
* patches 2, 3, 5, 7, and 8 improve connecting return codes
* patches 4 and 6 are a cleanups without functional change
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework smc_conn_create() to always return a valid DECLINE reason code.
This removes the need to translate the return codes on 4 different
places and allows to easily add more detailed return codes by changing
smc_conn_create() only.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rework smc_listen_work() to provide improved reason codes when an
SMC connection is declined. This allows better debugging on user side.
This also adds 3 more detailed reason codes in smc_clc.h to indicate
what type of device was not found (ism or rdma or both), or if ism
cannot talk to the peer.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In smc_listen_work() the variables rc and reason_code are defined which
have the same meaning. Eliminate reason_code in favor of the shorter
name rc. No functional changes.
Rename the functions smc_check_ism() and smc_check_rdma() into
smc_find_ism_device() and smc_find_rdma_device() to make there purpose
more clear. No functional changes.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The vlan_id of the underlying CLC socket was retrieved two times
during processing of the listen handshaking. Change this to get the
vlan id one time in connect and in listen processing, and reuse the id.
And add a new CLC DECLINE return code for the case when the retrieval
of the vlan id failed.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
During initialization of an SMC socket a lot of function parameters need
to get passed down the function call path. Consolidate the parameters
in a helper struct so there are less enough parameters to get all passed
by register.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The check for a matching ip prefix and subnet was only done for SMC-R
in smc_listen_rdma_check() but not when an SMC-D connection was
possible. Rename the function into smc_listen_prfx_check() and move its
call to a place where it is called for both SMC variants.
And add a new CLC DECLINE reason for the case when the IP prefix or
subnet check fails so the reason for the failing SMC connection can be
found out more easily.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Correct the CLC decline reason codes for internal problems to not have
the sign bit set, negative reason codes are interpreted as not eligible
for TCP fallback.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
For nonblocking sockets move the kernel_connect() from the connect
worker into the initial smc_connect part to return kernel_connect()
errors other than -EINPROGRESS to user space.
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Miller [Thu, 11 Apr 2019 22:02:07 +0000 (15:02 -0700)]
sctp: Pass sk_buff_head explicitly to sctp_ulpq_tail_event().
Now the SKB list implementation assumption can be removed.
And now that we know that the list head is always non-NULL
we can remove the code blocks dealing with that as well.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Miller [Thu, 11 Apr 2019 22:02:04 +0000 (15:02 -0700)]
sctp: Make sctp_enqueue_event tak an skb list.
Pass this, instead of an event. Then everything trickles down and we
always have events a non-empty list.
Then we needs a list creating stub to place into .enqueue_event for sctp_stream_interleave_1.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Miller [Thu, 11 Apr 2019 22:02:01 +0000 (15:02 -0700)]
sctp: Use helper for sctp_ulpq_tail_event() when hooked up to ->enqueue_event
This way we can make sure events sent this way to
sctp_ulpq_tail_event() are on a list as well. Now all such code paths
are fully covered.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Miller [Thu, 11 Apr 2019 22:01:57 +0000 (15:01 -0700)]
sctp: Always pass skbs on a list to sctp_ulpq_tail_event().
This way we can simplify the logic and remove assumptions
about the implementation of skb lists.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Miller [Thu, 11 Apr 2019 22:01:53 +0000 (15:01 -0700)]
sctp: Remove superfluous test in sctp_ulpq_reasm_drain().
Inside the loop, we always start with event non-NULL.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: sched: flower: fix filter net reference counting
Fix net reference counting in fl_change() and remove redundant call to
tcf_exts_get_net() from __fl_delete(). __fl_put() already tries to get net
before releasing exts and deallocating a filter, so this code caused flower
classifier to obtain net twice per filter that is being deleted.
Implementation of __fl_delete() called tcf_exts_get_net() to pass its
result as 'async' flag to fl_mask_put(). However, 'async' flag is redundant
and only complicates fl_mask_put() implementation. This functionality seems
to be copied from filter cleanup code, where it was added by Cong with
following explanation:
This patchset tries to fix the race between call_rcu() and
cleanup_net() again. Without holding the netns refcnt the
tc_action_net_exit() in netns workqueue could be called before
filter destroy works in tc filter workqueue. This patchset
moves the netns refcnt from tc actions to tcf_exts, without
breaking per-netns tc actions.
This doesn't apply to flower mask, which doesn't call any tc action code
during cleanup. Simplify fl_mask_put() by removing the flag parameter and
always use tcf_queue_work() to free mask objects.
Fixes: 061775583e35 ("net: sched: flower: introduce reference counting for filters") Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw") Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority") Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 11 Apr 2019 18:51:50 +0000 (11:51 -0700)]
selftests: Add debugging options to pmtu.sh
pmtu.sh script runs a number of tests and dumps a summary of pass/fail.
If a test fails, it is near impossible to debug why. For example:
TEST: ipv6: PMTU exceptions [FAIL]
There are a lot of commands run behind the scenes for this test. Which
one is failing?
Add a VERBOSE option to show commands that are run and any output from
those commands. Add a PAUSE_ON_FAIL option to halt the script if a test
fails allowing users to poke around with the setup in the failed state.
In the process, rename tracing to TRACING and move declaration to top
with the new variables.
Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Improve BPF verifier scalability for large programs through two
optimizations: i) remove verifier states that are not useful in pruning,
ii) stop walking parentage chain once first LIVE_READ is seen. Combined
gives approx 20x speedup. Increase limits for accepting large programs
under root, and add various stress tests, from Alexei.
2) Implement global data support in BPF. This enables static global variables
for .data, .rodata and .bss sections to be properly handled which allows
for more natural program development. This also opens up the possibility
to optimize program workflow by compiling ELFs only once and later only
rewriting section data before reload, from Daniel and with test cases and
libbpf refactoring from Joe.
3) Add config option to generate BTF type info for vmlinux as part of the
kernel build process. DWARF debug info is converted via pahole to BTF.
Latter relies on libbpf and makes use of BTF deduplication algorithm which
results in 100x savings compared to DWARF data. Resulting .BTF section is
typically about 2MB in size, from Andrii.
4) Add BPF verifier support for stack access with variable offset from
helpers and add various test cases along with it, from Andrey.
5) Extend bpf_skb_adjust_room() growth BPF helper to mark inner MAC header
so that L2 encapsulation can be used for tc tunnels, from Alan.
6) Add support for input __sk_buff context in BPF_PROG_TEST_RUN so that
users can define a subset of allowed __sk_buff fields that get fed into
the test program, from Stanislav.
7) Add bpf fs multi-dimensional array tests for BTF test suite and fix up
various UBSAN warnings in bpftool, from Yonghong.
8) Generate a pkg-config file for libbpf, from Luca.
9) Dump program's BTF id in bpftool, from Prashant.
10) libbpf fix to use smaller BPF log buffer size for AF_XDP's XDP
program, from Magnus.
11) kallsyms related fixes for the case when symbols are not present in
BPF selftests and samples, from Daniel
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: explicitly prohibit ctx_{in, out} in non-skb BPF_PROG_TEST_RUN
This should allow us later to extend BPF_PROG_TEST_RUN for non-skb case
and be sure that nobody is erroneously setting ctx_{in,out}.
Fixes: b0b9395d865e ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Tue, 9 Apr 2019 09:44:46 +0000 (11:44 +0200)]
tools: add smp_* barrier variants to include infrastructure
Add the definition for smp_rmb(), smp_wmb(), and smp_mb() to the
tools include infrastructure: this patch adds the implementation
for x86-64 and arm64, and have it fall back as currently is for
other archs which do not have it implemented at this point. The
x86-64 one uses lock + add combination for smp_mb() with address
below red zone.
This is on top of 09d62154f613 ("tools, perf: add and use optimized
ring_buffer_{read_head, write_tail} helpers"), which didn't touch
smp_* barrier implementations. Magnus recently rightfully reported
however that the latter on x86-64 still wrongly falls back to sfence,
lfence and mfence respectively, thus fix that for applications under
tools making use of these to avoid such ugly surprises. The main
header under tools (include/asm/barrier.h) will in that case not
select the fallback implementation.
Reported-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
ipv6: Refactor nexthop selection helpers during a fib lookup
IPv6 has a fib6_nh embedded within each fib6_info and a separate
fib6_info for each path in a multipath route. A side effect is that
a fib6_info is passed all the way down the stack when selecting a path
on a fib lookup. Refactor the fib lookup functions and associated
helper functions to take a fib6_nh when appropriate to enable IPv6
to work with nexthop objects where the fib6_nh is not directly part
of a fib entry.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 9 Apr 2019 21:41:16 +0000 (14:41 -0700)]
ipv6: Be smarter with null_entry handling in ip6_pol_route_lookup
Clean up the fib6_null_entry handling in ip6_pol_route_lookup.
rt6_device_match can return fib6_null_entry, but fib6_multipath_select
can not. Consolidate the fib6_null_entry handling and on the final
null_entry check set rt and goto out - no need to defer to a second
check after rt6_find_cached_rt.
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 9 Apr 2019 21:41:15 +0000 (14:41 -0700)]
ipv6: Refactor find_rr_leaf
find_rr_leaf has 3 loops over fib_entries calling find_match. The loops
are very similar with differences in start point and whether the metric
is evaluated:
1. start at rr_head, no extra loop compare, check fib metric
2. start at leaf, compare rt against rr_head, check metric
3. start at cont (potential saved point from earlier loops), no
extra loop compare, no metric check
Create 1 loop that is called 3 different times. This will make a
later change with multipath nexthop objects much simpler.
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 9 Apr 2019 21:41:14 +0000 (14:41 -0700)]
ipv6: Refactor find_match
find_match primarily needs a fib6_nh (and fib6_flags which it passes
through to rt6_score_route). Move fib6_check_expired up to the call
sites so find_match is only called for relevant entries. Remove the
match argument which is mostly a pass through and use the return
boolean to decide if match gets set in the call sites.
The end result is a helper that can be called per fib6_nh struct
which is needed once fib entries reference nexthop objects that
have more than one fib6_nh.
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 9 Apr 2019 21:41:12 +0000 (14:41 -0700)]
ipv6: Change rt6_probe to take a fib6_nh
rt6_probe sends probes for gateways in a nexthop. As such it really
depends on a fib6_nh, not a fib entry. Move last_probe to fib6_nh and
update rt6_probe to a fib6_nh struct.
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 9 Apr 2019 21:41:10 +0000 (14:41 -0700)]
ipv6: Only call rt6_check_neigh for nexthop with gateway
Change rt6_check_neigh to take a fib6_nh instead of a fib entry.
Move the check on fib_flags and whether the nexthop has a gateway
up to the one caller.
Remove the inline from the definition as well. Not necessary.
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 9 Apr 2019 12:59:12 +0000 (13:59 +0100)]
dns: remove redundant zero length namelen check
The zero namelen check is redundant as it has already been checked
for zero at the start of the function. Remove the redundant check.
Addresses-Coverity: ("Logically Dead Code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 11 Apr 2019 20:50:58 +0000 (22:50 +0200)]
Merge branch 'bpf-l2-encap'
Alan Maguire says:
====================
Extend bpf_skb_adjust_room growth to mark inner MAC header so
that L2 encapsulation can be used for tc tunnels.
Patch #1 extends the existing test_tc_tunnel to support UDP
encapsulation; later we want to be able to test MPLS over UDP
and MPLS over GRE encapsulation.
Patch #2 adds the BPF_F_ADJ_ROOM_ENCAP_L2(len) macro, which
allows specification of inner mac length. Other approaches were
explored prior to taking this approach. Specifically, I tried
automatically computing the inner mac length on the basis of the
specified flags (so inner maclen for GRE/IPv4 encap is the len_diff
specified to bpf_skb_adjust_room minus GRE + IPv4 header length
for example). Problem with this is that we don't know for sure
what form of GRE/UDP header we have; is it a full GRE header,
or is it a FOU UDP header or generic UDP encap header? My fear
here was we'd end up with an explosion of flags. The other approach
tried was to support inner L2 header marking as a separate room
adjustment, i.e. adjust for L3/L4 encap, then call
bpf_skb_adjust_room for L2 encap. This can be made to work but
because it imposed an order on operations, felt a bit clunky.
Patch #3 syncs tools/ bpf.h.
Patch #4 extends the tests again to support MPLSoverGRE,
MPLSoverUDP, and transparent ethernet bridging (TEB) where
the inner L2 header is an ethernet header. Testing of BPF
encap against tunnels is done for cases where configuration
of such tunnels is possible (MPLSoverGRE[6], MPLSoverUDP,
gre[6]tap), and skipped otherwise. Testing of BPF encap/decap
is always carried out.
Changes since v2:
- updated tools/testing/selftest/bpf/config with FOU/MPLS CONFIG
variables (patches 1, 4)
- reduced noise in patch 1 by avoiding unnecessary movement of code
- eliminated inner_mac variable in bpf_skb_net_grow (patch 2)
Changes since v1:
- fixed formatting of commit references.
- BPF_F_ADJ_ROOM_FIXED_GSO flag enabled on all variants (patch 1)
- fixed fou6 options for UDP encap; checksum errors observed were
due to the fact fou6 tunnel was not set up with correct ipproto
options (41 -6). 0 checksums work fine (patch 1)
- added definitions for mask and shift used in setting L2 length
(patch 2)
- allow udp encap with fixed GSO (patch 2)
- changed "elen" to "l2_len" to be more descriptive (patch 4)
====================
Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alan Maguire [Tue, 9 Apr 2019 14:06:41 +0000 (15:06 +0100)]
bpf: add layer 2 encap support to bpf_skb_adjust_room
commit 868d523535c2 ("bpf: add bpf_skb_adjust_room encap flags")
introduced support to bpf_skb_adjust_room for GSO-friendly GRE
and UDP encapsulation.
For GSO to work for skbs, the inner headers (mac and network) need to
be marked. For L3 encapsulation using bpf_skb_adjust_room, the mac
and network headers are identical. Here we provide a way of specifying
the inner mac header length for cases where L2 encap is desired. Such
an approach can support encapsulated ethernet headers, MPLS headers etc.
For example to convert from a packet of form [eth][ip][tcp] to
[eth][ip][udp][inner mac][ip][tcp], something like the following could
be done:
Alan Maguire [Tue, 9 Apr 2019 14:06:40 +0000 (15:06 +0100)]
selftests_bpf: extend test_tc_tunnel for UDP encap
commit 868d523535c2 ("bpf: add bpf_skb_adjust_room encap flags")
introduced support to bpf_skb_adjust_room for GSO-friendly GRE
and UDP encapsulation and later introduced associated test_tc_tunnel
tests. Here those tests are extended to cover UDP encapsulation also.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jon Maloy [Thu, 11 Apr 2019 19:56:28 +0000 (21:56 +0200)]
tipc: use standard write_lock & unlock functions when creating node
In the function tipc_node_create() we protect the peer capability field
by using the node rw_lock. However, we access the lock directly instead
of using the dedicated functions for this, as we do everywhere else in
node.c. This cosmetic spot is fixed here.
Fixes: 40999f11ce67 ("tipc: make link capability update thread safe") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: fix missing bpf_check_uarg_tail_zero in BPF_PROG_TEST_RUN
Commit b0b9395d865e ("bpf: support input __sk_buff context in
BPF_PROG_TEST_RUN") started using bpf_check_uarg_tail_zero in
BPF_PROG_TEST_RUN. However, bpf_check_uarg_tail_zero is not defined
for !CONFIG_BPF_SYSCALL:
net/bpf/test_run.c: In function ‘bpf_ctx_init’:
net/bpf/test_run.c:142:9: error: implicit declaration of function ‘bpf_check_uarg_tail_zero’ [-Werror=implicit-function-declaration]
err = bpf_check_uarg_tail_zero(data_in, max_size, size);
^~~~~~~~~~~~~~~~~~~~~~~~
Let's not build net/bpf/test_run.c when CONFIG_BPF_SYSCALL is not set.
Reported-by: kbuild test robot <lkp@intel.com> Fixes: b0b9395d865e ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
net: sched: flower: use correct ht function to prevent duplicates
Implementation of function rhashtable_insert_fast() check if its internal
helper function __rhashtable_insert_fast() returns non-NULL pointer and
seemingly return -EEXIST in such case. However, since
__rhashtable_insert_fast() is called with NULL key pointer, it never
actually checks for duplicates, which means that -EEXIST is never returned
to the user. Use rhashtable_lookup_insert_fast() hash table API instead. In
order to verify that it works as expected and prevent the problem from
happening in future, extend tc-tests with new test that verifies that no
new filters with existing key can be inserted to flower classifier.
Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
netns: read NETNSA_NSID as s32 attribute in rtnl_net_getid()
NETNSA_NSID is signed. Use nla_get_s32() to avoid confusion.
Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests: bpf: add selftest for __sk_buff context in BPF_PROG_TEST_RUN
Simple test that sets cb to {1,2,3,4,5} and priority to 6, runs bpf
program that fails if cb is not what we expect and increments cb[i] and
priority. When the test finishes, we check that cb is now {2,3,4,5,6}
and priority is 7.
We also test the sanity checks:
* ctx_in is provided, but ctx_size_in is zero (same for
ctx_out/ctx_size_out)
* unexpected non-zero fields in __sk_buff return EINVAL
Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
libbpf: add support for ctx_{size, }_{in, out} in BPF_PROG_TEST_RUN
Support recently introduced input/output context for test runs.
We extend only bpf_prog_test_run_xattr. bpf_prog_test_run is
unextendable and left as is.
Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
bpf: support input __sk_buff context in BPF_PROG_TEST_RUN
Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN:
* ctx_in/ctx_size_in - input context
* ctx_out/ctx_size_out - output context
The intended use case is to pass some meta data to the test runs that
operate on skb (this has being brought up on recent LPC).
For programs that use bpf_prog_test_run_skb, support __sk_buff input and
output. Initially, from input __sk_buff, copy _only_ cb and priority into
skb, all other non-zero fields are prohibited (with EINVAL).
If the user has set ctx_out/ctx_size_out, copy the potentially modified
__sk_buff back to the userspace.
We require all fields of input __sk_buff except the ones we explicitly
support to be set to zero. The expectation is that in the future we might
add support for more fields and we want to fail explicitly if the user
runs the program on the kernel where we don't yet support them.
The API is intentionally vague (i.e. we don't explicitly add __sk_buff
to bpf_attr, but ctx_in) to potentially let other test_run types use
this interface in the future (this can be xdp_md for xdp types for
example).
v4:
* don't copy more than allowed in bpf_ctx_init [Martin]
v3:
* handle case where ctx_in is NULL, but ctx_out is not [Martin]
* convert size==0 checks to ptr==NULL checks and add some extra ptr
checks [Martin]
v2:
* Addressed comments from Martin Lau
Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Let's add a way to know whether a program has btf context.
Patch adds 'btf_id' in the output of program listing.
When btf_id is present, it means program has btf context.
Sample output:
user@test# bpftool prog list
25: xdp name xdp_prog1 tag 539ec6ce11b52f98 gpl
loaded_at 2019-04-10T11:44:20+0900 uid 0
xlated 488B not jited memlock 4096B map_ids 23
btf_id 1
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Wed, 10 Apr 2019 09:07:14 +0000 (11:07 +0200)]
mailmap: add entry for email addresses
Redirect email addresses from git log to the mainly used ones
for Alexei and myself such that it is consistent with the ones
in MAINTAINERS file. Useful in particular when git mailmap is
enabled on broader scope, for example:
$ git config --global log.mailmap true
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
The Credit Based Shaper heavily depends on link speed to calculate
the scheduling credits, we can't properly calculate the credits if the
device has failed to report the link speed.
In that case we can't dequeue packets assuming a wrong port rate that will
result into an inconsistent credit distribution.
This patch makes sure we fail to dequeue case:
1) __ethtool_get_link_ksettings() reports error or 2) the ethernet driver
failed to set the ksettings' speed value (setting link speed to
SPEED_UNKNOWN).
Additionally we properly re calculate the port rate whenever the link speed
is changed.
Fixes: 3d0bd028ffb4a ("net/sched: Add support for HW offloading for CBS") Signed-off-by: Leandro Dorileo <leandro.maciel.dorileo@intel.com> Reviewed-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The Time Aware Priority Scheduler is heavily dependent to link speed,
it relies on it to calculate transmission bytes per cycle, we can't
properly calculate the so called budget if the device has failed
to report the link speed.
In that case we can't dequeue packets assuming a wrong budget.
This patch makes sure we fail to dequeue case:
1) __ethtool_get_link_ksettings() reports error or 2) the ethernet
driver failed to set the ksettings' speed value (setting link speed
to SPEED_UNKNOWN).
Additionally we re calculate the budget whenever the link speed is
changed.
Fixes: 5a781ccbd19e4 ("tc: Add support for configuring the taprio scheduler") Signed-off-by: Leandro Dorileo <leandro.maciel.dorileo@intel.com> Reviewed-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 10 Apr 2019 17:05:51 +0000 (10:05 -0700)]
ipv4: Handle RTA_GATEWAY set to 0
Govindarajulu reported a regression with Network Manager which sends an
RTA_GATEWAY attribute with the address set to 0. Fixup the handling of
RTA_GATEWAY to only set fc_gw_family if the gateway address is actually
set.
Fixes: f35b794b3b405 ("ipv4: Prepare fib_config for IPv6 gateway") Reported-by: Govindarajulu Varadarajan <govind.varadar@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: sched: move back qlen to per CPU accounting
The commit 46b1c18f9deb ("net: sched: put back q.qlen into a single location")
introduced some measurable regression in the contended scenarios for
lock qdisc.
As Eric suggested we could replace q.qlen access with calls to qdisc_is_empty()
in the datapath and revert the above commit. The TC subsystem updates
qdisc->is_empty in a somewhat loose way: notably 'is_empty' is set only when
the qdisc dequeue() calls return a NULL ptr. That is, the invocation after
the last packet is dequeued.
The above is good enough for BYPASS implementation - the only downside is that
we end up avoiding the optimization for a very small time-frame - but will
break hard things when internal structures consistency for classful qdisc
relies on child qdisc_is_empty().
A more strict 'is_empty' update adds a relevant complexity to its life-cycle, so
this series takes a different approach: we allow lockless qdisc to switch from
per CPU accounting to global stats accounting when the NOLOCK bit is cleared.
Since most pieces of infrastructure are already in place, this requires very
little changes to the pfifo_fast qdisc, and any later NOLOCK qdisc can hook
there with little effort - no need to maintain two different implementations.
The first 2 patches removes direct qlen access from non core TC code, the 3rd
and 4th patches place and use the infrastructure to allow stats account
switching and the 5th patch is the actual revert.
v1 -> v2:
- fixed build issues
- more descriptive commit message for patch 5/5
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 10 Apr 2019 12:32:41 +0000 (14:32 +0200)]
Revert: "net: sched: put back q.qlen into a single location"
This revert commit 46b1c18f9deb ("net: sched: put back q.qlen into
a single location").
After the previous patch, when a NOLOCK qdisc is enslaved to a
locking qdisc it switches to global stats accounting. As a consequence,
when a classful qdisc accesses directly a child qdisc's qlen, such
qdisc is not doing per CPU accounting and qlen value is consistent.
In the control path nobody uses directly qlen since commit e5f0e8f8e45 ("net: sched: introduce and use qdisc tree flush/purge
helpers"), so we can remove the contented atomic ops from the
datapath.
v1 -> v2:
- complete the qdisc_qstats_atomic_qlen_dec() ->
qdisc_qstats_cpu_qlen_dec() replacement, fix build issue
- more descriptive commit message
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 10 Apr 2019 12:32:39 +0000 (14:32 +0200)]
net: sched: always do stats accounting according to TCQ_F_CPUSTATS
The core sched implementation checks independently for NOLOCK flag
to acquire/release the root spin lock and for qdisc_is_percpu_stats()
to account per CPU values in many places.
This change update the last few places checking the TCQ_F_NOLOCK to
do per CPU stats accounting according to qdisc_is_percpu_stats()
value.
The above allows to clean dev_requeue_skb() implementation a bit
and makes stats update always consistent with a single flag.
v1 -> v2:
- do not move qdisc_is_empty definition, fix build issue
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 10 Apr 2019 12:32:38 +0000 (14:32 +0200)]
net: sched: prefer qdisc_is_empty() over direct qlen access
When checking for root qdisc queue length, do not access directly q.qlen.
In the following patches we will move back qlen accounting to per CPU
values for NOLOCK qdiscs.
Instead, prefer the qdisc_is_empty() helper usage.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 10 Apr 2019 12:32:37 +0000 (14:32 +0200)]
net: caif: avoid using qdisc_qlen()
Such helper does not cope correctly with NOLOCK qdiscs.
In the following patches we will move back qlen to per CPU
values for such qdiscs, so qdisc_qlen_sum() is not an option,
too.
Instead, use qlen only for lock qdiscs, and always set
flow off for NOLOCK qdiscs with a not empty tx queue.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Magnus Karlsson [Wed, 10 Apr 2019 06:54:16 +0000 (08:54 +0200)]
libbpf: fix crash in XDP socket part with new larger BPF_LOG_BUF_SIZE
In commit da11b417583e ("libbpf: teach libbpf about log_level bit 2"),
the BPF_LOG_BUF_SIZE was increased to 16M. The XDP socket part of
libbpf allocated the log_buf on the stack, but for the new 16M buffer
size this is not going to work. Change the code so it uses a 16K buffer
instead.
Fixes: da11b417583e ("libbpf: teach libbpf about log_level bit 2") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Yonghong Song [Wed, 10 Apr 2019 00:37:41 +0000 (17:37 -0700)]
bpf, bpftool: fix a few ubsan warnings
The issue is reported at https://github.com/libbpf/libbpf/issues/28.
Basically, per C standard, for
void *memcpy(void *dest, const void *src, size_t n)
if "dest" or "src" is NULL, regardless of whether "n" is 0 or not,
the result of memcpy is undefined. clang ubsan reported three such
instances in bpf.c with the following pattern:
memcpy(dest, 0, 0).
Although in practice, no known compiler will cause issues when
copy size is 0. Let us still fix the issue to silence ubsan
warnings.
Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
====================
This series is a major rework of previously submitted libbpf
patches [0] in order to add global data support for BPF. The
kernel has been extended to add proper infrastructure that allows
for full .bss/.data/.rodata sections on BPF loader side based
upon feedback from LPC discussions [1]. Latter support is then
also added into libbpf in this series which allows for more
natural C-like programming of BPF programs. For more information
on loader, please refer to 'bpf, libbpf: support global data/bss/
rodata sections' patch in this series.
Thanks a lot!
v5 -> v6:
- Removed synchronize_rcu() from map freeze (Jann)
- Rest as-is
v4 -> v5:
- Removed index selection again for ldimm64 (Alexei)
- Adapted related test cases and added new ones to test
rejection of off != 0
v3 -> v4:
- Various fixes in BTF verification e.g. to disallow
Var and DataSec to be an intermediate type during resolve (Martin)
- More BTF test cases added
- Few cleanups in key-less BTF commit (Martin)
- Bump libbpf minor version from 2 to 3
- Renamed and simplified read-only locking
- Various minor improvements all over the place
v2 -> v3:
- Implement BTF support in kernel, libbpf, bpftool, add tests
- Fix idx + off conversion (Andrii)
- Document lower / higher bits for direct value access (Andrii)
- Add tests with small value size (Andrii)
- Add index selection into ldimm64 (Andrii)
- Fix missing fdput() (Jann)
- Reject invalid flags in BPF_F_*_PROG (Jakub)
- Complete rework of libbpf support, includes:
- Add objname to map name (Stanislav)
- Make .rodata map full read-only after setup (Andrii)
- Merge relocation handling into single one (Andrii)
- Store global maps into obj->maps array (Andrii, Alexei)
- Debug message when skipping section (Andrii)
- Reject non-static global data till we have
semantics for sharing them (Yonghong, Andrii, Alexei)
- More test cases and completely reworked prog test (Alexei)
- Fixes, cleanups, etc all over the set
- Not yet addressed:
- Make BTF mandatory for these maps (Alexei)
-> Waiting till BTF support for these lands first
v1 -> v2:
- Instead of 32-bit static data, implement full global
data support (Alexei)
Daniel Borkmann [Tue, 9 Apr 2019 21:20:18 +0000 (23:20 +0200)]
bpf, selftest: add test cases for BTF Var and DataSec
Extend test_btf with various positive and negative tests around
BTF verification of kind Var and DataSec. All passing as well:
# ./test_btf
[...]
BTF raw test[4] (global data test #1): OK
BTF raw test[5] (global data test #2): OK
BTF raw test[6] (global data test #3): OK
BTF raw test[7] (global data test #4, unsupported linkage): OK
BTF raw test[8] (global data test #5, invalid var type): OK
BTF raw test[9] (global data test #6, invalid var type (fwd type)): OK
BTF raw test[10] (global data test #7, invalid var type (fwd type)): OK
BTF raw test[11] (global data test #8, invalid var size): OK
BTF raw test[12] (global data test #9, invalid var size): OK
BTF raw test[13] (global data test #10, invalid var size): OK
BTF raw test[14] (global data test #11, multiple section members): OK
BTF raw test[15] (global data test #12, invalid offset): OK
BTF raw test[16] (global data test #13, invalid offset): OK
BTF raw test[17] (global data test #14, invalid offset): OK
BTF raw test[18] (global data test #15, not var kind): OK
BTF raw test[19] (global data test #16, invalid var referencing sec): OK
BTF raw test[20] (global data test #17, invalid var referencing var): OK
BTF raw test[21] (global data test #18, invalid var loop): OK
BTF raw test[22] (global data test #19, invalid var referencing var): OK
BTF raw test[23] (global data test #20, invalid ptr referencing var): OK
BTF raw test[24] (global data test #21, var included in struct): OK
BTF raw test[25] (global data test #22, array of var): OK
[...]
PASS:167 SKIP:0 FAIL:0
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Joe Stringer [Tue, 9 Apr 2019 21:20:17 +0000 (23:20 +0200)]
bpf, selftest: test global data/bss/rodata sections
Add tests for libbpf relocation of static variable references
into the .data, .rodata and .bss sections of the ELF, also add
read-only test for .rodata. All passing:
Daniel Borkmann [Tue, 9 Apr 2019 21:20:16 +0000 (23:20 +0200)]
bpf, selftest: test {rd, wr}only flags and direct value access
Extend test_verifier with various test cases around the two kernel
extensions, that is, {rd,wr}only map support as well as direct map
value access. All passing, one skipped due to xskmap not present
on test machine:
# ./test_verifier
[...]
#948/p XDP pkt read, pkt_meta' <= pkt_data, bad access 1 OK
#949/p XDP pkt read, pkt_meta' <= pkt_data, bad access 2 OK
#950/p XDP pkt read, pkt_data <= pkt_meta', good access OK
#951/p XDP pkt read, pkt_data <= pkt_meta', bad access 1 OK
#952/p XDP pkt read, pkt_data <= pkt_meta', bad access 2 OK
Summary: 1410 PASSED, 1 SKIPPED, 0 FAILED
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Tue, 9 Apr 2019 21:20:15 +0000 (23:20 +0200)]
bpf: bpftool support for dumping data/bss/rodata sections
Add the ability to bpftool to handle BTF Var and DataSec kinds
in order to dump them out of btf_dumper_type(). The value has a
single object with the section name, which itself holds an array
of variables it dumps. A single variable is an object by itself
printed along with its name. From there further type information
is dumped along with corresponding value information.
Daniel Borkmann [Tue, 9 Apr 2019 21:20:14 +0000 (23:20 +0200)]
bpf, libbpf: add support for BTF Var and DataSec
This adds libbpf support for BTF Var and DataSec kinds. Main point
here is that libbpf needs to do some preparatory work before the
whole BTF object can be loaded into the kernel, that is, fixing up
of DataSec size taken from the ELF section size and non-static
variable offset which needs to be taken from the ELF's string section.
Upstream LLVM doesn't fix these up since at time of BTF emission
it is too early in the compilation process thus this information
isn't available yet, hence loader needs to take care of it.
Note, deduplication handling has not been in the scope of this work
and needs to be addressed in a future commit.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://reviews.llvm.org/D59441 Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Tue, 9 Apr 2019 21:20:13 +0000 (23:20 +0200)]
bpf, libbpf: support global data/bss/rodata sections
This work adds BPF loader support for global data sections
to libbpf. This allows to write BPF programs in more natural
C-like way by being able to define global variables and const
data.
Back at LPC 2018 [0] we presented a first prototype which
implemented support for global data sections by extending BPF
syscall where union bpf_attr would get additional memory/size
pair for each section passed during prog load in order to later
add this base address into the ldimm64 instruction along with
the user provided offset when accessing a variable. Consensus
from LPC was that for proper upstream support, it would be
more desirable to use maps instead of bpf_attr extension as
this would allow for introspection of these sections as well
as potential live updates of their content. This work follows
this path by taking the following steps from loader side:
1) In bpf_object__elf_collect() step we pick up ".data",
".rodata", and ".bss" section information.
2) If present, in bpf_object__init_internal_map() we add
maps to the obj's map array that corresponds to each
of the present sections. Given section size and access
properties can differ, a single entry array map is
created with value size that is corresponding to the
ELF section size of .data, .bss or .rodata. These
internal maps are integrated into the normal map
handling of libbpf such that when user traverses all
obj maps, they can be differentiated from user-created
ones via bpf_map__is_internal(). In later steps when
we actually create these maps in the kernel via
bpf_object__create_maps(), then for .data and .rodata
sections their content is copied into the map through
bpf_map_update_elem(). For .bss this is not necessary
since array map is already zero-initialized by default.
Additionally, for .rodata the map is frozen as read-only
after setup, such that neither from program nor syscall
side writes would be possible.
3) In bpf_program__collect_reloc() step, we record the
corresponding map, insn index, and relocation type for
the global data.
4) And last but not least in the actual relocation step in
bpf_program__relocate(), we mark the ldimm64 instruction
with src_reg = BPF_PSEUDO_MAP_VALUE where in the first
imm field the map's file descriptor is stored as similarly
done as in BPF_PSEUDO_MAP_FD, and in the second imm field
(as ldimm64 is 2-insn wide) we store the access offset
into the section. Given these maps have only single element
ldimm64's off remains zero in both parts.
5) On kernel side, this special marked BPF_PSEUDO_MAP_VALUE
load will then store the actual target address in order
to have a 'map-lookup'-free access. That is, the actual
map value base address + offset. The destination register
in the verifier will then be marked as PTR_TO_MAP_VALUE,
containing the fixed offset as reg->off and backing BPF
map as reg->map_ptr. Meaning, it's treated as any other
normal map value from verification side, only with
efficient, direct value access instead of actual call to
map lookup helper as in the typical case.
Currently, only support for static global variables has been
added, and libbpf rejects non-static global variables from
loading. This can be lifted until we have proper semantics
for how BPF will treat multi-object BPF loads. From BTF side,
libbpf will set the value type id of the types corresponding
to the ".bss", ".data" and ".rodata" names which LLVM will
emit without the object name prefix. The key type will be
left as zero, thus making use of the key-less BTF option in
array maps.
Simple example dump of program using globals vars in each
section:
# bpftool map show id 2237
2237: array name test_glo.bss flags 0x0
key 4B value 64B max_entries 1 memlock 4096B
# bpftool map show id 2235
2235: array name test_glo.data flags 0x0
key 4B value 64B max_entries 1 memlock 4096B
# bpftool map show id 2236
2236: array name test_glo.rodata flags 0x80
key 4B value 96B max_entries 1 memlock 4096B
For Cilium use-case in particular, this enables migrating configuration
constants from Cilium daemon's generated header defines into global
data sections such that expensive runtime recompilations with LLVM can
be avoided altogether. Instead, the ELF file becomes effectively a
"template", meaning, it is compiled only once (!) and the Cilium daemon
will then rewrite relevant configuration data from the ELF's .data or
.rodata sections directly instead of recompiling the program. The
updated ELF is then loaded into the kernel and atomically replaces
the existing program in the networking datapath. More info in [0].
Based upon recent fix in LLVM, commit c0db6b6bd444 ("[BPF] Don't fail
for static variables").
[0] LPC 2018, BPF track, "ELF relocation for static data in BPF",
http://vger.kernel.org/lpc-bpf2018.html#session-3
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Joe Stringer [Tue, 9 Apr 2019 21:20:12 +0000 (23:20 +0200)]
bpf, libbpf: refactor relocation handling
Adjust the code for relocations slightly with no functional changes,
so that upcoming patches that will introduce support for relocations
into the .data, .rodata and .bss sections can be added independent
of these changes.
Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>