Andrii Nakryiko [Tue, 7 Mar 2023 23:31:13 +0000 (15:31 -0800)]
Merge branch 'libbpf: usdt arm arg parsing support'
Puranjay Mohan says:
====================
This series add the support of the ARM architecture to libbpf USDT. This
involves implementing the parse_usdt_arg() function for ARM.
It was seen that the last part of parse_usdt_arg() is repeated for all architectures,
so, the first patch in this series refactors these functions and moved the post
processing to parse_usdt_spec()
Changes in V2[1] to V3:
- Use a tabular approach to find register offsets.
- Add the patch for refactoring parse_usdt_arg()
====================
Puranjay Mohan [Tue, 7 Mar 2023 12:04:40 +0000 (12:04 +0000)]
libbpf: USDT arm arg parsing support
Parsing of USDT arguments is architecture-specific; on arm it is
relatively easy since registers used are r[0-10], fp, ip, sp, lr,
pc. Format is slightly different compared to aarch64; forms are
- "size @ [ reg, #offset ]" for dereferences, for example
"-8 @ [ sp, #76 ]" ; " -4 @ [ sp ]"
- "size @ reg" for register values; for example
"-4@r0"
- "size @ #value" for raw values; for example
"-8@#1"
Add support for parsing USDT arguments for ARM architecture.
To test the above changes QEMU's virt[1] board with cortex-a15
CPU was used. libbpf-bootstrap's usdt example[2] was modified to attach
to a test program with DTRACE_PROBE1/2/3/4... probes to test different
combinations.
Puranjay Mohan [Tue, 7 Mar 2023 12:04:39 +0000 (12:04 +0000)]
libbpf: Refactor parse_usdt_arg() to re-use code
The parse_usdt_arg() function is defined differently for each
architecture but the last part of the function is repeated
verbatim for each architecture.
Refactor parse_usdt_arg() to fill the arg_sz and then do the repeated
post-processing in parse_usdt_spec().
Daniel Müller [Tue, 7 Mar 2023 21:55:04 +0000 (21:55 +0000)]
libbpf: Fix theoretical u32 underflow in find_cd() function
Coverity reported a potential underflow of the offset variable used in
the find_cd() function. Switch to using a signed 64 bit integer for the
representation of offset to make sure we can never underflow.
Currently we can't get bpf memory usage reliably either from memcg or
from bpftool.
In memcg, there's not a 'bpf' item in memory.stat, but only 'kernel',
'sock', 'vmalloc' and 'percpu' which may related to bpf memory. With
these items we still can't get the bpf memory usage, because bpf memory
usage may far less than the kmem in a memcg, for example, the dentry may
consume lots of kmem.
bpftool now shows the bpf memory footprint, which is difference with bpf
memory usage. The difference can be quite great in some cases, for example,
- non-preallocated bpf map
The non-preallocated bpf map memory usage is dynamically changed. The
allocated elements count can be from 0 to the max entries. But the
memory footprint in bpftool only shows a fixed number.
- bpf metadata consumes more memory than bpf element
In some corner cases, the bpf metadata can consumes a lot more memory
than bpf element consumes. For example, it can happen when the element
size is quite small.
- some maps don't have key, value or max_entries
For example the key_size and value_size of ringbuf is 0, so its
memlock is always 0.
We need a way to show the bpf memory usage especially there will be more
and more bpf programs running on the production environment and thus the
bpf memory usage is not trivial.
This patchset introduces a new map ops ->map_mem_usage to calculate the
memory usage. Note that we don't intend to make the memory usage 100%
accurate, while our goal is to make sure there is only a small difference
between what bpftool reports and the real memory. That small difference
can be ignored compared to the total usage. That is enough to monitor
the bpf memory usage. For example, the user can rely on this value to
monitor the trend of bpf memory usage, compare the difference in bpf
memory usage between different bpf program versions, figure out which
maps consume large memory, and etc.
This patchset implements the bpf memory usage for all maps, and yet there's
still work to do. We don't want to introduce runtime overhead in the
element update and delete path, but we have to do it for some
non-preallocated maps,
- devmap, xskmap
When we update or delete an element, it will allocate or free memory.
In order to track this dynamic memory, we have to track the count in
element update and delete path.
- cpumap
The element size of each cpumap element is not determinated. If we
want to track the usage, we have to count the size of all elements in
the element update and delete path. So I just put it aside currently.
- local_storage, bpf_local_storage
When we attach or detach a cgroup, it will allocate or free memory. If
we want to track the dynamic memory, we also need to do something in
the update and delete path. So I just put it aside currently.
- offload map
The element update and delete of offload map is via the netdev dev_ops,
in which it may dynamically allocate or free memory, but this dynamic
memory isn't counted in offload map memory usage currently.
The result of each map can be found in the individual patch.
We may also need to track per-container bpf memory usage, that will be
addressed by a different patchset.
Changes:
v3->v4: code improvement on ringbuf (Andrii)
use READ_ONCE() to read lpm_trie (Tao)
explain why we can't get bpf memory usage from memcg.
v2->v3: check callback at map creation time and avoid warning (Alexei)
fix build error under CONFIG_BPF=n (lkp@intel.com)
v1->v2: calculate the memory usage within bpf (Alexei)
- [v1] bpf, mm: bpf memory usage
https://lwn.net/Articles/921991/
- [RFC PATCH v2] mm, bpf: Add BPF into /proc/meminfo
https://lwn.net/Articles/919848/
- [RFC PATCH v1] mm, bpf: Add BPF into /proc/meminfo
https://lwn.net/Articles/917647/
- [RFC PATCH] bpf, mm: Add a new item bpf into memory.stat
https://lore.kernel.org/bpf/20220921170002.29557-1-laoar.shao@gmail].com/
====================
Yafang Shao [Sun, 5 Mar 2023 12:46:15 +0000 (12:46 +0000)]
bpf: enforce all maps having memory usage callback
We have implemented memory usage callback for all maps, and we enforce
any newly added map having a callback as well. We check this callback at
map creation time. If it doesn't have the callback, we will return
EINVAL.
Yafang Shao [Sun, 5 Mar 2023 12:46:14 +0000 (12:46 +0000)]
bpf: offload map memory usage
A new helper is introduced to calculate offload map memory usage. But
currently the memory dynamically allocated in netdev dev_ops, like
nsim_map_update_elem, is not counted. Let's just put it aside now.
Yafang Shao [Sun, 5 Mar 2023 12:46:13 +0000 (12:46 +0000)]
bpf, net: xskmap memory usage
A new helper is introduced to calculate xskmap memory usage.
The xfsmap memory usage can be dynamically changed when we add or remove
a xsk_map_node. Hence we need to track the count of xsk_map_node to get
its memory usage.
The result as follows,
- before
10: xskmap name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524288B
- after
10: xskmap name count_map flags 0x0 <<< no elements case
key 4B value 4B max_entries 65536 memlock 524608B
Yafang Shao [Sun, 5 Mar 2023 12:46:12 +0000 (12:46 +0000)]
bpf, net: sock_map memory usage
sockmap and sockhash don't have something in common in allocation, so let's
introduce different helpers to calculate their memory usage.
The reuslt as follows,
- before
28: sockmap name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524288B
29: sockhash name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524288B
- after
28: sockmap name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524608B
29: sockhash name count_map flags 0x0 <<<< no updated elements
key 4B value 4B max_entries 65536 memlock 1048896B
Yafang Shao [Sun, 5 Mar 2023 12:46:11 +0000 (12:46 +0000)]
bpf, net: bpf_local_storage memory usage
A new helper is introduced into bpf_local_storage map to calculate the
memory usage. This helper is also used by other maps like
bpf_cgrp_storage, bpf_inode_storage, bpf_task_storage and etc.
Note that currently the dynamically allocated storage elements are not
counted in the usage, since it will take extra runtime overhead in the
elements update or delete path. So let's put it aside now, and implement
it in the future when someone really need it.
Yafang Shao [Sun, 5 Mar 2023 12:46:10 +0000 (12:46 +0000)]
bpf: local_storage memory usage
A new helper is introduced to calculate local_storage map memory usage.
Currently the dynamically allocated elements are not counted, since it
will take runtime overhead in the element update or delete path. So
let's put it aside currently, and implement it in the future if the user
really needs it.
Yafang Shao [Sun, 5 Mar 2023 12:46:07 +0000 (12:46 +0000)]
bpf: devmap memory usage
A new helper is introduced to calculate the memory usage of devmap and
devmap_hash. The number of dynamically allocated elements are recored
for devmap_hash already, but not for devmap. To track the memory size of
dynamically allocated elements, this patch also count the numbers for
devmap.
The result as follows,
- before
40: devmap name count_map flags 0x80
key 4B value 4B max_entries 65536 memlock 524288B
41: devmap_hash name count_map flags 0x80
key 4B value 4B max_entries 65536 memlock 524288B
- after
40: devmap name count_map flags 0x80 <<<< no elements
key 4B value 4B max_entries 65536 memlock 524608B
41: devmap_hash name count_map flags 0x80 <<<< no elements
key 4B value 4B max_entries 65536 memlock 524608B
Note that the number of buckets is same with max_entries for devmap_hash
in this case.
Yafang Shao [Sun, 5 Mar 2023 12:46:06 +0000 (12:46 +0000)]
bpf: cpumap memory usage
A new helper is introduced to calculate cpumap memory usage. The size of
cpu_entries can be dynamically changed when we update or delete a cpumap
element, but this patch doesn't include the memory size of cpu_entry
yet. We can dynamically calculate the memory usage when we alloc or free
a cpu_entry, but it will take extra runtime overhead, so let just put it
aside currently. Note that the size of different cpu_entry may be
different as well.
The result as follows,
- before
48: cpumap name count_map flags 0x4
key 4B value 4B max_entries 64 memlock 4096B
- after
48: cpumap name count_map flags 0x4
key 4B value 4B max_entries 64 memlock 832B
Yafang Shao [Sun, 5 Mar 2023 12:46:02 +0000 (12:46 +0000)]
bpf: stackmap memory usage
A new helper is introduced to get stackmap memory usage. Some small
memory allocations are ignored as their memory size is quite small
compared to the totol usage.
The result as follows,
- before
16: stack_trace name count_map flags 0x0
key 4B value 8B max_entries 65536 memlock 1048576B
- after
16: stack_trace name count_map flags 0x0
key 4B value 8B max_entries 65536 memlock 2097472B
Yafang Shao [Sun, 5 Mar 2023 12:46:01 +0000 (12:46 +0000)]
bpf: arraymap memory usage
Introduce array_map_mem_usage() to calculate arraymap memory usage. In
this helper, some small memory allocations are ignored, like the
allocation of struct bpf_array_aux in prog_array. The inner_map_meta in
array_of_map is also ignored.
The result as follows,
- before
11: array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524288B
12: percpu_array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 8912896B
13: perf_event_array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524288B
14: prog_array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524288B
15: cgroup_array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524288B
- after
11: array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524608B
12: percpu_array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 17301824B
13: perf_event_array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524608B
14: prog_array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524608B
15: cgroup_array name count_map flags 0x0
key 4B value 4B max_entries 65536 memlock 524608B
Yafang Shao [Sun, 5 Mar 2023 12:46:00 +0000 (12:46 +0000)]
bpf: hashtab memory usage
htab_map_mem_usage() is introduced to calculate hashmap memory usage. In
this helper, some small memory allocations are ignore, as their size is
quite small compared with the total size. The inner_map_meta in
hash_of_map is also ignored.
The result for hashtab as follows,
- before this change
1: hash name count_map flags 0x1 <<<< no prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 41943040B
2: hash name count_map flags 0x1 <<<< no prealloc, none set
key 16B value 24B max_entries 1048576 memlock 41943040B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 41943040B
The memlock is always a fixed size whatever it is preallocated or
not, and whatever the count of allocated elements is.
- after this change
1: hash name count_map flags 0x1 <<<< non prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 117441536B
2: hash name count_map flags 0x1 <<<< non prealloc, non set
key 16B value 24B max_entries 1048576 memlock 16778240B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 109056000B
The memlock now is hashtab actually allocated.
The result for percpu hash map as follows,
- before this change
4: percpu_hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
5: percpu_hash name count_map flags 0x1 <<<< no prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
- after this change
4: percpu_hash name count_map flags 0x0
key 16B value 24B max_entries 1048576 memlock 897582080B
5: percpu_hash name count_map flags 0x1
key 16B value 24B max_entries 1048576 memlock 922748736B
At worst, the difference can be 10x, for example,
- before this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 8388608B
- after this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 83889408B
bpf: Increase size of BTF_ID_LIST without CONFIG_DEBUG_INFO_BTF again
After commit 66e3a13e7c2c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr"),
clang builds without CONFIG_DEBUG_INFO_BTF warn:
kernel/bpf/verifier.c:10298:24: warning: array index 16 is past the end of the array (that has type 'u32[16]' (aka 'unsigned int[16]')) [-Warray-bounds]
meta.func_id == special_kfunc_list[KF_bpf_dynptr_slice_rdwr]) {
^ ~~~~~~~~~~~~~~~~~~~~~~~~
kernel/bpf/verifier.c:9150:1: note: array 'special_kfunc_list' declared here
BTF_ID_LIST(special_kfunc_list)
^
include/linux/btf_ids.h:207:27: note: expanded from macro 'BTF_ID_LIST'
#define BTF_ID_LIST(name) static u32 __maybe_unused name[16];
^
1 warning generated.
A warning of this nature was previously addressed by
commit beb3d47d1d3d ("bpf: Fix a BTF_ID_LIST bug with CONFIG_DEBUG_INFO_BTF not set")
but there have been new kfuncs added since then.
Quadruple the size of the CONFIG_DEBUG_INFO_BTF=n definition so that
this problem is unlikely to show up for some time.
We've added 85 non-merge commits during the last 13 day(s) which contain
a total of 131 files changed, 7102 insertions(+), 1792 deletions(-).
The main changes are:
1) Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and variable-sized
accesses, from Joanne Koong.
2) Bigger batch of BPF verifier improvements to prepare for upcoming BPF
open-coded iterators allowing for less restrictive looping capabilities,
from Andrii Nakryiko.
3) Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
programs to NULL-check before passing such pointers into kfunc,
from Alexei Starovoitov.
4) Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
local storage maps, from Kumar Kartikeya Dwivedi.
5) Add BPF verifier support for ST instructions in convert_ctx_access()
which will help new -mcpu=v4 clang flag to start emitting them,
from Eduard Zingerman.
6) Make uprobe attachment Android APK aware by supporting attachment
to functions inside ELF objects contained in APKs via function names,
from Daniel Müller.
7) Add a new flag BPF_F_TIMER_ABS flag for bpf_timer_start() helper
to start the timer with absolute expiration value instead of relative
one, from Tero Kristo.
8) Add a new kfunc bpf_cgroup_from_id() to look up cgroups via id,
from Tejun Heo.
9) Extend libbpf to support users manually attaching kprobes/uprobes
in the legacy/perf/link mode, from Menglong Dong.
10) Implement workarounds in the mips BPF JIT for DADDI/R4000,
from Jiaxun Yang.
11) Enable mixing bpf2bpf and tailcalls for the loongarch BPF JIT,
from Hengqi Chen.
12) Extend BPF instruction set doc with describing the encoding of BPF
instructions in terms of how bytes are stored under big/little endian,
from Jose E. Marchesi.
13) Follow-up to enable kfunc support for riscv BPF JIT, from Pu Lehui.
14) Fix bpf_xdp_query() backwards compatibility on old kernels,
from Yonghong Song.
15) Fix BPF selftest cross compilation with CLANG_CROSS_FLAGS,
from Florent Revest.
16) Improve bpf_cpumask_ma to only allocate one bpf_mem_cache,
from Hou Tao.
17) Fix BPF verifier's check_subprogs to not unnecessarily mark
a subprogram with has_tail_call, from Ilya Leoshkevich.
18) Fix arm syscall regs spec in libbpf's bpf_tracing.h, from Puranjay Mohan.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
selftests/bpf: Add test for legacy/perf kprobe/uprobe attach mode
selftests/bpf: Split test_attach_probe into multi subtests
libbpf: Add support to set kprobe/uprobe attach mode
tools/resolve_btfids: Add /libsubcmd to .gitignore
bpf: add support for fixed-size memory pointer returns for kfuncs
bpf: generalize dynptr_get_spi to be usable for iters
bpf: mark PTR_TO_MEM as non-null register type
bpf: move kfunc_call_arg_meta higher in the file
bpf: ensure that r0 is marked scratched after any function call
bpf: fix visit_insn()'s detection of BPF_FUNC_timer_set_callback helper
bpf: clean up visit_insn()'s instruction processing
selftests/bpf: adjust log_fixup's buffer size for proper truncation
bpf: honor env->test_state_freq flag in is_state_visited()
selftests/bpf: enhance align selftest's expected log matching
bpf: improve regsafe() checks for PTR_TO_{MEM,BUF,TP_BUFFER}
bpf: improve stack slot state printing
selftests/bpf: Disassembler tests for verifier.c:convert_ctx_access()
selftests/bpf: test if pointer type is tracked for BPF_ST_MEM
bpf: allow ctx writes using BPF_ST_MEM instruction
bpf: Use separate RCU callbacks for freeing selem
...
====================
Andrii Nakryiko [Mon, 6 Mar 2023 17:30:21 +0000 (09:30 -0800)]
Merge branch 'libbpf: allow users to set kprobe/uprobe attach mode'
Menglong Dong says:
====================
From: Menglong Dong <imagedong@tencent.com>
By default, libbpf will attach the kprobe/uprobe BPF program in the
latest mode that supported by kernel. In this series, we add the support
to let users manually attach kprobe/uprobe in legacy/perf/link mode in
the 1th patch.
And in the 2th patch, we split the testing 'attach_probe' into multi
subtests, as Andrii suggested.
In the 3th patch, we add the testings for loading kprobe/uprobe in
different mode.
Changes since v3:
- rename eBPF to BPF in the doc
- use OPTS_GET() to get the value of 'force_ioctl_attach'
- error out on attach mode is not supported
- use test_attach_probe_manual__open_and_load() directly
Changes since v2:
- fix the typo in the 2th patch
Changes since v1:
- some small changes in the 1th patch, as Andrii suggested
- split 'attach_probe' into multi subtests
====================
Menglong Dong [Mon, 6 Mar 2023 06:48:32 +0000 (14:48 +0800)]
selftests/bpf: Split test_attach_probe into multi subtests
In order to adapt to the older kernel, now we split the "attach_probe"
testing into multi subtests:
manual // manual attach tests for kprobe/uprobe
auto // auto-attach tests for kprobe and uprobe
kprobe-sleepable // kprobe sleepable test
uprobe-lib // uprobe tests for library function by name
uprobe-sleepable // uprobe sleepable test
uprobe-ref_ctr // uprobe ref_ctr test
As sleepable kprobe needs to set BPF_F_SLEEPABLE flag before loading,
we need to move it to a stand alone skel file, in case of it is not
supported by kernel and make the whole loading fail.
Therefore, we can only enable part of the subtests for older kernel.
Menglong Dong [Mon, 6 Mar 2023 06:48:31 +0000 (14:48 +0800)]
libbpf: Add support to set kprobe/uprobe attach mode
By default, libbpf will attach the kprobe/uprobe BPF program in the
latest mode that supported by kernel. In this patch, we add the support
to let users manually attach kprobe/uprobe in legacy or perf mode.
There are 3 mode that supported by the kernel to attach kprobe/uprobe:
LEGACY: create perf event in legacy way and don't use bpf_link
PERF: create perf event with perf_event_open() and don't use bpf_link
Andrii Nakryiko [Thu, 2 Mar 2023 23:50:10 +0000 (15:50 -0800)]
bpf: add support for fixed-size memory pointer returns for kfuncs
Support direct fixed-size (and for now, read-only) memory access when
kfunc's return type is a pointer to non-struct type. Calculate type size
and let BPF program access that many bytes directly. This is crucial for
numbers iterator.
Andrii Nakryiko [Thu, 2 Mar 2023 23:50:08 +0000 (15:50 -0800)]
bpf: mark PTR_TO_MEM as non-null register type
PTR_TO_MEM register without PTR_MAYBE_NULL is indeed non-null. This is
important for BPF verifier to be able to prune guaranteed not to be
taken branches. This is always the case with open-coded iterators.
Andrii Nakryiko [Thu, 2 Mar 2023 23:50:06 +0000 (15:50 -0800)]
bpf: ensure that r0 is marked scratched after any function call
r0 is important (unless called function is void-returning, but that's
taken care of by print_verifier_state() anyways) in verifier logs.
Currently for helpers we seem to print it in verifier log, but for
kfuncs we don't.
Instead of figuring out where in the maze of code we accidentally set r0
as scratched for helpers and why we don't do that for kfuncs, just
enforce that after any function call r0 is marked as scratched.
Also, perhaps, we should reconsider "scratched" terminology, as it's
mightily confusing. "Touched" would seem more appropriate. But I left
that for follow ups for now.
Andrii Nakryiko [Thu, 2 Mar 2023 23:50:05 +0000 (15:50 -0800)]
bpf: fix visit_insn()'s detection of BPF_FUNC_timer_set_callback helper
It's not correct to assume that any BPF_CALL instruction is a helper
call. Fix visit_insn()'s detection of bpf_timer_set_callback() helper by
also checking insn->code == 0. For kfuncs insn->code would be set to
BPF_PSEUDO_KFUNC_CALL, and for subprog calls it will be BPF_PSEUDO_CALL.
Andrii Nakryiko [Thu, 2 Mar 2023 23:50:04 +0000 (15:50 -0800)]
bpf: clean up visit_insn()'s instruction processing
Instead of referencing processed instruction repeatedly as insns[t]
throughout entire visit_insn() function, take a local insn pointer and
work with it in a cleaner way.
It makes enhancing this function further a bit easier as well.
Andrii Nakryiko [Thu, 2 Mar 2023 23:50:03 +0000 (15:50 -0800)]
selftests/bpf: adjust log_fixup's buffer size for proper truncation
Adjust log_fixup's expected buffer length to fix the test. It's pretty
finicky in its length expectation, but it doesn't break often. So just
adjust the length to work on current kernel and with follow up iterator
changes as well.
Andrii Nakryiko [Thu, 2 Mar 2023 23:50:02 +0000 (15:50 -0800)]
bpf: honor env->test_state_freq flag in is_state_visited()
env->test_state_freq flag can be set by user by passing
BPF_F_TEST_STATE_FREQ program flag. This is used in a bunch of selftests
to have predictable state checkpoints at every jump and so on.
Currently, bounded loop handling heuristic ignores this flag if number
of processed jumps and/or number of processed instructions is below some
thresholds, which throws off that reliable state checkpointing.
Honor this flag in all circumstances by disabling heuristic if
env->test_state_freq is set.
Allow to search for expected register state in all the verifier log
output that's related to specified instruction number.
See added comment for an example of possible situation that is happening
due to a simple enhancement done in the next patch, which fixes handling
of env->test_state_freq flag in state checkpointing logic.
Andrii Nakryiko [Thu, 2 Mar 2023 23:50:00 +0000 (15:50 -0800)]
bpf: improve regsafe() checks for PTR_TO_{MEM,BUF,TP_BUFFER}
Teach regsafe() logic to handle PTR_TO_MEM, PTR_TO_BUF, and
PTR_TO_TP_BUFFER similarly to PTR_TO_MAP_{KEY,VALUE}. That is, instead of
exact match for var_off and range, use tnum_in() and range_within()
checks, allowing more general verified state to subsume more specific
current state. This allows to match wider range of valid and safe
states, speeding up verification and detecting wider range of equivalent
states for upcoming open-coded iteration looping logic.
Merge branch 'bpf: allow ctx writes using BPF_ST_MEM instruction'
Eduard Zingerman says:
====================
Changes v1 -> v2, suggested by Alexei:
- Resolved conflict with recent commit: 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier");
- Variable `ctx_access` removed in function `convert_ctx_accesses()`;
- Macro `BPF_COPY_STORE` renamed to `BPF_EMIT_STORE` and fixed to
correctly extract original store instruction class from code.
Original message follows:
The function verifier.c:convert_ctx_access() applies some rewrites to BPF
instructions that read from or write to the BPF program context.
For example, the write instruction for the `struct bpf_sockopt::retval`
field:
Currently, the verifier only supports such transformations for LDX
(memory-to-register read) and STX (register-to-memory write) instructions.
Error is reported for ST instructions (immediate-to-memory write).
This is fine because clang does not currently emit ST instructions.
However, new `-mcpu=v4` clang flag is planned, which would allow to emit
ST instructions (discussed in [1]).
This patch-set adjusts the verifier to support ST instructions in
`verifier.c:convert_ctx_access()`.
The patches #1 and #2 were previously shared as part of RFC [2]. The
changes compared to that RFC are:
- In patch #1, a bug in the handling of the
`struct __sk_buff::queue_mapping` field was fixed.
- Patch #3 is added, which is a set of disassembler-based test cases for
context access rewrites. The test cases cover all fields for which the
handling code is modified in patch #1.
[1] Propose some new instructions for -mcpu=v4
https://lore.kernel.org/bpf/4bfe98be-5333-1c7e-2f6d-42486c8ec039@meta.com/
[2] RFC Support for BPF_ST instruction in LLVM C compiler
https://lore.kernel.org/bpf/20221231163122.1360813-1-eddyz87@gmail.com/
[3] v1
https://lore.kernel.org/bpf/20230302225507.3413720-1-eddyz87@gmail.com/
====================
selftests/bpf: Disassembler tests for verifier.c:convert_ctx_access()
Function verifier.c:convert_ctx_access() applies some rewrites to BPF
instructions that read or write BPF program context. This commit adds
machinery to allow test cases that inspect BPF program after these
rewrites are applied.
An example of a test case:
{
// Shorthand for field offset and size specification
N(CGROUP_SOCKOPT, struct bpf_sockopt, retval),
For each test case, up to three programs are created:
- One that uses BPF_LDX_MEM to read the context field.
- One that uses BPF_STX_MEM to write to the context field.
- One that uses BPF_ST_MEM to write to the context field.
The disassembly of each program is compared with the pattern specified
in the test case.
Kernel code for disassembly is reused (as is in the bpftool).
To keep Makefile changes to the minimum, symbolic links to
`kernel/bpf/disasm.c` and `kernel/bpf/disasm.h ` are added.
bpf: allow ctx writes using BPF_ST_MEM instruction
Lift verifier restriction to use BPF_ST_MEM instructions to write to
context data structures. This requires the following changes:
- verifier.c:do_check() for BPF_ST updated to:
- no longer forbid writes to registers of type PTR_TO_CTX;
- track dst_reg type in the env->insn_aux_data[...].ptr_type field
(same way it is done for BPF_STX and BPF_LDX instructions).
- verifier.c:convert_ctx_access() and various callbacks invoked by
it are updated to handled BPF_ST instruction alongside BPF_STX.
Martin suggested that instead of using a byte in the hole (which he has
a use for in his future patch) in bpf_local_storage_elem, we can
dispatch a different call_rcu callback based on whether we need to free
special fields in bpf_local_storage_elem data. The free path, described
in commit 9db44fdd8105 ("bpf: Support kptrs in local storage maps"),
only waits for call_rcu callbacks when there are special (kptrs, etc.)
fields in the map value, hence it is necessary that we only access
smap in this case.
Therefore, dispatch different RCU callbacks based on the BPF map has a
valid btf_record, which dereference and use smap's btf_record only when
it is valid.
Daniel Borkmann [Fri, 3 Mar 2023 16:42:20 +0000 (17:42 +0100)]
Merge branch 'bpf-kptr-rcu'
Alexei Starovoitov says:
====================
v4->v5:
fix typos, add acks.
v3->v4:
- patch 3 got much cleaner after BPF_KPTR_RCU was removed as suggested by David.
- make KF_RCU stronger and require that bpf program checks for NULL
before passing such pointers into kfunc. The prog has to do that anyway
to access fields and it aligns with BTF_TYPE_SAFE_RCU allowlist.
- New patch 6: refactor RCU enforcement in the verifier.
The patches 2,3,6 are part of one feature.
The 2 and 3 alone are incomplete, since RCU pointers are barely useful
without bpf_rcu_read_lock/unlock in GCC compiled kernel.
Even if GCC lands support for btf_type_tag today it will take time
to mandate that version for kernel builds. Hence go with allow list
approach. See patch 6 for details.
This allows to start strict enforcement of TRUSTED | UNTRUSTED
in one part of PTR_TO_BTF_ID accesses.
One step closer to KF_TRUSTED_ARGS by default.
v2->v3:
- Instead of requiring bpf progs to tag fields with __kptr_rcu
teach the verifier to infer RCU properties based on the type.
BPF_KPTR_RCU becomes kernel internal type of struct btf_field.
- Add patch 2 to tag cgroups and dfl_cgrp as trusted.
That bug was spotted by BPF CI on clang compiler kernels,
since patch 3 is doing:
static bool in_rcu_cs(struct bpf_verifier_env *env)
{
return env->cur_state->active_rcu_lock || !env->prog->aux->sleepable;
}
which makes all non-sleepable programs behave like they have implicit
rcu_read_lock around them. Which is the case in practice.
It was fine on gcc compiled kernels where task->cgroup deference was producing
PTR_TO_BTF_ID, but on clang compiled kernels task->cgroup deference was
producing PTR_TO_BTF_ID | MEM_RCU | MAYBE_NULL, which is more correct,
but selftests were failing. Patch 2 fixes this discrepancy.
With few more patches like patch 2 we can make KF_TRUSTED_ARGS default
for kfuncs and helpers.
- Add comment in selftest patch 5 that it's verifier only check.
v1->v2:
Instead of agressively allow dereferenced kptr_rcu pointers into KF_TRUSTED_ARGS
kfuncs only allow them into KF_RCU funcs.
The KF_RCU flag is a weaker version of KF_TRUSTED_ARGS. The kfuncs marked with
KF_RCU expect either PTR_TRUSTED or MEM_RCU arguments. The verifier guarantees
that the objects are valid and there is no use-after-free, but the pointers
maybe NULL and pointee object's reference count could have reached zero, hence
kfuncs must do != NULL check and consider refcnt==0 case when accessing such
arguments.
No changes in patch 1.
Patches 2,3,4 adjusted with above behavior.
v1:
The __kptr_ref turned out to be too limited, since any "trusted" pointer access
requires bpf_kptr_xchg() which is impractical when the same pointer needs
to be dereferenced by multiple cpus.
The __kptr "untrusted" only access isn't very useful in practice.
Rename __kptr to __kptr_untrusted with eventual goal to deprecate it,
and rename __kptr_ref to __kptr, since that looks to be more common use of kptrs.
Introduce __kptr_rcu that can be directly dereferenced and used similar
to native kernel C code.
Once bpf_cpumask and task_struct kfuncs are converted to observe RCU GP
when refcnt goes to zero, both __kptr and __kptr_untrusted can be deprecated
and __kptr_rcu can become the only __kptr tag.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
bpf_rcu_read_lock/unlock() are only available in clang compiled kernels. Lack
of such key mechanism makes it impossible for sleepable bpf programs to use RCU
pointers.
Allow bpf_rcu_read_lock/unlock() in GCC compiled kernels (though GCC doesn't
support btf_type_tag yet) and allowlist certain field dereferences in important
data structures like tast_struct, cgroup, socket that are used by sleepable
programs either as RCU pointer or full trusted pointer (which is valid outside
of RCU CS). Use BTF_TYPE_SAFE_RCU and BTF_TYPE_SAFE_TRUSTED macros for such
tagging. They will be removed once GCC supports btf_type_tag.
With that refactor check_ptr_to_btf_access(). Make it strict in enforcing
PTR_TRUSTED and PTR_UNTRUSTED while deprecating old PTR_TO_BTF_ID without
modifier flags. There is a chance that this strict enforcement might break
existing programs (especially on GCC compiled kernels), but this cleanup has to
start sooner than later. Note PTR_TO_CTX access still yields old deprecated
PTR_TO_BTF_ID. Once it's converted to strict PTR_TRUSTED or PTR_UNTRUSTED the
kfuncs and helpers will be able to default to KF_TRUSTED_ARGS. KF_RCU will
remain as a weaker version of KF_TRUSTED_ARGS where obj refcnt could be 0.
Adjust rcu_read_lock selftest to run on gcc and clang compiled kernels.
The life time of certain kernel structures like 'struct cgroup' is protected by RCU.
Hence it's safe to dereference them directly from __kptr tagged pointers in bpf maps.
The resulting pointer is MEM_RCU and can be passed to kfuncs that expect KF_RCU.
Derefrence of other kptr-s returns PTR_UNTRUSTED.
For example:
struct map_value {
struct cgroup __kptr *cgrp;
};
cg = bpf_cgroup_acquire(cgrp_arg); // cg is PTR_TRUSTED and ref_obj_id > 0
bpf_kptr_xchg(&v->cgrp, cg);
cg2 = v->cgrp; // This is new feature introduced by this patch.
// cg2 is PTR_MAYBE_NULL | MEM_RCU.
// When cg2 != NULL, it's a valid cgroup, but its percpu_ref could be zero
if (cg2)
bpf_cgroup_ancestor(cg2, level); // safe to do.
}
bpf programs sometimes do:
bpf_cgrp_storage_get(&map, task->cgroups->dfl_cgrp, ...);
It is safe to do, because cgroups->dfl_cgrp pointer is set diring init and
never changes. The task->cgroups is also never NULL. It is also set during init
and will change when task switches cgroups. For any trusted task pointer
dereference of cgroups and dfl_cgrp should yield trusted pointers. The verifier
wasn't aware of this. Hence in gcc compiled kernels task->cgroups dereference
was producing PTR_TO_BTF_ID without modifiers while in clang compiled kernels
the verifier recognizes __rcu tag in cgroups field and produces
PTR_TO_BTF_ID | MEM_RCU | MAYBE_NULL.
Tag cgroups and dfl_cgrp as trusted to equalize clang and gcc behavior.
When GCC supports btf_type_tag such tagging will done directly in the type.
bpf: Rename __kptr_ref -> __kptr and __kptr -> __kptr_untrusted.
__kptr meant to store PTR_UNTRUSTED kernel pointers inside bpf maps.
The concept felt useful, but didn't get much traction,
since bpf_rdonly_cast() was added soon after and bpf programs received
a simpler way to access PTR_UNTRUSTED kernel pointers
without going through restrictive __kptr usage.
Rename __kptr_ref -> __kptr and __kptr -> __kptr_untrusted to indicate
its intended usage.
The main goal of __kptr_untrusted was to read/write such pointers
directly while bpf_kptr_xchg was a mechanism to access refcnted
kernel pointers. The next patch will allow RCU protected __kptr access
with direct read. At that point __kptr_untrusted will be deprecated.
Tero Kristo [Thu, 2 Mar 2023 11:46:14 +0000 (13:46 +0200)]
selftests/bpf: Add absolute timer test
Add test for the absolute BPF timer under the existing timer tests. This
will run the timer two times with 1us expiration time, and then re-arm
the timer at ~35s in the future. At the end, it is verified that the
absolute timer expired exactly two times.
Tero Kristo [Thu, 2 Mar 2023 11:46:13 +0000 (13:46 +0200)]
bpf: Add support for absolute value BPF timers
Add a new flag BPF_F_TIMER_ABS that can be passed to bpf_timer_start()
to start an absolute value timer instead of the default relative value.
This makes the timer expire at an exact point in time, instead of a time
with latencies induced by both the BPF and timer subsystems.
Dave Marchevsky [Fri, 3 Mar 2023 00:55:00 +0000 (16:55 -0800)]
selftests/bpf: Add -Wuninitialized flag to bpf prog flags
Per C99 standard [0], Section 6.7.8, Paragraph 10:
If an object that has automatic storage duration is not initialized
explicitly, its value is indeterminate.
And in the same document, in appendix "J.2 Undefined behavior":
The behavior is undefined in the following circumstances:
[...]
The value of an object with automatic storage duration is used while
it is indeterminate (6.2.4, 6.7.8, 6.8).
This means that use of an uninitialized stack variable is undefined
behavior, and therefore that clang can choose to do a variety of scary
things, such as not generating bytecode for "bunch of useful code" in
the below example:
void some_func()
{
int i;
if (!i)
return;
// bunch of useful code
}
To add insult to injury, if some_func above is a helper function for
some BPF program, clang can choose to not generate an "exit" insn,
causing verifier to fail with "last insn is not an exit or jmp". Going
from that verification failure to the root cause of uninitialized use
is certain to be frustrating.
This patch adds -Wuninitialized to the cflags for selftest BPF progs and
fixes up existing instances of uninitialized use.
Tejun Heo [Thu, 2 Mar 2023 19:42:59 +0000 (09:42 -1000)]
bpf: Make bpf_get_current_[ancestor_]cgroup_id() available for all program types
These helpers are safe to call from any context and there's no reason to
restrict access to them. Remove them from bpf_trace and filter lists and add
to bpf_base_func_proto() under perfmon_capable().
v2: After consulting with Andrii, relocated in bpf_base_func_proto() so that
they require bpf_capable() but not perfomon_capable() as it doesn't read
from or affect others on the system.
David Vernet [Thu, 2 Mar 2023 18:39:18 +0000 (12:39 -0600)]
bpf, docs: Fix final bpf docs build failure
maps.rst in the BPF documentation links to the
/userspace-api/ebpf/syscall document
(Documentation/userspace-api/ebpf/syscall.rst). For some reason, if you
try to reference the document with :doc:, the docs build emits the
following warning:
It appears that other places in the docs tree also don't support using
:doc:. Elsewhere in the BPF documentation, we just reference the kernel
docs page directly. Let's do that here to clean up the last remaining
noise in the docs build.
David Vernet [Thu, 2 Mar 2023 18:39:17 +0000 (12:39 -0600)]
bpf, docs: Fix link to netdev-FAQ target
The BPF devel Q&A documentation page makes frequent reference to the
netdev-QA page via the netdev-FAQ rst link. This link is currently
broken, as is evidenced by the build output when making BPF docs:
Andrii Nakryiko [Thu, 2 Mar 2023 00:05:35 +0000 (16:05 -0800)]
Merge branch 'Make uprobe attachment APK aware'
Daniel Müller says:
====================
On Android, APKs (android packages; zip packages with somewhat
prescriptive contents) are first class citizens in the system: the
shared objects contained in them don't exist in unpacked form on the
file system. Rather, they are mmaped directly from within the archive
and the archive is also what the kernel is aware of.
For users that complicates the process of attaching a uprobe to a
function contained in a shared object in one such APK: they'd have to
find the byte offset of said function from the beginning of the archive.
That is cumbersome to do manually and can be fragile, because various
changes could invalidate said offset.
That is why for uprobes inside ELF files (not inside an APK), commit d112c9ce249b ("libbpf: Support function name-based attach uprobes") added
support for attaching to symbols by name. On Android, that mechanism
currently does not work, because this logic is not APK aware.
This patch set introduces first class support for attaching uprobes to
functions inside ELF objects contained in APKs via function names. We
add support for recognizing the following syntax for a binary path:
<archive>!/<binary-in-archive>
This syntax is common in the Android eco system and used by tools such
as simpleperf. It is also what is being proposed for bcc [0].
If the user provides such a binary path, we find <binary-in-archive>
(lib/arm64-v8a/libc++.so in the example) inside of <archive>
(/system/app/test-app.apk). We perform the regular ELF offset search
inside the binary and add that to the offset within the archive itself,
to retrieve the offset at which to attach the uprobe.
[0] https://github.com/iovisor/bcc/pull/4440
Changelog
---------
v3->v4:
- use ERR_PTR instead of libbpf_err_ptr() in zip_archive_open()
- eliminated err variable from elf_find_func_offset_from_archive()
v2->v3:
- adjusted zip_archive_open() to report errno
- fixed provided libbpf_strlcpy() buffer size argument
- adjusted find_cd() to handle errors better
- use fewer local variables in get_entry_at_offset()
v1->v2:
- removed unaligned_* types
- switched to using __u32 and __u16
- switched to using errno constants instead of hard-coded negative values
- added another pr_debug() message
- shortened central_directory_* to cd_*
- inlined cd_file_header_at_offset() function
- bunch of syntactical changes
====================
Daniel Müller [Wed, 1 Mar 2023 21:23:08 +0000 (21:23 +0000)]
libbpf: Add support for attaching uprobes to shared objects in APKs
This change adds support for attaching uprobes to shared objects located
in APKs, which is relevant for Android systems where various libraries
may reside in APKs. To make that happen, we extend the syntax for the
"binary path" argument to attach to with that supported by various
Android tools:
<archive>!/<binary-in-archive>
For example:
/system/app/test-app/test-app.apk!/lib/arm64-v8a/libc++_shared.so
APKs need to be specified via full path, i.e., we do not attempt to
resolve mere file names by searching system directories.
We cannot currently test this functionality end-to-end in an automated
fashion, because it relies on an Android system being present, but there
is no support for that in CI. I have tested the functionality manually,
by creating a libbpf program containing a uretprobe, attaching it to a
function inside a shared object inside an APK, and verifying the sanity
of the returned values.
Daniel Müller [Wed, 1 Mar 2023 21:23:07 +0000 (21:23 +0000)]
libbpf: Introduce elf_find_func_offset_from_file() function
This change splits the elf_find_func_offset() function in two:
elf_find_func_offset(), which now accepts an already opened Elf object
instead of a path to a file that is to be opened, as well as
elf_find_func_offset_from_file(), which opens a binary based on a
path and then invokes elf_find_func_offset() on the Elf object. Having
this split in responsibilities will allow us to call
elf_find_func_offset() from other code paths on Elf objects that did not
necessarily come from a file on disk.
Daniel Müller [Wed, 1 Mar 2023 21:23:06 +0000 (21:23 +0000)]
libbpf: Implement basic zip archive parsing support
This change implements support for reading zip archives, including
opening an archive, finding an entry based on its path and name in it,
and closing it.
The code was copied from https://github.com/iovisor/bcc/pull/4440, which
implements similar functionality for bcc. The author confirmed that he
is fine with this usage and the corresponding relicensing. I adjusted it
to adhere to libbpf coding standards.
David Vernet [Wed, 1 Mar 2023 19:49:10 +0000 (13:49 -0600)]
bpf, docs: Fix __uninit kfunc doc section
In commit d96d937d7c5c ("bpf: Add __uninit kfunc annotation"), the
__uninit kfunc annotation was documented in kfuncs.rst. You have to
fully underline a section in rst, or the build will issue a warning that
the title underline is too short:
./Documentation/bpf/kfuncs.rst:104: WARNING: Title underline too short.
David Vernet [Wed, 1 Mar 2023 19:49:09 +0000 (13:49 -0600)]
bpf: Fix doxygen comments for dynptr slice kfuncs
In commit 66e3a13e7c2c ("bpf: Add bpf_dynptr_slice and
bpf_dynptr_slice_rdwr"), the bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr() kfuncs were added to BPF. These kfuncs included
doxygen headers, but unfortunately those headers are not properly
formatted according to [0], and causes the following warnings during the
docs build:
./kernel/bpf/helpers.c:2225: warning: \
Excess function parameter 'returns' description in 'bpf_dynptr_slice'
./kernel/bpf/helpers.c:2303: warning: \
Excess function parameter 'returns' description in 'bpf_dynptr_slice_rdwr'
...
It was developed by Andrii Nakryiko ([1]), I reused it in a
"test_verifier tests migration to inline assembly" patch series ([2]),
but the series is currently stuck on my side.
Andrii asked to spin this particular patch separately ([3]).
Andrii Nakryiko [Wed, 1 Mar 2023 17:54:17 +0000 (19:54 +0200)]
selftests/bpf: Support custom per-test flags and multiple expected messages
Extend __flag attribute by allowing to specify one of the following:
* BPF_F_STRICT_ALIGNMENT
* BPF_F_ANY_ALIGNMENT
* BPF_F_TEST_RND_HI32
* BPF_F_TEST_STATE_FREQ
* BPF_F_SLEEPABLE
* BPF_F_XDP_HAS_FRAGS
* Some numeric value
Extend __msg attribute by allowing to specify multiple exepcted messages.
All messages are expected to be present in the verifier log in the
order of application.
Viktor Malik [Wed, 1 Mar 2023 08:53:54 +0000 (09:53 +0100)]
libbpf: Remove several dead assignments
Clang Static Analyzer (scan-build) reports several dead assignments in
libbpf where the assigned value is unconditionally overridden by another
value before it is read. Remove these assignments.
Viktor Malik [Wed, 1 Mar 2023 08:53:53 +0000 (09:53 +0100)]
libbpf: Remove unnecessary ternary operator
Coverity reports that the first check of 'err' in bpf_object__init_maps
is always false as 'err' is initialized to 0 at that point. Remove the
unnecessary ternary operator.
Merge branch 'Add support for kptrs in more BPF maps'
Kumar Kartikeya Dwivedi says:
====================
This set adds support for kptrs in percpu hashmaps, percpu LRU hashmaps,
and local storage maps (covering sk, cgrp, task, inode).
Tests are expanded to test more existing maps at runtime and also test
the code path for the local storage maps (which is shared by all
implementations).
A question for reviewers is what the position of the BPF runtime should
be on dealing with reference cycles that can be created by BPF programs
at runtime using this additional support. For instance, one can store
the kptr of the task in its own task local storage, creating a cycle
which prevents destruction of task local storage. Cycles can be formed
using arbitrarily long kptr ownership chains. Therefore, just preventing
storage of such kptrs in some maps is not a sufficient solution, and is
more likely to hurt usability.
There is precedence in existing runtimes which promise memory safety,
like Rust, where reference cycles and memory leaks are permitted.
However, traditionally the safety guarantees of BPF have been stronger.
Thus, more discussion and thought is invited on this topic to ensure we
cover all usage aspects.
* Fix a use-after-free bug in local storage patch
* Fix selftest for aarch64 (don't use fentry/fmod_ret)
* Wait for RCU Tasks Trace GP along with RCU GP in selftest
Firstly, ensure programs successfully load when using all of the
supported maps. Then, extend existing tests to test more cases at
runtime. We are currently testing both the synchronous freeing of items
and asynchronous destruction when map is freed, but the code needs to be
adjusted a bit to be able to also accomodate support for percpu maps.
We now do a delete on the item (and update for array maps which has a
similar effect for kptrs) to perform a synchronous free of the kptr, and
test destruction both for the synchronous and asynchronous deletion.
Next time the program runs, it should observe the refcount as 1 since
all existing references should have been released by then. By running
the program after both possible paths freeing kptrs, we establish that
they correctly release resources. Next, we augment the existing test to
also test the same code path shared by all local storage maps using a
task local storage map.
Enable support for kptrs in local storage maps by wiring up the freeing
of these kptrs from map value. Freeing of bpf_local_storage_map is only
delayed in case there are special fields, therefore bpf_selem_free_*
path can also only dereference smap safely in that case. This is
recorded using a bool utilizing a hole in bpF_local_storage_elem. It
could have been tagged in the pointer value smap using the lowest bit
(since alignment > 1), but since there was already a hole I went with
the simpler option. Only the map structure freeing is delayed using RCU
barriers, as the buckets aren't used when selem is being freed, so they
can be freed once all readers of the bucket lists can no longer access
it.
Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230225154010.391965-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patchset is the 2nd in the dynptr series. The 1st can be found here [0].
This patchset adds skb and xdp type dynptrs, which have two main benefits for
packet parsing:
* allowing operations on sizes that are not statically known at
compile-time (eg variable-sized accesses).
* more ergonomic and less brittle iteration through data (eg does not need
manual if checking for being within bounds of data_end)
When comparing the differences in runtime for packet parsing without dynptrs
vs. with dynptrs, there is no noticeable difference. Patch 9 contains more
details as well as examples of how to use skb and xdp dynptrs.
v12 = https://lore.kernel.org/bpf/20230226085120.3907863-1-joannelkoong@gmail.com/
v12 -> v13:
* Fix missing { } for case statement
v11 = https://lore.kernel.org/bpf/20230222060747.2562549-1-joannelkoong@gmail.com/
v11 -> v12:
* Change constant mem size checking to use "__szk" kfunc annotation
for slices
* Use autoloading for success selftests
v10 = https://lore.kernel.org/bpf/20230216225524.1192789-1-joannelkoong@gmail.com/
v10 -> v11:
* Reject bpf_dynptr_slice_rdwr() for non-writable progs at load time
instead of runtime
* Add additional patch (__uninit kfunc annotation)
* Expand on documentation
* Add bpf_dynptr_write() calls for persisting writes in tests
v9 = https://lore.kernel.org/bpf/20230127191703.3864860-1-joannelkoong@gmail.com/
v9 -> v10:
* Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr interface
* Add some more tests
* Split up patchset into more parts to make it easier to review
v7 = https://lore.kernel.org/bpf/20221021011510.1890852-1-joannelkoong@gmail.com/
v7 -> v8:
* Change helpers to kfuncs
* Add 2 new patches (1/5 and 2/5)
v6 = https://lore.kernel.org/bpf/20220907183129.745846-1-joannelkoong@gmail.com/
v6 -> v7
* Change bpf_dynptr_data() to return read-only data slices if the skb prog
is read-only (Martin)
* Add test "skb_invalid_write" to test that writes to rd-only data slices
are rejected
v5 = https://lore.kernel.org/bpf/20220831183224.3754305-1-joannelkoong@gmail.com/
v5 -> v6
* Address kernel test robot errors by static inlining
v4 = https://lore.kernel.org/bpf/20220822235649.2218031-1-joannelkoong@gmail.com/
v4 -> v5
* Address kernel test robot errors for configs w/out CONFIG_NET set
* For data slices, return PTR_TO_MEM instead of PTR_TO_PACKET (Kumar)
* Split selftests into subtests (Andrii)
* Remove insn patching. Use rdonly and rdwr protos for dynptr skb
construction (Andrii)
* bpf_dynptr_data() returns NULL for rd-only dynptrs. There will be a
separate bpf_dynptr_data_rdonly() added later (Andrii and Kumar)
v3 = https://lore.kernel.org/bpf/20220822193442.657638-1-joannelkoong@gmail.com/
v3 -> v4
* Forgot to commit --amend the kernel test robot error fixups
v2 = https://lore.kernel.org/bpf/20220811230501.2632393-1-joannelkoong@gmail.com/
v2 -> v3
* Fix kernel test robot build test errors
v1 = https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/
v1 -> v2
* Return data slices to rd-only skb dynptrs (Martin)
* bpf_dynptr_write allows writes to frags for skb dynptrs, but always
invalidates associated data slices (Martin)
* Use switch casing instead of ifs (Andrii)
* Use 0xFD for experimental kind number in the selftest (Zvi)
* Put selftest conversions w/ dynptrs into new files (Alexei)
* Add new selftest "test_cls_redirect_dynptr.c"
====================
Joanne Koong [Wed, 1 Mar 2023 15:49:53 +0000 (07:49 -0800)]
selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
Test skb and xdp dynptr functionality in the following ways:
1) progs/test_cls_redirect_dynptr.c
* Rewrite "progs/test_cls_redirect.c" test to use dynptrs to parse
skb data
* This is a great example of how dynptrs can be used to simplify a
lot of the parsing logic for non-statically known values.
When measuring the user + system time between the original version
vs. using dynptrs, and averaging the time for 10 runs (using
"time ./test_progs -t cls_redirect"):
original version: 0.092 sec
with dynptrs: 0.078 sec
2) progs/test_xdp_dynptr.c
* Rewrite "progs/test_xdp.c" test to use dynptrs to parse xdp data
When measuring the user + system time between the original version
vs. using dynptrs, and averaging the time for 10 runs (using
"time ./test_progs -t xdp_attach"):
original version: 0.118 sec
with dynptrs: 0.094 sec
3) progs/test_l4lb_noinline_dynptr.c
* Rewrite "progs/test_l4lb_noinline.c" test to use dynptrs to parse
skb data
When measuring the user + system time between the original version
vs. using dynptrs, and averaging the time for 10 runs (using
"time ./test_progs -t l4lb_all"):
original version: 0.062 sec
with dynptrs: 0.081 sec
For number of processed verifier instructions:
original version: 6268 insns
with dynptrs: 2588 insns
4) progs/test_parse_tcp_hdr_opt_dynptr.c
* Add sample code for parsing tcp hdr opt lookup using dynptrs.
This logic is lifted from a real-world use case of packet parsing
in katran [0], a layer 4 load balancer. The original version
"progs/test_parse_tcp_hdr_opt.c" (not using dynptrs) is included
here as well, for comparison.
When measuring the user + system time between the original version
vs. using dynptrs, and averaging the time for 10 runs (using
"time ./test_progs -t parse_tcp_hdr_opt"):
original version: 0.031 sec
with dynptrs: 0.045 sec
5) progs/dynptr_success.c
* Add test case "test_skb_readonly" for testing attempts at writes
on a prog type with read-only skb ctx.
* Add "test_dynptr_skb_data" for testing that bpf_dynptr_data isn't
supported for skb progs.
6) progs/dynptr_fail.c
* Add test cases "skb_invalid_data_slice{1,2,3,4}" and
"xdp_invalid_data_slice{1,2}" for testing that helpers that modify the
underlying packet buffer automatically invalidate the associated
data slice.
* Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
that prog types that do not support bpf_dynptr_from_skb/xdp don't
have access to the API.
* Add test case "dynptr_slice_var_len{1,2}" for testing that
variable-sized len can't be passed in to bpf_dynptr_slice
* Add test case "skb_invalid_slice_write" for testing that writes to a
read-only data slice are rejected by the verifier.
* Add test case "data_slice_out_of_bounds_skb" for testing that
writes to an area outside the slice are rejected.
* Add test case "invalid_slice_rdwr_rdonly" for testing that prog
types that don't allow writes to packet data don't accept any calls
to bpf_dynptr_slice_rdwr.
Joanne Koong [Wed, 1 Mar 2023 15:49:52 +0000 (07:49 -0800)]
bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr
Two new kfuncs are added, bpf_dynptr_slice and bpf_dynptr_slice_rdwr.
The user must pass in a buffer to store the contents of the data slice
if a direct pointer to the data cannot be obtained.
For skb and xdp type dynptrs, these two APIs are the only way to obtain
a data slice. However, for other types of dynptrs, there is no
difference between bpf_dynptr_slice(_rdwr) and bpf_dynptr_data.
For skb type dynptrs, the data is copied into the user provided buffer
if any of the data is not in the linear portion of the skb. For xdp type
dynptrs, the data is copied into the user provided buffer if the data is
between xdp frags.
If the skb is cloned and a call to bpf_dynptr_data_rdwr is made, then
the skb will be uncloned (see bpf_unclone_prologue()).
Please note that any bpf_dynptr_write() automatically invalidates any prior
data slices of the skb dynptr. This is because the skb may be cloned or
may need to pull its paged buffer into the head. As such, any
bpf_dynptr_write() will automatically have its prior data slices
invalidated, even if the write is to data in the skb head of an uncloned
skb. Please note as well that any other helper calls that change the
underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
slices of the skb dynptr as well, for the same reasons.
Joanne Koong [Wed, 1 Mar 2023 15:49:51 +0000 (07:49 -0800)]
bpf: Add xdp dynptrs
Add xdp dynptrs, which are dynptrs whose underlying pointer points
to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of xdp->data and xdp->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).
For reads and writes on the dynptr, this includes reading/writing
from/to and across fragments. Data slices through the bpf_dynptr_data
API are not supported; instead bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr() should be used.
For examples of how xdp dynptrs can be used, please see the attached
selftests.
Joanne Koong [Wed, 1 Mar 2023 15:49:50 +0000 (07:49 -0800)]
bpf: Add skb dynptrs
Add skb dynptrs, which are dynptrs whose underlying pointer points
to a skb. The dynptr acts on skb data. skb dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of skb->data and skb->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).
For bpf prog types that don't support writes on skb data, the dynptr is
read-only (bpf_dynptr_write() will return an error)
For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
interfaces, reading and writing from/to data in the head as well as from/to
non-linear paged buffers is supported. Data slices through the
bpf_dynptr_data API are not supported; instead bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr() (added in subsequent commit) should be used.
For examples of how skb dynptrs can be used, please see the attached
selftests.
Joanne Koong [Wed, 1 Mar 2023 15:49:49 +0000 (07:49 -0800)]
bpf: Add __uninit kfunc annotation
This patch adds __uninit as a kfunc annotation.
This will be useful for scenarios such as for example in dynptrs,
indicating whether the dynptr should be checked by the verifier as an
initialized or an uninitialized dynptr.
Without this annotation, the alternative would be needing to hard-code
in the verifier the specific kfunc to indicate that arg should be
treated as an uninitialized arg.
Joanne Koong [Wed, 1 Mar 2023 15:49:48 +0000 (07:49 -0800)]
bpf: Refactor verifier dynptr into get_dynptr_arg_reg
This commit refactors the logic for determining which register in a
function is the dynptr into "get_dynptr_arg_reg". This will be used
in the future when the dynptr reg for BPF_FUNC_dynptr_write will need
to be obtained in order to support writes for skb dynptrs.
Joanne Koong [Wed, 1 Mar 2023 15:49:47 +0000 (07:49 -0800)]
bpf: Define no-ops for externally called bpf dynptr functions
Some bpf dynptr functions will be called from places where
if CONFIG_BPF_SYSCALL is not set, then the dynptr function is
undefined. For example, when skb type dynptrs are added in the
next commit, dynptr functions are called from net/core/filter.c
This patch defines no-op implementations of these dynptr functions
so that they do not break compilation by being an undefined reference.
Joanne Koong [Wed, 1 Mar 2023 15:49:46 +0000 (07:49 -0800)]
bpf: Allow initializing dynptrs in kfuncs
This change allows kfuncs to take in an uninitialized dynptr as a
parameter. Before this change, only helper functions could successfully
use uninitialized dynptrs. This change moves the memory access check
(including stack state growing and slot marking) into
process_dynptr_func(), which both helpers and kfuncs call into.
Joanne Koong [Wed, 1 Mar 2023 15:49:44 +0000 (07:49 -0800)]
bpf: Support "sk_buff" and "xdp_buff" as valid kfunc arg types
The bpf mirror of the in-kernel sk_buff and xdp_buff data structures are
__sk_buff and xdp_md. Currently, when we pass in the program ctx to a
kfunc where the program ctx is a skb or xdp buffer, we reject the
program if the in-kernel definition is sk_buff/xdp_buff instead of
__sk_buff/xdp_md.
This change allows "sk_buff <--> __sk_buff" and "xdp_buff <--> xdp_md"
to be recognized as valid matches. The user program may pass in their
program ctx as a __sk_buff or xdp_md, and the in-kernel definition
of the kfunc may define this arg as a sk_buff or xdp_buff.
Jose E. Marchesi [Tue, 28 Feb 2023 09:51:29 +0000 (10:51 +0100)]
bpf, docs: Document BPF insn encoding in term of stored bytes
[Changes from V4:
- s/regs:16/regs:8 in figure.]
[Changes from V3:
- Back to src_reg and dst_reg, since they denote register numbers
as opposed to the values stored in these registers.]
[Changes from V2:
- Use src and dst consistently in the document.
- Use a more graphical depiction of the 128-bit instruction.
- Remove `Where:' fragment.
- Clarify that unused bits are reserved and shall be zeroed.]
[Changes from V1:
- Use rst literal blocks for figures.
- Avoid using | in the basic instruction/pseudo instruction figure.
- Rebased to today's bpf-next master branch.]
This patch modifies instruction-set.rst so it documents the encoding
of BPF instructions in terms of how the bytes are stored (be it in an
ELF file or as bytes in a memory buffer to be loaded into the kernel
or some other BPF consumer) as opposed to how the instruction looks
like once loaded.
This is hopefully easier to understand by implementors looking to
generate and/or consume bytes conforming BPF instructions.
The patch also clarifies that the unused bytes in a pseudo-instruction
shall be cleared with zeros.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/87h6v6i0da.fsf_-_@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
David Vernet [Tue, 28 Feb 2023 15:28:45 +0000 (09:28 -0600)]
bpf: Fix bpf_cgroup_from_id() doxygen header
In commit 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc"), a new
bpf_cgroup_from_id() kfunc was added which allows a BPF program to
lookup and acquire a reference to a cgroup from a cgroup id. The
commit's doxygen comment seems to have copy-pasted fields, which causes
BPF kfunc helper documentation to fail to render:
<snip>/helpers.c:2114: warning: Excess function parameter 'cgrp'...
<snip>/helpers.c:2114: warning: Excess function parameter 'level'...
<snip>
<snip>/helpers.c:2114: warning: Excess function parameter 'level'...
Jiaxun Yang [Tue, 28 Feb 2023 11:33:05 +0000 (11:33 +0000)]
bpf, mips: Implement R4000 workarounds for JIT
For R4000 erratas around multiplication and division instructions,
as our use of those instructions are always followed by mflo/mfhi
instructions, the only issue we need care is
"MIPS R4000PC/SC Errata, Processor Revision 2.2 and 3.0" Errata 28:
"A double-word or a variable shift may give an incorrect result if
executed while an integer multiplication is in progress."
We just emit a mfhi $0 to ensure the operation is completed after
every multiplication instruction according to workaround suggestion
in the document.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Link: https://lore.kernel.org/bpf/20230228113305.83751-3-jiaxun.yang@flygoat.com
Jiaxun Yang [Tue, 28 Feb 2023 11:33:04 +0000 (11:33 +0000)]
bpf, mips: Implement DADDI workarounds for JIT
For DADDI errata we just workaround by disable immediate operation
for BPF_ADD / BPF_SUB to avoid generation of DADDIU.
All other use cases in JIT won't cause overflow thus they are all safe.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Link: https://lore.kernel.org/bpf/20230228113305.83751-2-jiaxun.yang@flygoat.com
Yonghong Song [Mon, 27 Feb 2023 22:49:43 +0000 (14:49 -0800)]
libbpf: Fix bpf_xdp_query() in old kernels
Commit 04d58f1b26a4("libbpf: add API to get XDP/XSK supported features")
added feature_flags to struct bpf_xdp_query_opts. If a user uses
bpf_xdp_query_opts with feature_flags member, the bpf_xdp_query()
will check whether 'netdev' family exists or not in the kernel.
If it does not exist, the bpf_xdp_query() will return -ENOENT.
But 'netdev' family does not exist in old kernels as it is
introduced in the same patch set as Commit 04d58f1b26a4.
So old kernel with newer libbpf won't work properly with
bpf_xdp_query() api call.
To fix this issue, if the return value of
libbpf_netlink_resolve_genl_family_id() is -ENOENT, bpf_xdp_query()
will just return 0, skipping the rest of xdp feature query.
This preserves backward compatibility.
Linus Torvalds [Mon, 27 Feb 2023 22:05:08 +0000 (14:05 -0800)]
Merge tag 'net-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless and netfilter.
The notable fixes here are the EEE fix which restores boot for many
embedded platforms (real and QEMU); WiFi warning suppression and the
ICE Kconfig cleanup.
Current release - regressions:
- phy: multiple fixes for EEE rework
- wifi: wext: warn about usage only once
- wifi: ath11k: allow system suspend to survive ath11k
Current release - new code bugs:
- mlx5: Fix memory leak in IPsec RoCE creation
- ibmvnic: assign XPS map to correct queue index
Previous releases - regressions:
- netfilter: ip6t_rpfilter: Fix regression with VRF interfaces
- netfilter: ctnetlink: make event listener tracking global
- nf_tables: allow to fetch set elements when table has an owner
- mlx5:
- fix skb leak while fifo resync and push
- fix possible ptp queue fifo use-after-free
Previous releases - always broken:
- sched: fix action bind logic
- ptp: vclock: use mutex to fix "sleep on atomic" bug if driver also
uses a mutex
Puranjay Mohan [Thu, 23 Feb 2023 09:53:46 +0000 (09:53 +0000)]
libbpf: Fix arm syscall regs spec in bpf_tracing.h
The syscall register definitions for ARM in bpf_tracing.h doesn't define
the fifth parameter for the syscalls. Because of this some KPROBES based
selftests fail to compile for ARM architecture.
Define the fifth parameter that is passed in the R5 register (uregs[4]).
Rong Tao [Fri, 24 Feb 2023 15:10:02 +0000 (23:10 +0800)]
selftests/bpf: Fix compilation errors: Assign a value to a constant
Commit bc292ab00f6c("mm: introduce vma->vm_flags wrapper functions")
turns the vm_flags into a const variable.
Added bpf_find_vma test in commit f108662b27c9("selftests/bpf: Add tests
for bpf_find_vma") to assign values to variables that declare const in
find_vma_fail1.c programs, which is an error to the compiler and does not
test BPF verifiers. It is better to replace 'const vm_flags_t vm_flags'
with 'unsigned long vm_start' for testing.
$ make -C tools/testing/selftests/bpf/ -j8
...
progs/find_vma_fail1.c:16:16: error: cannot assign to non-static data
member 'vm_flags' with const-qualified type 'const vm_flags_t' (aka
'const unsigned long')
vma->vm_flags |= 0x55;
~~~~~~~~~~~~~ ^
../tools/testing/selftests/bpf/tools/include/vmlinux.h:1898:20:
note: non-static data member 'vm_flags' declared const here
const vm_flags_t vm_flags;
~~~~~~~~~~~`~~~~~~^~~~~~~~