Andrii Nakryiko [Wed, 2 Dec 2020 06:52:43 +0000 (22:52 -0800)]
tools/bpftool: Auto-detect split BTFs in common cases
In case of working with module's split BTF from /sys/kernel/btf/*,
auto-substitute /sys/kernel/btf/vmlinux as the base BTF. This makes using
bpftool with module BTFs faster and simpler.
Merge branch 'switch to memcg-based memory accounting'
Roman Gushchin says:
====================
Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:
1) The limit is per-user, but because most bpf operations are performed
as root, the limit has a little value.
2) It's hard to come up with a specific maximum value. Especially because
the counter is shared with non-bpf use cases (e.g. memlock()).
Any specific value is either too low and creates false failures
or is too high and useless.
3) Charging is not connected to the actual memory allocation. Bpf code
should manually calculate the estimated cost and charge the counter,
and then take care of uncharging, including all fail paths.
It adds to the code complexity and makes it easy to leak a charge.
4) There is no simple way of getting the current value of the counter.
We've used drgn for it, but it's far from being convenient.
5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
a function to "explain" this case for users.
6) rlimits are generally considered as (at least partially) obsolete.
They do not provide a comprehensive system for the control of physical
resources: memory, cpu, io etc. All resource control developments
in the recent years were related to cgroups.
In order to overcome these problems let's switch to the memory cgroup-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of the memory used by bpf programs and maps.
This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
a better control over memory usage by different workloads.
2) The actual memory consumption is taken into account. It happens automatically
on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
performed automatically on releasing the memory. So the code on the bpf side
becomes simpler and safer.
3) There is a simple way to get the current value and statistics.
Cgroup-based accounting adds new requirements:
1) The kernel config should have CONFIG_CGROUPS and CONFIG_MEMCG_KMEM enabled.
These options are usually enabled, maybe excluding tiny builds for embedded
devices.
2) The system should have a configured cgroup hierarchy, including reasonable
memory limits and/or guarantees. Modern systems usually delegate this task
to systemd or similar task managers.
Without meeting these requirements there are no limits on how much memory bpf
can use and a non-root user is able to hurt the system by allocating too much.
But because per-user rlimits do not provide a functional system to protect
and manage physical resources anyway, anyone who seriously depends on it,
should use cgroups.
When a bpf map is created, the memory cgroup of the process which creates
the map is recorded. Subsequently all memory allocation related to the bpf map
are charged to the same cgroup. It includes allocations made from interrupts
and by any processes. Bpf program memory is charged to the memory cgroup of
a process which loads the program.
The patchset consists of the following parts:
1) 4 mm patches are required on the mm side, otherwise vmallocs cannot be mapped
to userspace
2) memcg-based accounting for various bpf objects: progs and maps
3) removal of the rlimit-based accounting
4) removal of rlimit adjustments in userspace samples
v9:
- always charge the saved memory cgroup, by Daniel, Toke and Alexei
- added bpf_map_kzalloc()
- rebase and minor fixes
v8:
- extended the cover letter to be more clear on new requirements, by Daniel
- an approximate value is provided by map memlock info, by Alexei
v7:
- introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei
- switched allocations made from an interrupt context to new helpers,
by Daniel
- rebase and minor fixes
v6:
- rebased to the latest version of the remote charging API
- fixed signatures, added acks
v5:
- rebased to the latest version of the remote charging API
- implemented kmem accounting from an interrupt context, by Shakeel
- rebased to latest changes in mm allowed to map vmallocs to userspace
- fixed a build issue in kselftests, by Alexei
- fixed a use-after-free bug in bpf_map_free_deferred()
- added bpf line info coverage, by Shakeel
- split bpf map charging preparations into a separate patch
v4:
- covered allocations made from an interrupt context, by Daniel
- added some clarifications to the cover letter
v3:
- droped the userspace part for further discussions/refinements,
by Andrii and Song
v2:
- fixed build issue, caused by the remaining rlimit-based accounting
for sockhash maps
====================
Roman Gushchin [Tue, 1 Dec 2020 21:58:58 +0000 (13:58 -0800)]
bpf: Eliminate rlimit-based memory accounting infra for bpf maps
Remove rlimit-based accounting infrastructure code, which is not used
anymore.
To provide a backward compatibility, use an approximation of the
bpf map memory footprint as a "memlock" value, available to a user
via map info. The approximation is based on the maximal number of
elements and key and value sizes.
Roman Gushchin [Tue, 1 Dec 2020 21:58:53 +0000 (13:58 -0800)]
bpf: Eliminate rlimit-based memory accounting for bpf ringbuffer
Do not use rlimit-based memory accounting for bpf ringbuffer.
It has been replaced with the memcg-based memory accounting.
bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM)
and a valid pointer, so to simplify the code make it return NULL
in the first case. This allows to drop a couple of lines in
ringbuf_map_alloc() and also makes it look similar to other memory
allocating function like kmalloc().
Roman Gushchin [Tue, 1 Dec 2020 21:58:33 +0000 (13:58 -0800)]
bpf: Memcg-based memory accounting for bpf maps
This patch enables memcg-based memory accounting for memory allocated
by __bpf_map_area_alloc(), which is used by many types of bpf maps for
large initial memory allocations.
Please note, that __bpf_map_area_alloc() should not be used outside of
map creation paths without setting the active memory cgroup to the
map's memory cgroup.
Following patches in the series will refine the accounting for
some of the map types.
Roman Gushchin [Tue, 1 Dec 2020 21:58:32 +0000 (13:58 -0800)]
bpf: Prepare for memcg-based memory accounting for bpf maps
Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.
Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and commit b87d8cefe43c
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.
To make the ownership model simple and consistent, when the map
is created, the memory cgroup of the current process is recorded.
All subsequent allocations related to the bpf map are charged to
the same memory cgroup. It includes allocations made by any processes
(even if they do belong to a different cgroup) and from interrupts.
This commit introduces 3 new helpers, which will be used by following
commits to enable the accounting of bpf maps memory:
- bpf_map_kmalloc_node()
- bpf_map_kzalloc()
- bpf_map_alloc_percpu()
They are wrapping popular memory allocation functions. They set
the active memory cgroup to the map's memory cgroup and add
__GFP_ACCOUNT to the passed gfp flags. Then they call into
the corresponding memory allocation function and restore
the original active memory cgroup.
These helpers are supposed to use everywhere except the map creation
path. During the map creation when the map structure is allocated by
itself, it cannot be passed to those helpers. In those cases default
memory allocation function will be used with the __GFP_ACCOUNT flag.
Roman Gushchin [Tue, 1 Dec 2020 21:58:31 +0000 (13:58 -0800)]
bpf: Memcg-based memory accounting for bpf progs
Include memory used by bpf programs into the memcg-based accounting.
This includes the memory used by programs itself, auxiliary data,
statistics and bpf line info. A memory cgroup containing the
process which loads the program is getting charged.
Roman Gushchin [Tue, 1 Dec 2020 21:58:30 +0000 (13:58 -0800)]
mm: Convert page kmemcg type to a page memcg flag
PageKmemcg flag is currently defined as a page type (like buddy, offline,
table and guard). Semantically it means that the page was accounted as a
kernel memory by the page allocator and has to be uncharged on the
release.
As a side effect of defining the flag as a page type, the accounted page
can't be mapped to userspace (look at page_has_type() and comments above).
In particular, this blocks the accounting of vmalloc-backed memory used
by some bpf maps, because these maps do map the memory to userspace.
One option is to fix it by complicating the access to page->mapcount,
which provides some free bits for page->page_type.
But it's way better to move this flag into page->memcg_data flags.
Indeed, the flag makes no sense without enabled memory cgroups and memory
cgroup pointer set in particular.
This commit replaces PageKmemcg() and __SetPageKmemcg() with
PageMemcgKmem() and an open-coded OR operation setting the memcg pointer
with the MEMCG_DATA_KMEM bit. __ClearPageKmemcg() can be simple deleted,
as the whole memcg_data is zeroed at once.
As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will be
compiled out.
Roman Gushchin [Tue, 1 Dec 2020 21:58:29 +0000 (13:58 -0800)]
mm: Introduce page memcg flags
The lowest bit in page->memcg_data is used to distinguish between struct
memory_cgroup pointer and a pointer to a objcgs array. All checks and
modifications of this bit are open-coded.
Let's formalize it using page memcg flags, defined in enum
page_memcg_data_flags.
They are similar to the corresponding API for generic pages, except that
the setter can return false, indicating that the value has been already
set from a different thread.
Roman Gushchin [Tue, 1 Dec 2020 21:58:27 +0000 (13:58 -0800)]
mm: memcontrol: Use helpers to read page's memcg data
Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace. The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter. Pages with a type set can't be mapped to
userspace.
But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.
Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.
This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer. Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions. As the
result the code became more robust with fewer open-coded bit tricks.
This patch (of 4):
Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.
It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information. In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.
This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
struct mem_cgroup *page_memcg(struct page *page);
struct mem_cgroup *page_memcg_rcu(struct page *page);
struct mem_cgroup *page_memcg_check(struct page *page);
page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector. It does
check the lowest bit, and if set, returns NULL. page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.
To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
Merge branch 'bpf: expose bpf_{s,g}etsockopt helpers to bind{4,6} hooks'
Stanislav Fomichev says:
====================
This might be useful for the listener sockets to pre-populate
some options. Since those helpers require locked sockets,
I'm changing bind hooks to lock/unlock the sockets. This
should not cause any performance overhead because at this
point there shouldn't be any socket lock contention and the
locking/unlocking should be cheap.
Also, as part of the series, I convert test_sock_addr bpf
assembly into C (and preserve the narrow load tests) to
make it easier to extend with th bpf_setsockopt later on.
v2:
* remove version from bpf programs (Andrii Nakryiko)
====================
bpf: Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks
I have to now lock/unlock socket for the bind hook execution.
That shouldn't cause any overhead because the socket is unbound
and shouldn't receive any traffic.
Stephen reported the following build error for !CONFIG_NET_RX_BUSY_POLL
built kernels:
In file included from fs/select.c:32:
include/net/busy_poll.h: In function 'sk_mark_napi_id_once':
include/net/busy_poll.h:150:36: error: 'const struct sk_buff' has no member named 'napi_id'
150 | __sk_mark_napi_id_once_xdp(sk, skb->napi_id);
| ^~
Fix it by wrapping a CONFIG_NET_RX_BUSY_POLL around the helpers.
Fixes: b02e5a0ebb17 ("xsk: Propagate napi_id to XDP socket Rx path") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/linux-next/20201201190746.7d3357fb@canb.auug.org.au
Daniel Borkmann [Mon, 30 Nov 2020 23:09:26 +0000 (00:09 +0100)]
Merge branch 'xdp-preferred-busy-polling'
Björn Töpel says:
====================
This series introduces three new features:
1. A new "heavy traffic" busy-polling variant that works in concert
with the existing napi_defer_hard_irqs and gro_flush_timeout knobs.
2. A new socket option that let a user change the busy-polling NAPI
budget.
3. Allow busy-polling to be performed on XDP sockets.
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Patch 6 touches a lot of drivers, so the Cc: list is grossly long.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Performance simple UDP ping-pong:
A packet generator blasts UDP packets from a packet generator to a
certain {src,dst}IP/port, so a dedicated ksoftirq will be busy
handling the packets at a certain core.
A simple UDP test program that simply does recvfrom/sendto is running
at the host end. Throughput in pps and RTT latency is measured at the
packet generator.
Scenario 2 and 5 shows when the new option should be used. Throughput
go from 155 to 420Kpps, average latency are similar, but the tail
latencies are much better for the latter.
Performance XDP sockets:
Again, a packet generator blasts UDP packets from a packet generator
to a certain {src,dst}IP/port.
Today, running XDP sockets sample on the same core as the softirq
handling, performance tanks mainly because we do not yield to
user-space when the XDP socket Rx queue is full.
Using the preferred busy-polling mode does not impact performance.
The above tests was done for the 'ice' driver.
Thanks to Jakub for suggesting this busy-polling addition [1], and
Eric for all input/review!
Changes:
rfc-v1 [2] -> rfc-v2:
* Changed name from bias to prefer.
* Base the work on Eric's/Luigi's defer irq/gro timeout work.
* Proper GRO flushing.
* Build issues for some XDP drivers.
rfc-v2 [3] -> v1:
* Fixed broken qlogic build.
* Do not trigger an IPI (XDP socket wakeup) when busy-polling is
enabled.
v1 [4] -> v2:
* Added napi_id to socionext driver, and added Ilias Acked-by:. (Ilias)
* Added a samples patch to improve busy-polling for xdpsock/l2fwd.
* Correctly mark atomic operations with {WRITE,READ}_ONCE, to make
KCSAN and the code readers happy. (Eric)
* Check NAPI budget not to exceed U16_MAX. (Eric)
* Added kdoc.
v2 [5] -> v3:
* Collected Acked-by.
* Check NAPI disable prior prefer busy-polling. (Jakub)
* Added napi_id registration for virtio-net. (Michael)
* Added napi_id registration for veth.
Björn Töpel [Mon, 30 Nov 2020 18:52:01 +0000 (19:52 +0100)]
xsk: Propagate napi_id to XDP socket Rx path
Add napi_id to the xdp_rxq_info structure, and make sure the XDP
socket pick up the napi_id in the Rx path. The napi_id is used to find
the corresponding NAPI structure for socket busy polling.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
Björn Töpel [Mon, 30 Nov 2020 18:51:59 +0000 (19:51 +0100)]
xsk: Check need wakeup flag in sendmsg()
Add a check for need wake up in sendmsg(), so that if a user calls
sendmsg() when no wakeup is needed, do not trigger a wakeup.
To simplify the need wakeup check in the syscall, unconditionally
enable the need wakeup flag for Tx. This has a side-effect for poll();
If poll() is called for a socket without enabled need wakeup, a Tx
wakeup is unconditionally performed.
The wakeup matrix for AF_XDP now looks like:
need wakeup | poll() | sendmsg() | recvmsg()
------------+--------------+-------------+------------
disabled | wake Tx | wake Tx | nop
enabled | check flag; | check flag; | check flag;
| wake Tx/Rx | wake Tx | wake Rx
Björn Töpel [Mon, 30 Nov 2020 18:51:58 +0000 (19:51 +0100)]
xsk: Add support for recvmsg()
Add support for non-blocking recvmsg() to XDP sockets. Previously,
only sendmsg() was supported by XDP socket. Now, for symmetry and the
upcoming busy-polling support, recvmsg() is added.
Björn Töpel [Mon, 30 Nov 2020 18:51:56 +0000 (19:51 +0100)]
net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
KP Singh [Thu, 26 Nov 2020 18:49:46 +0000 (18:49 +0000)]
selftests/bpf: Fix flavored variants of test_ima
Flavored variants of test_progs (e.g. test_progs-no_alu32) change their
working directory to the corresponding subdirectory (e.g. no_alu32).
Since the setup script required by test_ima (ima_setup.sh) is not
mentioned in the dependencies, it does not get copied to these
subdirectories and causes flavored variants of test_ima to fail.
Adding the script to TRUNNER_EXTRA_FILES ensures that the file is also
copied to the subdirectories for the flavored variants of test_progs.
Fixes: 34b82d3ac105 ("bpf: Add a selftest for bpf_ima_inode_hash") Reported-by: Yonghong Song <yhs@fb.com> Suggested-by: Yonghong Song <yhs@fb.com> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20201126184946.1708213-1-kpsingh@chromium.org
Zhu Yanjun [Thu, 26 Nov 2020 15:03:18 +0000 (23:03 +0800)]
xdp: Remove the functions xsk_map_inc and xsk_map_put
The functions xsk_map_put() and xsk_map_inc() are simple wrappers and
as such, replace these functions with the functions bpf_map_inc() and
bpf_map_put() and remove some error testing code.
Magnus Karlsson [Thu, 26 Nov 2020 09:37:35 +0000 (10:37 +0100)]
libbpf: Replace size_t with __u32 in xsk interfaces
Replace size_t with __u32 in the xsk interfaces that contain this.
There is no reason to have size_t since the internal variable that
is manipulated is a __u32. The following APIs are affected:
Numerous refactoring that rewrites BPF programs written with bpf_load
to use the libbpf loader was finally completed, resulting in BPF
programs using bpf_load within the kernel being completely no longer
present.
This patchset refactors remaining bpf programs with libbpf and
completely removes bpf_load, an outdated bpf loader that is difficult
to keep up with the latest kernel BPF and causes confusion.
Changes in v2:
- drop 'move tracing helpers to trace_helper' patch
- add link pinning to prevent cleaning up on process exit
- add static at global variable and remove unused variable
- change to destroy link even after link__pin()
- fix return error code on exit
- merge commit with changing Makefile
Changes in v3:
- cleanup bpf_link, bpf_object and cgroup fd both on success and error
====================
Daniel T. Lee [Tue, 24 Nov 2020 09:03:10 +0000 (09:03 +0000)]
samples: bpf: Remove bpf_load loader completely
Numerous refactoring that rewrites BPF programs written with bpf_load
to use the libbpf loader was finally completed, resulting in BPF
programs using bpf_load within the kernel being completely no longer
present.
This commit removes bpf_load, an outdated bpf loader that is difficult
to keep up with the latest kernel BPF and causes confusion.
Also, this commit removes the unused trace_helper and bpf_load from
samples/bpf target objects from Makefile.
Currently, lwt_len_hist's map lwt_len_hist_map is uses pinning, and the
map isn't cleared on test end. This leds to reuse of that map for
each test, which prevents the results of the test from being accurate.
This commit fixes the problem by removing of pinned map from bpffs.
Also, this commit add the executable permission to shell script
files.
Daniel T. Lee [Tue, 24 Nov 2020 09:03:08 +0000 (09:03 +0000)]
samples: bpf: Refactor test_overhead program with libbpf
This commit refactors the existing program with libbpf bpf loader.
Since the kprobe, tracepoint and raw_tracepoint bpf program can be
attached with single bpf_program__attach() interface, so the
corresponding function of libbpf is used here.
Rather than specifying the number of cpus inside the code, this commit
uses the number of available cpus with _SC_NPROCESSORS_ONLN.
Daniel T. Lee [Tue, 24 Nov 2020 09:03:07 +0000 (09:03 +0000)]
samples: bpf: Refactor ibumad program with libbpf
This commit refactors the existing ibumad program with libbpf bpf
loader. Attach/detach of Tracepoint bpf programs has been managed
with the generic bpf_program__attach() and bpf_link__destroy() from
the libbpf.
Also, instead of using the previous BPF MAP definition, this commit
refactors ibumad MAP definition with the new BTF-defined MAP format.
To verify that this bpf program works without an infiniband device,
try loading ib_umad kernel module and test the program as follows:
# modprobe ib_umad
# ./ibumad
Moreover, TRACE_HELPERS has been removed from the Makefile since it is
not used on this program.
Daniel T. Lee [Tue, 24 Nov 2020 09:03:06 +0000 (09:03 +0000)]
samples: bpf: Refactor task_fd_query program with libbpf
This commit refactors the existing kprobe program with libbpf bpf
loader. To attach bpf program, this uses generic bpf_program__attach()
approach rather than using bpf_load's load_bpf_file().
To attach bpf to perf_event, instead of using previous ioctl method,
this commit uses bpf_program__attach_perf_event since it manages the
enable of perf_event and attach of BPF programs to it, which is much
more intuitive way to achieve.
Also, explicit close(fd) has been removed since event will be closed
inside bpf_link__destroy() automatically.
Furthermore, to prevent conflict of same named uprobe events, O_TRUNC
flag has been used to clear 'uprobe_events' interface.
Daniel T. Lee [Tue, 24 Nov 2020 09:03:05 +0000 (09:03 +0000)]
samples: bpf: Refactor test_cgrp2_sock2 program with libbpf
This commit refactors the existing cgroup program with libbpf bpf
loader. The original test_cgrp2_sock2 has keeped the bpf program
attached to the cgroup hierarchy even after the exit of user program.
To implement the same functionality with libbpf, this commit uses the
BPF_LINK_PINNING to pin the link attachment even after it is closed.
Since this uses LINK instead of ATTACH, detach of bpf program from
cgroup with 'test_cgrp2_sock' is not used anymore.
The code to mount the bpf was added to the .sh file in case the bpff
was not mounted on /sys/fs/bpf. Additionally, to fix the problem that
shell script cannot find the binary object from the current path,
relative path './' has been added in front of binary.
Daniel T. Lee [Tue, 24 Nov 2020 09:03:04 +0000 (09:03 +0000)]
samples: bpf: Refactor hbm program with libbpf
This commit refactors the existing cgroup programs with libbpf
bpf loader. Since bpf_program__attach doesn't support cgroup program
attachment, this explicitly attaches cgroup bpf program with
bpf_program__attach_cgroup(bpf_prog, cg1).
Also, to change attach_type of bpf program, this uses libbpf's
bpf_program__set_expected_attach_type helper to switch EGRESS to
INGRESS. To keep bpf program attached to the cgroup hierarchy even
after the exit, this commit uses the BPF_LINK_PINNING to pin the link
attachment even after it is closed.
Besides, this program was broken due to the typo of BPF MAP definition.
But this commit solves the problem by fixing this from 'queue_stats' map
struct hvm_queue_stats -> hbm_queue_stats.
Andrei Matei [Wed, 25 Nov 2020 03:52:55 +0000 (22:52 -0500)]
bpf: Fix selftest compilation on clang 11
Before this patch, profiler.inc.h wouldn't compile with clang-11 (before
the __builtin_preserve_enum_value LLVM builtin was introduced in
https://reviews.llvm.org/D83242).
Another test that uses this builtin (test_core_enumval) is conditionally
skipped if the compiler is too old. In that spirit, this patch inhibits
part of populate_cgroup_info(), which needs this CO-RE builtin. The
selftests build again on clang-11.
The affected test (the profiler test) doesn't pass on clang-11 because
it's missing https://reviews.llvm.org/D85570, but at least the test suite
as a whole compiles. The test's expected failure is already called out in
the README.
KP Singh [Tue, 24 Nov 2020 15:12:10 +0000 (15:12 +0000)]
bpf: Add a selftest for bpf_ima_inode_hash
The test does the following:
- Mounts a loopback filesystem and appends the IMA policy to measure
executions only on this file-system. Restricting the IMA policy to
a particular filesystem prevents a system-wide IMA policy change.
- Executes an executable copied to this loopback filesystem.
- Calls the bpf_ima_inode_hash in the bprm_committed_creds hook and
checks if the call succeeded and checks if a hash was calculated.
The test shells out to the added ima_setup.sh script as the setup is
better handled in a shell script and is more complicated to do in the
test program or even shelling out individual commands from C.
The list of required configs (i.e. IMA, SECURITYFS,
IMA_{WRITE,READ}_POLICY) for running this test are also updated.
Suggested-by: Mimi Zohar <zohar@linux.ibm.com> (limit policy rule to loopback mount) Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201124151210.1081188-4-kpsingh@chromium.org
KP Singh [Tue, 24 Nov 2020 15:12:09 +0000 (15:12 +0000)]
bpf: Add a BPF helper for getting the IMA hash of an inode
Provide a wrapper function to get the IMA hash of an inode. This helper
is useful in fingerprinting files (e.g executables on execution) and
using these fingerprints in detections like an executable unlinking
itself.
Since the ima_inode_hash can sleep, it's only allowed for sleepable
LSM hooks.
KP Singh [Tue, 24 Nov 2020 15:12:08 +0000 (15:12 +0000)]
ima: Implement ima_inode_hash
This is in preparation to add a helper for BPF LSM programs to use
IMA hashes when attached to LSM hooks. There are LSM hooks like
inode_unlink which do not have a struct file * argument and cannot
use the existing ima_file_hash API.
An inode based API is, therefore, useful in LSM based detections like an
executable trying to delete itself which rely on the inode_unlink LSM
hook.
Moreover, the ima_file_hash function does nothing with the struct file
pointer apart from calling file_inode on it and converting it to an
inode.
Li RongQing [Tue, 24 Nov 2020 07:21:14 +0000 (15:21 +0800)]
libbpf: Add support for canceling cached_cons advance
Add a new function for returning descriptors the user received
after an xsk_ring_cons__peek call. After the application has
gotten a number of descriptors from a ring, it might not be able
to or want to process them all for various reasons. Therefore,
it would be useful to have an interface for returning or
cancelling a number of them so that they are returned to the ring.
This patch adds a new function called xsk_ring_cons__cancel that
performs this operation on nb descriptors counted from the end of
the batch of descriptors that was received through the peek call.
The current implementation uses a number of gotos to implement a loop
and different paths within the loop, which makes the code less readable
than it would be with an explicit while-loop. This patch also replaces a
chain of if/if-elses keyed on the same expression with a switch
statement.
Andrii Nakryiko [Sat, 21 Nov 2020 07:08:29 +0000 (23:08 -0800)]
bpf: Sanitize BTF data pointer after module is loaded
Given .BTF section is not allocatable, it will get trimmed after module is
loaded. BPF system handles that properly by creating an independent copy of
data. But prevent any accidental misused by resetting the pointer to BTF data.
Andrii Nakryiko [Sat, 21 Nov 2020 07:08:28 +0000 (23:08 -0800)]
kbuild: Skip module BTF generation for out-of-tree external modules
In some modes of operation, Kbuild allows to build modules without having
vmlinux image around. In such case, generation of module BTF is impossible.
This patch changes the behavior to emit a warning about impossibility of
generating kernel module BTF, instead of breaking the build. This is especially
important for out-of-tree external module builds.
In vmlinux-less mode:
$ make clean
$ make modules_prepare
$ touch drivers/acpi/button.c
$ make M=drivers/acpi
...
CC [M] drivers/acpi/button.o
MODPOST drivers/acpi/Module.symvers
LD [M] drivers/acpi/button.ko
BTF [M] drivers/acpi/button.ko
Skipping BTF generation for drivers/acpi/button.ko due to unavailability of vmlinux
...
$ readelf -S ~/linux-build/default/drivers/acpi/button.ko | grep BTF -A1
... empty ...
Yonghong Song [Thu, 19 Nov 2020 07:30:39 +0000 (23:30 -0800)]
bpftool: Add {i,d}tlb_misses support for bpftool profile
Commit 47c09d6a9f67("bpftool: Introduce "prog profile" command")
introduced "bpftool prog profile" command which can be used
to profile bpf program with metrics like # of instructions,
This patch added support for itlb_misses and dtlb_misses.
During an internal bpf program performance evaluation,
I found these two metrics are also very useful. The following
is an example output:
$ bpftool prog profile id 324 duration 3 cycles itlb_misses
Björn Töpel [Wed, 18 Nov 2020 07:16:40 +0000 (08:16 +0100)]
selftests/bpf: Mark tests that require unaligned memory access
A lot of tests require unaligned memory access to work. Mark the tests
as such, so that they can be avoided on unsupported architectures such
as RISC-V.
Björn Töpel [Wed, 18 Nov 2020 07:16:39 +0000 (08:16 +0100)]
selftests/bpf: Avoid running unprivileged tests with alignment requirements
Some architectures have strict alignment requirements. In that case,
the BPF verifier detects if a program has unaligned accesses and
rejects them. A user can pass BPF_F_ANY_ALIGNMENT to a program to
override this check. That, however, will only work when a privileged
user loads a program. An unprivileged user loading a program with this
flag will be rejected prior entering the verifier.
Hence, it does not make sense to load unprivileged programs without
strict alignment when testing the verifier. This patch avoids exactly
that.
Björn Töpel [Wed, 18 Nov 2020 07:16:38 +0000 (08:16 +0100)]
selftests/bpf: Fix broken riscv build
The selftests/bpf Makefile includes system include directories from
the host, when building BPF programs. On RISC-V glibc requires that
__riscv_xlen is defined. This is not the case for "clang -target bpf",
which messes up __WORDSIZE (errno.h -> ... -> wordsize.h) and breaks
the build.
By explicitly defining __risc_xlen correctly for riscv, we can
workaround this.
Fixes: 167381f3eac0 ("selftests/bpf: Makefile fix "missing" headers on build with -idirafter") Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Luke Nelson <luke.r.nels@gmail.com> Link: https://lore.kernel.org/bpf/20201118071640.83773-2-bjorn.topel@gmail.com
The helper uses CLOCK_MONOTONIC_COARSE source of time that is less
accurate but more performant.
We have a BPF CGROUP_SKB firewall that supports event logging through
bpf_perf_event_output(). Each event has a timestamp and currently we use
bpf_ktime_get_ns() for it. Use of bpf_ktime_get_coarse_ns() saves ~15-20
ns in time required for event logging.
KP Singh [Tue, 17 Nov 2020 23:29:29 +0000 (23:29 +0000)]
bpf: Add tests for bpf_bprm_opts_set helper
The test forks a child process, updates the local storage to set/unset
the securexec bit.
The BPF program in the test attaches to bprm_creds_for_exec which checks
the local storage of the current task to set the secureexec bit on the
binary parameters (bprm).
The child then execs a bash command with the environment variable
TMPDIR set in the envp. The bash command returns a different exit code
based on its observed value of the TMPDIR variable.
Since TMPDIR is one of the variables that is ignored by the dynamic
loader when the secureexec bit is set, one should expect the
child execution to not see this value when the secureexec bit is set.
KP Singh [Tue, 17 Nov 2020 23:29:28 +0000 (23:29 +0000)]
bpf: Add bpf_bprm_opts_set helper
The helper allows modification of certain bits on the linux_binprm
struct starting with the secureexec bit which can be updated using the
BPF_F_BPRM_SECUREEXEC flag.
secureexec can be set by the LSM for privilege gaining executions to set
the AT_SECURE auxv for glibc. When set, the dynamic linker disables the
use of certain environment variables (like LD_PRELOAD).
Daniel Borkmann [Tue, 17 Nov 2020 21:07:40 +0000 (22:07 +0100)]
Merge branch 'af-xdp-tx-batch'
Magnus Karlsson says:
====================
This patch set improves the performance of mainly the Tx processing of
AF_XDP sockets. Though, patch 3 also improves the Rx path. All in all,
this patch set improves the throughput of the l2fwd xdpsock application
by around 11%. If we just take a look at Tx processing part, it is
improved by 35% to 40%.
Hopefully the new batched Tx interfaces should be of value to other
drivers implementing AF_XDP zero-copy support. But patch #3 is generic
and will improve performance of all drivers when using AF_XDP sockets
(under the premises explained in that patch).
@Daniel. In patch 3, I apply all the padding required to hinder the
adjacency prefetcher to prefetch the wrong things. After this patch
set, I will submit another patch set that introduces
____cacheline_padding_in_smp in include/linux/cache.h according to your
suggestions. The last patch in that patch set will then convert the
explicit paddings that we have now to ____cacheline_padding_in_smp.
v2 -> v3:
* Fixed #pragma warning with clang and defined a loop_unrolled_for macro
for easier readability [lkp, Nick]
* Simplified invalid descriptor handling in xskq_cons_read_desc_batch()
v1 -> v2:
* Removed added parameter in i40e_setup_tx_descriptors and adopted a
simpler solution [Maciej]
* Added test for !xs in xsk_tx_peek_release_desc_batch() [John]
* Simplified return path in xsk_tx_peek_release_desc_batch() [John]
* Dropped patch #1 in v1 that introduced lazy completions. Hopefully
this is not needed when we get busy poll [Jakub]
* Iterate over local variable in xskq_prod_reserve_addr_batch() for
improved performance
* Fixed the fallback path in xsk_tx_peek_release_desc_batch() so that
it also produces a batch of descriptors, albeit by using the slower
(but more general) older code. This improves the performance of the
case when multiple sockets are sharing the same device and queue id.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Magnus Karlsson [Mon, 16 Nov 2020 11:12:47 +0000 (12:12 +0100)]
i40e: Use batched xsk Tx interfaces to increase performance
Use the new batched xsk interfaces for the Tx path in the i40e driver
to improve performance. On my machine, this yields a throughput
increase of 4% for the l2fwd sample app in xdpsock. If we instead just
look at the Tx part, this patch set increases throughput with above
20% for Tx.
Note that I had to explicitly loop unroll the inner loop to get to
this performance level, by using a pragma. It is honored by both clang
and gcc and should be ignored by versions that do not support
it. Using the -funroll-loops compiler command line switch on the
source file resulted in a loop unrolling on a higher level that
lead to a performance decrease instead of an increase.
Magnus Karlsson [Mon, 16 Nov 2020 11:12:46 +0000 (12:12 +0100)]
xsk: Introduce batched Tx descriptor interfaces
Introduce batched descriptor interfaces in the xsk core code for the
Tx path to be used in the driver to write a code path with higher
performance. This interface will be used by the i40e driver in the
next patch. Though other drivers would likely benefit from this new
interface too.
Note that batching is only implemented for the common case when
there is only one socket bound to the same device and queue id. When
this is not the case, we fall back to the old non-batched version of
the function.
Magnus Karlsson [Mon, 16 Nov 2020 11:12:45 +0000 (12:12 +0100)]
xsk: Introduce padding between more ring pointers
Introduce one cache line worth of padding between the consumer pointer
and the flags field as well as between the flags field and the start
of the descriptors in all the lockless rings. This so that the x86 HW
adjacency prefetcher will not prefetch the adjacent pointer/field when
only one pointer/field is going to be used. This improves throughput
performance for the l2fwd sample app with 1% on my machine with HW
prefetching turned on in the BIOS.
Magnus Karlsson [Mon, 16 Nov 2020 11:12:44 +0000 (12:12 +0100)]
i40e: Remove unnecessary sw_ring access from xsk Tx
Remove the unnecessary access to the software ring for the AF_XDP
zero-copy driver. This was used to record the length of the packet so
that the driver Tx completion code could sum this up to produce the
total bytes sent. This is now performed during the transmission of the
packet, so no need to record this in the software ring.
Magnus Karlsson [Mon, 16 Nov 2020 11:12:43 +0000 (12:12 +0100)]
samples/bpf: Increment Tx stats at sending
Increment the statistics over how many Tx packets have been sent at
the time of sending instead of at the time of completion. This as a
completion event means that the buffer has been sent AND returned to
user space. The packet always gets sent shortly after sendto() is
called. The kernel might, for performance reasons, decide to not
return every single buffer to user space immediately after sending,
for example, only after a batch of packets have been
transmitted. Incrementing the number of packets sent at completion,
will in that case be confusing as if you send a single packet, the
counter might show zero for a while even though the packet has been
transmitted.
Alan Maguire [Sun, 15 Nov 2020 10:46:35 +0000 (10:46 +0000)]
libbpf: bpf__find_by_name[_kind] should use btf__get_nr_types()
When operating on split BTF, btf__find_by_name[_kind] will not
iterate over all types since they use btf->nr_types to show
the number of types to iterate over. For split BTF this is
the number of types _on top of base BTF_, so it will
underestimate the number of types to iterate over, especially
for vmlinux + module BTF, where the latter is much smaller.
Martin KaFai Lau [Mon, 16 Nov 2020 20:01:13 +0000 (12:01 -0800)]
bpf: Fix the irq and nmi check in bpf_sk_storage for tracing usage
The intention of the current check is to avoid using bpf_sk_storage
in irq and nmi. Jakub pointed out that the current check cannot
do that. For example, in_serving_softirq() returns true
if the softirq handling is interrupted by hard irq.
Fixes: 8e4597c627fb ("bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201116200113.2868539-1-kafai@fb.com
selftest/bpf: Fix IPV6FR handling in flow dissector
From second fragment on, IPV6FR program must stop the dissection of IPV6
fragmented packet. This is the same approach used for IPV4 fragmentation.
This fixes the flow keys calculation for the upper-layer protocols.
Note that according to RFC8200, the first fragment packet must include
the upper-layer header.
Jakub Kicinski [Sat, 14 Nov 2020 21:23:01 +0000 (13:23 -0800)]
Merge branch 'ionic-updates'
Shannon Nelson says:
====================
ionic updates
These updates are a bit of code cleaning and a minor
bit of performance tweaking.
v3: convert ionic_lif_quiesce() to void
v2: added void cast on call to ionic_lif_quiesce()
lowered batching threshold
added patch to flatten calls to ionic_lif_rx_mode
added patch to change from_ndo to can_sleep
====================
Shannon Nelson [Thu, 12 Nov 2020 18:22:08 +0000 (10:22 -0800)]
ionic: useful names for booleans
With a few more uses of true and false in function calls, we
need to give them some useful names so we can tell from the
calling point what we're doing.
Signed-off-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shannon Nelson [Thu, 12 Nov 2020 18:22:07 +0000 (10:22 -0800)]
ionic: change set_rx_mode from_ndo to can_sleep
Instead of having two different ways of expressing the same
sleepability concept, using opposite logic, we can rework the
from_ndo to can_sleep for a more consistent usage.
Fixes: 1800eee16676 ("net: ionic: Replace in_interrupt() usage.") Signed-off-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shannon Nelson [Thu, 12 Nov 2020 18:22:05 +0000 (10:22 -0800)]
ionic: use mc sync for multicast filters
We should be using the multicast sync routines for the multicast
filters. Also, let's just flatten the logic a bit and pull
the small unicast routine back into ionic_set_rx_mode().
Fixes: 1800eee16676 ("net: ionic: Replace in_interrupt() usage.") Signed-off-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shannon Nelson [Thu, 12 Nov 2020 18:22:04 +0000 (10:22 -0800)]
ionic: batch rx buffer refilling
We don't need to refill the rx descriptors on every napi
if only a few were handled. Waiting until we can batch up
a few together will save us a few Rx cycles.
Signed-off-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shannon Nelson [Thu, 12 Nov 2020 18:22:03 +0000 (10:22 -0800)]
ionic: add lif quiesce
After the queues are stopped, expressly quiesce the lif.
This assures that even if the queues were in an odd state,
the firmware will close up everything cleanly.
Signed-off-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>