The bpf selftest send_signal() is flaky for its subtests trying to
send signals in softirq/nmi context. To reduce flakiness, the
signal-targetted process priority is boosted, which should minimize
preemption of that process and improve the possibility that
the underlying task in softirq/nmi context is the bpf_send_signal()
wanted task.
Patch #1 did a refactoring to use ASSERT_* instead of old CHECK macros.
Patch #2 did actual change of boosting priority.
Changelog:
v1 -> v2:
remove skip logic where the underlying task in interrupt context
is not the intended one.
====================
Yonghong Song [Tue, 17 Aug 2021 19:09:23 +0000 (12:09 -0700)]
selftests/bpf: Fix flaky send_signal test
libbpf CI has reported send_signal test is flaky although
I am not able to reproduce it in my local environment.
But I am able to reproduce with on-demand libbpf CI ([1]).
Through code analysis, the following is possible reason.
The failed subtest runs bpf program in softirq environment.
Since bpf_send_signal() only sends to a fork of "test_progs"
process. If the underlying current task is
not "test_progs", bpf_send_signal() will not be triggered
and the subtest will fail.
To reduce the chances where the underlying process is not
the intended one, this patch boosted scheduling priority to
-20 (highest allowed by setpriority() call). And I did
10 runs with on-demand libbpf CI with this patch and I
didn't observe any failures.
Andrii Nakryiko [Tue, 17 Aug 2021 18:16:07 +0000 (11:16 -0700)]
Merge branch 'selftests/bpf: Improve the usability of test_progs'
Yucong Sun says:
====================
This short series adds two new switches to test_progs, "-a" and "-d",
adding support for both exact string matching, as well as '*' wildcards.
It also cleans up the output to make it possible to generate
allowlist/denylist using common cli tools.
====================
Yucong Sun [Tue, 17 Aug 2021 04:47:32 +0000 (21:47 -0700)]
selftests/bpf: Support glob matching for test selector.
This patch adds '-a' and '-d' arguments supporting both exact string match as
well as using '*' wildcard in test/subtests selection. '-a' and '-t' can
co-exists, same as '-d' and '-b', in which case they just add to the list of
allowed or denied test selectors.
Caveat: Same as the current substring matching mechanism, test and subtest
selector applies independently, 'a*/b*' will execute all tests matching "a*",
and with subtest name matching "b*", but tests matching "a*" that has no
subtests will also be executed.
Yucong Sun [Tue, 17 Aug 2021 04:47:31 +0000 (21:47 -0700)]
selftests/bpf: Also print test name in subtest status message
This patch add test name in subtest status message line, making it possible to
grep ':OK' in the output to generate a list of passed test+subtest names, which
can be processed to generate argument list to be used with "-a", "-d" exact
string matching.
Yucong Sun [Tue, 17 Aug 2021 04:47:30 +0000 (21:47 -0700)]
selftests/bpf: Correctly display subtest skip status
In skip_account(), test->skip_cnt is set to 0 at the end, this makes next print
statement never display SKIP status for the subtest. This patch moves the
accounting logic after the print statement, fixing the issue.
This patch also added SKIP status display for normal tests.
Yucong Sun [Tue, 17 Aug 2021 04:47:29 +0000 (21:47 -0700)]
selftests/bpf: Skip loading bpf_testmod when using -l to list tests.
When using "-l", test_progs often is executed as non-root user,
load_bpf_testmod() will fail and output errors. This patch skips loading bpf
testmod when "-l" is specified, making output cleaner.
Yucong Sun [Tue, 17 Aug 2021 04:57:13 +0000 (21:57 -0700)]
selftests/bpf: Add exponential backoff to map_delete_retriable in test_maps
Using a fixed delay of 1 microsecond has proven flaky in slow CPU environment,
e.g. Github Actions CI system. This patch adds exponential backoff with a cap
of 50ms to reduce the flakiness of the test. Initial delay is chosen at random
in the range [0ms, 5ms).
Yucong Sun [Mon, 16 Aug 2021 17:52:50 +0000 (10:52 -0700)]
selftests/bpf: Add exponential backoff to map_update_retriable in test_maps
Using a fixed delay of 1 microsecond has proven flaky in slow CPU environment,
e.g. Github Actions CI system. This patch adds exponential backoff with a cap
of 50ms to reduce the flakiness of the test. Initial delay is chosen at random
in the range [0ms, 5ms).
Andrii Nakryiko [Tue, 17 Aug 2021 01:42:18 +0000 (18:42 -0700)]
Merge branch 'sockmap: add sockmap support for unix stream socket'
Jiang Wang says:
====================
This patch series add support for unix stream type
for sockmap. Sockmap already supports TCP, UDP,
unix dgram types. The unix stream support is similar
to unix dgram.
Also add selftests for unix stream type in sockmap tests.
====================
Jiang Wang [Mon, 16 Aug 2021 19:03:21 +0000 (19:03 +0000)]
af_unix: Add unix_stream_proto for sockmap
Previously, sockmap for AF_UNIX protocol only supports
dgram type. This patch add unix stream type support, which
is similar to unix_dgram_proto. To support sockmap, dgram
and stream cannot share the same unix_proto anymore, because
they have different implementations, such as unhash for stream
type (which will remove closed or disconnected sockets from the map),
so rename unix_proto to unix_dgram_proto and add a new
unix_stream_proto.
Also implement stream related sockmap functions.
And add dgram key words to those dgram specific functions.
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210816190327.2739291-3-jiang.wang@bytedance.com
Hengqi Chen [Sun, 15 Aug 2021 08:10:35 +0000 (16:10 +0800)]
selftests/bpf: Test btf__load_vmlinux_btf/btf__load_module_btf APIs
Add test for btf__load_vmlinux_btf/btf__load_module_btf APIs. The test
loads bpf_testmod module BTF and check existence of a symbol which is
known to exist.
grantseltzer [Tue, 10 Aug 2021 02:05:08 +0000 (22:05 -0400)]
bpf: Reconfigure libbpf docs to remove unversioned API
This removes the libbpf_api.rst file from the kernel documentation.
The intention for this file was to pull documentation from comments
above API functions in libbpf. However, due to limitations of the
kernel documentation system, this API documentation could not be
versioned, which is counterintuative to how users expect to use it.
There is also currently no doc comments, making this a blank page.
Once the kernel comment documentation is actually contributed, it
will still exist in the kernel repository, just in the code itself.
A seperate site is being spun up to generate documentaiton from those
comments in a way in which it can be versioned properly.
This also reconfigures the bpf documentation index page to make it
easier to sync to the previously mentioned documentaiton site.
Daniel Borkmann [Mon, 16 Aug 2021 22:45:08 +0000 (00:45 +0200)]
Merge branch 'bpf-perf-link'
Andrii Nakryiko says:
====================
This patch set implements an ability for users to specify custom black box u64
value for each BPF program attachment, bpf_cookie, which is available to BPF
program at runtime. This is a feature that's critically missing for cases when
some sort of generic processing needs to be done by the common BPF program
logic (or even exactly the same BPF program) across multiple BPF hooks (e.g.,
many uniformly handled kprobes) and it's important to be able to distinguish
between each BPF hook at runtime (e.g., for additional configuration lookup).
The choice of restricting this to a fixed-size 8-byte u64 value is an explicit
design decision. Making this configurable by users adds unnecessary complexity
(extra memory allocations, extra complications on the verifier side to validate
accesses to variable-sized data area) while not really opening up new
possibilities. If user's use case requires storing more data per attachment,
it's possible to use either global array, or ARRAY/HASHMAP BPF maps, where
bpf_cookie would be used as an index into respective storage, populated by
user-space code before creating BPF link. This gives user all the flexibility
and control while keeping BPF verifier and BPF helper API simple.
Currently, similar functionality can only be achieved through:
- code-generation and BPF program cloning, which is very complicated and
unmaintainable;
- on-the-fly C code generation and further runtime compilation, which is
what BCC uses and allows to do pretty simply. The big downside is a very
heavy-weight Clang/LLVM dependency and inefficient memory usage (due to
many BPF program clones and the compilation process itself);
- in some cases (kprobes and sometimes uprobes) it's possible to do function
IP lookup to get function-specific configuration. This doesn't work for
all the cases (e.g., when attaching uprobes to shared libraries) and has
higher runtime overhead and additional programming complexity due to
BPF_MAP_TYPE_HASHMAP lookups. Up until recently, before bpf_get_func_ip()
BPF helper was added, it was also very complicated and unstable (API-wise)
to get traced function's IP from fentry/fexit and kretprobe.
With libbpf and BPF CO-RE, runtime compilation is not an option, so to be able
to build generic tracing tooling simply and efficiently, ability to provide
additional bpf_cookie value for each *attachment* (as opposed to each BPF
program) is extremely important. Two immediate users of this functionality are
going to be libbpf-based USDT library (currently in development) and retsnoop
([0]), but I'm sure more applications will come once users get this feature in
their kernels.
To achieve above described, all perf_event-based BPF hooks are made available
through a new BPF_LINK_TYPE_PERF_EVENT BPF link, which allows to use common
LINK_CREATE command for program attachments and generally brings
perf_event-based attachments into a common BPF link infrastructure.
With that, LINK_CREATE gets ability to pass throught bpf_cookie value during
link creation (BPF program attachment) time. bpf_get_attach_cookie() BPF
helper is added to allow fetching this value at runtime from BPF program side.
BPF cookie is stored either on struct perf_event itself and fetched from the
BPF program context, or is passed through ambient BPF run context, added in c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current").
On the libbpf side of things, BPF perf link is utilized whenever is supported
by the kernel instead of using PERF_EVENT_IOC_SET_BPF ioctl on perf_event FD.
All the tracing attach APIs are extended with OPTS and bpf_cookie is passed
through corresponding opts structs.
Last part of the patch set adds few self-tests utilizing new APIs.
There are also a few refactorings along the way to make things cleaner and
easier to work with, both in kernel (BPF_PROG_RUN and BPF_PROG_RUN_ARRAY), and
throughout libbpf and selftests.
Follow-up patches will extend bpf_cookie to fentry/fexit programs.
While adding uprobe_opts, also extend it with ref_ctr_offset for specifying
USDT semaphore (reference counter) offset. Update attach_probe selftests to
validate its functionality. This is another feature (along with bpf_cookie)
required for implementing libbpf-based USDT solution.
[0] https://github.com/anakryiko/retsnoop
v4->v5:
- rebase on latest bpf-next to resolve merge conflict;
- add ref_ctr_offset to uprobe_opts and corresponding selftest;
v3->v4:
- get rid of BPF_PROG_RUN macro in favor of bpf_prog_run() (Daniel);
- move #ifdef CONFIG_BPF_SYSCALL check into bpf_set_run_ctx (Daniel);
v2->v3:
- user_ctx -> bpf_cookie, bpf_get_user_ctx -> bpf_get_attach_cookie (Peter);
- fix BPF_LINK_TYPE_PERF_EVENT value fix (Jiri);
- use bpf_prog_run() from bpf_prog_run_pin_on_cpu() (Yonghong);
v1->v2:
- fix build failures on non-x86 arches by gating on CONFIG_PERF_EVENTS.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:09 +0000 (00:06 -0700)]
selftests/bpf: Add ref_ctr_offset selftests
Extend attach_probe selftests to specify ref_ctr_offset for uprobe/uretprobe
and validate that its value is incremented from zero.
Turns out that once uprobe is attached with ref_ctr_offset, uretprobe for the
same location/function *has* to use ref_ctr_offset as well, otherwise
perf_event_open() fails with -EINVAL. So this test uses ref_ctr_offset for
both uprobe and uretprobe, even though for the purpose of test uprobe would be
enough.
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:08 +0000 (00:06 -0700)]
libbpf: Add uprobe ref counter offset support for USDT semaphores
When attaching to uprobes through perf subsystem, it's possible to specify
offset of a so-called USDT semaphore, which is just a reference counted u16,
used by kernel to keep track of how many tracers are attached to a given
location. Support for this feature was added in [0], so just wire this through
uprobe_opts. This is important to enable implementing USDT attachment and
tracing through libbpf's bpf_program__attach_uprobe_opts() API.
[0] a6ca88b241d5 ("trace_uprobe: support reference counter in fd-based uprobe")
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:07 +0000 (00:06 -0700)]
selftests/bpf: Add bpf_cookie selftests for high-level APIs
Add selftest with few subtests testing proper bpf_cookie usage.
Kprobe and uprobe subtests are pretty straightforward and just validate that
the same BPF program attached with different bpf_cookie will be triggered with
those different bpf_cookie values.
Tracepoint subtest is a bit more interesting, as it is the only
perf_event-based BPF hook that shares bpf_prog_array between multiple
perf_events internally. This means that the same BPF program can't be attached
to the same tracepoint multiple times. So we have 3 identical copies. This
arrangement allows to test bpf_prog_array_copy()'s handling of bpf_prog_array
list manipulation logic when programs are attached and detached. The test
validates that bpf_cookie isn't mixed up and isn't lost during such list
manipulations.
Perf_event subtest validates that two BPF links can be created against the
same perf_event (but not at the same time, only one BPF program can be
attached to perf_event itself), and that for each we can specify different
bpf_cookie value.
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:06 +0000 (00:06 -0700)]
selftests/bpf: Extract uprobe-related helpers into trace_helpers.{c,h}
Extract two helpers used for working with uprobes into trace_helpers.{c,h} to
be re-used between multiple uprobe-using selftests. Also rename get_offset()
into more appropriate get_uprobe_offset().
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:04 +0000 (00:06 -0700)]
libbpf: Add bpf_cookie to perf_event, kprobe, uprobe, and tp attach APIs
Wire through bpf_cookie for all attach APIs that use perf_event_open under the
hood:
- for kprobes, extend existing bpf_kprobe_opts with bpf_cookie field;
- for perf_event, uprobe, and tracepoint APIs, add their _opts variants and
pass bpf_cookie through opts.
For kernel that don't support BPF_LINK_CREATE for perf_events, and thus
bpf_cookie is not supported either, return error and log warning for user.
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:03 +0000 (00:06 -0700)]
libbpf: Add bpf_cookie support to bpf_link_create() API
Add ability to specify bpf_cookie value when creating BPF perf link with
bpf_link_create() low-level API.
Given BPF_LINK_CREATE command is growing and keeps getting new fields that are
specific to the type of BPF_LINK, extend libbpf side of bpf_link_create() API
and corresponding OPTS struct to accomodate such changes. Add extra checks to
prevent using incompatible/unexpected combinations of fields.
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:02 +0000 (00:06 -0700)]
libbpf: Use BPF perf link when supported by kernel
Detect kernel support for BPF perf link and prefer it when attaching to
perf_event, tracepoint, kprobe/uprobe. Underlying perf_event FD will be kept
open until BPF link is destroyed, at which point both perf_event FD and BPF
link FD will be closed.
This preserves current behavior in which perf_event FD is open for the
duration of bpf_link's lifetime and user is able to "disconnect" bpf_link from
underlying FD (with bpf_link__disconnect()), so that bpf_link__destroy()
doesn't close underlying perf_event FD.When BPF perf link is used, disconnect
will keep both perf_event and bpf_link FDs open, so it will be up to
(advanced) user to close them. This approach is demonstrated in bpf_cookie.c
selftests, added in this patch set.
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:01 +0000 (00:06 -0700)]
libbpf: Remove unused bpf_link's destroy operation, but add dealloc
bpf_link->destroy() isn't used by any code, so remove it. Instead, add ability
to override deallocation procedure, with default doing plain free(link). This
is necessary for cases when we want to "subclass" struct bpf_link to keep
extra information, as is the case in the next patch adding struct
bpf_link_perf.
Andrii Nakryiko [Sun, 15 Aug 2021 07:06:00 +0000 (00:06 -0700)]
libbpf: Re-build libbpf.so when libbpf.map changes
Ensure libbpf.so is re-built whenever libbpf.map is modified. Without this,
changes to libbpf.map are not detected and versioned symbols mismatch error
will be reported until `make clean && make` is used, which is a suboptimal
developer experience.
Andrii Nakryiko [Sun, 15 Aug 2021 07:05:59 +0000 (00:05 -0700)]
bpf: Add bpf_get_attach_cookie() BPF helper to access bpf_cookie value
Add new BPF helper, bpf_get_attach_cookie(), which can be used by BPF programs
to get access to a user-provided bpf_cookie value, specified during BPF
program attachment (BPF link creation) time.
Naming is hard, though. With the concept being named "BPF cookie", I've
considered calling the helper:
- bpf_get_cookie() -- seems too unspecific and easily mistaken with socket
cookie;
- bpf_get_bpf_cookie() -- too much tautology;
- bpf_get_link_cookie() -- would be ok, but while we create a BPF link to
attach BPF program to BPF hook, it's still an "attachment" and the
bpf_cookie is associated with BPF program attachment to a hook, not a BPF
link itself. Technically, we could support bpf_cookie with old-style
cgroup programs.So I ultimately rejected it in favor of
bpf_get_attach_cookie().
Currently all perf_event-backed BPF program types support
bpf_get_attach_cookie() helper. Follow-up patches will add support for
fentry/fexit programs as well.
While at it, mark bpf_tracing_func_proto() as static to make it obvious that
it's only used from within the kernel/trace/bpf_trace.c.
Andrii Nakryiko [Sun, 15 Aug 2021 07:05:58 +0000 (00:05 -0700)]
bpf: Allow to specify user-provided bpf_cookie for BPF perf links
Add ability for users to specify custom u64 value (bpf_cookie) when creating
BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event,
tracepoints).
This is useful for cases when the same BPF program is used for attaching and
processing invocation of different tracepoints/kprobes/uprobes in a generic
fashion, but such that each invocation is distinguished from each other (e.g.,
BPF program can look up additional information associated with a specific
kernel function without having to rely on function IP lookups). This enables
new use cases to be implemented simply and efficiently that previously were
possible only through code generation (and thus multiple instances of almost
identical BPF program) or compilation at runtime (BCC-style) on target hosts
(even more expensive resource-wise). For uprobes it is not even possible in
some cases to know function IP before hand (e.g., when attaching to shared
library without PID filtering, in which case base load address is not known
for a library).
This is done by storing u64 bpf_cookie in struct bpf_prog_array_item,
corresponding to each attached and run BPF program. Given cgroup BPF programs
already use two 8-byte pointers for their needs and cgroup BPF programs don't
have (yet?) support for bpf_cookie, reuse that space through union of
cgroup_storage and new bpf_cookie field.
Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx.
This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF
program execution code, which luckily is now also split from
BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper
giving access to this user-provided cookie value from inside a BPF program.
Generic perf_event BPF programs will access this value from perf_event itself
through passed in BPF program context.
Andrii Nakryiko [Sun, 15 Aug 2021 07:05:57 +0000 (00:05 -0700)]
bpf: Implement minimal BPF perf link
Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
the common BPF link infrastructure, allowing to list all active perf_event
based attachments, auto-detaching BPF program from perf_event when link's FD
is closed, get generic BPF link fdinfo/get_info functionality.
BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
are currently supported.
Force-detaching and atomic BPF program updates are not yet implemented, but
with perf_event-based BPF links we now have common framework for this without
the need to extend ioctl()-based perf_event interface.
One interesting consideration is a new value for bpf_attach_type, which
BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
define a single BPF_PERF_EVENT attach type for all of them and adjust
link_create()'s logic for checking correspondence between attach type and
program type.
The alternative would be to define three new attach types (e.g., BPF_KPROBE,
BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
libbpf. I chose to not do this to avoid unnecessary proliferation of
bpf_attach_type enum values and not have to deal with naming conflicts.
Andrii Nakryiko [Sun, 15 Aug 2021 07:05:56 +0000 (00:05 -0700)]
bpf: Refactor perf_event_set_bpf_prog() to use struct bpf_prog input
Make internal perf_event_set_bpf_prog() use struct bpf_prog pointer as an
input argument, which makes it easier to re-use for other internal uses
(coming up for BPF link in the next patch). BPF program FD is not as
convenient and in some cases it's not available. So switch to struct bpf_prog,
move out refcounting outside and let caller do bpf_prog_put() in case of an
error. This follows the approach of most of the other BPF internal functions.
Andrii Nakryiko [Sun, 15 Aug 2021 07:05:55 +0000 (00:05 -0700)]
bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions
Similar to BPF_PROG_RUN, turn BPF_PROG_RUN_ARRAY macros into proper functions
with all the same readability and maintainability benefits. Making them into
functions required shuffling around bpf_set_run_ctx/bpf_reset_run_ctx
functions. Also, explicitly specifying the type of the BPF prog run callback
required adjusting __bpf_prog_run_save_cb() to accept const void *, casted
internally to const struct sk_buff.
Further, split out a cgroup-specific BPF_PROG_RUN_ARRAY_CG and
BPF_PROG_RUN_ARRAY_CG_FLAGS from the more generic BPF_PROG_RUN_ARRAY due to
the differences in bpf_run_ctx used for those two different use cases.
I think BPF_PROG_RUN_ARRAY_CG would benefit from further refactoring to accept
struct cgroup and enum bpf_attach_type instead of bpf_prog_array, fetching
cgrp->bpf.effective[type] and RCU-dereferencing it internally. But that
required including include/linux/cgroup-defs.h, which I wasn't sure is ok with
everyone.
The remaining generic BPF_PROG_RUN_ARRAY function will be extended to
pass-through user-provided context value in the next patch.
Andrii Nakryiko [Sun, 15 Aug 2021 07:05:54 +0000 (00:05 -0700)]
bpf: Refactor BPF_PROG_RUN into a function
Turn BPF_PROG_RUN into a proper always inlined function. No functional and
performance changes are intended, but it makes it much easier to understand
what's going on with how BPF programs are actually get executed. It's more
obvious what types and callbacks are expected. Also extra () around input
parameters can be dropped, as well as `__` variable prefixes intended to avoid
naming collisions, which makes the code simpler to read and write.
This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
BPF_PROG_RUN into a function causes naming conflict compilation error. So
rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.
Andrii Nakryiko [Sun, 15 Aug 2021 07:13:33 +0000 (00:13 -0700)]
Merge branch 'BPF iterator for UNIX domain socket.'
Kuniyuki Iwashima says:
====================
This patch set adds BPF iterator support for UNIX domain socket. The first
patch implements it, and the second adds "%c" support for BPF_SEQ_PRINTF().
Thanks to Yonghong Song for the fix [0] for the LLVM code gen. The fix
prevents the LLVM compiler from transforming the loop exit condition '<' to
'!=', where the upper bound is not a constant. The transformation leads
the verifier to interpret it as an infinite loop.
And thanks to Andrii Nakryiko for its workaround [1].
Changelog:
v6:
- Align the header "Inde" column
- Change int vars to __u64 not to break test_progs-no_alu32
- Move the if statement into the for loop not to depend on the fix [0]
- Drop the README change
- Modify "%c" positive test patterns
v5:
https://lore.kernel.org/netdev/20210812164557.79046-1-kuniyu@amazon.co.jp/
- Align header line of bpf_iter_unix.c
- Add test for "%c"
v4:
https://lore.kernel.org/netdev/20210810092807.13190-1-kuniyu@amazon.co.jp/
- Check IS_BUILTIN(CONFIG_UNIX)
- Support "%c" in BPF_SEQ_PRINTF()
- Uncomment the code to print the name of the abstract socket
- Mention the LLVM fix in README.rst
- Remove the 'aligned' attribute in bpf_iter.h
- Keep the format string on a single line
v3:
https://lore.kernel.org/netdev/20210804070851.97834-1-kuniyu@amazon.co.jp/
- Export some functions for CONFIG_UNIX=m
v2:
https://lore.kernel.org/netdev/20210803011110.21205-1-kuniyu@amazon.co.jp/
- Implement bpf_iter specific seq_ops->stop()
- Add bpf_iter__unix in bpf_iter.h
- Move common definitions in selftest to bpf_tracing_net.h
- Include the code for abstract UNIX domain socket as comment in selftest
- Use ASSERT_OK_PTR() instead of CHECK()
- Make ternary operators on single line
The iterator can output almost the same result compared to /proc/net/unix.
The header line is aligned, and the Inode column uses "%8lu" because "%5lu"
can be easily overflown.
/proc/net/unix uses "%c" to print a single-byte character to escape '\0' in
the name of the abstract UNIX domain socket. The following selftest uses
it, so this patch adds support for "%c". Note that it does not support
wide character ("%lc" and "%llc") for simplicity.
bpf: af_unix: Implement BPF iterator for UNIX domain socket.
This patch implements the BPF iterator for the UNIX domain socket.
Currently, the batch optimisation introduced for the TCP iterator in the
commit 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock") is not
used for the UNIX domain socket. It will require replacing the big lock
for the hash table with small locks for each hash list not to block other
processes.
Andrii Nakryiko [Sat, 14 Aug 2021 00:49:24 +0000 (17:49 -0700)]
Merge branch 'bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_CGROUP_SOCKOPT'
Stanislav Fomichev says:
====================
We'd like to be able to identify netns from setsockopt hooks
to be able to do the enforcement of some options only in the
"initial" netns (to give users the ability to create clear/isolated
sandboxes if needed without any enforcement by doing unshare(net)).
v3:
- remove extra 'ctx->skb == NULL' check (Martin KaFai Lau)
- rework test to make sure the helper is really called, not just
verified
Ilya Leoshkevich [Thu, 12 Aug 2021 22:48:14 +0000 (00:48 +0200)]
selftests/bpf: Fix test_core_autosize on big-endian machines
The "probed" part of test_core_autosize copies an integer using
bpf_core_read() into an integer of a potentially different size.
On big-endian machines a destination offset is required for this to
produce a sensible result.
Hao Luo [Thu, 12 Aug 2021 00:38:19 +0000 (17:38 -0700)]
libbpf: Support weak typed ksyms.
Currently weak typeless ksyms have default value zero, when they don't
exist in the kernel. However, weak typed ksyms are rejected by libbpf
if they can not be resolved. This means that if a bpf object contains
the declaration of a nonexistent weak typed ksym, it will be rejected
even if there is no program that references the symbol.
Nonexistent weak typed ksyms can also default to zero just like
typeless ones. This allows programs that access weak typed ksyms to be
accepted by verifier, if the accesses are guarded. For example,
extern const int bpf_link_fops3 __ksym __weak;
/* then in BPF program */
if (&bpf_link_fops3) {
/* use bpf_link_fops3 */
}
If actual use of nonexistent typed ksym is not guarded properly,
verifier would see that register is not PTR_TO_BTF_ID and wouldn't
allow to use it for direct memory reads or passing it to BPF helpers.
Jussi Maki [Wed, 11 Aug 2021 12:36:27 +0000 (12:36 +0000)]
selftests/bpf: Fix running of XDP bonding tests
An "innocent" cleanup in the last version of the XDP bonding patchset moved
the "test__start_subtest" calls to the test main function, but I forgot to
reverse the condition, which lead to all tests being skipped. Fix it.
Jussi Maki [Thu, 12 Aug 2021 14:52:41 +0000 (14:52 +0000)]
net, bonding: Disallow vlan+srcmac with XDP
The new vlan+srcmac xmit policy is not implementable with XDP since
in many cases the 802.1Q payload is not present in the packet. This
can be for example due to hardware offload or in the case of veth
due to use of skbuffs internally.
This also fixes the NULL deref with the vlan+srcmac xmit policy
reported by Jonathan Toppins by additionally checking the skb
pointer.
Fixes: a815bde56b15 ("net, bonding: Refactor bond_xmit_hash for use with xdp_buff") Reported-by: Jonathan Toppins <jtoppins@redhat.com> Signed-off-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Jonathan Toppins <jtoppins@redhat.com> Link: https://lore.kernel.org/r/20210812145241.12449-1-joamaki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h 9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp") 9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins") 099fdeda659d ("bnxt_en: Event handler for PPS events")
kernel/bpf/helpers.c
include/linux/bpf-cgroup.h a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()") c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current")
drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray") 2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer")
MAINTAINERS 7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo") 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver")
Linus Torvalds [Fri, 13 Aug 2021 02:24:03 +0000 (16:24 -1000)]
Merge tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, bpf, can and
ieee802154.
The size of this is pretty normal, but we got more fixes for 5.14
changes this week than last week. Nothing major but the trend is the
opposite of what we like. We'll see how the next week goes..
Current release - regressions:
- r8169: fix ASPM-related link-up regressions
- bridge: fix flags interpretation for extern learn fdb entries
- phy: micrel: fix link detection on ksz87xx switch
- Revert "tipc: Return the correct errno code"
- ptp: fix possible memory leak caused by invalid cast
Current release - new code bugs:
- bpf: add missing bpf_read_[un]lock_trace() for syscall program
- bpf: fix potentially incorrect results with bpf_get_local_storage()
- page_pool: mask the page->signature before the checking, avoid dma
mapping leaks
- netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps
- bnxt_en: fix firmware interface issues with PTP
- mlx5: Bridge, fix ageing time
Previous releases - regressions:
- linkwatch: fix failure to restore device state across
suspend/resume
- bareudp: fix invalid read beyond skb's linear data
Previous releases - always broken:
- bpf: fix integer overflow involving bucket_size
- ppp: fix issues when desired interface name is specified via
netlink
- wwan: mhi_wwan_ctrl: fix possible deadlock
- dsa: microchip: ksz8795: fix number of VLAN related bugs
- dsa: drivers: fix broken backpressure in .port_fdb_dump
- dsa: qca: ar9331: make proper initial port defaults
Misc:
- bpf: add lockdown check for probe_write_user helper
- netfilter: conntrack: remove offload_pickup sysctl before 5.14 is
out
- netfilter: conntrack: collect all entries in one cycle,
heuristically slow down garbage collection scans on idle systems to
prevent frequent wake ups"
* tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
vsock/virtio: avoid potential deadlock when vsock device remove
wwan: core: Avoid returning NULL from wwan_create_dev()
net: dsa: sja1105: unregister the MDIO buses during teardown
Revert "tipc: Return the correct errno code"
net: mscc: Fix non-GPL export of regmap APIs
net: igmp: increase size of mr_ifc_count
MAINTAINERS: switch to my OMP email for Renesas Ethernet drivers
tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets
net: pcs: xpcs: fix error handling on failed to allocate memory
net: linkwatch: fix failure to restore device state across suspend/resume
net: bridge: fix memleak in br_add_if()
net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge
net: bridge: fix flags interpretation for extern learn fdb entries
net: dsa: sja1105: fix broken backpressure in .port_fdb_dump
net: dsa: lantiq: fix broken backpressure in .port_fdb_dump
net: dsa: lan9303: fix broken backpressure in .port_fdb_dump
net: dsa: hellcreek: fix broken backpressure in .port_fdb_dump
bpf, core: Fix kernel-doc notation
net: igmp: fix data-race in igmp_ifc_timer_expire()
net: Fix memory leak in ieee802154_raw_deliver
...
Linus Torvalds [Fri, 13 Aug 2021 02:16:01 +0000 (16:16 -1000)]
Merge tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to avoid a soft lockup in ceph_check_delayed_caps() from Luis
and a reference handling fix from Jeff that should address some memory
corruption reports in the snaprealm area.
Both marked for stable"
* tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client:
ceph: take snap_empty_lock atomically with snaprealm refcount change
ceph: reduce contention in ceph_check_delayed_caps()
i915:
- GVT fix for Windows VM hang.
- Display fix of 12 BPC bits for display 12 and newer.
- Don't try to access some media register for fused off domains.
- Fix kerneldoc build warnings.
* tag 'drm-fixes-2021-08-13' of git://anongit.freedesktop.org/drm/drm:
drm/doc/rfc: drop lmem uapi section
drm/i915: Only access SFC_DONE when media domain is not fused off
drm/i915/display: Fix the 12 BPC bits for PIPE_MISC reg
drm/amd/display: use GFP_ATOMIC in amdgpu_dm_irq_schedule_work
drm/amd/display: Remove invalid assert for ODM + MPC case
drm/amd/pm: bug fix for the runtime pm BACO
drm/amdgpu: handle VCN instances when harvesting (v2)
drm/meson: fix colour distortion from HDR set during vendor u-boot
drm/i915/gvt: Fix cached atomics setting for Windows VM
drm/amdgpu: Add preferred mode in modeset when freesync video mode's enabled.
drm/amd/pm: Fix a memory leak in an error handling path in 'vangogh_tables_init()'
drm/amdgpu: don't enable baco on boco platforms in runpm
drm/amdgpu: set RAS EEPROM address from VBIOS
drm/amd/pm: update smu v13.0.1 firmware header
drm/mediatek: Fix cursor plane no update
drm/mediatek: mtk-dpi: Set out_fmt from config if not the last bridge
drm/mediatek: dpi: Fix NULL dereference in mtk_dpi_bridge_atomic_check
Alex Elder [Wed, 11 Aug 2021 14:18:02 +0000 (09:18 -0500)]
dt-bindings: net: qcom,ipa: make imem interconnect optional
On some newer SoCs, the interconnect between IPA and SoC internal
memory (imem) is not used. Update the binding to indicate that
having just the memory and config interconnects is another allowed
configuration.
It isn't required, but all callers of ipa_aggr_granularity_val()
pass a constant value (IPA_AGGR_GRANULARITY) as the usec argument.
Two of those callers are in ipa_validate_build(), with the result
being passed to BUILD_BUG_ON().
Evidently the "sparc64-linux-gcc" compiler (at least) doesn't always
inline ipa_aggr_granularity_val(), so the result of the function is
not constant at compile time, and that leads to build errors.
Define the function with the __always_inline attribute to avoid the
errors. We can see by inspection that the value passed is never
zero, so we can just remove its WARN_ON() call.
Fixes: 5bc5588466a1f ("net: ipa: use WARN_ON() rather than assertions") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alex Elder <elder@linaro.org> Link: https://lore.kernel.org/r/20210811135948.2634264-1-elder@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dave Airlie [Thu, 12 Aug 2021 20:29:12 +0000 (06:29 +1000)]
Merge tag 'drm-intel-fixes-2021-08-12' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- GVT fix for Windows VM hang.
- Display fix of 12 BPC bits for display 12 and newer.
- Don't try to access some media register for fused off domains.
- Fix kerneldoc build warnings.
Jakub Kicinski [Thu, 12 Aug 2021 18:50:16 +0000 (11:50 -0700)]
Merge tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
Stefan Schmidt says:
====================
ieee802154 for net 2021-08-12
Mostly fixes coming from bot reports. Dongliang Mu tackled some syzkaller
reports in hwsim again and Takeshi Misawa a memory leak in ieee802154 raw.
* tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan:
net: Fix memory leak in ieee802154_raw_deliver
ieee802154: hwsim: fix GPF in hwsim_new_edge_nl
ieee802154: hwsim: fix GPF in hwsim_set_edge_lqi
====================
lock_sock() may do initiative schedule when the 'sk' is owned by
other thread at the same time, we would receivce a warning message
that "scheduling while atomic".
Even worse, if the next task (selected by the scheduler) try to
release a 'sk', it need to request vsock_table_lock and the deadlock
occur, cause the system into softlockup state.
Call trace:
queued_spin_lock_slowpath
vsock_remove_bound
vsock_remove_sock
virtio_transport_release
__vsock_release
vsock_release
__sock_release
sock_close
__fput
____fput
So we should not require sk_lock in this case, just like the behavior
in vhost_vsock or vmci.
Linus Torvalds [Thu, 12 Aug 2021 17:20:16 +0000 (07:20 -1000)]
Merge branch 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts fix from Eric Biederman:
"This fixes the ucount sysctls on big endian architectures.
The counts were expanded to be longs instead of ints, and the sysctl
code was overlooked, so only the low 32bit were being processed. On
litte endian just processing the low 32bits is fine, but on 64bit big
endian processing just the low 32bits results in the high order bits
instead of the low order bits being processed and nothing works
proper.
This change took a little bit to mature as we have the SYSCTL_ZERO,
and SYSCTL_INT_MAX macros that are only usable for sysctls operating
on ints, but unfortunately are not obviously broken. Which resulted in
the versions of this change working on big endian and not on little
endian, because the int SYSCTL_ZERO when extended 64bit wound up being
0x100000000. So we only allowed values greater than 0x100000000 and
less than 0faff. Which unfortunately broken everything that tried to
set the sysctls. (First reported with the windows subsystem for
linux).
I have tested this on x86_64 64bit after first reproducing the
problems with the earlier version of this change, and then verifying
the problems do not exist when we use appropriate long min and max
values for extra1 and extra2"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: add missing data type changes
Andy Shevchenko [Wed, 11 Aug 2021 12:48:45 +0000 (15:48 +0300)]
wwan: core: Avoid returning NULL from wwan_create_dev()
Make wwan_create_dev() to return either valid or error pointer,
In some cases it may return NULL. Prevent this by converting
it to the respective error pointer.
Fixes: 9a44c1cc6388 ("net: Add a WWAN subsystem") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Link: https://lore.kernel.org/r/20210811124845.10955-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David S. Miller [Thu, 12 Aug 2021 11:45:41 +0000 (12:45 +0100)]
Merge tag 'mlx5-updates-2021-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 updates 2021-08-11
This series provides misc updates to mlx5.
For more information please see tag log below.
Please pull and let me know if there is any problem.
mlx5-updates-2021-08-11
Misc. cleanup for mlx5.
1) Typos and use of netdev_warn()
2) smatch cleanup
3) Minor fix to inner TTC table creation
4) Dynamic capability cache allocation
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 12 Aug 2021 10:46:21 +0000 (11:46 +0100)]
Merge branch 'dsa-cross-chip-notifiers'
Vladimir Oltean says:
====================
Improvements to the DSA tag_8021q cross-chip notifiers
This series improves cross-chip notifier error messages and addresses a
benign error message seen during reboot on a system with disjoint DSA
trees.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 11 Aug 2021 13:46:06 +0000 (16:46 +0300)]
net: dsa: tag_8021q: don't broadcast during setup/teardown
Currently, on my board with multiple sja1105 switches in disjoint trees
described in commit f66a6a69f97a ("net: dsa: permit cross-chip bridging
between all trees in the system"), rebooting the board triggers the
following benign warnings:
[ 12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
[ 12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
[ 12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
[ 12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
[ 12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
[ 12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and
RX VLANs cannot be found on switch 2's CPU port.
But why would switch 2 even attempt to delete switch 1's TX and RX
tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast,
and it is supposed that it had added those VLANs in the first place
(because in dsa_port_tag_8021q_vlan_match, all CPU ports match
regardless of their tree index or switch index).
The two trees probe asynchronously, and when switch 1 probed, it called
dsa_broadcast which did not notify the tree of switch 2, because that
didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it
_is_ notified of the deletion.
Before jumping to introduce a synchronization mechanism between the
probing across disjoint switch trees, let's take a step back and see
whether we _need_ to do that in the first place.
The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port
only if switch 1 and 2 were part of a cross-chip bridge. And
dsa_tag_8021q_bridge_join takes care precisely of that (but if probing
was synchronous, the bridge_join would just end up bumping the VLANs'
refcount, because they are already installed by the setup path).
Since by the time the ports are bridged, all DSA trees are already set
up, and we don't need the tag_8021q VLANs of one switch installed on the
other switches during probe time, the answer is that we don't need to
fix the synchronization issue.
So make the setup and teardown code paths call dsa_port_notify, which
notifies only the local tree, and the bridge code paths call
dsa_broadcast, which let the other trees know as well.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 11 Aug 2021 13:46:05 +0000 (16:46 +0300)]
net: dsa: print more information when a cross-chip notifier fails
Currently this error message does not say a lot:
[ 32.693498] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.699725] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.705931] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.712139] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.718347] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[ 32.724554] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
but in this form, it is immediately obvious (at least to me) what the
problem is, even without further looking at the code:
[ 12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
[ 12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
[ 12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
[ 12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
[ 12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
[ 12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Vetter [Tue, 10 Aug 2021 14:27:48 +0000 (16:27 +0200)]
drm/doc/rfc: drop lmem uapi section
We still have quite a bit more work to do with overall reworking of
the ttm-based dg1 code, but the uapi stuff is now finalized with the
latest pull. So remove that.
This also fixes kerneldoc build warnings because we've included the
same headers in two places, resulting in sphinx complaining about
duplicated symbols. This regression has been created when we moved the
uapi definitions to the real include/uapi/ folder in 727ecd99a4c9
("drm/doc/rfc: drop the i915_gem_lmem.h header")
v2: Fix a few references that I missed, the htmldocs build took
forever.
Acked-by: Jason Ekstrand <jason@jlekstrand.net> Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Tested-by Stephen Rothwell <sfr@canb.auug.org.au> (v1)
References: https://lore.kernel.org/dri-devel/20210603193242.1ce99344@canb.auug.org.au/ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 727ecd99a4c9 ("drm/doc/rfc: drop the i915_gem_lmem.h header") Cc: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210810142748.1983271-1-daniel.vetter@ffwll.ch
(cherry picked from commit dae2d28832968751f7731336b560a4a84a197b76) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Matt Roper [Fri, 6 Aug 2021 17:41:30 +0000 (10:41 -0700)]
drm/i915: Only access SFC_DONE when media domain is not fused off
The SFC_DONE register lives within the corresponding VD0/VD2/VD4/VD6
forcewake domain and is not accessible if the vdbox in that domain is
fused off and the forcewake is not initialized.
This mistake went unnoticed because until recently we were using the
wrong register offset for the SFC_DONE register; once the register
offset was corrected, we started hitting errors like
Andy Shevchenko [Wed, 11 Aug 2021 13:39:32 +0000 (16:39 +0300)]
wwan: core: Unshadow error code returned by ida_alloc_range()
ida_alloc_range() may return other than -ENOMEM error code.
Unshadow it in the wwan_create_port().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Ankit Nautiyal [Wed, 11 Aug 2021 05:18:57 +0000 (10:48 +0530)]
drm/i915/display: Fix the 12 BPC bits for PIPE_MISC reg
Till DISPLAY12 the PIPE_MISC bits 5-7 are used to set the
Dithering BPC, with valid values of 6, 8, 10 BPC.
For ADLP+ these bits are used to set the PORT OUTPUT BPC, with valid
values of: 6, 8, 10, 12 BPC, and need to be programmed whether
dithering is enabled or not.
This patch:
-corrects the bits 5-7 for PIPE MISC register for 12 BPC.
-renames the bits and mask to have generic names for these bits for
dithering bpc and port output bpc.
v3: Added a note for MIPI DSI which uses the PIPE_MISC for readout
for pipe_bpp. (Uma Shankar)
v2: Added 'display' to the subject and fixes tag. (Uma Shankar)
Fixes: 756f85cffef2 ("drm/i915/bdw: Broadwell has PIPEMISC") Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> (v1) Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: intel-gfx@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com> Reviewed-by: Uma Shankar <uma.shankar@intel.com> Signed-off-by: Uma Shankar <uma.shankar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210811051857.109723-1-ankit.k.nautiyal@intel.com
(cherry picked from commit 70418a68713c13da3f36c388087d0220b456a430) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Vladimir Oltean [Wed, 11 Aug 2021 11:59:45 +0000 (14:59 +0300)]
net: dsa: sja1105: unregister the MDIO buses during teardown
The call to sja1105_mdiobus_unregister is present in the error path but
absent from the main driver unbind path.
Fixes: 5a8f09748ee7 ("net: dsa: sja1105: register the MDIO buses for 100base-T1 and 100base-TX") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
DENG Qingfang [Wed, 11 Aug 2021 09:50:43 +0000 (17:50 +0800)]
net: dsa: mt7530: fix VLAN traffic leaks again
When a port leaves a VLAN-aware bridge, the current code does not clear
other ports' matrix field bit. If the bridge is later set to VLAN-unaware
mode, traffic in the bridge may leak to that port.
Remove the VLAN filtering check in mt7530_port_bridge_leave.
Fixes: 474a2ddaa192 ("net: dsa: mt7530: fix VLAN traffic leaks") Fixes: 83163f7dca56 ("net: dsa: mediatek: add VLAN support for MT7530") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 12 Aug 2021 09:50:33 +0000 (10:50 +0100)]
Merge branch 'pktgen-imix'
Nick Richardson says:
====================
pktgen: Add IMIX mode
Adds internet mix (IMIX) mode to pktgen. Internet mix is
included in many user-space network perf testing tools. It allows
for the user to specify a distribution of discrete packet sizes to be
generated. This type of test is common among vendors when perf testing
their devices. link: https://datatracker.ietf.org/doc/html/rfc2544#section-9.1]
This allows users to get a
more complete picture of how their device will perform in the
real-world.
This feature adds a command that allows users to specify an imix
distribution in the following format:
imix_weights size_1,weight_1 size_2,weight_2 ... size_n,weight_n
The distribution of packets with size_i will be
(weight_i / total_weights) where
total_weights = weight_1 + weight_2 + ... + weight_n
For example:
imix_weights 40,7 576,4 1500,1
The pkt_size "40" will account for 7 / (7 + 4 + 1) = ~58% of the total
packets sent.
This patch was tested with the following:
1. imix_weights = 40,7 576,4 1500,1
2. imix_weights = 0,7 576,4 1500,1
- Packet size of 0 is resized to the minimum, 42
3. imix_weights = 40,7 576,4 1500,1 count = 0
- Zero count.
- Runs until user stops pktgen.
Invalid Configurations
1. clone_skb = 200 imix_weights = 40,7 576,4 1500,1
- Returns error code -524 (-ENOTSUPP) when setting imix_weights
2. len(imix_weights) > MAX_IMIX_ENTRIES
- Returns -7 (-E2BIG)
This patch is split into three parts, each provide different aspects of
required functionality:
1. Parse internet mix input.
2. Add IMIX Distribution representation.
3. Process and output IMIX results.
Changes in v2:
* Remove __ prefix outside of uAPI.
* Use seq_puts instead of seq_printf where necessary.
* Reorder variable declaration.
* Return -EINVAL instead of -ENOTSUPP when using IMIX with clone_skb > 0
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nick Richardson [Tue, 10 Aug 2021 19:01:55 +0000 (19:01 +0000)]
pktgen: Add output for imix results
The bps for imix mode is calculated by:
sum(imix_entry.size) / time_elapsed
The actual counts of each imix_entry are displayed under the
"Current:" section of the interface output in the following format:
imix_size_counts: size_1,count_1 size_2,count_2 ... size_n,count_n
Nick Richardson [Tue, 10 Aug 2021 19:01:54 +0000 (19:01 +0000)]
pktgen: Add imix distribution bins
In order to represent the distribution of imix packet sizes, a
pre-computed data structure is used. It features 100 (IMIX_PRECISION)
"bins". Contiguous ranges of these bins represent the respective
packet size of each imix entry. This is done to avoid the overhead of
selecting the correct imix packet size based on the corresponding weights.
pkt_size 40 occurs 7/total_weight = 58% of the time
pkt_size 576 occurs 4/total_weight = 33% of the time
pkt_size 1500 occurs 1/total_weight = 9% of the time
We generate a random number between 0-100 and select the corresponding
packet size based on the specified weights.
Eg. random number = 358723895 % 100 = 65
Selects the packet size corresponding to index:65 in the pre-computed
imix_distribution array.
An example of the pre-computed array is below:
The imix_distribution will look like the following:
0 -> 0 (index of imix_entry.size == 40)
1 -> 0 (index of imix_entry.size == 40)
2 -> 0 (index of imix_entry.size == 40)
[...] -> 0 (index of imix_entry.size == 40)
57 -> 0 (index of imix_entry.size == 40)
58 -> 1 (index of imix_entry.size == 576)
[...] -> 1 (index of imix_entry.size == 576)
90 -> 1 (index of imix_entry.size == 576)
91 -> 2 (index of imix_entry.size == 1500)
[...] -> 2 (index of imix_entry.size == 1500)
99 -> 2 (index of imix_entry.size == 1500)
Create and use "bin" representation of the imix distribution.
Signed-off-by: Nick Richardson <richardsonnick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nick Richardson [Tue, 10 Aug 2021 19:01:53 +0000 (19:01 +0000)]
pktgen: Parse internet mix (imix) input
Adds "imix_weights" command for specifying internet mix distribution.
The command is in this format:
"imix_weights size_1,weight_1 size_2,weight_2 ... size_n,weight_n"
where the probability that packet size_i is picked is:
weight_i / (weight_1 + weight_2 + .. + weight_n)
The user may provide up to 100 imix entries (size_i,weight_i) in this
command.
The user specified imix entries will be displayed in the "Params"
section of the interface output.
Values for clone_skb > 0 is not supported in IMIX mode.
Summary of changes:
Add flag for enabling internet mix mode.
Add command (imix_weights) for internet mix input.
Return -ENOTSUPP when clone_skb > 0 in IMIX mode.
Display imix_weights in Params.
Create data structures to store imix entries and distribution.
Signed-off-by: Nick Richardson <richardsonnick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Wed, 11 Aug 2021 01:22:09 +0000 (08:22 +0700)]
Revert "tipc: Return the correct errno code"
This reverts commit 0efea3c649f0 because of:
- The returning -ENOBUF error is fine on socket buffer allocation.
- There is side effect in the calling path
tipc_node_xmit()->tipc_link_xmit() when checking error code returning.
Fixes: 0efea3c649f0 ("tipc: Return the correct errno code") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Brown [Tue, 10 Aug 2021 12:37:48 +0000 (13:37 +0100)]
net: mscc: Fix non-GPL export of regmap APIs
The ocelot driver makes use of regmap, wrapping it with driver specific
operations that are thin wrappers around the core regmap APIs. These are
exported with EXPORT_SYMBOL, dropping the _GPL from the core regmap
exports which is frowned upon. Add _GPL suffixes to at least the APIs that
are doing register I/O.
Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 12 Aug 2021 05:56:10 +0000 (19:56 -1000)]
Merge tag 'seccomp-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
- Fix typo in user notification documentation (Rodrigo Campos)
- Fix userspace counter report when using TSYNC (Hsuan-Chi Kuo, Wiktor
Garbacz)
* tag 'seccomp-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: Fix setting loaded filter count during TSYNC
Documentation: seccomp: Fix typo in user notification
net: bridge: vlan: fix global vlan option range dumping
When global vlan options are equal sequentially we compress them in a
range to save space and reduce processing time. In order to have the
proper range end id we need to update range_end if the options are equal
otherwise we get ranges with the same end vlan id as the start.
Jeremy Kerr [Tue, 10 Aug 2021 02:38:34 +0000 (10:38 +0800)]
mctp: Specify route types, require rtm_type in RTM_*ROUTE messages
This change adds a 'type' attribute to routes, which can be parsed from
a RTM_NEWROUTE message. This will help to distinguish local vs. peer
routes in a future change.
This means userspace will need to set a correct rtm_type in RTM_NEWROUTE
and RTM_DELROUTE messages; we currently only accept RTN_UNICAST.
Yufeng Mo [Tue, 10 Aug 2021 13:28:48 +0000 (21:28 +0800)]
net: hns3: add support for triggering reset by ethtool
Currently, four reset types are supported for the HNS3 ethernet
driver: IMP reset, global reset, function reset, and FLR. Only
FLR can now be triggered by the user. To restore the device when
an exception occurs, add support for triggering reset by ethtool.
Run the "ethtool --reset DEVNAME mgmt | all | dedicated" to
trigger the IMP | global | function reset manually.
Sergey Shtylyov [Tue, 10 Aug 2021 20:17:12 +0000 (23:17 +0300)]
MAINTAINERS: switch to my OMP email for Renesas Ethernet drivers
I'm still going to continue looking after the Renesas Ethernet drivers and
device tree bindings. Now my new employer, Open Mobile Platform (OMP), will
pay for all my upstream work. Let's switch to my OMP email for the reviews.
Neal Cardwell [Wed, 11 Aug 2021 02:40:56 +0000 (22:40 -0400)]
tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets
Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the phase of the
tp->delivered counter the tracking of BBR round trips can get stuck.
The bug arises because if tp->delivered is between 2^31 and 2^32 at
the time the BBR congestion control module is initialized, then the
initialization of bbr->next_rtt_delivered to 0 will cause the logic to
believe that the end of the round trip is still billions of packets in
the future. More specifically, the following check will fail
repeatedly:
and thus the connection will take up to 2B packets delivered before
that check will pass and the connection will set:
bbr->round_start = 1;
This could cause many mechanisms in BBR to fail to trigger, for
example bbr_check_full_bw_reached() would likely never exit STARTUP.
This bug is 5 years old and has not been observed, and as a practical
matter this would likely rarely trigger, since it would require
transferring at least 2B packets, or likely more than 3 terabytes of
data, before switching congestion control algorithms to BBR.
This patch is a stable candidate for kernels as far back as v4.9,
when tcp_bbr.c was added.
Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Kevin Yang <yyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonathan Toppins [Wed, 11 Aug 2021 02:53:30 +0000 (22:53 -0400)]
bonding: remove extraneous definitions from bonding.h
All of the symbols either only exist in bond_options.c or nowhere at
all. These symbols were verified to not exist in the code base by
using `git grep` and their removal was verified by compiling bonding.ko.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wong Vee Khee [Tue, 10 Aug 2021 08:58:12 +0000 (16:58 +0800)]
net: pcs: xpcs: fix error handling on failed to allocate memory
Drivers such as sja1105 and stmmac that call xpcs_create() expects an
error returned by the pcs-xpcs module, but this was not the case on
failed to allocate memory.
Fixed this by returning an -ENOMEM instead of a NULL pointer.
Mark Brown [Tue, 10 Aug 2021 12:37:48 +0000 (13:37 +0100)]
net: mscc: Fix non-GPL export of regmap APIs
The ocelot driver makes use of regmap, wrapping it with driver specific
operations that are thin wrappers around the core regmap APIs. These are
exported with EXPORT_SYMBOL, dropping the _GPL from the core regmap
exports which is frowned upon. Add _GPL suffixes to at least the APIs that
are doing register I/O.
Willy Tarreau [Mon, 9 Aug 2021 16:06:28 +0000 (18:06 +0200)]
net: linkwatch: fix failure to restore device state across suspend/resume
After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed
that my Ethernet port to which a bond and a VLAN interface are attached
appeared to remain up after resuming from suspend with the cable unplugged
(and that problem still persists with 5.10-LTS).
It happens that the following happens:
- the network driver (e1000e here) prepares to suspend, calls e1000e_down()
which calls netif_carrier_off() to signal that the link is going down.
- netif_carrier_off() adds a link_watch event to the list of events for
this device
- the device is completely stopped.
- the machine suspends
- the cable is unplugged and the machine brought to another location
- the machine is resumed
- the queued linkwatch events are processed for the device
- the device doesn't yet have the __LINK_STATE_PRESENT bit and its events
are silently dropped
- the device is resumed with its link down
- the upper VLAN and bond interfaces are never notified that the link had
been turned down and remain up
- the only way to provoke a change is to physically connect the machine
to a port and possibly unplug it.
The state after resume looks like this:
$ ip -br li | egrep 'bond|eth'
bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>
eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP>
eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
Placing an explicit call to netdev_state_change() either in the suspend
or the resume code in the NIC driver worked around this but the solution
is not satisfying.
The issue in fact really is in link_watch that loses events while it
ought not to. It happens that the test for the device being present was
added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice
being present to linkwatch_do_dev") in 4.20 to avoid an access to
devices that are not present.
Instead of dropping events, this patch proceeds slightly differently by
postponing their handling so that they happen after the device is fully
resumed.
Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") Link: https://lists.openwall.net/netdev/2018/03/15/62 Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A recent change in LLVM causes module_{c,d}tor sections to appear when
CONFIG_K{A,C}SAN are enabled, which results in orphan section warnings
because these are not handled anywhere:
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_ctor) is being placed in '.text.asan.module_ctor'
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_dtor) is being placed in '.text.asan.module_dtor'
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.tsan.module_ctor) is being placed in '.text.tsan.module_ctor'
Fangrui explains: "the function asan.module_ctor has the SHF_GNU_RETAIN
flag, so it is in a separate section even with -fno-function-sections
(default)".
Place them in the TEXT_TEXT section so that these technologies continue
to work with the newer compiler versions. All of the KASAN and KCSAN
KUnit tests continue to pass after this change.
Hsuan-Chi Kuo [Thu, 4 Mar 2021 23:37:08 +0000 (17:37 -0600)]
seccomp: Fix setting loaded filter count during TSYNC
The desired behavior is to set the caller's filter count to thread's.
This value is reported via /proc, so this fixes the inaccurate count
exposed to userspace; it is not used for reference counting, etc.