Jakub Kicinski [Thu, 18 Jan 2024 17:54:24 +0000 (09:54 -0800)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-01-18
We've added 10 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 806 insertions(+), 51 deletions(-).
The main changes are:
1) Fix an issue in bpf_iter_udp under backward progress which prevents
user space process from finishing iteration, from Martin KaFai Lau.
2) Fix BPF verifier to reject variable offset alu on registers with a type
of PTR_TO_FLOW_KEYS to prevent oob access, from Hao Sun.
3) Follow up fixes for kernel- and libbpf-side logic around handling
arg:ctx tagged arguments of BPF global subprogs, from Andrii Nakryiko.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
selftests/bpf: add tests confirming type logic in kernel for __arg_ctx
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
bpf: extract bpf_ctx_convert_map logic and make it more reusable
libbpf: feature-detect arg:ctx tag support in kernel
selftests/bpf: Add test for alu on PTR_TO_FLOW_KEYS
bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS
selftests/bpf: Test udp and tcp iter batching
bpf: Avoid iter->offset making backward progress in bpf_iter_udp
bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket
====================
Tony Nguyen [Wed, 17 Jan 2024 17:25:32 +0000 (09:25 -0800)]
i40e: Include types.h to some headers
Commit 56df345917c0 ("i40e: Remove circular header dependencies and fix
headers") redistributed a number of includes from one large header file
to the locations they were needed. In some environments, types.h is not
included and causing compile issues. The driver should not rely on
implicit inclusion from other locations; explicitly include it to these
files.
Snippet of issue. Entire log can be seen through the Closes: link.
In file included from drivers/net/ethernet/intel/i40e/i40e_diag.h:7,
from drivers/net/ethernet/intel/i40e/i40e_diag.c:4:
drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h:33:9: error: unknown type name '__le16'
33 | __le16 flags;
| ^~~~~~
drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h:34:9: error: unknown type name '__le16'
34 | __le16 opcode;
| ^~~~~~
...
drivers/net/ethernet/intel/i40e/i40e_diag.h:22:9: error: unknown type name 'u32'
22 | u32 elements; /* number of elements if array */
| ^~~
drivers/net/ethernet/intel/i40e/i40e_diag.h:23:9: error: unknown type name 'u32'
23 | u32 stride; /* bytes between each element */
Reported-by: Martin Zaharinov <micron10@gmail.com> Closes: https://lore.kernel.org/netdev/21BBD62A-F874-4E42-B347-93087EEA8126@gmail.com/ Fixes: 56df345917c0 ("i40e: Remove circular header dependencies and fix headers") Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240117172534.3555162-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work
idev->mc_ifc_count can be written over without proper locking.
Originally found by syzbot [1], fix this issue by encapsulating calls
to mld_ifc_stop_work() (and mld_gq_stop_work() for good measure) with
mutex_lock() and mutex_unlock() accordingly as these functions
should only be called with mc_lock per their declarations.
[1]
BUG: KCSAN: data-race in ipv6_mc_down / mld_ifc_work
write to 0xffff88813a80c832 of 1 bytes by task 22 on cpu 1:
mld_ifc_work+0x54c/0x7b0 net/ipv6/mcast.c:2653
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2700
worker_thread+0x525/0x730 kernel/workqueue.c:2781
...
Amit Cohen [Wed, 17 Jan 2024 15:04:21 +0000 (16:04 +0100)]
selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes
'qos_pfc' test checks PFC behavior. The idea is to limit the traffic
using a shaper somewhere in the flow of the packets. In this area, the
buffer is smaller than the buffer at the beginning of the flow, so it fills
up until there is no more space left. The test configures there PFC
which is supposed to notice that the headroom is filling up and send PFC
Xoff to indicate the transmitter to stop sending traffic for the priorities
sharing this PG.
The Xon/Xoff threshold is auto-configured and always equal to
2*(MTU rounded up to cell size). Even after sending the PFC Xoff packet,
traffic will keep arriving until the transmitter receives and processes
the PFC packet. This amount of traffic is known as the PFC delay allowance.
Currently the buffer for the delay traffic is configured as 100KB. The
MTU in the test is 10KB, therefore the threshold for Xoff is about 20KB.
This allows 80KB extra to be stored in this buffer.
8-lane ports use two buffers among which the configured buffer is split,
the Xoff threshold then applies to each buffer in parallel.
The test does not take into account the behavior of 8-lane ports, when the
ports are configured to 400Gbps with 8 lanes or 800Gbps with 8 lanes,
packets are dropped and the test fails.
Check if the relevant ports use 8 lanes, in such case double the size of
the buffer, as the headroom is split half-half.
Cc: Shuah Khan <shuah@kernel.org> Fixes: bfa804784e32 ("selftests: mlxsw: Add a PFC test") Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/23ff11b7dff031eb04a41c0f5254a2b636cd8ebb.1705502064.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Wed, 17 Jan 2024 15:04:19 +0000 (16:04 +0100)]
mlxsw: spectrum_router: Register netdevice notifier before nexthop
If there are IPIP nexthops at the time when the driver is loaded (or the
devlink instance reloaded), the driver looks up the corresponding IPIP
entry. But IPIP entries are only created as a result of netdevice
notifications. Since the netdevice notifier is registered after the nexthop
notifier, mlxsw_sp_nexthop_type_init() never finds the IPIP entry,
registers the nexthop MLXSW_SP_NEXTHOP_TYPE_ETH, and fails to assign a CRIF
to the nexthop. Later on when the CRIF is necessary, the WARN_ON in
mlxsw_sp_nexthop_rif() triggers, causing the splat [1].
In order to fix the issue, reorder the netdevice notifier to be registered
before the nexthop one.
Ido Schimmel [Wed, 17 Jan 2024 15:04:18 +0000 (16:04 +0100)]
mlxsw: spectrum_acl_tcam: Fix stack corruption
When tc filters are first added to a net device, the corresponding local
port gets bound to an ACL group in the device. The group contains a list
of ACLs. In turn, each ACL points to a different TCAM region where the
filters are stored. During forwarding, the ACLs are sequentially
evaluated until a match is found.
One reason to place filters in different regions is when they are added
with decreasing priorities and in an alternating order so that two
consecutive filters can never fit in the same region because of their
key usage.
In Spectrum-2 and newer ASICs the firmware started to report that the
maximum number of ACLs in a group is more than 16, but the layout of the
register that configures ACL groups (PAGT) was not updated to account
for that. It is therefore possible to hit stack corruption [1] in the
rare case where more than 16 ACLs in a group are required.
Fix by limiting the maximum ACL group size to the minimum between what
the firmware reports and the maximum ACLs that fit in the PAGT register.
Add a test case to make sure the machine does not crash when this
condition is hit.
Ido Schimmel [Wed, 17 Jan 2024 15:04:17 +0000 (16:04 +0100)]
mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path
When calling mlxsw_sp_acl_tcam_region_destroy() from an error path after
failing to attach the region to an ACL group, we hit a NULL pointer
dereference upon 'region->group->tcam' [1].
Fix by retrieving the 'tcam' pointer using mlxsw_sp_acl_to_tcam().
Amit Cohen [Wed, 17 Jan 2024 15:04:16 +0000 (16:04 +0100)]
mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure
Lately, a bug was found when many TC filters are added - at some point,
several bugs are printed to dmesg [1] and the switch is crashed with
segmentation fault.
The issue starts when gen_pool_free() fails because of unexpected
behavior - a try to free memory which is already freed, this leads to BUG()
call which crashes the switch and makes many other bugs.
Trying to track down the unexpected behavior led to a bug in eRP code. The
function mlxsw_sp_acl_erp_table_alloc() gets a pointer to the allocated
index, sets the value and returns an error code. When gen_pool_alloc()
fails it returns address 0, we track it and return -ENOBUFS outside, BUT
the call for gen_pool_alloc() already override the index in erp_table
structure. This is a problem when such allocation is done as part of
table expansion. This is not a new table, which will not be used in case
of allocation failure. We try to expand eRP table and override the
current index (non-zero) with zero. Then, it leads to an unexpected
behavior when address 0 is freed twice. Note that address 0 is valid in
erp_table->base_index and indeed other tables use it.
gen_pool_alloc() fails in case that there is no space left in the
pre-allocated pool, in our case, the pool is limited to
ACL_MAX_ERPT_BANK_SIZE, which is read from hardware. When more than max
erp entries are required, we exceed the limit and return an error, this
error leads to "Failed to migrate vregion" print.
Fix this by changing erp_table->base_index only in case of a successful
allocation.
Add a test case for such a scenario. Without this fix it causes
segmentation fault:
$ TESTS="max_erp_entries_test" ./tc_flower.sh
./tc_flower.sh: line 988: 1560 Segmentation fault tc filter del dev $h2 ingress chain $i protocol ip pref $i handle $j flower &>/dev/null
Accessing an ethernet device that is powered off or clock gated might
cause the CPU to hang. Add ethnl_ops_begin/complete in
ethnl_set_features() to protect against this.
Benjamin Poirier [Tue, 16 Jan 2024 15:49:26 +0000 (10:49 -0500)]
selftests: bonding: Add more missing config options
As a followup to commit 03fb8565c880 ("selftests: bonding: add missing
build configs"), add more networking-specific config options which are
needed for bonding tests.
For testing, I used the minimal config generated by virtme-ng and I added
the options in the config file. All bonding tests passed.
Fixes: bbb774d921e2 ("net: Add tests for bonding and team address list management") # for ipv6 Fixes: 6cbe791c0f4e ("kselftest: bonding: add num_grat_arp test") # for tc options Fixes: 222c94ec0ad4 ("selftests: bonding: add tests for ether type changes") # for nlmon Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Link: https://lore.kernel.org/r/20240116154926.202164-1-bpoirier@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Tue, 16 Jan 2024 15:43:11 +0000 (07:43 -0800)]
selftests: netdevsim: add a config file
netdevsim tests aren't very well integrated with kselftest,
which has its advantages and disadvantages. But regardless
of the intended integration - a config file to know what kernel
to build is very useful, add one.
====================
Tighten up arg:ctx type enforcement
Follow up fixes for kernel-side and libbpf-side logic around handling arg:ctx
(__arg_ctx) tagged arguments of BPF global subprogs.
Patch #1 adds libbpf feature detection of kernel-side __arg_ctx support to
avoid unnecessary rewriting BTF types. With stricter kernel-side type
enforcement this is now mandatory to avoid problems with using `struct
bpf_user_pt_regs_t` instead of actual typedef. For __arg_ctx tagged arguments
verifier is now supporting either `bpf_user_pt_regs_t` typedef or resolves it
down to the actual struct (pt_regs/user_pt_regs/user_regs_struct), depending
on architecture), but for old kernels without __arg_ctx support it's more
backwards compatible for libbpf to use `struct bpf_user_pt_regs_t` rewrite
which will work on wider range of kernels. So feature detection prevent libbpf
accidentally breaking global subprogs on new kernels.
We also adjust selftests to do similar feature detection (much simpler, but
potentially breaking due to kernel source code refactoring, which is fine for
selftests), and skip tests expecting libbpf's BTF type rewrites.
Patch #2 is preparatory refactoring for patch #3 which adds type enforcement
for arg:ctx tagged global subprog args. See the patch for specifics.
Patch #4 adds many new cases to ensure type logic works as expected.
Finally, patch #5 adds a relevant subset of kernel-side type checks to
__arg_ctx cases that libbpf supports rewrite of. In libbpf's case, type
violations are reported as warnings and BTF rewrite is not performed, which
will eventually lead to BPF verifier complaining at program verification time.
Good care was taken to avoid conflicts between bpf and bpf-next tree (which
has few follow up refactorings in the same code area). Once trees converge
some of the code will be moved around a bit (and some will be deleted), but
with no change to functionality or general shape of the code.
v2->v3:
- support `bpf_user_pt_regs_t` typedef for KPROBE and PERF_EVENT (CI);
v1->v2:
- add user_pt_regs and user_regs_struct support for PERF_EVENT (CI);
- drop FEAT_ARG_CTX_TAG enum leftover from patch #1;
- fix warning about default: without break in the switch (CI).
====================
Andrii Nakryiko [Thu, 18 Jan 2024 03:31:43 +0000 (19:31 -0800)]
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
On kernel that don't support arg:ctx tag, before adjusting global
subprog BTF information to match kernel's expected canonical type names,
make sure that types used by user are meaningful, and if not, warn and
don't do BTF adjustments.
This is similar to checks that kernel performs, but narrower in scope,
as only a small subset of BPF program types can be accommodated by
libbpf using canonical type names.
Libbpf unconditionally allows `struct pt_regs *` for perf_event program
types, unlike kernel, which supports that conditionally on architecture.
This is done to keep things simple and not cause unnecessary false
positives. This seems like a minor and harmless deviation, which in
real-world programs will be caught by kernels with arg:ctx tag support
anyways. So KISS principle.
This logic is hard to test (especially on latest kernels), so manual
testing was performed instead. Libbpf emitted the following warning for
perf_event program with wrong context argument type:
libbpf: prog 'arg_tag_ctx_perf': subprog 'subprog_ctx_tag' arg#0 is expected to be of `struct bpf_perf_event_data *` type
Andrii Nakryiko [Thu, 18 Jan 2024 03:31:41 +0000 (19:31 -0800)]
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
Add enforcement of expected types for context arguments tagged with
arg:ctx (__arg_ctx) tag.
First, any program type will accept generic `void *` context type when
combined with __arg_ctx tag.
Besides accepting "canonical" struct names and `void *`, for a bunch of
program types for which program context is actually a named struct, we
allows a bunch of pragmatic exceptions to match real-world and expected
usage:
- for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
canonical context argument type, where `bpf_user_pt_regs_t` is a
*typedef*, not a struct;
- for kprobes, we also always accept `struct pt_regs *`, as that's what
actually is passed as a context to any kprobe program;
- for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
down to actual struct type and accept `struct pt_regs *`, or
`struct user_pt_regs *`, or `struct user_regs_struct *`, depending
on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
expected;
- for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
what's expected with BPF_PROG() usage; otherwise, canonical
`struct bpf_raw_tracepoint_args *` is expected;
- tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
formats, both are coded as expections as tp_btf is actually a TRACING
program type, which has no canonical context type;
- iterator programs accept `struct bpf_iter__xxx *` structs, currently
with no further iterator-type specific enforcement;
- fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
- classic tracepoint programs, as well as syscall and freplace
programs allow any user-provided type.
In all other cases kernel will enforce exact match of struct name to
expected canonical type. And if user-provided type doesn't match that
expectation, verifier will emit helpful message with expected type name.
Note a bit unnatural way the check is done after processing all the
arguments. This is done to avoid conflict between bpf and bpf-next
trees. Once trees converge, a small follow up patch will place a simple
btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
(which bpf-next tree patch refactored already), removing duplicated
arg:ctx detection logic.
Andrii Nakryiko [Thu, 18 Jan 2024 03:31:40 +0000 (19:31 -0800)]
bpf: extract bpf_ctx_convert_map logic and make it more reusable
Refactor btf_get_prog_ctx_type() a bit to allow reuse of
bpf_ctx_convert_map logic in more than one places. Simplify interface by
returning btf_type instead of btf_member (field reference in BTF).
To do the above we need to touch and start untangling
btf_translate_to_vmlinux() implementation. We do the bare minimum to
not regress anything for btf_translate_to_vmlinux(), but its
implementation is very questionable for what it claims to be doing.
Mapping kfunc argument types to kernel corresponding types conceptually
is quite different from recognizing program context types. Fixing this
is out of scope for this change though.
Andrii Nakryiko [Thu, 18 Jan 2024 03:31:39 +0000 (19:31 -0800)]
libbpf: feature-detect arg:ctx tag support in kernel
Add feature detector of kernel-side arg:ctx (__arg_ctx) tag support. If
this is detected, libbpf will avoid doing any __arg_ctx-related BTF
rewriting and checks in favor of letting kernel handle this completely.
test_global_funcs/ctx_arg_rewrite subtest is adjusted to do the same
feature detection (albeit in much simpler, though round-about and
inefficient, way), and skip the tests. This is done to still be able to
execute this test on older kernels (like in libbpf CI).
Note, BPF token series ([0]) does a major refactor and code moving of
libbpf-internal feature detection "framework", so to avoid unnecessary
conflicts we keep newly added feature detection stand-alone with ad-hoc
result caching. Once things settle, there will be a small follow up to
re-integrate everything back and move code into its final place in
newly-added (by BPF token series) features.c file.
Fixes: b63e78fca889 ("net: netdevsim: use mock PHC driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 16 Jan 2024 17:18:47 +0000 (18:18 +0100)]
mptcp: relax check on MPC passive fallback
While testing the blamed commit below, I was able to miss (!)
packetdrill failures in the fastopen test-cases.
On passive fastopen the child socket is created by incoming TCP MPC syn,
allow for both MPC_SYN and MPC_ACK header.
Fixes: 724b00c12957 ("mptcp: refine opt_mp_capable determination") Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Romain Gantois [Tue, 16 Jan 2024 12:19:17 +0000 (13:19 +0100)]
net: stmmac: Prevent DSA tags from breaking COE
Some DSA tagging protocols change the EtherType field in the MAC header
e.g. DSA_TAG_PROTO_(DSA/EDSA/BRCM/MTK/RTL4C_A/SJA1105). On TX these tagged
frames are ignored by the checksum offload engine and IP header checker of
some stmmac cores.
On RX, the stmmac driver wrongly assumes that checksums have been computed
for these tagged packets, and sets CHECKSUM_UNNECESSARY.
Add an additional check in the stmmac TX and RX hotpaths so that COE is
deactivated for packets with ethertypes that will not trigger the COE and
IP header checks.
Fixes: 6b2c6e4a938f ("net: stmmac: propagate feature flags to vlan") Cc: <stable@vger.kernel.org> Reported-by: Richard Tresidder <rtresidd@electromag.com.au> Link: https://lore.kernel.org/netdev/e5c6c75f-2dfa-4e50-a1fb-6bf4cdb617c2@electromag.com.au/ Reported-by: Romain Gantois <romain.gantois@bootlin.com> Link: https://lore.kernel.org/netdev/c57283ed-6b9b-b0e6-ee12-5655c1c54495@bootlin.com/ Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Romain Gantois <romain.gantois@bootlin.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Mon, 15 Jan 2024 13:59:22 +0000 (14:59 +0100)]
selftests: rtnetlink: use setup_ns in bonding test
This is a follow-up of commit a159cbe81d3b ("selftests: rtnetlink: check
enslaving iface in a bond") after the merge of net-next into net.
The goal is to follow the new convention,
see commit d3b6b1116127 ("selftests/net: convert rtnetlink.sh to run it in
unique namespace") for more details.
Let's use also the generic dummy name instead of defining a new one.
The referenced commit moved the setting of the Autoneg and pause bits
early in sfp_parse_support(). However, we check whether the modes are
empty before using the bitrate to set some modes. Setting these bits
so early causes that test to always be false, preventing this working,
and thus some modules that used to work no longer do.
Move them just before the call to the quirk.
Fixes: 8110633db49d ("net: sfp-bus: allow SFP quirks to override Autoneg and pause bits") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://lore.kernel.org/r/E1rPMJW-001Ahf-L0@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hao Sun [Mon, 15 Jan 2024 08:20:28 +0000 (09:20 +0100)]
selftests/bpf: Add test for alu on PTR_TO_FLOW_KEYS
Add a test case for PTR_TO_FLOW_KEYS alu. Testing if alu with variable
offset on flow_keys is rejected. For the fixed offset success case, we
already have C code coverage to verify (e.g. via bpf_flow.c).
Hao Sun [Mon, 15 Jan 2024 08:20:27 +0000 (09:20 +0100)]
bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS
For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off
for validation. However, variable offset ptr alu is not prohibited
for this ptr kind. So the variable offset is not checked.
Fix this by rejecting ptr alu with variable offset on flow_keys.
Applying the patch rejects the program with "R7 pointer arithmetic
on flow_keys prohibited".
Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240115082028.9992-1-sunhao.th@gmail.com
Jakub Kicinski [Tue, 16 Jan 2024 02:02:01 +0000 (18:02 -0800)]
selftests: bonding: add missing build configs
bonding tests also try to create bridge, veth and dummy
interfaces. These are not currently listed in config.
Fixes: bbb774d921e2 ("net: Add tests for bonding and team address list management") Fixes: c078290a2b76 ("selftests: include bonding tests into the kselftest infra") Acked-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20240116020201.1883023-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sun, 14 Jan 2024 22:47:26 +0000 (14:47 -0800)]
selftests: netdevsim: sprinkle more udevadm settle
Number of tests are failing when netdev renaming is active
on the system. Add udevadm settle in logic determining
the names.
Fixes: 242aaf03dc9b ("selftests: add a test for ethtool pause stats") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240114224726.1210532-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Multiple disable_irq_wake() calls will keep decreasing the IRQ
wake_depth, When wake_depth is 0, calling disable_irq_wake() again,
will report the above calltrace.
Due to the need to appear in pairs, we cannot call disable_irq_wake()
without calling enable_irq_wake(). Fix this by making sure there are
no unbalanced disable_irq_wake() calls.
Fixes: 3172d3afa998 ("stmmac: support wake up irq from external sources (v3)") Signed-off-by: Qiang Ma <maqianga@uniontech.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240112021249.24598-1-maqianga@uniontech.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Benjamin Poirier [Wed, 10 Jan 2024 14:14:35 +0000 (09:14 -0500)]
selftests: bonding: Change script interpreter
The tests changed by this patch, as well as the scripts they source, use
features which are not part of POSIX sh (ex. 'source' and 'local'). As a
result, these tests fail when /bin/sh is dash such as on Debian. Change the
interpreter to bash so that these tests can run successfully.
Fixes: d43eff0b85ae ("selftests: bonding: up/down delay w/ slave link flapping") Tested-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net: ravb: Fix dma_addr_t truncation in error case
In ravb_start_xmit(), ravb driver uses u32 variable to store result of
dma_map_single() call. Since ravb hardware has 32-bit address fields in
descriptors, this works properly when mapping is successful - it is
platform's job to provide mapping addresses that fit into hardware
limitations.
However, in failure case dma_map_single() returns DMA_MAPPING_ERROR
constant that is 64-bit when dma_addr_t is 64-bit. Storing this constant
in u32 leads to truncation, and further call to dma_mapping_error()
fails to notice the error.
Fix that by storing result of dma_map_single() in a dma_addr_t
variable.
Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 14 Jan 2024 12:17:14 +0000 (12:17 +0000)]
Merge branch 'tls-splice-hint-fixes'
John Fastabend says:
====================
tls fixes for SPLICE with more hint
Syzbot found a splat where it tried to splice data over a tls socket
with the more hint and sending greater than the number of frags that
fit in a msg scatterlist. This resulted in an error where we do not
correctly send the data when the msg sg is full. The more flag being
just a hint not a strict contract. This then results in the syzbot
warning on the next send.
Edward generated an initial patch for this which checked for a full
msg on entry to the sendmsg hook. This fixed the WARNING, but didn't
fully resolve the issue because the full msg_pl scatterlist was never
sent resulting in a stuck socket. In this series instead avoid the
situation by forcing the send on the splice that fills the scatterlist.
Also in original thread I mentioned it didn't seem to be enough to
simply fix the send on full sg problem. That was incorrect and was
really a bug in my test program that was hanging the test program.
I had setup a repair socket and wasn't handling it correctly so my
tester got stuck.
Thanks. Please review. Fix in patch 1 and test in patch 2.
v2: use SPLICE_F_ flag names instead of cryptic 0xe
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Sat, 13 Jan 2024 00:32:58 +0000 (16:32 -0800)]
net: tls, add test to capture error on large splice
syzbot found an error with how splice() is handled with a msg greater
than 32. This was fixed in previous patch, but lets add a test for
it to ensure it continues to work.
Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Sat, 13 Jan 2024 00:32:57 +0000 (16:32 -0800)]
net: tls, fix WARNIING in __sk_msg_free
A splice with MSG_SPLICE_PAGES will cause tls code to use the
tls_sw_sendmsg_splice path in the TLS sendmsg code to move the user
provided pages from the msg into the msg_pl. This will loop over the
msg until msg_pl is full, checked by sk_msg_full(msg_pl). The user
can also set the MORE flag to hint stack to delay sending until receiving
more pages and ideally a full buffer.
If the user adds more pages to the msg than can fit in the msg_pl
scatterlist (MAX_MSG_FRAGS) we should ignore the MORE flag and send
the buffer anyways.
What actually happens though is we abort the msg to msg_pl scatterlist
setup and then because we forget to set 'full record' indicating we
can no longer consume data without a send we fallthrough to the 'continue'
path which will check if msg_data_left(msg) has more bytes to send and
then attempts to fit them in the already full msg_pl. Then next
iteration of sender doing send will encounter a full msg_pl and throw
the warning in the syzbot report.
To fix simply check if we have a full_record in splice code path and
if not send the msg regardless of MORE flag.
Reported-and-tested-by: syzbot+f2977222e0e95cec15c8@syzkaller.appspotmail.com Reported-by: Edward Adam Davis <eadavis@qq.com> Fixes: fe1e81d4f73b ("tls/sw: Support MSG_SPLICE_PAGES") Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
bpf: Fix backward progress bug in bpf_iter_udp
From: Martin KaFai Lau <martin.lau@kernel.org>
This patch set fixes an issue in bpf_iter_udp that makes backward
progress and prevents the user space process from finishing. There is
a test at the end to reproduce the bug.
Please see individual patches for details.
v3:
- Fixed the iter_fd check and local_port check in the
patch 3 selftest. (Yonghong)
- Moved jhash2 to test_jhash.h in the patch 3. (Yonghong)
- Added explanation in the bucket selection in the patch 3. (Yonghong)
v2:
- Added patch 1 to fix another bug that goes back to
the previous bucket
- Simplify the fix in patch 2 to always reset iter->offset to 0
- Add a test case to close all udp_sk in a bucket while
in the middle of the iteration.
====================
Martin KaFai Lau [Fri, 12 Jan 2024 19:05:30 +0000 (11:05 -0800)]
selftests/bpf: Test udp and tcp iter batching
The patch adds a test to exercise the bpf_iter_udp batching
logic. It specifically tests the case that there are multiple
so_reuseport udp_sk in a bucket of the udp_table.
The test creates two sets of so_reuseport sockets and
each set on a different port. Meaning there will be
two buckets in the udp_table.
The test does the following:
1. read() 3 out of 4 sockets in the first bucket.
2. close() all sockets in the first bucket. This
will ensure the current bucket's offset in
the kernel does not affect the read() of the
following bucket.
3. read() all 4 sockets in the second bucket.
The test also reads one udp_sk at a time from
the bpf_iter_udp prog. The true case in
"do_test(..., bool onebyone)". This is the buggy case
that the previous patch fixed.
It also tests the "false" case in "do_test(..., bool onebyone)",
meaning the userspace reads the whole bucket. There is
no bug in this case but adding this test also while
at it.
Considering the way to have multiple tcp_sk in the same
bucket is similar (by using so_reuseport),
this patch also tests the bpf_iter_tcp even though the
bpf_iter_tcp batching logic works correctly.
Both IP v4 and v6 are exercising the same bpf_iter batching
code path, so only v6 is tested.
Martin KaFai Lau [Fri, 12 Jan 2024 19:05:29 +0000 (11:05 -0800)]
bpf: Avoid iter->offset making backward progress in bpf_iter_udp
There is a bug in the bpf_iter_udp_batch() function that stops
the userspace from making forward progress.
The case that triggers the bug is the userspace passed in
a very small read buffer. When the bpf prog does bpf_seq_printf,
the userspace read buffer is not enough to capture the whole bucket.
When the read buffer is not large enough, the kernel will remember
the offset of the bucket in iter->offset such that the next userspace
read() can continue from where it left off.
The kernel will skip the number (== "iter->offset") of sockets in
the next read(). However, the code directly decrements the
"--iter->offset". This is incorrect because the next read() may
not consume the whole bucket either and then the next-next read()
will start from offset 0. The net effect is the userspace will
keep reading from the beginning of a bucket and the process will
never finish. "iter->offset" must always go forward until the
whole bucket is consumed.
This patch fixes it by using a local variable "resume_offset"
and "resume_bucket". "iter->offset" is always reset to 0 before
it may be used. "iter->offset" will be advanced to the
"resume_offset" when it continues from the "resume_bucket" (i.e.
"state->bucket == resume_bucket"). This brings it closer to
the bpf_iter_tcp's offset handling which does not suffer
the same bug.
Cc: Aditi Ghag <aditi.ghag@isovalent.com> Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator") Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20240112190530.3751661-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Fri, 12 Jan 2024 19:05:28 +0000 (11:05 -0800)]
bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket
The current logic is to use a default size 16 to batch the whole bucket.
If it is too small, it will retry with a larger batch size.
The current code accidentally does a state->bucket-- before retrying.
This goes back to retry with the previous bucket which has already
been done. This patch fixed it.
It is hard to create a selftest. I added a WARN_ON(state->bucket < 0),
forced a particular port to be hashed to the first bucket,
created >16 sockets, and observed the for-loop went back
to the "-1" bucket.
Cc: Aditi Ghag <aditi.ghag@isovalent.com> Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator") Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20240112190530.3751661-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
net: netdev_queue: netdev_txq_completed_mb(): fix wake condition
netif_txq_try_stop() uses "get_desc >= start_thrs" as the check for
the call to netif_tx_start_queue().
Use ">=" i netdev_txq_completed_mb(), too.
Fixes: c91c46de6bbc ("net: provide macros for commonly copied lockless queue stop/wake code") Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 12 Jan 2024 12:28:16 +0000 (12:28 +0000)]
net: add more sanity check in virtio_net_hdr_to_skb()
syzbot/KMSAN reports access to uninitialized data from gso_features_check() [1]
The repro use af_packet, injecting a gso packet and hdrlen == 0.
We could fix the issue making gso_features_check() more careful
while dealing with NETIF_F_TSO_MANGLEID in fast path.
Or we can make sure virtio_net_hdr_to_skb() pulls minimal network and
transport headers as intended.
Note that for GSO packets coming from untrusted sources, SKB_GSO_DODGY
bit forces a proper header validation (and pull) before the packet can
hit any device ndo_start_xmit(), thus we do not need a precise disection
at virtio_net_hdr_to_skb() stage.
CPU: 0 PID: 5025 Comm: syz-executor279 Not tainted 6.7.0-rc7-syzkaller-00003-gfbafc3e621c3 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Reported-by: syzbot+7f4d0ea3df4d4fa9a65f@syzkaller.appspotmail.com Link: https://lore.kernel.org/netdev/0000000000005abd7b060eb160cd@google.com/ Fixes: 9274124f023b ("net: stricter validation of untrusted gso packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 12 Jan 2024 11:39:30 +0000 (12:39 +0100)]
net: sched: track device in tcf_block_get/put_ext() only for clsact binder types
Clsact/ingress qdisc is not the only one using shared block,
red is also using it. The device tracking was originally introduced
by commit 913b47d3424e ("net/sched: Introduce tc block netdev
tracking infra") for clsact/ingress only. Commit 94e2557d086a ("net:
sched: move block device tracking into tcf_block_get/put_ext()")
mistakenly enabled that for red as well.
Fix that by adding a check for the binder type being clsact when adding
device to the block->ports xarray.
Reported-by: Ido Schimmel <idosch@idosch.org> Closes: https://lore.kernel.org/all/ZZ6JE0odnu1lLPtu@shredder/ Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 12 Jan 2024 10:44:27 +0000 (10:44 +0000)]
udp: annotate data-races around up->pending
up->pending can be read without holding the socket lock,
as pointed out by syzbot [1]
Add READ_ONCE() in lockless contexts, and WRITE_ONCE()
on write side.
[1]
BUG: KCSAN: data-race in udpv6_sendmsg / udpv6_sendmsg
write to 0xffff88814e5eadf0 of 4 bytes by task 15547 on cpu 1:
udpv6_sendmsg+0x1405/0x1530 net/ipv6/udp.c:1596
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
__sys_sendto+0x257/0x310 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
read to 0xffff88814e5eadf0 of 4 bytes by task 15551 on cpu 0:
udpv6_sendmsg+0x22c/0x1530 net/ipv6/udp.c:1373
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2586
___sys_sendmsg net/socket.c:2640 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2726
__do_sys_sendmmsg net/socket.c:2755 [inline]
__se_sys_sendmmsg net/socket.c:2752 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2752
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
value changed: 0x00000000 -> 0x0000000a
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 15551 Comm: syz-executor.1 Tainted: G W 6.7.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+8d482d0e407f665d9d10@syzkaller.appspotmail.com Link: https://lore.kernel.org/netdev/0000000000009e46c3060ebcdffd@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sneh Shah [Tue, 9 Jan 2024 14:47:29 +0000 (20:17 +0530)]
net: stmmac: Fix ethool link settings ops for integrated PCS
Currently get/set_link_ksettings ethtool ops are dependent on PCS.
When PCS is integrated, it will not have separate link config.
Bypass configuring and checking PCS for integrated PCS.
Fixes: aa571b6275fb ("net: stmmac: add new switch to struct plat_stmmacenet_data") Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8775p-ride Signed-off-by: Sneh Shah <quic_snehshah@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Jan 2024 19:49:17 +0000 (19:49 +0000)]
mptcp: refine opt_mp_capable determination
OPTIONS_MPTCP_MPC is a combination of three flags.
It would be better to be strict about testing what
flag is expected, at least for code readability.
mptcp_parse_option() already makes the distinction.
- subflow_check_req() should use OPTION_MPTCP_MPC_SYN.
- mptcp_subflow_init_cookie_req() should use OPTION_MPTCP_MPC_ACK.
- subflow_finish_connect() should use OPTION_MPTCP_MPC_SYNACK
- subflow_syn_recv_sock should use OPTION_MPTCP_MPC_ACK
Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Fixes: 74c7dfbee3e1 ("mptcp: consolidate in_opt sub-options fields in a bitmask") Link: https://lore.kernel.org/r/20240111194917.4044654-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 11 Jan 2024 19:49:15 +0000 (19:49 +0000)]
mptcp: use OPTION_MPTCP_MPJ_SYNACK in subflow_finish_connect()
subflow_finish_connect() uses four fields (backup, join_id, thmac, none)
that may contain garbage unless OPTION_MPTCP_MPJ_SYNACK has been set
in mptcp_parse_option()
Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Peter Krystad <peter.krystad@linux.intel.com> Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Mat Martineau <martineau@kernel.org> Cc: Geliang Tang <geliang.tang@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20240111194917.4044654-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Claudiu Beznea [Fri, 5 Jan 2024 08:52:42 +0000 (10:52 +0200)]
net: phy: micrel: populate .soft_reset for KSZ9131
The RZ/G3S SMARC Module has 2 KSZ9131 PHYs. In this setup, the KSZ9131 PHY
is used with the ravb Ethernet driver. It has been discovered that when
bringing the Ethernet interface down/up continuously, e.g., with the
following sh script:
$ while :; do ifconfig eth0 down; ifconfig eth0 up; done
the link speed and duplex are wrong after interrupting the bring down/up
operation even though the Ethernet interface is up. To recover from this
state the following configuration sequence is necessary (executed
manually):
$ ifconfig eth0 down
$ ifconfig eth0 up
The behavior has been identified also on the Microchip SAMA7G5-EK board
which runs the macb driver and uses the same PHY.
The order of PHY-related operations in ravb_open() is as follows:
ravb_open() ->
ravb_phy_start() ->
ravb_phy_init() ->
of_phy_connect() ->
phy_connect_direct() ->
phy_attach_direct() ->
phy_init_hw() ->
phydev->drv->soft_reset()
phydev->drv->config_init()
phydev->drv->config_intr()
phy_resume()
kszphy_resume()
The order of PHY-related operations in ravb_close is as follows:
ravb_close() ->
phy_stop() ->
phy_suspend() ->
kszphy_suspend() ->
genphy_suspend()
// set BMCR_PDOWN bit in MII_BMCR
In genphy_suspend() setting the BMCR_PDWN bit in MII_BMCR switches the PHY
to Software Power-Down (SPD) mode (according to the KSZ9131 datasheet).
Thus, when opening the interface after it has been previously closed (via
ravb_close()), the phydev->drv->config_init() and
phydev->drv->config_intr() reach the KSZ9131 PHY driver via the
ksz9131_config_init() and kszphy_config_intr() functions.
KSZ9131 specifies that the MII management interface remains operational
during SPD (Software Power-Down), but (according to manual):
- Only access to the standard registers (0 through 31) is supported.
- Access to MMD address spaces other than MMD address space 1 is possible
if the spd_clock_gate_override bit is set.
- Access to MMD address space 1 is not possible.
The spd_clock_gate_override bit is not used in the KSZ9131 driver.
ksz9131_config_init() configures RGMII delay, pad skews and LEDs by
accessesing MMD registers other than those in address space 1.
The datasheet for the KSZ9131 does not specify what happens if registers
from an unsupported address space are accessed while the PHY is in SPD.
To fix the issue the .soft_reset method has been instantiated for KSZ9131,
too. This resets the PHY to the default state before doing any
configurations to it, thus switching it out of SPD.
Fixes: bff5b4b37372 ("net: phy: micrel: add Microchip KSZ9131 initial driver") Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Wed, 10 Jan 2024 11:37:30 +0000 (12:37 +0100)]
net: micrel: Fix PTP frame parsing for lan8841
The HW has the capability to check each frame if it is a PTP frame,
which domain it is, which ptp frame type it is, different ip address in
the frame. And if one of these checks fail then the frame is not
timestamp. Most of these checks were disabled except checking the field
minorVersionPTP inside the PTP header. Meaning that once a partner sends
a frame compliant to 8021AS which has minorVersionPTP set to 1, then the
frame was not timestamp because the HW expected by default a value of 0
in minorVersionPTP.
Fix this issue by removing this check so the userspace can decide on this.
Fixes: cafc3662ee3f ("net: micrel: Add PHC support for lan8841") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Divya Koppera <divya.koppera@microchip.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Taehee Yoo [Sun, 7 Jan 2024 14:42:41 +0000 (14:42 +0000)]
amt: do not use overwrapped cb area
amt driver uses skb->cb for storing tunnel information.
This job is worked before TC layer and then amt driver load tunnel info
from skb->cb after TC layer.
So, its cb area should not be overwrapped with CB area used by TC.
In order to not use cb area used by TC, it skips the biggest cb
structure used by TC, which was qdisc_skb_cb.
But it's not anymore.
Currently, biggest structure of TC's CB is tc_skb_cb.
So, it should skip size of tc_skb_cb instead of qdisc_skb_cb.
Fixes: ec624fe740b4 ("net/sched: Extend qdisc control block with tc control block") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240107144241.4169520-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: ethernet: ti: am65-cpsw: Allow for MTU values
The am65-cpsw-nuss driver has a fixed definition for the maximum ethernet
frame length of 1522 bytes (AM65_CPSW_MAX_PACKET_SIZE). This limits the switch
ports to only operate at a maximum MTU of 1500 bytes. When combining this CPSW
switch with a DSA switch connected to one of its ports this limitation shows up.
The extra 8 bytes the DSA subsystem adds internally to the ethernet frame
create resulting frames bigger than 1522 bytes (1518 for non VLAN + 8 for DSA
stuff) so they get dropped by the switch.
One of the issues with the the am65-cpsw-nuss driver is that the network device
max_mtu was being set to the same fixed value defined for the max total frame
length (1522 bytes). This makes the DSA subsystem believe that the MTU of the
interface can be set to 1508 bytes to make room for the extra 8 bytes of the DSA
headers. However, all packages created assuming the 1500 bytes payload get
dropped by the switch as oversized.
This series offers a solution to this problem. The max_mtu advertised on the
network device and the actual max frame size configured on the switch registers
are made consistent by letting the extra room needed for the ethernet headers
and the frame checksum (22 bytes including VLAN).
====================
net: ethernet: ti: am65-cpsw: Fix max mtu to fit ethernet frames
The value of AM65_CPSW_MAX_PACKET_SIZE represents the maximum length
of a received frame. This value is written to the register
AM65_CPSW_PORT_REG_RX_MAXLEN.
The maximum MTU configured on the network device should then leave
some room for the ethernet headers and frame check. Otherwise, if
the network interface is configured to its maximum mtu possible,
the frames will be larger than AM65_CPSW_MAX_PACKET_SIZE and will
get dropped as oversized.
The switch supports ethernet frame sizes between 64 and 2024 bytes
(including VLAN) as stated in the technical reference manual, so
define AM65_CPSW_MAX_PACKET_SIZE with that maximum size.
Zhu Yanjun [Thu, 4 Jan 2024 02:09:02 +0000 (10:09 +0800)]
virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings
Fix the warnings when building virtio_net driver.
"
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4551:48: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 10 [-Wformat-overflow=]
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4551:41: note: directive argument in the range [-2147483643, 65534]
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~~~~~~~~~
drivers/net/virtio_net.c:4551:17: note: ‘sprintf’ output between 8 and 18 bytes into a destination of size 16
4551 | sprintf(vi->rq[i].name, "input.%d", i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4552:49: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 9 [-Wformat-overflow=]
4552 | sprintf(vi->sq[i].name, "output.%d", i);
| ^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4552:41: note: directive argument in the range [-2147483643, 65534]
4552 | sprintf(vi->sq[i].name, "output.%d", i);
| ^~~~~~~~~~~
drivers/net/virtio_net.c:4552:17: note: ‘sprintf’ output between 9 and 19 bytes into a destination of size 16
4552 | sprintf(vi->sq[i].name, "output.%d", i);
octeontx2-af: CN10KB: Fix FIFO length calculation for RPM2
RPM0 and RPM1 on the CN10KB SoC have 8 LMACs each, whereas RPM2
has only 4 LMACs. Similarly, the RPM0 and RPM1 have 256KB FIFO,
whereas RPM2 has 128KB FIFO. This patch fixes an issue with
improper TX credit programming for the RPM2 link.
Fixes: b9d0fedc6234 ("octeontx2-af: cn10kb: Add RPM_USX MAC support") Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240108073036.8766-1-naveenm@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
rtnetlink: allow to enslave with one msg an up interface
The first patch fixes a regression, introduced in linux v6.1, by reverting
a patch. The second patch adds a test to verify this API.
====================
The patch broke:
> ip link set dummy0 up
> ip link set dummy0 master bond0 down
This last command is useful to be able to enslave an interface with only
one netlink message.
After discussion, there is no good reason to support:
> ip link set dummy0 down
> ip link set dummy0 master bond0 up
because the bond interface already set the slave up when it is up.
Cc: stable@vger.kernel.org Fixes: a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240108094103.2001224-2-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 9 Jan 2024 15:10:48 +0000 (15:10 +0000)]
rxrpc: Fix use of Don't Fragment flag
rxrpc normally has the Don't Fragment flag set on the UDP packets it
transmits, except when it has decided that DATA packets aren't getting
through - in which case it turns it off just for the DATA transmissions.
This can be a problem, however, for RESPONSE packets that convey
authentication and crypto data from the client to the server as ticket may
be larger than can fit in the MTU.
In such a case, rxrpc gets itself into an infinite loop as the sendmsg
returns an error (EMSGSIZE), which causes rxkad_send_response() to return
-EAGAIN - and the CHALLENGE packet is put back on the Rx queue to retry,
leading to the I/O thread endlessly attempting to perform the transmission.
Fix this by disabling DF on RESPONSE packets for now. The use of DF and
best data MTU determination needs reconsidering at some point in the
future.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/1581852.1704813048@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Which is obviously bogus, because not all net_devices have a netdev_priv()
of type struct dsa_user_priv. But struct dsa_user_priv is fairly small,
and p->dp means dereferencing 8 bytes starting with offset 16. Most
drivers allocate that much private memory anyway, making our access not
fault, and we discard the bogus data quickly afterwards, so this wasn't
caught.
But the dummy interface is somewhat special in that it calls
alloc_netdev() with a priv size of 0. So every netdev_priv() dereference
is invalid, and we get this when we emit a NETDEV_PRECHANGEUPPER event
with a VLAN as its new upper:
$ ip link add dummy1 type dummy
$ ip link add link dummy1 name dummy1.100 type vlan id 100
[ 43.309174] ==================================================================
[ 43.316456] BUG: KASAN: slab-out-of-bounds in dsa_user_prechangeupper+0x30/0xe8
[ 43.323835] Read of size 8 at addr ffff3f86481d2990 by task ip/374
[ 43.330058]
[ 43.342436] Call trace:
[ 43.366542] dsa_user_prechangeupper+0x30/0xe8
[ 43.371024] dsa_user_netdevice_event+0xb38/0xee8
[ 43.375768] notifier_call_chain+0xa4/0x210
[ 43.379985] raw_notifier_call_chain+0x24/0x38
[ 43.384464] __netdev_upper_dev_link+0x3ec/0x5d8
[ 43.389120] netdev_upper_dev_link+0x70/0xa8
[ 43.393424] register_vlan_dev+0x1bc/0x310
[ 43.397554] vlan_newlink+0x210/0x248
[ 43.401247] rtnl_newlink+0x9fc/0xe30
[ 43.404942] rtnetlink_rcv_msg+0x378/0x580
Avoid the kernel oops by dereferencing after the type check, as customary.
Fixes: 4c3f80d22b2e ("net: dsa: walk through all changeupper notifier functions") Reported-and-tested-by: syzbot+d81bcd883824180500c8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/0000000000001d4255060e87545c@google.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240110003354.2796778-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lin Ma [Wed, 10 Jan 2024 06:14:00 +0000 (14:14 +0800)]
net: qualcomm: rmnet: fix global oob in rmnet_policy
The variable rmnet_link_ops assign a *bigger* maxtype which leads to a
global out-of-bounds read when parsing the netlink attributes. See bug
trace below:
==================================================================
BUG: KASAN: global-out-of-bounds in validate_nla lib/nlattr.c:386 [inline]
BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x24af/0x2750 lib/nlattr.c:600
Read of size 1 at addr ffffffff92c438d0 by task syz-executor.6/84207
The intel test robot uses only selftest's Makefile, not the top linux
Makefile:
> make W=1 O=/tmp/kselftest -C tools/testing/selftests
So, $(LINK.c) is determined by environment, rather than by kernel
Makefiles. On my machine (as well as other people that ran tcp-ao
selftests) GNU/Make implicit definition does use $(LDFLAGS):
But, according to build robot report, it's not the case for them.
While I could just avoid using pre-defined $(LINK.c), it's also used by
selftests/lib.mk by default.
Anyways, according to GNU/Make documentation [1], I should have used
$(LDLIBS) instead of $(LDFLAGS) in the first place, so let's just do it:
> LDFLAGS
> Extra flags to give to compilers when they are supposed to invoke
> the linker, ‘ld’, such as -L. Libraries (-lfoo) should be added
> to the LDLIBS variable instead.
> LDLIBS
> Library flags or names given to compilers when they are supposed
> to invoke the linker, ‘ld’. LOADLIBES is a deprecated (but still
> supported) alternative to LDLIBS. Non-library linker flags, such
> as -L, should go in the LDFLAGS variable.
Arnd Bergmann [Thu, 11 Jan 2024 16:27:53 +0000 (17:27 +0100)]
wangxunx: select CONFIG_PHYLINK where needed
The ngbe driver needs phylink:
arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/libwx/wx_ethtool.o: in function `wx_nway_reset':
wx_ethtool.c:(.text+0x458): undefined reference to `phylink_ethtool_nway_reset'
arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_main.o: in function `ngbe_remove':
ngbe_main.c:(.text+0x7c): undefined reference to `phylink_destroy'
arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_main.o: in function `ngbe_open':
ngbe_main.c:(.text+0xf90): undefined reference to `phylink_connect_phy'
arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_mdio.o: in function `ngbe_mdio_init':
ngbe_mdio.c:(.text+0x314): undefined reference to `phylink_create'
Add the missing Kconfig description for this.
Fixes: bc2426d74aa3 ("net: ngbe: convert phylib to phylink") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240111162828.68564-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 9 Jan 2024 16:45:17 +0000 (08:45 -0800)]
MAINTAINERS: ibmvnic: drop Dany from reviewers
I missed that Dany uses a different email address
when tagging patches (drt@linux.ibm.com)
and asked him if he's still actively working on ibmvnic.
He doesn't really fall under our removal criteria,
but he admitted that he already moved on to other projects.
Jakub Kicinski [Tue, 9 Jan 2024 16:45:16 +0000 (08:45 -0800)]
MAINTAINERS: mark ax25 as Orphan
We haven't heard from Ralf for two years, according to lore.
We get a constant stream of "fixes" to ax25 from people using
code analysis tools. Nobody is reviewing those, let's reflect
this reality in MAINTAINERS.
Jakub Kicinski [Tue, 9 Jan 2024 16:45:15 +0000 (08:45 -0800)]
MAINTAINERS: Bluetooth: retire Johan (for now?)
Johan moved to maintaining the Zephyr Bluetooth stack,
and we haven't heard from him on the ML in 3 years
(according to lore), and seen any tags in git in 4 years.
Trade the MAINTAINER entry for CREDITS, we can revert
whenever Johan comes back to Linux hacking :)
Subsystem BLUETOOTH SUBSYSTEM
Changes 173 / 986 (17%)
Last activity: 2023-12-22
Marcel Holtmann <marcel@holtmann.org>:
Author 91cb4c19118a 2022-01-27 00:00:00 52
Committer edcb185fa9c4 2022-05-23 00:00:00 446
Tags 000c2fa2c144 2023-04-23 00:00:00 523
Johan Hedberg <johan.hedberg@gmail.com>:
Luiz Augusto von Dentz <luiz.dentz@gmail.com>:
Author d03376c18592 2023-12-22 00:00:00 241
Committer da9065caa594 2023-12-22 00:00:00 341
Tags da9065caa594 2023-12-22 00:00:00 493
Top reviewers:
[33]: alainm@chromium.org
[31]: mcchou@chromium.org
[27]: abhishekpandit@chromium.org
INACTIVE MAINTAINER Johan Hedberg <johan.hedberg@gmail.com>
Jakub Kicinski [Tue, 9 Jan 2024 16:45:12 +0000 (08:45 -0800)]
MAINTAINERS: eth: mt7530: move Landen Chao to CREDITS
mt7530 is a pretty active driver and last we have heard
from Landen Chao on the list was March. There were total
of 4 message from them in the last 2.5 years.
I think it's time to move to CREDITS.
Linus Torvalds [Thu, 11 Jan 2024 18:07:29 +0000 (10:07 -0800)]
Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"The most interesting thing is probably the networking structs
reorganization and a significant amount of changes is around
self-tests.
Core & protocols:
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up build
time warnings to safeguard against future header changes
This improves TCP performances with many concurrent connections up
to 40%
- Add page-pool netlink-based introspection, exposing the memory
usage and recycling stats. This helps indentify bad PP users and
possible leaks
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
lowers the time taken by connect() for hosts having many active
connections to the same destination
- Refactor the TCP bind conflict code, shrinking related socket
structs
- Refactor TCP SYN-Cookie handling, as a preparation step to allow
arbitrary SYN-Cookie processing via eBPF
- Tune optmem_max for 0-copy usage, increasing the default value to
128KB and namespecifying it
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations
- Reduce extension header parsing overhead at GRO time
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries
- Reorder nftables struct members, to keep data accessed by the
datapath first
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex)
- More data-race annotations
- Extend the diag interface to dump TCP bound-only sockets
- Conditional notification of events for TC qdisc class and actions
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID
- Implement SMCv2.1 virtual ISM device support
- Add support for Batman-avd mulicast packet type
BPF:
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers.
This improves the verifier "instructions processed" metric from
single digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer
experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques
- Support uid/gid options when mounting bpffs
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is
identified by its id
- Extend verifier to allow bpf_refcount_acquire() of a map value
field obtained via direct load which is a use-case needed in
sched_ext
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter
- Support for VLAN tag in XDP hints
- Remove deprecated bpfilter kernel leftovers given the project is
developed in user-space (https://github.com/facebook/bpfilter)
Misc:
- Support for parellel TC self-tests execution
- Increase MPTCP self-tests coverage
- Updated the bridge documentation, including several so-far
undocumented features
- Convert all the net self-tests to run in unique netns, to avoid
random failures due to conflict and allow concurrent runs
- Add TCP-AO self-tests
- Add kunit tests for both cfg80211 and mac80211
- Autogenerate Netlink families documentation from YAML spec
- Add yml-gen support for fixed headers and recursive nests, the tool
can now generate user-space code for all genetlink families for
which we have specs
- A bunch of additional module descriptions fixes
- Catch incorrect freeing of pages belonging to a page pool
Driver API:
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host
- Add support for ethtool symmetric-xor RSS hash
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute
- Convert older drivers to platform remove callback returning void
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running
timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will
allow in future releases combining multiple PFs devices
attached to different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion
for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications
for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring
param, coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine
driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync"
* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
lan78xx: remove redundant statement in lan78xx_get_eee
lan743x: remove redundant statement in lan743x_ethtool_get_eee
bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
tcp: Revert no longer abort SYN_SENT when receiving some ICMP
Revert "mlx5 updates 2023-12-20"
Revert "net: stmmac: Enable Per DMA Channel interrupt"
ipvlan: Remove usage of the deprecated ida_simple_xx() API
ipvlan: Fix a typo in a comment
net/sched: Remove ipt action tests
net: stmmac: Use interrupt mode INTM=1 for per channel irq
net: stmmac: Add support for TX/RX channel interrupt
net: stmmac: Make MSI interrupt routine generic
dt-bindings: net: snps,dwmac: per channel irq
net: phy: at803x: make read_status more generic
net: phy: at803x: add support for cdt cross short test for qca808x
net: phy: at803x: refactor qca808x cable test get status function
net: phy: at803x: generalize cdt fault length function
net: ethernet: cortina: Drop TSO support
...
Linus Torvalds [Thu, 11 Jan 2024 02:18:20 +0000 (18:18 -0800)]
Merge tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Add machine variable capacity information to /proc/sysinfo.
- Limit the waste of page tables and always align vmalloc area size and
base address on segment boundary.
- Fix a memory leak when an attempt to register interruption sub class
(ISC) for the adjunct-processor (AP) guest failed.
- Reset response code AP_RESPONSE_INVALID_GISA to understandable by
guest AP_RESPONSE_INVALID_ADDRESS in response to a failed
interruption sub class (ISC) registration attempt.
- Improve reaction to adjunct-processor (AP)
AP_RESPONSE_OTHERWISE_CHANGED response code when enabling interrupts
on behalf of a guest.
- Fix incorrect sysfs 'status' attribute of adjunct-processor (AP)
queue device bound to the vfio_ap device driver when the mediated
device is attached to a guest, but the queue device is not passed
through.
- Rework struct ap_card to hold the whole adjunct-processor (AP) card
hardware information. As result, all the ugly bit checks are replaced
by simple evaluations of the required bit fields.
- Improve handling of some weird scenarios between service element (SE)
host and SE guest with adjunct-processor (AP) pass-through support.
- Change local_ctl_set_bit() and local_ctl_clear_bit() so they return
the previous value of the to be changed control register. This is
useful if a bit is only changed temporarily and the previous content
needs to be restored.
- The kernel starts with machine checks disabled and is expected to
enable it once trap_init() is called. However the implementation
allows machine checks early. Consistently enable it in trap_init()
only.
- local_mcck_disable() and local_mcck_enable() assume that machine
checks are always enabled. Instead implement and use
local_mcck_save() and local_mcck_restore() to disable machine checks
and restore the previous state.
- Modification of floating point control (FPC) register of a traced
process using ptrace interface may lead to corruption of the FPC
register of the tracing process. Fix this.
- kvm_arch_vcpu_ioctl_set_fpu() allows to set the floating point
control (FPC) register in vCPU, but may lead to corruption of the FPC
register of the host process. Fix this.
- Use READ_ONCE() to read a vCPU floating point register value from the
memory mapped area. This avoids that, depending on code generation, a
different value is tested for validity than the one that is used.
- Get rid of test_fp_ctl(), since it is quite subtle to use it
correctly. Instead copy a new floating point control register value
into its save area and test the validity of the new value when
loading it.
- Remove superfluous save_fpu_regs() call.
- Remove s390 support for ARCH_WANTS_DYNAMIC_TASK_STRUCT. All machines
provide the vector facility since many years and the need to make the
task structure size dependent on the vector facility does not exist.
- Remove the "novx" kernel command line option, as the vector code runs
without any problems since many years.
- Add the vector facility to the z13 architecture level set (ALS). All
hypervisors support the vector facility since many years. This allows
compile time optimizations of the kernel.
- Get rid of MACHINE_HAS_VX and replace it with cpu_has_vx(). As
result, the compiled code will have less runtime checks and less
code.
- Convert pgste_get_lock() and pgste_set_unlock() ASM inlines to C.
- Convert the struct subchannel spinlock from pointer to member.
* tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (24 commits)
Revert "s390: update defconfigs"
s390/cio: make sch->lock spinlock pointer a member
s390: update defconfigs
s390/mm: convert pgste locking functions to C
s390/fpu: get rid of MACHINE_HAS_VX
s390/als: add vector facility to z13 architecture level set
s390/fpu: remove "novx" option
s390/fpu: remove ARCH_WANTS_DYNAMIC_TASK_STRUCT support
KVM: s390: remove superfluous save_fpu_regs() call
s390/fpu: get rid of test_fp_ctl()
KVM: s390: use READ_ONCE() to read fpc register value
KVM: s390: fix setting of fpc register
s390/ptrace: handle setting of fpc register correctly
s390/nmi: implement and use local_mcck_save() / local_mcck_restore()
s390/nmi: consistently enable machine checks in trap_init()
s390/ctlreg: return old register contents when changing bits
s390/ap: handle outband SE bind state change
s390/ap: store TAPQ hwinfo in struct ap_card
s390/vfio-ap: fix sysfs status attribute for AP queue devices
s390/vfio-ap: improve reaction to response code 07 from PQAP(AQIC) command
...
Linus Torvalds [Thu, 11 Jan 2024 02:13:44 +0000 (18:13 -0800)]
Merge tag 'asm-generic-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanups from Arnd Bergmann:
"A series from Baoquan He cleans up the asm-generic/io.h to remove the
ioremap_uc() definition from everything except x86, which still needs
it for pre-PAT systems. This series notably contains a patch from
Jiaxun Yang that converts MIPS to use asm-generic/io.h like every
other architecture does, enabling future cleanups.
Some of my own patches fix -Wmissing-prototype warnings in
architecture specific code across several architectures. This is now
needed as the warning is enabled by default. There are still some
remaining warnings in minor platforms, but the series should catch
most of the widely used ones make them more consistent with one
another.
David McKay fixes a bug in __generic_cmpxchg_local() when this is used
on 64-bit architectures. This could currently only affect parisc64 and
sparc64.
Additional cleanups address from Linus Walleij, Uwe Kleine-König,
Thomas Huth, and Kefeng Wang help reduce unnecessary inconsistencies
between architectures"
* tag 'asm-generic-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: Fix 32 bit __generic_cmpxchg_local
Hexagon: Make pfn accessors statics inlines
ARC: mm: Make virt_to_pfn() a static inline
mips: remove extraneous asm-generic/iomap.h include
sparc: Use $(kecho) to announce kernel images being ready
arm64: vdso32: Define BUILD_VDSO32_64 to correct prototypes
csky: fix arch_jump_label_transform_static override
arch: add do_page_fault prototypes
arch: add missing prepare_ftrace_return() prototypes
arch: vdso: consolidate gettime prototypes
arch: include linux/cpu.h for trap_init() prototype
arch: fix asm-offsets.c building with -Wmissing-prototypes
arch: consolidate arch_irq_work_raise prototypes
hexagon: Remove CONFIG_HEXAGON_ARCH_VERSION from uapi header
asm/io: remove unnecessary xlate_dev_mem_ptr() and unxlate_dev_mem_ptr()
mips: io: remove duplicated codes
arch/*/io.h: remove ioremap_uc in some architectures
mips: add <asm-generic/io.h> including
Linus Torvalds [Thu, 11 Jan 2024 02:00:18 +0000 (18:00 -0800)]
Merge tag 'modules-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"Just one cleanup and one documentation improvement change. No
functional changes"
* tag 'modules-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
kernel/module: improve documentation for try_module_get()
module: Remove redundant TASK_UNINTERRUPTIBLE
Linus Torvalds [Thu, 11 Jan 2024 01:44:36 +0000 (17:44 -0800)]
Merge tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"To help make the move of sysctls out of kernel/sysctl.c not incur a
size penalty sysctl has been changed to allow us to not require the
sentinel, the final empty element on the sysctl array. Joel Granados
has been doing all this work.
In the v6.6 kernel we got the major infrastructure changes required to
support this. For v6.7 we had all arch/ and drivers/ modified to
remove the sentinel. For v6.8-rc1 we get a few more updates for fs/
directory only.
The kernel/ directory is left but we'll save that for v6.9-rc1 as
those patches are still being reviewed. After that we then can expect
also the removal of the no longer needed check for procname == NULL.
Let us recap the purpose of this work:
- this helps reduce the overall build time size of the kernel and run
time memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move
sysctls out from kernel/sysctl.c to their own files
Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to
see further work by Thomas Weißschuh with the constificatin of the
struct ctl_table.
Due to Joel Granados's work, and to help bring in new blood, I have
suggested for him to become a maintainer and he's accepted. So for
v6.9-rc1 I look forward to seeing him sent you a pull request for
further sysctl changes. This also removes Iurii Zaikin as a maintainer
as he has moved on to other projects and has had no time to help at
all"
* tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: remove struct ctl_path
sysctl: delete unused define SYSCTL_PERM_EMPTY_DIR
coda: Remove the now superfluous sentinel elements from ctl_table array
sysctl: Remove the now superfluous sentinel elements from ctl_table array
fs: Remove the now superfluous sentinel elements from ctl_table array
cachefiles: Remove the now superfluous sentinel element from ctl_table array
sysclt: Clarify the results of selftest run
sysctl: Add a selftest for handling empty dirs
sysctl: Fix out of bounds access for empty sysctl registers
MAINTAINERS: Add Joel Granados as co-maintainer for proc sysctl
MAINTAINERS: remove Iurii Zaikin from proc sysctl
Linus Torvalds [Thu, 11 Jan 2024 00:43:55 +0000 (16:43 -0800)]
Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull header cleanups from Kent Overstreet:
"The goal is to get sched.h down to a type only header, so the main
thing happening in this patchset is splitting out various _types.h
headers and dependency fixups, as well as moving some things out of
sched.h to better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies"
* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
Kill sched.h dependency on rcupdate.h
kill unnecessary thread_info.h include
Kill unnecessary kernel.h include
preempt.h: Kill dependency on list.h
rseq: Split out rseq.h from sched.h
LoongArch: signal.c: add header file to fix build error
restart_block: Trim includes
lockdep: move held_lock to lockdep_types.h
sem: Split out sem_types.h
uidgid: Split out uidgid_types.h
seccomp: Split out seccomp_types.h
refcount: Split out refcount_types.h
uapi/linux/resource.h: fix include
x86/signal: kill dependency on time.h
syscall_user_dispatch.h: split out *_types.h
mm_types_task.h: Trim dependencies
Split out irqflags_types.h
ipc: Kill bogus dependency on spinlock.h
shm: Slim down dependencies
workqueue: Split out workqueue_types.h
...
Linus Torvalds [Thu, 11 Jan 2024 00:34:17 +0000 (16:34 -0800)]
Merge tag 'bcachefs-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs updates from Kent Overstreet:
- btree write buffer rewrite: instead of adding keys to the btree write
buffer at transaction commit time, we now journal them with a
different journal entry type and copy them from the journal to the
write buffer just prior to journal write.
This reduces the number of atomic operations on shared cachelines in
the transaction commit path and is a signicant performance
improvement on some workloads: multithreaded 4k random writes went
from ~650k iops to ~850k iops.
- Bring back optimistic spinning for six locks: the new implementation
doesn't use osq locks; instead we add to the lock waitlist as normal,
and then spin on the lock_acquired bit in the waitlist entry, _not_
the lock itself.
- New ioctls:
- BCH_IOCTL_DEV_USAGE_V2, which allows for new data types
- BCH_IOCTL_OFFLINE_FSCK, which runs the kernel implementation of
fsck but without mounting: useful for transparently using the
kernel version of fsck from 'bcachefs fsck' when the kernel
version is a better match for the on disk filesystem.
- BCH_IOCTL_ONLINE_FSCK: online fsck. Not all passes are supported
yet, but the passes that are supported are fully featured - errors
may be corrected as normal.
The new ioctls use the new 'thread_with_file' abstraction for kicking
off a kthread that's tied to a file descriptor returned to userspace
via the ioctl.
- btree_paths within a btree_trans are now dynamically growable,
instead of being limited to 64. This is important for the
check_directory_structure phase of fsck, and also fixes some issues
we were having with btree path overflow in the reflink btree.
- Trigger refactoring; prep work for the upcoming disk space accounting
rewrite
- Numerous bugfixes :)
* tag 'bcachefs-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (226 commits)
bcachefs: eytzinger0_find() search should be const
bcachefs: move "ptrs not changing" optimization to bch2_trigger_extent()
bcachefs: fix simulateously upgrading & downgrading
bcachefs: Restart recovery passes more reliably
bcachefs: bch2_dump_bset() doesn't choke on u64s == 0
bcachefs: improve checksum error messages
bcachefs: improve validate_bset_keys()
bcachefs: print sb magic when relevant
bcachefs: __bch2_sb_field_to_text()
bcachefs: %pg is banished
bcachefs: Improve would_deadlock trace event
bcachefs: fsck_err()s don't need to manually check c->sb.version anymore
bcachefs: Upgrades now specify errors to fix, like downgrades
bcachefs: no thread_with_file in userspace
bcachefs: Don't autofix errors we can't fix
bcachefs: add missing bch2_latency_acct() call
bcachefs: increase max_active on io_complete_wq
bcachefs: add time_stats for btree_node_read_done()
bcachefs: don't clear accessed bit in btree node fill
bcachefs: Add an option to control btree node prefetching
...
Linus Torvalds [Thu, 11 Jan 2024 00:23:30 +0000 (16:23 -0800)]
Merge tag 'v6.8-rc-part1-smb-client' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Various smb client fixes, most related to better handling special file
types:
- Improve handling of special file types:
- performance improvement (better compounding and better caching
of readdir entries that are reparse points)
- extend support for creating special files (sockets, fifos,
block/char devices)
- fix renaming and hardlinking of reparse points
- extend support for creating symlinks with IO_REPARSE_TAG_SYMLINK
- Multichannel logging improvement
- Exception handling fix
- Minor cleanups"
* tag 'v6.8-rc-part1-smb-client' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko
cifs: remove unneeded return statement
cifs: make cifs_chan_update_iface() a void function
cifs: delete unnecessary NULL checks in cifs_chan_update_iface()
cifs: get rid of dup length check in parse_reparse_point()
smb: client: stop revalidating reparse points unnecessarily
cifs: Pass unbyteswapped eof value into SMB2_set_eof()
smb3: Improve exception handling in allocate_mr_list()
cifs: fix in logging in cifs_chan_update_iface
smb: client: handle special files and symlinks in SMB3 POSIX
smb: client: cleanup smb2_query_reparse_point()
smb: client: allow creating symlinks via reparse points
smb: client: fix hardlinking of reparse points
smb: client: fix renaming of reparse points
smb: client: optimise reparse point querying
smb: client: allow creating special files via reparse points
smb: client: extend smb2_compound_op() to accept more commands
smb: client: Fix minor whitespace errors and warnings
Linus Torvalds [Thu, 11 Jan 2024 00:13:57 +0000 (16:13 -0800)]
Merge tag 'nfs-for-6.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull nfs client updates from Anna Schumaker:
"New Features:
- Always ask for type with READDIR
- Remove nfs_writepage()
Bugfixes:
- Fix a suspicious RCU usage warning
- Fix a blocklayoutdriver reference leak
- Fix the block driver's calculation of layoutget size
- Fix handling NFS4ERR_RETURNCONFLICT
- Fix _xprt_switch_find_current_entry()
- Fix v4.1 backchannel request timeouts
- Don't add zero-length pnfs block devices
- Use the parent cred in nfs_access_login_time()
Cleanups:
- A few improvements when dealing with referring calls from the
server
- Clean up various unused variables, struct fields, and function
calls
- Various tracepoint improvements"
* tag 'nfs-for-6.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (21 commits)
NFSv4.1: Use the nfs_client's rpc timeouts for backchannel
SUNRPC: Fixup v4.1 backchannel request timeouts
rpc_pipefs: Replace one label in bl_resolve_deviceid()
nfs: Remove writepage
NFS: drop unused nfs_direct_req bytes_left
pNFS: Fix the pnfs block driver's calculation of layoutget size
nfs: print fileid in lookup tracepoints
nfs: rename the nfs_async_rename_done tracepoint
nfs: add new tracepoint at nfs4 revalidate entry point
SUNRPC: fix _xprt_switch_find_current_entry logic
NFSv4.1/pnfs: Ensure we handle the error NFS4ERR_RETURNCONFLICT
NFSv4.1: if referring calls are complete, trust the stateid argument
NFSv4: Track the number of referring calls in struct cb_process_state
NFS: Use parent's objective cred in nfs_access_login_time()
NFSv4: Always ask for type with READDIR
pnfs/blocklayout: Don't add zero-length pnfs_block_dev
blocklayoutdriver: Fix reference leak of pnfs_device_node
SUNRPC: Fix a suspicious RCU usage warning
SUNRPC: Create a helper function for accessing the rpc_clnt's xprt_switch
SUNRPC: Remove unused function rpc_clnt_xprt_switch_put()
...
Linus Torvalds [Thu, 11 Jan 2024 00:09:14 +0000 (16:09 -0800)]
Merge tag 'ext4_for_linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Various ext4 bug fixes and cleanups. The fixes are mostly in the
fstrim and mballoc code paths.
Also enable dioread_nolock in the case where the block size is less
than the page size (dioread_nolock has been default in the bs == ps
case for quite some time)"
* tag 'ext4_for_linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix inconsistent between segment fstrim and full fstrim
ext4: fallback to complex scan if aligned scan doesn't work
ext4: convert ext4_da_do_write_end() to take a folio
ext4: allow for the last group to be marked as trimmed
ext4: move ext4_check_bdev_write_error() into nojournal mode
jbd2: abort journal when detecting metadata writeback error of fs dev
jbd2: remove unused 'JBD2_CHECKPOINT_IO_ERROR' and 'j_atomic_flags'
jbd2: replace journal state flag by checking errseq
jbd2: add errseq to detect client fs's bdev writeback error
ext4: improving calculation of 'fe_{len|start}' in mb_find_extent()
ext4: clarify handling of unwritten bh in __ext4_block_zero_page_range()
ext4: treat end of range as exclusive in ext4_zero_range()
ext4: enable dioread_nolock as default for bs < ps case
ext4: delete redundant calculations in ext4_mb_get_buddy_page_lock()
ext4: reduce unnecessary memory allocation in alloc_flex_gd()
ext4: avoid online resizing failures due to oversized flex bg
ext4: remove unnecessary check from alloc_flex_gd()
ext4: unify the type of flexbg_size to unsigned int
Linus Torvalds [Thu, 11 Jan 2024 00:06:58 +0000 (16:06 -0800)]
Merge tag 'unicode-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode
Pull unicode updates from Gabriel Krisman Bertazi:
"Other than the update to MAINTAINERS, this PR has only a fix to stop
ecryptfs from inadvertently mounting case-insensitive filesystems that
it cannot handle, which would otherwise caused post-mount failures"
* tag 'unicode-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
MAINTAINERS: update unicode maintainer e-mail address
ecryptfs: Reject casefold directory inodes
David Howells [Wed, 10 Jan 2024 21:11:40 +0000 (21:11 +0000)]
keys, dns: Fix size check of V1 server-list header
Fix the size check added to dns_resolver_preparse() for the V1 server-list
header so that it doesn't give EINVAL if the size supplied is the same as
the size of the header struct (which should be valid).
Linus Torvalds [Wed, 10 Jan 2024 19:37:38 +0000 (11:37 -0800)]
Merge tag 'tpmdd-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm updates from Jarkko Sakkinen:
"Just a couple fixes and no new features"
* tag 'tpmdd-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm: cr50: fix kernel-doc warning and spelling
tpm: nuvoton: Use i2c_get_match_data()
Linus Torvalds [Wed, 10 Jan 2024 19:03:52 +0000 (11:03 -0800)]
Merge tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
- Introduce the param_unknown_fn type and other clean ups (Andy
Shevchenko)
- Various __counted_by annotations (Christophe JAILLET, Gustavo A. R.
Silva, Kees Cook)
- Add KFENCE test to LKDTM (Stephen Boyd)
- Various strncpy() refactorings (Justin Stitt)
- Fix qnx4 to avoid writing into the smaller of two overlapping buffers
- Various strlcpy() refactorings
* tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
qnx4: Use get_directory_fname() in qnx4_match()
qnx4: Extract dir entry filename processing into helper
atags_proc: Add __counted_by for struct buffer and use struct_size()
tracing/uprobe: Replace strlcpy() with strscpy()
params: Fix multi-line comment style
params: Sort headers
params: Use size_add() for kmalloc()
params: Do not go over the limit when getting the string length
params: Introduce the param_unknown_fn type
lkdtm: Add kfence read after free crash type
nvme-fc: replace deprecated strncpy with strscpy
nvdimm/btt: replace deprecated strncpy with strscpy
nvme-fabrics: replace deprecated strncpy with strscpy
drm/modes: replace deprecated strncpy with strscpy_pad
afs: Add __counted_by for struct afs_acl and use struct_size()
VMCI: Annotate struct vmci_handle_arr with __counted_by
i40e: Annotate struct i40e_qvlist_info with __counted_by
HID: uhid: replace deprecated strncpy with strscpy
samples: Replace strlcpy() with strscpy()
SUNRPC: Replace strlcpy() with strscpy()