====================
net: dsa: microchip: use phylink_mac_ops for ksz driver
This four patch series switches the Microchip KSZ DSA driver to use
phylink_mac_ops support, and for this one we go a little further
beyond a simple conversion. This driver has four distinct cases:
lan937x
ksz9477
ksz8
ksz8830
Three of these cases are handled by shimming the existing DSA calls
through ksz_dev_ops, and the final case is handled through a
conditional in ksz_phylink_mac_config(). These can all be handled
with separate phylink_mac_ops.
To get there, we do a progressive conversion.
Patch 1 removes ksz_dev_ops' phylink_mac_config() method which is
not populated in any of the arrays - and is thus redundant.
Patch 2 switches the driver to use a common set of phylink_mac_ops
for all cases, doing the simple conversion to avoid the DSA shim.
Patch 3 pushes the phylink_mac_ops down to the first three classes
(lan937x, ksz9477, ksz8) adding an appropriate pointer to the
phylink_mac_ops to struct ksz_chip_data, and using that to
populate DSA's ds->phylink_mac_ops pointer. The difference between
each of these are the mac_link_up() method. mac_config() and
mac_link_down() remain common between each at this stage.
Patch 4 splits out ksz8830, which needs different mac_config()
handling, and thus means we have a difference in mac_config()
methods between the now four phylink_mac_ops structures.
Build tested only, with additional -Wunused-const-variable flag.
====================
We've added 147 non-merge commits during the last 32 day(s) which contain
a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).
The main changes are:
1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
memory addresses and implement support in x86 BPF JIT. This allows
inlining per-CPU array and hashmap lookups
and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.
2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.
3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
atomics in bpf_arena which can be JITed as a single x86 instruction,
from Alexei Starovoitov.
4) Add support for passing mark with bpf_fib_lookup helper,
from Anton Protopopov.
5) Add a new bpf_wq API for deferring events and refactor sleepable
bpf_timer code to keep common code where possible,
from Benjamin Tissoires.
6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
to check when NULL is passed for non-NULLable parameters,
from Eduard Zingerman.
7) Harden the BPF verifier's and/or/xor value tracking,
from Harishankar Vishwanathan.
8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
crypto subsystem, from Vadim Fedorenko.
9) Various improvements to the BPF instruction set standardization doc,
from Dave Thaler.
10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
from Andrea Righi.
11) Bigger batch of BPF selftests refactoring to use common network helpers
and to drop duplicate code, from Geliang Tang.
12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
from Jose E. Marchesi.
13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
program to have code sections where preemption is disabled,
from Kumar Kartikeya Dwivedi.
14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
from David Vernet.
15) Extend the BPF verifier to allow different input maps for a given
bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.
16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.
17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
the stack memory, from Martin KaFai Lau.
18) Improve xsk selftest coverage with new tests on maximum and minimum
hardware ring size configurations, from Tushar Vyavahare.
19) Various ReST man pages fixes as well as documentation and bash completion
improvements for bpftool, from Rameez Rehman & Quentin Monnet.
20) Fix libbpf with regards to dumping subsequent char arrays,
from Quentin Deslandes.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
bpf, docs: Clarify PC use in instruction-set.rst
bpf_helpers.h: Define bpf_tail_call_static when building with GCC
bpf, docs: Add introduction for use in the ISA Internet Draft
selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
bpf: check bpf_dummy_struct_ops program params for test runs
selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
selftests/bpf: adjust dummy_st_ops_success to detect additional error
bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
selftests/bpf: Add ring_buffer__consume_n test.
bpf: Add bpf_guard_preempt() convenience macro
selftests: bpf: crypto: add benchmark for crypto functions
selftests: bpf: crypto skcipher algo selftests
bpf: crypto: add skcipher to bpf crypto
bpf: make common crypto API for TC/XDP programs
bpf: update the comment for BTF_FIELDS_MAX
selftests/bpf: Fix wq test.
selftests/bpf: Use make_sockaddr in test_sock_addr
selftests/bpf: Use connect_to_addr in test_sock_addr
...
====================
net: phy: micrel: Add support for PTP_PF_EXTTS for lan8814
Extend the PTP programmable gpios to implement also PTP_PF_EXTTS
function. The pins can be configured to capture both of rising
and falling edge. Once the event is seen, then an interrupt is
generated and the LTC is saved in the registers.
On lan8814 only GPIO 3 can be configured for this.
This was tested using:
ts2phc -m -l 7 -s generic -f ts2phc.cfg
Where the configuration was the following:
---
[global]
ts2phc.pin_index 3
[eth0]
---
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 29 Apr 2024 12:35:41 +0000 (13:35 +0100)]
Merge branch 'dsa-realtek-leds'
Luiz Angelo Daros de Luca says:
====================
net: dsa: realtek: fix LED support for rtl8366
This series fixes the LED support for rtl8366. The existing code was not
tested in a device with switch LEDs and it was using a flawed logic.
The driver now keeps the default LED configuration if nothing requests a
different behavior. This may be enough for most devices. This can be
achieved either by omitting the LED from the device-tree or configuring
all LEDs in a group with the default state set to "keep".
The hardware trigger for LEDs in Realtek switches is shared among all
LEDs in a group. This behavior doesn't align well with the Linux LED
API, which controls LEDs individually. Once the OS changes the
brightness of a LED in a group still triggered by the hardware, the
entire group switches to software-controlled LEDs, even for those not
metioned in the device-tree. This shared behavior also prevents
offloading the trigger to the hardware as it would require an
orchestration between LEDs in a group, not currently present in the LED
API.
The assertion of device hardware reset during driver removal was removed
because it was causing an issue with the LED release code. Devres
devices are released after the driver's removal is executed. Asserting
the reset at that point was causing timeout errors during LED release
when it attempted to turn off the LED.
To: Linus Walleij <linus.walleij@linaro.org>
To: Alvin Šipraga <alsi@bang-olufsen.dk>
To: Andrew Lunn <andrew@lunn.ch>
To: Florian Fainelli <f.fainelli@gmail.com>
To: Vladimir Oltean <olteanv@gmail.com>
To: David S. Miller <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Rob Herring <robh+dt@kernel.org>
To: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
To: Conor Dooley <conor+dt@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: devicetree@vger.kernel.org Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Changes in v2:
- Fixed commit message formatting
- Added GROUP to LED group enum values. With that, moved the code that
disables LED into a new function to keep 80-collumn limit.
- Dropped unused enable argument in rb8366rb_get_port_led()
- Fixed variable order in rtl8366rb_setup_led()
- Removed redundant led group test in rb8366rb_{g,s}et_port_led()
- Initialize ret as 0 in rtl8366rb_setup_leds()
- Updated comments related to LED blinking and setup
- Link to v1: https://lore.kernel.org/r/20240310-realtek-led-v1-0-4d9813ce938e@gmail.com
Changes in v1:
- Rebased on new relatek DSA drivers
- Improved commit messages
- Added commit to remove the reset assert during .remove
- Link to RFC: https://lore.kernel.org/r/20240106184651.3665-1-luizluca@gmail.com
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit introduces LED drivers for rtl8366rb, enabling LEDs to be
described in the device tree using the same format as qca8k. Each port
can configure up to 4 LEDs.
If all LEDs in a group use the default state "keep", they will use the
default behavior after a reset. Changing the brightness of one LED,
either manually or by a trigger, will disable the default hardware
trigger and switch the entire LED group to manually controlled LEDs.
Once in this mode, there is no way to revert to hardware-controlled LEDs
(except by resetting the switch).
Software triggers function as expected with manually controlled LEDs.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
The necessity of asserting the reset on removal was previously
questioned, as DSA's own cleanup methods should suffice to prevent
traffic leakage[1].
When a driver has subdrivers controlled by devres, they will be
unregistered after the main driver's .remove is executed. If it asserts
a reset, the subdrivers will be unable to communicate with the hardware
during their cleanup. For LEDs, this means that they will fail to turn
off, resulting in a timeout error.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: realtek: keep default LED state in rtl8366rb
This switch family supports four LEDs for each of its six ports. Each
LED group is composed of one of these four LEDs from all six ports. LED
groups can be configured to display hardware information, such as link
activity, or manually controlled through a bitmap in registers
RTL8366RB_LED_0_1_CTRL_REG and RTL8366RB_LED_2_3_CTRL_REG.
After a reset, the default LED group configuration for groups 0 to 3
indicates, respectively, link activity, link at 1000M, 100M, and 10M, or
RTL8366RB_LED_CTRL_REG as 0x5432. These configurations are commonly used
for LED indications. However, the driver was replacing that
configuration to use manually controlled LEDs (RTL8366RB_LED_FORCE)
without providing a way for the OS to control them. The default
configuration is deemed more useful than fixed, uncontrollable turned-on
LEDs.
The driver was enabling/disabling LEDs during port_enable/disable.
However, these events occur when the port is administratively controlled
(up or down) and are not related to link presence. Additionally, when a
port N was disabled, the driver was turning off all LEDs for group N,
not only the corresponding LED for port N in any of those 4 groups. In
such cases, if port 0 was brought down, the LEDs for all ports in LED
group 0 would be turned off. As another side effect, the driver was
wrongly warning that port 5 didn't have an LED ("no LED for port 5").
Since showing the administrative state of ports is not an orthodox way
to use LEDs, it was not worth it to fix it and all this code was
dropped.
The code to disable LEDs was simplified only changing each LED group to
the RTL8366RB_LED_OFF state. Registers RTL8366RB_LED_0_1_CTRL_REG and
RTL8366RB_LED_2_3_CTRL_REG are only used when the corresponding LED
group is configured with RTL8366RB_LED_FORCE and they don't need to be
cleaned. The code still references an LED controlled by
RTL8366RB_INTERRUPT_CONTROL_REG, but as of now, no test device has
actually used it. Also, some magic numbers were replaced by macros.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum ASICs only support a single interrupt, it means that all the
events are handled by one IRQ (interrupt request) handler.
Currently, we schedule a tasklet to handle events in EQ, then we also use
tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ)
context, and will be run on the same CPU which scheduled it. It means that
today we have one CPU which handles all the packets (both network packets
and EMADs) from hardware.
The existing implementation is not efficient and can be improved.
Measuring latency of EMADs in the driver (without the time in FW) shows
that latency is increased by factor of 28 (x28) when network traffic is
handled by the driver.
Measuring throughput in CPU shows that CPU can handle ~35% less packets
of specific flow when corrupted packets are also handled by the driver.
There are cases that these values even worse, we measure decrease of ~44%
packet rate.
This can be improved if network packet and EMADs will be handled in
parallel by several CPUs, and more than that, if different types of traffic
will be handled in parallel. We can achieve this using NAPI.
This set converts the driver to process completions from hardware via NAPI.
The idea is to add NAPI instance per CQ (which is mapped 1:1 to SDQ/RDQ),
which means that each DQ can be handled separately. we have DQ for EMADs
and DQs for each trap group (like LLDP, BGP, L3 drops, etc..). See more
details in commit messages.
An additional improvement which is done as part of this set is related to
doorbells' ring. The idea is to handle small chunks of Rx packets (which
is also recommended using NAPI) and ring doorbells once per chunk. This
reduces the access to hardware which is expensive (time wise) and might
take time because of memory barriers.
With this set we can see better performance.
To summerize:
EMADs latency:
+------------------------------------------------------------------------+
| | Before this set | Now |
|------------------|---------------------------|-------------------------|
| Increased factor | x28 | x1.5 |
+------------------------------------------------------------------------+
Note that we can see even measurements that show better latency when
traffic is handled by the driver.
Throughput:
+------------------------------------------------------------------------+
| | Before this set | Now |
|-------------|----------------------------|-----------------------------|
| Reduced | 35% | 6% |
| packet rate | | |
+------------------------------------------------------------------------+
Additional improvements are planned - use page pool for buffer allocations
and avoid cache miss of each SKB using napi_build_skb().
Patch set overview:
Patches #1-#2 improve access to hardware by reducing dorbells' rings
Patch #3-#4 are preaparations for NAPI usage
Patch #5 converts the driver to use NAPI
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Fri, 26 Apr 2024 12:42:26 +0000 (14:42 +0200)]
mlxsw: pci: Use NAPI for event processing
Spectrum ASICs only support a single interrupt, that means that all the
events are handled by one IRQ (interrupt request) handler. Once an
interrupt is received, we schedule tasklet to handle events from EQ and
then schedule tasklets to handle completions from CQs. Tasklet runs in
softIRQ (software IRQ) context, and will be run on the same CPU which
scheduled it. That means that today we use only one CPU to handle all the
packets (both network packets and EMADs) from hardware.
This can be improved using NAPI. The idea is to use NAPI instance per
CQ, which is mapped 1:1 to DQ (RDQ or SDQ). NAPI poll method can be run
in kernel thread, so then the driver will be able to handle WQEs in several
CPUs. Convert the existing code to use NAPI APIs.
Add NAPI instance as part of 'struct mlxsw_pci_queue' and initialize it
as part of CQs initialization. Set the appropriate poll method and dummy
net device, according to queue number, similar to tasklet setup. For CQs
which are used for completions of RDQ, use Rx poll method and
'napi_dev_rx', which is set as 'threaded'. It means that Rx poll method
will run in kernel context, so several RDQs will be handled in parallel.
For CQs which are used for completions of SDQ, use Tx poll method and
'napi_dev_tx', this method will run in softIRQ context, as it is
recommended in NAPI documentation, as Tx packets' processing is short task.
Convert mlxsw_pci_cq_{rx,tx}_tasklet() to poll methods. Handle 'budget'
argument - ignore it in Tx poll method, as it is recommended to not limit
Tx processing. For Rx processing, handle up to 'budget' completions.
Return 'work_done' which is the amount of completions that were handled.
Handle the following cases:
1. After processing 'budget' completions, the driver still has work to do:
Return work-done = budget. In that case, the NAPI instance will be
polled again (without the need to be rescheduled). Do not re-arm the
queue, as NAPI will handle the reschedule, so we do not have to involve
hardware to send an additional interrupt for the completions that should
be processed.
2. Event processing has been completed:
Call napi_complete_done() to mark NAPI processing as completed, which
means that the poll method will not be rescheduled. Re-arm the queue,
as all completions were handled.
In case that poll method handled exactly 'budget' completions, return
work-done = budget -1, to distinguish from the case that driver still
has completions to handle. Otherwise, return the amount of completions
that were handled.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch will set the driver to use NAPI for event processing. Then
tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue'
to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ
and CQ. This will be changed in the next patch, as 'tasklet_struct' will be
replaced with NAPI instance.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Fri, 26 Apr 2024 12:42:24 +0000 (14:42 +0200)]
mlxsw: pci: Initialize dummy net devices for NAPI
mlxsw will use NAPI for event processing in a next patch. As preparation,
add two dummy net devices and initialize them.
NAPI instance should be attached to net device. Usually each queue is used
by a single net device in network drivers, so the mapping between net
device to NAPI instance is intuitive. In our case, Rx queues are not per
port, they are per trap-group. Tx queues are mapped to net devices, but we
do not have a separate queue for each local port, several ports share the
same queue.
Use init_dummy_netdev() to initialize dummy net devices for NAPI.
To run NAPI poll method in a kernel thread, the net device which NAPI
instance is attached to should be marked as 'threaded'. It is
recommended to handle Tx packets in softIRQ context, as usually this is
a short task - just free the Tx packet which has been transmitted.
Rx packets handling is more complicated task, so drivers can use a
dedicated kernel thread to process them. It allows processing packets from
different Rx queues in parallel. We would like to handle only Rx packets in
kernel threads, which means that we will use two dummy net devices
(one for Rx and one for Tx). Set only one of them with 'threaded' as it
will be used for Rx processing. Do not fail in case that setting 'threaded'
fails, as it is better to use regular softIRQ NAPI rather than preventing
the driver from loading.
Note that the net devices are initialized with init_dummy_netdev(), so
they are not registered, which means that they will not be visible to user.
It will not be possible to change 'threaded' configuration from user
space, but it is reasonable in our case, as there is no another
configuration which makes sense, considering that user has no influence
on the usage of each queue.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Fri, 26 Apr 2024 12:42:23 +0000 (14:42 +0200)]
mlxsw: pci: Ring RDQ and CQ doorbells once per several completions
Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and
ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet.
The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that
when we post new WQE (after RDQ is handled), there is an available CQE.
This was done because of a hardware bug as part of
commit c9ebea04cb1b ("mlxsw: pci: Ring CQ's doorbell before RDQ's").
There is no real reason to ring RDQ and CQ doorbells for each completion,
it is better to handle several completions and reduce number of ringings,
as access to hardware is expensive (time wise) and might take time because
of memory barriers.
A previous patch changed CQ tasklet to handle up to 64 Rx packets. With
this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet.
The counters of the doorbells are increased by the amount of packets
that we handled, then the device will know for which completion to send
an additional event.
To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring
also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but
does not ring the doorbell.
Note that with this change there is no need to copy the CQE, as we ring CQ
doorbell only after Rx packet processing (which uses the CQE) is done.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Fri, 26 Apr 2024 12:42:22 +0000 (14:42 +0200)]
mlxsw: pci: Handle up to 64 Rx completions in tasklet
We can get many completions in one interrupt. Currently, the CQ tasklet
handles up to half queue size completions, and then arms the hardware to
generate additional events, which means that in case that there were
additional completions that we did not handle, we will get immediately an
additional interrupt to handle the rest.
The decision to handle up to half of the queue size is arbitrary and was
determined in 2015, when mlxsw driver was added to the kernel. One
additional fact that should be taken into account is that while WQEs
from RDQ are handled, the CPU that handles the tasklet is dedicated for
this task, which means that we might hold the CPU for a long time.
Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware
to send additional notifications. Set the chunk size to 64 as this number
is recommended using NAPI and the driver will use NAPI in a next patch.
Note that for now we use ARM doorbell to retrigger CQ tasklet, but with
NAPI it will be more efficient as software will reschedule the poll
method and we will not involve hardware for that.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 26 Apr 2024 10:47:22 +0000 (10:47 +0000)]
ipv6: use call_rcu_hurry() in fib6_info_release()
This is a followup of commit c4e86b4363ac ("net: add two more
call_rcu_hurry()")
fib6_info_destroy_rcu() is calling nexthop_put() or fib6_nh_release()
We must not delay it too much or risk unregister_netdevice/ref_tracker
traces because references to netdev are not released in time.
This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 26 Apr 2024 06:42:22 +0000 (06:42 +0000)]
net: give more chances to rcu in netdev_wait_allrefs_any()
This came while reviewing commit c4e86b4363ac ("net: add two more
call_rcu_hurry()").
Paolo asked if adding one synchronize_rcu() would help.
While synchronize_rcu() does not help, making sure to call
rcu_barrier() before msleep(wait) is definitely helping
to make sure lazy call_rcu() are completed.
Instead of waiting ~100 seconds in my tests, the ref_tracker
splats occurs one time only, and netdev_wait_allrefs_any()
latency is reduced to the strict minimum.
Ideally we should audit our call_rcu() users to make sure
no refcount (or cascading call_rcu()) is held too long,
because rcu_barrier() is quite expensive.
net: ethernet: ti: am65-cpsw-qos: Add support to taprio for past base_time
If the base-time for taprio is in the past, start the schedule at the time
of the form "base_time + N*cycle_time" where N is the smallest possible
integer such that the above time is in the future.
Signed-off-by: Tanmay Patil <t-patil@ti.com> Signed-off-by: Chintan Vankar <c-vankar@ti.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 26 Apr 2024 00:31:11 +0000 (17:31 -0700)]
tools: ynl: don't append doc of missing type directly to the type
When using YNL in tests appending the doc string to the type
name makes it harder to check that we got the correct error.
Put the doc under a separate key.
Jakub Kicinski [Thu, 25 Apr 2024 22:23:41 +0000 (15:23 -0700)]
selftests: drv-net: validate the environment
Throw a slightly more helpful exception when env variables
are partially populated. Prior to this change we'd get
a dictionary key exception somewhere later on.
net: dsa: lan9303: use ethtool_puts() for lan9303_get_strings()
This pattern of strncpy with some pointer arithmetic setting fixed-sized
intervals with string literal data is a bit weird so let's use
ethtool_puts() as this has more obvious behavior and is less-error
prone.
Nicely, we also get to drop a usage of the now deprecated strncpy() [1].
Jose E. Marchesi [Fri, 26 Apr 2024 14:51:58 +0000 (16:51 +0200)]
bpf_helpers.h: Define bpf_tail_call_static when building with GCC
The definition of bpf_tail_call_static in tools/lib/bpf/bpf_helpers.h
is guarded by a preprocessor check to assure that clang is recent
enough to support it. This patch updates the guard so the function is
compiled when using GCC 13 or later as well.
====================
Implement reset reason mechanism to detect
From: Jason Xing <kernelxing@tencent.com>
In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.
RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own, such as active reset reasons which can not be
replace by skb drop reason or something like this.
In this series, I focus on the fundamental implement mostly about how
the rstreason mechanism works and give the detailed passive part as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.
Here are some examples when tracing:
<idle>-0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
<idle>-0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET
Jason Xing [Thu, 25 Apr 2024 03:13:40 +0000 (11:13 +0800)]
rstreason: make it work in trace world
At last, we should let it work by introducing this reset reason in
trace world.
One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED
Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason Xing [Thu, 25 Apr 2024 03:13:39 +0000 (11:13 +0800)]
mptcp: introducing a helper into active reset logic
Since we have mapped every mptcp reset reason definition in enum
sk_rst_reason, introducing a new helper can cover some missing places
where we have already set the subflow->reset_reason.
Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
it directly.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason Xing [Thu, 25 Apr 2024 03:13:38 +0000 (11:13 +0800)]
mptcp: support rstreason for passive reset
It relies on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.
Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason Xing [Thu, 25 Apr 2024 03:13:37 +0000 (11:13 +0800)]
tcp: support rstreason for passive reset
Reuse the dropreason logic to show the exact reason of tcp reset,
so we can finally display the corresponding item in enum sk_reset_reason
instead of reinventing new reset reasons. This patch replaces all
the prior NOT_SPECIFIED reasons.
Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason Xing [Thu, 25 Apr 2024 03:13:34 +0000 (11:13 +0800)]
net: introduce rstreason to detect why the RST is sent
Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
This patch only does the preparations for reset reason mechanism,
nothing else changes.
The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) our own independent reasons which aren't relying on other reasons at all
3) reuse MP_TCPRST option for MPTCP
The benefits of a standalone reset reason are listed here:
1) it can cover more than one case, such as reset reasons in MPTCP,
active reset reasons.
2) people can easily/fastly understand and maintain this mechanism.
3) we get unified format of output with prefix stripped.
4) more new reset reasons are on the way
...
I will implement the basic codes of active/passive reset reason in
those three protocols, which are not complete for this moment. For
passive reset part in TCP, I only introduce the NO_SOCKET common case
which could be set as an example.
After this series applied, it will have the ability to open a new
gate to let other people contribute more reasons into it :)
Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
igc: Add Tx hardware timestamp request for AF_XDP zero-copy packet
This patch adds support to per-packet Tx hardware timestamp request to
AF_XDP zero-copy packet via XDP Tx metadata framework. Please note that
user needs to enable Tx HW timestamp capability via igc_ioctl() with
SIOCSHWTSTAMP cmd before sending xsk Tx hardware timestamp request.
Same as implementation in RX timestamp XDP hints kfunc metadata, Timer 0
(adjustable clock) is used in xsk Tx hardware timestamp. i225/i226 have
four sets of timestamping registers. *skb and *xsk_tx_buffer pointers
are used to indicate whether the timestamping register is already occupied.
Furthermore, a boolean variable named xsk_pending_ts is used to hold the
transmit completion until the tx hardware timestamp is ready. This is
because, for i225/i226, the timestamp notification event comes some time
after the transmit completion event. The driver will retrigger hardware irq
to clean the packet after retrieve the tx hardware timestamp.
Besides, xsk_meta is added into struct igc_tx_timestamp_request as a hook
to the metadata location of the transmit packet. When the Tx timestamp
interrupt is fired, the interrupt handler will copy the value of Tx hwts
into metadata location via xsk_tx_metadata_complete().
This patch is tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel ADL-S platform. Below are the test steps and results.
Test Step 1: Run xdp_hw_metadata app
./xdp_hw_metadata <iface> > /dev/shm/result.log
Test Step 3: Run ptp4l and phc2sys for time synchronization
Test Step 4: Generate UDP packets with 1ms interval for 10s
trafgen --dev <iface> '{eth(da=<addr>), udp(dp=9091)}' -t 1ms -n 10000
Test Step 5: Rerun Step 1-3 with 10s iperf3 as background traffic
Test Step 6: Rerun Step 1-4 with 10s iperf3 as background traffic
Based on iperf3 results below, the impact of holding tx completion to
throughput is not observable.
Result of last UDP packet (no. 10000) in Step 4:
poll: 1 (0) skip=99 fail=0 redir=10000
xsk_ring_cons__peek: 1
0x5640a37972d0: rx_desc[9999]->addr=f2110 addr=f2110 comp_addr=f2110 EoP
rx_hash: 0x2049BE1D with RSS type:0x1
HW RX-time: 1679819246792971268 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (14.990 usec)
XDP RX-time: 1679819246792981987 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (4.271 usec)
No rx_vlan_tci or rx_vlan_proto, err=-95
0x5640a37972d0: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6
0x5640a37972d0: complete tx idx=9999 addr=f010
HW TX-complete-time: 1679819246793036971 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (77.656 usec)
XDP RX-time: 1679819246792981987 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (132.640 usec)
HW RX-time: 1679819246792971268 (sec:1679819246.7930) delta to HW TX-complete-time sec:0.0001 (65.703 usec)
0x5640a37972d0: complete rx idx=10127 addr=f2110
Result of iperf3 without tx hwts request in step 5:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.74 GBytes 2.36 Gbits/sec 0 sender
[ 5] 0.00-10.05 sec 2.74 GBytes 2.34 Gbits/sec receiver
Result of iperf3 running parallel with trafgen command in step 6:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.74 GBytes 2.36 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 2.74 GBytes 2.34 Gbits/sec receiver
Co-developed-by: Lai Peter Jun Ann <jun.ann.lai@intel.com> Signed-off-by: Lai Peter Jun Ann <jun.ann.lai@intel.com> Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240424210256.3440903-1-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patchset aims at introducing very basic initial infrastructure
for virtio_net testing, namely it focuses on virtio feature testing.
The first patch adds support for debugfs for virtio devices, allowing
user to filter features to pretend to be driver that is not capable
of the filtered feature.
Leverage that in the last patch that lays ground for virtio_net
selftests testing, including very basic F_MAC feature test.
To run this, do:
$ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests
It is assumed, as with lot of other selftests in the net group,
that there are netdevices connected back-to-back. In this case,
two virtio_net devices connected back to back. If you use "tap" qemu
netdevice type, to configure this loop on a hypervisor, one may use
this script:
DEV1="$1"
DEV2="$2"
sudo tc qdisc add dev $DEV1 clsact
sudo tc qdisc add dev $DEV2 clsact
sudo tc filter add dev $DEV1 ingress protocol all pref 1 matchall action mirred egress redirect dev $DEV2
sudo tc filter add dev $DEV2 ingress protocol all pref 1 matchall action mirred egress redirect dev $DEV1
sudo ip link set $DEV1 up
sudo ip link set $DEV2 up
Another possibility is to use virtme-ng like this:
$ vng --network=loop
or directly:
$ vng --network=loop -- make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests
"loop" network type will take care of creating two "hubport" qemu netdevs
putting them into a single hub.
To do it manually with qemu, pass following command line options:
-nic hubport,hubid=1,id=nd0,model=virtio-net-pci
-nic hubport,hubid=1,id=nd1,model=virtio-net-pci
====================
The existing setup_wait*() helper family check the status of the
interface to be up. Introduce wait_for_dev() to wait for the netdevice
to appear, for example after test script does manual device bind.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
selftests: forwarding: add ability to assemble NETIFS array by driver name
Allow driver tests to work without specifying the netdevice names.
Introduce a possibility to search for available netdevices according to
set driver name. Allow test to specify the name by setting
NETIF_FIND_DRIVER variable.
Note that user overrides this either by passing netdevice names on the
command line or by declaring NETIFS array in custom forwarding.config
configuration file.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
virtio: add debugfs infrastructure to allow to debug virtio features
Currently there is no way for user to set what features the driver
should obey or not, it is hard wired in the code.
In order to be able to debug the device behavior in case some feature is
disabled, introduce a debugfs infrastructure with couple of files
allowing user to see what features the device advertises and
to set filter for features used by driver.
test: hsr: Extract version agnostic information from ping command output
Current code checks if ping command output match hardcoded pattern:
"10 packets transmitted, 10 received, 0% packet loss,".
Such approach will work only from one ping program version (for which
this test has been originally written).
This patch address problem when ping with different summary output
like "10 packets transmitted, 10 packets received, 0% packet" is
used to run this test - for example one from busybox (as the test
system runs in QEMU with rootfs created with buildroot).
The fix is to modify output of ping command to be agnostic to ping
version used on the platform.
Signed-off-by: Lukasz Majewski <lukma@denx.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some of the code already present in the hsr_ping.sh test program can be
moved to a separate script file, so it can be reused by other HSR
functionality (like HSR-SAN) tests.
Signed-off-by: Lukasz Majewski <lukma@denx.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce RedBox support (HSR-SAN to be more precise) for HSR networks.
Following traffic reduction optimizations have been implemented:
- Do not send HSR supervisory frames to Port C (interlink)
- Do not forward to HSR ring frames addressed to Port C
- Do not forward to Port C frames from HSR ring
- Do not send duplicate HSR frame to HSR ring when destination is Port C
The corresponding patch to modify iptable2 sources has already been sent:
https://lore.kernel.org/netdev/20240308145729.490863-1-lukma@denx.de/T/
Testing procedure (veth and netns):
-----------------------------------
One shall run:
linux-vanila/tools/testing/selftests/net/hsr/hsr_redbox.sh
(Detailed description of the setup one can find in the test
script file).
Testing procedure (real hardware):
----------------------------------
The EVB-KSZ9477 has been used for testing on net-next branch
(SHA1: 5fc68320c1fb3c7d456ddcae0b4757326a043e6f).
Ports 4/5 were used for SW managed HSR (hsr1) as first hsr0 for ports 1/2
(with HW offloading for ksz9477) was created. Port 3 has been used as
interlink port (single USB-ETH dongle).
Configuration - RedBox (EVB-KSZ9477):
if link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 interlink lan3 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip link set lan3 up
ip addr add 192.168.0.11/24 dev hsr1
ip link set hsr1 up
Configuration - DAN-H (EVB-KSZ9477):
ip link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip addr add 192.168.0.12/24 dev hsr1
ip link set hsr1 up
This approach uses only SW based HSR devices (hsr1).
This happens when TC does a mirred egress redirect from the root qdisc of
device A to the root qdisc of device B. As long as these two locks aren't
protecting the same qdisc, they can be acquired in chain: add a per-qdisc
lockdep key to silence false warnings.
This dynamic key should safely replace the static key we have in sch_htb:
it was added to allow enqueueing to the device "direct qdisc" while still
holding the qdisc root lock.
v2: don't use static keys anymore in HTB direct qdiscs (thanks Eric Dumazet)
Jakub Kicinski [Fri, 26 Apr 2024 03:00:54 +0000 (20:00 -0700)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
net: intel: start The Great Code Dedup + Page Pool for iavf
Alexander Lobakin says:
Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
MAINTAINERS: add entry for libeth and libie
iavf: switch to Page Pool
iavf: pack iavf_ring more efficiently
libeth: add Rx buffer management
page_pool: add DMA-sync-for-CPU inline helper
page_pool: constify some read-only function arguments
slab: introduce kvmalloc_array_node() and kvcalloc_node()
iavf: drop page splitting and recycling
iavf: kill "legacy-rx" for good
net: intel: introduce {, Intel} Ethernet common library
====================
====================
net: lan966x: flower: validate control flags
This series adds flower control flags validation to the
lan966x driver, and changes it from assuming that it handles
all control flags, to instead reject rules if they have
masked any unknown/unsupported control flags.
====================
net: sparx5: flower: validate control flags
This series adds flower control flags validation to the
sparx5 driver, and changes it from assuming that it handles
all control flags, to instead reject rules if they have
masked any unknown/unsupported control flags. Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Tested-by: Daniel Machon <daniel.machon@microchip.com>
v1: https://lore.kernel.org/netdev/20240423102728.228765-1-ast@fiberby.net/
====================
net: sparx5: flower: only do lookup if fragment flags are set
The fragment lookup should only be performed, when
at least one of the fragment flags are set.
This change was deliberately not included in commit 68aba00483c7 ("net: sparx5: flower: fix fragment flags handling")
as it's only needed for future proffing the code, since
"mask" is currently only set in conjunction with the
fragment flags.
(The 3rd flag FLOW_DIS_ENCAPSULATION is only used with "key")
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Tested-by: Daniel Machon <daniel.machon@microchip.com> Link: https://lore.kernel.org/r/20240424121632.459022-2-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].
Un-embed the net_device from the private struct by converting it
into a pointer. Then use the leverage the new alloc_netdev_dummy()
helper to allocate and initialize dummy devices.
Hayes Wang [Wed, 24 Apr 2024 08:45:32 +0000 (16:45 +0800)]
r8152: replace dev_info with dev_dbg for loading firmware
Someone complains the message appears continuously. This occurs
because the device is woken from UPS mode, and the driver re-loads
the firmware.
When the device enters runtime suspend and cable is unplugged, the
device would enter UPS mode. If the runtime resume occurs, and the
device is woken from UPS mode, the driver has to re-load the firmware
and causes the message. If someone wakes the device continuously, the
message would be shown continuously, too. Use dev_dbg to avoid it.
Note that, the function could be called before register_netdev(), so I
don't use netif_info() or netif_dbg().
Daniel Golle [Tue, 23 Apr 2024 09:00:25 +0000 (11:00 +0200)]
net: sfp: add quirk for ATS SFP-GE-T 1000Base-TX module
Add quirk for ATS SFP-GE-T 1000Base-TX module.
This copper module comes with broken TX_FAULT indicator which must be
ignored for it to work.
Co-authored-by: Josef Schlehofer <pepe.schlehofer@gmail.com> Signed-off-by: Daniel Golle <daniel@makrotopia.org>
[ rebased on top of net-next ] Signed-off-by: Marek Behún <kabel@kernel.org> Link: https://lore.kernel.org/r/20240423090025.29231-1-kabel@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Marek Behún [Tue, 23 Apr 2024 08:50:39 +0000 (10:50 +0200)]
net: sfp: enhance quirk for Fibrestore 2.5G copper SFP module
Enhance the quirk for Fibrestore 2.5G copper SFP module. The original
commit e27aca3760c0 ("net: sfp: add quirk for FS's 2.5G copper SFP")
introducing the quirk says that the PHY is inaccessible, but that is
not true.
The module uses Rollball protocol to talk to the PHY, and needs a 4
second wait before probing it, same as FS 10G module.
The PHY inside the module is Realtek RTL8221B-VB-CG PHY. The realtek
driver recently gained support to set it up via clause 45 accesses.
Marek Behún [Tue, 23 Apr 2024 08:50:38 +0000 (10:50 +0200)]
net: sfp: update comment for FS SFP-10G-T quirk
Update the comment for the Fibrestore SFP-10G-T module: since commit e9301af385e7 ("net: sfp: fix PHY discovery for FS SFP-10G-T module")
we also do a 4 second wait before probing the PHY.
Fixes: e9301af385e7 ("net: sfp: fix PHY discovery for FS SFP-10G-T module") Signed-off-by: Marek Behún <kabel@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240423085039.26957-1-kabel@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240423205408.39632-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Philo Lu [Thu, 25 Apr 2024 16:17:23 +0000 (00:17 +0800)]
bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
Two important arguments in RTT estimation, mrtt and srtt, are passed to
tcp_bpf_rtt(), so that bpf programs get more information about RTT
computation in BPF_SOCK_OPS_RTT_CB.
The difference between bpf_sock_ops->srtt_us and the srtt here is: the
former is an old rtt before update, while srtt passed by tcp_bpf_rtt()
is that after update.
====================
check bpf_dummy_struct_ops program params for test runs
When doing BPF_PROG_TEST_RUN for bpf_dummy_struct_ops programs,
execution should be rejected when NULL is passed for non-nullable
params, because for such params verifier assumes that such params are
never NULL and thus might optimize out NULL checks.
This problem was reported by Jose E. Marchesi in off-list discussion.
The code generated by GCC for dummy_st_ops_success/test_1() function
differs from LLVM variant in a way that allows verifier to remove the
NULL check. The test dummy_st_ops/dummy_init_ret_value actually sets
the 'state' parameter to NULL, thus GCC-generated version of the test
triggers NULL pointer dereference when BPF program is executed.
This patch-set addresses the issue in the following steps:
- patch #1 marks bpf_dummy_struct_ops.test_1 parameter as nullable,
for verifier to have correct assumptions about test_1() programs;
- patch #2 modifies dummy_st_ops/dummy_init_ret_value to trigger NULL
dereference with both GCC and LLVM (if patch #1 is not applied);
- patch #3 adjusts a few dummy_st_ops test cases to avoid passing NULL
for 'state' parameter of test_2() and test_sleepable() functions,
as parameters of these functions are not marked as nullable;
- patch #4 adjusts bpf_dummy_struct_ops to reject test execution of
programs if NULL is passed for non-nullable parameter;
- patch #5 adds a test to verify logic from patch #4.
====================
Eduard Zingerman [Wed, 24 Apr 2024 01:28:20 +0000 (18:28 -0700)]
bpf: check bpf_dummy_struct_ops program params for test runs
When doing BPF_PROG_TEST_RUN for bpf_dummy_struct_ops programs,
reject execution when NULL is passed for non-nullable params.
For programs with non-nullable params verifier assumes that
such params are never NULL and thus might optimize out NULL checks.
Eduard Zingerman [Wed, 24 Apr 2024 01:28:19 +0000 (18:28 -0700)]
selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
dummy_st_ops.test_2 and dummy_st_ops.test_sleepable do not have their
'state' parameter marked as nullable. Update dummy_st_ops.c to avoid
passing NULL for such parameters, as the next patch would allow kernel
to enforce this restriction.
If the 'state' argument is not marked as nullable in
net/bpf/bpf_dummy_struct_ops.c, the verifier would assume that
'r1 == 0x0' is never true:
- for the GCC version, this means that instructions #5-6 would be
marked as dead and removed;
- for the LLVM version, all instructions would be marked as live.
The test dummy_st_ops/dummy_init_ret_value actually sets the 'state'
parameter to NULL.
Therefore, when the 'state' argument is not marked as nullable,
the GCC-generated version of the code would trigger a NULL pointer
dereference at instruction #3.
This patch updates the test_1() test case to always follow a shape
similar to the GCC-generated version above, in order to verify whether
the 'state' nullability is marked correctly.
Eduard Zingerman [Wed, 24 Apr 2024 01:28:17 +0000 (18:28 -0700)]
bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
Test case dummy_st_ops/dummy_init_ret_value passes NULL as the first
parameter of the test_1() function. Mark this parameter as nullable to
make verifier aware of such possibility.
Otherwise, NULL check in the test_1() code:
SEC("struct_ops/test_1")
int BPF_PROG(test_1, struct bpf_dummy_ops_state *state)
{
if (!state)
return ...;
... access state ...
}
Might be removed by verifier, thus triggering NULL pointer dereference
under certain conditions.
net/unix/garbage.c 1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") 4090fa373f0e ("af_unix: Replace garbage collection algorithm.")
drivers/net/ethernet/ti/icssg/icssg_prueth.c
drivers/net/ethernet/ti/icssg/icssg_common.c 4dcd0e83ea1d ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file")
Jakub Kicinski [Thu, 25 Apr 2024 18:49:35 +0000 (11:49 -0700)]
Merge tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.10
The second "new features" pull request for v6.10 with changes both in
stack and in drivers. This time the pull request is rather small and
nothing special standing out except maybe that we have several
kernel-doc fixes. Great to see that we are getting warning free
wireless code (until new warnings are added).
Major changes:
rtl8xxxu:
* enable Management Frame Protection (MFP) support
rtw88:
* disable unsupported interface type of mesh point for all chips, and only
support station mode for SDIO chips.
* tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (63 commits)
wifi: mac80211: handle link ID during management Tx
wifi: mac80211: handle sdata->u.ap.active flag with MLO
wifi: cfg80211: add return docs for regulatory functions
wifi: cfg80211: make some regulatory functions void
wifi: mac80211: add return docs for sta_info_flush()
wifi: mac80211: keep mac80211 consistent on link activation failure
wifi: mac80211: simplify ieee80211_assign_link_chanctx()
wifi: mac80211: reserve chanctx during find
wifi: cfg80211: fix cfg80211 function kernel-doc
wifi: mac80211_hwsim: Use wider regulatory for custom for 6GHz tests
wifi: iwlwifi: mvm: Don't allow EMLSR when the RSSI is low
wifi: iwlwifi: mvm: disable EMLSR when we suspend with wowlan
wifi: iwlwifi: mvm: get periodic statistics in EMLSR
wifi: iwlwifi: mvm: don't recompute EMLSR mode in can_activate_links
wifi: iwlwifi: mvm: implement EMLSR prevention mechanism.
wifi: iwlwifi: mvm: exit EMLSR upon missed beacon
wifi: iwlwifi: mvm: init vif works only once
wifi: iwlwifi: mvm: Add helper functions to update EMLSR status
wifi: iwlwifi: mvm: Implement new link selection algorithm
wifi: iwlwifi: mvm: move EMLSR/links code
...
====================
b53 is now the only remaining driver that uses both PHYLIB's adjust_link
and PHYLINK's mac_ops callbacks, convert entirely to PHYLINK.
====================
net: dsa: b53: Call b53_eee_init() from b53_mac_link_up()
And make sure this is done for the MLO_AN_PHY case, where it actually
makes sense, contrary to b53_adjust_link() which only did it for
fixed-PHY configurations where it does not make sense.
net: dsa: b53: Configure RGMII for 531x5 and MII for 5325
Call b53_adjust_531x5_rgmii() and b53_adjust_5325_mii() from
b53_phylink_mac_config() when we have a fixed PHY in preparation for removing
b53_adjust_link(). Also move b53_adjust_63xx_rgmii() to
b53_phylink_mac_config() where it logically belongs.
net: dsa: b53: Force flow control for BCM5301X CPU port(s)
Just like what b53_adjust_link() does, force flow control for the
BCM5301X CPU port(s) by forcing rx_pause and tx_pause in
b53_phylink_mac_link_up(). Preparatory step for getting rid of
b53_adjust_link().
Takes care of doing the 5325 switch series specific MII programming and
is called from b53_adjust_link() to allow the future removal of
b53_adjust_link().
Takes care of doing the 531x5 switch series specific RGMII programming
and is called from b53_adjust_link() to allow the future removal of
b53_adjust_link().
Andrea Righi [Thu, 25 Apr 2024 14:06:27 +0000 (16:06 +0200)]
selftests/bpf: Add ring_buffer__consume_n test.
Add a testcase for the ring_buffer__consume_n() API.
The test produces multiple samples in a ring buffer, using a
sys_getpid() fentry prog, and consumes them from user-space in batches,
rather than consuming all of them greedily, like ring_buffer__consume()
does.
Merge tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wireless and bluetooth.
Nothing major, regression fixes are mostly in drivers, two more of
those are flowing towards us thru various trees. I wish some of the
changes went into -rc5, we'll try to keep an eye on frequency of PRs
from sub-trees.
Also disproportional number of fixes for bugs added in v6.4, strange
coincidence.
Current release - regressions:
- igc: fix LED-related deadlock on driver unbind
- wifi: mac80211: small fixes to recent clean up of the connection
process
- Revert "wifi: iwlwifi: bump FW API to 90 for BZ/SC devices", kernel
doesn't have all the code to deal with that version, yet
- Bluetooth:
- set power_ctrl_enabled on NULL returned by gpiod_get_optional()
- qca: fix invalid device address check, again
- Bluetooth:
- lots of fixes for the command submission rework from v6.4
- qca: fix NULL-deref on non-serdev suspend
Misc:
- tools: ynl: don't ignore errors in NLMSG_DONE messages"
* tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().
net: b44: set pause params only when interface is up
tls: fix lockless read of strp->msg_ready in ->poll
dpll: fix dpll_pin_on_pin_register() for multiple parent pins
net: ravb: Fix registered interrupt names
octeontx2-af: fix the double free in rvu_npc_freemem()
net: ethernet: ti: am65-cpts: Fix PTPv1 message type on TX packets
ice: fix LAG and VF lock dependency in ice_reset_vf()
iavf: Fix TC config comparison with existing adapter TC config
i40e: Report MFS in decimal base instead of hex
i40e: Do not use WQ_MEM_RECLAIM flag for workqueue
net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()
net/mlx5e: Advertise mlx5 ethernet driver updates sk_buff md_dst for MACsec
macsec: Detect if Rx skb is macsec-related for offloading devices that update md_dst
ethernet: Add helper for assigning packet type when dest address does not match device address
macsec: Enable devices to advertise whether they update sk_buff md_dst during offloads
net: phy: dp83869: Fix MII mode failure
netfilter: nf_tables: honor table dormant flag from netdev release event path
eth: bnxt: fix counting packets discarded due to OOM and netpoll
igc: Fix LED-related deadlock on driver unbind
...