net: dsa: mt7530: rename p5_intf_sel and use only for MT7530 switch
The p5_intf_sel pointer is used to store the information of whether PHY
muxing is used or not. PHY muxing is a feature specific to port 5 of the
MT7530 switch. Do not use it for other switch models.
Rename the pointer to p5_mode to store the mode the port is being used in.
Rename the p5_interface_select enum to mt7530_p5_mode, the string
representation to mt7530_p5_mode_str, and the enum elements.
If PHY muxing is not detected, the default mode, GMAC5, will be used.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The MT7530_PMCR_P() registers are on MT7530, MT7531, and the switch on the
MT7988 SoC. Rename the definition for them to MT753X_PMCR_P(). Bit 15 is
for MT7530 only. Add MT7530 prefix to the definition for bit 15.
Use GENMASK and FIELD_PREP for PMCR_IFG_XMIT().
Rename PMCR_TX_EN and PMCR_RX_EN to PMCR_MAC_TX_EN and PMCR_MAC_TX_EN to
follow the naming on the "MT7621 Giga Switch Programming Guide v0.3",
"MT7531 Reference Manual for Development Board v1.0", and "MT7988A Wi-Fi 7
Generation Router Platform: Datasheet (Open Version) v0.1" documents.
These documents show that PMCR_RX_FC_EN is at bit 5. Correct this along
with renaming it to PMCR_FORCE_RX_FC_EN, and the same for PMCR_TX_FC_EN.
Remove PMCR_SPEED_MASK which doesn't have a use.
Rename the force mode definitions for MT7531 to FORCE_MODE. Add MASK at the
end for the mask that includes all force mode definitions.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: mt7530: disable EEE abilities on failure on MT7531 and MT7988
The MT7531_FORCE_EEE1G and MT7531_FORCE_EEE100 bits let the
PMCR_FORCE_EEE1G and PMCR_FORCE_EEE100 bits determine the 1G/100 EEE
abilities of the MAC. If MT7531_FORCE_EEE1G and MT7531_FORCE_EEE100 are
unset, the abilities are left to be determined by PHY auto polling.
The commit 40b5d2f15c09 ("net: dsa: mt7530: Add support for EEE features")
made it so that the PMCR_FORCE_EEE1G and PMCR_FORCE_EEE100 bits are set on
mt753x_phylink_mac_link_up(). But it did not set the MT7531_FORCE_EEE1G and
MT7531_FORCE_EEE100 bits. Because of this, the EEE abilities will be
determined by PHY auto polling, regardless of the result of phy_init_eee().
Define these bits and add them to the MT7531_FORCE_MODE mask which is set
in mt7531_setup_common(). With this, there won't be any EEE abilities set
when phy_init_eee() returns a negative value.
Thanks to Russell for explaining when phy_init_eee() could return a
negative value below.
Looking at phy_init_eee(), it could return a negative value when:
1. phydev->drv is NULL
2. if genphy_c45_eee_is_active() returns negative
3. if genphy_c45_eee_is_active() returns zero, it returns -EPROTONOSUPPORT
4. if phy_set_bits_mmd() fails (e.g. communication error with the PHY)
If we then look at genphy_c45_eee_is_active(), then:
genphy_c45_read_eee_adv() and genphy_c45_read_eee_lpa() propagate their
non-zero return values, otherwise this function returns zero or positive
integer.
If we then look at genphy_c45_read_eee_adv(), then a failure of
phy_read_mmd() would cause a negative value to be returned.
Looking at genphy_c45_read_eee_lpa(), the same is true.
So, it can be summarised as:
- phydev->drv is NULL
- there is a communication error accessing the PHY
- EEE is not active
otherwise, it returns zero on success.
If one wishes to determine whether an error occurred vs EEE not being
supported through negotiation for the negotiated speed, if it returns
-EPROTONOSUPPORT in the latter case. Other error codes mean either the
driver has been unloaded or communication error.
In conclusion, determining the EEE abilities by PHY auto polling shouldn't
result in having any EEE abilities enabled, when one of the last two
situations in the summary happens. And it seems that if phydev->drv is
NULL, there would be bigger problems with the device than a broken link. So
this is not a bugfix.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 21 Apr 2024 18:57:53 +0000 (18:57 +0000)]
neighbour: fix neigh_master_filtered()
If we no longer hold RTNL, we must use netdev_master_upper_dev_get_rcu()
instead of netdev_master_upper_dev_get().
Fixes: ba0f78069423 ("neighbour: no longer hold RTNL in neigh_dump_info()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240421185753.1808077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
selftests: drv-net: support testing with a remote system
Implement support for tests which require access to a remote system /
endpoint which can generate traffic.
This series concludes the "groundwork" for upstream driver tests.
I wanted to support the three models which came up in discussions:
- SW testing with netdevsim
- "local" testing with two ports on the same system in a loopback
- "remote" testing via SSH
so there is a tiny bit of an abstraction which wraps up how "remote"
commands are executed. Otherwise hopefully there's nothing surprising.
I'm only adding a ping test. I had a bigger one written but I was
worried we'll get into discussing the details of the test itself
and how I chose to hack up netdevsim, instead of the test infra...
So that test will be a follow up :)
Jakub Kicinski [Sat, 20 Apr 2024 02:52:36 +0000 (19:52 -0700)]
selftests: drv-net: add a TCP ping test case (and useful helpers)
More complex tests often have to spawn a background process,
like a server which will respond to requests or tcpdump.
Add support for creating such processes using the with keyword:
with bkg("my-daemon", ..):
# my-daemon is alive in this block
My initial thought was to add this support to cmd() directly
but it runs the command in the constructor, so by the time
we __enter__ it's too late to make sure we used "background=True".
Second useful helper transplanted from net_helper.sh is
wait_port_listen().
The test itself uses socat, which insists on v6 addresses
being wrapped in [], it's not the only command which requires
this format, so add the wrapped address to env. The hope
is to save test code from checking if address is v6.
Jakub Kicinski [Sat, 20 Apr 2024 02:52:35 +0000 (19:52 -0700)]
selftests: net: support matching cases by name prefix
While writing tests with a lot more cases I got tired of having
to jump back and forth to add the name of the test to the ksft_run()
list. Most unittest frameworks do some name matching, e.g. assume
that functions with names starting with test_ are test cases.
Support similar flow in ksft_run(). Let the author list the desired
prefixes. globals() need to be passed explicitly, IDK how to work
around that.
Jakub Kicinski [Sat, 20 Apr 2024 02:52:34 +0000 (19:52 -0700)]
selftests: drv-net: add a trivial ping test
Add a very simple test for testing with a remote system.
Both IPv4 and IPv6 connectivity is optional, later change
will add checks to skip tests based on available addresses.
Using netdevsim:
$ ./run_kselftest.sh -t drivers/net:ping.py
TAP version 13
1..1
# timeout set to 45
# selftests: drivers/net: ping.py
# KTAP version 1
# 1..2
# ok 1 ping.test_v4
# ok 2 ping.test_v6
# # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: drivers/net: ping.py
Command line SSH:
$ NETIF=virbr0 REMOTE_TYPE=ssh REMOTE_ARGS=root@192.168.122.123 \
LOCAL_V4=192.168.122.1 REMOTE_V4=192.168.122.123 \
./tools/testing/selftests/drivers/net/ping.py
KTAP version 1
1..2
ok 1 ping.test_v4
ok 2 ping.test_v6 # SKIP Test requires IPv6 connectivity
# Totals: pass:1 fail:0 xfail:1 xpass:0 skip:0 error:0
Existing devices placed in netns (and using net.config):
Jakub Kicinski [Sat, 20 Apr 2024 02:52:31 +0000 (19:52 -0700)]
selftests: drv-net: define endpoint structures
Define the remote endpoint "model". To execute most meaningful device
driver tests we need to be able to communicate with a remote system,
and have it send traffic to the device under test.
Various test environments will have different requirements.
0) "Local" netdevsim-based testing can simply use net namespaces.
netdevsim supports connecting two devices now, to form a veth-like
construct.
1) Similarly on hosts with multiple NICs, the NICs may be connected
together with a loopback cable or internal device loopback.
One interface may be placed into separate netns, and tests
would proceed much like in the netdevsim case. Note that
the loopback config or the moving of one interface
into a netns is not expected to be part of selftest code.
2) Some systems may need to communicate with the remote endpoint
via SSH.
3) Last but not least environment may have its own custom communication
method.
Fundamentally we only need two operations:
- run a command remotely
- deploy a binary (if some tool we need is built as part of kselftests)
Wrap these two in a class. Use dynamic loading to load the Remote
class. This will allow very easy definition of other communication
methods without bothering upstream code base.
Stick to the "simple" / "no unnecessary abstractions" model for
referring to the remote endpoints. The host / remote object are
passed as an argument to the usual cmd() or ip() invocation.
For example:
====================
netdev: support dumping a single netdev in qstats
I was writing a test for page pool which depended on qstats,
and got tired of having to filter dumps in user space.
Add support for dumping stats for a single netdev.
To get there we first need to add full support for extack
in dumps (and fix a dump error handling bug in YNL, sent
separately to the net tree).
====================
Jakub Kicinski [Sat, 20 Apr 2024 02:35:41 +0000 (19:35 -0700)]
netlink: support all extack types in dumps
Note that when this commit message refers to netlink dump
it only means the actual dumping part, the parsing / dump
start is handled by the same code as "doit".
Commit 4a19edb60d02 ("netlink: Pass extack to dump handlers")
added support for returning extack messages from dump handlers,
but left out other extack info, e.g. bad attribute.
This used to be fine because until YNL we had little practical
use for the machine readable attributes, and only messages were
used in practice.
YNL flips the preference 180 degrees, it's now much more useful
to point to a bad attr with NL_SET_BAD_ATTR() than type
an English message saying "attribute XYZ is $reason-why-bad".
Support all of extack. The fact that extack only gets added if
it fits remains unaddressed.
Here, two skb are created, and every unix_edge->successor is the first
socket. Then, __unix_gc() will garbage-collect the two skb:
(a) free skb with self-referencing fd
(b) free skb holding other sockets
After (a), the self-referencing socket will be scheduled to be freed
later by the delayed_fput() task.
syzbot repeated the sequences above (1. ~ 3.) quickly and triggered
the task concurrently while GC was running.
So, at (b), the socket was already freed, and accessing it was illegal.
unix_del_edges() accesses the receiver socket as edge->successor to
optimise GC. However, we should not do it during GC.
Garbage-collecting sockets does not change the shape of the rest
of the graph, so we need not call unix_update_graph() to update
unix_graph_grouped when we purge skb.
However, if we clean up all loops in the unix_walk_scc_fast() path,
unix_graph_maybe_cyclic remains unchanged (true), and __unix_gc()
will call unix_walk_scc_fast() continuously even though there is no
socket to garbage-collect.
To keep that optimisation while fixing UAF, let's add the same
updating logic of unix_graph_maybe_cyclic in unix_walk_scc_fast()
as done in unix_walk_scc() and __unix_walk_scc().
Note that when unix_del_edges() is called from other places, the
receiver socket is always alive:
- sendmsg: the successor's sk_refcnt is bumped by sock_hold()
unix_find_other() for SOCK_DGRAM, connect() for SOCK_STREAM
- recvmsg: the successor is the receiver, and its fd is alive
[0]:
BUG: KASAN: slab-use-after-free in unix_edge_successor net/unix/garbage.c:109 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edge net/unix/garbage.c:165 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edges+0x148/0x630 net/unix/garbage.c:237
Read of size 8 at addr ffff888079c6e640 by task kworker/u8:6/1099
The buggy address belongs to the object at ffff888079c6e000
which belongs to the cache UNIX of size 1920
The buggy address is located 1600 bytes inside of
freed 1920-byte region [ffff888079c6e000, ffff888079c6e780)
Reported-by: syzbot+f3f3eef1d2100200e593@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f3f3eef1d2100200e593 Fixes: 77e5593aebba ("af_unix: Skip GC if no cycle exists.") Fixes: fd86344823b5 ("af_unix: Try not to hold unix_gc_lock during accept().") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240419235102.31707-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This series contains a mix of cleanups, some dating back to
December, 2022. Version 1 was based on an older version of
net-next/main; this version has simply been rebased.
The first two make it so the IPA SUSPEND interrupt only gets enabled
when necessary. That make it possible in the third patch to call
device_init_wakeup() during an earlier phase of initialization, and
remove two functions.
The next patch removes IPA register definitions that are never used.
The fifth patch makes ipa_table_hash_support() a real function, so
the IPA structure only needs to be declared rather than defined when
that file is parsed.
The sixth patch fixes improper argument names in two function
declarations. The seventh removes the declaration for a function
that does not exist, and makes ipa_cmd_init() actually get called.
And the last one eliminates ipa_version_supported(), in favor of
just deciding that if a device is probed because its compatible
matches, that device is assumed to be supported.
====================
Alex Elder [Fri, 19 Apr 2024 15:18:00 +0000 (10:18 -0500)]
net: ipa: kill ipa_version_supported()
The only place ipa_version_supported() is called is in the probe
function. The version comes from the match data. Rather than
checking the version validity separately, just consider anything
that has match data to be supported.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alex Elder [Fri, 19 Apr 2024 15:17:59 +0000 (10:17 -0500)]
net: ipa: fix two minor ipa_cmd problems
In "ipa_cmd.h", ipa_cmd_data_valid() is declared, but that function
does not exist. So delete that declaration.
Also, for some reason ipa_cmd_init() never gets called. It isn't
really critical--it just validates that some memory offsets and a
size can be represented in some register fields, and they won't fail
with current data. Regardless, call the function in ipa_probe().
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The FILT_ROUT_HASH_EN register is only used for IPA v4.2. There,
routing and filter table hashing are not supported, and so the
register must be written to disable the feature. No other version
uses this register, so its definition can be removed. If we need to
use these some day (for example, explicitly enable the feature) this
commit can be reverted.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alex Elder [Fri, 19 Apr 2024 15:17:55 +0000 (10:17 -0500)]
net: ipa: call device_init_wakeup() earlier
Currently, enabling wakeup for the IPA device doesn't occur until
the setup phase of initialization (in ipa_power_setup()).
There is no need to delay doing that, however. We can conveniently
do it during the config phase, in ipa_interrupt_config(), where we
enable power management wakeup mode for the IPA interrupt.
Moving the device_init_wakeup() out of ipa_power_setup() leaves that
function empty, so it can just be eliminated.
Similarly, rearrange all of the matching inverse calls, disabling
device wakeup in ipa_interrupt_deconfig() and removing that function
as well.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alex Elder [Fri, 19 Apr 2024 15:17:53 +0000 (10:17 -0500)]
net: ipa: maintain bitmap of suspend-enabled endpoints
Keep track of which endpoints have the SUSPEND IPA interrupt enabled
in a variable-length bitmap. This will be used in the next patch to
allow the SUSPEND interrupt type to be disabled except when needed.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The series got born as a result of the discussions around the recent
Yanteng' series adding the Loongson LS7A1000, LS2K1000, LS7A2000, LS2K2000
MACs support: Link: https://lore.kernel.org/netdev/fu3f6uoakylnb6eijllakeu5i4okcyqq7sfafhp5efaocbsrwe@w74xe7gb6x7p
In particular the Yanteng' patchset needed to implement the Loongson
MAC-specific constraints applied to the link speed and link duplex mode.
As a result of the discussion with Russel the next preliminary patch was
born: Link: https://lore.kernel.org/netdev/df31e8bcf74b3b4ddb7ddf5a1c371390f16a2ad5.1712917541.git.siyanteng@loongson.cn
The patch above was a temporal solution utilized by Yanteng for further
developments and to move on with the on-going review. This patchset is a
refactored version of that single patch with formatting required for the
fixes patches.
The main part of the series has already been merged in on v1 stage. The
leftover is the cleanup patches which rename
stmmac_ops::phylink_get_caps() callback to stmmac_ops::update_caps() and
move the MAC-capabilities init/re-init to the phylink MAC-capabilities
getter.
net: stmmac: Move MAC caps init to phylink MAC caps getter
After a set of recent fixes the stmmac_phy_setup() and
stmmac_reinit_queues() methods have turned to having some duplicated code.
Let's get rid from the duplication by moving the MAC-capabilities
initialization to the PHYLINK MAC-capabilities getter. The getter is
called during each network device interface open/close cycle. So the
MAC-capabilities will be initialized in generic device open procedure and
in case of the Tx/Rx queues re-initialization as the original code
semantics implies.
net: stmmac: Rename phylink_get_caps() callback to update_caps()
Since recent commits the stmmac_ops::phylink_get_caps() callback has no
longer been responsible for the phylink MAC capabilities getting, but
merely updates the MAC capabilities in the mac_device_info::link::caps
field. Rename the callback to comply with the what the method does now.
====================
Enable RX HW timestamp for PTP packets using CPTS FIFO
The CPSW offers two mechanisms for communicating packet ingress timestamp
information to the host.
The first mechanism is via the CPTS Event FIFO which records timestamp
when triggered by certain events. One such event is the reception of an
Ethernet packet with a specified EtherType field. This is used to capture
ingress timestamps for PTP packets. With this mechanism the host must
read the timestamp (from the CPTS FIFO) separately from the packet payload
which is delivered via DMA.
In the second mechanism of timestamping, CPSW driver enables hardware
timestamping for all received packets by setting the TSTAMP_EN bit in
CPTS_CONTROL register, which directs the CPTS module to timestamp all
received packets, followed by passing timestamp via DMA descriptors.
This mechanism is responsible for triggering errata i2401:
"CPSW: Host Timestamps Cause CPSW Port to Lock up."
The errata affects all K3 SoCs. Link to errata for AM64x:
https://www.ti.com/lit/er/sprz457h/sprz457h.pdf
As a workaround we can use first mechanism to timestamp received
packets.
====================
net: ethernet: ti: am65-cpsw/ethtool: Enable RX HW timestamp only for PTP packets
In the current mechanism of timestamping, am65-cpsw-nuss driver
enables hardware timestamping for all received packets by setting
the TSTAMP_EN bit in CPTS_CONTROL register, which directs the CPTS
module to timestamp all received packets, followed by passing
timestamp via DMA descriptors. This mechanism causes CPSW Port to
Lock up.
To prevent port lock up, don't enable rx packet timestamping by
setting TSTAMP_EN bit in CPTS_CONTROL register. The workaround for
timestamping received packets is to utilize the CPTS Event FIFO
that records timestamps corresponding to certain events. The CPTS
module is configured to generate timestamps for Multicast Ethernet,
UDP/IPv4 and UDP/IPv6 PTP packets.
Update supported hwtstamp_rx_filters values for CPSW's timestamping
capability.
====================
Read PHY address of switch from device tree on MT7530 DSA subdriver
This patch series makes the driver read the PHY address the switch listens
on from the device tree which, in result, brings support for MT7530
switches listening on a different PHY address than 31. And the patch series
simplifies the core operations.
The core_rmw() function calls core_read_mmd_indirect() to read the
requested register, and then calls core_write_mmd_indirect() to write the
requested value to the register. Because Clause 22 is used to access Clause
45 registers, some operations on core_write_mmd_indirect() are
unnecessarily run. Get rid of core_read_mmd_indirect() and
core_write_mmd_indirect(), and run only the necessary operations on
core_write() and core_rmw().
Reviewed-by: Daniel Golle <daniel@makrotopia.org> Tested-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net: dsa: mt7530-mdio: read PHY address of switch from device tree
Read the PHY address the switch listens on from the reg property of the
switch node on the device tree. This change brings support for MT7530
switches on boards with such bootstrapping configuration where the switch
listens on a different PHY address than the hardcoded PHY address on the
driver, 31.
As described on the "MT7621 Programming Guide v0.4" document, the MT7530
switch and its PHYs can be configured to listen on the range of 7-12,
15-20, 23-28, and 31 and 0-4 PHY addresses.
There are operations where the switch PHY registers are used. For the PHY
address of the control PHY, transform the MT753X_CTRL_PHY_ADDR constant
into a macro and use it. The PHY address for the control PHY is 0 when the
switch listens on 31. In any other case, it is one greater than the PHY
address the switch listens on.
Reviewed-by: Daniel Golle <daniel@makrotopia.org> Tested-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
netlink: Add nftables spec w/ multi messages
This series adds a ynl spec for nftables and extends ynl with a --multi
command line option that makes it possible to send transactional batches
for nftables.
This series includes a patch for nfnetlink which adds ACK processing for
batch begin/end messages. If you'd prefer that to be sent separately to
nf-next then I can do so, but I included it here so that it gets seen in
context.
- ynl reports errors by raising an NlError exception so only the first
error gets reported. This could be changed to add errors to the list
of responses so that multiple errors could be reported.
- If any message does not get a response (e.g. batch-begin w/o patch 2)
then ynl waits indefinitely. A recv timeout could be added which
would allow ynl to terminate.
====================
Donald Hunter [Thu, 18 Apr 2024 10:47:37 +0000 (11:47 +0100)]
netfilter: nfnetlink: Handle ACK flags for batch messages
The NLM_F_ACK flag is ignored for nfnetlink batch begin and end
messages. This is a problem for ynl which wants to receive an ack for
every message it sends, not just the commands in between the begin/end
messages.
Add processing for ACKs for begin/end messages and provide responses
when requested.
I have checked that iproute2, pyroute2 and systemd are unaffected by
this change since none of them use NLM_F_ACK for batch begin/end.
Donald Hunter [Thu, 18 Apr 2024 10:47:36 +0000 (11:47 +0100)]
tools/net/ynl: Add multi message support to ynl
Add a "--multi <do-op> <json>" command line to ynl that makes it
possible to add several operations to a single netlink request payload.
The --multi command line option is repeated for each operation.
This is used by the nftables family for transaction batches. For
example:
Donald Hunter [Thu, 18 Apr 2024 10:47:35 +0000 (11:47 +0100)]
tools/net/ynl: Fix extack decoding for directional ops
NetlinkProtocol.decode() was looking up ops by response value which breaks
when it is used for extack decoding of directional ops. Instead, pass
the op to decode().
To have per request buffer notifications each zerocopy io_uring send
request allocates a new ubuf_info. However, as an skb can carry only
one uarg, it may force the stack to create many small skbs hurting
performance in many ways.
The patchset implements notification, i.e. an io_uring's ubuf_info
extension, stacking. It attempts to link ubuf_info's into a list,
allowing to have multiple of them per skb.
liburing/examples/send-zerocopy shows up 6 times performance improvement
for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without
the patchset it requires much larger sends to utilise all potential.
Pavel Begunkov [Fri, 19 Apr 2024 11:08:40 +0000 (12:08 +0100)]
net: add callback for setting a ubuf_info to skb
At the moment an skb can only have one ubuf_info associated with it,
which might be a performance problem for zerocopy sends in cases like
TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
way we will implement smarter assignment later like linking ubuf_info
together.
Note, it's an optional callback, which should be compatible with
skb_zcopy_set(), that's because the net stack might potentially decide
to clone an skb and take another reference to ubuf_info whenever it
wishes. Also, a correct implementation should always be able to bind to
an skb without prior ubuf_info, otherwise we could end up in a situation
when the send would not be able to progress.
Pavel Begunkov [Fri, 19 Apr 2024 11:08:39 +0000 (12:08 +0100)]
net: extend ubuf_info callback to ops structure
We'll need to associate additional callbacks with ubuf_info, introduce
a structure holding ubuf_info callbacks. Apart from a more smarter
io_uring notification management introduced in next patches, it can be
used to generalise msg_zerocopy_put_abort() and also store
->sg_from_iter, which is currently passed in struct msghdr.
====================
tcp: avoid sending too small packets
tcp_sendmsg() cooks 'large' skbs, that are later split
if needed from tcp_write_xmit().
After a split, the leftover skb size is smaller than the optimal
size, and this causes a performance drop.
In this series, tcp_grow_skb() helper is added to shift
payload from the second skb in the write queue to the first
skb to always send optimal sized skbs.
This increases TSO efficiency, and decreases number of ACK
packets.
====================
tcp_tso_should_defer() heuristic is defeated, because after a split of
a packet in write queue for whatever reason (this might be a too small
CWND or a small enough pacing_rate),
the leftover packet in the queue is smaller than the optimal size.
It is time to try to make 'leftover packets' bigger so that
tcp_tso_should_defer() can give its full potential.
After this patch, we can see the following output:
Eric Dumazet [Thu, 18 Apr 2024 21:45:59 +0000 (21:45 +0000)]
tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
tcp_write_xmit() calls tcp_init_tso_segs()
to set gso_size and gso_segs on the packet.
tcp_init_tso_segs() requires the stack to maintain
an up to date tcp_skb_pcount(), and this makes sense
for packets in rtx queue. Not so much for packets
still in the write queue.
In the following patch, we don't want to deal with
tcp_skb_pcount() when moving payload from 2nd
skb to 1st skb in the write queue.
Jakub Kicinski [Mon, 22 Apr 2024 21:22:22 +0000 (14:22 -0700)]
Merge branch 'mlx5e-per-queue-coalescing'
Tariq Toukan says:
====================
mlx5e per-queue coalescing
This patchset adds ethtool per-queue coalescing support for the mlx5e
driver.
The series introduce some changes needed as preparations for the final
patch which adds the support and implements the callbacks. Main
changes:
- DIM code movements into its own header file.
- Switch to dynamic allocation of the DIM struct in the RQs/SQs.
- Allow coalescing config change without channels reset when possible.
====================
net/mlx5e: Support updating coalescing configuration without resetting channels
When CQE mode or DIM state is changed, gracefully reconfigure channels to
handle new configuration. Previously, would create new channels that would
reflect the changes rather than update the original channels.
Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5e: Dynamically allocate DIM structure for SQs/RQs
Make it possible for the DIM structure to be torn down while an SQ or RQ is
still active. Changing the CQ period mode is an example where the previous
sampling done with the DIM structure would need to be invalidated.
Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5e: Use DIM constants for CQ period mode parameter
Use core DIM CQ period mode enum values for the CQ parameter for the period
mode. Translate the value to the specific mlx5 device constant for the
selected period mode when creating a CQ. Avoid needing to translate mlx5
device constants to DIM constants for core DIM functionality.
Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5e: Move DIM function declarations to en/dim.h
Create a header specifically for DIM-related declarations. Move existing
DIM-specific functionality from en.h. Future DIM-related functionality will
be declared in en/dim.h in subsequent patches.
Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: dsa: vsc73xx: convert to PHYLINK and do some cleanup
This patch series is a result of splitting a larger patch series [0],
where some parts needed to be refactored.
The first patch switches from a poll loop to read_poll_timeout.
The second patch is a simple conversion to phylink because adjust_link
won't work anymore.
The third patch is preparation for future use. Using the
"phy_interface_mode_is_rgmii" macro allows for the proper recognition
of all RGMII modes.
Patches 4-5 involve some cleanup: The fourth patch introduces
a definition with the maximum number of ports to avoid using
magic numbers. The next one fills in documentation.
net: dsa: vsc73xx: use macros for rgmii recognition
It's preparation for future use. At this moment, the RGMII port is used
only for a connection to the MAC interface, but in the future, someone
could connect a PHY to it. Using the "phy_interface_mode_is_rgmii" macro
allows for the proper recognition of all RGMII modes.
Suggested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240417205048.3542839-4-paweldembicki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: vsc73xx: use read_poll_timeout instead delay loop
Switch the delay loop during the Arbiter empty check from
vsc73xx_adjust_link() to use read_poll_timeout(). Functionally,
one msleep() call is eliminated at the end of the loop in the timeout
case.
As Russell King suggested:
"This [change] avoids the issue that on the last iteration, the code reads
the register, tests it, finds the condition that's being waiting for is
false, _then_ waits and end up printing the error message - that last
wait is rather useless, and as the arbiter state isn't checked after
waiting, it could be that we had success during the last wait."
Suggested-by: Russell King <linux@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com> Link: https://lore.kernel.org/r/20240417205048.3542839-2-paweldembicki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Fri, 19 Apr 2024 07:19:42 +0000 (07:19 +0000)]
tcp: do not export tcp_twsk_purge()
After commit 1eeb50435739 ("tcp/dccp: do not care about
families in inet_twsk_purge()") tcp_twsk_purge() is
no longer potentially called from a module.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2-pf: Add support for offload tc with skbedit mark action
Support offloading of skbedit mark action.
For example, to mark with 0x0008, with dest ip 60.60.60.2 on eth2
interface:
# tc qdisc add dev eth2 ingress
# tc filter add dev eth2 ingress protocol ip flower \
dst_ip 60.60.60.2 action skbedit mark 0x0008 skip_sw
Signed-off-by: Geetha sowjanya <gakula@marvell.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ethernet: ti: am65-cpsw: Fix xdp_rxq error for disabled port
When an ethX port is disabled in the device tree, an error is returned
by xdp_rxq_info_reg() function while transitioning the CPSW device to
the up state. The message 'Missing net_device from driver' is output.
This patch fixes the issue by registering xdp_rxq info only if ethX
port is enabled (i.e. ndev pointer is not NULL).
Fixes: 8acacc40f733 ("net: ethernet: ti: am65-cpsw: Add minimal XDP support") Link: https://lore.kernel.org/all/260d258f-87a1-4aac-8883-aab4746b32d8@ti.com/ Reported-by: Siddharth Vadapalli <s-vadapalli@ti.com> Closes: https://gist.github.com/Siddharth-Vadapalli-at-TI/5ed0e436606001c247a7da664f75edee Signed-off-by: Julien Panis <jpanis@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.
Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
testing: make netfilter selftests functional in vng environment
This is the second batch of the netfilter selftest move.
Changes since v1:
- makefile and kernel config are updated to have all required features
- fix makefile with missing bits to make kselftest-install work
- test it via vng as per
https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style
(Thanks Jakub!)
- squash a few fixes, e.g. nft_queue.sh v1 had a race w. NFNETLINK_QUEUE=m
- add a settings file with 8m timeout, for nft_concat_range.sh sake.
That script can be sped up a bit, I think, but its not contained in
this batch yet.
- toss the first two bogus rebase artifacts (Matthieu Baerts)
scripts are moved to lib.sh infra. This allows to use busywait helper
and ditch various 'sleep 2' all over the place.
selftests: netfilter: update makefiles and kernel config
Jakub reports the Makefile missed a few updates to make kselftest-install
work for the netfilter tests and points out that config file lacks many
dependencies such as VETH support.
The settings file (timeout 8m) is added for nft_concat_range.sh script
which can take several minutes to complete.
Fixes: 3f189349e52a ("selftests: netfilter: move to net subdir") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/all/20240412175413.04e5e616@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-13-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
selftests: netfilter: xt_string.sh: move to lib.sh infra
Intentional changes:
- Use socat instead of netcat
- Use a temporary file instead of pipe, else packets do not match
"-m string" rules, multiple writes to the pipe cause multiple packets,
but this needs only one to work.
selftests: netfilter: nft_zones_many.sh: move to lib.sh infra
Also do shellcheck cleanups here, no functional changes intended.
When running tests via vng tool, the packetpath insertion test fails:
dd: failed to open '/dev/stdout': Device or resource busy
selftests: netfilter: nft_queue.sh: move to lib.sh infra
- switch to socat, like other tests
- use buswait helper to test once listener netns is ready
- do not generate multiple input test files, only generate
one and use cleanup hook to remove it, like other temporary files.
net: dsa: xrs700x: fix missing initialisation of ds->phylink_mac_ops
The kernel build bot identified the following mistake in the recently
merged 860a9bed2651 ("net: dsa: xrs700x: provide own phylink MAC
operations") patch:
drivers/net/dsa/xrs700x/xrs700x.c:714:37: warning: 'xrs700x_phylink_mac_ops' defined but not used [-Wunused-const-variable=]
714 | static const struct phylink_mac_ops xrs700x_phylink_mac_ops = {
| ^~~~~~~~~~~~~~~~~~~~~~~
Fix the omitted assignment of ds->phylink_mac_ops.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 19 Apr 2024 10:38:04 +0000 (11:38 +0100)]
Merge branch 'net-rps-lockless'
Jason Xing says:
====================
locklessly protect left members in struct rps_dev_flow
From: Jason Xing <kernelxing@tencent.com>
Since Eric did a more complicated locklessly change to last_qtail
member[1] in struct rps_dev_flow, the left members are easier to change
as the same.
One thing important I would like to share by qooting Eric:
"rflow is located in rxqueue->rps_flow_table, it is thus private to current
thread. Only one cpu can service an RX queue at a time."
So we only pay attention to the reader in the rps_may_expire_flow() and
writer in the set_rps_cpu(). They are in the two different contexts.
Jason Xing [Thu, 18 Apr 2024 07:36:03 +0000 (15:36 +0800)]
net: rps: locklessly access rflow->cpu
This is the last member in struct rps_dev_flow which should be
protected locklessly. So finish it.
Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Xing [Thu, 18 Apr 2024 07:36:02 +0000 (15:36 +0800)]
net: rps: protect filter locklessly
As we can see, rflow->filter can be written/read concurrently, so
lockless access is needed.
Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Xing [Thu, 18 Apr 2024 07:36:01 +0000 (15:36 +0800)]
net: rps: protect last_qtail with rps_input_queue_tail_save() helper
Removing one unnecessary reader protection and add another writer
protection to finish the locklessly proctection job.
Note: the removed READ_ONCE() is not needed because we only have to protect
the locklessly reader in the different context (rps_may_expire_flow()).
Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 19 Apr 2024 10:34:08 +0000 (11:34 +0100)]
Merge branch 'net_sched-dump-no-rtnl'
Eric Dumazet says:
====================
net_sched: first series for RTNL-less qdisc dumps
Medium term goal is to implement "tc qdisc show" without needing
to acquire RTNL.
This first series makes the requested changes in 14 qdisc.
Notes :
- RTNL is still held in "tc qdisc show", more changes are needed.
- Qdisc returning many attributes might want/need to provide
a consistent set of attributes. If that is the case, their
dump() method could acquire the qdisc spinlock, to pair the
spinlock acquision in their change() method.
V2: Addressed Simon feedback (Thanks a lot Simon)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>