David S. Miller [Wed, 6 Mar 2024 10:30:08 +0000 (10:30 +0000)]
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
From: Tony Nguyen <anthony.l.nguyen@intel.com>
To: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
edumazet@google.com, netdev@vger.kernel.org Cc: Tony Nguyen <anthony.l.nguyen@intel.com>, alan.brady@intel.com
Tony Nguyen says:
The motivation for this series has two primary goals. We want to enable
support of multiple simultaneous messages and make the channel more
robust. The way it works right now, the driver can only send and receive
a single message at a time and if something goes really wrong, it can
lead to data corruption and strange bugs.
To start the series, we introduce an idpf_virtchnl.h file. This reduces
the burden on idpf.h which is overloaded with struct and function
declarations.
The conversion works by conceptualizing a send and receive as a
"virtchnl transaction" (idpf_vc_xn) and introducing a "transaction
manager" (idpf_vc_xn_manager). The vcxn_mngr will init a ring of
transactions from which the driver will pop from a bitmap of free
transactions to track in-flight messages. Instead of needing to handle a
complicated send/recv for every a message, the driver now just needs to
fill out a xn_params struct and hand it over to idpf_vc_xn_exec which
will take care of all the messy bits. Once a message is sent and
receives a reply, we leverage the completion API to signal the received
buffer is ready to be used (assuming success, or an error code
otherwise).
At a low-level, this implements the "sw cookie" field of the virtchnl
message descriptor to enable this. We have 16 bits we can put whatever
we want and the recipient is required to apply the same cookie to the
reply for that message. We use the first 8 bits as an index into the
array of transactions to enable fast lookups and we use the second 8
bits as a salt to make sure each cookie is unique for that message. As
transactions are received in arbitrary order, it's possible to reuse a
transaction index and the salt guards against index conflicts to make
certain the lookup is correct. As a primitive example, say index 1 is
used with salt 1. The message times out without receiving a reply so
index 1 is renewed to be ready for a new transaction, we report the
timeout, and send the message again. Since index 1 is free to be used
again now, index 1 is again sent but now salt is 2. This time we do get
a reply, however it could be that the reply is _actually_ for the
previous send index 1 with salt 1. Without the salt we would have no
way of knowing for sure if it's the correct reply, but with we will know
for certain.
Through this conversion we also get several other benefits. We can now
more appropriately handle asynchronously sent messages by providing
space for a callback to be defined. This notably allows us to handle MAC
filter failures better; previously we could potentially have stale,
failed filters in our list, which shouldn't really have a major impact
but is obviously not correct. I also managed to remove fairly
significant more lines than I added which is a win in my book.
Additionally, this converts some variables to use auto-variables where
appropriate. This makes the alloc paths much cleaner and less prone to
memory leaks. We also fix a few virtchnl related bugs while we're here.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 6 Mar 2024 08:07:45 +0000 (08:07 +0000)]
Merge branch 'netlink-emsgsize'
Jakub Kicinski says:
====================
netlink: handle EMSGSIZE errors in the core
Ido discovered some time back that we usually force NLMSG_DONE
to be delivered in a separate recv() syscall, even if it would
fit into the same skb as data messages. He made nexthop try
to fit DONE with data in commit 8743aeff5bc4 ("nexthop: Fix
infinite nexthop bucket dump when using maximum nexthop ID"),
and nobody has complained so far.
We have since also tried to follow the same pattern in new
genetlink families, but explaining to people, or even remembering
the correct handling ourselves is tedious.
Let the netlink socket layer consume -EMSGSIZE errors.
Practically speaking most families use this error code
as "dump needs more space", anyway.
v2:
- init err to 0 in last patch
v1: https://lore.kernel.org/all/20240301012845.2951053-1-kuba@kernel.org/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sun, 3 Mar 2024 05:24:08 +0000 (21:24 -0800)]
genetlink: fit NLMSG_DONE into same read() as families
Make sure ctrl_fill_info() returns sensible error codes and
propagate them out to netlink core. Let netlink core decide
when to return skb->len and when to treat the exit as an
error. Netlink core does better job at it, if we always
return skb->len the core doesn't know when we're done
dumping and NLMSG_DONE ends up in a separate read().
Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sun, 3 Mar 2024 05:24:07 +0000 (21:24 -0800)]
netdev: let netlink core handle -EMSGSIZE errors
Previous change added -EMSGSIZE handling to af_netlink, we don't
have to hide these errors any longer.
Theoretically the error handling changes from:
if (err == -EMSGSIZE)
to
if (err == -EMSGSIZE && skb->len)
everywhere, but in practice it doesn't matter.
All messages fit into NLMSG_GOODSIZE, so overflow of an empty
skb cannot happen.
Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sun, 3 Mar 2024 05:24:06 +0000 (21:24 -0800)]
netlink: handle EMSGSIZE errors in the core
Eric points out that our current suggested way of handling
EMSGSIZE errors ((err == -EMSGSIZE) ? skb->len : err) will
break if we didn't fit even a single object into the buffer
provided by the user. This should not happen for well behaved
applications, but we can fix that, and free netlink families
from dealing with that completely by moving error handling
into the core.
Let's assume from now on that all EMSGSIZE errors in dumps are
because we run out of skb space. Families can now propagate
the error nla_put_*() etc generated and not worry about any
return value magic. If some family really wants to send EMSGSIZE
to user space, assuming it generates the same error on the next
dump iteration the skb->len should be 0, and user space should
still see the EMSGSIZE.
This should simplify families and prevent mistakes in return
values which lead to DONE being forced into a separate recv()
call as discovered by Ido some time ago.
Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 4 Mar 2024 23:36:20 +0000 (15:36 -0800)]
selftests: avoid using SKIP(exit()) in harness fixure setup
selftest harness uses various exit codes to signal test
results. Avoid calling exit() directly, otherwise tests
may get broken by harness refactoring (like the commit
under Fixes). SKIP() will instruct the harness that the
test shouldn't run, it used to not be the case, but that
has been fixed. So just return, no need to exit.
Note that for hmm-tests this actually changes the result
from pass to skip. Which seems fair, the test is skipped,
after all.
Jakub Kicinski [Wed, 6 Mar 2024 03:21:19 +0000 (19:21 -0800)]
Merge branch 'net-ethernet-rework-eee'
Oleksij Rempel says:
====================
net: ethernet: Rework EEE
with Andrew's permission I'll continue mainlining this patches:
==============================================================
Most MAC drivers get EEE wrong. The API to the PHY is not very
obvious, which is probably why. Rework the API, pushing most of the
EEE handling into phylib core, leaving the MAC drivers to just
enable/disable support for EEE in there change_link call back.
MAC drivers are now expect to indicate to phylib if they support
EEE. This will allow future patches to configure the PHY to advertise
no EEE link modes when EEE is not supported. The information could
also be used to enable SmartEEE if the PHY supports it.
With these changes, the uAPI configuration eee_enable becomes a global
on/off. tx-lpi must also be enabled before EEE is enabled. This fits
the discussion here:
This patchset puts in place all the infrastructure, and converts one
MAC driver to the new API. Following patchsets will convert other MAC
drivers, extend support into phylink, and when all MAC drivers are
converted to the new scheme, clean up some unneeded code.
====================
Andrew Lunn [Sat, 2 Mar 2024 19:53:06 +0000 (20:53 +0100)]
net: fec: Fixup EEE
The enabling/disabling of EEE in the MAC should happen as a result of
auto negotiation. So move the enable/disable into
fec_enet_adjust_link() which gets called by phylib when there is a
change in link status.
fec_enet_set_eee() now just stores away the LPI timer value.
Everything else is passed to phylib, so it can correctly setup the
PHY.
fec_enet_get_eee() relies on phylib doing most of the work,
the MAC driver just adds the LPI timer value.
Call phy_support_eee() if the quirk is present to indicate the MAC
actually supports EEE.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> (On iMX8MP debix) Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Wei Fang <wei.fang@nxp.com> Link: https://lore.kernel.org/r/20240302195306.3207716-8-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrew Lunn [Sat, 2 Mar 2024 19:53:04 +0000 (20:53 +0100)]
net: phy: Add phy_support_eee() indicating MAC support EEE
In order for EEE to operate, both the MAC and the PHY need to support
it, similar to how pause works. With some exception - a number of PHYs
have SmartEEE or AutoGrEEEn support in order to provide some EEE-like
power savings with non-EEE capable MACs.
Copy the pause concept and add the call phy_support_eee() which the MAC
makes after connecting the PHY to indicate it supports EEE. phylib will
then advertise EEE when auto-neg is performed.
Andrew Lunn [Sat, 2 Mar 2024 19:53:03 +0000 (20:53 +0100)]
net: phy: Immediately call adjust_link if only tx_lpi_enabled changes
The MAC driver changes its EEE hardware configuration in its
adjust_link callback. This is called when auto-neg
completes. Disabling EEE via eee_enabled false will trigger an
autoneg, and as a result the adjust_link callback will be called with
phydev->enable_tx_lpi set to false. Similarly, eee_enabled set to true
and with a change of advertised link modes will result in a new
autoneg, and a call the adjust_link call.
If set_eee is called with only a change to tx_lpi_enabled which does
not trigger an auto-neg, it is necessary to call the adjust_link
callback so that the MAC is reconfigured to take this change into
account.
When setting phydev->enable_tx_lpi, take both eee_enabled and
tx_lpi_enabled into account, so the MAC drivers just needs to act on
phydev->enable_tx_lpi and not the whole EEE configuration.
The same check should be done for tx_lpi_timer too.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240302195306.3207716-5-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrew Lunn [Sat, 2 Mar 2024 19:53:01 +0000 (20:53 +0100)]
net: phy: Add phydev->enable_tx_lpi to simplify adjust link callbacks
MAC drivers which support EEE need to know the results of the EEE
auto-neg in order to program the hardware to perform EEE or not. The
oddly named phy_init_eee() can be used to determine this, it returns 0
if EEE should be used, or a negative error code,
e.g. -EOPPROTONOTSUPPORT if the PHY does not support EEE or negotiate
resulted in it not being used.
However, many MAC drivers get this wrong. Add phydev->enable_tx_lpi
which indicates the result of the autoneg for EEE, including if EEE is
administratively disabled with ethtool. The MAC driver can then access
this in the same way as link speed and duplex in the adjust link
callback. If enable_tx_lpi is true, the MAC should send low power
indications and does not need to consider anything else with respect
to EEE.
Russell King [Sat, 2 Mar 2024 19:53:00 +0000 (20:53 +0100)]
net: add helpers for EEE configuration
Add helpers that phylib and phylink can use to manage EEE configuration
and determine whether the MAC should be permitted to use LPI based on
that configuration.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20240302195306.3207716-2-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Sat, 2 Mar 2024 14:18:27 +0000 (15:18 +0100)]
ethtool: ignore unused/unreliable fields in set_eee op
This function is used with the set_eee() ethtool operation. Certain
fields of struct ethtool_keee() are relevant only for the get_eee()
operation. In addition, in case of the ioctl interface, we have no
guarantee that userspace sends sane values in struct ethtool_eee.
Therefore explicitly ignore all fields not needed for set_eee().
This protects from drivers trying to use unchecked and unreliable
data, relying on specific userspace behavior.
Note: Such unsafe driver behavior has been found and fixed in the
tg3 driver.
Kees Cook [Mon, 4 Mar 2024 21:29:31 +0000 (13:29 -0800)]
sock: Use unsafe_memcpy() for sock_copy()
While testing for places where zero-sized destinations were still showing
up in the kernel, sock_copy() and inet_reqsk_clone() were found, which
are using very specific memcpy() offsets for both avoiding a portion of
struct sock, and copying beyond the end of it (since struct sock is really
just a common header before the protocol-specific allocation). Instead
of trying to unravel this historical lack of container_of(), just switch
to unsafe_memcpy(), since that's effectively what was happening already
(memcpy() wasn't checking 0-sized destinations while the code base was
being converted away from fake flexible arrays).
Avoid the following false positive warning with future changes to
CONFIG_FORTIFY_SOURCE:
memcpy: detected field-spanning write (size 3068) of destination "&nsk->__sk_common.skc_dontcopy_end" at net/core/sock.c:2057 (size 0)
Breno Leitao [Mon, 4 Mar 2024 18:38:08 +0000 (10:38 -0800)]
net: tap: Remove generic .ndo_get_stats64
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.
Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.
Breno Leitao [Mon, 4 Mar 2024 18:38:07 +0000 (10:38 -0800)]
net: tuntap: Leverage core stats allocator
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in the tun/tap driver and leverage the network
core allocation instead.
ptp: fc3: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the nfc_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the wwan_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the wwan_hwsim_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the ppp_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the framer_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Herve Codina <herve.codina@bootlin.com> Link: https://lore.kernel.org/r/20240302-class_cleanup-net-next-v1-2-8fa378595b93@marliere.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the hnae_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Add two erratas for lan8814. First one fix the led which might
stay on even that there is no link. The second one improves increases
length of the cable that can be used when used in 1000Base-T.
====================
When the length of the cable is more than 100m and the lan8814 is
configured to run in 1000Base-T Slave then the register of the device
needs to be optimized.
Workaround this by setting the measure time to a value of 0xb. This
value can be set regardless of the configuration.
This issue is described in 'LAN8814 Silicon Errata and Data Sheet
Clarification' and according to that, this will not be corrected in a
future silicon revision.
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com> Link: https://lore.kernel.org/r/20240304091548.1386022-3-horatiu.vultur@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Horatiu Vultur [Mon, 4 Mar 2024 09:15:47 +0000 (10:15 +0100)]
net: phy: micrel: lan8814 led errata
Lan8814 phy led behavior is not correct. It was noticed that the led
still remains ON when the cable is unplugged while there was traffic
passing at that time.
The fix consists in clearing bit 10 of register 0x38, in this way the
led behaviour is correct and gets OFF when there is no link.
====================
selftests: forwarding: Various improvements
This patchset speeds up the multipath tests (patches #1-#2) and makes
other tests more stable (patches #3-#6) so that they will not randomly
fail in the netdev CI.
On my system, after applying the first two patches, the run time of
gre_multipath_nh_res.sh is reduced by over 90%.
====================
Ido Schimmel [Mon, 4 Mar 2024 09:56:12 +0000 (11:56 +0200)]
selftests: forwarding: Make {, ip6}gre-inner-v6-multipath tests more robust
These tests generate various IPv6 flows, encapsulate them in GRE packets
and check that the encapsulated packets are distributed between the
available nexthops according to the configured weights.
Unlike the corresponding IPv4 tests, these tests sometimes fail in the
netdev CI because of large discrepancies between the expected and
measured ratios [1]. This can be explained by the fact that the IPv4
tests generate about 3,600 different flows whereas the IPv6 tests only
generate about 784 different flows (potentially by mistake).
Fix by aligning the IPv6 tests to the IPv4 ones and increase the number
of generated flows.
[1]
[...]
# TEST: ping [ OK ]
# INFO: Running IPv6 over GRE over IPv4 multipath tests
# TEST: ECMP [FAIL]
# Too large discrepancy between expected and measured ratios
# INFO: Expected ratio 1.00 Measured ratio 1.18
[...]
Ido Schimmel [Mon, 4 Mar 2024 09:56:09 +0000 (11:56 +0200)]
selftests: forwarding: Make tc-police pass on debug kernels
The test configures a policer with a rate of 80Mbps and expects to
measure a rate close to it. This is a too high rate for debug kernels,
causing the test to fail [1].
Fix by reducing the rate to 10Mbps.
[1]
# ./tc_police.sh
TEST: police on rx [FAIL]
Expected rate 76.2Mbps, got 29.6Mbps, which is -61% off. Required accuracy is +-10%.
TEST: police on tx [FAIL]
Expected rate 76.2Mbps, got 30.4Mbps, which is -60% off. Required accuracy is +-10%.
The various multipath tests use mausezahn to generate different flows
and check how they are distributed between the available nexthops. The
tool is currently invoked with an hard coded transmission delay of 1 ms.
This is unnecessary when the tests are run with veth pairs and
needlessly prolongs the tests.
Parametrize this delay and default it to 0 us. It can be overridden
using the forwarding.config file. On my system, this reduces the run
time of router_multipath.sh by 93%.
The multipath tests currently test both the L3 and L4 multipath hash
policies for IPv6, but only the L4 policy for IPv4. The reason is mostly
historic: When the initial multipath test was added
(router_multipath.sh) the IPv6 L4 policy did not exist and was later
added to the test. The other multipath tests copied this pattern
although there is little value in testing both policies.
Align the IPv4 and IPv6 tests and only test the L4 policy. On my system,
this reduces the run time of router_multipath.sh by 89% because of the
repeated ping6 invocations to randomize the flow label.
Eric Dumazet [Sat, 2 Mar 2024 10:07:44 +0000 (10:07 +0000)]
net/smc: reduce rtnl pressure in smc_pnet_create_pnetids_list()
Many syzbot reports show extreme rtnl pressure, and many of them hint
that smc acquires rtnl in netns creation for no good reason [1]
This patch returns early from smc_pnet_net_init()
if there is no netdevice yet.
I am not even sure why smc_pnet_create_pnetids_list() even exists,
because smc_pnet_netdev_event() is also calling
smc_pnet_add_base_pnetid() when handling NETDEV_UP event.
* tag 'linux-can-next-for-6.9-20240304' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: mcp251xfd: __mcp251xfd_get_berr_counter(): use CAN_BUS_OFF_THRESHOLD instead of open coding it
can: gs_usb: gs_cmd_reset(): use cpu_to_le32() to assign mode
can: kvaser_pciefd: Add support for Kvaser PCIe 8xCAN
can: kvaser_usb: Add support for Leaf v3
====================
net: Re-use and set mono_delivery_time bit for userspace tstamp packets
Bridge driver today has no support to forward the userspace timestamp
packets and ends up resetting the timestamp. ETF qdisc checks the
packet coming from userspace and encounters to be 0 thereby dropping
time sensitive packets. These changes will allow userspace timestamps
packets to be forwarded from the bridge to NIC drivers.
Setting the same bit (mono_delivery_time) to avoid dropping of
userspace tstamp packets in the forwarding path.
Existing functionality of mono_delivery_time remains unaltered here,
instead just extended with userspace tstamp support for bridge
forwarding path.
====================
MT7530 DSA Subdriver Improvements Act III
This is the third patch series with the goal of simplifying the MT7530 DSA
subdriver and improving support for MT7530, MT7531, and the switch on the
MT7988 SoC.
I have done a simple ping test to confirm basic communication on all switch
ports on MCM and standalone MT7530, and MT7531 switch with this patch
series applied.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
---
Changes in v3:
- Patch 8
- Explain properly the behaviour of setting link down on all ports at
setup.
- Split the changes for simplifying the link settings operations out to
another patch.
- Link to v2: https://lore.kernel.org/r/20240216-for-netnext-mt7530-improvements-3-v2-0-094cae3ff23b@arinc9.com
Changes in v2:
- Patch 8
- Use a single mt7530_rmw() instead of two mt7530_clear() and
mt7530_set() commands.
- Link to v1: https://lore.kernel.org/r/20240208-for-netnext-mt7530-improvements-3-v1-0-d7c1cfd502ca@arinc9.com
====================
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:05 +0000 (12:43 +0200)]
net: dsa: mt7530: simplify link operations
The "MT7621 Giga Switch Programming Guide v0.3", "MT7531 Reference Manual
for Development Board v1.0", and "MT7988A Wi-Fi 7 Generation Router
Platform: Datasheet (Open Version) v0.1" documents show that these bits are
enabled at reset:
PMCR_IFG_XMIT(1) (not part of PMCR_LINK_SETTINGS_MASK)
PMCR_MAC_MODE (not part of PMCR_LINK_SETTINGS_MASK)
PMCR_TX_EN
PMCR_RX_EN
PMCR_BACKOFF_EN (not part of PMCR_LINK_SETTINGS_MASK)
PMCR_BACKPR_EN (not part of PMCR_LINK_SETTINGS_MASK)
PMCR_TX_FC_EN
PMCR_RX_FC_EN
These bits also don't exist on the MT7530_PMCR_P(6) register of the switch
on the MT7988 SoC:
Remove the setting of the bits not part of PMCR_LINK_SETTINGS_MASK on
phylink_mac_config as they're already set.
The bit for setting the port on force mode is already done on
mt7530_setup() and mt7531_setup_common(). So get rid of
PMCR_FORCE_MODE_ID() which helped determine which bit to use for the switch
model.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:04 +0000 (12:43 +0200)]
net: dsa: mt7530: sort link settings ops and force link down on all ports
port_enable and port_disable clears the link settings. Move that to
mt7530_setup() and mt7531_setup_common() which set up the switches. This
way, the link settings are cleared on all ports at setup, and then only
once with phylink_mac_link_down() when a link goes down.
Enable force mode at setup to apply the force part of the link settings.
This ensures that disabled ports will have their link down.
Suggested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:03 +0000 (12:43 +0200)]
net: dsa: mt7530: put initialising PCS devices code back to original order
The commit fae463084032 ("net: dsa: mt753x: fix pcs conversion regression")
fixes regression caused by cpu_port_config manually calling phylink
operations. cpu_port_config was deemed useless and was removed. Therefore,
put initialising PCS devices code back to its original order.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:02 +0000 (12:43 +0200)]
net: dsa: mt7530: get rid of mt753x_mac_config()
There is no need for a separate function to call
priv->info->mac_port_config(). Call it from mt753x_phylink_mac_config()
instead and remove mt753x_mac_config().
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:01 +0000 (12:43 +0200)]
net: dsa: mt7530: get rid of priv->info->cpu_port_config()
priv->info->cpu_port_config() is used for MT7531 and the switch on the
MT7988 SoC. It sets up the ports described as a CPU port earlier than the
phylink code path would do.
This function is useless as:
- Configuring the MACs can be done from the phylink_mac_config code path
instead.
- All the link configuration it does on the CPU ports are later undone with
the port_enable, phylink_mac_config, and then phylink_mac_link_up code
path [1].
priv->p5_interface and priv->p6_interface were being used to prevent
configuring the MACs from the phylink_mac_config code path. Remove them now
that they hold no purpose.
Remove priv->info->cpu_port_config(). On mt753x_phylink_mac_config, switch
to if statements to simplify the code.
Remove the overwriting of the speed and duplex interfaces for certain
interface modes. Phylink already provides the speed and duplex variables
with proper values. Phylink already sets the max speed of TRGMII to
SPEED_1000. Add SPEED_2500 for PHY_INTERFACE_MODE_2500BASEX to where the
speed and EEE bits are set instead.
On the switch on the MT7988 SoC, PHY_INTERFACE_MODE_INTERNAL is being used
to describe the interface mode of the 10G MAC, which is of port 6. On
mt7988_cpu_port_config() PMCR_FORCE_SPEED_1000 was set via the
PMCR_CPU_PORT_SETTING() mask. Add SPEED_10000 case to where the speed bits
are set to cover this. No need to add it to where the EEE bits are set as
the "MT7988A Wi-Fi 7 Generation Router Platform: Datasheet (Open Version)
v0.1" document shows that these bits don't exist on the MT7530_PMCR_P(6)
register.
Remove the definition of PMCR_CPU_PORT_SETTING() now that it holds no
purpose.
Change mt753x_cpu_port_enable() to void now that there're no error cases
left.
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:00 +0000 (12:43 +0200)]
net: dsa: mt7530: get rid of useless error returns on phylink code path
Remove error returns on the cases where they are already handled with the
function the mac_port_get_caps member in mt753x_table points to.
mt7531_mac_config() is also called from mt7531_cpu_port_config() outside of
phylink but the port and interface modes are already handled there.
Change the functions and the mac_port_config function pointer to void now
that there're no error returns anymore.
Remove mt753x_is_mac_port() that used to help the said error returns.
On mt7531_mac_config(), switch to if statements to simplify the code.
Remove internal phy cases from mt753x_phylink_mac_config(), there is no
need to check the interface mode as that's already handled with the
function the mac_port_get_caps member in mt753x_table points to.
Acked-by: Daniel Golle <daniel@makrotopia.org> Tested-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arınç ÜNAL [Fri, 1 Mar 2024 10:42:59 +0000 (12:42 +0200)]
net: dsa: mt7530: do not use SW_PHY_RST to reset MT7531 switch
According to the document MT7531 Reference Manual for Development Board
v1.0, the SW_PHY_RST bit on the SYS_CTRL register doesn't exist for
MT7531. This is likely why forcing link down on all ports is necessary for
MT7531.
Therefore, do not set SW_PHY_RST on mt7531_setup().
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arınç ÜNAL [Fri, 1 Mar 2024 10:42:57 +0000 (12:42 +0200)]
net: dsa: mt7530: remove .mac_port_config for MT7988 and make it optional
For the switch on the MT7988 SoC, the mac_port_config member for ID_MT7988
in mt753x_table is not needed as the interfaces of all MACs are already
handled on mt7988_mac_port_get_caps().
Therefore, remove the mac_port_config member from ID_MT7988 in
mt753x_table. Before calling priv->info->mac_port_config(), if there's no
mac_port_config member in mt753x_table, exit mt753x_mac_config()
successfully.
Remove calling priv->info->mac_port_config() from the sanity check as the
sanity check requires a pointer to a mac_port_config function to be
non-NULL. This will fail for MT7988 as mac_port_config won't be a member of
its info table.
Co-developed-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
remove page frag implementation in vhost_net
Currently there are three implementations for page frag:
1. mm/page_alloc.c: net stack seems to be using it in the
rx part with 'struct page_frag_cache' and the main API
being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
tx part with 'struct page_frag' and the main API being
skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
xdp frame, and it's implementation seems to be a mix of
the above two.
This patchset tries to unfiy the page frag implementation a
little bit by unifying gfp bit for order 3 page allocation
and replacing page frag implementation in vhost.c with the
one in page_alloc.c.
After this patchset, we are not only able to unify the page
frag implementation a little, but also able to have about
0.5% performance boost testing by using the vhost_net_test
introduced in the last patch.
Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):
174.764 +- 0.214 seconds time elapsed ( +- 0.12% )
Changelog:
V6: Add timeout for poll() and simplify some logic as suggested
by Jason.
V5: Address the comment from jason in vhost_net_test.c and the
comment about leaving out the gfp change for page frag in
sock.c as suggested by Paolo.
V4: Resend based on latest net-next branch.
V3:
1. Add __page_frag_alloc_align() which is passed with the align mask
the original function expected as suggested by Alexander.
2. Drop patch 3 in v2 suggested by Alexander.
3. Reorder patch 4 & 5 in v2 suggested by Alexander.
Note that placing this gfp flags handing for order 3 page in an inline
function is not considered, as we may be able to unify the page_frag
and page_frag_cache handling.
V2: Change 'xor'd' to 'masked off', add vhost tx testing for
vhost_net_test.
V1: Fix some typo, drop RFC tag and rebase on latest net-next.
====================
Yunsheng Lin [Wed, 28 Feb 2024 09:30:12 +0000 (17:30 +0800)]
tools: virtio: introduce vhost_net_test
introduce vhost_net_test for both vhost_net tx and rx basing
on virtio_test to test vhost_net changing in the kernel.
Steps for vhost_net tx testing:
1. Prepare a out buf.
2. Kick the vhost_net to do tx processing.
3. Do the receiving in the tun side.
4. verify the data received by tun is correct.
Steps for vhost_net rx testing:
1. Prepare a in buf.
2. Do the sending in the tun side.
3. Kick the vhost_net to do rx processing.
4. verify the data received by vhost_net is correct.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yunsheng Lin [Wed, 28 Feb 2024 09:30:11 +0000 (17:30 +0800)]
vhost/net: remove vhost_net_page_frag_refill()
The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.
This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().
The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yunsheng Lin [Wed, 28 Feb 2024 09:30:10 +0000 (17:30 +0800)]
net: introduce page_frag_cache_drain()
When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yunsheng Lin [Wed, 28 Feb 2024 09:30:09 +0000 (17:30 +0800)]
page_frag: unify gfp bits for order 3 page allocation
Currently there seems to be three page frag implementations
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.
The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in vhost_net_page_frag_refill(), but it is
not masked off in __page_frag_cache_refill().
This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.
Leave the gfp unifying for page frag implementation in sock.c
for now as suggested by Paolo Abeni.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yunsheng Lin [Wed, 28 Feb 2024 09:30:08 +0000 (17:30 +0800)]
mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument
napi_alloc_frag_align() and netdev_alloc_frag_align() accept
align as an argument, and they are thin wrappers around the
__napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs
doing the alignment checking and align mask conversion, in order
to call page_frag_alloc_align() directly. The intention here is
to keep the alignment checking and the alignmask conversion in
in-line wrapper to avoid those kind of operations during execution
time since it can usually be handled during compile time.
We are going to use page_frag_alloc_align() in vhost_net.c, it
need the same kind of alignment checking and alignmask conversion,
so split up page_frag_alloc_align into an inline wrapper doing the
above operation, and add __page_frag_alloc_align() which is passed
with the align mask the original function expected as suggested by
Alexander.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jiawen Wu [Fri, 1 Mar 2024 09:29:56 +0000 (17:29 +0800)]
net: txgbe: fix to clear interrupt status after handling IRQ
GPIO EOI is not set to clear interrupt status after handling the
interrupt. It should be done in irq_chip->irq_ack, but this function
is not called in handle_nested_irq(). So executing function
txgbe_gpio_irq_ack() manually in txgbe_gpio_irq_handler().
Jiawen Wu [Fri, 1 Mar 2024 09:29:55 +0000 (17:29 +0800)]
net: txgbe: fix GPIO interrupt blocking
The register of GPIO interrupt status is masked before MAC IRQ
is enabled. This is because of hardware deficiency. So manually
clear the interrupt status before using them. Otherwise, GPIO
interrupts will never be reported again. There is a workaround for
clearing interrupts to set GPIO EOI in txgbe_up_complete().
Vitaly Lifshits [Fri, 1 Mar 2024 18:48:05 +0000 (10:48 -0800)]
e1000e: Minor flow correction in e1000_shutdown function
Add curly braces to avoid entering to an if statement where it is not
always required in e1000_shutdown function.
This improves code readability and might prevent non-deterministic
behaviour in the future.
Arnd Bergmann [Fri, 1 Mar 2024 18:48:04 +0000 (10:48 -0800)]
igc: fix LEDS_CLASS dependency
When IGC is built-in but LEDS_CLASS is a loadable module, there is
a link failure:
x86_64-linux-ld: drivers/net/ethernet/intel/igc/igc_leds.o: in function `igc_led_setup':
igc_leds.c:(.text+0x75c): undefined reference to `devm_led_classdev_register_ext'
Add another dependency that prevents this combination.
Fixes: ea578703b03d ("igc: Add support for LEDs on i225/i226") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240301184806.2634508-4-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Added support for 1000BASE-BX, i.e. Gigabit Ethernet over single strand
of single-mode fiber.
The initialization of a 1000BASE-BX SFP is the same as 1000BASE-SX/LX
with the only difference that the Bit Rate Nominal Value must be
checked to make sure it is a Gigabit Ethernet transceiver, as described
by the SFF-8472 specification.
This was tested with the FS.com SFP-GE-BX 1310/1490nm 10km transceiver:
$ ethtool -m eth4
Identifier : 0x03 (SFP)
Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID)
Connector : 0x07 (LC)
Transceiver codes : 0x00 0x00 0x00 0x40 0x00 0x00 0x00 0x00 0x00
Transceiver type : Ethernet: BASE-BX10
Encoding : 0x01 (8B/10B)
BR, Nominal : 1300MBd
Rate identifier : 0x00 (unspecified)
Length (SMF,km) : 10km
Length (SMF) : 10000m
Length (50um) : 0m
Length (62.5um) : 0m
Length (Copper) : 0m
Length (OM3) : 0m
Laser wavelength : 1310nm
Vendor name : FS
Vendor OUI : 64:9d:99
Vendor PN : SFP-GE-BX
Vendor rev :
Option values : 0x20 0x0a
Option : RX_LOS implemented
Option : TX_FAULT implemented
Option : Power level 3 requirement
BR margin, max : 0%
BR margin, min : 0%
Vendor SN : S2202359108
Date code : 220307
Optical diagnostics support : Yes
Laser bias current : 17.650 mA
Laser output power : 0.2132 mW / -6.71 dBm
Receiver signal average optical power : 0.2740 mW / -5.62 dBm
Module temperature : 47.30 degrees C / 117.13 degrees F
Module voltage : 3.2576 V
Alarm/warning flags implemented : Yes
Laser bias current high alarm : Off
Laser bias current low alarm : Off
Laser bias current high warning : Off
Laser bias current low warning : Off
Laser output power high alarm : Off
Laser output power low alarm : Off
Laser output power high warning : Off
Laser output power low warning : Off
Module temperature high alarm : Off
Module temperature low alarm : Off
Module temperature high warning : Off
Module temperature low warning : Off
Module voltage high alarm : Off
Module voltage low alarm : Off
Module voltage high warning : Off
Module voltage low warning : Off
Laser rx power high alarm : Off
Laser rx power low alarm : Off
Laser rx power high warning : Off
Laser rx power low warning : Off
Laser bias current high alarm threshold : 110.000 mA
Laser bias current low alarm threshold : 1.000 mA
Laser bias current high warning threshold : 100.000 mA
Laser bias current low warning threshold : 1.000 mA
Laser output power high alarm threshold : 0.7079 mW / -1.50 dBm
Laser output power low alarm threshold : 0.0891 mW / -10.50 dBm
Laser output power high warning threshold : 0.6310 mW / -2.00 dBm
Laser output power low warning threshold : 0.1000 mW / -10.00 dBm
Module temperature high alarm threshold : 90.00 degrees C / 194.00 degrees F
Module temperature low alarm threshold : -45.00 degrees C / -49.00 degrees F
Module temperature high warning threshold : 85.00 degrees C / 185.00 degrees F
Module temperature low warning threshold : -40.00 degrees C / -40.00 degrees F
Module voltage high alarm threshold : 3.7950 V
Module voltage low alarm threshold : 2.8050 V
Module voltage high warning threshold : 3.4650 V
Module voltage low warning threshold : 3.1350 V
Laser rx power high alarm threshold : 0.7079 mW / -1.50 dBm
Laser rx power low alarm threshold : 0.0028 mW / -25.53 dBm
Laser rx power high warning threshold : 0.6310 mW / -2.00 dBm
Laser rx power low warning threshold : 0.0032 mW / -24.95 dBm
Signed-off-by: Ernesto Castellotti <ernesto@castellotti.net> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240301184806.2634508-3-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jon Maxwell [Fri, 1 Mar 2024 18:48:02 +0000 (10:48 -0800)]
intel: make module parameters readable in sys filesystem
Linux users sometimes need an easy way to check current values of module
parameters. For example the module may be manually reloaded with different
parameters. Make these visible and readable in the /sys filesystem to allow
that. But don't make the "debug" module parameter visible as debugging is
enabled via ethtool msglvl.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240301184806.2634508-2-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pedro Tammela [Thu, 29 Feb 2024 14:38:25 +0000 (11:38 -0300)]
selftests/tc-testing: require an up to date iproute2 for blockcast tests
Add the dependsOn test check for all the mirred blockcast tests.
It will prevent the issue reported by LKFT which happens when an older
iproute2 is used to run the current tdc.
selftests: net: Correct couple of spelling mistakes
Changes :
- "excercise" is corrected to "exercise" in drivers/net/mlxsw/spectrum-2/tc_flower.sh
- "mutliple" is corrected to "multiple" in drivers/net/netdevsim/ethtool-fec.sh
Alan Brady [Thu, 22 Feb 2024 19:04:41 +0000 (11:04 -0800)]
idpf: remove dealloc vector msg err in idpf_intr_rel
This error message is at best not really helpful and at worst
misleading. If we're here in idpf_intr_rel we're likely trying to do
remove or reset. If we're in reset, this message will fail because we
lose the virtchnl on reset and HW is going to clean up those resources
regardless in that case. If we're in remove and we get an error here,
we're going to reset the device at the end of remove anyway so not a big
deal. Just remove this message it's not useful.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:40 +0000 (11:04 -0800)]
idpf: fix minor controlq issues
While we're here improving virtchnl we can include two minor fixes for
the lower level ctrlq flow.
This adds a memory barrier to idpf_post_rx_buffs before we update tail
on the controlq. We should make sure our writes have had a chance to
finish before we tell HW it can touch them.
This also removes some defensive programming in idpf_ctrlq_recv. The
caller should not be using a num_q_msg value of zero or more than the
ring size and it's their responsibility to call functions sanely.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:39 +0000 (11:04 -0800)]
idpf: prevent deinit uninitialized virtchnl core
In idpf_remove we need to tear down the virtchnl core with
idpf_vc_core_deinit so we can free up resources and leave things in a
good state. However, in the case where we failed to establish VC
communications we may not have ever actually successfully initialized
the virtchnl core.
This fixes it by setting a bit once we successfully init the virtchnl
core. Then, in deinit, we'll check for it before going on further,
otherwise we just return. Also clear the bit at the end of deinit so we
know it's gone now.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:37 +0000 (11:04 -0800)]
idpf: refactor idpf_recv_mb_msg
Now that all the messages are using the transaction API, we can rework
idpf_recv_mb_msg quite a lot to simplify it. Due to this, we remove
idpf_find_vport as no longer used and alter idpf_recv_event_msg
slightly.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:36 +0000 (11:04 -0800)]
idpf: add async_handler for MAC filter messages
There are situations where the driver needs to add a MAC filter but
we're explicitly not allowed to sleep so we can wait for a virtchnl
message to complete.
This adds an async_handler for asynchronously sent messages for MAC
filters so that we can better handle if there's an error of some kind.
If success we don't need to do anything else, but if we failed to
program the new filter we really should remove it from our list of MAC
filters. If we don't remove bad filters, what I expect to happen is
after a reset of some kind we try to program the MAC filter again and it
fails again. This is clearly wrong and I would expect to be confusing
for the user.
It could also be the failure is for a delete MAC filter message but
those filters get deleted regardless. Not much we can do about a delete
failure.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:35 +0000 (11:04 -0800)]
idpf: refactor remaining virtchnl messages
This takes care of RSS/SRIOV/MAC and other misc virtchnl messages. This
again is mostly mechanical.
In absence of an async_handler for MAC filters, this will simply
generically report any errors from idpf_vc_xn_forward_async. This
maintains the existing behavior. Follow up patch will add an async
handler for MAC filters to remove bad filters from our list.
While we're here we can also make the code much nicer by converting some
variables to auto-variables where appropriate. This makes it cleaner and
less prone to memory leaking.
There's still a bit more cleanup we can do here to remove stuff that's
not being used anymore now; follow-up patches will take care of loose
ends.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:34 +0000 (11:04 -0800)]
idpf: refactor queue related virtchnl messages
This reworks queue specific virtchnl messages to use the added
transaction API. It is fairly mechanical and generally makes the
functions using it more simple. Functions using transaction API no
longer need to take the vc_buf_lock since it's not using it anymore.
After filling out an idpf_vc_xn_params struct, idpf_vc_xn_exec takes
care of the send and recv handling.
This also converts those functions where appropriate to use
auto-variables instead of manually calling kfree. This greatly
simplifies the memory alloc paths and makes them less prone memory
leaks.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:33 +0000 (11:04 -0800)]
idpf: refactor vport virtchnl messages
This reworks the way vport related virtchnl messages work to take
advantage of the added transaction API. It is fairly mechanical as, to
use the transaction API, the function just needs to fill out an
appropriate idpf_vc_xn_params struct to pass to idpf_vc_xn_exec which
will take care of the actual send and recv.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com> Co-developed-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:32 +0000 (11:04 -0800)]
idpf: implement virtchnl transaction manager
This starts refactoring how virtchnl messages are handled by adding a
transaction manager (idpf_vc_xn_manager).
There are two primary motivations here which are to enable handling of
multiple messages at once and to make it more robust in general. As it
is right now, the driver may only have one pending message at a time and
there's no guarantee that the response we receive was actually intended
for the message we sent prior.
This works by utilizing a "cookie" field of the message descriptor. It
is arbitrary what data we put in the cookie and the response is required
to have the same cookie the original message was sent with. Then using a
"transaction" abstraction that uses the completion API to pair responses
to the message it belongs to.
The cookie works such that the first half is the index to the
transaction in our array, and the second half is a "salt" that gets
incremented every message. This enables quick lookups into the array and
also ensuring we have the correct message. The salt is necessary because
after, for example, a message times out and we deem the response was
lost for some reason, we could theoretically reuse the same index but
using a different salt ensures that when we do actually get a response
it's not the old message that timed out previously finally coming in.
Since the number of transactions allocated is U8_MAX and the salt is 8
bits, we can never have a conflict because we can't roll over the salt
without using more transactions than we have available.
This starts by only converting the VIRTCHNL2_OP_VERSION message to use
this new transaction API. Follow up patches will convert all virtchnl
messages to use the API.
Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com> Co-developed-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Thu, 22 Feb 2024 19:04:31 +0000 (11:04 -0800)]
idpf: add idpf_virtchnl.h
idpf.h is quite heavy. We can reduce the burden a fair bit by
introducing an idpf_virtchnl.h file. This mostly just moves function
declarations but there are many of them. This also makes an attempt to
group those declarations in a way that makes some sense instead of
mishmashed.
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Geliang Tang [Fri, 1 Mar 2024 18:18:39 +0000 (19:18 +0100)]
selftests: mptcp: userspace pm get addr tests
This patch adds a new helper userspace_pm_get_addr() in mptcp_join.sh.
In it, parse the token value from the output of 'pm_nl_ctl events', then
pass it to pm_nl_ctl get_addr command. Use this helper in userspace pm
dump tests.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 1 Mar 2024 18:18:37 +0000 (19:18 +0100)]
mptcp: get addr in userspace pm list
This patch renames mptcp_pm_nl_get_addr_doit() as a dedicated in-kernel
netlink PM get addr function mptcp_pm_nl_get_addr(). and invoke a new
wrapper mptcp_pm_get_addr() in mptcp_pm_nl_get_addr_doit.
If a token is gotten in the wrapper, that means a userspace PM is used.
So invoke mptcp_userspace_pm_get_addr() to get addr in userspace PM list.
Otherwise, invoke mptcp_pm_nl_get_addr().
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 1 Mar 2024 18:18:36 +0000 (19:18 +0100)]
mptcp: implement mptcp_userspace_pm_get_addr
This patch implements mptcp_userspace_pm_get_addr() to get an address
from userspace pm address list according the given 'token' and 'id'.
Use nla_get_u32() to get the u32 value of 'token', then pass it to
mptcp_token_get_sock() to get the msk. Pass 'msk' and 'id' to the helper
mptcp_userspace_pm_lookup_addr_by_id() to get the address entry. Put
this entry to userspace using mptcp_pm_nl_put_entry_info().
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 1 Mar 2024 18:18:35 +0000 (19:18 +0100)]
mptcp: add userspace_pm_lookup_addr_by_id helper
Corresponding __lookup_addr_by_id() helper in the in-kernel netlink PM,
this patch adds a new helper mptcp_userspace_pm_lookup_addr_by_id() to
lookup the address entry with the given id on the userspace pm local
address list.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 1 Mar 2024 18:18:34 +0000 (19:18 +0100)]
selftests: mptcp: dump userspace addrs list
This patch adds a new helper userspace_pm_dump() to dump addresses
for the userspace PM. Use this helper to check whether an ID 0 subflow
is listed in the output of dump command after creating an ID 0 subflow
in "userspace pm create id 0 subflow" test. Dump userspace PM addresses
list in "userspace pm add & remove address" test and in "userspace pm
create destroy subflow" test.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 1 Mar 2024 18:18:30 +0000 (19:18 +0100)]
mptcp: check userspace pm flags
Just like MPTCP_PM_ADDR_FLAG_SIGNAL flag is checked in userspace PM
announce mptcp_pm_nl_announce_doit(), PM flags should be checked in
mptcp_pm_nl_subflow_create_doit() too.
If MPTCP_PM_ADDR_FLAG_SUBFLOW flag is not set, there's no flags field
in the output of dump_addr. This looks a bit strange:
id 10 flags 10.0.3.2
This patch uses mptcp_pm_parse_entry() instead of mptcp_pm_parse_addr()
to get the PM flags of the entry and check it. MPTCP_PM_ADDR_FLAG_SIGNAL
flag shouldn't be set here, and if MPTCP_PM_ADDR_FLAG_SUBFLOW flag is
missing from the netlink attribute, always set this flag.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 1 Mar 2024 18:18:29 +0000 (19:18 +0100)]
mptcp: dump addrs in userspace pm list
This patch renames mptcp_pm_nl_get_addr_dumpit() as a dedicated in-kernel
netlink PM dump addrs function mptcp_pm_nl_dump_addr(), and invoke a newly
added wrapper mptcp_pm_dump_addr() in mptcp_pm_nl_get_addr_dumpit().
Invoke in-kernel PM dump addrs function mptcp_pm_nl_dump_addr() or
userspace PM dump addrs function mptcp_userspace_pm_dump_addr() based on
whether the token parameter is passed in or not in the wrapper.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 1 Mar 2024 18:18:28 +0000 (19:18 +0100)]
mptcp: add token for get-addr in yaml
This patch adds token parameter together with addr in get-addr section in
mptcp_pm.yaml, then use the following commands to update mptcp_pm_gen.c
and mptcp_pm_gen.h:
Geliang Tang [Fri, 1 Mar 2024 18:18:27 +0000 (19:18 +0100)]
mptcp: implement mptcp_userspace_pm_dump_addr
This patch implements mptcp_userspace_pm_dump_addr() to dump addresses
from userspace pm address list. Use mptcp_token_get_sock() to get the
msk from the given token, if userspace PM is enabled in it, traverse
each address entry in address list, put every entry to userspace using
mptcp_pm_nl_put_entry_msg().
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Fri, 1 Mar 2024 18:18:25 +0000 (19:18 +0100)]
mptcp: make pm_remove_addrs_and_subflows static
mptcp_pm_remove_addrs_and_subflows() is only used in pm_netlink.c, it's
no longer used in pm_userspace.c any more since the commit 8b1c94da1e48
("mptcp: only send RM_ADDR in nl_cmd_remove"). So this patch changes it
to a static function.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This version of this patch series fixes the bugs in the first patch
(which were fixed in the second), where ipa_interrupt_config() had
two remaining spots that returned a pointer rather than an integer.
Outside of initialization, all uses of the platform device pointer
stored in the IPA structure determine the address of device
structure embedded within the platform device structure.
By changing some of the initialization functions to take a platform
device as argument we can simplify getting at the device structure
address by storing it (instead of the platform device pointer) in
the IPA structure.
The first two patches split the interrupt initialization code into
two parts--one done earlier than before. The next four patches
update some initialization functions to take a platform device
pointer as argument. And the last patch replaces the platform
device pointer with a device pointer, and converts all remaining
references to the &ipa->pdev->dev to use ipa->dev.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Fri, 1 Mar 2024 17:02:42 +0000 (11:02 -0600)]
net: ipa: don't save the platform device
The IPA platform device is now only used as the structure containing
the IPA device structure. Replace the platform device pointer with
a pointer to the device structure.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Fri, 1 Mar 2024 17:02:41 +0000 (11:02 -0600)]
net: ipa: pass a platform device to ipa_smp2p_init()
Rather than using the platform device pointer field in the IPA
pointer, pass a platform device pointer to ipa_smp2p_init(). Use
that pointer throughout that function.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>