net: mana: Enable RX path to handle various MTU sizes
Update RX data path to allocate and use RX queue DMA buffers with
proper size based on potentially various MTU sizes.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mana: Refactor RX buffer allocation code to prepare for various MTU
Move out common buffer allocation code from mana_process_rx_cqe() and
mana_alloc_rx_wqe() to helper functions.
Refactor related variables so they can be changed in one place, and buffer
sizes are in sync.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Use napi_build_skb() instead of build_skb() to take advantage of the
NAPI percpu caches to obtain skbuff_head.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 14 Apr 2023 05:28:03 +0000 (22:28 -0700)]
Merge tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-04-11
1) Vlad adds the support for linux bridge multicast offload support
Patches #1 through #9
Synopsis
Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
Steering architecture
Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:
- Patch 1 adds hardware definition bits for capabilities required to replicate
multicast packets to multiple per-port tables. These bits are used by
following patches to only attempt multicast offload if firmware and hardware
provide necessary support.
- Pathces 2-4 patches are preparations and refactoring.
- Patch 5 implements necessary infrastructure to toggle multicast offload
via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
This also enabled IGMP and MLD snooping.
- Patch 6 implements per-port multicast replication tables. It only supports
filtering of loopback packets.
- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
VLANs.
- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
creates MDB replication rules in egress table that can replicate packets to
multiple per-port multicast tables.
- Patch 9 adds tracepoints for MDB events.
==============
2) Parav Create a new allocation profile for SFs, to save on memory
3) Yevgeny provides some initial patches for upcoming software steering
support new pattern/arguments type of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
As a preparation Yevgeny provides the following 4 patches:
- Patch 1: Add required mlx5_ifc HW bits
- Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
support and adds appropriate support in dr_send.c
- Patch 4: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
* tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: DR, Add modify-header-pattern ICM pool
net/mlx5: DR, Prepare sending new WQE type
net/mlx5: Add new WQE for updating flow table
net/mlx5: Add mlx5_ifc bits for modify header argument
net/mlx5: DR, Set counter ID on the last STE for STEv1 TX
net/mlx5: Create a new profile for SFs
net/mlx5: Bridge, add tracepoints for multicast
net/mlx5: Bridge, implement mdb offload
net/mlx5: Bridge, support multicast VLAN pop
net/mlx5: Bridge, add per-port multicast replication tables
net/mlx5: Bridge, snoop igmp/mld packets
net/mlx5: Bridge, extract code to lookup parent bridge of port
net/mlx5: Bridge, move additional data structures to priv header
net/mlx5: Bridge, increase bridge tables sizes
net/mlx5: Add mlx5_ifc definitions for bridge multicast support
====================
====================
Add kernel tc-mqprio and tc-taprio support for preemptible traffic classes
The last RFC in August 2022 contained a proposal for the UAPI of both
TSN standards which together form Frame Preemption (802.1Q and 802.3):
https://lore.kernel.org/netdev/20220816222920.1952936-1-vladimir.oltean@nxp.com/
It wasn't clear at the time whether the 802.1Q portion of Frame Preemption
should be exposed via the tc qdisc (mqprio, taprio) or via some other
layer (perhaps also ethtool like the 802.3 portion, or dcbnl), even
though the options were discussed extensively, with pros and cons:
https://lore.kernel.org/netdev/20220816222920.1952936-3-vladimir.oltean@nxp.com/
So the 802.3 portion got submitted separately and finally was accepted:
https://lore.kernel.org/netdev/20230119122705.73054-1-vladimir.oltean@nxp.com/
leaving the only remaining question: how do we expose the 802.1Q bits?
This series proposes that we use the Qdisc layer, through separate
(albeit very similar) UAPI in mqprio and taprio, and that both these
Qdiscs pass the information down to the offloading device driver through
the common mqprio offload structure (which taprio also passes).
An implementation is provided for the NXP LS1028A on-board Ethernet
endpoint (enetc). Previous versions also contained support for its
embedded switch (felix), but this needs more work and will be submitted
separately.
Vladimir Oltean [Tue, 11 Apr 2023 18:01:57 +0000 (21:01 +0300)]
net: enetc: add support for preemptible traffic classes
PFs which support the MAC Merge layer also have a set of 8 registers
called "Port traffic class N frame preemption register (PTC0FPR - PTC7FPR)".
Through these, a traffic class (group of TX rings of same dequeue
priority) can be mapped to the eMAC or to the pMAC.
There's nothing particularly spectacular here. We should probably only
commit the preemptible TCs to hardware once the MAC Merge layer became
active, but unlike Felix, we don't have an IRQ that notifies us of that.
We'd have to sleep for up to verifyTime (127 ms) to wait for a
resolution coming from the verification state machine; not only from the
ndo_setup_tc() code path, but also from enetc_mm_link_state_update().
Since it's relatively complicated and has a relatively small benefit,
I'm not doing it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 11 Apr 2023 18:01:56 +0000 (21:01 +0300)]
net: enetc: rename "mqprio" to "qopt"
To gain access to the larger encapsulating structure which has the type
tc_mqprio_qopt_offload, rename just the "qopt" field as "qopt".
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 11 Apr 2023 18:01:55 +0000 (21:01 +0300)]
net/sched: taprio: allow per-TC user input of FP adminStatus
This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.
I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.
Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.
Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.
A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 11 Apr 2023 18:01:54 +0000 (21:01 +0300)]
net/sched: mqprio: allow per-TC user input of FP adminStatus
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).
The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:
| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.
It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.
I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).
The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.
Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 11 Apr 2023 18:01:53 +0000 (21:01 +0300)]
net/sched: pass netlink extack to mqprio and taprio offload
With the multiplexed ndo_setup_tc() model which lacks a first-class
struct netlink_ext_ack * argument, the only way to pass the netlink
extended ACK message down to the device driver is to embed it within the
offload structure.
Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.
Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
structure, and since device drivers might effectively reuse their mqprio
implementation for the mqprio portion of taprio, we make taprio set the
extack in both offload structures to point at the same netlink extack
message.
In fact, the taprio handling is a bit more tricky, for 2 reasons.
First is because the offload structure has a longer lifetime than the
extack structure. The driver is supposed to populate the extack
synchronously from ndo_setup_tc() and leave it alone afterwards.
To not have any use-after-free surprises, we zero out the extack pointer
when we leave taprio_enable_offload().
The second reason is because taprio does overwrite the extack message on
ndo_setup_tc() error. We need to switch to the weak form of setting an
extack message, which preserves a potential message set by the driver.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 11 Apr 2023 18:01:52 +0000 (21:01 +0300)]
net/sched: mqprio: add an extack message to mqprio_parse_opt()
Ferenc reports that a combination of poor iproute2 defaults and obscure
cases where the kernel returns -EINVAL make it difficult to understand
what is wrong with this command:
$ ip link add veth0 numtxqueues 8 numrxqueues 8 type veth peer name veth1
$ tc qdisc add dev veth0 root mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7
RTNETLINK answers: Invalid argument
Hopefully with this patch, the cause is clearer:
Error: Device does not support hardware offload.
The kernel was (and still is) rejecting this because iproute2 defaults
to "hw 1" if this command line option is not specified.
Vladimir Oltean [Tue, 11 Apr 2023 18:01:51 +0000 (21:01 +0300)]
net/sched: mqprio: add extack to mqprio_parse_nlattr()
Netlink attribute parsing in mqprio is a minesweeper game, with many
options having the possibility of being passed incorrectly and the user
being none the wiser.
Try to make errors less sour by giving user space some information
regarding what went wrong.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 11 Apr 2023 18:01:50 +0000 (21:01 +0300)]
net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS
In commit 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and
shaper in mqprio"), the TCA_OPTIONS format of mqprio was extended to
contain a fixed portion (of size NLA_ALIGN(sizeof struct tc_mqprio_qopt))
and a variable portion of other nlattrs (in the TCA_MQPRIO_* type space)
following immediately afterwards.
In commit feb2cf3dcfb9 ("net/sched: mqprio: refactor nlattr parsing to a
separate function"), we've moved the nlattr handling to a smaller
function, but yet, a small parse_attr() still remains, and the larger
mqprio_parse_nlattr() still does not have access to the beginning, and
the length, of the TCA_OPTIONS region containing these other nlattrs.
In a future change, the mqprio qdisc will need to iterate through this
nlattr region to discover other attributes, so eliminate parse_attr()
and add 2 variables in mqprio_parse_nlattr() which hold the beginning
and the length of the nlattr range.
We avoid the need to memset when nlattr_opt_len has insufficient length
by pre-initializing the table "tb".
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 11 Apr 2023 18:01:49 +0000 (21:01 +0300)]
net: ethtool: create and export ethtool_dev_mm_supported()
Create a wrapper over __ethtool_dev_mm_supported() which also calls
ethnl_ops_begin() and ethnl_ops_complete(). It can be used by other code
layers, such as tc, to make sure that preemptible TCs are supported
(this is true if an underlying MAC Merge layer exists).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make it explicit that this tool is not a drop-in replacement for ethtool.
This tool is intended for testing ethtool functionality implemented in the
kernel and should use a name that differentiates it from the ethtool
utility.
tools: ynl: Remove absolute paths to yaml files from ethtool testing tool
Absolute paths for the spec and schema files make the ethtool testing tool
unusable with freshly checked-out source trees. Replace absolute paths with
relative paths for both files in the Documentation/ directory.
Issue seen before the change
Traceback (most recent call last):
File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/./ethtool", line 424, in <module>
main()
File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/./ethtool", line 158, in main
ynl = YnlFamily(spec, schema)
File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/lib/ynl.py", line 342, in __init__
super().__init__(def_path, schema)
File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/lib/nlspec.py", line 333, in __init__
with open(spec_path, "r") as stream:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/google/home/sdf/src/linux/Documentation/netlink/specs/ethtool.yaml'
The seconds input from BD (6 bits) just needs to be ORed with the
upper bits from timer in this function. Avoid addition operation
every single time. Seconds rollover handling is left untouched.
Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Enable transmission and reception of PTP unicast packets by
updating PTP unicast config bit and setting current HW mac
address as allowed address in PTP unicast filter registers.
Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are currently two checks for PTP functionality - one on GEM
capability and another on the kernel config option. Combine them
into a single function as there's no use case where gem_has_ptp is
TRUE and MACB_USE_HWSTAMP is false.
Signed-off-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 14 Apr 2023 04:56:08 +0000 (21:56 -0700)]
Merge branch 'ocelot-felix-driver-cleanup'
Vladimir Oltean says:
====================
Ocelot/Felix driver cleanup
The cleanup mostly handles the statistics code path - some issues
regarding understandability became apparent after the series
"Fix trainwreck with Ocelot switch statistics counters":
https://lore.kernel.org/netdev/20230321010325.897817-1-vladimir.oltean@nxp.com/
There is also one patch which cleans up a misleading comment
in the DSA felix_setup().
====================
Vladimir Oltean [Wed, 12 Apr 2023 12:47:37 +0000 (15:47 +0300)]
net: mscc: ocelot: fix ineffective WARN_ON() in ocelot_stats.c
Since it is hopefully now clear that, since "last" and "layout[i].reg"
are enum types and not addresses, the existing WARN_ON() is ineffective
in checking that the _addresses_ are sorted in the proper order.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Wed, 12 Apr 2023 12:47:36 +0000 (15:47 +0300)]
net: mscc: ocelot: strengthen type of "int i" in ocelot_stats.c
The "int i" used to index the struct ocelot_stat_layout array actually
has a specific type: enum ocelot_stat. Use it, so that the WARN()
comment from ocelot_prepare_stats_regions() makes more sense.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Wed, 12 Apr 2023 12:47:35 +0000 (15:47 +0300)]
net: mscc: ocelot: strengthen type of "u32 reg" and "u32 base" in ocelot_stats.c
Use the specific enum ocelot_reg to make it clear that the region
registers are encoded and not plain addresses.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Wed, 12 Apr 2023 12:47:34 +0000 (15:47 +0300)]
net: dsa: felix: remove confusing/incorrect comment from felix_setup()
That comment was written prior to knowing that what I was actually
seeing was a manifestation of the bug fixed in commit b4024c9e5c57
("felix: Fix initialization of ioremap resources").
There isn't any particular reason now why the hardware initialization is
done in felix_setup(), so just delete that comment to avoid spreading
misinformation.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Wed, 12 Apr 2023 12:47:33 +0000 (15:47 +0300)]
net: mscc: ocelot: remove blank line at the end of ocelot_stats.c
Commit a3bb8f521fd8 ("net: mscc: ocelot: remove unnecessary exposure of
stats structures") made an unnecessary change which was to add a new
line at the end of ocelot_stats.c. Remove it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Colin Foster <colin.foster@in-advantage.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Wed, 12 Apr 2023 12:47:32 +0000 (15:47 +0300)]
net: mscc: ocelot: debugging print for statistics regions
To make it easier to debug future issues with statistics counters not
getting aggregated properly into regions, like what happened in commit 6acc72a43eac ("net: mscc: ocelot: fix stats region batching"), add some
dev_dbg() prints which show the regions that were dynamically
determined.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Wed, 12 Apr 2023 12:47:31 +0000 (15:47 +0300)]
net: mscc: ocelot: refactor enum ocelot_reg decoding to helper
ocelot_io.c duplicates the decoding of an enum ocelot_reg (which holds
an enum ocelot_target in the upper bits and an index into a regmap array
in the lower bits) 4 times.
We'd like to reuse that logic once more, from ocelot.c. In order to do
that, let's consolidate the existing 4 instances into a header
accessible both by ocelot.c as well as by ocelot_io.c.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Wed, 12 Apr 2023 12:47:30 +0000 (15:47 +0300)]
net: mscc: ocelot: strengthen type of "u32 reg" in I/O accessors
The "u32 reg" argument that is passed to these functions is not a plain
address, but rather a driver-specific encoding of another enum
ocelot_target target in the upper bits, and an index into the
u32 ocelot->map[target][] array in the lower bits. That encoded value
takes the type "enum ocelot_reg" and is what is passed to these I/O
functions, so let's actually use that to prevent type confusion.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).
The main changes are:
1) Rework BPF verifier log behavior and implement it as a rotating log
by default with the option to retain old-style fixed log behavior,
from Andrii Nakryiko.
2) Adds support for using {FOU,GUE} encap with an ipip device operating
in collect_md mode and add a set of BPF kfuncs for controlling encap
params, from Christian Ehrig.
3) Allow BPF programs to detect at load time whether a particular kfunc
exists or not, and also add support for this in light skeleton,
from Alexei Starovoitov.
4) Optimize hashmap lookups when key size is multiple of 4,
from Anton Protopopov.
5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps, from David Vernet.
6) Add support for stashing local BPF kptr into a map value via
bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
for new cgroups, from Dave Marchevsky.
7) Fix BTF handling of is_int_ptr to skip modifiers to work around
tracing issues where a program cannot be attached, from Feng Zhou.
8) Migrate a big portion of test_verifier unit tests over to
test_progs -a verifier_* via inline asm to ease {read,debug}ability,
from Eduard Zingerman.
9) Several updates to the instruction-set.rst documentation
which is subject to future IETF standardization
(https://lwn.net/Articles/926882/), from Dave Thaler.
10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
known bits information propagation, from Daniel Borkmann.
11) Add skb bitfield compaction work related to BPF with the overall goal
to make more of the sk_buff bits optional, from Jakub Kicinski.
12) BPF selftest cleanups for build id extraction which stand on its own
from the upcoming integration work of build id into struct file object,
from Jiri Olsa.
13) Add fixes and optimizations for xsk descriptor validation and several
selftest improvements for xsk sockets, from Kal Conley.
14) Add BPF links for struct_ops and enable switching implementations
of BPF TCP cong-ctls under a given name by replacing backing
struct_ops map, from Kui-Feng Lee.
15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
offset stack read as earlier Spectre checks cover this,
from Luis Gerhorst.
16) Fix issues in copy_from_user_nofault() for BPF and other tracers
to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
and Alexei Starovoitov.
17) Add --json-summary option to test_progs in order for CI tooling to
ease parsing of test results, from Manu Bretelle.
18) Batch of improvements and refactoring to prep for upcoming
bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
from Martin KaFai Lau.
19) Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations,
from Quentin Monnet.
20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
the module name from BTF of the target and searching kallsyms of
the correct module, from Viktor Malik.
21) Improve BPF verifier handling of '<const> <cond> <non_const>'
to better detect whether in particular jmp32 branches are taken,
from Yonghong Song.
22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
A built-in cc or one from a kernel module is already able to write
to app_limited, from Yixin Shen.
Conflicts:
Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") 0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/
include/net/ip_tunnels.h bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts") ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/
net/bpf/test_run.c e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption") 294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================
Merge tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, and bluetooth.
Not all that quiet given spring celebrations, but "current" fixes are
thinning out, which is encouraging. One outstanding regression in the
mlx5 driver when using old FW, not blocking but we're pushing for a
fix.
Current release - new code bugs:
- eth: enetc: workaround for unresponsive pMAC after receiving
express traffic
Previous releases - regressions:
- rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the
pid/seq fields 0 for backward compatibility
Previous releases - always broken:
- sctp: fix a potential overflow in sctp_ifwdtsn_skip
- mptcp:
- use mptcp_schedule_work instead of open-coding it and make the
worker check stricter, to avoid scheduling work on closed
sockets
- fix NULL pointer dereference on fastopen early fallback
- skbuff: fix memory corruption due to a race between skb coalescing
and releasing clones confusing page_pool reference counting
- bonding: fix neighbor solicitation validation on backup slaves
- bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp
- bpf: arm64: fixed a BTI error on returning to patched function
- openvswitch: fix race on port output leading to inf loop
- sfp: initialize sfp->i2c_block_size at sfp allocation to avoid
returning a different errno than expected
- phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove
- Bluetooth: fix printing errors if LE Connection times out
- Bluetooth: assorted UaF, deadlock and data race fixes
- adjust the XDP Rx flow hash API to also include the protocol layers
over which the hash was computed"
* tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
xdp: rss hash types representation
selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
skbuff: Fix a race between coalescing and releasing SKBs
net: macb: fix a memory corruption in extended buffer descriptor mode
selftests: add the missing CONFIG_IP_SCTP in net config
udp6: fix potential access to stale information
selftests: openvswitch: adjust datapath NL message declaration
selftests: mptcp: userspace pm: uniform verify events
mptcp: fix NULL pointer dereference on fastopen early fallback
mptcp: stricter state check in mptcp_worker
mptcp: use mptcp_schedule_work instead of open-coding it
net: enetc: workaround for unresponsive pMAC after receiving express traffic
sctp: fix a potential overflow in sctp_ifwdtsn_skip
net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume()
rtnetlink: Restore RTM_NEW/DELLINK notification behavior
net: ti/cpsw: Add explicit platform_device.h and of_platform.h includes
...
Merge tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Fix interaction between fw_devlink and DT overlays causing devices to
not be probed
- Fix the compatible string for loongson,cpu-interrupt-controller
* tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
treewide: Fix probing of devices in DT overlays
dt-bindings: interrupt-controller: loongarch: Fix mismatched compatible
Merge tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fix from Linus Walleij:
"This is just a revert of the AMD fix, because the fix broke some
laptops. We are working on a proper solution"
* tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
Revert "pinctrl: amd: Disable and mask interrupts on resume"
Merge tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Daniel Vetter:
- two fbcon regressions
- amdgpu: dp mst, smu13
- i915: dual link dsi for tgl+
- armada, nouveau, drm/sched, fbmem
* tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm:
fbcon: set_con2fb_map needs to set con2fb_map!
fbcon: Fix error paths in set_con2fb_map
drm/amd/pm: correct the pcie link state check for SMU13
drm/amd/pm: correct SMU13.0.7 max shader clock reporting
drm/amd/pm: correct SMU13.0.7 pstate profiling clock settings
drm/amd/display: Pass the right info to drm_dp_remove_payload
drm/armada: Fix a potential double free in an error handling path
fbmem: Reject FB_ACTIVATE_KD_TEXT from userspace
drm/nouveau/fb: add missing sysmen flush callbacks
drm/i915/dsi: fix DSS CTL register offsets for TGL+
drm/scheduler: Fix UAF race in drm_sched_entity_push_job()
Jakub Kicinski [Thu, 13 Apr 2023 20:04:44 +0000 (13:04 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-04-13
We've added 6 non-merge commits during the last 1 day(s) which contain
a total of 14 files changed, 205 insertions(+), 38 deletions(-).
The main changes are:
1) One late straggler fix on the XDP hints side which fixes
bpf_xdp_metadata_rx_hash kfunc API before the release goes out
in order to provide information on the RSS hash type,
from Jesper Dangaard Brouer.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
xdp: rss hash types representation
selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
====================
Current API for bpf_xdp_metadata_rx_hash() returns the raw RSS hash value,
but doesn't provide information on the RSS hash type (part of 6.3-rc).
This patchset proposal is to change the function call signature via adding
a pointer value argument for providing the RSS hash type.
Patchset also removes all bpf_printk's from xdp_hw_metadata program
that we expect driver developers to use. Instead counters are introduced
for relaying e.g. skip and fail info.
====================
veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type.
The veth driver currently only support XDP-hints based on SKB code path.
The SKB have lost information about the RSS hash type, by compressing
the information down to a single bitfield skb->l4_hash, that only knows
if this was a L4 hash value.
In preparation for veth, the xdp_rss_hash_type have an L4 indication
bit that allow us to return a meaningful L4 indication when working
with SKB based packets.
The RSS hash type specifies what portion of packet data NIC hardware used
when calculating RSS hash value. The RSS types are focused on Internet
traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash
value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4
primarily TCP vs UDP, but some hardware supports SCTP.
Hardware RSS types are differently encoded for each hardware NIC. Most
hardware represent RSS hash type as a number. Determining L3 vs L4 often
requires a mapping table as there often isn't a pattern or sorting
according to ISO layer.
The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that
contains both BITs for the L3/L4 types, and combinations to be used by
drivers for their mapping tables. The enum xdp_rss_type_bits get exposed
to BPF via BTF, and it is up to the BPF-programmer to match using these
defines.
This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding
a pointer value argument for provide the RSS hash type.
Change signature for all xmo_rx_hash calls in drivers to make it compile.
The RSS type implementations for each driver comes as separate patches.
Daniel Vetter [Wed, 12 Apr 2023 15:31:46 +0000 (17:31 +0200)]
fbcon: set_con2fb_map needs to set con2fb_map!
I got really badly confused in d443d9386472 ("fbcon: move more common
code into fb_open()") because we set the con2fb_map before the failure
points, which didn't look good.
But in trying to fix that I moved the assignment into the wrong path -
we need to do it for _all_ vc we take over, not just the first one
(which additionally requires the call to con2fb_acquire_newinfo).
I've figured this out because of a KASAN bug report, where the
fbcon_registered_fb and fbcon_display arrays went out of sync in
fbcon_mode_deleted() because the con2fb_map pointed at the old
fb_info, but the modes and everything was updated for the new one.
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: Helge Deller <deller@gmx.de> Tested-by: Xingyuan Mo <hdthky0@gmail.com> Fixes: d443d9386472 ("fbcon: move more common code into fb_open()") Reported-by: Xingyuan Mo <hdthky0@gmail.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Xingyuan Mo <hdthky0@gmail.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # v5.19+
Daniel Vetter [Wed, 12 Apr 2023 15:23:49 +0000 (17:23 +0200)]
fbcon: Fix error paths in set_con2fb_map
This is a regressoin introduced in b07db3958485 ("fbcon: Ditch error
handling for con2fb_release_oldinfo"). I failed to realize what the if
(!err) checks. The mentioned commit was dropping the
con2fb_release_oldinfo() return value but the if (!err) was also
checking whether the con2fb_acquire_newinfo() function call above
failed or not.
Fix this with an early return statement.
Note that there's still a difference compared to the orginal state of
the code, the below lines are now also skipped on error:
if (!search_fb_in_map(info_idx))
info_idx = newidx;
These are only needed when we've actually thrown out an old fb_info
from the console mappings, which only happens later on.
Also move the fbcon_add_cursor_work() call into the same if block,
it's all protected by console_lock so doesn't matter when we set up
the blinking cursor delayed work anyway. This further simplifies the
control flow and allows us to ditch the found local variable.
v2: Clarify commit message (Javier)
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: Helge Deller <deller@gmx.de> Tested-by: Xingyuan Mo <hdthky0@gmail.com> Fixes: b07db3958485 ("fbcon: Ditch error handling for con2fb_release_oldinfo") Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Xingyuan Mo <hdthky0@gmail.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # v5.19+
skbuff: Fix a race between coalescing and releasing SKBs
Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
recycling") allowed coalescing to proceed with non page pool page and page
pool page when @from is cloned, i.e.
However, it actually requires skb_cloned(@from) to hold true until
coalescing finishes in this situation. If the other cloned SKB is
released while the merging is in process, from_shinfo->nr_frags will be
set to 0 toward the end of the function, causing the increment of frag
page _refcount to be unexpectedly skipped resulting in inconsistent
reference counts. Later when SKB(@to) is released, it frees the page
directly even though the page pool page is still in use, leading to
use-after-free or double-free errors. So it should be prohibited.
Roman Gushchin [Wed, 12 Apr 2023 23:21:44 +0000 (16:21 -0700)]
net: macb: fix a memory corruption in extended buffer descriptor mode
For quite some time we were chasing a bug which looked like a sudden
permanent failure of networking and mmc on some of our devices.
The bug was very sensitive to any software changes and even more to
any kernel debug options.
Finally we got a setup where the problem was reproducible with
CONFIG_DMA_API_DEBUG=y and it revealed the issue with the rx dma:
Lars was fast to find an explanation: according to the datasheet
bit 2 of the rx buffer descriptor entry has a different meaning in the
extended mode:
Address [2] of beginning of buffer, or
in extended buffer descriptor mode (DMA configuration register [28] = 1),
indicates a valid timestamp in the buffer descriptor entry.
The macb driver didn't mask this bit while getting an address and it
eventually caused a memory corruption and a dma failure.
The problem is resolved by explicitly clearing the problematic bit
if hw timestamping is used.
Fixes: 7b4296148066 ("net: macb: Add support for PTP timestamps in DMA descriptors") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Co-developed-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230412232144.770336-1-roman.gushchin@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Wed, 12 Apr 2023 13:03:08 +0000 (13:03 +0000)]
udp6: fix potential access to stale information
lena wang reported an issue caused by udpv6_sendmsg()
mangling msg->msg_name and msg->msg_namelen, which
are later read from ____sys_sendmsg() :
/*
* If this is sendmmsg() and sending to current destination address was
* successful, remember it.
*/
if (used_address && err >= 0) {
used_address->name_len = msg_sys->msg_namelen;
if (msg_sys->msg_name)
memcpy(&used_address->name, msg_sys->msg_name,
used_address->name_len);
}
udpv6_sendmsg() wants to pretend the remote address family
is AF_INET in order to call udp_sendmsg().
A fix would be to modify the address in-place, instead
of using a local variable, but this could have other side effects.
Instead, restore initial values before we return from udpv6_sendmsg().
Fixes: c71d8ebe7a44 ("net: Fix security_socket_sendmsg() bypass problem.") Reported-by: lena wang <lena.wang@mediatek.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20230412130308.1202254-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The netlink message for creating a new datapath takes an array
of ports for the PID creation. This shouldn't cause much issue
but correct it for future cases where we need to do decode of
datapath information that could include the per-cpu PID map.
Simply adding a "sleep" before checking something is usually not a good
idea because the time that has been picked can not be enough or too
much. The best is to wait for events with a timeout.
In this selftest, 'sleep 0.5' is used more than 40 times. It is always
used before calling a 'verify_*' function except for this
verify_listener_events which has been added later.
At the end, using all these 'sleep 0.5' seems to work: the slow CIs
don't complain so far. Also because it doesn't take too much time, we
can just add two more 'sleep 0.5' to uniform what is done before calling
a 'verify_*' function. For the same reasons, we can also delay a bigger
refactoring to replace all these 'sleep 0.5' by functions waiting for
events instead of waiting for a fix time and hope for the best.
Fixes: 6c73008aa301 ("selftests: mptcp: listener test for userspace PM") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xsk: Elide base_addr comparison in xp_unaligned_validate_desc
Remove redundant (base_addr >= pool->addrs_cnt) comparison from the
conditional.
In particular, addr is computed as:
addr = base_addr + offset
... where base_addr and offset are stored as 48-bit and 16-bit unsigned
integers, respectively. The above sum cannot overflow u64 since base_addr
has a maximum value of 0x0000ffffffffffff and offset has a maximum value
of 0xffff (implying a maximum sum of 0x000100000000fffe). Since overflow
is impossible, it follows that addr >= base_addr.
Now if (base_addr >= pool->addrs_cnt), then clearly:
addr >= base_addr
>= pool->addrs_cnt
Thus, (base_addr >= pool->addrs_cnt) implies (addr >= pool->addrs_cnt).
Subsequently, the former comparison is unnecessary in the conditional
since for any boolean expressions A and B, (A || B) && (A -> B) is
equivalent to B.
Perform the chunk boundary check like the page boundary check in
xp_desc_crosses_non_contig_pg(). This simplifies the implementation and
reduces the number of branches.
selftests/bpf: Remove stand-along test_verifier_log test binary
test_prog's prog_tests/verifier_log.c is superseding test_verifier_log
stand-alone test. It cover same checks and adds more, and is also
integrated into test_progs test runner.
Song Liu [Wed, 12 Apr 2023 21:04:21 +0000 (14:04 -0700)]
selftests/bpf: Use read_perf_max_sample_freq() in perf_event_stackmap
Currently, perf_event sample period in perf_event_stackmap is set too low
that the test fails randomly. Fix this by using the max sample frequency,
from read_perf_max_sample_freq().
Move read_perf_max_sample_freq() to testing_helpers.c. Replace the CHECK()
with if-printf, as CHECK is not available in testing_helpers.c.
====================
net: use READ_ONCE/WRITE_ONCE for ring index accesses
Small follow up to the lockless ring stop/start macros.
Update the doc and the drivers suggested by Eric:
https://lore.kernel.org/all/CANn89iJrBGSybMX1FqrhCEMWT3Nnz2=2+aStsbbwpWzKHjk51g@mail.gmail.com/
Jakub Kicinski [Wed, 12 Apr 2023 01:50:37 +0000 (18:50 -0700)]
bnxt: use READ_ONCE/WRITE_ONCE for ring indexes
Eric points out that we should make sure that ring index updates
are wrapped in the appropriate READ_ONCE/WRITE_ONCE macros.
Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Wed, 12 Apr 2023 01:50:36 +0000 (18:50 -0700)]
net: docs: update the sample code in driver.rst
The sample code talks about single-queue devices and uses locks.
Update it to something resembling more modern code.
Make sure we mention use of READ_ONCE() / WRITE_ONCE().
Change the comment which talked about consumer on the xmit side.
AFAIU xmit is the producer and completions are a consumer.
Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 13 Apr 2023 10:50:47 +0000 (12:50 +0200)]
Merge branch 'add-emac3-support-for-sa8540p-ride'
Andrew Halaney says:
====================
Add EMAC3 support for sa8540p-ride
This is a forward port / upstream refactor of code delivered
downstream by Qualcomm over at [0] to enable the DWMAC5 based
implementation called EMAC3 on the sa8540p-ride dev board.
From what I can tell with the board schematic in hand,
as well as the code delivered, the main changes needed are:
1. A new address space layout for dwmac5/EMAC3 MTL/DMA regs
2. A new programming sequence required for the EMAC3 based platforms
This series makes the changes above as well as other housekeeping items
such as converting dt-bindings to yaml, etc.
As requested[1], it has been split up by compilation deps / maintainer tree.
I will post a link to the associated devicetree changes that together
with this series get the hardware functioning.
Patches 1-3 are clean ups of the currently supported dt-bindings and
IMO could be picked up as is independent of the rest of the series to
improve the current codebase. They've all been reviewed in prior
versions of the series.
Patches 5-7 are also clean ups of the driver and are worth picking up
independently as well. They don't all have explicit reviews but should
be good to go (trivial changes on non-reviewed bits).
The rest of the patches have new changes, lack review, or are specificly
being made to support the new hardware, so they should wait until the
series as a whole is deemed ready to go by the community.
Andrew Halaney [Tue, 11 Apr 2023 20:04:07 +0000 (15:04 -0500)]
net: stmmac: dwmac-qcom-ethqos: Respect phy-mode and TX delay
The driver currently sets a MAC TX delay of 2 ns no matter what the
phy-mode is. If the phy-mode indicates the phy is in charge of the
TX delay (rgmii-txid, rgmii-id), don't do it in the MAC.
Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Andrew Halaney [Tue, 11 Apr 2023 20:04:06 +0000 (15:04 -0500)]
net: stmmac: dwmac4: Allow platforms to specify some DMA/MTL offsets
Some platforms have dwmac4 implementations that have a different
address space layout than the default, resulting in the need to define
their own DMA/MTL offsets.
Extend the functions to allow a platform driver to indicate what its
addresses are, overriding the defaults.
Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Andrew Halaney [Tue, 11 Apr 2023 20:04:05 +0000 (15:04 -0500)]
net: stmmac: Pass stmmac_priv in some callbacks
Passing stmmac_priv to some of the callbacks allows hwif implementations
to grab some data that platforms can customize. Adjust the callbacks
accordingly in preparation of such a platform customization.
Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Andrew Halaney [Tue, 11 Apr 2023 20:04:04 +0000 (15:04 -0500)]
net: stmmac: Remove some unnecessary void pointers
There's a few spots in the hardware interface where a void pointer is
used, but what's passed in and later cast out is always the same type.
Just use the proper type directly.
Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The sc8280xp has a new version of the ETHQOS hardware in it, EMAC v3.
Add a compatible for this.
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
dt-bindings: net: qcom,ethqos: Convert bindings to yaml
Convert Qualcomm ETHQOS Ethernet devicetree binding to YAML.
In doing so add a new property for iommus since newer platforms support
using one, and without such make dtbs_check fails on them.
While at it, also update the MAINTAINERS file to point to the yaml
version of the bindings.
Signed-off-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
[halaney: Remove duplicated properties, add MAINTAINERS and iommus] Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Bhupesh Sharma <bhupesh.sharma@linaro.org> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As commit fc191af1bb0d ("net: stmmac: platform: Fix misleading
interrupt error msg") noted, not every stmmac based platform
makes use of the 'eth_wake_irq' or 'eth_lpi' interrupts.
So, update the 'interrupt-names' inside 'snps,dwmac' YAML
bindings to reflect the same.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Bhupesh Sharma <bhupesh.sharma@linaro.org> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Vladimir Oltean [Tue, 11 Apr 2023 19:26:45 +0000 (22:26 +0300)]
net: enetc: workaround for unresponsive pMAC after receiving express traffic
I have observed an issue where the RX direction of the LS1028A ENETC pMAC
seems unresponsive. The minimal procedure to reproduce the issue is:
1. Connect ENETC port 0 with a loopback RJ45 cable to one of the Felix
switch ports (0).
2. Bring the ports up (MAC Merge layer is not enabled on either end).
3. Send a large quantity of unidirectional (express) traffic from Felix
to ENETC. I tried altering frame size and frame count, and it doesn't
appear to be specific to either of them, but rather, to the quantity
of octets received. Lowering the frame count, the minimum quantity of
packets to reproduce relatively consistently seems to be around 37000
frames at 1514 octets (w/o FCS) each.
4. Using ethtool --set-mm, enable the pMAC in the Felix and in the ENETC
ports, in both RX and TX directions, and with verification on both
ends.
5. Wait for verification to complete on both sides.
6. Configure a traffic class as preemptible on both ends.
7. Send some packets again.
The issue is at step 5, where the verification process of ENETC ends
(meaning that Felix responds with an SMD-R and ENETC sees the response),
but the verification process of Felix never ends (it remains VERIFYING).
If step 3 is skipped or if ENETC receives less traffic than
approximately that threshold, the test runs all the way through
(verification succeeds on both ends, preemptible traffic passes fine).
If, between step 4 and 5, the step below is also introduced:
4.1. Disable and re-enable PM0_COMMAND_CONFIG bit RX_EN
then again, the sequence of steps runs all the way through, and
verification succeeds, even if there was the previous RX traffic
injected into ENETC.
Traffic sent *by* the ENETC port prior to enabling the MAC Merge layer
does not seem to influence the verification result, only received
traffic does.
The LS1028A manual does not mention any relationship between
PM0_COMMAND_CONFIG and MMCSR, and the hardware people don't seem to
know for now either.
The bit that is toggled to work around the issue is also toggled
by enetc_mac_enable(), called from phylink's mac_link_down() and
mac_link_up() methods - which is how the workaround was found:
verification would work after a link down/up.
Fixes: c7b9e8086902 ("net: enetc: add support for MAC Merge layer") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230411192645.1896048-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 11 Apr 2023 12:04:42 +0000 (14:04 +0200)]
bnxt_en: Allow to set switchdev mode without existing VFs
Remove an inability of bnxt_en driver to set eswitch to switchdev
mode without existing VFs by:
1. Allow to set switchdev mode in bnxt_dl_eswitch_mode_set() so
representors are created only when num_vfs > 0 otherwise just
set bp->eswitch_mode
2. Do not automatically change bp->eswitch_mode during
bnxt_vf_reps_create() and bnxt_vf_reps_destroy() calls so
the eswitch mode is managed only by an user by devlink.
Just set temporarily bp->eswitch_mode to legacy to avoid
re-opening of representors during destroy.
3. Create representors in bnxt_sriov_enable() if current eswitch
mode is switchdev one
Tested by this sequence:
1. Set PF interface up
2. Set PF's eswitch mode to switchdev
3. Created N VFs
4. Checked that N representors were created
5. Set eswitch mode to legacy
6. Checked that representors were deleted
7. Set eswitch mode back to switchdev
8. Checked that representors exist again for VFs
9. Deleted all VFs
10. Checked that all representors were deleted as well
11. Checked that current eswitch mode is still switchdev
Xin Long [Mon, 10 Apr 2023 19:43:30 +0000 (15:43 -0400)]
sctp: fix a potential overflow in sctp_ifwdtsn_skip
Currently, when traversing ifwdtsn skips with _sctp_walk_ifwdtsn, it only
checks the pos against the end of the chunk. However, the data left for
the last pos may be < sizeof(struct sctp_ifwdtsn_skip), and dereference
it as struct sctp_ifwdtsn_skip may cause coverflow.
This patch fixes it by checking the pos against "the end of the chunk -
sizeof(struct sctp_ifwdtsn_skip)" in sctp_ifwdtsn_skip, similar to
sctp_fwdtsn_skip.
It is because that skb->len requires at least sizeof(struct qrtr_ctrl_pkt)
in qrtr_tx_resume(). And skb->len equals to size in qrtr_endpoint_post().
But size is less than sizeof(struct qrtr_ctrl_pkt) when qrtr_cb->type
equals to QRTR_TYPE_RESUME_TX in qrtr_endpoint_post() under the syzbot
scenario. This triggers the uninit variable access bug.
Add size check when qrtr_cb->type equals to QRTR_TYPE_RESUME_TX in
qrtr_endpoint_post() to fix the bug.
====================
net: thunderbolt: Fix for sparse warnings and typos
This series tries to fix the rest of the sparse warnings generated
against the driver. While there fix the two typos in comments as well.
The previous version of the series can be found here:
https://lore.kernel.org/netdev/20230404053636.51597-1-mika.westerberg@linux.intel.com/
====================
Mika Westerberg [Tue, 11 Apr 2023 09:10:49 +0000 (12:10 +0300)]
net: thunderbolt: Fix typos in comments
Fix two typos in comments:
blongs -> belongs
UPD -> UDP
No functional changes.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mika Westerberg [Tue, 11 Apr 2023 09:10:48 +0000 (12:10 +0300)]
net: thunderbolt: Fix sparse warnings in tbnet_xmit_csum_and_map()
Fixes the following warning when the driver is built with sparse checks
enabled:
main.c:993:23: warning: incorrect type in initializer (different base types)
main.c:993:23: expected restricted __wsum [usertype] wsum
main.c:993:23: got restricted __be32 [usertype]
No functional changes intended.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mika Westerberg [Tue, 11 Apr 2023 09:10:47 +0000 (12:10 +0300)]
net: thunderbolt: Fix sparse warnings in tbnet_check_frame() and tbnet_poll()
Fixes the following warnings when the driver is built with sparse
checks enabled:
main.c:767:47: warning: restricted __le32 degrades to integer
main.c:775:47: warning: restricted __le16 degrades to integer
main.c:776:44: warning: restricted __le16 degrades to integer
main.c:876:40: warning: incorrect type in assignment (different base types)
main.c:876:40: expected restricted __le32 [usertype] frame_size
main.c:876:40: got unsigned int [assigned] [usertype] frame_size
main.c:877:41: warning: incorrect type in assignment (different base types)
main.c:877:41: expected restricted __le32 [usertype] frame_count
main.c:877:41: got unsigned int [usertype]
main.c:878:41: warning: incorrect type in assignment (different base types)
main.c:878:41: expected restricted __le16 [usertype] frame_index
main.c:878:41: got unsigned short [usertype]
main.c:879:38: warning: incorrect type in assignment (different base types)
main.c:879:38: expected restricted __le16 [usertype] frame_id
main.c:879:38: got unsigned short [usertype]
main.c:880:62: warning: restricted __le32 degrades to integer
main.c:880:35: warning: restricted __le16 degrades to integer
No functional changes intended.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The commits referenced below allows userspace to use the NLM_F_ECHO flag
for RTM_NEW/DELLINK operations to receive unicast notifications for the
affected link. Prior to these changes, applications may have relied on
multicast notifications to learn the same information without specifying
the NLM_F_ECHO flag.
For such applications, the mentioned commits changed the behavior for
requests not using NLM_F_ECHO. Multicast notifications are still received,
but now use the portid of the requester and the sequence number of the
request instead of zero values used previously. For the application, this
message may be unexpected and likely handled as a response to the
NLM_F_ACKed request, especially if it uses the same socket to handle
requests and notifications.
To fix existing applications relying on the old notification behavior,
set the portid and sequence number in the notification only if the
request included the NLM_F_ECHO flag. This restores the old behavior
for applications not using it, but allows unicasted notifications for
others.
Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link") Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create") Signed-off-by: Martin Willi <martin@strongswan.org> Acked-by: Guillaume Nault <gnault@redhat.com> Acked-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrew Lunn [Sun, 9 Apr 2023 15:02:04 +0000 (17:02 +0200)]
net: ethernet: Add missing depends on MDIO_DEVRES
A number of MDIO drivers make use of devm_mdiobus_alloc_size(). This
is only available when CONFIG_MDIO_DEVRES is enabled. Add missing
depends or selects, depending on if there are circular dependencies or
not. This avoids linker errors, especially for randconfig builds.
There are several issues with copy_from_user_nofault():
- access_ok() is designed for user context only and for that reason
it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe
and perf on ppc are calling it from irq.
- it's missing nmi_uaccess_okay() which is a nop on all architectures
except x86 where it's required.
The comment in arch/x86/mm/tlb.c explains the details why it's necessary.
Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe.
- __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling
check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock()
which is not safe to do from bpf, [ke]probe and perf due to potential deadlock.
Fix all three issues. At the end the copy_from_user_nofault() becomes
equivalent to copy_from_user_nmi() from safety point of view with
a difference in the return value.
Merge tag 'for-linus-2023041201' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Jiri Kosina:
- kernel panic fix for intel-ish-hid driver (Tanu Malhotra)
- buffer overflow fix in hid-sensor-custom driver (Todd Brandt)
- two device specific quirks (Alessandro Manca, Philippe Troin)
* tag 'for-linus-2023041201' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: intel-ish-hid: Fix kernel panic during warm reset
HID: hid-sensor-custom: Fix buffer overrun in device name
HID: topre: Add support for 87 keys Realforce R2
HID: add HP 13t-aw100 & 14t-ea100 digitizer battery quirks
Merge branch 'Add FOU support for externally controlled ipip devices'
Christian Ehrig says:
====================
This patch set adds support for using FOU or GUE encapsulation with
an ipip device operating in collect-metadata mode and a set of kfuncs
for controlling encap parameters exposed to a BPF tc-hook.
BPF tc-hooks allow us to read tunnel metadata (like remote IP addresses)
in the ingress path of an externally controlled tunnel interface via
the bpf_skb_get_tunnel_{key,opt} bpf-helpers. Packets can then be
redirected to the same or a different externally controlled tunnel
interface by overwriting metadata via the bpf_skb_set_tunnel_{key,opt}
helpers and a call to bpf_redirect. This enables us to redirect packets
between tunnel interfaces - and potentially change the encapsulation
type - using only a single BPF program.
Today this approach works fine for a couple of tunnel combinations.
For example: redirecting packets between Geneve and GRE interfaces or
GRE and plain ipip interfaces. However, redirecting using FOU or GUE is
not supported today. The ip_tunnel module does not allow us to egress
packets using additional UDP encapsulation from an ipip device in
collect-metadata mode.
Patch 1 lifts this restriction by adding a struct ip_tunnel_encap to
the tunnel metadata. It can be filled by a new BPF kfunc introduced
in Patch 2 and evaluated by the ip_tunnel egress path. This will allow
us to use FOU and GUE encap with externally controlled ipip devices.
Patch 2 introduces two new BPF kfuncs: bpf_skb_{set,get}_fou_encap.
These helpers can be used to set and get UDP encap parameters from the
BPF tc-hook doing the packet redirect.
Patch 3 adds BPF tunnel selftests using the two kfuncs.
---
v3:
- Integrate selftest into test_progs (Alexei)
v2:
- Fixes for checkpatch.pl
- Fixes for kernel test robot
====================