This patch implemented the retransmition of ADD_ADDR when no ADD_ADDR echo
is received. It added a timer with the announced address. When timeout
occurs, ADD_ADDR will be retransmitted.
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new helper sk_stop_timer_sync, it deactivates a timer
like sk_stop_timer, but waits for the handler to finish.
Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new struct mptcp_pm_add_entry to describe add_addr's entry.
Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests: mptcp: add remove addr and subflow test cases
This patch added the remove addr and subflow test cases and two new
functions.
The first function run_remove_tests calls do_transfer with two new
arguments, rm_nr_ns1 and rm_nr_ns2, for the numbers of addresses should be
removed during the transfer process in namespace 1 and namespace 2.
If both these two arguments are 0, we do the join test cases with
"mptcp_connect -j" command. Otherwise, do the remove test cases with
"mptcp_connect -r" command.
The second function chk_rm_nr checks the RM_ADDR related mibs's counters.
The output of the test cases looks like this:
11 remove single subflow syn[ ok ] - synack[ ok ] - ack[ ok ]
rm [ ok ] - sf [ ok ]
12 remove multiple subflows syn[ ok ] - synack[ ok ] - ack[ ok ]
rm [ ok ] - sf [ ok ]
13 remove single address syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ]
rm [ ok ] - sf [ ok ]
14 remove subflow and signal syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ]
rm [ ok ] - sf [ ok ]
15 remove subflows and signal syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ]
rm [ ok ] - sf [ ok ]
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new cfg, named cfg_remove in mptcp_connect. This new
cfg_remove is copied from cfg_join. The only difference between them is in
the do_rnd_write function. Here we slow down the transfer process of all
data to let the RM_ADDR suboption can be sent and received completely.
Otherwise the remove address and subflow test cases don't work.
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new helper named mptcp_destroy_common containing the
shared code between mptcp_destroy() and mptcp_sock_destruct().
Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added two new mibs for RM_ADDR, named MPTCP_MIB_RMADDR and
MPTCP_MIB_RMSUBFLOW, when the RM_ADDR suboption is received, increase
the first mib counter, when the local subflow is removed, increase the
second mib counter.
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implemented the local subflow removing function,
mptcp_pm_remove_subflow, it simply called mptcp_pm_nl_rm_subflow_received
under the PM spin lock.
We use mptcp_pm_remove_subflow to remove a local subflow, so change it's
argument from remote_id to local_id.
We check subflow->local_id in mptcp_pm_nl_rm_subflow_received to remove
a subflow.
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the remove announced addr and subflow logic in PM
netlink.
When the PM netlink removes an address, we traverse all the existing msk
sockets to find the relevant sockets.
We add a new list named anno_list in mptcp_pm_data, to record all the
announced addrs. In the traversing, we check if it has been recorded.
If it has been, we trigger the RM_ADDR signal.
We also check if this address is in conn_list. If it is, we remove the
subflow which using this local address.
Since we call mptcp_pm_free_anno_list in mptcp_destroy, we need to move
__mptcp_init_sock before the mptcp_is_enabled check in mptcp_init_sock.
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The re-check of pm->accept_subflow with pm->lock held was missing, this
patch fixed it.
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests: mptcp: add ADD_ADDR mibs check function
This patch added the ADD_ADDR related mibs counter check function
chk_add_nr(). This function check both ADD_ADDR and ADD_ADDR with
echo flag.
The output looks like this:
07 unused signal address syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ]
08 signal address syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ]
09 subflow and signal syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ]
10 multiple subflows and signal syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ]
11 remove subflow and signal syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ]
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added two mibs for ADD_ADDR, MPTCP_MIB_ADDADDR for receiving
of the ADD_ADDR suboption with echo-flag=0, and MPTCP_MIB_ECHOADD for
receiving the ADD_ADDR suboption with echo-flag=1.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When the ADD_ADDR suboption has been received, we need to send out the same
ADD_ADDR suboption with echo-flag=1, and no HMAC.
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added the RM_ADDR option parsing logic:
We parsed the incoming options to find if the rm_addr option is received,
and called mptcp_pm_rm_addr_received to schedule PM work to a new status,
named MPTCP_PM_RM_ADDR_RECEIVED.
PM work got this status, and called mptcp_pm_nl_rm_addr_received to handle
it.
In mptcp_pm_nl_rm_addr_received, we closed the subflow matching the rm_id,
and updated PM counter.
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new signal named rm_addr_signal in PM. On outgoing path,
we called mptcp_pm_should_rm_signal to check if rm_addr_signal has been
set. If it has been, we sent out the RM_ADDR option.
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mptcp: rename addr_signal and the related functions
This patch renamed addr_signal and the related functions with the explicit
word "add".
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 25 Sep 2020 02:54:40 +0000 (19:54 -0700)]
Merge tag 'mlx5-updates-2020-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2020-09-22
This series includes mlx5 updates
1) Add support for Connection Tracking offload in NIC mode.
Supporting CT offload in NIC mode on Mellanox cards is useful for
scenarios where the dual port NIC serves as a gateway between 2
networks and forwards traffic between these networks.
Since the traffic is not terminated on the host in this case,
no use of SRIOV VFs and/or switchdev mode is required.
Today Mellanox NIC cards already support offloading of packet forwarding
between physical ports without going to the host so combining it with CT
offloading allows users to create a gateway with forwarding and CT
(Including NAT) offloading capabilities in non-switchdev mode.
To support connection tracking in non-Switchdev mode (Single NIC mode),
we need to make use of the current Connection tracking infrastructure
implemented on top of E-Switch and the mlx5 generic flow table chains
APIs, to make it work on non-Eswitch steering domain e.g. NIC RX domain,
the following was performed:
1.1) Refactor current flow steering chains infrastructure and
updates TC nic mode implementation to use flow table chains.
1.2) Refactor current Connection Tracking (CT) infrastructure to not
assume E-switch backend, and make the CT layer agnostic to
underlying steering mode (E-Switch/NIC)
1.3) Plumbing to support CT offload in NIC mode.
2) Trivial code cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
====================
dpaa2-mac: add PCS support through the Lynx module
This patch set aims to add PCS support in the dpaa2-eth driver by
leveraging the Lynx PCS module.
The first two patches are some missing pieces: the first one adding
support for 10GBASER in Lynx PCS while the second one adds a new
function - of_mdio_find_device - which is helpful in retrieving the PCS
represented as a mdio_device. The final patch adds the glue logic
between phylink and the Lynx PCS module: it retrieves the PCS
represented as an mdio_device and registers it to Lynx and phylink.
From that point on, any PCS callbacks are treated by Lynx, without
dpaa2-eth interaction.
Changes in v2:
- move put_device() after destroy - 3/3
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
dpaa2-mac: add PCS support through the Lynx module
Include PCS support in the dpaa2-eth driver by integrating it with the
new Lynx PCS module. There is not much to talk about in terms of changes
needed in the dpaa2-eth driver since the only steps necessary are to
find the MDIO device representing the PCS, register it to the Lynx PCS
module and then let phylink know if its existence also.
After this, the PCS callbacks will be treated directly by Lynx, without
interraction from dpaa2-eth's part.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Wed, 23 Sep 2020 15:41:22 +0000 (18:41 +0300)]
of: add of_mdio_find_device() api
Add a helper function which finds the mdio_device structure given a
device tree node. This is helpful for finding the PCS device based on a
DTS node but managing it as a mdio_device instead of a phy_device.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 23 Sep 2020 11:24:20 +0000 (14:24 +0300)]
net: mscc: ocelot: always pass skb clone to ocelot_port_add_txtstamp_skb
Currently, ocelot switchdev passes the skb directly to the function that
enqueues it to the list of skb's awaiting a TX timestamp. Whereas the
felix DSA driver first clones the skb, then passes the clone to this
queue.
This matters because in the case of felix, the common IRQ handler, which
is ocelot_get_txtstamp(), currently clones the clone, and frees the
original clone. This is useless and can be simplified by using
skb_complete_tx_timestamp() instead of skb_tstamp_tx().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: dsa: b53: Configure VLANs while not filtering
These two patches allow the b53 driver which always configures its CPU
port as egress tagged to behave correctly with VLANs being always
configured whenever a port is added to a bridge.
Vladimir provides a patch that aligns the bridge with vlan_filtering=0
receive path to behave the same as vlan_filtering=1. Per discussion with
Nikolay, this behavior is deemed to be too DSA specific to be done in
the bridge proper.
This is a preliminary series for Vladimir to make
configure_vlan_while_filtering the default behavior for all DSA drivers
in the future.
Thanks!
Changes in v3:
- added Vladimir's Acked-by tag to patch #2
- removed unnecessary if_vlan.h inclusion in patch #2
- reworded commit message to be accurate with the code changes
Changes in v2:
- moved the call to dsa_untag_bridge_pvid() into net/dsa/tag_brcm.c
since we have a single user for now
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: b53: Configure VLANs while not filtering
Update the B53 driver to support VLANs while not filtering. This
requires us to enable VLAN globally within the switch upon driver
initial configuration (dev->vlan_enabled).
We also need to remove the code that dealt with PVID re-configuration in
b53_vlan_filtering() since that function worked under the assumption
that it would only be called to make a bridge VLAN filtering, or not
filtering, and we would attempt to move the port's PVID accordingly.
Now that VLANs are programmed all the time, even in the case of a
non-VLAN filtering bridge, we would be programming a default_pvid for
the bridged switch ports.
We need the DSA receive path to pop the VLAN tag if it is the bridge's
default_pvid because the CPU port is always programmed tagged in the
programmed VLANs. In order to do so we utilize the
dsa_untag_bridge_pvid() helper introduced in the commit before within
net/dsa/tag_brcm.c.
Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 23 Sep 2020 21:40:37 +0000 (14:40 -0700)]
net: dsa: untag the bridge pvid from rx skbs
Currently the bridge untags VLANs present in its VLAN groups in
__allowed_ingress() only when VLAN filtering is enabled.
But when a skb is seen on the RX path as tagged with the bridge's pvid,
and that bridge has vlan_filtering=0, and there isn't any 8021q upper
with that VLAN either, then we have a problem. The bridge will not untag
it (since it is supposed to remain VLAN-unaware), and pvid-tagged
communication will be broken.
There are 2 situations where we can end up like that:
1. When installing a pvid in egress-tagged mode, like this:
ip link add dev br0 type bridge vlan_filtering 0
ip link set swp0 master br0
bridge vlan del dev swp0 vid 1
bridge vlan add dev swp0 vid 1 pvid
This happens because DSA configures the VLAN membership of the CPU port
using the same flags as swp0 (in this case "pvid and not untagged"), in
an attempt to copy the frame as-is from ingress to the CPU.
However, in this case, the packet may arrive untagged on ingress, it
will be pvid-tagged by the ingress port, and will be sent as
egress-tagged towards the CPU. Otherwise stated, the CPU will see a VLAN
tag where there was none to speak of on ingress.
When vlan_filtering is 1, this is not a problem, as stated in the first
paragraph, because __allowed_ingress() will pop it. But currently, when
vlan_filtering is 0 and we have such a VLAN configuration, we need an
8021q upper (br0.1) to be able to ping over that VLAN, which is not
symmetrical with the vlan_filtering=1 case, and therefore, confusing for
users.
Basically what DSA attempts to do is simply an approximation: try to
copy the skb with (or without) the same VLAN all the way up to the CPU.
But DSA drivers treat CPU port VLAN membership in various ways (which is
a good segue into situation 2). And some of those drivers simply tell
the CPU port to copy the frame unmodified, which is the golden standard
when it comes to VLAN processing (therefore, any driver which can
configure the hardware to do that, should do that, and discard the VLAN
flags requested by DSA on the CPU port).
2. Some DSA drivers always configure the CPU port as egress-tagged, in
an attempt to recover the classified VLAN from the skb. These drivers
cannot work at all with untagged traffic when bridged in
vlan_filtering=0 mode. And they can't go for the easy "just keep the
pvid as egress-untagged towards the CPU" route, because each front port
can have its own pvid, and that might require conflicting VLAN
membership settings on the CPU port (swp1 is pvid for VID 1 and
egress-tagged for VID 2; swp2 is egress-taggeed for VID 1 and pvid for
VID 2; with this simplistic approach, the CPU port, which is really a
separate hardware entity and has its own VLAN membership settings, would
end up being egress-untagged in both VID 1 and VID 2, therefore losing
the VLAN tags of ingress traffic).
So the only thing we can do is to create a helper function for resolving
the problematic case (that is, a function which untags the bridge pvid
when that is in vlan_filtering=0 mode), which taggers in need should
call. It isn't called from the generic DSA receive path because there
are drivers that fall neither in the first nor second category.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Sep 2020 01:02:49 +0000 (18:02 -0700)]
Merge branch 'PHY-subsystem-kernel-doc'
Andrew Lunn says:
====================
PHY subsystem kernel doc
The first patches fix existing warnings in the kerneldoc for the PHY
subsystem, and then the 2nd extend the kernel documentation for the
major structures and functions in the PHY subsystem.
v2:
Drop the other fixes which have already been merged.
s/phy/PHY/g
TBI
TypOs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series includes some additional changes to the bcm_sf2 in
order to support the Device Tree firmwares provided on such platforms.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: bcm_sf2: Disallow port 5 to be a DSA CPU port
While the switch driver is written such that port 5 or 8 could be CPU
ports, the use case on Broadcom STB chips is to use port 8 exclusively.
The platform firmware does make port 5 comply to a proper DSA CPU port
binding by specifiying an "ethernet" phandle. This is undesirable for
now until we have an user-space configuration mechanism (such as
devlink) which could support dynamically changing the port flavor at
run time.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
octeontx2: Add support for VLAN based flow distribution
This series add support for VLAN based flow distribution for octeontx2
netdev driver. This adds support for configuring the same via ethtool.
Following tests have been done.
- Multi VLAN flow with same SD
- Multi VLAN flow with same SDFN
- Single VLAN flow with multi SD
- Single VLAN flow with multi SDFN
All tests done for udp/tcp both v4 and v6
====================
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
George Cherian [Tue, 22 Sep 2020 13:07:27 +0000 (18:37 +0530)]
octeontx2-pf: Support to change VLAN based RSS hash options via ethtool
Add support to control rx-flow-hash based on VLAN.
By default VLAN plus 4-tuple based hashing is enabled.
Changes can be done runtime using ethtool
To enable 2-tuple plus VLAN based flow distribution
# ethtool -N <intf> rx-flow-hash <prot> sdv
To enable 4-tuple plus VLAN based flow distribution
# ethtool -N <intf> rx-flow-hash <prot> sdfnv
Signed-off-by: George Cherian <george.cherian@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
George Cherian [Tue, 22 Sep 2020 13:07:26 +0000 (18:37 +0530)]
octeontx2-af: Add support for VLAN based RSS hashing
Added support for PF/VF drivers to choose RSS flow key algorithm
with VLAN tag included in hashing input data. Only CTAG is considered.
Signed-off-by: George Cherian <george.cherian@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
kernel-doc expects the function prototype to be just after
the kernel-doc markup, as otherwise it will get it all wrong:
./net/core/dev.c:10036: warning: Excess function parameter 'dev' description in 'WAIT_REFS_MIN_MSECS'
Fixes: 0e4be9e57e8c ("net: use exponential backoff in netdev_wait_allrefs") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Marko [Tue, 22 Sep 2020 10:16:32 +0000 (12:16 +0200)]
net: mdio-ipq4019: add Clause 45 support
While up-streaming the IPQ4019 driver it was thought that the controller had no Clause 45 support,
but it actually does and its activated by writing a bit to the mode register.
So lets add it as newer SoC-s use the same controller and Clause 45 compliant PHY-s.
Signed-off-by: Robert Marko <robert.marko@sartura.hr> Cc: Luka Perkov <luka.perkov@sartura.hr> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Marko [Tue, 22 Sep 2020 10:16:31 +0000 (12:16 +0200)]
net: mdio-ipq4019: change defines to upper case
In the commit adding the IPQ4019 MDIO driver, defines for timeout and sleep partially used lower case.
Lets change it to upper case in line with the rest of driver defines.
Signed-off-by: Robert Marko <robert.marko@sartura.hr> Cc: Luka Perkov <luka.perkov@sartura.hr> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Introduce mbox tracepoints for Octeontx2
This patchset adds tracepoints support for mailbox.
In Octeontx2, PFs and VFs need to communicate with AF
for allocating and freeing resources. Once all the
configuration is done by AF for a PF/VF then packet I/O
can happen on PF/VF queues. When an interface
is brought up many mailbox messages are sent
to AF for initializing queues. Say a VF is brought up
then each message is sent to PF and PF forwards to
AF and response also traverses from AF to PF and then VF.
To aid debugging, tracepoints are added at places where
messages are allocated, sent and message interrupts.
Below is the trace of one of the messages from VF to AF
and AF response back to VF:
~ # echo 1 > /sys/kernel/tracing/events/rvu/enable
~ # ifconfig eth20 up
[ 279.379559] eth20 NIC Link is UP 10000 Mbps Full duplex
~ # cat /sys/kernel/tracing/trace
ifconfig-171 [000] .... 275.753345: otx2_msg_alloc: [0002:02:00.1] msg:(0x400) size:40
ifconfig-171 [000] ...1 275.753347: otx2_msg_send: [0002:02:00.1] sent 1 msg(s) of size:48
v3 changes:
Removed EXPORT_TRACEPOINT_SYMBOLS of otx2_msg_send and otx2_msg_check
since they are called locally only
v2 changes:
Removed otx2_msg_err tracepoint since it is similar to devlink_hwerr
and it will be used instead when devlink supported is added.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With tracepoints support present in the mailbox
code this patch adds tracepoints in PF and VF drivers
at places where mailbox messages are allocated,
sent and at message interrupts.
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Added tracepoints in mailbox code so that
the mailbox operations like message allocation,
sending message and message interrupts are traced.
Also the mailbox errors occurred like timeout
or wrong responses are traced.
These will help in debugging mailbox issues.
Here's an example output showing one of the mailbox
messages sent by PF to AF and AF responding to it:
~# mount -t tracefs none /sys/kernel/tracing/
~# echo 1 > /sys/kernel/tracing/events/rvu/enable
~# ifconfig eth0 up
~# cat /sys/kernel/tracing/trace
kworker/u49:0-1165 [000] ...1 756.162051: otx2_msg_send: [0002:01:00.0] sent 1 msg(s) of size:32
kworker/u49:0-1165 [000] d.h. 756.162056: otx2_msg_interrupt: [0002:02:00.0] mbox interrupt AF to PF (0x1)
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Barry Song [Tue, 22 Sep 2020 01:56:15 +0000 (13:56 +1200)]
net: allwinner: remove redundant irqsave and irqrestore in hardIRQ
The comment "holders of db->lock must always block IRQs" and related
code to do irqsave and irqrestore don't make sense since we are in a
IRQ-disabled hardIRQ context.
Cc: Maxime Ripard <mripard@kernel.org> Cc: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Acked-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
A number of static variables were not modified. Make them const to allow
the compiler to put them in read-only memory. In order to do so,
constify a couple of input pointers as well as some local pointers.
This moves about 35Kb to read-only memory as seen by the output of the
size command.
Before:
text data bss dec hex filename
404938 111534 640 517112 7e3f8 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge.ko
After:
text data bss dec hex filename
439499 76974 640 517113 7e3f9 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge.ko
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Levkovich [Sun, 30 Aug 2020 19:38:11 +0000 (22:38 +0300)]
net/mlx5e: Keep direct reference to mlx5_core_dev in tc ct
Keep and use a direct reference to the mlx5 core device in all of
tc_ct code instead of accessing it via a pointer to mlx5 eswitch
in order to support nic mode ct offload for VF devices that don't
have a valid eswitch pointer set.
Oz Shlomo [Sun, 19 Jul 2020 13:51:39 +0000 (13:51 +0000)]
net/mlx5e: CT: Use the same counter for both directions
A connection is represented by two 5-tuple entries, one for each direction.
Currently, each direction allocates its own hw counter, which is
inefficient as ct aging is managed per connection.
Share the counter that was allocated for the original direction with the
reverse direction.
Signed-off-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Adding support to perform CT related tc actions and
matching on CT states for nic flows.
The ct flows management and handling will be done using a new
instance of the ct database that is declared in this patch to
keep it separate from the eswitch ct flows database.
Offloading and unoffloading ct flows will be done using the
existing ct offload api by providing it the relevant ct
database reference in each mode.
In addition, refactoring the tc ct api is introduced to make it
agnostic to the flow type and perform the resource allocations
and rule insertion to the proper steering domain in the device.
In the initialization call, the api requests and stores in the ct
database instance all the relevant information that distinguishes
between nic flows and esw flows, such as chains database, steering
namespace and mod hdr table.
This way the operations of adding and removing ct flows to the device
can later performed agnostically to the flow type.
net/mlx5e: Add tc chains offload support for nic flows
Allow adding nic tc flow rules with goto chain action.
Connecting the nic flows to the mlx5 chains infrastructure in previous
patches allows us to support the creation of chained flow tables and
rules that direct to another chain for further packet processing.
This is a required preparation to support CT offloads for nic tc flows.
We allow the creation of 256 different chains for nic flows since we
have 8 bits available for the chain restore tag in case of a miss.
In order to support chains and connection tracking offload for
nic flows, there's a need to introduce a common flow attributes
struct so that these features can be agnostic and have access to
a single attributes struct, regardless of the flow type.
Therefore, a new tc flow attributes format is introduced to allow
access to attributes that are common to eswitch and nic flows.
The common attributes will always get allocated for the new flows,
regardless of their type, while the type specific attributes are
separated into different structs and will be allocated based on the
flow type to avoid memory waste.
When allocating the flow attributes the caller provides the flow
steering namespace and according the namespace type the additional
space for the extra, type specific, attributes is determined and
added to the total attribute allocation size.
In addition, the attributes that are going to be common to both
flow types are moved to the common attributes struct.
net/mlx5e: Split nic tc flow allocation and creation
For future support of CT offload with nic tc flows, where
the flow rule is not created immediately but rather following
a future event, the patch is splitting the nic rule creation
and deletion into 2 parts:
1. Creating/Deleting and setting the rule attributes.
2. Creating/Deleting the flow table and flow rule itself.
This way the attributes can be prepared and stored in the
flow handle when the tc flow is created but the rule can
actually be created at any point in the future, using these
pre allocated attributes.
net/mlx5e: Tc nic flows to use mlx5_chains flow tables
Change nic tc flows offload path to use the chains and prios
infrastructure for the flow table creation as a preparation to
support tc multi chains and priorities for nic flows.
Adding an instance of the table chaining database to the nic tc struct
and perform the root table creation and desctuction via the chains api
while keeping the limit of a single chain (0) in nic tc mode.
This will be extendable to supporting multiple chains in the following
patches.
The flow table sizes and default miss table parameters that are provided
to the chains creation api are kept the same.
Allow setting a flow table with a lower level
as a rule destination in nic rx tables.
This is required in order to support table chaining
of tc nic flows.
Decouple the chains infrastructure from eswitch and make
it generic to support other steering namespaces.
The change defines an agnostic data structure to keep
all the relevant information for maintaining flow table
chaining in any steering namespace. Each namespace that
requires table chaining will be required to allocate
such data structure.
The chains creation code will receive the steering namespace
and flow table parameters from the caller so it will operate
agnosticly when creating the required resources to
maintain the table chaining function while Parts of the code
that are relevant to eswitch specific functionality are moved
to eswitch files.
This is the second part of the IGMPv3/MLDv2 support which adds support
for the fast-path. In order to be able to handle source entries we add
mdb support for S,G entries (i.e. we add source address support to
br_ip), that requires to extend the current mdb netlink API, fortunately
we just add another attribute which will contain nested future mdb
attributes, then we use it to add support for S,G user- add, del and
dump. The lookup sequence is simple: when IGMPv3/MLDv2 are enabled do
the S,G lookup first and if it fails fallback to *,G. The more complex
part is when we begin handling source lists and auto-installing S,G entries
and *,G filter mode transitions. We have the following cases:
1) *,G INCLUDE -> EXCLUDE transition: we need to install the port in
all of *,G's installed S,G entries for proper replication (except
the ones explicitly blocked), this is also necessary when adding a
new *,G EXCLUDE port group
2) *,G EXCLUDE -> INCLUDE transition: we need to remove the port from
all of *,G's installed S,G entries, this is also necessary when
removing a *,G port group
3) New S,G port entry: we need to install all current *,G EXCLUDE ports
4) Remove S,G port entry: if all other port groups were auto-installed we
can safely remove them and delete the whole S,G entry
Currently we compute these operations from the available ports, their
source lists and their filter mode. In the future we can extend the port
group structure and reduce the running time of these ops. Also one
current limitation is that host-joined S,G entries are not supported.
I.e. one cannot add "dev bridge port bridge" mdb S,G entries. The host
join is currently considered an EXCLUDE {} join, so it's reflected in
all of *,G's installed S,G entries. If an S,G,port entry is added as
temporary then the kernel can take it over if a source shows up from a
report, permanent entries are skipped. In order to properly handle
blocked sources we add a new port group blocked flag to avoid forwarding
to that port group in the S,G. Finally when forwarding we use the port
group filter mode (if it's INCLUDE and the port group is from a *,G then
don't replicate to it, respectively if it's EXCLUDE then forward) and the
blocked flag (obviously if it's set - skip that port unless it's a
router port) to decide if the port should be skipped. Another limitation
is that we can't do some of the above transitions without small traffic
drop while installing/removing entries. That will be taken care of when
we add atomic swap of port replication lists later.
Patch break down:
patches 1-3: prepare the mdb code for better extack support which is
used in future patches to return a more meaningful error
patches 4-6: add the source address field to struct br_ip, and do minor
cleanups around it
patches 7-8: extend the mdb netlink API so we can send new mdb
attributes and uses the new API for S,G entry add/del/dump
support
patch 9: takes care of S,G entries when doing a lookup (first S,G
then *,G lookup)
patch 10: adds a new port group field and attribute for origin protocol
we use the already available RTPROT_ definitions,
currently user-space entries are added as RTPROT_STATIC and
kernel entries are added as RTPROT_KERNEL, we may allow
user-space to set custom values later (e.g. for FRR, clag)
patch 11: adds an internal S,G,port rhashtable to speed up filter
mode transitions
patch 12: initial automatic install of S,G entries based on port
groups' source lists
patch 13: handles port group modes on transitions or when new
port group entries are added
patch 14: self-explanatory - adds support for blocked port group
entries needed to stop forwarding to particular S,G,port
entries
patch 15: handles host-join/leave state changes, treats host-joins
as EXCLUDE {} groups (reflected in all *,G's S,G entries)
patch 16: finally adds the fast-path filter mode and block flag
support
Here're the sets that will come next (in order):
- iproute2 support for IGMPv3/MLDv2
- selftests for all mode transitions and group flags
- explicit host tracking for proper fast-leave support
- atomic port replication lists (these are also needed for broadcast
forwarding optimizations)
- mode transition optimization and removal of open-coded sorted lists
Not implemented yet:
- Host IGMPv3/MLDv2 filter support (currently we handle only join/leave
as before)
- Proper other querier source timer and value updates
- IGMPv3/v2 MLDv2/v1 compat (I have a few rough patches for this one)
v2: fix build with CONFIG_BATMAN_ADV_MCAST in patch 6
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since host joins are considered as EXCLUDE {} joins we need to reflect
that in all of *,G ports' S,G entries. Since the S,Gs can have
host_joined == true only set automatically we can safely set it to false
when removing all automatically added entries upon S,G delete.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mcast: add support for blocked port groups
When excluding S,G entries we need a way to block a particular S,G,port.
The new port group flag is managed based on the source's timer as per
RFCs 3376 and 3810. When a source expires and its port group is in
EXCLUDE mode, it will be blocked.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mcast: handle port group filter modes
We need to handle group filter mode transitions and initial state.
To change a port group's INCLUDE -> EXCLUDE mode (or when we have added
a new port group in EXCLUDE mode) we need to add that port to all of
*,G ports' S,G entries for proper replication. When the EXCLUDE state is
changed from IGMPv3 report, br_multicast_fwd_filter_exclude() must be
called after the source list processing because the assumption is that
all of the group's S,G entries will be created before transitioning to
EXCLUDE mode, i.e. most importantly its blocked entries will already be
added so it will not get automatically added to them.
The transition EXCLUDE -> INCLUDE happens only when a port group timer
expires, it requires us to remove that port from all of *,G ports' S,G
entries where it was automatically added previously.
Finally when we are adding a new S,G entry we must add all of *,G's
EXCLUDE ports to it.
In order to distinguish automatically added *,G EXCLUDE ports we have a
new port group flag - MDB_PG_FLAGS_STAR_EXCL.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mcast: install S,G entries automatically based on reports
This patch adds support for automatic install of S,G mdb entries based
on the port group's source list and the source entry's timer.
Once installed the S,G will be used when forwarding packets if the
approprate multicast/mld versions are set. A new source flag called
BR_SGRP_F_INSTALLED denotes if the source has a forwarding mdb entry
installed.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
To speedup S,G forward handling we need to be able to quickly find out
if a port is a member of an S,G group. To do that add a global S,G port
rhashtable with key: source addr, group addr, protocol, vid (all br_ip
fields) and port pointer.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mcast: add rt_protocol field to the port group struct
We need to be able to differentiate between pg entries created by
user-space and the kernel when we start generating S,G entries for
IGMPv3/MLDv2's fast path. User-space entries are created by default as
RTPROT_STATIC and the kernel entries are RTPROT_KERNEL. Later we can
allow user-space to provide the entry rt_protocol so we can
differentiate between who added the entries specifically (e.g. clag,
admin, frr etc).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mdb: add support for add/del/dump of entries with source
Add new mdb attributes (MDBE_ATTR_SOURCE for setting,
MDBA_MDB_EATTR_SOURCE for dumping) to allow add/del and dump of mdb
entries with a source address (S,G). New S,G entries are created with
filter mode of MCAST_INCLUDE. The same attributes are used for IPv4 and
IPv6, they're validated and parsed based on their protocol.
S,G host joined entries which are added by user are not allowed yet.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mdb: add support to extend add/del commands
Since the MDB add/del code expects an exact struct br_mdb_entry we can't
really add any extensions, thus add a new nested attribute at the level of
MDBA_SET_ENTRY called MDBA_SET_ENTRY_ATTRS which will be used to pass
all new options via netlink attributes. This patch doesn't change
anything functionally since the new attribute is not used yet, only
parsed.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mcast: rename br_ip's u member to dst
Since now we have src in br_ip, u no longer makes sense so rename
it to dst. No functional changes.
v2: fix build with CONFIG_BATMAN_ADV_MCAST
CC: Marek Lindner <mareklindner@neomailbox.ch> CC: Simon Wunderlich <sw@simonwunderlich.de> CC: Antonio Quartulli <a@unstable.cc> CC: Sven Eckelmann <sven@narfation.org> CC: b.a.t.m.a.n@lists.open-mesh.org Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mcast: use br_ip's src for src groups and querier address
Now that we have src and dst in br_ip it is logical to use the src field
for the cases where we need to work with a source address such as
querier source address and group source address.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: mdb: move all port and bridge checks to br_mdb_add
To avoid doing duplicate device checks and searches (the same were done
in br_mdb_add and __br_mdb_add) pass the already found port to __br_mdb_add
and pull the bridge's netif_running and enabled multicast checks to
br_mdb_add. This would also simplify the future extack errors.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/realtek/8139cp.c: In function cp_tx_timeout:
drivers/net/ethernet/realtek/8139cp.c:1242:6: warning: variable ‘rc’ set but not used [-Wunused-but-set-variable]
`rc` is never used, so remove it.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Olsa [Wed, 23 Sep 2020 18:57:34 +0000 (20:57 +0200)]
bpf: Check CONFIG_BPF option for resolve_btfids
Currently all the resolve_btfids 'users' are under CONFIG_BPF
code, so if we have CONFIG_BPF disabled, resolve_btfids will
fail, because there's no data to resolve.
Disabling resolve_btfids if there's CONFIG_BPF disabled,
so we won't fail such builds.
this is a pull request of 20 patches for net-next.
The complete series target the flexcan driver and is created by Joakim
Zhang and me.
The first six patches are cleanup (sort include files alphabetically,
remove stray empty line, get rid of long lines) and adding more
registers and documentation (registers and wakeup interrupt).
Then in two patches the transceiver regulator is made optional, and a
check for maximum transceiver bitrate is added.
Then the ECC support for HW thats supports this is added.
The next three patches improve suspend and low power mode handling.
Followed by six patches that add CAN-FD support and CAN-FD related
features.
The last two patches add support for the flexcan IP core on the imx8qm
and lx2160ar1.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
please apply the following patch series for qeth to netdev's net-next tree.
This brings all sorts of cleanups. Highlights are more code sharing in
the init/teardown paths, and more fine-grained rollback on errors during
initialization (instead of a full-blown teardown).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Clarify which discipline-specific steps are needed to roll back after
error in qeth_l?_set_online(), and which are common to roll back
from qeth_hardsetup_card().
Some steps (cancelling the RX modeset, draining the TX queues) are only
necessary if the netdev was potentially UP before, so move them to the
common qeth_set_offline().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Originators of cmd IO typically hold the rtnl or conf_mutex to protect
against a concurrent teardown.
Since qeth_set_offline() already holds the conf_mutex, the main reason
why we still care about cancelling pending cmds is so that they release
the rtnl when we need it ourselves.
So move this step a little earlier into the teardown sequence.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The programming of ucast IPs via qeth_l3_modify_ip() is driven
independently from any of our typical locking mechanisms (eg. detaching
the netdevice, or holding the conf_mutex).
So when we inspect the card state to check whether the required cmd IO
should be deferred, there is no protection against concurrent state
changes.
But by slightly re-ordering the teardown sequence, we can rely on the
ip_lock to sufficiently serialize things:
1. when running concurrently to qeth_l3_set_online(), any instance of
qeth_l3_modify_ip() that aquires the ip_lock _after_
qeth_l3_recover_ip() will observe the state as CARD_STATE_SOFTSETUP
and not defer the IO.
2. when running concurrently to qeth_l3_set_offline(), any instance of
qeth_l3_modify_ip() that aquires the ip_lock _after_
qeth_l3_clear_ip_htable() will observe the state as CARD_STATE_DOWN
and defer the IO.
These guarantees in mind, we can now drop the conf_mutex from the
qeth_l3_modify_rxip_vipa() wrapper.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the remaining occurences in sysfs code to kstrtouint().
While at it move some input parsing out of locked sections, replace an
open-coded clamp() and remove some unnecessary run-time checks for
ipatoe->mask_bits that are already enforced when creating the object.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
card->ipato is currently protected by the conf_mutex. But most users
also hold the ip_lock - in particular qeth_l3_add_ip().
So slightly expand the sections under ip_lock in a few places (to
effectively cover a few error & no-op cases), and then drop the
conf_mutex where it's no longer needed.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arrays with designated initializers have an implicit length of the highest
initialized value plus one. I used this to ensure that newly added entries
in enum bpf_reg_type get a NULL entry in compatible_reg_types.
This is difficult to understand since it requires knowledge of the
peculiarities of designated initializers. Use __BPF_ARG_TYPE_MAX to size
the array instead.
net: microchip: Make `lan743x_pm_suspend` function return right value
drivers/net/ethernet/microchip/lan743x_main.c: In function lan743x_pm_suspend:
`ret` is set but not used. In fact, `pci_prepare_to_sleep` function value should
be the right value of `lan743x_pm_suspend` function, therefore, fix it.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 Sep 2020 00:44:59 +0000 (17:44 -0700)]
Merge tag 'mlx5-updates-2020-09-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2020-09-21
Multi packet TX descriptor support for SKBs.
This series introduces some refactoring of the regular TX data path in
mlx5 and adds the Enhanced TX MPWQE feature support. MPWQE stands for
multi-packet work queue element, and it can serve multiple packets,
reducing the PCI bandwidth spent on control traffic. It should improve
performance in scenarios where PCI is the bottleneck, and xmit_more is
signaled by the kernel. The refactoring done in this series also
improves the packet rate on its own.
MPWQE is already implemented in the XDP tx path, this series adds the
support of MPWQE for regular kernel SKB tx path.
MPWQE is supported from ConnectX-5 and onward, for legacy devices we need
to keep backward compatibility for regular (Single packet) WQE descriptor.
MPWQE is not compatible with certain offloads and features, such as TLS
offload, TSO, nonlinear SKBs. If such incompatible features are in use,
the driver gracefully falls back to non-MPWQE per SKB.
Prior to the final patch "net/mlx5e: Enhanced TX MPWQE for SKBs" that adds
the actual support, Maxim did some refactoring to the tx data path to
split it into stages and smaller helper functions that can be utilized and
reused for both legacy and new MPWQE feature.
Performance testing:
UDP performance is improved in a single stream pktgen test:
Packet rate: 16.86 Mpps (±0.15 Mpps) -> 20.94 Mpps (±0.33 Mpps)
Instructions per packet: 434 -> 329
Cycles per packet: 158 -> 123
Instructions per cycle: 2.75 -> 2.67
TCP and XDP_TX single stream tests show no performance difference.
MPWQE can reduce PCI bandwidth:
PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores:
Inbound PCI utilization with MPWQE off: 80.3%
Inbound PCI utilization with MPWQE on: 59.0%
PCI Gen3, pktgen at fixed rate of 56064000 pps on 24 CPU cores:
Inbound PCI utilization with MPWQE off: 65.4%
Inbound PCI utilization with MPWQE on: 49.3%
MPWQE can also reduce CPU load, increasing the packet rate in case of
CPU bottleneck:
PCI Gen2, pktgen at full rate on 24 CPU cores:
Packet rate with MPWQE off: 37.5 Mpps
Packet rate with MPWQE on: 49.0 Mpps
PCI Gen3, pktgen at full rate on 24 CPU cores:
Packet rate with MPWQE off: 57.0 Mpps
Packet rate with MPWQE on: 66.8 Mpps
====================
devlink: Use nla_policy to validate range
This two small patches uses nla_policy to validate user specified
fields are in valid range or not.
Patch summary:
Patch-1 checks the range of eswitch mode field
Patch-2 checks for the port type field. It eliminates a check in
code by using nla policy infrastructure.
====================
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
1) net/ipv4/route.c, adding a new local variable while
moving another local variable and removing it's
initial assignment.
2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
One pretty prints the port mode differently, whilst another
changes the driver to try and obtain the port mode from
the port node rather than the switch node.
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"No common topic, just assorted fixes"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse: fix the ->direct_IO() treatment of iov_iter
fs: fix cast in fsparam_u32hex() macro
vboxsf: Fix the check for the old binary mount-arguments struct
- fix failure to add bond interfaces to a bridge, the offload-handling
code was too defensive there and recent refactoring unearthed that.
Users complained (Ido)
- fix unnecessarily reflecting ECN bits within TOS values / QoS marking
in TCP ACK and reset packets (Wei)
- fix a deadlock with bpf iterator. Hopefully we're in the clear on
this front now... (Yonghong)
- BPF fix for clobbering r2 in bpf_gen_ld_abs (Daniel)
- fix AQL on mt76 devices with FW rate control and add a couple of AQL
issues in mac80211 code (Felix)
- fix authentication issue with mwifiex (Maximilian)
- WiFi connectivity fix: revert IGTK support in ti/wlcore (Mauro)
- fix exception handling for multipath routes via same device (David
Ahern)
- revert back to a BH spin lock flavor for nsid_lock: there are paths
which do require the BH context protection (Taehee)
- fix interrupt / queue / NAPI handling in the lantiq driver (Hauke)
- fix ife module load deadlock (Cong)
- make an adjustment to netlink reply message type for code added in
this release (the sole change touching uAPI here) (Michal)
- a number of fixes for small NXP and Microchip switches (Vladimir)
[ Pull request acked by David: "you can expect more of this in the
future as I try to delegate more things to Jakub" ]
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (167 commits)
net: mscc: ocelot: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
net: dsa: seville: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
net: dsa: felix: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
inet_diag: validate INET_DIAG_REQ_PROTOCOL attribute
net: bridge: br_vlan_get_pvid_rcu() should dereference the VLAN group under RCU
net: Update MAINTAINERS for MediaTek switch driver
net/mlx5e: mlx5e_fec_in_caps() returns a boolean
net/mlx5e: kTLS, Avoid kzalloc(GFP_KERNEL) under spinlock
net/mlx5e: kTLS, Fix leak on resync error flow
net/mlx5e: kTLS, Add missing dma_unmap in RX resync
net/mlx5e: kTLS, Fix napi sync and possible use-after-free
net/mlx5e: TLS, Do not expose FPGA TLS counter if not supported
net/mlx5e: Fix using wrong stats_grps in mlx5e_update_ndo_stats()
net/mlx5e: Fix multicast counter not up-to-date in "ip -s"
net/mlx5e: Fix endianness when calculating pedit mask first bit
net/mlx5e: Enable adding peer miss rules only if merged eswitch is supported
net/mlx5e: CT: Fix freeing ct_label mapping
net/mlx5e: Fix memory leak of tunnel info when rule under multipath not ready
net/mlx5e: Use synchronize_rcu to sync with NAPI
net/mlx5e: Use RCU to protect rq->xdp_prog
...