Brett Creeley [Wed, 13 Oct 2021 16:02:19 +0000 (09:02 -0700)]
ice: Add support to print error on PHY FW load failure
Some devices have support for loading the PHY FW and in some cases this
can fail. When this fails, the FW will set the corresponding bit in the
link info structure. Also, the FW will send a link event if the correct
link event mask bit is set. Add support for printing an error message
when the PHY FW load fails during any link configuration flow and the
link event flow.
Since ice_check_module_power() is already doing something very similar
add a new function ice_check_link_cfg_err() so any failures reported in
the link info's link_cfg_err member can be printed in this one function.
Also, add the new ICE_FLAG_PHY_FW_LOAD_FAILED bit to the PF's flags so
we don't constantly print this error message during link polling if the
value never changed.
Marcin Szycik [Mon, 18 Oct 2021 11:30:32 +0000 (13:30 +0200)]
ice: Add support for changing MTU on PR in switchdev mode
This change adds support for changing MTU on port representor in
switchdev mode, by setting the min/max MTU values on port representor
netdev. Before it was possible to change the MTU only in a limited,
default range (68-1500).
Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Part of virtchannel messages are treated in different way in switchdev
mode to block configuring VFs from iavf driver side. This blocking was
done by doing nothing and returning success, event without sending
response.
Not sending response for opcodes that aren't supported in switchdev mode
leads to block iavf driver message handling. This happens for example
when vlan is configured at VF config time (VLAN module is already
loaded).
To get rid of it ice driver should answer for each VF message. In
switchdev mode:
- for adding/deleting VLAN driver should answer success without doing
anything to allow creating vlan device on VFs
- for enabling/disabling VLAN stripping and promiscuous mode driver
should answer not supported, this feature in switchdev can be only
set from host side
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mostly reuse code from Geneve and VXLAN in TC parsing code. Add new GRE
header to match on correct fields. Create new dummy packets with GRE
fields.
Instead of checking if any encap values are presented in TC flower,
check if device is tunnel type or redirect is to tunnel device. This
will allow adding all combination of rules. For example filters only
with inner fields.
Return error in case device isn't tunnel but encap values are presented.
gre example:
- create tunnel device
ip l add $NVGRE_DEV type gretap remote $NVGRE_REM_IP local $VF1_IP \
dev $PF
- add tc filter (in switchdev mode)
tc filter add dev $NVGRE_DEV protocol ip parent ffff: flower dst_ip \
$NVGRE1_IP action mirred egress redirect dev $VF1_PR
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add definition of UDP tunnel dummy packets. Fill destination port value
in filter based on UDP tunnel port. Append tunnel flags to switch filter
definition in case of matching the tunnel.
Both VXLAN and Geneve are UDP tunnels, so only one new header is needed.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add definition for VXLAN and Geneve dummy packet. Define VXLAN and
Geneve type of fields to match on correct UDP tunnel header.
Parse tunnel specific fields from TC tool like outer MACs, outer IPs,
outer destination port and VNI. Save values and masks in outer header
struct and move header pointer to inner to simplify parsing inner
values.
There are two cases for redirect action:
- from uplink to VF - TC filter is added on tunnel device
- from VF to uplink - TC filter is added on PR, for this case check if
redirect device is tunnel device
VXLAN example:
- create tunnel device
ip l add $VXLAN_DEV type vxlan id $VXLAN_VNI dstport $VXLAN_PORT \
dev $PF
- add TC filter (in switchdev mode)
tc filter add dev $VXLAN_DEV protocol ip parent ffff: flower \
enc_dst_ip $VF1_IP enc_key_id $VXLAN_VNI action mirred egress \
redirect dev $VF1_PR
Geneve example:
- create tunnel device
ip l add $GENEVE_DEV type geneve id $GENEVE_VNI dstport $GENEVE_PORT \
remote $GENEVE_IP
- add TC filter (in switchdev mode)
tc filter add dev $GENEVE_DEV protocol ip parent ffff: flower \
enc_key_id $GENEVE_VNI dst_ip $GENEVE1_IP action mirred egress \
redirect dev $VF1_PR
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Implement indirect notification mechanism to support offloading TC rules
on tunnel devices.
Keep indirect block list in netdev priv. Notification will call setting
tc cls flower function. For now we can offload only ingress type. Return
not supported for other flow block binder.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jakub Kicinski [Wed, 27 Oct 2021 15:20:12 +0000 (08:20 -0700)]
net: virtio: use eth_hw_addr_set()
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.
Even though the current code uses dev->addr_len the we can switch
to eth_hw_addr_set() instead of dev_addr_set(). The netdev is
always allocated by alloc_etherdev_mq() and there are at least two
places which assume Ethernet address:
- the line below calling eth_hw_addr_random()
- virtnet_set_mac_address() -> eth_commit_mac_addr_change()
Reduce extra indirection from devlink_params_*() API. Such change
makes it clear that we can drop devlink->lock from these flows, because
everything is executed when the devlink is not registered yet.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 28 Oct 2021 13:49:29 +0000 (14:49 +0100)]
Merge branch 'octeontx2-debugfs-updates'
Rakesh Babu Saladi says:
====================
RVU Debugfs updates.
Patch 1: Few minor changes such as spelling mistakes, deleting unwanted
characters, etc.
Patch 2: Add debugfs dump for lmtst map table
Patch 3: Add channel and channel mask in debugfs.
Changes made from v2 to v3:
1. In patch 1 moved few lines and submitted those changes as a
different patch to net branch
2. Patch 2 is left unchanged.
3. Patch 3 is left unchanged.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rakesh Babu [Wed, 27 Oct 2021 18:07:45 +0000 (23:37 +0530)]
octeontx2-af: debugfs: Add channel and channel mask.
This patch is to dispaly channel and channel_mask for each RX
interface of NPC MCAM rule.
Signed-off-by: Rakesh Babu <rsaladi2@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Harman Kalra [Wed, 27 Oct 2021 18:07:44 +0000 (23:37 +0530)]
octeontx2-af: cn10k: debugfs for dumping LMTST map table
CN10k SoCs use atomic stores of up to 128 bytes to submit
packets/instructions into co-processor cores. The enqueueing is performed
using Large Memory Transaction Store (LMTST) operations. They allow for
lockless enqueue operations - i.e., two different CPU cores can submit
instructions to the same queue without needing to lock the queue or
synchronize their accesses.
This patch implements a new debugfs entry for dumping LMTST map
table present on CN10K, as this might be very useful to debug any issue
in case of shared LMTST region among multiple pci functions.
Signed-off-by: Harman Kalra <hkalra@marvell.com> Signed-off-by: Bhaskara Budiredla <bbudiredla@marvell.com> Signed-off-by: Rakesh Babu <rsaladi2@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Few changes in rvu_debugfs.c file to remove unwanted characters,
indenting the code, added a new comment line etc.
Signed-off-by: Rakesh Babu Saladi <rsaladi2@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Carlos Llamas [Wed, 27 Oct 2021 23:18:02 +0000 (23:18 +0000)]
ptp: fix code indentation issues
This fixes the following checkpatch.pl errors:
ERROR: code indent should use tabs where possible
+^I if (ptp->pps_source)$
ERROR: code indent should use tabs where possible
+^I pps_unregister_source(ptp->pps_source);$
ERROR: code indent should use tabs where possible
+^I kthread_destroy_worker(ptp->kworker);$
Fixes: 4225fea1cb28 ("ptp: Fix possible memory leak in ptp_clock_register()") Signed-off-by: Carlos Llamas <cmllamas@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
luo penghao [Thu, 28 Oct 2021 03:15:51 +0000 (03:15 +0000)]
sky2: Remove redundant assignment and parentheses
The variable err will be reassigned on subsequent branches, and this
assignment does not perform related value operations. This will cause
the double parentheses to be redundant, so the inner parentheses should
be deleted.
net: ipconfig: Release the rtnl_lock while waiting for carrier
While waiting for a carrier to come on one of the netdevices, some
devices will require to take the rtnl lock at some point to fully
initialize all parts of the link.
That's the case for SFP, where the rtnl is taken when a module gets
detected. This prevents mounting an NFS rootfs over an SFP link.
This means that while ipconfig waits for carriers to be detected, no SFP
modules can be detected in the meantime, it's only detected after
ipconfig times out.
This commit releases the rtnl_lock while waiting for the carrier to come
up, and re-takes it to check the for the init device and carrier status.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add a file to document devlink support for octeontx2
driver. Driver-specific parameters implemented by
AF, PF and VF drivers are documented.
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
sch_htb: Add extack messages for EOPNOTSUPP errors
In order to make the "Operation not supported" message clearer to the
user, add extack messages explaining why exactly adding offloaded HTB
could be not supported in each case.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 28 Oct 2021 11:55:44 +0000 (12:55 +0100)]
Merge branch 'mvpp2-phylink'
Russell King says:
====================
Convert mvpp2 to phylink supported_interfaces
This patch series converts mvpp2 to use phylinks supported_interfaces
bitmap to simplify the validate() implementation. The patches:
1) Add the supported interface modes the supported_interfaces bitmap.
2) Removes the checks for the interface type being supported from
the validate callback
3) Removes the now unnecessary checks and call to
phylink_helper_basex_speed() to support switching between
1000base-X and 2500base-X for SFPs
4) Cleans up the resulting validate() code.
(3) becomes possible because when asking the MAC for its complete
support, we walk all supported interfaces which will include 1000base-X
and 2500base-X only if the comphy is present.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
mvpp2_phylink_validate() no longer needs to check for
PHY_INTERFACE_MODE_NA as phylink will walk the supported interface
types to discover the link mode capabilities. Remove these checks.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: drop use of phylink_helper_basex_speed()
Now that we have a better method to select SFP interface modes, we
no longer need to use phylink_helper_basex_speed() in a driver's
validation function, and we can also get rid of our hack to indicate
both 1000base-X and 2500base-X if the comphy is present to make that
work. Remove this hack and use of phylink_helper_basex_speed().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: remove interface checks in mvpp2_phylink_validate()
As phylink checks the interface mode against the supported_interfaces
bitmap, we no longer need to validate the interface mode in the
validation function. Remove this to simplify it.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: enable net.ipv6.route.max_size sysctl in network namespace
We want to increase route cache size in network namespace
created with user namespace. Currently ipv6 route settings
are disabled for non-initial network namespaces.
We can allow this sysctl and it will be safe since
commit <6126891c6d4f> because route cache account to kmem,
that is why users from user namespace can not DOS system.
Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Acked-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 26 Oct 2021 17:55:00 +0000 (10:55 -0700)]
mpt fusion: use dev_addr_set()
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 26 Oct 2021 17:53:52 +0000 (10:53 -0700)]
firewire: don't write directly to netdev->dev_addr
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.
Prepare fwnet_hwaddr on the stack and use dev_addr_set() to copy
it to netdev->dev_addr. We no longer need to worry about alignment.
union fwnet_hwaddr does not have any padding and we set all fields
so we don't need to zero it upfront.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 26 Oct 2021 17:52:50 +0000 (10:52 -0700)]
media: use eth_hw_addr_set()
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.
Convert media from memcpy(... 6) and memcpy(... addr_len) to
eth_hw_addr_set():
Eric Dumazet [Wed, 27 Oct 2021 20:19:21 +0000 (13:19 -0700)]
tcp: factorize ip_summed setting
Setting skb->ip_summed to CHECKSUM_PARTIAL can be centralized
in tcp_stream_alloc_skb() and __mptcp_do_alloc_tx_skb()
instead of being done multiple times.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
mptcp: Rework fwd memory allocation and one cleanup
These patches from the MPTCP tree rework forward memory allocation for
MPTCP (with some supporting changes in the net core), and also clean up
an unused function parameter.
Patch 1 updates TCP code but does not change any behavior, and creates
some macros for reclaim thresholds that will be reused in the MPTCP
code.
Patch 2 adds sk_forward_alloc_get() to the networking core to support
MPTCP's forward allocation with the diag interface.
Patch 3 reworks forward memory for MPTCP.
Patch 4 removes an unused arg and has no functional changes.
====================
Geliang Tang [Tue, 26 Oct 2021 23:29:16 +0000 (16:29 -0700)]
mptcp: drop unused sk in mptcp_push_release
Since mptcp_set_timeout() had removed from mptcp_push_release() in
commit 33d41c9cd74c5 ("mptcp: more accurate timeout"), the argument
sk in mptcp_push_release() became useless. Let's drop it.
Fixes: 33d41c9cd74c5 ("mptcp: more accurate timeout") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 26 Oct 2021 23:29:15 +0000 (16:29 -0700)]
mptcp: allocate fwd memory separately on the rx and tx path
All the mptcp receive path is protected by the msk socket
spinlock. As consequences, the tx path has to play a few tricks to
allocate the forward memory without acquiring the spinlock multiple
times, making the overall TX path quite complex.
This patch tries to clean-up a bit the tx path, using completely
separated fwd memory allocation, for the rx and the tx path.
The forward memory allocated in the rx path is now accounted in
msk->rmem_fwd_alloc and is (still) protected by the msk socket spinlock.
To cope with the above we provide a few MPTCP-specific variants for
the helpers to charge, uncharge, reclaim and free the forward memory
in the receive path.
msk->sk_forward_alloc now accounts only the forward memory for the tx
path, we can use the plain core sock helper to manipulate it and drop
quite a bit of complexity.
On memory pressure, both rx and tx fwd memories are reclaimed.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 26 Oct 2021 23:29:14 +0000 (16:29 -0700)]
net: introduce sk_forward_alloc_get()
A later patch will change the MPTCP memory accounting schema
in such a way that MPTCP sockets will encode the total amount of
forward allocated memory in two separate fields (one for tx and
one for rx).
MPTCP sockets will use their own helper to provide the accurate
amount of fwd allocated memory.
To allow the above, this patch adds a new, optional, sk method to
fetch the fwd memory, wrap the call in a new helper and use it
where it is appropriate.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 26 Oct 2021 23:29:13 +0000 (16:29 -0700)]
tcp: define macros for a couple reclaim thresholds
A following patch is going to implement a similar reclaim schema
for the MPTCP protocol, with different locking.
Let's define a couple of macros for the used thresholds, so
that the latter code will be more easily maintainable.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 26 Oct 2021 21:30:14 +0000 (14:30 -0700)]
inet: remove races in inet{6}_getname()
syzbot reported data-races in inet_getname() multiple times,
it is time we fix this instead of pretending applications
should not trigger them.
getsockname() and getpeername() are not really considered fast path.
v2: added the missing BPF_CGROUP_RUN_SA_PROG() declaration
needed when CONFIG_CGROUP_BPF=n, as reported by
kernel test robot <lkp@intel.com>
syzbot typical report:
BUG: KCSAN: data-race in __inet_hash_connect / inet_getname
write to 0xffff888136d66cf8 of 2 bytes by task 14374 on cpu 1:
__inet_hash_connect+0x7ec/0x950 net/ipv4/inet_hashtables.c:831
inet_hash_connect+0x85/0x90 net/ipv4/inet_hashtables.c:853
tcp_v4_connect+0x782/0xbb0 net/ipv4/tcp_ipv4.c:275
__inet_stream_connect+0x156/0x6e0 net/ipv4/af_inet.c:664
inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:728
__sys_connect_file net/socket.c:1896 [inline]
__sys_connect+0x254/0x290 net/socket.c:1913
__do_sys_connect net/socket.c:1923 [inline]
__se_sys_connect net/socket.c:1920 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1920
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888136d66cf8 of 2 bytes by task 14408 on cpu 0:
inet_getname+0x11f/0x170 net/ipv4/af_inet.c:790
__sys_getsockname+0x11d/0x1b0 net/socket.c:1946
__do_sys_getsockname net/socket.c:1961 [inline]
__se_sys_getsockname net/socket.c:1958 [inline]
__x64_sys_getsockname+0x3e/0x50 net/socket.c:1958
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000 -> 0xdee0
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 5.15.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Yajun Deng [Wed, 27 Oct 2021 01:38:56 +0000 (09:38 +0800)]
xdp: Remove redundant warning
There is a warning in xdp_rxq_info_unreg_mem_model() when reg_state isn't
equal to REG_STATE_REGISTERED, so the warning in xdp_rxq_info_unreg() is
redundant.
Jakub Kicinski [Tue, 26 Oct 2021 17:55:47 +0000 (10:55 -0700)]
net: thunderbolt: use eth_hw_addr_set()
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.
Guenter Roeck [Tue, 26 Oct 2021 17:39:50 +0000 (10:39 -0700)]
net: macb: Fix mdio child node detection
Commit 4d98bb0d7ec2 ("net: macb: Use mdio child node for MDIO bus if it
exists") added code to detect if a 'mdio' child node exists to the macb
driver. Ths added code does, however, not actually check if the child node
exists, but if the parent node exists. This results in errors such as
macb 10090000.ethernet eth0: Could not attach PHY (-19)
if there is no 'mdio' child node. Fix the code to actually check for
the child node.
Fixes: 4d98bb0d7ec2 ("net: macb: Use mdio child node for MDIO bus if it exists") Cc: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Sean Anderson <sean.anderson@seco.com> Tested-by: Claudiu Beznea <claudiu.beznea@microchip.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Link: https://lore.kernel.org/r/20211026173950.353636-1-linux@roeck-us.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Seth Forshee [Tue, 26 Oct 2021 18:37:21 +0000 (13:37 -0500)]
net: sch: simplify condtion for selecting mini_Qdisc_pair buffer
The only valid values for a miniq pointer are NULL or a pointer to
miniq1 or miniq2, so testing for miniq_old != &miniq1 is functionally
equivalent to testing that it is NULL or equal to &miniq2.
Seth Forshee [Tue, 26 Oct 2021 13:06:59 +0000 (08:06 -0500)]
net: sch: eliminate unnecessary RCU waits in mini_qdisc_pair_swap()
Currently rcu_barrier() is used to ensure that no readers of the
inactive mini_Qdisc buffer remain before it is reused. This waits for
any pending RCU callbacks to complete, when all that is actually
required is to wait for one RCU grace period to elapse after the buffer
was made inactive. This means that using rcu_barrier() may result in
unnecessary waits.
To improve this, store the current RCU state when a buffer is made
inactive and use poll_state_synchronize_rcu() to check whether a full
grace period has elapsed before reusing it. If a full grace period has
not elapsed, wait for a grace period to elapse, and in the non-RT case
use synchronize_rcu_expedited() to hasten it.
Since this approach eliminates the RCU callback it is no longer
necessary to synchronize_rcu() in the tp_head==NULL case. However, the
RCU state should still be saved for the previously active buffer.
Before this change I would typically see mini_qdisc_pair_swap() take
tens of milliseconds to complete. After this change it typcially
finishes in less than 1 ms, and often it takes just a few microseconds.
Thanks to Paul for walking me through the options for improving this.
This reverts commit 22849b5ea5952d853547cc5e0651f34a246b2a4f as it
revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters
devlink objects during another devlink user triggered command.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Leon Romanovsky [Tue, 26 Oct 2021 19:40:41 +0000 (22:40 +0300)]
Revert "devlink: Remove not-executed trap group notifications"
This reverts commit 8bbeed4858239ac956a78e5cbaf778bd6f3baef8 as it
revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters
devlink objects during another devlink user triggered command.
Fixes: 22849b5ea595 ("devlink: Remove not-executed trap policer notifications") Reported-by: syzbot+93d5accfaefceedf43c1@syzkaller.appspotmail.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David S. Miller [Wed, 27 Oct 2021 13:54:02 +0000 (14:54 +0100)]
Merge branch 'br-fdb-refactoring'
Vladimir Oltean says:
====================
Bridge FDB refactoring
This series refactors the br_fdb.c, br_switchdev.c and switchdev.c files
to offer the same level of functionality with a bit less code, and to
clarify the purpose of some functions.
No functional change intended.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
To reduce code churn, the same patch makes multiple changes, since they
all touch the same lines:
1. The implementations for these two are identical, just with different
function pointers. Reduce duplications and name the function pointers
"mod_cb" instead of "add_cb" and "del_cb". Pass the event as argument.
2. Drop the "const" attribute from "orig_dev". If the driver needs to
check whether orig_dev belongs to itself and then
call_switchdev_notifiers(orig_dev, SWITCHDEV_FDB_OFFLOADED), it
can't, because call_switchdev_notifiers takes a non-const struct
net_device *.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 26 Oct 2021 14:27:42 +0000 (17:27 +0300)]
net: bridge: create a common function for populating switchdev FDB entries
There are two places where a switchdev FDB entry is constructed, one is
br_switchdev_fdb_notify() and the other is br_fdb_replay(). One uses a
struct initializer, and the other declares the structure as
uninitialized and populates the elements one by one.
One problem when introducing new members of struct
switchdev_notifier_fdb_info is that there is a risk for one of these
functions to run with an uninitialized value.
So centralize the logic of populating such structure into a dedicated
function. Being the primary location where these structures are created,
using an uninitialized variable and populating the members one by one
should be fine, since this one function is supposed to assign values to
all its members.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_replay is only called from switchdev code paths, so it makes
sense to be disabled if switchdev is not enabled in the first place.
As opposed to br_mdb_replay and br_vlan_replay which might be turned off
depending on bridge support for multicast and VLANs, FDB support is
always on. So moving br_mdb_replay and br_vlan_replay inside
br_switchdev.c would mean adding some #ifdef's in br_switchdev.c, so we
keep those where they are.
The reason for the movement is that in future changes there will be some
code reuse between br_switchdev_fdb_notify and br_fdb_replay.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 26 Oct 2021 14:27:40 +0000 (17:27 +0300)]
net: bridge: reduce indentation level in fdb_create
We can express the same logic without an "if" condition as big as the
function, just return early if the kmem_cache_alloc() call fails.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 26 Oct 2021 14:27:39 +0000 (17:27 +0300)]
net: bridge: rename br_fdb_insert to br_fdb_add_local
br_fdb_insert() is a wrapper over fdb_insert() that also takes the
bridge hash_lock.
With fdb_insert() being renamed to fdb_add_local(), rename
br_fdb_insert() to br_fdb_add_local().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 26 Oct 2021 14:27:38 +0000 (17:27 +0300)]
net: bridge: rename fdb_insert to fdb_add_local
fdb_insert() is not a descriptive name for this function, and also easy
to confuse with __br_fdb_add(), fdb_add_entry(), br_fdb_update().
Even more confusingly, it is not even related in any way with those
functions, neither one calls the other.
Since fdb_insert() basically deals with the creation of a BR_FDB_LOCAL
entry and is called only from functions where that is the intention:
then rename it to fdb_add_local(), because its removal counterpart is
called fdb_delete_local().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
fdb_insert() has a forward declaration because its first caller,
br_fdb_changeaddr(), is declared before fdb_create(), a function which
fdb_insert() needs.
This patch moves the 2 functions above br_fdb_changeaddr() and deletes
the forward declaration for fdb_insert().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
fdb_notify() has a forward declaration because its first caller,
fdb_delete(), is declared before 3 functions that fdb_notify() needs:
fdb_to_nud(), fdb_fill_info() and fdb_nlmsg_size().
This patch moves the aforementioned 4 functions above fdb_delete() and
deletes the forward declaration.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 27 Oct 2021 13:50:11 +0000 (14:50 +0100)]
Merge branch 'mvneta-phylink'
Russell King says:
====================
Convert mvneta to phylink supported_interfaces
This patch series converts mvneta to use phylinks supported_interfaces
bitmap to simplify the validate() implementation. The patches:
1) Add the supported interface modes the supported_interfaces bitmap.
2) Removes the checks for the interface type being supported from
the validate callback
3) Removes the now unnecessary checks and call to
phylink_helper_basex_speed() to support switching between
1000base-X and 2500base-X for SFPs
(3) becomes possible because when asking the MAC for its complete
support, we walk all supported interfaces which will include 1000base-X
and 2500base-X only if the comphy is present.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvneta: drop use of phylink_helper_basex_speed()
Now that we have a better method to select SFP interface modes, we
no longer need to use phylink_helper_basex_speed() in a driver's
validation function, and we can also get rid of our hack to indicate
both 1000base-X and 2500base-X if the comphy is present to make that
work. Remove this hack and use of phylink_helper_basex_speed().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvneta: remove interface checks in mvneta_validate()
As phylink checks the interface mode against the supported_interfaces
bitmap, we no longer need to validate the interface mode in the
validation function. Remove this to simplify it.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 27 Oct 2021 13:39:57 +0000 (14:39 +0100)]
Merge tag 'mlx5-updates-2021-10-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-10-26
HW-GRO support in mlx5
Beside the HW GRO this series includes two trivial non-mlx5 patches:
- net: Prevent HW-GRO and LRO features operate together
- lib: bitmap: Introduce node-aware alloc API
Khalid Manaa Says:
==================
This series implements the HW-GRO offload using the HW feature SHAMPO.
HW-GRO: Hardware offload for the Generic Receive Offload feature.
SHAMPO: Split Headers And Merge Payload Offload.
This feature performs headers data split for each received packed and
merge the payloads of the packets of the same session.
There are new HW components for this feature:
The headers buffer:
– cyclic buffer where the packets headers will be located
Reservation buffer:
– capability to divide RQ WQEs to reservations, a definite size in
granularity of 4KB, the reservation is used to define the largest segment
that we can create by packets stitching.
Each reservation will have a session and the new received packet can be merged
to the session, terminate it, or open a new one according to the match criteria.
When a new packet is received the headers will be written to the headers buffer
and the data will be written to the reservation, in case the packet matches
the session the data will be written continuously otherwise it will be written
after performing an alignment.
SHAMPO RQ, WQ and CQE changes:
-----------------------------
RQ (receive queue) new params:
-shampo_no_match_alignment_granularity: the HW alignment granularity in case
the received packet doesn't match the current session.
-shampo_match_criteria_type: the type of match criteria.
-reservation_timeout: the maximum time that the HW will hold the reservation.
-Each RQ has SKB that represents the current opened flow.
WQ (work queue) new params:
-headers_mkey: mkey that represents the headers buffer, where the packets
headers will be written by the HW.
-shampo_enable: flag to verify if the WQ supports SHAMPO feature.
-log_reservation_size: the log of the reservation size where the data of
the packet will be written by the HW.
-log_max_num_of_packets_per_reservation: log of the maximum number of packets
that can be written to the same reservation.
-log_headers_entry_size: log of the header entry size of the headers buffer.
-log_headers_buffer_entry_num: log of the entries number of the headers buffer.
CQEs (Completion queue entry) SHAMPO fields:
-match: in case it is set, then the current packet matches the opened session.
-flush: in case it is set, the opened session must be flushed.
-header_size: the size of the packet’s headers.
-header_entry_index: the entry index in the headers buffer of the received
packet headers.
-data_offset: the offset of the received packet data in the WQE.
HW-GRO works as follow:
----------------------
The feature can be enabled on the interface using the ethtool command by
setting on rx-gro-hw. When the feature is on the mlx5 driver will reopen
the RQ to support the SHAMPO feature:
Will allocate the headers buffer and fill the parameters regarding the
reservation and the match criteria.
Receive packet flow:
each RQ will hold SKB that represents the current GRO opened session.
The driver has a new CQE handler mlx5e_handle_rx_cqe_mpwrq_shampo which will
use the CQE SHAMPO params to extract the location of the packet’s headers
in the headers buffer and the location of the packets data in the RQ.
Also, the CQE has two flags flush and match that indicate if the current
packet matches the current session or not and if we need to close the session.
In case there is an opened session, and we receive a matched packet then the
handler will merge the packet's payload to the current SKB, in case we receive
no match then the handler will flush the SKB and create a new one for the new packet.
In case the flash flag is set then the driver will close the session, the SKB
will be passed to the network stack.
In case the driver merges packets in the SKB, before passing the SKB to the network
stack the driver will update the checksum of the packet’s headers.
SKB build:
---------
The driver will build a new SKB in the following situations:
in case there is no current opened session.
In case the current packet doesn’t match the current session.
In case there is no place to add the packets data to the SKB that represents the
current session.
Otherwise, the driver will add the packet’s data to the SKB.
When the driver builds a new SKB, the linear area will contain only the packet headers
and the data will be added to the SKB fragments.
In case the entry size of the headers buffer is sufficient to build the SKB
it will be used, otherwise the driver will allocate new memory to build the SKB.
==================
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Khalid Manaa [Tue, 13 Oct 2020 07:34:35 +0000 (10:34 +0300)]
net/mlx5e: Add HW_GRO statistics
This patch adds HW_GRO counters to RX packets statistics:
- gro_match_packets: counter of received packets with set match flag.
- gro_packets: counter of received packets over the HW_GRO feature,
this counter is increased by one for every received
HW_GRO cqe.
- gro_bytes: counter of received bytes over the HW_GRO feature,
this counter is increased by the received bytes for every
received HW_GRO cqe.
- gro_skbs: counter of built HW_GRO skbs,
increased by one when we flush HW_GRO skb
(when we call a napi_gro_receive with hw_gro skb).
- gro_large_hds: counter of received packets with large headers size,
in case the packet needs new SKB, the driver will allocate
new one and will not use the headers entry to build it.
this patch updates the SHAMPO CQE handler to support HW_GRO,
changes in the SHAMPO CQE handler:
- CQE match and flush fields are used to determine if to build new skb
using the new received packet,
or to add the received packet data to the existing RQ.hw_gro_skb,
also this fields are used to determine when to flush the skb.
- in the end of the function mlx5e_poll_rx_cq the RQ.hw_gro_skb is flushed.
Ben Ben-Ishay [Mon, 14 Sep 2020 09:51:57 +0000 (12:51 +0300)]
net/mlx5e: Add data path for SHAMPO feature
The header buffer is used to store the headers of the rx packets.
The header buffer size deduced from WorkQueue size + restriction
of max packets per WorkQueueElement.
This commit adds the functionality for posting/updating memory for
the header buffer during the posting/updating of WQEs.
Khalid Manaa [Tue, 19 May 2020 12:45:38 +0000 (15:45 +0300)]
net/mlx5e: Add handle SHAMPO cqe support
This patch adds the new CQE SHAMPO fields:
- flush: indicates that we must close the current session and pass the SKB
to the network stack.
- match: indicates that the current packet matches the oppened session,
the packet will be merge into the current SKB.
- header_size: the size of the packet headers that written into the headers
buffer.
- header_entry_index: the entry index in the headers buffer.
- data_offset: packets data offset in the WQE.
Also new cqe handler is added to handle SHAMPO packets:
- The new handler uses CQE SHAMPO fields to build the SKB.
CQE's Flush and match fields are not used in this patch, packets are not
merged in this patch.
Ben Ben-Ishay [Wed, 9 Jun 2021 09:28:57 +0000 (12:28 +0300)]
net/mlx5e: Add control path for SHAMPO feature
This commit introduces the control path infrastructure for SHAMPO feature.
SHAMPO feature enables packet stitching by splitting packets to
header and payload, the header is placed on a dedicated buffer
and the payload on the RX ring, this allows stitching the data part
of a flow together continuously in the receive buffer.
SHAMPO feature is implemented as linked list striding RQ feature.
To support packets splitting and payload stitching:
- Enlarge the ICOSQ and the correspond CQ to support the header buffer
memory regions.
- Add support to create linked list striding RQ with SHAMPO feature set
in the open_rq function.
- Add deallocation function and corresponded calls for SHAMPO header
buffer.
- Add mlx5e_create_umr_klm_mkey to support KLM mkey for the header
buffer.
- Rename mlx5e_create_umr_mkey to mlx5e_create_umr_mtt_mkey.
Ben Ben-Ishay [Tue, 14 Jul 2020 11:40:32 +0000 (14:40 +0300)]
net/mlx5e: Add support to klm_umr_wqe
This commit adds the needed definitions for using the klm_umr_wqe.
UMR stands for user-mode memory registration, is a mechanism to alter
address translation properties of MKEY by posting WorkQueueElement
aka WQE on send queue.
MKEY stands for memory key, MKEY are used to describe a region in memory that
can be later used by HW.
KLM stands for {Key, Length, MemVa}, KLM_MKEY is indirect MKEY that enables
to map multiple memory spaces with different sizes in unified MKEY.
klm_umr_wqe is a UMR that use to update a KLM_MKEY.
SHAMPO feature uses KLM_MKEY for memory registration of his header buffer.
Khalid Manaa [Wed, 9 Jun 2021 09:27:32 +0000 (12:27 +0300)]
net/mlx5e: Rename TIR lro functions to TIR packet merge functions
This series introduces new packet merge type, therefore rename lro
functions to packet merge to support the new merge type:
- Generalize + rename mlx5e_build_tir_ctx_lro to
mlx5e_build_tir_ctx_packet_merge.
- Rename mlx5e_modify_tirs_lro to mlx5e_modify_tirs_packet_merge.
- Rename lro bit in mlx5_ifc_modify_tir_bitmask_bits to packet_merge.
- Rename lro_en in mlx5e_params to packet_merge_type type and combine
packet_merge params into one struct mlx5e_packet_merge_param.
Ben Ben-Ishay [Wed, 9 Sep 2020 14:36:39 +0000 (17:36 +0300)]
net/mlx5: Add SHAMPO caps, HW bits and enumerations
This commit adds SHAMPO bit to hca_cap and SHAMPO capabilities structure,
SHAMPO related HW spec hardware fields and enumerations.
SHAMPO stands for: split headers and merge payload offload.
SHAMPO new fields:
WQ:
- headers_mkey: mkey that represents the headers buffer, where the packets
headers will be written by the HW.
- shampo_enable: flag to verify if the WQ supports SHAMPO feature.
- log_reservation_size: the log of the reservation size where the data of
the packet will be written by the HW.
- log_max_num_of_packets_per_reservation: log of the maximum number of
packets that can be written to the same reservation.
- log_headers_entry_size: log of the header entry size of the headers buffer.
- log_headers_buffer_entry_num: log of the entries number of the headers buffer.
RQ:
- shampo_no_match_alignment_granularity: the HW alignment granularity
in case the received packet doesn't match the current session.
- shampo_match_criteria_type: the type of match criteria.
- reservation_timeout: the maximum time that the HW will hold the
reservation.
mlx5_ifc_shampo_cap_bits, the capabilities of the SHAMPO feature:
- shampo_log_max_reservation_size: the maximum allowed value of the field
WQ.log_reservation_size.
- log_reservation_size: the minimum allowed value of the field
WQ.log_reservation_size.
- shampo_min_mss_size: the minimum payload size of packet that can open
a new session or be merged to a session.
- shampo_max_log_headers_entry_size: the maximum allowed value of the field
WQ.log_headers_entry_size
Ben Ben-Ishay [Thu, 2 Jul 2020 14:22:45 +0000 (17:22 +0300)]
net/mlx5e: Rename lro_timeout to packet_merge_timeout
TIR stands for transport interface receive, the TIR object is
responsible for performing all transport related operations on
the receive side like packet processing, demultiplexing the packets
to different RQ's, etc.
lro_timeout is a field in the TIR that is used to set the timeout for lro
session, this series introduces new packet merge type, therefore rename
lro_timeout to packet_merge_timeout for all packet merge types.
Ben Ben-ishay [Wed, 16 Dec 2020 10:32:24 +0000 (12:32 +0200)]
net: Prevent HW-GRO and LRO features operate together
LRO and HW-GRO are mutually exclusive, this commit adds this restriction
in netdev_fix_feature. HW-GRO is preferred, that means in case both
HW-GRO and LRO features are requested, LRO is cleared.
Luo Jie [Tue, 26 Oct 2021 10:29:57 +0000 (18:29 +0800)]
net: phy: fixed warning: Function parameter not described
Fixed warning: Function parameter or member 'enable' not
described in 'genphy_c45_fast_retrain'
Signed-off-by: Luo Jie <luoj@codeaurora.org> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20211026102957.17100-1-luoj@codeaurora.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 Oct 2021 15:29:39 +0000 (08:29 -0700)]
net/mlx5: remove the recent devlink params
revert commit 46ae40b94d88 ("net/mlx5: Let user configure io_eq_size param")
revert commit a6cb08daa3b4 ("net/mlx5: Let user configure event_eq_size param")
revert commit 554604061979 ("net/mlx5: Let user configure max_macs param")
The EQE parameters are applicable to more drivers, they should
be configured via standard API, probably ethtool. Example of
another driver needing something similar:
The last param for "max_macs" is probably fine but the documentation
is severely lacking. The meaning and implications for changing the
param need to be stated.
This series introduces a new bitmap to allow us to indicate which
phy_interface_t modes are supported.
Currently, phylink will call ->validate with PHY_INTERFACE_MODE_NA to
request all link mode capabilities from the MAC driver before choosing
an interface to use. This leads in some cases to some rather hairly
code. This can be simplified if phylink is aware of the interface modes
that the MAC supports, and it can instead walk those modes, calling
->validate for each one, and combining the results.
This series merely introduces the support; there is no change of
behaviour until MAC drivers populate their supported_interfaces bitmap.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: phylink: use supported_interfaces for phylink validation
If the network device supplies a supported interface bitmap, we can use
that during phylink's validation to simplify MAC drivers in two ways by
using the supported_interfaces bitmap to:
1. reject unsupported interfaces before calling into the MAC driver.
2. generate the set of all supported link modes across all supported
interfaces (used mainly for SFP, but also some 10G PHYs.)
Suggested-by: Sean Anderson <sean.anderson@seco.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for a bitmap for phy interface modes, which includes:
- a macro to declare the interface bitmap
- an inline helper to zero the interface bitmap
- an inline helper to detect an empty interface bitmap
- inline helpers to do a bitwise AND and OR operations on two interface
bitmaps
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 26 Oct 2021 14:07:36 +0000 (15:07 +0100)]
Merge branch 'dsa-isolation-prep'
Vladimir Oltean says:
====================
DSA preparations for FDB isolation between bridges
This series makes 2 small changes to DSA's SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
handler, which will make it possible to offer switch drivers a stable
association between a FDB entry and a bridge device in a future series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 26 Oct 2021 09:25:56 +0000 (12:25 +0300)]
net: dsa: stop calling dev_hold in dsa_slave_fdb_event
Now that we guarantee that SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events have
finished executing by the time we leave our bridge upper interface,
we've established a stronger boundary condition for how long the
dsa_slave_switchdev_event_work() might run.
As such, it is no longer possible for DSA slave interfaces to become
unregistered, since they are still bridge ports.
So delete the unnecessary dev_hold() and dev_put().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 26 Oct 2021 09:25:55 +0000 (12:25 +0300)]
net: dsa: flush switchdev workqueue when leaving the bridge
DSA is preparing to offer switch drivers an API through which they can
associate each FDB entry with a struct net_device *bridge_dev. This can
be used to perform FDB isolation (the FDB lookup performed on the
ingress of a standalone, or bridged port, should not find an FDB entry
that is present in the FDB of another bridge).
In preparation of that work, DSA needs to ensure that by the time we
call the switch .port_fdb_add and .port_fdb_del methods, the
dp->bridge_dev pointer is still valid, i.e. the port is still a bridge
port.
This is not guaranteed because the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE API
requires drivers that must have sleepable context to handle those events
to schedule the deferred work themselves. DSA does this through the
dsa_owq.
It can happen that a port leaves a bridge, del_nbp() flushes the FDB on
that port, SWITCHDEV_FDB_DEL_TO_DEVICE is notified in atomic context,
DSA schedules its deferred work, but del_nbp() finishes unlinking the
bridge as a master from the port before DSA's deferred work is run.
Fundamentally, the port must not be unlinked from the bridge until all
FDB deletion deferred work items have been flushed. The bridge must wait
for the completion of these hardware accesses.
An attempt has been made to address this issue centrally in switchdev by
making SWITCHDEV_FDB_DEL_TO_DEVICE deferred (=> blocking) at the switchdev
level, which would offer implicit synchronization with del_nbp:
but it seems that any attempt to modify switchdev's behavior and make
the events blocking there would introduce undesirable side effects in
other switchdev consumers.
The most undesirable behavior seems to be that
switchdev_deferred_process_work() takes the rtnl_mutex itself, which
would be worse off than having the rtnl_mutex taken individually from
drivers which is what we have now (except DSA which has removed that
lock since commit 0faf890fc519 ("net: dsa: drop rtnl_lock from
dsa_slave_switchdev_event_work")).
So to offer the needed guarantee to DSA switch drivers, I have come up
with a compromise solution that does not require switchdev rework:
we already have a hook at the last moment in time when the bridge is
still an upper of ours: the NETDEV_PRECHANGEUPPER handler. We can flush
the dsa_owq manually from there, which makes all FDB deletions
synchronous.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lukas Wunner [Tue, 26 Oct 2021 05:15:32 +0000 (07:15 +0200)]
ifb: Depend on netfilter alternatively to tc
IFB originally depended on NET_CLS_ACT for traffic redirection.
But since v4.5, that may be achieved with NFT_FWD_NETDEV as well.
Fixes: 39e6dea28adc ("netfilter: nf_tables: add forward expression to the netdev family") Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: <stable@vger.kernel.org> # v4.5+: bcfabee1afd9: netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress Cc: <stable@vger.kernel.org> # v4.5+ Signed-off-by: David S. Miller <davem@davemloft.net>
Jeremy Kerr [Tue, 26 Oct 2021 01:57:28 +0000 (09:57 +0800)]
mctp: Implement extended addressing
This change allows an extended address struct - struct sockaddr_mctp_ext
- to be passed to sendmsg/recvmsg. This allows userspace to specify
output ifindex and physical address information (for sendmsg) or receive
the input ifindex/physaddr for incoming messages (for recvmsg). This is
typically used by userspace for MCTP address discovery and assignment
operations.
The extended addressing facility is conditional on a new sockopt:
MCTP_OPT_ADDR_EXT; userspace must explicitly enable addressing before
the kernel will consume/populate the extended address data.
Includes a fix for an uninitialised var: Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ax88796c: Remove pointless check in ax88796c_open()
Clang warns:
drivers/net/ethernet/asix/ax88796c_main.c:851:24: error: address of
array 'ax_local->phydev->advertising' will always evaluate to 'true'
[-Werror,-Wpointer-bool-conversion]
if (ax_local->phydev->advertising &&
~~~~~~~~~~~~~~~~~~^~~~~~~~~~~ ~~
advertising cannot be NULL here if ax_local is not NULL, which cannot
happen due to the check in ax88796c_probe(). Remove the check.
net: ax88796c: Fix clang -Wimplicit-fallthrough in ax88796c_set_mac()
Clang warns:
drivers/net/ethernet/asix/ax88796c_main.c:696:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
case SPEED_10:
^
drivers/net/ethernet/asix/ax88796c_main.c:696:2: note: insert 'break;' to avoid fall-through
case SPEED_10:
^
break;
drivers/net/ethernet/asix/ax88796c_main.c:706:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
case DUPLEX_HALF:
^
drivers/net/ethernet/asix/ax88796c_main.c:706:2: note: insert 'break;' to avoid fall-through
case DUPLEX_HALF:
^
break;
Clang is a little more pedantic than GCC, which permits implicit
fallthroughs to cases that contain just break or return. Clang's version
is more in line with the kernel's own stance in deprecated.rst, which
states that all switch/case blocks must end in either break,
fallthrough, continue, goto, or return. Add the missing breaks to fix
the warning.