git.proxmox.com Git - mirror_ubuntu-kernels.git/log

Merge tag 'rxrpc-iothread-20240305' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
Here are some changes to AF_RXRPC:

(1) Cache the transmission serial number of ACK and DATA packets in the
     rxrpc_txbuf struct and log this in the retransmit tracepoint.

(2) Don't use atomics on rxrpc_txbuf::flags[*] and cache the intended wire
     header flags there too to avoid duplication.

(3) Cache the wire checksum in rxrpc_txbuf to make it easier to create
     jumbo packets in future (which will require altering the wire header
     to a jumbo header and restoring it back again for retransmission).

(4) Fix the protocol names in the wire ACK trailer struct.

(5) Strip all the barriers and atomics out of the call timer tracking[*].

(6) Remove atomic handling from call->tx_transmitted and
     call->acks_prev_seq[*].

(7) Don't bother resetting the DF flag after UDP packet transmission.  To
     change it, we now call directly into UDP code, so it's quick just to
     set it every time.

(8) Merge together the DF/non-DF branches of the DATA transmission to
     reduce duplication in the code.

(9) Add a kvec array into rxrpc_txbuf and start moving things over to it.
     This paves the way for using page frags.

(10) Split (sub)packet preparation and timestamping out of the DATA
     transmission function.  This helps pave the way for future jumbo
     packet generation.

(11) In rxkad, don't pick values out of the wire header stored in
     rxrpc_txbuf, buf rather find them elsewhere so we can remove the wire
     header from there.

(12) Move rxrpc_send_ACK() to output.c so that it can be merged with
     rxrpc_send_ack_packet().

(13) Use rxrpc_txbuf::kvec[0] to access the wire header for the packet
     rather than directly accessing the copy in rxrpc_txbuf.  This will
     allow that to be removed to a page frag.

(14) Switch from keeping the transmission buffers in rxrpc_txbuf allocated
     in the slab to allocating them using page fragment allocators.  There
     are separate allocators for DATA packets (which persist for a while)
     and control packets (which are discarded immediately).

     We can then turn on MSG_SPLICE_PAGES when transmitting DATA and ACK
     packets.

     We can also get rid of the RCU cleanup on rxrpc_txbufs, preferring
     instead to release the page frags as soon as possible.

(15) Parse received packets before handling timeouts as the former may
     reset the latter.

(16) Make sure we don't retransmit DATA packets after all the packets have
     been ACK'd.

(17) Differentiate traces for PING ACK transmission.

(18) Switch to keeping timeouts as ktime_t rather than a number of jiffies
     as the latter is too coarse a granularity.  Only set the call timer at
     the end of the call event function from the aggregate of all the
     timeouts, thereby reducing the number of timer calls made.  In future,
     it might be possible to reduce the number of timers from one per call
     to one per I/O thread and to use a high-precision timer.

(19) Record RTT probes after successful transmission rather than recording
     it before and then cancelling it after if unsuccessful[*].  This
     allows a number of calls to get the current time to be removed.

(20) Clean up the resend algorithm as there's now no need to walk the
     transmission buffer under lock[*].  DATA packets can be retransmitted
     as soon as they're found rather than being queued up and transmitted
     when the locked is dropped.

(21) When initially parsing a received ACK packet, extract some of the
     fields from the ack info to the skbuff private data.  This makes it
     easier to do path MTU discovery in the future when the call to which a
     PING RESPONSE ACK refers has been deallocated.

[*] Possible with the move of almost all code from softirq context to the
    I/O thread.

Link: https://lore.kernel.org/r/20240301163807.385573-1-dhowells@redhat.com/
Link: https://lore.kernel.org/r/20240304084322.705539-1-dhowells@redhat.com/
* tag 'rxrpc-iothread-20240305' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (21 commits)
  rxrpc: Extract useful fields from a received ACK to skb priv data
  rxrpc: Clean up the resend algorithm
  rxrpc: Record probes after transmission and reduce number of time-gets
  rxrpc: Use ktimes for call timeout tracking and set the timer lazily
  rxrpc: Differentiate PING ACK transmission traces.
  rxrpc: Don't permit resending after all Tx packets acked
  rxrpc: Parse received packets before dealing with timeouts
  rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags
  rxrpc: Use rxrpc_txbuf::kvec[0] instead of rxrpc_txbuf::wire
  rxrpc: Move rxrpc_send_ACK() to output.c with rxrpc_send_ack_packet()
  rxrpc: Don't pick values out of the wire header when setting up security
  rxrpc: Split up the DATA packet transmission function
  rxrpc: Add a kvec[] to the rxrpc_txbuf struct
  rxrpc: Merge together DF/non-DF branches of data Tx function
  rxrpc: Do lazy DF flag resetting
  rxrpc: Remove atomic handling on some fields only used in I/O thread
  rxrpc: Strip barriers and atomics off of timer tracking
  rxrpc: Fix the names of the fields in the ACK trailer struct
  rxrpc: Note cksum in txbuf
  rxrpc: Convert rxrpc_txbuf::flags into a mask and don't use atomics
  ...
====================

Link: https://lore.kernel.org/r/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usbnet: Remove generic .ndo_get_stats64

Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Oliver Neukum <oneukum@suse.com>
Link: https://lore.kernel.org/r/20240306142643.2429409-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usbnet: Leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the usbnet driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Oliver Neukum <oneukum@suse.com>
Link: https://lore.kernel.org/r/20240306142643.2429409-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: spec: use proper enum for pin capabilities attribute

The enum is defined, however the pin capabilities attribute does
refer to it. Add this missing enum field.

This fixes ynl cli output:

Example current output:
$ sudo ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/dpll.yaml --do pin-get --json '{"id": 0}'
{'capabilities': 4,
...
Example new output:
$ sudo ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/dpll.yaml --do pin-get --json '{"id": 0}'
{'capabilities': {'state-can-change'},
...

Fixes: 3badff3a25d8 ("dpll: spec: Add Netlink spec in YAML")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20240306120739.1447621-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: update 88e6185 PCS driver to use neg_mode

Update the Marvell 88e6185 PCS driver to use neg_mode rather than the
mode argument to match the other updated PCS drivers.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1rhosE-003yuc-FM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pcs: rzn1-miic: update PCS driver to use neg_mode

Update the RZN1-MIIC PCS driver to use neg_mode rather than the mode
argument to match the other updated PCS drivers.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1rhos9-003yuW-Az@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: marvell: add comment about m88e1111_config_init_1000basex()

The comment in m88e1111_config_init_1000basex() is wrong - it claims
that Autoneg will be enabled, but this doesn't actually happen.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1rhos4-003yuQ-5p@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: let core handle error cases in dump operations

After commit b5a899154aa9 ("netlink: handle EMSGSIZE errors
in the core"), we can remove some code that was not 100 % correct
anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306102426.245689-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: fix waiting time for ipv6_gc test in fib_tests.sh.

ipv6_gc fails occasionally. According to the study, fib6_run_gc() using
jiffies_round() to round the GC interval could increase the waiting time up
to 750ms (3/4 seconds). The timer has a granularity of 512ms at the range
4s to 32s. That means a route with an expiration time E seconds can wait
for more than E * 2 + 1 seconds if the GC interval is also E seconds.

E * 2 + 2 seconds should be enough for waiting for removing routes.

Also remove a check immediately after replacing 5 routes since it is very
likely to remove some of routes before completing the last route with a
slow environment.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240305183949.258473-1-thinker.li@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mpls: Do not orphan the skb

We observed that TCP-pacing was falling back to the TCP-layer pacing
instead of utilizing sch_fq for the pacing. This causes significant
CPU-usage due to the hrtimer running on a per-TCP-connection basis.

The issue is that mpls_xmit() calls skb_orphan() and thus sets
skb->sk to NULL. Which implies that many of the goodies of TCP won't
work. Pacing falls back to TCP-layer pacing. TCP Small Queues does not
work, ...

It is safe to remove this call to skb_orphan() in mpls_xmit() as there
really is not reason for it to be there. It appears that this call to
skb_orphan comes from the very initial implementation of MPLS.

Cc: Roopa Prabhu <roopa@nvidia.com>
Reported-by: Craig Taylor <cmtaylor@apple.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Link: https://lore.kernel.org/r/20240306181117.77419-1-cpaasch@apple.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: Leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core
and convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the DSA user network device code and leverage
the network core allocation instead.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240306200416.2973179-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: fore200e: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240306212344.97985-2-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tools-net-ynl-add-support-for-nlctrl-netlink-family'

Donald Hunter says:

====================
tools/net/ynl: Add support for nlctrl netlink family

This series adds a new YNL spec for the nlctrl family, plus some fixes
and enhancements for ynl.

Patch 1 fixes an extack decoding bug
Patch 2 gives cleaner netlink error reporting
Patch 3 fixes an array-nest codegen bug
Patch 4 adds nest-type-value support to ynl
Patch 5 fixes the ynl schemas to allow empty enum-name attrs
Patch 6 contains the nlctrl spec
====================

Link: https://lore.kernel.org/r/20240306231046.97158-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink/specs: Add spec for nlctrl netlink family

Add a spec for the nlctrl family.

Example usage:

./tools/net/ynl/cli.py \
    --spec Documentation/netlink/specs/nlctrl.yaml \
    --do getfamily --json '{"family-name": "nlctrl"}'

./tools/net/ynl/cli.py \
    --spec Documentation/netlink/specs/nlctrl.yaml \
    --dump getpolicy --json '{"family-name": "nlctrl"}'

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240306231046.97158-7-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: Allow empty enum-name in ynl specs

Update the ynl schemas to allow the specification of empty enum names
for all enum code generation.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240306231046.97158-6-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/net/ynl: Add nest-type-value decoding

The nlctrl genetlink-legacy family uses nest-type-value encoding as
described in Documentation/userspace-api/netlink/genetlink-legacy.rst

Add nest-type-value decoding to ynl.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240306231046.97158-5-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/net/ynl: Fix c codegen for array-nest

ynl-gen-c generates e.g. 'calloc(mcast_groups, sizeof(*dst->mcast_groups))'
for array-nest attrs when it should be 'n_mcast_groups'.

Add a 'n_' prefix in the generated code for array-nests.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240306231046.97158-4-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/net/ynl: Report netlink errors without stacktrace

ynl does not handle NlError exceptions so they get reported like program
failures. Handle the NlError exceptions and report the netlink errors
more cleanly.

Example now:

Netlink error: No such file or directory
nl_len = 44 (28) nl_flags = 0x300 nl_type = 2
error: -2 extack: {'bad-attr': '.op'}

Example before:

Traceback (most recent call last):
  File "/home/donaldh/net-next/./tools/net/ynl/cli.py", line 81, in <module>
    main()
  File "/home/donaldh/net-next/./tools/net/ynl/cli.py", line 69, in main
    reply = ynl.dump(args.dump, attrs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/donaldh/net-next/tools/net/ynl/lib/ynl.py", line 906, in dump
    return self._op(method, vals, [], dump=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/donaldh/net-next/tools/net/ynl/lib/ynl.py", line 872, in _op
    raise NlError(nl_msg)
lib.ynl.NlError: Netlink error: No such file or directory
nl_len = 44 (28) nl_flags = 0x300 nl_type = 2
error: -2 extack: {'bad-attr': '.op'}

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240306231046.97158-3-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/net/ynl: Fix extack decoding for netlink-raw

Extack decoding was using a hard-coded msg header size of 20 but
netlink-raw has a header size of 16.

Use a protocol specific msghdr_size() when decoding the attr offssets.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240306231046.97158-2-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'isdn-constify-struct-class-usage'

Ricardo B. Marliere says:

====================
isdn: constify struct class usage

This is a simple and straight forward cleanup series that aims to make the
class structures in isdn constant. This has been possible since 2023 [1].

[1]: https://lore.kernel.org/all/2023040248-customary-release-4aec@gregkh/
====================

Link: https://lore.kernel.org/r/20240305-class_cleanup-isdn-v1-0-6f0edca75b61@marliere.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

isdn: capi: make capi_class constant

Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the capi_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240305-class_cleanup-isdn-v1-2-6f0edca75b61@marliere.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

isdn: mISDN: make elements_class constant

Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the elements_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240305-class_cleanup-isdn-v1-1-6f0edca75b61@marliere.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dp83822: change ti,rmii-mode description

Drop reference to the 25MHz clock as it has nothing to do with connecting
the PHY and the MAC.
Add info about the reference clock direction between the PHY and the MAC
as it depends on the selected rmii mode.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jérémie Dautheribes <jeremie.dautheribes@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240305141309.127669-1-jeremie.dautheribes@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: x25: remove dead links from Kconfig

Remove the "You can read more about X.25 at" links provided in
Kconfig as they have not pointed at any relevant pages for quite
a while.

An old copy of https://www.sangoma.com/tutorials/x25/ can be
retrieved via https://archive.org/web/ but nothing useful seems
to have been preserved for http://docwiki.cisco.com/wiki/X.25

For the sake of necromancy and those who really did want to
read more about X.25, a previous incarnation of Kconfig included
a link to:
http://www.cisco.com/univercd/cc/td/doc/product/software/ios11/cbook/cx25.htm

Which can still be read at:
https://web.archive.org/web/20071013101232/http://cisco.com/en/US/docs/ios/11_0/router/configuration/guide/cx25.html

Signed-off-by: Justin Swartz <justin.swartz@risingedge.co.za>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20240306112659.25375-1-justin.swartz@risingedge.co.za
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: check for overflow of constructed messages

Donald points out that we don't check for overflows.
Stash the length of the message on nlmsg_pid (nlmsg_seq would
do as well). This allows the attribute helpers to remain
self-contained (no extra arguments). Also let the put
helpers continue to return nothing. The error is checked
only in (newly introduced) ynl_msg_end().

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240305185000.964773-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

net/core/page_pool_user.c
0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors")
429679dcf7d9 ("page_pool: fix netlink dump stop/resume")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from bpf, ipsec and netfilter.

  No solution yet for the stmmac issue mentioned in the last PR, but it
  proved to be a lockdep false positive, not a blocker.

  Current release - regressions:

   - dpll: move all dpll<>netdev helpers to dpll code, fix build
     regression with old compilers

  Current release - new code bugs:

   - page_pool: fix netlink dump stop/resume

  Previous releases - regressions:

   - bpf: fix verifier to check bpf_func_state->callback_depth when
     pruning states as otherwise unsafe programs could get accepted

   - ipv6: avoid possible UAF in ip6_route_mpath_notify()

   - ice: reconfig host after changing MSI-X on VF

   - mlx5:
       - e-switch, change flow rule destination checking
       - add a memory barrier to prevent a possible null-ptr-deref
       - switch to using _bh variant of of spinlock where needed

  Previous releases - always broken:

   - netfilter: nf_conntrack_h323: add protection for bmp length out of
     range

   - bpf: fix to zero-initialise xdp_rxq_info struct before running XDP
     program in CPU map which led to random xdp_md fields

   - xfrm: fix UDP encapsulation in TX packet offload

   - netrom: fix data-races around sysctls

   - ice:
       - fix potential NULL pointer dereference in ice_bridge_setlink()
       - fix uninitialized dplls mutex usage

   - igc: avoid returning frame twice in XDP_REDIRECT

   - i40e: disable NAPI right after disabling irqs when handling
     xsk_pool

   - geneve: make sure to pull inner header in geneve_rx()

   - sparx5: fix use after free inside sparx5_del_mact_entry

   - dsa: microchip: fix register write order in ksz8_ind_write8()

  Misc:

   - selftests: mptcp: fixes for diag.sh"

* tag 'net-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
  net: pds_core: Fix possible double free in error handling path
  netrom: Fix data-races around sysctl_net_busy_read
  netrom: Fix a data-race around sysctl_netrom_link_fails_count
  netrom: Fix a data-race around sysctl_netrom_routing_control
  netrom: Fix a data-race around sysctl_netrom_transport_no_activity_timeout
  netrom: Fix a data-race around sysctl_netrom_transport_requested_window_size
  netrom: Fix a data-race around sysctl_netrom_transport_busy_delay
  netrom: Fix a data-race around sysctl_netrom_transport_acknowledge_delay
  netrom: Fix a data-race around sysctl_netrom_transport_maximum_tries
  netrom: Fix a data-race around sysctl_netrom_transport_timeout
  netrom: Fix data-races around sysctl_netrom_network_ttl_initialiser
  netrom: Fix a data-race around sysctl_netrom_obsolescence_count_initialiser
  netrom: Fix a data-race around sysctl_netrom_default_path_quality
  netfilter: nf_conntrack_h323: Add protection for bmp length out of range
  netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout
  netfilter: nft_ct: fix l3num expectations with inet pseudo family
  netfilter: nf_tables: reject constant set with timeout
  netfilter: nf_tables: disallow anonymous set with timeout flag
  net/rds: fix WARNING in rds_conn_connect_if_down
  net: dsa: microchip: fix register write order in ksz8_ind_write8()
  ...

Merge branch 'tcp-add-two-missing-addresses-when-using-trace'

Jason Xing says:

====================
tcp: add two missing addresses when using trace

When I reviewed other people's patch [1], I noticed that similar things
also happen in tcp_event_skb class and tcp_event_sk_skb class. They
don't print those two addrs of skb/sk which already exist.

In this patch, I just do as other trace functions do, like
trace_net_dev_start_xmit(), to know the exact flow or skb we would like
to know in case some systems doesn't support BPF programs well or we
have to use /sys/kernel/debug/tracing only for some reasons.

[1]
Link: https://lore.kernel.org/netdev/CAL+tcoAhvFhXdr1WQU8mv_6ZX5nOoNpbOLAB6=C+DB-qXQ11Ew@mail.gmail.com/
v2
Link: https://lore.kernel.org/netdev/CANn89iJcScraKAUk1GzZFoOO20RtC9iXpiJ4LSOWT5RUAC_QQA@mail.gmail.com/
1. change the description.
====================

Link: https://lore.kernel.org/r/20240304092934.76698-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: add tracing of skbaddr in tcp_event_skb class

Use the existing parameter and print the address of skbaddr
as other trace functions do.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: add tracing of skb/skaddr in tcp_event_sk_skb class

Printing the addresses can help us identify the exact skb/sk
for those system in which it's not that easy to run BPF program.
As we can see, it already fetches those, then use it directly
and it will print like below:

...tcp_retransmit_skb: skbaddr=XXX skaddr=XXX family=AF_INET...

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'doc-sfp-phylink-update-the-porting-guide'

Maxime Chevallier says:

====================
doc: sfp-phylink: update the porting guide

Here's a V3 for an update on the phylink porting guide. The only
difference with V2 is a whitespace fix along with a line-wrap.

The main point of the update is the description of a basic process to
follow to expose one or more PCS to phylink. Let me know if you spot any
inaccuracies in the guide.

The second patch is a simple fixup on some in-code doc that was spotted
while updating the guide.

Link to V2: https://lore.kernel.org/netdev/20240228095755.1499577-1-maxime.chevallier@bootlin.com/
Link to V1: https://lore.kernel.org/netdev/20240220160406.3363002-1-maxime.chevallier@bootlin.com/
====================

Link: https://lore.kernel.org/r/20240301164309.3643849-1-maxime.chevallier@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phylink: clean the pcs_get_state documentation

commit 4d72c3bb60dd ("net: phylink: strip out pre-March 2020 legacy code")
dropped the mac_pcs_get_state ops in phylink_mac_ops in favor of
dedicated PCS operation pcs_get_state. However, the documentation for
the pcs_get_state ops was incorrectly converted and now self-references.

Drop the extra comment.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

doc: sfp-phylink: update the porting guide with PCS handling

Now that phylink has a comprehensive PCS support, update the porting
guide to explain the process of supporting the PCS configuration. This
also removed outdated references to phylink_config fields that no longer
exists.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pds_core: Fix possible double free in error handling path

When auxiliary_device_add() returns error and then calls
auxiliary_device_uninit(), Callback function pdsc_auxbus_dev_release
calls kfree(padev) to free memory. We shouldn't call kfree(padev)
again in the error handling path.

Fix this by cleaning up the redundant kfree() and putting
the error handling back to where the errors happened.

Fixes: 4569cce43bc6 ("pds_core: add auxiliary_bus devices")
Signed-off-by: Yongzhi Liu <hyperlyzcs@gmail.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20240306105714.20597-1-hyperlyzcs@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'nf-24-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains fixes for net:

Patch #1 disallows anonymous sets with timeout, except for dynamic sets.
         Anonymous sets with timeouts using the pipapo set backend makes
         no sense from userspace perspective.

Patch #2 rejects constant sets with timeout which has no practical usecase.
         This kind of set, once bound, contains elements that expire but
         no new elements can be added.

Patch #3 restores custom conntrack expectations with NFPROTO_INET,
         from Florian Westphal.

Patch #4 marks rhashtable anonymous set with timeout as dead from the
         commit path to avoid that async GC collects these elements. Rules
         that refers to the anonymous set get released with no mutex held
         from the commit path.

Patch #5 fixes a UBSAN shift overflow in H.323 conntrack helper,
         from Lena Wang.

netfilter pull request 24-03-07

* tag 'nf-24-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_conntrack_h323: Add protection for bmp length out of range
  netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout
  netfilter: nft_ct: fix l3num expectations with inet pseudo family
  netfilter: nf_tables: reject constant set with timeout
  netfilter: nf_tables: disallow anonymous set with timeout flag
====================

Link: https://lore.kernel.org/r/20240307021545.149386-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'netrom-fix-all-the-data-races-around-sysctls'

Jason Xing says:

====================
netrom: Fix all the data-races around sysctls

As the title said, in this patchset I fix the data-race issues because
the writer and the reader can manipulate the same value concurrently.
====================

Link: https://lore.kernel.org/r/20240304082046.64977-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix data-races around sysctl_net_busy_read

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_link_fails_count

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_routing_control

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_transport_no_activity_timeout

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_transport_requested_window_size

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_transport_busy_delay

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_transport_acknowledge_delay

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_transport_maximum_tries

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_transport_timeout

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix data-races around sysctl_netrom_network_ttl_initialiser

We need to protect the reader reading the sysctl value because the
value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_obsolescence_count_initialiser

We need to protect the reader reading the sysctl value
because the value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netrom: Fix a data-race around sysctl_netrom_default_path_quality

We need to protect the reader reading sysctl_netrom_default_path_quality
because the value can be changed concurrently.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'ipsec-2024-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2024-03-06

1) Clear the ECN bits flowi4_tos in decode_session4().
   This was already fixed but the bug was reintroduced
   when decode_session4() switched to us the flow dissector.
   From Guillaume Nault.

2) Fix UDP encapsulation in the TX path with packet offload mode.
   From Leon Romanovsky,

3) Avoid clang fortify warning in copy_to_user_tmpl().
   From Nathan Chancellor.

4) Fix inter address family tunnel in packet offload mode.
   From Mike Yu.

* tag 'ipsec-2024-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: set skb control buffer based on packet offload as well
  xfrm: fix xfrm child route lookup for packet offload
  xfrm: Avoid clang fortify warning in copy_to_user_tmpl()
  xfrm: Pass UDP encapsulation in TX packet offload
  xfrm: Clear low order bits of ->flowi4_tos in decode_session4().
====================

Link: https://lore.kernel.org/r/20240306100438.3953516-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: remove ethtool_eee_use_linkmodes

After 292fac464b01 ("net: ethtool: eee: Remove legacy _u32 from keee")
this function has no user any longer.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/b4ff9b51-092b-4d44-bfce-c95342a05b51@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxbf_gige: add support to display pause frame counters

This patch updates the mlxbf_gige driver to support the
"get_pause_stats()" callback, which enables display of
pause frame counters via "ethtool -I -a oob_net0".

The pause frame counters are only enabled if the "counters_en"
bit is asserted in the LLU general config register. The driver
will only report stats, and thus overwrite the default stats
state of ETHTOOL_STAT_NOT_SET, if "counters_en" is asserted.

Reviewed-by: Asmaa Mnebhi <asmaa@nvidia.com>
Signed-off-by: David Thompson <davthompson@nvidia.com>
Link: https://lore.kernel.org/r/20240305212137.3525-1-davthompson@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: qca807x: fix compilation when CONFIG_GPIOLIB is not set

Kernel bot has discovered that if CONFIG_GPIOLIB is not set compilation
will fail.

Upon investigation the issue is that qca807x_gpio() is guarded by a
preprocessor check but then it is called under
if (IS_ENABLED(CONFIG_GPIOLIB)) in the probe call so the compiler will
error out since qca807x_gpio() has not been declared if CONFIG_GPIOLIB has
not been set.

Fixes: d1cb613efbd3 ("net: phy: qcom: add support for QCA807x PHY Family")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202403031332.IGAbZzwq-lkp@intel.com/
Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Link: https://lore.kernel.org/r/20240305142113.795005-1-robimarko@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: geneve: Remove generic .ndo_get_stats64

Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240305172911.502058-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: geneve: Leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the geneve driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240305172911.502058-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: gtp: Move net_device assigned in setup

Assign netdev to gtp->dev at setup time, so, we can get rid of
gtp_dev_init() completely.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20240305121524.2254533-3-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: gtp: Remove generic .ndo_get_stats64

Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20240305121524.2254533-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: gtp: Leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the gtp driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20240305121524.2254533-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: macsec: Leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the macsec driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240305113728.1974944-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: renesas,etheravb: Add support for R-Car V4M

Document support for the Renesas Ethernet AVB (EtherAVB-IF) block in the
Renesas R-Car V4M (R8A779H0) SoC.

Signed-off-by: Thanh Quan <thanh.quan.xn@renesas.com>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/0212b57ba1005bb9b5a922f8f25cc67a7bc15f30.1709631152.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sr9800: Add check for usbnet_get_endpoints

Add check for usbnet_get_endpoints() and return the error if it fails
in order to transfer the error.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 19a38d8e0aa3 ("USB2NET : SR9800 : One chip USB2.0 USB2NET SR9800 Device Driver Support")
Link: https://lore.kernel.org/r/20240305075927.261284-1-nichen@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/harness: Fix TEST_F()'s vfork handling

Always run fixture setup in the grandchild process, and by default also
run the teardown in the same process.  However, this change makes it
possible to run the teardown in a parent process when
_metadata->teardown_parent is set to true (e.g. in fixture setup).

Fix TEST_SIGNAL() by forwarding grandchild's signal to its parent.  Fix
seccomp tests by running the test setup in the parent of the test
thread, as expected by the related test code.  Fix Landlock tests by
waiting for the grandchild before processing _metadata.

Use of exit(3) in tests should be OK because the environment in which
the vfork(2) call happen is already dedicated to the running test (with
flushed stdio, setpgrp() call), see __run_test() and the call to fork(2)
just before running the setup/test/teardown.  Even if the test
configures its own exit handlers, they will not be run by the parent
because it never calls exit(3), and the test function either ends with a
call to _exit(2) or a signal.

Cc: Günther Noack <gnoack@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Fixes: 0710a1a73fb4 ("selftests/harness: Merge TEST_F_FORK() into TEST_F()")
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reported-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240305201029.1331333-1-mic@digikod.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-some-clean-up-patches'

Matthieu Baerts says:

====================
mptcp: some clean-up patches

Here are some clean-up patches for MPTCP:

- Patch 1 drops duplicated header inclusions.

- Patch 2 updates PM 'set_flags' interface, to make it more similar to
others.

- Patch 3 adds some error messages for the PM 'set_flags' command to
help the userspace understanding what's wrong in case of error.

- Patch 4 simplifies __lookup_addr() function from pm_netlink.c.

Except for the 3rd patch, the behaviour is not supposed to be modified.
====================

Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-0-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: drop lookup_by_id in lookup_addr

When the lookup_by_id parameter of __lookup_addr() is true, it's the same
as __lookup_addr_by_id(), it can be replaced by __lookup_addr_by_id()
directly. So drop this parameter, let __lookup_addr() only looks up address
on the local address list by comparing addresses in it, not address ids.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-4-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: set error messages for set_flags

In addition to returning the error value, this patch also sets an error
messages with GENL_SET_ERR_MSG or NL_SET_ERR_MSG_ATTR both for pm_netlink.c
and pm_userspace.c. It will help the userspace to identify the issue.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-3-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: update set_flags interfaces

This patch updates set_flags interfaces, make it more similar to the
interfaces of dump_addr and get_addr:

mptcp_pm_set_flags(struct sk_buff *skb, struct genl_info *info)
mptcp_pm_nl_set_flags(struct sk_buff *skb, struct genl_info *info)
mptcp_userspace_pm_set_flags(struct sk_buff *skb, struct genl_info *info)

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-2-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: drop duplicate header inclusions

The headers net/tcp.h, net/genetlink.h and uapi/linux/mptcp.h are included
in protocol.h already, no need to include them again directly. This patch
removes these duplicate header inclusions.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-1-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-03-06

We've added 5 non-merge commits during the last 1 day(s) which contain
a total of 5 files changed, 77 insertions(+), 4 deletions(-).

The main changes are:

1) Fix BPF verifier to check bpf_func_state->callback_depth when pruning
   states as otherwise unsafe programs could get accepted,
   from Eduard Zingerman.

2) Fix to zero-initialise xdp_rxq_info struct before running XDP program in
   CPU map which led to random xdp_md fields, from Toke Høiland-Jørgensen.

3) Fix bonding XDP feature flags calculation when bonding device has no
   slave devices anymore, from Daniel Borkmann.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  cpumap: Zero-initialise xdp_rxq_info struct before running XDP program
  selftests/bpf: Fix up xdp bonding test wrt feature flags
  xdp, bonding: Fix feature flags when there are no slave devs anymore
  selftests/bpf: test case for callback_depth states pruning logic
  bpf: check bpf_func_state->callback_depth when pruning states
====================

Link: https://lore.kernel.org/r/20240306220309.13534-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: nf_conntrack_h323: Add protection for bmp length out of range

UBSAN load reports an exception of BRK#5515 SHIFT_ISSUE:Bitwise shifts
that are out of bounds for their data type.

vmlinux   get_bitmap(b=75) + 712
<net/netfilter/nf_conntrack_h323_asn1.c:0>
vmlinux   decode_seq(bs=0xFFFFFFD008037000, f=0xFFFFFFD008037018, level=134443100) + 1956
<net/netfilter/nf_conntrack_h323_asn1.c:592>
vmlinux   decode_choice(base=0xFFFFFFD0080370F0, level=23843636) + 1216
<net/netfilter/nf_conntrack_h323_asn1.c:814>
vmlinux   decode_seq(f=0xFFFFFFD0080371A8, level=134443500) + 812
<net/netfilter/nf_conntrack_h323_asn1.c:576>
vmlinux   decode_choice(base=0xFFFFFFD008037280, level=0) + 1216
<net/netfilter/nf_conntrack_h323_asn1.c:814>
vmlinux   DecodeRasMessage() + 304
<net/netfilter/nf_conntrack_h323_asn1.c:833>
vmlinux   ras_help() + 684
<net/netfilter/nf_conntrack_h323_main.c:1728>
vmlinux   nf_confirm() + 188
<net/netfilter/nf_conntrack_proto.c:137>

Due to abnormal data in skb->data, the extension bitmap length
exceeds 32 when decoding ras message then uses the length to make
a shift operation. It will change into negative after several loop.
UBSAN load could detect a negative shift as an undefined behaviour
and reports exception.
So we add the protection to avoid the length exceeding 32. Or else
it will return out of range error and stop decoding.

Fixes: 5e35941d9901 ("[NETFILTER]: Add H.323 conntrack/NAT helper")
Signed-off-by: Lena Wang <lena.wang@mediatek.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout

While the rhashtable set gc runs asynchronously, a race allows it to
collect elements from anonymous sets with timeouts while it is being
released from the commit path.

Mingi Cho originally reported this issue in a different path in 6.1.x
with a pipapo set with low timeouts which is not possible upstream since
7395dfacfff6 ("netfilter: nf_tables: use timestamp to check for set
element timeout").

Fix this by setting on the dead flag for anonymous sets to skip async gc
in this case.

According to 08e4c8c5919f ("netfilter: nf_tables: mark newset as dead on
transaction abort"), Florian plans to accelerate abort path by releasing
objects via workqueue, therefore, this sets on the dead flag for abort
path too.

Cc: stable@vger.kernel.org
Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Reported-by: Mingi Cho <mgcho.minic@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_ct: fix l3num expectations with inet pseudo family

Following is rejected but should be allowed:

table inet t {
        ct expectation exp1 {
                [..]
                l3proto ip

Valid combos are:
table ip t, l3proto ip
table ip6 t, l3proto ip6
table inet t, l3proto ip OR l3proto ip6

Disallow inet pseudeo family, the l3num must be a on-wire protocol known
to conntrack.

Retain NFPROTO_INET case to make it clear its rejected
intentionally rather as oversight.

Fixes: 8059918a1377 ("netfilter: nft_ct: sanitize layer 3 and 4 protocol number in custom expectations")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: reject constant set with timeout

This set combination is weird: it allows for elements to be
added/deleted, but once bound to the rule it cannot be updated anymore.
Eventually, all elements expire, leading to an empty set which cannot
be updated anymore. Reject this flags combination.

Cc: stable@vger.kernel.org
Fixes: 761da2935d6e ("netfilter: nf_tables: add set timeout API support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: disallow anonymous set with timeout flag

Anonymous sets are never used with timeout from userspace, reject this.
Exception to this rule is NFT_SET_EVAL to ensure legacy meters still work.

Cc: stable@vger.kernel.org
Fixes: 761da2935d6e ("netfilter: nf_tables: add set timeout API support")
Reported-by: lonial con <kongln9170@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge tag 'vfs-6.8-release.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

- Get rid of copy_mc flag in iov_iter which really only makes sense for
   the core dumping code so move it out of the generic iov iter code and
   make it coredump's problem. See the detailed commit description.

- Revert fs/aio: Make io_cancel() generate completions again

   The initial fix here was predicated on the assumption that calling
   ki_cancel() didn't complete aio requests. However, that turned out to
   be wrong since the two drivers that actually make use of this set a
   cancellation function that performs the cancellation correctly. So
   revert this change.

- Ensure that the test for IOCB_AIO_RW always happens before the read
   from ki_ctx.

* tag 'vfs-6.8-release.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iov_iter: get rid of 'copy_mc' flag
  fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversion
  Revert "fs/aio: Make io_cancel() generate completions again"

Merge tag 'arm-fixes-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull ARM SoC fixes from Arnd Bergmann:
"These should be the final fixes for the soc tree for 6.8, as usual
  they mostly deal wtih dts files:

   - Qualcomm fixes for pcie4 on sc8280xp, a revert of msm8996 mpm
     support, sm6115 interconnect and sm8650 gpio.

   - Two fixes for Tegra234 ethernet

   - A Makefile fix to actually build the allwinner based orange pi zero
     2w device tree

   - Fixes for clocks and reset on imx8mp and a DSI display regression
     on imx7.

  The non-DT fixes are:

   - Firmware fixes addressing a kernel panic in op-tee and a minor
     regression in microchip/riscv.

   - A defconfig change to bring back backlight support after a Kconfig
     change"

* tag 'arm-fixes-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  firmware: microchip: Fix over-requested allocation size
  tee: optee: Fix kernel panic caused by incorrect error handling
  Revert "arm64: dts: qcom: msm8996: Hook up MPM"
  arm64: dts: qcom: sc8280xp-x13s: limit pcie4 link speed
  arm64: dts: qcom: sc8280xp-crd: limit pcie4 link speed
  arm64: dts: imx8mp: Fix LDB clocks property
  arm64: dts: imx8mp: Fix TC9595 reset GPIO on DH i.MX8M Plus DHCOM SoM
  MAINTAINERS: Use a proper mailinglist for NXP i.MX development
  ARM: dts: imx7: remove DSI port endpoints
  arm64: dts: allwinner: h616: Add Orange Pi Zero 2W to Makefile
  ARM: imx_v6_v7_defconfig: Restore CONFIG_BACKLIGHT_CLASS_DEVICE
  arm64: tegra: Fix Tegra234 MGBE power-domains
  arm64: tegra: Set the correct PHY mode for MGBE
  arm64: dts: qcom: sm6115: Fix missing interconnect-names
  arm64: dts: qcom: sm8650-mtp: add gpio74 as reserved gpio
  arm64: dts: qcom: sm8650-qrd: add gpio74 as reserved gpio

Merge tag 'v6.8-p6' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:
"Fix potential use-after-frees in rk3288 and sun8i-ce"

* tag 'v6.8-p6' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: rk3288 - Fix use after free in unprepare
crypto: sun8i-ce - Fix use after free in unprepare

inet: Add getsockopt support for IP_ROUTER_ALERT and IPV6_ROUTER_ALERT

Currently getsockopt does not support IP_ROUTER_ALERT and
IPV6_ROUTER_ALERT, and we are unable to get the values of these two
socket options through getsockopt.

This patch adds getsockopt support for IP_ROUTER_ALERT and
IPV6_ROUTER_ALERT.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ynl-small-recv'

Jakub Kicinski says:

====================
tools: ynl: add --dbg-small-recv for easier kernel testing

When testing netlink dumps I usually hack some user space up
to constrain its user space buffer size (iproute2, ethtool or ynl).
Netlink will try to fill the messages up, so since these apps use
large buffers by default, the dumps are rarely fragmented.

I was hoping to figure out a way to create a selftest for dump
testing, but so far I have no idea how to do that in a useful
and generic way.

Until someone does that, make manual dump testing easier with YNL.
Create a special option for limiting the buffer size, so I don't
have to make the same edits each time, and maybe others will benefit,
too :)

Example:

  $ ./cli.py [...] --dbg-small-recv >/dev/null
  Recv: read 3712 bytes, 29 messages
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
    [...]
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
  Recv: read 3968 bytes, 31 messages
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
    [...]
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
  Recv: read 532 bytes, 5 messages
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
    [...]
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
     nl_len = 20 (4) nl_flags = 0x2 nl_type = 3

Now let's make the DONE not fit in the last message:

  $ ./cli.py [...] --dbg-small-recv 4499 >/dev/null
  Recv: read 3712 bytes, 29 messages
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
    [...]
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
  Recv: read 4480 bytes, 35 messages
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
    [...]
     nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
  Recv: read 20 bytes, 1 messages
     nl_len = 20 (4) nl_flags = 0x2 nl_type = 3

A real test would also have to check the messages are complete
and not duplicated. That part has to be done manually right now.

Note that the first message is always conservatively sized by the kernel.
Still, I think this is good enough to be useful.

v2:
- patch 2:
   - move the recv_size setting up
   - change the default to 0 so that cli.py doesn't have to worry
     what the "unset" value is
v1: https://lore.kernel.org/all/20240301230542.116823-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: add --dbg-small-recv for easier kernel testing

Most "production" netlink clients use large buffers to
make dump efficient, which means that handling of dump
continuation in the kernel is not very well tested.

Add an option for debugging / testing handling of dumps.
It enables printing of extra netlink-level debug and
lowers the recv() buffer size in one go. When used
without any argument (--dbg-small-recv) it picks
a very small default (4000), explicit size can be set,
too (--dbg-small-recv 5000).

Example:

$ ./cli.py [...] --dbg-small-recv
Recv: read 3712 bytes, 29 messages
   nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
[...]
   nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
Recv: read 3968 bytes, 31 messages
   nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
[...]
   nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
Recv: read 532 bytes, 5 messages
   nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
[...]
   nl_len = 128 (112) nl_flags = 0x0 nl_type = 19
   nl_len = 20 (4) nl_flags = 0x2 nl_type = 3

(the [...] are edits to shorten the commit message).

Note that the first message of the dump is sized conservatively
by the kernel.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: support debug printing messages

For manual debug, allow printing the netlink level messages
to stderr.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: allow setting recv() size

Make the size of the buffer we use for recv() configurable.
The details of the buffer sizing in netlink are somewhat
arcane, we could spend a lot of time polishing this API.
Let's just leave some hopefully helpful comments for now.
This is a for-developers-only feature, anyway.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: move the new line in NlMsg __repr__

We add the new line even if message has no error or extack,
which leads to print(nl_msg) ending with two new lines.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tools-ynl-make-clean'

Jakub Kicinski says:

====================
tools: ynl: clean up make clean

First change renames the clean target which removes build results,
to a more common name. Second one add missing .PHONY targets.
Third one ensures that clean deletes __pycache__.

v2: add patch 2
v1: https://lore.kernel.org/all/20240301235609.147572-1-kuba@kernel.org/

====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: remove __pycache__ during clean

Build process uses python to generate the user space code.
Remove __pycache__ on make clean.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: add distclean to .PHONY in all makefiles

Donald points out most YNL makefiles are missing distclean
in .PHONY, even tho generated/Makefile does list it.

Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: rename make hardclean -> distclean

The make target to remove all generated files used to be called
"hardclean" because it deleted files which were tracked by git.
We no longer track generated user space files, so use the more
common "distclean" name.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/rds: fix WARNING in rds_conn_connect_if_down

If connection isn't established yet, get_mr() will fail, trigger connection after
get_mr().

Fixes: 584a8279a44a ("RDS: RDMA: return appropriate error on rdma map failures")
Reported-and-tested-by: syzbot+d4faee732755bba9838e@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-03-04 (ice)

This series contains updates to ice driver only.

Jake changes the driver to use relative VSI index for VF VSIs as the VF
driver has no direct use of the VSI number on ice hardware. He also
reworks some Tx/Rx functions to clarify their uses, cleans up some style
issues, and utilizes kernel helper functions.

Maciej removes a redundant call to disable Tx queues on ifdown and
removes some unnecessary devm usages.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ravb-cleanups'

Niklas Söderlund says:

====================
ravb: Align Rx descriptor setup and maintenance

When RZ/G2L support was added the Rx code path was split in two, one to
support R-Car and one to support RZ/G2L. One reason for this is that
R-Car uses the extended Rx descriptor format, while RZ/G2L uses the
normal descriptor format.

In many aspects this is not needed as the extended descriptor format is
just a normal descriptor with extra metadata (timestamsp) appended. And
the R-Car SoCs can also use normal descriptors if hardware timestamps
were not desired. This split has led to RZ/G2L gaining support for
split descriptors in the Rx path while R-Car still lacks this.

This series is the first step in trying to merge the R-Car and RZ/G2L Rx
paths so features and bugs corrected in one will benefit the other.

The first patch in the series clarifies that the driver now supports
either normal or extended descriptors, not both at the same time by
grouping them in a union. This is the foundation that later patches will
build on the aligning the two Rx paths.

Patches 2-5 deals with correcting small issues in the Rx frame and
descriptor sizes that either were incorrect at the time they were added
in 2017 (my bad) or concepts built on-top of this initial incorrect
design.

While finally patch 6 merges the R-Car and RZ/G2L for Rx descriptor
setup and maintenance.

When this work has landed I plan to follow up with more work aligning
the rest of the Rx code paths and hopefully bring split descriptor
support to the R-Car SoCs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Unify Rx ring maintenance code paths

The R-Car and RZ/G2L Rx code paths were split in two separate
implementations when support for RZ/G2L was added due to the fact that
R-Car uses the extended descriptor format while RZ/G2L uses normal
descriptors. This has led to a duplication of Rx logic with the only
difference being the different Rx descriptors types used. The
implementation however neglects to take into account that extended
descriptors are normal descriptors with additional metadata at the end
to carry hardware timestamp information.

The hardware timestamp information is only consumed in the R-Car Rx
loop and all the maintenance code around the Rx ring can be shared
between the two implementations if the difference in descriptor length
is carefully considered.

This change merges the two implementations for Rx ring maintenance by
adding a method to access both types of descriptors as normal
descriptors, as this part covers all the fields needed for Rx ring
maintenance the only difference between using normal or extended
descriptor is the size of the memory region to allocate/free and the
step size between each descriptor in the ring.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Move maximum Rx descriptor data usage to info struct

To make it possible to merge the R-Car and RZ/G2L code paths move the
maximum usable size of a single Rx descriptor data slice into the
hardware information instead of using two different defines in the two
different code paths.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Use the max frame size from hardware info for RZ/G2L

Remove the define describing the RZ/G2L maximum frame size and only use
the information in the hardware information struct. This will make it
easier to merge the R-Car and RZ/G2L code paths.

There is no functional change as both the define and the maximum frame
length in the hardware information is set to 8K.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Create helper to allocate skb and align it

The EtherAVB device requires the SKB data to be aligned to 128 bytes.
The alignment is done by allocating an skb 128 bytes larger than the
maximum frame size supported by the device and adjusting the headroom to
fit the requirement.

This code has been refactored a few times and small issues have been
added along the way. The issues are not harmful but prevent merging
parts of the Rx code which have been split in two implementations with
the addition of RZ/G2L support, a device that supports larger frame
sizes.

This change removes the need for duplicated and somewhat inaccurate
hardware alignment constrains stored in the hardware information struct
by creating a helper to handle the allocation of an skb and alignment of
an skb data.

For the R-Car class of devices the maximum frame size is 4K and each
descriptor is limited to 2K of data. The current implementation does not
support split descriptors, this limits the frame size to 2K. The
current hardware information however records the descriptor size just
under 2K due to bad understanding of the device when larger MTUs where
added.

For the RZ/G2L device the maximum frame size is 8K and each descriptor
is limited to 4K of data. The current hardware information records this
correctly, but it gets the alignment constrains wrong as just aligns it
by 128, it does not extend it by 128 bytes to allow the full frame to be
stored. This works because the RZ/G2L device supports split descriptors
and allocates each skb to 8K and aligns each 4K descriptor in this
space.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Make it clear the information relates to maximum frame size

The struct member rx_max_buf_size was added before split descriptor
support was added. It is unclear if the value describes the full skb
frame buffer or the data descriptor buffer which can be combined into a
single skb.

Rename it to make it clear it referees to the maximum frame size and can
cover multiple descriptors.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Group descriptor types used in Rx ring

The Rx ring can either be made up of normal or extended descriptors, not
a mix of the two at the same time. Make this explicit by grouping the
two variables in a rx_ring union.

The extension of the storage for more than one queue of normal
descriptors from a single to NUM_RX_QUEUE queues have no practical
effect. But aids in making the code readable as the code that uses it
already piggyback on other members of struct ravb_private that are
arrays of max length NUM_RX_QUEUE, e.g. rx_desc_dma. This will also make
further refactoring easier.

While at it, rename the normal descriptor Rx ring to make it clear it's
not strictly related to the GbEthernet E-MAC IP found in RZ/G2L, normal
descriptors could be used on R-Car SoCs too.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

From: Tony Nguyen <anthony.l.nguyen@intel.com>
To: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
edumazet@google.com, netdev@vger.kernel.org
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>, alan.brady@intel.com
Tony Nguyen says:

====================
idpf: refactor virtchnl messages

Alan Brady says:

The motivation for this series has two primary goals. We want to enable
support of multiple simultaneous messages and make the channel more
robust. The way it works right now, the driver can only send and receive
a single message at a time and if something goes really wrong, it can
lead to data corruption and strange bugs.

To start the series, we introduce an idpf_virtchnl.h file. This reduces
the burden on idpf.h which is overloaded with struct and function
declarations.

The conversion works by conceptualizing a send and receive as a
"virtchnl transaction" (idpf_vc_xn) and introducing a "transaction
manager" (idpf_vc_xn_manager). The vcxn_mngr will init a ring of
transactions from which the driver will pop from a bitmap of free
transactions to track in-flight messages. Instead of needing to handle a
complicated send/recv for every a message, the driver now just needs to
fill out a xn_params struct and hand it over to idpf_vc_xn_exec which
will take care of all the messy bits. Once a message is sent and
receives a reply, we leverage the completion API to signal the received
buffer is ready to be used (assuming success, or an error code
otherwise).

At a low-level, this implements the "sw cookie" field of the virtchnl
message descriptor to enable this. We have 16 bits we can put whatever
we want and the recipient is required to apply the same cookie to the
reply for that message. We use the first 8 bits as an index into the
array of transactions to enable fast lookups and we use the second 8
bits as a salt to make sure each cookie is unique for that message. As
transactions are received in arbitrary order, it's possible to reuse a
transaction index and the salt guards against index conflicts to make
certain the lookup is correct. As a primitive example, say index 1 is
used with salt 1. The message times out without receiving a reply so
index 1 is renewed to be ready for a new transaction, we report the
timeout, and send the message again. Since index 1 is free to be used
again now, index 1 is again sent but now salt is 2. This time we do get
a reply, however it could be that the reply is _actually_ for the
previous send index 1 with salt 1. Without the salt we would have no
way of knowing for sure if it's the correct reply, but with we will know
for certain.

Through this conversion we also get several other benefits. We can now
more appropriately handle asynchronously sent messages by providing
space for a callback to be defined. This notably allows us to handle MAC
filter failures better; previously we could potentially have stale,
failed filters in our list, which shouldn't really have a major impact
but is obviously not correct. I also managed to remove fairly
significant more lines than I added which is a win in my book.

Additionally, this converts some variables to use auto-variables where
appropriate. This makes the alloc paths much cleaner and less prone to
memory leaks. We also fix a few virtchnl related bugs while we're here.

====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-03-05 (idpf, ice, i40e, igc, e1000e)

This series contains updates to idpf, ice, i40e, igc and e1000e drivers.

Emil disables local BH on NAPI schedule for proper handling of softirqs
on idpf.

Jake stops reporting of virtchannel RSS option which in unsupported on
ice.

Rand Deeb adds null check to prevent possible null pointer dereference
on ice.

Michal Schmidt moves DPLL mutex initialization to resolve uninitialized
mutex usage for ice.

Jesse fixes incorrect variable usage for calculating Tx stats on ice.

Ivan Vecera corrects logic for firmware equals check on i40e.

Florian Kauer prevents memory corruption for XDP_REDIRECT on igc.

Sasha reverts an incorrect use of FIELD_GET which caused a regression
for Wake on LAN on e1000e.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

iov_iter: get rid of 'copy_mc' flag

This flag is only set by one single user: the magical core dumping code
that looks up user pages one by one, and then writes them out using
their kernel addresses (by using a BVEC_ITER).

That actually ends up being a huge problem, because while we do use
copy_mc_to_kernel() for this case and it is able to handle the possible
machine checks involved, nothing else is really ready to handle the
failures caused by the machine check.

In particular, as reported by Tong Tiangen, we don't actually support
fault_in_iov_iter_readable() on a machine check area.

As a result, the usual logic for writing things to a file under a
filesystem lock, which involves doing a copy with page faults disabled
and then if that fails trying to fault pages in without holding the
locks with fault_in_iov_iter_readable() does not work at all.

We could decide to always just make the MC copy "succeed" (and filling
the destination with zeroes), and that would then create a core dump
file that just ignores any machine checks.

But honestly, this single special case has been problematic before, and
means that all the normal iov_iter code ends up slightly more complex
and slower.

See for example commit c9eec08bac96 ("iov_iter: Don't deal with
iter->copy_mc in memcpy_from_iter_mc()") where David Howells
re-organized the code just to avoid having to check the 'copy_mc' flags
inside the inner iov_iter loops.

So considering that we have exactly one user, and that one user is a
non-critical special case that doesn't actually ever trigger in real
life (Tong found this with manual error injection), the sane solution is
to just decide that the onus on handling the machine check lines on that
user instead.

Ergo, do the copy_mc_to_kernel() in the core dump logic itself, copying
the user data to a stable kernel page before writing it out.

Fixes: f1982740f5e7 ("iov_iter: Convert iterate*() to inline funcs")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Link: https://lore.kernel.org/r/20240305133336.3804360-1-tongtiangen@huawei.com
Link: https://lore.kernel.org/all/4e80924d-9c85-f13a-722a-6a5d2b1c225a@huawei.com/
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

Merge branch 'Improve packet offload for dual stack'

Mike Yu says:
====================
In the XFRM stack, whether a packet is forwarded to the IPv4
or IPv6 stack depends on the family field of the matched SA.
This does not completely work for IPsec packet offload in some
scenario, for example, sending an IPv6 packet that will be
encrypted and encapsulated as an IPv4 packet in HW.

Here are the patches to make IPsec packet offload work on the
mentioned scenario.
====================

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Merge branch 'netlink-emsgsize'

Jakub Kicinski says:

====================
netlink: handle EMSGSIZE errors in the core

Ido discovered some time back that we usually force NLMSG_DONE
to be delivered in a separate recv() syscall, even if it would
fit into the same skb as data messages. He made nexthop try
to fit DONE with data in commit 8743aeff5bc4 ("nexthop: Fix
infinite nexthop bucket dump when using maximum nexthop ID"),
and nobody has complained so far.

We have since also tried to follow the same pattern in new
genetlink families, but explaining to people, or even remembering
the correct handling ourselves is tedious.

Let the netlink socket layer consume -EMSGSIZE errors.
Practically speaking most families use this error code
as "dump needs more space", anyway.

v2:
- init err to 0 in last patch
v1: https://lore.kernel.org/all/20240301012845.2951053-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

genetlink: fit NLMSG_DONE into same read() as families

Make sure ctrl_fill_info() returns sensible error codes and
propagate them out to netlink core. Let netlink core decide
when to return skb->len and when to treat the exit as an
error. Netlink core does better job at it, if we always
return skb->len the core doesn't know when we're done
dumping and NLMSG_DONE ends up in a separate read().

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>