Jiri Pirko [Thu, 26 Jan 2023 07:58:30 +0000 (08:58 +0100)]
devlink: don't work with possible NULL pointer in devlink_param_unregister()
There is a WARN_ON checking the param_item for being NULL when the param
is not inserted in the list. That indicates a driver BUG. Instead of
continuing to work with NULL pointer with its consequences, return.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:29 +0000 (08:58 +0100)]
devlink: make devlink_param_register/unregister static
There is no user outside the devlink code, so remove the export and make
the functions static. Move them before callers to avoid forward
declarations.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:28 +0000 (08:58 +0100)]
net/mlx5: Covert devlink params registration to use devlink_params_register/unregister()
Since mlx5 is the only user of devlink API to register/unregister a
single param, convert it to use array registration function allowing to
simplify the devlink API by removing the single param registration
functions.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:27 +0000 (08:58 +0100)]
net/mlx5: Change devlink param register/unregister function names
The functions are registering and unregistering devlink params, so
change the names accordingly.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Jan 2023 12:24:32 +0000 (12:24 +0000)]
Merge branch 'ethtool-netlink-next'
Jakub Kicinski says:
====================
ethtool: netlink: handle SET intro/outro in the common code
Factor out the boilerplate code from SET handlers to common code.
I volunteered to refactor the extack in GET in a conversation
with Vladimir but I gave up.
The handling of failures during dump in GET handlers is a bit
unclear to me. Some code uses presence of info as indication
of dump and tries to avoid reporting errors altogether
(including extack messages).
There's also the question of whether we should have a validation
callback (similar to .set_validate here) for GET. It looks like
.parse_request was expected to perform the validation. It takes
the extack and tb directly, not via info:
int (*parse_request)(struct ethnl_req_info *req_info,
struct nlattr **tb,
struct netlink_ext_ack *extack);
But .parse_request doesn't run under rtnl nor ethnl_ops_begin().
As a result some implementations defer validation until .prepare_data
where all the locks are held and they can call out to the driver.
All this makes me think that maybe we should refactor GET in the
same direction I'm refactoring SET. Split .prepare_data, take
more locks in the core, and add a validation helper which would
take extack directly:
- ret = ops->prepare_data(req_info, reply_data, info);
+ ret = ops->prepare_data_validate(req_info, reply_data, attrs, extack);
+ if (ret < 1) // if 0 -> skip for dump; -EOPNOTSUPP in do
+ goto err1;
+
+ ret = ethnl_ops_begin(dev);
+ if (ret)
+ goto err1;
+
+ ret = ops->prepare_data(req_info, reply_data); // no extack
+ ethnl_ops_complete(dev);
I'll file that away as a TODO for posterity / older me.
v2:
- invert checks for coalescing to avoid error code changes
- rebase and convert MM as well
This leads to a lot of copy / pasted code an bugs when people
mis-handle the error path.
Add a generic implementation of this pattern with a .set callback
in struct ethnl_request_ops called to "do stuff".
Also add an optional .set_validate which is called before
ethnl_ops_begin() -- a lot of implementations do basic request
capability / sanity checking at that point.
Because we want to avoid generating the notification when
no change happened - adopt a slightly hairy return values:
- 0 means nothing to do (no notification)
- 1 means done / continue
- negative error codes on error
Reuse .hdr_attr from struct ethnl_request_ops, GET and SET
use the same attr spaces in all cases.
Convert pause as an example (and to avoid unused function warnings).
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Convert qca8k to regmap read/write bulk API. The mgmt eth can write up
to 32 bytes of data at times. Currently we use a custom function to do
it but regmap now supports declaration of read/write bulk even without a
bus.
Drop the custom function and rework the regmap function to this new
implementation.
Rework the qca8k_fdb_read/write function to use the new
regmap_bulk_read/write as the old qca8k_bulk_read/write are now dropped.
Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:23 +0000 (23:14 -0800)]
net: skbuff: drop the linux/hrtimer.h include
linux/hrtimer.h include was added because apparently it used
to contain ktime related code. This is no longer the case
and we include linux/time.h explicitly.
Sadly this change is currently a noop because linux/dma-mapping.h
and net/page_pool.h pull in half of the universe.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:22 +0000 (23:14 -0800)]
net: skbuff: drop the linux/splice.h include
splice.h is included since commit a60e3cc7c929 ("net: make
skb_splice_bits more configureable") but really even then
all we needed is some forward declarations. Most of that
code is now gone, and remaining has fwd declarations.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:20 +0000 (23:14 -0800)]
net: skbuff: drop the linux/sched.h include
linux/sched.h was added for skb_mstamp_* (all the way back
before linux/sched.h got split and linux/sched/clock.h created).
We don't need it in skbuff.h any more.
Sadly this change is currently a noop because linux/dma-mapping.h
and net/page_pool.h pull in half of the universe.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Jan 2023 11:16:29 +0000 (11:16 +0000)]
Merge branch 'ipa-abstract-status'
Alex Elder says:
====================
net: ipa: abstract status parsing
Under some circumstances, IPA generates a "packet status" structure
that describes information about a packet. This is used, for
example, when offload hardware detects an error in a packet, or
otherwise discovers a packet needs special handling. In this case,
the status is delivered (along with the packet it describes) to a
"default" endpoint so that it can be handled by the AP.
Until now, the structure of this status information hasn't changed.
However, to support more than 32 endpoints, this structure required
some changes, such that some fields are rearranged in ways that are
tricky to represent using C code.
This series updates code related to the IPA status structure. The
first patch uses a local variable to avoid recomputing a packet
length more than once. The second stops using sizeof() to determine
the size of an IPA packet status structure. Patches 3-5 extend the
definitions for values held in packet status fields. Patch 6 does a
little general cleanup to make patch 7 simpler. Patch 7 stops using
a C structure to represent packet status; instead, a new function
fetches values "by name" from a buffer containing such a structure.
The last patch updates this function so it also supports IPA v5.0+.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:45 +0000 (14:45 -0600)]
net: ipa: add IPA v5.0 packet status support
Update ipa_status_extract() to support IPA v5.0 and beyond. Because
the format of the IPA packet status depends on the version, pass an
IPA pointer to the function.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:44 +0000 (14:45 -0600)]
net: ipa: introduce generalized status decoder
Stop assuming the IPA packet status has a fixed format (defined by
a C structure). Instead, use a function to extract each field from
a block of data interpreted as an IPA packet status. Define an
enumerated type that identifies the fields that can be extracted.
The current function extracts fields based on the existing
ipa_status structure format (which is no longer used).
Define IPA_STATUS_RULE_MISS, to replace the calls to field_max() to
represent that condition; those depended on the knowing the width of
a filter or router rule in the IPA packet status structure.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:43 +0000 (14:45 -0600)]
net: ipa: IPA status preparatory cleanups
The next patch reworks how the IPA packet status structure is
interpreted. This patch does some preparatory work, to make it
easier to see the effect of that change:
- Change a few functions that access fields in a IPA packet status
structure to store field values in local variables with names
related to the field.
- Pass a void pointer rather than an (equivalent) status pointer
to two functions called by ipa_endpoint_status_parse().
- Use "rule" rather than "val" as the name of a variable that
holds a routing rule ID.
- Consistently use "IPA packet status" rather than "status
element" when referring to this data structure.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:42 +0000 (14:45 -0600)]
net: ipa: define remaining IPA status field values
Define the remaining values for opcode and exception fields in the
IPA packet status structure. Most of these values are powers-of-2,
suggesting they are meant to be used as bitmasks, but that is not
the case. Add comments to be clear about this, and express the
values in decimal format.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:41 +0000 (14:45 -0600)]
net: ipa: rename the NAT enumerated type
Rename the ipa_nat_en enumerated type to be ipa_nat_type, and rename
its symbols accordingly. Add a comment indicating those values are
also used in the IPA status nat_type field.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:40 +0000 (14:45 -0600)]
net: ipa: define all IPA status mask bits
There is a 16 bit status mask defined in the IPA packet status
structure, of which only one (TAG_VALID) is currently used.
Define all other IPA status mask values in an enumerated type whose
numeric values are bit mask values (in CPU byte order) in the status
mask. Use the TAG_VALID value from that type rather than defining a
separate field mask.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:39 +0000 (14:45 -0600)]
net: ipa: stop using sizeof(status)
The IPA packet status structure changes in IPA v5.0 in ways that are
difficult to represent cleanly. As a small step toward redefining
it as a parsed block of data, use a constant to define its size,
rather than the size of the IPA status structure type.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:38 +0000 (14:45 -0600)]
net: ipa: refactor status buffer parsing
The packet length encoded in an IPA packet status buffer is computed
more than once in ipa_endpoint_status_parse(). It is also checked
again in ipa_endpoint_status_skip(), which that function calls.
Compute the length once, and use that computed value later rather
than recomputing it. Check for it being zero in the parse function
rather than in ipa_endpoint_status_skip().
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 25 Jan 2023 14:57:16 +0000 (16:57 +0200)]
net: dsa: ocelot: build felix.c into a dedicated kernel module
The build system currently complains:
scripts/Makefile.build:252: drivers/net/dsa/ocelot/Makefile:
felix.o is added to multiple modules: mscc_felix mscc_seville
Since felix.c holds the DSA glue layer, create a mscc_felix_dsa_lib.ko.
This is similar to how mscc_ocelot_switch_lib.ko holds a library for
configuring the hardware.
Jakub Kicinski [Fri, 27 Jan 2023 07:29:12 +0000 (23:29 -0800)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
virtchnl: update and refactor
Jesse Brandeburg says:
The virtchnl.h file is used by i40e/ice physical function (PF) drivers
and irdma when talking to the iavf driver. This series cleans up the
header file by removing unused elements, adding/cleaning some comments,
fixing the data structures so they are explicitly defined, including
padding, and finally does a long overdue rename of the IWARP members in
the structures to RDMA, since the ice driver and it's associated Intel
Ethernet E800 series adapters support both RDMA and IWARP.
The whole series should result in no functional change, but hopefully
clearer code.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
virtchnl: i40e/iavf: rename iwarp to rdma
virtchnl: do structure hardening
virtchnl: update header and increase header clarity
virtchnl: remove unused structure declaration
====================
====================
tools: ynl: prevent reorder and fix flags
Some codegen improvements for YAML specs.
First, Lorenzon discovered when switching the XDP feature family
to use flags instead of pure enum that the kdoc got garbled.
The support for enum and flags is therefore unified.
Second when regenerating all families we discussed so far I noticed
that some netlink policies jumped around. We need to ensure we don't
render code based on their ordering in a hash.
====================
Jakub Kicinski [Thu, 26 Jan 2023 00:02:35 +0000 (16:02 -0800)]
tools: ynl: store ops in ordered dict to avoid random ordering
When rendering code we should walk the ops in the order in which
they are declared in the spec. This is both more intuitive and
prevents code from jumping around when hashing in the dict changes.
Jakub Kicinski [Thu, 26 Jan 2023 00:02:34 +0000 (16:02 -0800)]
tools: ynl: rename ops_list -> msg_list
ops_list contains all the operations, but the main iteration use
case is to walk only ops which define attrs. Rename ops_list to
msg_list, because now it looks like the contents are the same,
just the format is different. While at it convert from tuple
to just keys, none of the users care about the name of the op.
====================
Convert drivers to return XFRM configuration errors through extack
This series continues effort started by Sabrina to return XFRM configuration
errors through extack. It allows for user space software stack easily present
driver failure reasons to users.
As a note, Intel drivers have a path where extack is equal to NULL, and error
prints won't be available in current patchset. If it is needed, it can be
changed by adding special to Intel macro to print to dmesg in case of
extack == NULL.
====================
Leon Romanovsky [Tue, 24 Jan 2023 11:55:02 +0000 (13:55 +0200)]
nfp: fill IPsec state validation failure reason
Rely on extack to return failure reason.
Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Wed, 25 Jan 2023 11:02:14 +0000 (13:02 +0200)]
net: ethtool: provide shims for stats aggregation helpers when CONFIG_ETHTOOL_NETLINK=n
ethtool_aggregate_*_stats() are implemented in net/ethtool/stats.c, a
file which is compiled out when CONFIG_ETHTOOL_NETLINK=n. In order to
avoid adding Kbuild dependencies from drivers (which call these helpers)
on CONFIG_ETHTOOL_NETLINK, let's add some shim definitions which simply
make the helpers dead code.
This means the function prototypes should have been located in
include/linux/ethtool_netlink.h rather than include/linux/ethtool.h.
Fixes: 449c5459641a ("net: ethtool: add helpers for aggregate statistics") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230125110214.4127759-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
mptcp: add mixed v4/v6 support for the in-kernel PM
Before these patches, the in-kernel Path-Manager would not allow, for
the same MPTCP connection, having a mix of subflows in v4 and v6.
MPTCP's RFC 8684 doesn't forbid that and it is even recommended to do so
as the path in v4 and v6 are likely different. Some networks are also
v4 or v6 only, we cannot assume they all have both v4 and v6 support.
Patch 1 then removes this artificial constraint in the in-kernel PM
currently enforcing there are no mixed subflows in place, either in
address announcement or in subflow creation areas.
Patch 2 makes sure the sk_ipv6only attribute is also propagated to
subflows, just in case a new PM wouldn't respect it.
Some selftests have also been added for the in-kernel PM (patch 3).
Patches 4 to 8 are just some cleanups and small improvements in the
printed messages in the userspace PM. It is not linked to the rest but
identified when working on a related patch modifying this selftest,
already in -net:
Matthieu Baerts [Wed, 25 Jan 2023 10:47:28 +0000 (11:47 +0100)]
selftests: mptcp: userspace: avoid read errors
During the cleanup phase, the server pids were killed with a SIGTERM
directly, not using a SIGUSR1 first to quit safely. As a result, this
test was often ending with two error messages:
read: Connection reset by peer
While at it, use a for-loop to terminate all the PIDs the same way.
Also the different files are now removed after having killed the PIDs
using them. It makes more sense to do that in this order.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Wed, 25 Jan 2023 10:47:21 +0000 (11:47 +0100)]
mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses
Currently the in-kernel PM arbitrary enforces that created subflow's
family must match the main MPTCP socket while the RFC allows mixing
IPv4 and IPv6 subflows.
This patch changes the in-kernel PM logic to create subflows matching
the currently selected source (or destination) address. IPv4 sockets
can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets
not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as
long as the source and destination matches.
A helper, previously introduced is used to ease family matching checks,
taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269 Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
However, when ICMP output is limited, there is no way to tell
which limit has been hit or even if the limits are responsible
for the lack of ICMP output.
Add counters for each of the cases above. As we are within
local_bh_disable(), use the __INC stats variant.
Paolo Abeni [Thu, 26 Jan 2023 09:08:05 +0000 (10:08 +0100)]
Merge branch 'adding-sparx5-is0-vcap-support'
Steen Hegelund says:
====================
Adding Sparx5 IS0 VCAP support
This provides the Ingress Stage 0 (IS0) VCAP (Versatile Content-Aware
Processor) support for the Sparx5 platform.
The IS0 VCAP (also known in the datasheet as CLM) is a classifier VCAP that
mainly extracts frame information to metadata that follows the frame in the
Sparx5 processing flow all the way to the egress port.
The IS0 VCAP has 4 lookups and they are accessible with a TC chain id:
Each of these lookups have their own port keyset configuration that decides
which keys will be used for matching on which traffic type.
The IS0 VCAP has these traffic classifications:
- IPv4 frames
- IPv6 frames
- Unicast MPLS frames (ethertype = 0x8847)
- Multicast MPLS frames (ethertype = 0x8847)
- Other frame types than MPLS, IPv4 and IPv6
The IS0 VCAP has an action that allows setting the value of a PAG (Policy
Association Group) key field in the frame metadata, and this can be used
for matching in an IS2 VCAP rule.
This allow rules in the IS0 VCAP to be linked to rules in the IS2 VCAP.
The linking is exposed by using the TC "goto chain" action with an offset
from the IS2 chain ids.
As an example a "goto chain 8000001" will use a PAG value of 1 to chain to
a rule in IS2 Lookup 0.
====================
Steen Hegelund [Tue, 24 Jan 2023 10:45:09 +0000 (11:45 +0100)]
net: microchip: sparx5: Add automatic selection of VCAP rule actionset
With more than one possible actionset in a VCAP instance, the VCAP API will
now use the actions in a VCAP rule to select the actionset that fits these
actions the best possible way.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Steen Hegelund [Tue, 24 Jan 2023 10:45:08 +0000 (11:45 +0100)]
net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPs
This allows rules to be chained between VCAP instances, e.g. from IS0
Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2
Lookups.
Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the
hardware.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch set is a follow up to the "How to share IPv4 addresses by
partitioning the port space" talk given at LPC 2022 [1].
Please see patch #1 for the motivation & the use case description.
Patch #2 adds tests exercising the new option in various scenarios.
Documentation
-------------
Proposed update to the ip(7) man-page:
IP_LOCAL_PORT_RANGE (since Linux X.Y)
Set or get the per-socket default local port range. This
option can be used to clamp down the global local port
range, defined by the ip_local_port_range /proc interface
described below, for a given socket.
The option takes an uint32_t value with the high 16 bits
set to the upper range bound, and the low 16 bits set to
the lower range bound. Range bounds are inclusive. The
16-bit values should be in host byte order.
The lower bound has to be less than the upper bound when
both bounds are not zero. Otherwise, setting the option
fails with EINVAL.
If either bound is outside of the global local port range,
or is zero, then that bound has no effect.
To reset the setting, pass zero as both the upper and the
lower bound.
Interaction with SELinux bind() hook
------------------------------------
SELinux bind() hook - selinux_socket_bind() - performs a permission check
if the requested local port number lies outside of the netns ephemeral port
range.
The proposed socket option cannot be used change the ephemeral port range
to extend beyond the per-netns port range, as set by
net.ipv4.ip_local_port_range.
Hence, there is no interaction with SELinux, AFAICT.
Jakub Sitnicki [Tue, 24 Jan 2023 13:36:44 +0000 (14:36 +0100)]
selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option
Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:
1. pass invalid values to setsockopt
2. pass a range outside of the per-netns port range
3. configure a single-port range
4. exhaust a configured multi-port range
5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
6. set then get the per-socket port range
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Sitnicki [Tue, 24 Jan 2023 13:36:43 +0000 (14:36 +0100)]
inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.
A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:
1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.
An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:
1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.
This approach has a couple of downsides:
a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.
Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.
# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port
In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].
b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.
IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.
2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.
The per-netns setting affects all sockets, so this approach can be used
only if:
- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.
For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.
Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.
To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.
To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.
The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.
UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.
PORT_LO = 40_000
PORT_HI = 40_511
s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.
Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jesse Brandeburg [Fri, 16 Dec 2022 20:06:58 +0000 (12:06 -0800)]
virtchnl: i40e/iavf: rename iwarp to rdma
Since the latest Intel hardware does both IWARP and ROCE, rename the
term IWARP in the virtchnl header to be RDMA. Do this for both upper and
lower case instances. Many of the non-virtchnl.h changes were done with
regular expression replacements using perl like:
perl -p -i -e 's/_IWARP/_RDMA/' <files>
perl -p -i -e 's/_iwarp/_rdma/' <files>
and I had to pick up a few instances manually.
The virtchnl.h header has some comments and clarity added around when to
use certain defines.
note: had to fix a checkpatch warning for a long line by wrapping one of
the lines I changed.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Jakub Andrysiak <jakub.andrysiak@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jesse Brandeburg [Fri, 16 Dec 2022 20:06:57 +0000 (12:06 -0800)]
virtchnl: do structure hardening
The virtchnl interface can have a bunch of "soft" defined structures
hardened by using explicit sizes for declarations, and then referring to
the enum type that uses them in a comment. None of these changes should
change any of the structure sizes.
Also, remove a duplicate line in a switch statement and let two cases
uses the same code.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
struct genl_info *info will be passed as NULL, and pause_prepare_data()
dereferences it while getting the extended ack pointer.
To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.
The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.
Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
struct genl_info *info will be passed as NULL, and stats_prepare_data()
dereferences it while getting the extended ack pointer.
To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.
The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.
Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, there was no clean separation between SMC-D code and the ISM
device driver.This patch series addresses the situation to make ISM available
for uses outside of SMC-D.
In detail: SMC-D offers an interface via struct smcd_ops, which only the
ISM module implements so far. However, there is no real separation between
the smcd and ism modules, which starts right with the ISM device
initialization, which calls directly into the SMC-D code.
This patch series introduces a new API in the ISM module, which allows
registration of arbitrary clients via include/linux/ism.h: struct ism_client.
Furthermore, it introduces a "pure" struct ism_dev (i.e. getting rid of
dependencies on SMC-D in the device structure), and adds a number of API
calls for data transfers via ISM (see ism_register_dmb() & friends).
Still, the ISM module implements the SMC-D API, and therefore has a number
of internal helper functions for that matter.
Note that the ISM API is consciously kept thin for now (as compared to the
SMC-D API calls), as a number of API calls are only used with SMC-D and
hardly have any meaningful usage beyond SMC-D, e.g. the VLAN-related calls.
v1 -> v2:
Removed s390x dependency which broke config for other archs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jan 2023 18:17:52 +0000 (19:17 +0100)]
net/smc: De-tangle ism and smc device initialization
The struct device for ISM devices was part of struct smcd_dev. Move to
struct ism_dev, provide a new API call in struct smcd_ops, and convert
existing SMCD code accordingly.
Furthermore, remove struct smcd_dev from struct ism_dev.
This is the final part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jan 2023 18:17:51 +0000 (19:17 +0100)]
s390/ism: Consolidate SMC-D-related code
The ism module had SMC-D-specific code sprinkled across the entire module.
We are now consolidating the SMC-D-specific parts into the latter parts
of the module, so it becomes more clear what code is intended for use with
ISM, and which parts are glue code for usage in the context of SMC-D.
This is the fourth part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jan 2023 18:17:50 +0000 (19:17 +0100)]
net/smc: Separate SMC-D and ISM APIs
We separate the code implementing the struct smcd_ops API in the ISM
device driver from the functions that may be used by other exploiters of
ISM devices.
Note: We start out small, and don't offer the whole breadth of the ISM
device for public use, as many functions are specific to or likely only
ever used in the context of SMC-D.
This is the third part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jan 2023 18:17:49 +0000 (19:17 +0100)]
net/smc: Register SMC-D as ISM client
Register the smc module with the new ism device driver API.
This is the second part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jan 2023 18:17:48 +0000 (19:17 +0100)]
net/ism: Add new API for client registration
Add a new API that allows other drivers to concurrently access ISM devices.
To do so, we introduce a new API that allows other modules to register for
ISM device usage. Furthermore, we move the GID to struct ism, where it
belongs conceptually, and rename and relocate struct smcd_event to struct
ism_event.
This is the first part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jan 2023 18:17:47 +0000 (19:17 +0100)]
s390/ism: Introduce struct ism_dmb
Conceptually, a DMB is a structure that belongs to ISM devices. However,
SMC currently 'owns' this structure. So future exploiters of ISM devices
would be forced to include SMC headers to work - which is just weird.
Therefore, we switch ISM to struct ism_dmb, introduce a new public header
with the definition (will be populated with further API calls later on),
and, add a thin wrapper to please SMC. Since structs smcd_dmb and ism_dmb
are identical, we can simply convert between the two for now.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jan 2023 18:17:46 +0000 (19:17 +0100)]
net/ism: Add missing calls to disable bus-mastering
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jan 2023 18:17:45 +0000 (19:17 +0100)]
net/smc: Terminate connections prior to device removal
Removing an ISM device prior to terminating its associated connections
doesn't end well.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Parav Pandit [Mon, 23 Jan 2023 03:55:11 +0000 (05:55 +0200)]
virtio-net: Reduce debug name field size to 16 bytes
virtio queue index can be maximum of 65535. 16 bytes are enough to store
the vq name with the existing string prefix.
With this change, send queue struct saves 24 bytes and receive
queue saves whole cache line worth 64 bytes per structure
due to saving in alignment bytes.
Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 24 Jan 2023 03:52:31 +0000 (19:52 -0800)]
devlink: remove a dubious assumption in fmsg dumping
Build bot detects that err may be returned uninitialized in
devlink_fmsg_prepare_skb(). This is not really true because
all fmsgs users should create at least one outer nest, and
therefore fmsg can't be completely empty.
That said the assumption is not trivial to confirm, so let's
follow the bots advice, anyway.
This code does not seem to have changed since its inception in
commit 1db64e8733f6 ("devlink: Add devlink formatted message (fmsg) API")
Vladimir Oltean [Mon, 23 Jan 2023 18:45:38 +0000 (20:45 +0200)]
net: mscc: ocelot: fix incorrect verify_enabled reporting in ethtool get_mm()
We don't read the verify_enabled variable from hardware in the MAC Merge
layer state GET operation, instead we always leave it set to "false".
The user may think something is wrong if they set verify_enabled to
true, then read it back and see it's still false, even though the
configuration took place.
James Hershaw [Mon, 23 Jan 2023 13:41:35 +0000 (14:41 +0100)]
nfp: flower: change get/set_eeprom logic and enable for flower reps
The changes in this patch are as follows:
- Alter the logic of get/set_eeprom functions to use the helper function
nfp_app_from_netdev() which handles differentiating between an nfp_net
and a nfp_repr. This allows us to get an agnostic backpointer to the
pdev.
- Enable the various eeprom commands by adding the 'get_eeprom_len',
'get_eeprom', 'set_eeprom' callbacks to the nfp_port_ethtool_ops struct.
This allows the eeprom commands to work on representor interfaces,
similar to a previous patch which added it to the vnics.
Currently these are being used to configure persistent MAC addresses for
the physical ports on the nfp.
Paolo Abeni [Tue, 24 Jan 2023 10:02:03 +0000 (11:02 +0100)]
Merge branch 'netlink-protocol-specs'
Jakub Kicinski says:
====================
Netlink protocol specs
I think the Netlink proto specs are far along enough to merge.
Filling in all attribute types and quirks will be an ongoing
effort but we have enough to cover FOU so it's somewhat complete.
I fully intend to continue polishing the code but at the same
time I'd like to start helping others base their work on the
specs (e.g. DPLL) and need to start working on some new families
myself.
That's the progress / motivation for merging. The RFC [1] has more
of a high level blurb, plus I created a lot of documentation, I'm
not going to repeat it here. There was also the talk at LPC [2].
Jakub Kicinski [Fri, 20 Jan 2023 17:50:39 +0000 (09:50 -0800)]
net: fou: rename the source for linking
We'll need to link two objects together to form the fou module.
This means the source can't be called fou, the build system expects
fou.o to be the combined object.
Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Fri, 20 Jan 2023 17:50:35 +0000 (09:50 -0800)]
netlink: add schemas for YAML specs
Add schemas for Netlink spec files. As described in the docs
we have 4 "protocols" or compatibility levels, and each one
comes with its own schema, but the more general / legacy
schemas are superset of more modern ones: genetlink is
the smallest followed by genetlink-c and genetlink-legacy.
There is no schema for raw netlink, yet, I haven't found the time..
I don't know enough jsonschema to do inheritance or something
but the repetition is not too bad. I hope.
Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>