Jan Sokolowski [Mon, 9 Jan 2023 14:11:18 +0000 (15:11 +0100)]
i40e: use int for i40e_status
To prepare for removal of i40e_status, change the variables
from i40e_status to int. This eases the transition when values
are changed to return standard int error codes over enum i40e_status.
As such changes often also change variable orders, a cleanup
is also applied here to make variables conform to RCT and
some lines are also reformatted where applicable.
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jan Sokolowski [Mon, 9 Jan 2023 14:11:17 +0000 (15:11 +0100)]
i40e: Remove string printing for i40e_status
Remove the i40e_stat_str() function which prints the string
representation of the i40e_status error code. With upcoming changes
moving away from i40e_status, there will be no need for this function
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jan Sokolowski [Mon, 9 Jan 2023 14:11:16 +0000 (15:11 +0100)]
i40e: Remove unused i40e status codes
In an effort to remove i40e status codes and replace them
with standard kernel errornums, unused values of i40e_status_code
were removed.
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jakub Kicinski [Thu, 9 Feb 2023 05:32:19 +0000 (21:32 -0800)]
Merge tag 'linux-can-next-for-6.3-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
can-next 2023-02-08
The 1st patch is by Oliver Hartkopp and cleans up the CAN_RAW's
raw_setsockopt() for CAN_RAW_FD_FRAMES.
The 2nd patch is by me and fixes the compilation if
CONFIG_CAN_CALC_BITTIMING is disabled. (Problem introduced in last
pull request to next-next.)
* tag 'linux-can-next-for-6.3-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: bittiming: can_calc_bittiming(): add missing parameter to no-op function
can: raw: use temp variable instead of rolling back config
====================
Jakub Kicinski [Thu, 9 Feb 2023 05:00:54 +0000 (21:00 -0800)]
Merge tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-next-netdev-deadlock
This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()
To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
* tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
net/mlx5e: Propagate an internal event in case uplink netdev changes
net/mlx5e: Fix trap event handling
net/mlx5: Introduce CQE error syndrome
====================
Resolve this by converting to register_netdevice_notifier_dev_net()
which does not take pernet_ops_rwsem and moves the notifier block around
according to netdev it takes as arg.
Use previously introduced netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
Jiri Pirko [Tue, 1 Nov 2022 14:27:43 +0000 (15:27 +0100)]
net/mlx5e: Propagate an internal event in case uplink netdev changes
Whenever uplink netdev is set/cleared, propagate newly introduced event
to inform notifier blocks netdev was added/removed.
Move the set() helper to core.c from header, introduce clear() and
netdev_added_event_replay() helpers. The last one is going to be called
from rdma driver, so export it.
Jiri Pirko [Thu, 24 Nov 2022 12:05:53 +0000 (13:05 +0100)]
net/mlx5e: Fix trap event handling
Current code does not return correct return value from event handler.
Fix it by returning NOTIFY_* and propagate err over newly introduce ctx
structure.
can: bittiming: can_calc_bittiming(): add missing parameter to no-op function
In commit 286c0e09e8e0 ("can: bittiming: can_changelink() pass extack
down callstack") a new parameter was added to can_calc_bittiming(),
however the static inline no-op (which is used if
CONFIG_CAN_CALC_BITTIMING is disabled) wasn't converted.
Add the new parameter to the static inline no-op of
can_calc_bittiming().
Oliver Hartkopp [Fri, 3 Feb 2023 09:08:07 +0000 (10:08 +0100)]
can: raw: use temp variable instead of rolling back config
Introduce a temporary variable to check for an invalid configuration
attempt from user space. Before this patch the value was copied to
the real config variable and rolled back in the case of an error.
David S. Miller [Wed, 8 Feb 2023 09:48:53 +0000 (09:48 +0000)]
Merge branch 'taprio-auto-qmaxsdu-new-tx'
Vladimir Oltean says:
====================
taprio automatic queueMaxSDU and new TXQ selection procedure
This patch set addresses 2 design limitations in the taprio software scheduler:
1. Software scheduling fundamentally prioritizes traffic incorrectly,
in a way which was inspired from Intel igb/igc drivers and does not
follow the inputs user space gives (traffic classes and TC to TXQ
mapping). Patch 05/15 handles this, 01/15 - 04/15 are preparations
for this work.
2. Software scheduling assumes that the gate for a traffic class closes
as soon as the next interval begins. But this isn't true.
If consecutive schedule entries have that traffic class gate open,
there is no "gate close" event and taprio should keep dequeuing from
that TC without interruptions. Patches 06/15 - 15/15 handle this.
Patch 10/15 is a generic Qdisc change required for this to work.
Future development directions which depend on this patch set are:
- Propagating the automatic queueMaxSDU calculation down to offloading
device drivers, instead of letting them calculate this, as
vsc9959_tas_guard_bands_update() does today.
- A software data path for tc-taprio with preemptible traffic and
Hold/Release events.
Vladimir Oltean [Tue, 7 Feb 2023 13:54:40 +0000 (15:54 +0200)]
net/sched: taprio: don't segment unnecessarily
Improve commit 497cc00224cf ("taprio: Handle short intervals and large
packets") to only perform segmentation when skb->len exceeds what
taprio_dequeue() expects.
In practice, this will make the biggest difference when a traffic class
gate is always open in the schedule. This is because the max_frm_len
will be U32_MAX, and such large skb->len values as Kurt reported will be
sent just fine unsegmented.
What I don't seem to know how to handle is how to make sure that the
segmented skbs themselves are smaller than the maximum frame size given
by the current queueMaxSDU[tc]. Nonetheless, we still need to drop
those, otherwise the Qdisc will hang.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:39 +0000 (15:54 +0200)]
net/sched: taprio: split segmentation logic from qdisc_enqueue()
The majority of the taprio_enqueue()'s function is spent doing TCP
segmentation, which doesn't look right to me. Compilers shouldn't have a
problem in inlining code no matter how we write it, so move the
segmentation logic to a separate function.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:38 +0000 (15:54 +0200)]
net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations
taprio today has a huge problem with small TC gate durations, because it
might accept packets in taprio_enqueue() which will never be sent by
taprio_dequeue().
Since not much infrastructure was available, a kludge was added in
commit 497cc00224cf ("taprio: Handle short intervals and large
packets"), which segmented large TCP segments, but the fact of the
matter is that the issue isn't specific to large TCP segments (and even
worse, the performance penalty in segmenting those is absolutely huge).
In commit a54fc09e4cba ("net/sched: taprio: allow user input of per-tc
max SDU"), taprio gained support for queueMaxSDU, which is precisely the
mechanism through which packets should be dropped at qdisc_enqueue() if
they cannot be sent.
After that patch, it was necessary for the user to manually limit the
maximum MTU per TC. This change adds the necessary logic for taprio to
further limit the values specified (or not specified) by the user to
some minimum values which never allow oversized packets to be sent.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:37 +0000 (15:54 +0200)]
net/sched: keep the max_frm_len information inside struct sched_gate_list
I have one practical reason for doing this and one concerning correctness.
The practical reason has to do with a follow-up patch, which aims to mix
2 sources of max_sdu (one coming from the user and the other automatically
calculated based on TC gate durations @current link speed). Among those
2 sources of input, we must always select the smaller max_sdu value, but
this can change at various link speeds. So the max_sdu coming from the
user must be kept separated from the value that is operationally used
(the minimum of the 2), because otherwise we overwrite it and forget
what the user asked us to do.
To solve that, this patch proposes that struct sched_gate_list contains
the operationally active max_frm_len, and q->max_sdu contains just what
was requested by the user.
The reason having to do with correctness is based on the following
observation: the admin sched_gate_list becomes operational at a given
base_time in the future. Until then, it is inactive and applies no
shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't
apply either (this is a mechanism to ensure that packets smaller than
the largest gate duration for that TC don't hang the port; clearly it
makes little sense if the gates are always open).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:36 +0000 (15:54 +0200)]
net/sched: taprio: warn about missing size table
Vinicius intended taprio to take the L1 overhead into account when
estimating packet transmission time through user input, specifically
through the qdisc size table (man tc-stab).
Without the overhead being specified, transmission times will be
underestimated and will cause late transmissions. For an offloading
driver, it might even cause TX hangs if there is no open gate large
enough to send the maximum sized packets for that TC (including L1
overhead). Properly knowing the L1 overhead will ensure that we are able
to auto-calculate the queueMaxSDU per traffic class just right, and
avoid these hangs due to head-of-line blocking.
We can't make the stab mandatory due to existing setups, but we can warn
the user that it's important with a warning netlink extack.
Vladimir Oltean [Tue, 7 Feb 2023 13:54:35 +0000 (15:54 +0200)]
net/sched: make stab available before ops->init() call
Some qdiscs like taprio turn out to be actually pretty reliant on a well
configured stab, to not underestimate the skb transmission time (by
properly accounting for L1 overhead).
In a future change, taprio will need the stab, if configured by the
user, to be available at ops->init() time. It will become even more
important in upcoming work, when the overhead will be used for the
queueMaxSDU calculation that is passed to an offloading driver.
However, rcu_assign_pointer(sch->stab, stab) is called right after
ops->init(), making it unavailable, and I don't really see a good reason
for that.
Move it earlier, which nicely seems to simplify the error handling path
as well.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:34 +0000 (15:54 +0200)]
net/sched: taprio: calculate guard band against actual TC gate close time
taprio_dequeue_from_txq() looks at the entry->end_time to determine
whether the skb will overrun its traffic class gate, as if at the end of
the schedule entry there surely is a "gate close" event for it. Hint:
maybe there isn't.
For each schedule entry, introduce an array of kernel times which
actually tracks when in the future will there be an *actual* gate close
event for that traffic class, and use that in the guard band overrun
calculation.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:33 +0000 (15:54 +0200)]
net/sched: taprio: calculate budgets per traffic class
Currently taprio assumes that the budget for a traffic class expires at
the end of the current interval as if the next interval contains a "gate
close" event for this traffic class.
This is, however, an unfounded assumption. Allow schedule entry
intervals to be fused together for a particular traffic class by
calculating the budget until the gate *actually* closes.
This means we need to keep budgets per traffic class, and we also need
to update the budget consumption procedure.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:32 +0000 (15:54 +0200)]
net/sched: taprio: rename close_time to end_time
There is a confusion in terms in taprio which makes what is called
"close_time" to be actually used for 2 things:
1. determining when an entry "closes" such that transmitted skbs are
never allowed to overrun that time (?!)
2. an aid for determining when to advance and/or restart the schedule
using the hrtimer
It makes more sense to call this so-called "close_time" "end_time",
because it's not clear at all to me what "closes". Future patches will
hopefully make better use of the term "to close".
This is an absolutely mechanical change.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:31 +0000 (15:54 +0200)]
net/sched: taprio: calculate tc gate durations
Current taprio code operates on a very simplistic (and incorrect)
assumption: that egress scheduling for a traffic class can only take
place for the duration of the current interval, or i.o.w., it assumes
that at the end of each schedule entry, there is a "gate close" event
for all traffic classes.
As an example, traffic sent with the schedule below will be jumpy, even
though all 8 TC gates are open, so there is absolutely no "gate close"
event (effectively a transition from BIT(tc)==1 to BIT(tc)==0 in
consecutive schedule entries):
This qdisc simply does not have what it takes in terms of logic to
*actually* compute the durations of traffic classes. Also, it does not
recognize the need to use this information on a per-traffic-class basis:
it always looks at entry->interval and entry->close_time.
This change proposes that each schedule entry has an array called
tc_gate_duration[tc]. This holds the information: "for how long will
this traffic class gate remain open, starting from *this* schedule
entry". If the traffic class gate is always open, that value is equal to
the cycle time of the schedule.
We'll also need to keep track, for the purpose of queueMaxSDU[tc]
calculation, what is the maximum time duration for a traffic class
having an open gate. This gives us directly what is the maximum sized
packet that this traffic class will have to accept. For everything else
it has to qdisc_drop() it in qdisc_enqueue().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:30 +0000 (15:54 +0200)]
net/sched: taprio: give higher priority to higher TCs in software dequeue mode
Current taprio software implementation is haunted by the shadow of the
igb/igc hardware model. It iterates over child qdiscs in increasing
order of TXQ index, therefore giving higher xmit priority to TXQ 0 and
lower to TXQ N. According to discussions with Vinicius, that is the
default (perhaps even unchangeable) prioritization scheme used for the
NICs that taprio was first written for (igb, igc), and we have a case of
two bugs canceling out, resulting in a functional setup on igb/igc, but
a less sane one on other NICs.
To the best of my understanding, taprio should prioritize based on the
traffic class, so it should really dequeue starting with the highest
traffic class and going down from there. We get to the TXQ using the
tc_to_txq[] netdev property.
TXQs within the same TC have the same (strict) priority, so we should
pick from them as fairly as we can. We can achieve that by implementing
something very similar to q->curband from multiq_dequeue().
Since igb/igc really do have TXQ 0 of higher hardware priority than
TXQ 1 etc, we need to preserve the behavior for them as well. We really
have no choice, because in txtime-assist mode, taprio is essentially a
software scheduler towards offloaded child tc-etf qdiscs, so the TXQ
selection really does matter (not all igb TXQs support ETF/SO_TXTIME,
says Kurt Kanzenbach).
To preserve the behavior, we need a capability bit so that taprio can
determine if it's running on igb/igc, or on something else. Because igb
doesn't offload taprio at all, we can't piggyback on the
qdisc_offload_query_caps() call from taprio_enable_offload(), but
instead we need a separate call which is also made for software
scheduling.
Introduce two static keys to minimize the performance penalty on systems
which only have igb/igc NICs, and on systems which only have other NICs.
For mixed systems, taprio will have to dynamically check whether to
dequeue using one prioritization algorithm or using the other.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify taprio_dequeue_from_txq() by noticing that we can goto one call
earlier than the previous skb_found label. This is possible because
we've unified the treatment of the child->ops->dequeue(child) return
call, we always try other TXQs now, instead of abandoning the root
dequeue completely if we failed in the peek() case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:28 +0000 (15:54 +0200)]
net/sched: taprio: refactor one skb dequeue from TXQ to separate function
Future changes will refactor the TXQ selection procedure, and a lot of
stuff will become messy, the indentation of the bulk of the dequeue
procedure would increase, etc.
Break out the bulk of the function into a new one, which knows the TXQ
(child qdisc) we should perform a dequeue from.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:27 +0000 (15:54 +0200)]
net/sched: taprio: continue with other TXQs if one dequeue() failed
This changes the handling of an unlikely condition to not stop dequeuing
if taprio failed to dequeue the peeked skb in taprio_dequeue().
I've no idea when this can happen, but the only side effect seems to be
that the atomic_sub_return() call right above will have consumed some
budget. This isn't a big deal, since either that made us remain without
any budget (and therefore, we'd exit on the next peeked skb anyway), or
we could send some packets from other TXQs.
I'm making this change because in a future patch I'll be refactoring the
dequeue procedure to simplify it, and this corner case will have to go
away.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 7 Feb 2023 13:54:26 +0000 (15:54 +0200)]
net/sched: taprio: delete peek() implementation
There isn't any code in the network stack which calls taprio_peek().
We only see qdisc->ops->peek() being called on child qdiscs of other
classful qdiscs, never from the generic qdisc code. Whereas taprio is
never a child qdisc, it is always root.
This snippet of a comment from qdisc_peek_dequeued() seems to confirm:
/* we can reuse ->gso_skb because peek isn't called for root qdiscs */
Since I've been known to be wrong many times though, I'm not completely
removing it, but leaving a stub function in place which emits a warning.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 8 Feb 2023 09:16:07 +0000 (09:16 +0000)]
Merge branch 'micrel-lan8841-support'
Horatiu Vultur says:
====================
net: micrel: Add support for lan8841 PHY
Add support for lan8841 PHY.
The first patch add the support for lan8841 PHY which can run at
10/100/1000Mbit. It also has support for other features, but they are not
added in this series.
The second patch updates the documentation for the dt-bindings which is
similar to the ksz9131.
v3->v4:
- add space between defines and function names
- inside lan8841_config_init use only ret variable
v2->v3:
- reuse ksz9131_config_init
- allow only open-drain configuration
- change from single patch to a patch series
Horatiu Vultur [Tue, 7 Feb 2023 10:52:12 +0000 (11:52 +0100)]
dt-bindings: net: micrel-ksz90x1.txt: Update for lan8841
The lan8841 has the same bindings as ksz9131, so just reuse the entire
section of ksz9131.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Tue, 7 Feb 2023 10:52:11 +0000 (11:52 +0100)]
net: micrel: Add support for lan8841 PHY
The LAN8841 is completely integrated triple-speed (10BASE-T/ 100BASE-TX/
1000BASE-T) Ethernet physical layer transceivers for transmission and
reception of data on standard CAT-5, as well as CAT-5e and CAT-6,
unshielded twisted pair (UTP) cables.
The LAN8841 offers the industry-standard GMII/MII as well as the RGMII.
Some of the features of the PHY are:
- Wake on LAN
- Auto-MDIX
- IEEE 1588-2008 (V2)
- LinkMD Capable diagnosis
Currently the patch offers support only for link configuration.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Tue, 7 Feb 2023 10:35:49 +0000 (11:35 +0100)]
net: lan966x: Add support for TC flower filter statistics
Add flower filter packet statistics. This will just read the TCAM
counter of the rule, which mention how many packages were hit by this
rule.
Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 8 Feb 2023 05:40:40 +0000 (21:40 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
ice: various virtualization cleanups
Jacob Keller says:
This series contains a variety of refactors and cleanups in the VF code for
the ice driver. Its primary focus is cleanup and simplification of the VF
operations and addition of a few new operations that will be required by
Scalable IOV, as well as some other refactors needed for the handling of VF
subfunctions.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: remove unnecessary virtchnl_ether_addr struct use
ice: introduce .irq_close VF operation
ice: introduce clear_reset_state operation
ice: convert vf_ops .vsi_rebuild to .create_vsi
ice: introduce ice_vf_init_host_cfg function
ice: add a function to initialize vf entry
ice: Pull common tasks into ice_vf_post_vsi_rebuild
ice: move ice_vf_vsi_release into ice_vf_lib.c
ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg
ice: refactor VSI setup to use parameter structure
ice: drop unnecessary VF parameter from several VSI functions
ice: fix function comment referring to ice_vsi_alloc
ice: Add more usage of existing function ice_get_vf_vsi(vf)
====================
James Hershaw [Mon, 6 Feb 2023 15:48:36 +0000 (16:48 +0100)]
nfp: flower: add check for flower VF netdevs for get/set_eeprom
Move the nfp_net_get_port_mac_by_hwinfo() check to ahead in the
get/set_eeprom() functions to in order to check for a VF netdev, which
this function does not support.
It is debatable if this is a fix or an enhancement, and we have chosen
to go for the latter. It does address a problem introduced by
commit 74b4f1739d4e ("nfp: flower: change get/set_eeprom logic and enable for flower reps").
However, the ethtool->len == 0 check avoids the problem manifesting as a
run-time bug (NULL pointer dereference of app).
Signed-off-by: James Hershaw <james.hershaw@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230206154836.2803995-1-simon.horman@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 8 Feb 2023 04:18:51 +0000 (20:18 -0800)]
Merge branch 'mlxsw-misc-devlink-changes'
Petr Machata says:
====================
mlxsw: Misc devlink changes
This patchset adjusts mlxsw to recent devlink changes in net-next.
Patch #1 removes a devl_param_driverinit_value_set() call that was
unnecessary, but now additionally triggers a WARN_ON.
Patches #2-#4 are non-functional preparations for the following patches.
Patch #5 fixes a use-after-free that is triggered while changing network
namespaces.
Patch #6 makes mlxsw consistent with netdevsim by having mlxsw register
its devlink instance before its sub-objects. It helps us avoid a warning
described in the commit message.
====================
Ido Schimmel [Mon, 6 Feb 2023 15:39:23 +0000 (16:39 +0100)]
mlxsw: core: Register devlink instance before sub-objects
Recent changes made it possible to register the devlink instance before
its sub-objects and under the instance lock. Among other things, it
allows us to avoid warnings such as this one [1]. The warning is
generated because a buggy firmware is generating a health event during
driver initialization, before the devlink instance is registered.
Move the registration of the devlink instance to the beginning of the
initialization flow to avoid such problems.
A similar change was implemented in netdevsim in commit 82a3aef2e6af
("netdevsim: move devlink registration under the instance lock").
Ido Schimmel [Mon, 6 Feb 2023 15:39:22 +0000 (16:39 +0100)]
mlxsw: spectrum_acl_tcam: Move devlink param to TCAM code
Cited commit added 'DEVLINK_CMD_PARAM_DEL' notifications whenever the
network namespace of the devlink instance is changed. Specifically, the
notifications are generated after calling reload_down(), but before
calling reload_up(). At this stage, the data structures accessed while
reading the value of the "acl_region_rehash_interval" devlink parameter
are uninitialized, resulting in a use-after-free [1].
Fix by moving the registration and unregistration of the devlink
parameter to the TCAM code where it is actually used. This means that
the parameter is unregistered during reload_down() and then
re-registered during reload_up(), avoiding the use-after-free between
these two operations.
Reproducer:
# ip netns add test123
# devlink dev reload pci/0000:06:00.0 netns test123
[1]
BUG: KASAN: use-after-free in mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0
Read of size 4 at addr ffff888162fd37d8 by task devlink/1323
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x95/0xbd
print_report+0x181/0x4a1
kasan_report+0xdb/0x200
mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0
mlxsw_sp_params_acl_region_rehash_intrvl_get+0x32/0x80
devlink_nl_param_fill.constprop.0+0x29a/0x11e0
devlink_param_notify.constprop.0+0xb9/0x250
devlink_notify_unregister+0xbc/0x470
devlink_reload+0x1aa/0x440
devlink_nl_cmd_reload+0x559/0x11b0
genl_family_rcv_msg_doit.isra.0+0x1f8/0x2e0
genl_rcv_msg+0x558/0x7f0
netlink_rcv_skb+0x170/0x440
genl_rcv+0x2d/0x40
netlink_unicast+0x53f/0x810
netlink_sendmsg+0x961/0xe80
__sys_sendto+0x2a4/0x420
__x64_sys_sendto+0xe5/0x1c0
do_syscall_64+0x38/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 7d7e9169a3ec ("devlink: move devlink reload notifications back in between _down() and _up() calls") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Danielle Ratson [Mon, 6 Feb 2023 15:39:18 +0000 (16:39 +0100)]
mlxsw: spectrum: Remove pointless call to devlink_param_driverinit_value_set()
The "acl_region_rehash_interval" devlink parameter is a "runtime"
parameter, making the call to devl_param_driverinit_value_set()
pointless. Before cited commit the function simply returned an error
(that was not checked), but now it emits a WARNING [1].
Vladimir Oltean [Mon, 6 Feb 2023 09:45:30 +0000 (11:45 +0200)]
net: enetc: add support for MAC Merge layer
Add PF driver support for viewing and changing the MAC Merge sublayer
parameters, and seeing the verification state machine's current state.
The verification handshake with the link partner is driven by hardware.
====================
sched: cpumask: improve on cpumask_local_spread() locality
cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.
This has significant performance implications on NUMA machines, for example
when using NUMA-aware allocated memory together with NUMA-aware IRQ
affinity hints.
Performance tests from patch 8 of this series for mellanox network
driver show:
TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
+-------------------------+-----------+------------------+------------------+
| | BW (Gbps) | TX side CPU util | RX side CPU util |
+-------------------------+-----------+------------------+------------------+
| Baseline | 52.3 | 6.4 % | 17.9 % |
+-------------------------+-----------+------------------+------------------+
| Applied on TX side only | 52.6 | 5.2 % | 18.5 % |
+-------------------------+-----------+------------------+------------------+
| Applied on RX side only | 94.9 | 11.9 % | 27.2 % |
+-------------------------+-----------+------------------+------------------+
| Applied on both sides | 95.1 | 8.4 % | 27.3 % |
+-------------------------+-----------+------------------+------------------+
Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.
====================
Yury Norov [Sat, 21 Jan 2023 04:24:36 +0000 (20:24 -0800)]
lib/cpumask: update comment for cpumask_local_spread()
Now that we have an iterator-based alternative for a very common case
of using cpumask_local_spread for all cpus in a row, it's worth to
mention that in comment to cpumask_local_spread().
Tariq Toukan [Sat, 21 Jan 2023 04:24:35 +0000 (20:24 -0800)]
net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints
In the IRQ affinity hints, replace the binary NUMA preference (local /
remote) with the improved for_each_numa_hop_cpu() API that minds the
actual distances, so that remote NUMAs with short distance are preferred
over farther ones.
This has significant performance implications when using NUMA-aware
allocated memory (follow [1] and derivatives for example).
[1]
drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel()
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
Performance tests:
TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
+-------------------------+-----------+------------------+------------------+
| | BW (Gbps) | TX side CPU util | RX side CPU util |
+-------------------------+-----------+------------------+------------------+
| Baseline | 52.3 | 6.4 % | 17.9 % |
+-------------------------+-----------+------------------+------------------+
| Applied on TX side only | 52.6 | 5.2 % | 18.5 % |
+-------------------------+-----------+------------------+------------------+
| Applied on RX side only | 94.9 | 11.9 % | 27.2 % |
+-------------------------+-----------+------------------+------------------+
| Applied on both sides | 95.1 | 8.4 % | 27.3 % |
+-------------------------+-----------+------------------+------------------+
Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.
* CPU util on active cores only.
Setups details (similar for both sides):
NIC: ConnectX6-DX dual port, 100 Gbps each.
Single port used in the tests.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 16
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7763 64-Core Processor
Stepping: 1
CPU MHz: 2594.804
BogoMIPS: 4890.73
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-7,128-135
NUMA node1 CPU(s): 8-15,136-143
NUMA node2 CPU(s): 16-23,144-151
NUMA node3 CPU(s): 24-31,152-159
NUMA node4 CPU(s): 32-39,160-167
NUMA node5 CPU(s): 40-47,168-175
NUMA node6 CPU(s): 48-55,176-183
NUMA node7 CPU(s): 56-63,184-191
NUMA node8 CPU(s): 64-71,192-199
NUMA node9 CPU(s): 72-79,200-207
NUMA node10 CPU(s): 80-87,208-215
NUMA node11 CPU(s): 88-95,216-223
NUMA node12 CPU(s): 96-103,224-231
NUMA node13 CPU(s): 104-111,232-239
NUMA node14 CPU(s): 112-119,240-247
NUMA node15 CPU(s): 120-127,248-255
..
The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
reachable within a given distance budget, wrap the logic for iterating over
all (distance, mask) values inside an iterator macro.
Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.
sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks.
Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
else-branch of cpumask_local_spread() is just a function call, and
we can simplify logic by using ternary operator.
While here, replace BUG() with WARN_ON().
Signed-off-by: Yury Norov <yury.norov@gmail.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Peter Lafreniere <peter@n8pjl.ca> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yury Norov [Sat, 21 Jan 2023 04:24:30 +0000 (20:24 -0800)]
sched: add sched_numa_find_nth_cpu()
The function finds Nth set CPU in a given cpumask starting from a given
node.
Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.
Signed-off-by: Yury Norov <yury.norov@gmail.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Peter Lafreniere <peter@n8pjl.ca> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yury Norov [Sat, 21 Jan 2023 04:24:29 +0000 (20:24 -0800)]
cpumask: introduce cpumask_nth_and_andnot
Introduce cpumask_nth_and_andnot() based on find_nth_and_andnot_bit().
It's used in the following patch to traverse cpumasks without storing
intermediate result in temporary cpumask.
Signed-off-by: Yury Norov <yury.norov@gmail.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Peter Lafreniere <peter@n8pjl.ca> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Thu, 19 Jan 2023 12:09:00 +0000 (14:09 +0200)]
net/mlx5e: Remove incorrect debugfs_create_dir NULL check in TLS
Remove the NULL check on debugfs_create_dir() return value as the
function returns an ERR pointer on failure, not NULL.
The check is not replaced with a IS_ERR_OR_NULL() as
debugfs_create_file(), and debugfs functions in general don't need error
checking.
Gal Pressman [Thu, 19 Jan 2023 12:09:00 +0000 (14:09 +0200)]
net/mlx5e: Remove incorrect debugfs_create_dir NULL check in hairpin
Remove the NULL check on debugfs_create_dir() return value as the
function returns an ERR pointer on failure, not NULL.
The check is not replaced with a IS_ERR_OR_NULL() as
debugfs_create_file(), and debugfs functions in general don't need error
checking.
Leon Romanovsky [Mon, 9 Jan 2023 12:11:01 +0000 (14:11 +0200)]
net/mlx5e: Don't listen to remove flows event
remove_flow_enable event isn't really needed as it will be
triggered once user and/or XFRM core explicitly asked to delete
state. In such situation, we won't be interested in any event
at all.
So don't trigger and listen to remove_flow_enable event.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Moshe Shemesh [Thu, 26 Jan 2023 06:58:33 +0000 (08:58 +0200)]
net/mlx5: fw reset: Skip device ID check if PCI link up failed
In case where after reset the PCI link is not ready within timeout, skip
reading device ID as if there is no PCI link up we can't have FW
response to pci config cycles either.
This also fixes err value not used and overwritten in such flow.
Shay Drory [Thu, 5 Jan 2023 08:03:46 +0000 (10:03 +0200)]
net/mlx5: Remove redundant health work lock
Commit 90e7cb78b815 ("net/mlx5: fix missing mutex_unlock in
mlx5_fw_fatal_reporter_err_work()") introduced another checking of
MLX5_DROP_HEALTH_NEW_WORK. At this point, the first check of
MLX5_DROP_HEALTH_NEW_WORK is redundant and so is the lock that
protects it.
Remove the lock and rename MLX5_DROP_HEALTH_NEW_WORK to reflect these
changes.
Arnd Bergmann [Tue, 17 Jan 2023 21:01:55 +0000 (22:01 +0100)]
mlx5: reduce stack usage in mlx5_setup_tc
Clang warns about excessive stack usage on 32-bit targets:
drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3597:12: error: stack frame size (1184) exceeds limit (1024) in 'mlx5e_setup_tc' [-Werror,-Wframe-larger-than]
static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
It turns out that both the mlx5e_setup_tc_mqprio_dcb() function and
the mlx5e_safe_switch_params() function it calls have a copy of
'struct mlx5e_params' on the stack, and this structure is fairly
large.
- After Kees Cook patches and this one, we might
be able to revert commit dbae2b062824 ("net: skb: introduce and use a single page frag cache")
because GRO_MAX_HEAD is also small.
- This patch is a NOP for CONFIG_SLOB=y builds.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
this is a pull request of 47 patches for net-next/master.
The first two patch is by Oliver Hartkopp. One adds missing error
checking to the CAN_GW protocol, the other adds a missing CAN address
family check to the CAN ISO TP protocol.
Thomas Kopp contributes a performance optimization to the mcp251xfd
driver.
The next 11 patches are by Geert Uytterhoeven and add support for
R-Car V4H systems to the rcar_canfd driver.
Stephane Grosjean and Lukas Magel contribute 8 patches to the peak_usb
driver, which add support for configurable CAN channel ID.
The last 17 patches are by me and target the CAN bit timing
configuration. The bit timing is cleaned up, error messages are
improved and forwarded to user space via NL_SET_ERR_MSG_FMT() instead
of netdev_err(), and the SJW handling is updated, including the
definition of a new default value that will benefit CAN-FD
controllers, by increasing their oscillator tolerance.
* tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (47 commits)
can: bittiming: can_validate_bitrate(): report error via netlink
can: bittiming: can_calc_bittiming(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
can: bittiming: can_calc_bittiming(): clean up SJW handling
can: bittiming: can_sjw_set_default(): use Phase Seg2 / 2 as default for SJW
can: bittiming: can_sjw_check(): check that SJW is not longer than either Phase Buffer Segment
can: bittiming: can_sjw_check(): report error via netlink and harmonize error value
can: bittiming: can_fixup_bittiming(): report error via netlink and harmonize error value
can: bittiming: factor out can_sjw_set_default() and can_sjw_check()
can: bittiming: can_changelink() pass extack down callstack
can: netlink: can_changelink(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
can: netlink: can_validate(): validate sample point for CAN and CAN-FD
can: dev: register_candev(): bail out if both fixed bit rates and bit timing constants are provided
can: dev: register_candev(): ensure that bittiming const are valid
can: bittiming: can_get_bittiming(): use direct return and remove unneeded else
can: bittiming: can_fixup_bittiming(): set effective tq
can: bittiming: can_fixup_bittiming(): use CAN_SYNC_SEG instead of 1
can: bittiming(): replace open coded variants of can_bit_time()
can: peak_usb: Reorder include directives alphabetically
can: peak_usb: align CAN channel ID format in log with sysfs attribute
can: peak_usb: export PCAN CAN channel ID as sysfs device attribute
...
====================
Vladimir Oltean [Mon, 6 Feb 2023 09:49:32 +0000 (11:49 +0200)]
ethtool: mm: fix get_mm() return code not propagating to user space
If ops->get_mm() returns a non-zero error code, we goto out_complete,
but there, we return 0. Fix that to propagate the "ret" variable to the
caller. If ops->get_mm() succeeds, it will always return 0.
Fixes: 2b30f8291a30 ("net: ethtool: add support for MAC Merge layer") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230206094932.446379-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Eddy Tao [Sun, 5 Feb 2023 01:35:37 +0000 (09:35 +0800)]
net: openvswitch: reduce cpu_used_mask memory
Use actual CPU number instead of hardcoded value to decide the size
of 'cpu_used_mask' in 'struct sw_flow'. Below is the reason.
'struct cpumask cpu_used_mask' is embedded in struct sw_flow.
Its size is hardcoded to CONFIG_NR_CPUS bits, which can be
8192 by default, it costs memory and slows down ovs_flow_alloc.
To address this:
Redefine cpu_used_mask to pointer.
Append cpumask_size() bytes after 'stat' to hold cpumask.
Initialization cpu_used_mask right after stats_last_writer.
APIs like cpumask_next and cpumask_set_cpu never access bits
beyond cpu count, cpumask_size() bytes of memory is enough.
Arnd Bergmann [Fri, 3 Feb 2023 12:15:36 +0000 (13:15 +0100)]
amd-xgbe: fix mismatched prototype
The forward declaration was introduced with a prototype that does
not match the function definition:
drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c:2166:13: error: conflicting types for 'xgbe_phy_perform_ratechange' due to enum/integer mismatch; have 'void(struct xgbe_prv_data *, enum xgbe_mb_cmd, enum xgbe_mb_subcmd)' [-Werror=enum-int-mismatch]
2166 | static void xgbe_phy_perform_ratechange(struct xgbe_prv_data *pdata,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c:391:13: note: previous declaration of 'xgbe_phy_perform_ratechange' with type 'void(struct xgbe_prv_data *, unsigned int, unsigned int)'
391 | static void xgbe_phy_perform_ratechange(struct xgbe_prv_data *pdata,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Ideally there should not be any forward declarations here, which
would make it easier to show that there is no unbounded recursion.
I tried fixing this but could not figure out how to avoid the
recursive call.
As a hotfix, address only the broken prototype to fix the build
problem instead.
Fixes: 4f3b20bfbb75 ("amd-xgbe: add support for rx-adaptation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Link: https://lore.kernel.org/r/20230203121553.2871598-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are no external users of the vsc7514_*_regmap[] symbols or
vsc7514_vcap_* functions. They were exported in commit 32ecd22ba60b ("net:
mscc: ocelot: split register definitions to a separate file") with the
intention of being used, but the actual structure used in commit 2efaca411c96 ("net: mscc: ocelot: expose vsc7514_regmap definition") ended
up being all that was needed.
Bury these unnecessary symbols.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com> Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230204182056.25502-1-colin.foster@in-advantage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 7 Feb 2023 06:24:33 +0000 (22:24 -0800)]
Merge branch 'aux-bus-v11' of https://github.com/ajitkhaparde1/linux
Ajit Khaparde says:
====================
bnxt: Add Auxiliary driver support
Add auxiliary device driver for Broadcom devices.
The bnxt_en driver will register and initialize an aux device
if RDMA is enabled in the underlying device.
The bnxt_re driver will then probe and initialize the
RoCE interfaces with the infiniband stack.
We got rid of the bnxt_en_ops which the bnxt_re driver used to
communicate with bnxt_en.
Similarly We have tried to clean up most of the bnxt_ulp_ops.
In most of the cases we used the functions and entry points provided
by the auxiliary bus driver framework.
And now these are the minimal functions needed to support the functionality.
We will try to work on getting rid of the remaining if we find any
other viable option in future.
* 'aux-bus-v11' of https://github.com/ajitkhaparde1/linux:
bnxt_en: Remove runtime interrupt vector allocation
RDMA/bnxt_re: Remove the sriov config callback
bnxt_en: Remove struct bnxt access from RoCE driver
bnxt_en: Use auxiliary bus calls over proprietary calls
bnxt_en: Use direct API instead of indirection
bnxt_en: Remove usage of ulp_id
RDMA/bnxt_re: Use auxiliary driver interface
bnxt_en: Add auxiliary driver support
====================
Jacob Keller [Thu, 19 Jan 2023 01:16:53 +0000 (17:16 -0800)]
ice: remove unnecessary virtchnl_ether_addr struct use
The dev_lan_addr and hw_lan_addr members of ice_vf are used only to store
the MAC address for the VF. They are defined using virtchnl_ether_addr, but
only the .addr sub-member is actually used. Drop the use of
virtchnl_ether_addr and just use a u8 array of length [ETH_ALEN].
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:52 +0000 (17:16 -0800)]
ice: introduce .irq_close VF operation
The Scalable IOV implementation will require notifying the VDCM driver when
an IRQ must be closed. This allows the VDCM to handle releasing stale IRQ
context values and properly reconfigure.
To handle this, introduce a new optional .irq_close callback to the VF
operations structure. This will be implemented by Scalable IOV to handle
the shutdown of the IRQ context.
Since the SR-IOV implementation does not need this, we must check that its
non-NULL before calling it.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:51 +0000 (17:16 -0800)]
ice: introduce clear_reset_state operation
When hardware is reset, the VF relies on the VFGEN_RSTAT register to detect
when the VF is finished resetting. This is a tri-state register where 0
indicates a reset is in progress, 1 indicates the hardware is done
resetting, and 2 indicates that the software is done resetting.
Currently the PF driver relies on the device hardware resetting VFGEN_RSTAT
when a global reset occurs. This works ok, but it does mean that the VF
might not immediately notice a reset when the driver first detects that the
global reset is occurring.
This is also problematic for Scalable IOV, because there is no read/write
equivalent VFGEN_RSTAT register for the Scalable VSI type. Instead, the
Scalable IOV VFs will need to emulate this register.
To support this, introduce a new VF operation, clear_reset_state, which is
called when the PF driver first detects a global reset. The Single Root IOV
implementation can just write to VFGEN_RSTAT to ensure it's cleared
immediately, without waiting for the actual hardware reset to begin. The
Scalable IOV implementation will use this as part of its tracking of the
reset status to allow properly reporting the emulated VFGEN_RSTAT to the VF
driver.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:50 +0000 (17:16 -0800)]
ice: convert vf_ops .vsi_rebuild to .create_vsi
The .vsi_rebuild function exists for ice_reset_vf. It is used to release
and re-create the VSI during a single-VF reset.
This function is only called when we need to re-create the VSI, and not
when rebuilding an existing VSI. This makes the single-VF reset process
different from the process used to restore functionality after a
hardware reset such as the PF reset or EMP reset.
When we add support for Scalable IOV VFs, the implementation will be very
similar. The primary difference will be in the fact that each VF type uses
a different underlying VSI type in hardware.
Move the common functionality into a new ice_vf_recreate VSI function. This
will allow the two IOV paths to share this functionality. Rework the
.vsi_rebuild vf_op into .create_vsi, only performing the task of creating a
new VSI.
This creates a nice dichotomy between the ice_vf_rebuild_vsi and
ice_vf_recreate_vsi, and should make it more clear why the two flows atre
distinct.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:49 +0000 (17:16 -0800)]
ice: introduce ice_vf_init_host_cfg function
Introduce a new generic helper ice_vf_init_host_cfg which performs common
host configuration initialization tasks that will need to be done for both
Single Root IOV and the new Scalable IOV implementation.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:48 +0000 (17:16 -0800)]
ice: add a function to initialize vf entry
Some of the initialization code for Single Root IOV VFs will need to be
reused when we introduce Scalable IOV. Pull this code out into a new
ice_initialize_vf_entry helper function.
Co-developed-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com> Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:47 +0000 (17:16 -0800)]
ice: Pull common tasks into ice_vf_post_vsi_rebuild
The Single Root IOV implementation of .post_vsi_rebuild performs some tasks
that will ultimately need to be shared with the Scalable IOV implementation
such as rebuilding the host configuration.
Refactor by introducing a new wrapper function, ice_vf_post_vsi_rebuild
which performs the tasks that will be shared between SR-IOV and Scalable
IOV. Move the ice_vf_rebuild_host_cfg and ice_vf_set_initialized calls into
this wrapper. Then call the implementation specific post_vsi_rebuild
handler afterwards.
This ensures that we will properly re-initialize filters and expected
settings for both SR-IOV and Scalable IOV.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:46 +0000 (17:16 -0800)]
ice: move ice_vf_vsi_release into ice_vf_lib.c
The ice_vf_vsi_release function will be used in a future change to
refactor the .vsi_rebuild function. Move this over to ice_vf_lib.c so
that it can be used there.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:44 +0000 (17:16 -0800)]
ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg
The ice_vsi_alloc and ice_vsi_cfg functions are used together to allocate
and configure a new VSI, called as part of the ice_vsi_setup function.
In the future with the addition of the subfunction code the ice driver
will want to be able to allocate a VSI while delaying the configuration to
a later point of the port activation.
Currently this requires that the port code know what type of VSI should
be allocated. This is required because ice_vsi_alloc assigns the VSI type.
Refactor the ice_vsi_alloc and ice_vsi_cfg functions so that VSI type
assignment isn't done until the configuration stage. This will allow the
devlink port addition logic to reserve a VSI as early as possible before
the type of the port is known. In this way, the port add can fail in the
event that all hardware VSI resources are exhausted.
Since the ice_vsi_cfg function already takes the ice_vsi_cfg_params
structure, this is relatively straight forward.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:43 +0000 (17:16 -0800)]
ice: refactor VSI setup to use parameter structure
The ice_vsi_setup function, ice_vsi_alloc, and ice_vsi_cfg functions have
grown a large number of parameters. These parameters are used to initialize
a new VSI, as well as re-configure an existing VSI
Any time we want to add a new parameter to this function chain, even if it
will usually be unset, we have to change many call sites due to changing
the function signature.
A future change is going to refactor ice_vsi_alloc and ice_vsi_cfg to move
the VSI configuration and initialization all into ice_vsi_cfg.
Before this, refactor the VSI setup flow to use a new ice_vsi_cfg_params
structure. This will contain the configuration (mainly pointers) used to
initialize a VSI.
Pass this from ice_vsi_setup into the related functions such as
ice_vsi_alloc, ice_vsi_cfg, and ice_vsi_cfg_def.
Introduce a helper, ice_vsi_to_params to convert an existing VSI to the
parameters used to initialize it. This will aid in the flows where we
rebuild an existing VSI.
Since we also pass the ICE_VSI_FLAG_INIT to more functions which do not
need (or cannot yet have) the VSI parameters, lets make this clear by
renaming the function parameter to vsi_flags and using a u32 instead of a
signed integer. The name vsi_flags also makes it clear that we may extend
the flags in the future.
This change will make it easier to refactor the setup flow in the future,
and will reduce the complexity required to add a new parameter for
configuration in the future.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:42 +0000 (17:16 -0800)]
ice: drop unnecessary VF parameter from several VSI functions
The vsi->vf pointer gets assigned early on during ice_vsi_alloc. Several
functions currently take a VF pointer, but they can just use the existing
vsi->vf pointer as needed. Modify these functions to drop the unnecessary
VF parameter.
Note that ice_vsi_cfg is not changed as a following change will refactor so
that the VF pointer is assigned during ice_vsi_cfg rather than
ice_vsi_alloc.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 19 Jan 2023 01:16:41 +0000 (17:16 -0800)]
ice: fix function comment referring to ice_vsi_alloc
Since commit 1d2e32275de7 ("ice: split ice_vsi_setup into smaller
functions") ice_vsi_alloc has not been responsible for all of the behavior
implied by the comment for ice_vsi_setup_vector_base.
Fix the comment to refer to the new function ice_vsi_alloc_def().
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Brett Creeley [Thu, 1 Dec 2022 11:58:59 +0000 (12:58 +0100)]
ice: Add more usage of existing function ice_get_vf_vsi(vf)
Extend the usage of function ice_get_vf_vsi(vf) in multiple places
instead of VF's VSI by using a long string of dereferences
(i.e. vf->pf->vsi[vf->lan_vsi_idx]).
Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Kalyan Kodamagula <kalyan.kodamagula@intel.com> Tested-by: Piotr Tyda <piotr.tyda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Merge patch series "can: bittiming: cleanups and rework SJW handling"
Marc Kleine-Budde <mkl@pengutronix.de> says:
several people noticed that on modern CAN controllers with wide bit
timing registers the default SJW of 1 can result in unstable or no
synchronization to the CAN network. See Patch 14/17 for details.
During review of v1 Vincent pointed out that the original code and the
series doesn't always check user provided bit timing parameters,
sometimes silently limits them and the return error values are not
consistent.
This series first cleans up some code in bittiming.c, replacing
open-coded variants by macros or functions (Patches 1, 2).
Patch 3 adds the missing assignment of the effective TQ if the
interface is configured with low level timing parameters.
Patch 4 is another code cleanup.
Patches 5, 6 check the bit timing parameter during interface
registration.
Patch 7 adds a validation of the sample point.
The patches 8-13 convert the error messages from netdev_err() to
NL_SET_ERR_MSG_FMT, factor out the SJW handling from
can_fixup_bittiming(), add checking and error messages for the
individual limits and harmonize the error return values.
Patch 14 changes the default SJW value from 1 to min(Phase Seg1, Phase
Seg2 / 2).
Patch 15 switches can_calc_bittiming() to use the new SJW handling.
Patch 16 converts can_calc_bittiming() to NL_SET_ERR_MSG_FMT().
And patch 16 adds a NL_SET_ERR_MSG_FMT() error message to
can_validate_bitrate().
can: bittiming: can_calc_bittiming(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
Replace the netdev_err() by NL_SET_ERR_MSG_FMT() to better inform the
user about the problem. While there, use %u to print unsigned values
and improve error message a bit.
In case of an error, return -EINVAL instead of -EDOM, this corresponds
better to the actual meaning of the error value.
can: bittiming: can_calc_bittiming(): clean up SJW handling
In the current code, if the user configures a bitrate, a default SJW
value of 1 is used. If the user configures both a bitrate and a SJW
value, can_calc_bittiming() silently limits the SJW value to SJW max
and TSEG2.
We came to the conclusion that if the user provided an invalid SJW
value, it's best to bail out and inform the user [1].
Further the ISO 11898-1:2015 standard mandates that "SJW shall be less
than or equal to the minimum of these two items: Phase_Seg1 and
Phase_Seg2." [2] The current code is missing that check.
The previous patches introduced
1) can_sjw_set_default() - sets a default value for SJW if unset
2) can_sjw_check() - implements a SJW check against SJW max, Phase
Seg1 and Phase Seg2. In the error case this function reports the error
to user space via netlink.
Replace both the open-coded SJW default setting and the open-coded and
insufficient checks of SJW with the helper functions
can_sjw_set_default() and can_sjw_check().
can: bittiming: can_sjw_set_default(): use Phase Seg2 / 2 as default for SJW
"The (Re-)Synchronization Jump Width (SJW) defines how far a
resynchronization may move the Sample Point inside the limits defined
by the Phase Buffer Segments to compensate for edge phase errors." [1]
In other words, this means that the SJW parameter controls the
tolerance of the CAN controller to frequency errors compared to other
CAN controllers.
If the user space does not provide an SJW parameter, the kernel
chooses a default value of 1. This has proven to be a good default
value for classic CAN controllers, but no longer for modern CAN-FD
controllers.
In the past there were CAN controllers like the sja1000 with a rather
limited range of bit timing parameters. For the standard bit rates
this results in the following bit timing parameters:
The attentive reader will notice that the SJW is 1 in most cases,
while the Seg2 phase is 2. Both values are given in TQ units, which in
turn is a duration in nanoseconds.
The TQ is 25ns, the Phase Seg 2 is "10" (== 250ns), the SJW is "1" (==
25ns).
Since the kernel chooses a default SJW of 1 regardless of the TQ, this
leads to a much smaller SJW and thus much smaller tolerances to
frequency errors.
To maintain the same oscillator tolerances on controllers with wide
bit timing registers, select a default SJW value of Phase Seg2 / 2
unless Phase Seg 1 is less. This results in the following bit timing
parameters: