Dmitry Bezrukov [Tue, 22 Oct 2019 09:53:38 +0000 (09:53 +0000)]
net: aquantia: rx filters for ptp
We implement HW filter reservation for PTP traffic. Special location
in filters table is marked as reserved, because incoming ptp traffic
should be directed only to PTP designated queue. This way HW will do PTP
timestamping and proper processing.
Co-developed-by: Egor Pomozov <epomozov@marvell.com> Signed-off-by: Egor Pomozov <epomozov@marvell.com> Co-developed-by: Sergey Samoilenko <sergey.samoilenko@aquantia.com> Signed-off-by: Sergey Samoilenko <sergey.samoilenko@aquantia.com> Signed-off-by: Dmitry Bezrukov <dmitry.bezrukov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Bezrukov [Tue, 22 Oct 2019 09:53:32 +0000 (09:53 +0000)]
net: aquantia: styling fixes on ptp related functions
Checkpatch and styling fixes on parts of code touched by ptp
Signed-off-by: Dmitry Bezrukov <dmitry.bezrukov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Egor Pomozov [Tue, 22 Oct 2019 09:53:27 +0000 (09:53 +0000)]
net: aquantia: add basic ptp_clock callbacks
Basic HW functions implemented for adjusting frequency,
adjusting time, getting and setting time.
With these callbacks we now do register ptp clock in the system.
Firmware interface parts are defined for PTP requests and interactions.
Enable/disable PTP counters in HW on clock register/unregister.
Signed-off-by: Egor Pomozov <epomozov@marvell.com> Co-developed-by: Sergey Samoilenko <sergey.samoilenko@aquantia.com> Signed-off-by: Sergey Samoilenko <sergey.samoilenko@aquantia.com> Co-developed-by: Dmitry Bezrukov <dmitry.bezrukov@aquantia.com> Signed-off-by: Dmitry Bezrukov <dmitry.bezrukov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Bezrukov [Tue, 22 Oct 2019 09:53:25 +0000 (09:53 +0000)]
net: aquantia: unify styling of bit enums
Make some other bit-enums more clear about positioning,
this helps on debugging and development
Signed-off-by: Dmitry Bezrukov <dmitry.bezrukov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Egor Pomozov [Tue, 22 Oct 2019 09:53:22 +0000 (09:53 +0000)]
net: aquantia: PTP skeleton declarations and callbacks
Here we add basic function for PTP clock register/unregister.
We also declare FW/HW capability bits used to control PTP feature on device.
PTP device is created if network card has appropriate FW that has PTP
enabled in config. HW supports timestamping for PTPv2 802.AS1 and
PTPv2 IPv4 UDP packets.
It also supports basic PTP callbacks for getting/setting time, adjusting
frequency and time as well.
Signed-off-by: Egor Pomozov <epomozov@marvell.com> Co-developed-by: Sergey Samoilenko <sergey.samoilenko@aquantia.com> Signed-off-by: Sergey Samoilenko <sergey.samoilenko@aquantia.com> Co-developed-by: Dmitry Bezrukov <dmitry.bezrukov@aquantia.com> Signed-off-by: Dmitry Bezrukov <dmitry.bezrukov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
mlxsw: Update main pool computation and pool size limits
Petr says:
In Spectrum ASICs, the shared buffer is an area of memory where packets are
kept until they can be transmitted. There are two resources associated with
shared buffer size: cap_total_buffer_size and cap_guaranteed_shared_buffer.
So far, mlxsw has been using the former as a limit when validating shared
buffer pool size configuration. However, the total size also includes
headrooms and reserved space, which really cannot be used for shared buffer
pools. Patch #1 mends this and has mlxsw use the guaranteed size.
To configure default pool sizes, mlxsw has historically hard-coded one or
two smallish pools, and one "main" pool that took most of the shared buffer
(that would be pool 0 on ingress and pool 4 on egress). During the
development of Spectrum-2, it became clear that the shared buffer size
keeps shrinking as bugs are identified and worked around. In order to
prevent having to tweak the size of pools 0 and 4 to catch up with updates
to values reported by the FW, patch #2 changes the way these pools are set.
Instead of hard-coding a fixed value, the main pool now takes whatever is
left from the guaranteed size after the smaller pool(s) are taken into
account.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 23 Oct 2019 06:05:00 +0000 (09:05 +0300)]
mlxsw: spectrum_buffers: Calculate the size of the main pool
Instead of hard-coding the size of the largest pool, calculate it from the
reported guaranteed shared buffer size and sizes of other pools (currently
only the CPU port pool).
Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 23 Oct 2019 06:04:59 +0000 (09:04 +0300)]
mlxsw: spectrum: Use guaranteed buffer size as pool size limit
There are two resources associated with shared buffer size:
cap_total_buffer_size, and cap_guaranteed_shared_buffer. So far, mlxsw has
been using the former as a limit to determine how large a pool size is
allowed to be. However, the total size also includes headrooms and reserved
space, which really cannot be used for shared buffer pools.
Therefore convert mlxsw to use the latter resource as a limit. Adjust
hard-coded pool sizes to be the guaranteed size minus 256000 bytes for CPU
port pool. On Spectrum-1 that actually leads to an increase. A follow-up
patch will have this size calculated automatically.
Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 22 Oct 2019 19:30:57 +0000 (21:30 +0200)]
r8169: never set PCI_EXP_DEVCTL_NOSNOOP_EN
Setting PCI_EXP_DEVCTL_NOSNOOP_EN for certain chip versions had been
added to the vendor driver more than 10 years ago, and copied from
there to r8169. It has been removed from the vendor driver meanwhile
and I think we can safely remove this too.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: phy: support 1000Base-X auto-negotiation for BCM54616S
This patch series aims at supporting auto negotiation when BCM54616S is
running in 1000Base-X mode: without the patch series, BCM54616S PHY driver
would report incorrect link speed in 1000Base-X mode.
Patch #1 (of 3) modifies assignment to OR when dealing with dev_flags in
phy_attach_direct function, so that dev_flags updated in BCM54616S PHY's
probe callback won't be lost.
Patch #2 (of 3) adds several genphy_c37_* functions to support clause 37
1000Base-X auto-negotiation, and these functions are called in BCM54616S
PHY driver.
Patch #3 (of 3) detects BCM54616S PHY's operation mode and calls according
genphy_c37_* functions to configure auto-negotiation and parse link
attributes (speed, duplex, and etc.) in 1000Base-X mode.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tao Ren [Tue, 22 Oct 2019 18:31:08 +0000 (11:31 -0700)]
net: phy: broadcom: add 1000Base-X support for BCM54616S
The BCM54616S PHY cannot work properly in RGMII->1000Base-X mode, mainly
because genphy functions are designed for copper links, and 1000Base-X
(clause 37) auto negotiation needs to be handled differently.
This patch enables 1000Base-X support for BCM54616S by customizing 3
driver callbacks, and it's verified to be working on Facebook CMM BMC
platform (RGMII->1000Base-KX):
- probe: probe callback detects PHY's operation mode based on
INTERF_SEL[1:0] pins and 1000X/100FX selection bit in SerDES 100-FX
Control register.
- config_aneg: calls genphy_c37_config_aneg when the PHY is running in
1000Base-X mode; otherwise, genphy_config_aneg will be called.
- read_status: calls genphy_c37_read_status when the PHY is running in
1000Base-X mode; otherwise, genphy_read_status will be called.
Note: BCM54616S PHY can also be configured in RGMII->100Base-FX mode, and
100Base-FX support is not available as of now.
Signed-off-by: Tao Ren <taoren@fb.com> Acked-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tao Ren [Tue, 22 Oct 2019 18:31:06 +0000 (11:31 -0700)]
net: phy: modify assignment to OR for dev_flags in phy_attach_direct
Modify the assignment to OR when dealing with phydev->dev_flags in
phy_attach_direct function, and this is to make sure dev_flags set in
driver's probe callback won't be lost.
Suggested-by: Andrew Lunn <andrew@lunn.ch> CC: Heiner Kallweit <hkallweit1@gmail.com> CC: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Tao Ren <taoren@fb.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
The dsa_switch structure represents the physical switch device itself,
and is allocated by the driver. The dsa_switch_tree and dsa_port structures
represent the logical switch fabric (eventually composed of multiple switch
devices) and its ports, and are allocated by the DSA core.
This branch lists the logical ports directly in the fabric which simplifies
the iteration over all ports when assigning the default CPU port or configuring
the D in DSA in drivers like mv88e6xxx.
This also removes the unique dst->cpu_dp pointer and is a first step towards
supporting multiple CPU ports and dropping the DSA_MAX_PORTS limitation.
Because the dsa_port structures are not tied to the dsa_switch structure
anymore, we do not need to provide an helper for the drivers to allocate a
switch structure. Like in many other subsystems, drivers can now embed their
dsa_switch structure as they wish into their private structure. This will
be particularly interesting for the Broadcom drivers which were currently
limited by the dynamically allocated array of DSA ports.
The series implements the list of dsa_port structures, makes use of it,
then drops dst->cpu_dp and the dsa_switch_alloc helper.
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Vivien Didelot [Mon, 21 Oct 2019 20:51:30 +0000 (16:51 -0400)]
net: dsa: remove dsa_switch_alloc helper
Now that ports are dynamically listed in the fabric, there is no need
to provide a special helper to allocate the dsa_switch structure. This
will give more flexibility to drivers to embed this structure as they
wish in their private structure.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Vivien Didelot [Mon, 21 Oct 2019 20:51:28 +0000 (16:51 -0400)]
net: dsa: sja1105: register switch before assigning port private data
Like the dsa_switch_tree structures, the dsa_port structures will be
allocated on switch registration.
The SJA1105 driver is the only one accessing the dsa_port structure
after the switch allocation and before the switch registration.
For that reason, move switch registration prior to assigning the priv
member of the dsa_port structures.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Vivien Didelot [Mon, 21 Oct 2019 20:51:27 +0000 (16:51 -0400)]
net: dsa: mv88e6xxx: use ports list to map bridge
Instead of digging into the other dsa_switch structures of the fabric
and relying too much on the dsa_to_port helper, use the new list
of switch fabric ports to remap the Port VLAN Map of local bridge
group members or remap the Port VLAN Table entry of external bridge
group members.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Vivien Didelot [Mon, 21 Oct 2019 20:51:26 +0000 (16:51 -0400)]
net: dsa: mv88e6xxx: use ports list to map port VLAN
Instead of digging into the other dsa_switch structures of the fabric
and relying too much on the dsa_to_port helper, use the new list of
switch fabric ports to define the mask of the local ports allowed to
receive frames from another port of the fabric.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Vivien Didelot [Mon, 21 Oct 2019 20:51:16 +0000 (16:51 -0400)]
net: dsa: add ports list in the switch fabric
Add a list of switch ports within the switch fabric. This will help the
lookup of a port inside the whole fabric, and it is the first step
towards supporting multiple CPU ports, before deprecating the usage of
the unique dst->cpu_dp pointer.
In preparation for a future allocation of the dsa_port structures,
return -ENOMEM in case no structure is returned, even though this
error cannot be reached yet.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
====================
The attempt to improve performance by changing the PCIe max read request
size was added in the vendor driver more than 10 years back and copied
to r8169 driver. In the vendor driver this has been removed long ago.
Obviously it had no effect, also in my tests I didn't see any
difference. Typically the max payload size is less than 512 bytes
anyway, and the PCI core takes care that the maximum supported value
is set. So let's remove fiddling with PCIe max read request size from
r8169 too. This change allows to simplify the driver in the subsequent
three patches of this series.
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Heiner Kallweit [Mon, 21 Oct 2019 19:24:15 +0000 (21:24 +0200)]
r8169: remove rtl_hw_start_8168bef
We can remove rtl_hw_start_8168bef() and use rtl_hw_start_8168b()
instead because setting register Config4 is done in
rtl_jumbo_config(), being called from rtl_hw_start().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Heiner Kallweit [Mon, 21 Oct 2019 19:22:42 +0000 (21:22 +0200)]
r8169: simplify setting PCI_EXP_DEVCTL_NOSNOOP_EN
r8168b_0_hw_jumbo_enable() and r8168b_0_hw_jumbo_disable() both do the
same and just set PCI_EXP_DEVCTL_NOSNOOP_EN. We can simplify the code
by moving this setting for RTL8168B to rtl_hw_start_8168().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Heiner Kallweit [Mon, 21 Oct 2019 19:22:07 +0000 (21:22 +0200)]
r8169: remove fiddling with the PCIe max read request size
The attempt to improve performance by changing the PCIe max read request
size was added in the vendor driver more than 10 years back and copied
to r8169 driver. In the vendor driver this has been removed long ago.
Obviously it had no effect, also in my tests I didn't see any
difference. Typically the max payload size is less than 512 bytes
anyway, and the PCI core takes care that the maximum supported value
is set. So let's remove fiddling with PCIe max read request size from
r8169 too.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Ursula Braun [Mon, 21 Oct 2019 14:13:15 +0000 (16:13 +0200)]
net/smc: remove close abort worker
With the introduction of the link group termination worker there is
no longer a need to postpone smc_close_active_abort() to a worker.
To protect socket destruction due to normal and abnormal socket
closing, the socket refcount is increased.
Ursula Braun [Mon, 21 Oct 2019 14:13:13 +0000 (16:13 +0200)]
net/smc: improve abnormal termination of link groups
If a link group and its connections must be terminated,
* wake up socket waiters
* do not enable buffer reuse
A linkgroup might be terminated while normal connection closing
is running. Avoid buffer reuse and its related LLC DELETE RKEY
call, if linkgroup termination has started. And use the earliest
indication of linkgroup termination possible, namely the removal
from the linkgroup list.
Ursula Braun [Mon, 21 Oct 2019 14:13:12 +0000 (16:13 +0200)]
net/smc: tell peers about abnormal link group termination
There are lots of link group termination scenarios. Most of them
still allow to inform the peer of the terminating sockets about aborting.
This patch tries to call smc_close_abort() for terminating sockets.
And the internal TCP socket is reset with tcp_abort().
Ursula Braun [Mon, 21 Oct 2019 14:13:11 +0000 (16:13 +0200)]
net/smc: improve link group freeing
Usually link groups are freed delayed to enable quick connection
creation for a follow-on SMC socket. Terminated link groups are
freed faster. This patch makes sure, fast schedule of link group
freeing is not rescheduled by a delayed schedule. And it makes sure
link group freeing is not rescheduled, if the real freeing is already
running.
Ursula Braun [Mon, 21 Oct 2019 14:13:10 +0000 (16:13 +0200)]
net/smc: improve abnormal termination locking
Locking hierarchy requires that the link group conns_lock can be
taken if the socket lock is held, but not vice versa. Nevertheless
socket termination during abnormal link group termination should
be protected by the socket lock.
This patch reduces the time segments the link group conns_lock is
held to enable usage of lock_sock in smc_lgr_terminate().
Ursula Braun [Mon, 21 Oct 2019 14:13:09 +0000 (16:13 +0200)]
net/smc: terminate link group without holding lgr lock
When a link group is to be terminated, it is sufficient to hold
the lgr lock when unlinking the link group from its list.
Move the lock-protected link group unlinking into smc_lgr_terminate().
Ursula Braun [Mon, 21 Oct 2019 14:13:08 +0000 (16:13 +0200)]
net/smc: cancel send and receive for terminated socket
The resources for a terminated socket are being cleaned up.
This patch makes sure
* no more data is received for an actively terminated socket
* no more data is sent for an actively or passively terminated socket
Jakub Kicinski [Tue, 22 Oct 2019 17:31:54 +0000 (10:31 -0700)]
Merge branch 'mlxsw-core-extend-qsfp-eeprom-size'
Ido Schimmel says:
====================
Vadim says:
This patch set extends the size of QSFP EEPROM for the cable types
SSF-8436 and SFF-8636 from 256 bytes to 640 bytes. This allows ethtool
to show correct information for these cable types (more details below).
Patch #1 adds a macro that computes the EEPROM page number from the
provided offset specified in the request.
Patch #2 teaches the driver to access the information stored in the
upper pages of the QSFP memory map.
Details and examples:
SFF-8436 specification defines pages 0, 1, 2 and 3. Page 0 contains
lower memory page offsets (from 0x00 to 0x7f) and upper page offsets
(from 0x80 to 0xfe). Upper pages 1, 2 and 3 are optional and can be
empty.
Page 1 is provided if upper page 0 byte 0xc3 bit 6 is set.
Page 2 is provided if upper page 0 byte 0xc3 bit 7 is set.
Page 3 is provided if lower page 0 byte 0x02 bit 2 is cleared.
Offset 0xc3 for the upper page is provided as 0x43 = 0xc3 - 0x80.
As a result of exposing 256 bytes only, ethtool shows wrong information
for pages 1, 2 and 3. In the below hex dump from ethtool for a cable
compliant to SFF-8636 specification, it can be seen that EEPROM of this
device contains optical diagnostic page (lower page 0 byte 0x02 bit 2 is
cleared), but it is not exposed, as the length defined for this type is
256 bytes.
After changing the length returned by get_module_info() callback from
256 bytes to 640 bytes, the upper pages 1, 2 and 3 are exposed by
ethtool. In the below hex dump from the same cable it can be seen that
the optical diagnostic page (page 3, from offset 0x0200) has non-zero
data.
And 'ethtool -m sfp42' shows the real values for the below fields, while
before it exposed zeros for these fields:
Laser bias current high alarm threshold : 8.500 mA
Laser bias current low alarm threshold : 5.492 mA
Laser bias current high warning threshold : 8.000 mA
Laser bias current low warning threshold : 6.000 mA
Laser output power high alarm threshold : 3.4673 mW / 5.40 dBm
Laser output power low alarm threshold : 0.0724 mW / -11.40 dBm
Laser output power high warning threshold : 1.7378 mW / 2.40 dBm
Laser output power low warning threshold : 0.1445 mW / -8.40 dBm
Module temperature high alarm threshold : 80.00 degrees C / 176.00 F
Module temperature low alarm threshold : -10.00 degrees C / 14.00 F
Module temperature high warning threshold : 70.00 degrees C / 158.00 F
Module temperature low warning threshold : 0.00 degrees C / 32.00 F
Module voltage high alarm threshold : 3.5000 V
Module voltage low alarm threshold : 3.1000 V
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Vadim Pasternak [Mon, 21 Oct 2019 10:30:31 +0000 (13:30 +0300)]
mlxsw: core: Extend QSFP EEPROM size for ethtool
Extend the size of QSFP EEPROM for the cable types SSF8436 and SFF8636
from 256 to 640 bytes in order to expose all the EEPROM pages by
ethtool.
For SFF-8636 and SFF-8436 specifications, the driver exposes 256 bytes
of data for ethtool's get_module_eeprom() callback. This is because the
driver uses the below defines to specify SFF module length in ethtool's
get_module_info() callback:
'ETH_MODULE_SFF_8636_LEN' and 'ETH_MODULE_SFF_8436_LEN' (both are 256).
As a result of exposing 256 bytes only, ethtool shows wrong "zero" info
for pages 1, 2, 3.
The patch changes the length returned by callback for get_module_info()
to the values from the next defines: 'ETH_MODULE_SFF_8636_MAX_LEN' and
'ETH_MODULE_SFF_8436_MAX_LEN' (both are 640) to allow exposing of upper
page 1, 2 and 3.
Juergen Gross [Mon, 21 Oct 2019 05:30:52 +0000 (07:30 +0200)]
xen/netback: cleanup init and deinit code
Do some cleanup of the netback init and deinit code:
- add an omnipotent queue deinit function usable from
xenvif_disconnect_data() and the error path of xenvif_connect_data()
- only install the irq handlers after initializing all relevant items
(especially the kthreads related to the queue)
- there is no need to use get_task_struct() after creating a kthread
and using put_task_struct() again after having stopped it.
- use kthread_run() instead of kthread_create() to spare the call of
wake_up_process().
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <pdurrant@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Jakub Kicinski [Tue, 22 Oct 2019 03:16:12 +0000 (20:16 -0700)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
1GbE Intel Wired LAN Driver Updates 2019-10-21
This series contains updates to e1000e and igc only.
Sasha adds stream control transmission protocol (SCTP) CRC checksum
support for igc. Also added S0ix support to the e1000e driver. Then
added multicast support by adding the address list to the MTA table and
providing the option for IPv6 address for igc. In addition, added
receive checksum support to igc as well. Lastly, cleaned up some code
that was not fully implemented yet for the VLAN filter table array.
v2: Dropped patch 1 & 2 from the original series. Patch 1 is being sent
to 'net' tree as a fix and patch 2 implementation needs to be
re-worked. Updated the patch to add support for S0ix to fix the
reverse Xmas tree issues and made the entry/exit functions void
since they constantly returned success. All based on community
feedback.
v3: Cleaned up patch 4 of the series based on feedback from the
community. Cleaned up a stray comma in a code comment and removed
the 'inline' of a function that would be inlined by the compiler
anyways.
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
====================
net: phy: marvell: support downshift as PHY tunable
So far downshift is implemented for one small use case only and can't
be controlled from userspace. So let's implement this feature properly
as a PHY tunable so that it can be controlled via ethtool.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 19 Oct 2019 13:58:19 +0000 (15:58 +0200)]
net: phy: marvell: remove superseded function marvell_set_downshift
Instead of superseded function marvell_set_downshift() we can use new
function m88e1111_set_downshift() in m88e1116r_config_init().
For this m88e1116r_config_init() has to be moved in the code.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 19 Oct 2019 13:57:33 +0000 (15:57 +0200)]
net: phy: marvell: support downshift as PHY tunable
So far downshift is implemented for one small use case only and can't
be controlled from userspace. So let's implement this feature properly
as a PHY tunable so that it can be controlled via ethtool.
More Marvell PHY's may support downshift, but I restricted it for now
to the ones where I have the datasheet.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 21 Oct 2019 17:36:08 +0000 (10:36 -0700)]
Merge branch 'mvneta-xdp'
Lorenzo Bianconi says:
====================
add XDP support to mvneta driver
Add XDP support to mvneta driver for devices that rely on software
buffer management. Supported verdicts are:
- XDP_DROP
- XDP_PASS
- XDP_REDIRECT
- XDP_TX
Moreover set ndo_xdp_xmit net_device_ops function pointer in order
to support redirecting from other device (e.g. virtio-net).
Convert mvneta driver to page_pool API.
This series is based on previous work done by Jesper and Ilias.
We will send follow-up patches to reduce DMA-sync operations.
Changes since v4:
- reset page_pool pointer to NULL in mvneta_rxq_drop_pkts and in
mvneta_create_page_pool error path
- move dma sync in mvneta_rx_refill() in patch 2/7
- verify bpf prog pointer in mvneta_xdp_setup to double-check if
stop/start is really necessary
- coding style fixes
Changes since v3:
- rename MVNETA_XDP_CONSUMED in MVNETA_XDP_DROPPED
- squash patch 4/8 and patch 3/8
- fix dma sync for XDP_TX verdict
- fix queue_index in xdp_rxq_info_reg
- cosmetics
Changes since v2:
- rely on page_pool_recycle_direct instead of xdp_return_buff for XDP_DROP
- define xdp buffer in mvneta_rx_swbm and avoid default initializations
- use dma_sync_single_for_cpu instead of dma_sync_single_range_for_cpu
- run page_pool_release_page in mvneta_swbm_add_rx_fragment even if
the buffer contains just ETH_FCS
Changes since v1:
- sync dma buffers before refilling hw queues
- fix stats accounting
Changes since RFC:
- implement XDP_TX
- make tx pending buffer list agnostic
- code refactoring
- check if device is running in mvneta_xdp_setup
====================
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Sat, 19 Oct 2019 08:13:25 +0000 (10:13 +0200)]
net: mvneta: move header prefetch in mvneta_swbm_rx_frame
Move data buffer prefetch in mvneta_swbm_rx_frame after
dma_sync_single_range_for_cpu
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Sat, 19 Oct 2019 08:13:24 +0000 (10:13 +0200)]
net: mvneta: add basic XDP support
Add basic XDP support to mvneta driver for devices that rely on software
buffer management. Currently supported verdicts are:
- XDP_DROP
- XDP_PASS
- XDP_REDIRECT
- XDP_ABORTED
- XDP_DROP via xdp1
$./samples/bpf/xdp1 3
proto 0: 421419 pkt/s
proto 0: 421444 pkt/s
proto 0: 421393 pkt/s
proto 0: 421440 pkt/s
proto 0: 421184 pkt/s
Tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Sat, 19 Oct 2019 08:13:23 +0000 (10:13 +0200)]
net: mvneta: rely on build_skb in mvneta_rx_swbm poll routine
Refactor mvneta_rx_swbm code introducing mvneta_swbm_rx_frame and
mvneta_swbm_add_rx_fragment routines. Rely on build_skb in oreder to
allocate skb since the previous patch introduced buffer recycling using
the page_pool API.
This patch fixes even an issue in the original driver where dma buffers
are accessed before dma sync.
mvneta driver can run on not cache coherent devices so it is
necessary to sync DMA buffers before sending them to the device
in order to avoid memory corruptions. Running perf analysis we can
see a performance cost associated with this DMA-sync (anyway it is
already there in the original driver code). In follow up patches we
will add more logic to reduce DMA-sync as much as possible.
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Sat, 19 Oct 2019 08:13:22 +0000 (10:13 +0200)]
net: mvneta: introduce page pool API for sw buffer manager
Use the page_pool api for allocations and DMA handling instead of
__dev_alloc_page()/dma_map_page() and free_page()/dma_unmap_page().
Pages are unmapped using page_pool_release_page before packets
go into the network stack.
The page_pool API offers buffer recycling capabilities for XDP but
allocates one page per packet, unless the driver splits and manages
the allocated page.
This is a preliminary patch to add XDP support to mvneta driver
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce mvneta_update_stats routine to collect {rx/tx} statistics
(packets and bytes). This is a preliminary patch to add XDP support to
mvneta driver
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Implement flow for S0ix support. Modern SoCs support S0ix low power
states during idle periods, which are sub-states of ACPI S0 that increase
power saving while supporting an instant-on experience for providing
lower latency that ACPI S0. The S0ix states shut off parts of the SoC
when they are not in use, while still maintaning optimal performance.
This patch add support for S0ix started from an Ice Lake platform.
Suggested-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com> Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jian Shen [Sat, 19 Oct 2019 08:03:56 +0000 (16:03 +0800)]
net: hns3: log and clear hardware error after reset complete
When device is resetting, the CMDQ service may be stopped until
reset completed. If a new RAS error occurs at this moment, it
will no be able to clear the RAS source. This patch fixes it
by clear the RAS source after reset complete.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sat, 19 Oct 2019 08:03:55 +0000 (16:03 +0800)]
net: hns3: do not allocate linear data for fraglist skb
Currently, napi_alloc_skb() is used to allocate skb for fraglist
when the head skb is not enough to hold the remaining data, and
the remaining data is added to the frags part of the fraglist skb,
leaving the linear part unused.
So this patch passes length of 0 to allocate fraglist skb with
zero size of linear data.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sat, 19 Oct 2019 08:03:54 +0000 (16:03 +0800)]
net: hns3: minor cleanup for hns3_handle_rx_bd()
Since commit e55970950556 ("net: hns3: Add handling of GRO Pkts
not fully RX'ed in NAPI poll"), ring->skb is used to record the
current SKB when processing the RX BD in hns3_handle_rx_bd(),
so the parameter out_skb is unnecessary.
This patch also adjusts the err checking to reduce duplication
in hns3_handle_rx_bd(), and "err == -ENXIO" is rare case, so put
it in the unlikely annotation.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sat, 19 Oct 2019 08:03:53 +0000 (16:03 +0800)]
net: hns3: make struct hns3_enet_ring cacheline aligned
Since struct hns3_enet_ring is a frequently used in critical data
path, so make it cacheline aligned as struct hns3_enet_tqp_vector.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sat, 19 Oct 2019 08:03:52 +0000 (16:03 +0800)]
net: hns3: introduce ring_to_netdev() in enet module
There are a few places that need to access the netdev of a ring
through ring->tqp->handle->kinfo.netdev, and ring->tqp is a struct
which both in enet and hclge modules, it is better to use the
struct that is only used in enet module.
This patch adds the ring_to_netdev() to access the netdev of ring
through ring->tqp_vector->napi.dev.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sat, 19 Oct 2019 08:03:51 +0000 (16:03 +0800)]
net: hns3: minor optimization for barrier in IO path
Currently, the TX and RX ring in a queue is bounded to the
same IRQ, there may be unnecessary barrier op when only one of
the ring need to be processed.
This patch adjusts the location of rmb() in hns3_clean_tx_ring()
and adds a checking in hns3_clean_rx_ring() to avoid unnecessary
barrier op when there is nothing to do for the ring.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Guojia Liao [Sat, 19 Oct 2019 08:03:50 +0000 (16:03 +0800)]
net: hns3: optimized MAC address in management table.
mac_addr_hi32 and mac_addr_lo16 are used to store the MAC address
for management table. But using array of mac_addr[ETH_ALEN] would
be more general and not need to care about the big-endian mode of
the CPU.
Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sat, 19 Oct 2019 08:03:49 +0000 (16:03 +0800)]
net: hns3: remove struct hns3_nic_ring_data in hns3_enet module
Only the queue_index field in struct hns3_nic_ring_data is
used, other field is unused and unnecessary for hns3 driver,
so this patch removes it and move the queue_index field to
hns3_enet_ring.
This patch also removes an unused struct hns_queue declaration.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"I was battling a cold after some recent trips, so quite a bit piled up
meanwhile, sorry about that.
Highlights:
1) Fix fd leak in various bpf selftests, from Brian Vazquez.
2) Fix crash in xsk when device doesn't support some methods, from
Magnus Karlsson.
3) Fix various leaks and use-after-free in rxrpc, from David Howells.
4) Fix several SKB leaks due to confusion of who owns an SKB and who
should release it in the llc code. From Eric Biggers.
5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet.
6) Jumbo packets don't work after resume on r8169, as the BIOS resets
the chip into non-jumbo mode during suspend. From Heiner Kallweit.
7) Corrupt L2 header during MPLS push, from Davide Caratti.
8) Prevent possible infinite loop in tc_ctl_action, from Eric
Dumazet.
9) Get register bits right in bcmgenet driver, based upon chip
version. From Florian Fainelli.
10) Fix mutex problems in microchip DSA driver, from Marek Vasut.
11) Cure race between route lookup and invalidation in ipv4, from Wei
Wang.
12) Fix performance regression due to false sharing in 'net'
structure, from Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits)
net: reorder 'struct net' fields to avoid false sharing
net: dsa: fix switch tree list
net: ethernet: dwmac-sun8i: show message only when switching to promisc
net: aquantia: add an error handling in aq_nic_set_multicast_list
net: netem: correct the parent's backlog when corrupted packet was dropped
net: netem: fix error path for corrupted GSO frames
macb: propagate errors when getting optional clocks
xen/netback: fix error path of xenvif_connect_data()
net: hns3: fix mis-counting IRQ vector numbers issue
net: usb: lan78xx: Connect PHY before registering MAC
vsock/virtio: discard packets if credit is not respected
vsock/virtio: send a credit update when buffer size is changed
mlxsw: spectrum_trap: Push Ethernet header before reporting trap
net: ensure correct skb->tstamp in various fragmenters
net: bcmgenet: reset 40nm EPHY on energy detect
net: bcmgenet: soft reset 40nm EPHYs before MAC init
net: phy: bcm7xxx: define soft_reset for 40nm EPHY
net: bcmgenet: don't set phydev->link from MAC
net: Update address for MediaTek ethernet driver in MAINTAINERS
ipv4: fix race condition between route lookup and invalidation
...
Eric Dumazet [Fri, 18 Oct 2019 22:20:05 +0000 (15:20 -0700)]
net: reorder 'struct net' fields to avoid false sharing
Intel test robot reported a ~7% regression on TCP_CRR tests
that they bisected to the cited commit.
Indeed, every time a new TCP socket is created or deleted,
the atomic counter net->count is touched (via get_net(net)
and put_net(net) calls)
So cpus might have to reload a contended cache line in
net_hash_mix(net) calls.
We need to reorder 'struct net' fields to move @hash_mix
in a read mostly cache line.
We move in the first cache line fields that can be
dirtied often.
We probably will have to address in a followup patch
the __randomize_layout that was added in linux-4.13,
since this might break our placement choices.
Fixes: 355b98553789 ("netns: provide pure entropy for net_hash_mix()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 18 Oct 2019 21:02:46 +0000 (17:02 -0400)]
net: dsa: fix switch tree list
If there are multiple switch trees on the device, only the last one
will be listed, because the arguments of list_add_tail are swapped.
Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mans Rullgard [Fri, 18 Oct 2019 16:56:58 +0000 (17:56 +0100)]
net: ethernet: dwmac-sun8i: show message only when switching to promisc
Printing the info message every time more than the max number of mac
addresses are requested generates unnecessary log spam. Showing it only
when the hw is not already in promiscous mode is equally informative
without being annoying.
Signed-off-by: Mans Rullgard <mans@mansr.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chenwandun [Fri, 18 Oct 2019 10:20:37 +0000 (18:20 +0800)]
net: aquantia: add an error handling in aq_nic_set_multicast_list
add an error handling in aq_nic_set_multicast_list, it may not
work when hw_multicast_list_set error; and at the same time
it will remove gcc Wunused-but-set-variable warning.
Signed-off-by: Chenwandun <chenwandun@huawei.com> Reviewed-by: Igor Russkikh <igor.russkikh@aquantia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: netem: fix further issues with packet corruption
This set is fixing two more issues with the netem packet corruption.
First patch (which was previously posted) avoids NULL pointer dereference
if the first frame gets freed due to allocation or checksum failure.
v2 improves the clarity of the code a little as requested by Cong.
Second patch ensures we don't return SUCCESS if the frame was in fact
dropped. Thanks to this commit message for patch 1 no longer needs the
"this will still break with a single-frame failure" disclaimer.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 18 Oct 2019 16:16:58 +0000 (09:16 -0700)]
net: netem: correct the parent's backlog when corrupted packet was dropped
If packet corruption failed we jump to finish_segs and return
NET_XMIT_SUCCESS. Seeing success will make the parent qdisc
increment its backlog, that's incorrect - we need to return
NET_XMIT_DROP.
Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 18 Oct 2019 16:16:57 +0000 (09:16 -0700)]
net: netem: fix error path for corrupted GSO frames
To corrupt a GSO frame we first perform segmentation. We then
proceed using the first segment instead of the full GSO skb and
requeue the rest of the segments as separate packets.
If there are any issues with processing the first segment we
still want to process the rest, therefore we jump to the
finish_segs label.
Commit 177b8007463c ("net: netem: fix backlog accounting for
corrupted GSO frames") started using the pointer to the first
segment in the "rest of segments processing", but as mentioned
above the first segment may had already been freed at this point.
Backlog corrections for parent qdiscs have to be adjusted.
Fixes: 177b8007463c ("net: netem: fix backlog accounting for corrupted GSO frames") Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Tretter [Fri, 18 Oct 2019 14:11:43 +0000 (16:11 +0200)]
macb: propagate errors when getting optional clocks
The tx_clk, rx_clk, and tsu_clk are optional. Currently the macb driver
marks clock as not available if it receives an error when trying to get
a clock. This is wrong, because a clock controller might return
-EPROBE_DEFER if a clock is not available, but will eventually become
available.
In these cases, the driver would probe successfully but will never be
able to adjust the clocks, because the clocks were not available during
probe, but became available later.
For example, the clock controller for the ZynqMP is implemented in the
PMU firmware and the clocks are only available after the firmware driver
has been probed.
Use devm_clk_get_optional() in instead of devm_clk_get() to get the
optional clock and propagate all errors to the calling function.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Tested-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Juergen Gross [Fri, 18 Oct 2019 07:45:49 +0000 (09:45 +0200)]
xen/netback: fix error path of xenvif_connect_data()
xenvif_connect_data() calls module_put() in case of error. This is
wrong as there is no related module_get().
Remove the superfluous module_put().
Fixes: 279f438e36c0a7 ("xen-netback: Don't destroy the netdev until the vif is shut down") Cc: <stable@vger.kernel.org> # 3.12 Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the num_msi_left means the vector numbers of NIC,
but if the PF supported RoCE, it contains the vector numbers
of NIC and RoCE(Not expected).
This may cause interrupts lost in some case, because of the
NIC module used the vector resources which belongs to RoCE.
This patch adds a new variable num_nic_msi to store the vector
numbers of NIC, and adjust the default TQP numbers and rss_size
according to the value of num_nic_msi.
Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 19 Oct 2019 10:53:59 +0000 (06:53 -0400)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"Rather a lot of fixes, almost all affecting mm/"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
scripts/gdb: fix debugging modules on s390
kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
mm/thp: allow dropping THP from page cache
mm/vmscan.c: support removing arbitrary sized pages from mapping
mm/thp: fix node page state in split_huge_page_to_list()
proc/meminfo: fix output alignment
mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
mm: include <linux/huge_mm.h> for is_vma_temporary_stack
zram: fix race between backing_dev_show and backing_dev_store
mm/memcontrol: update lruvec counters in mem_cgroup_move_account
ocfs2: fix panic due to ocfs2_wq is null
hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
mm: memblock: do not enforce current limit for memblock_phys* family
mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
mm/gup: fix a misnamed "write" argument, and a related bug
mm/gup_benchmark: add a missing "w" to getopt string
ocfs2: fix error handling in ocfs2_setattr()
mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
mm/memunmap: don't access uninitialized memmap in memunmap_pages()
...