git.proxmox.com Git - mirror_ubuntu-kernels.git/log

Merge branch 'HWMON-support-for-SFP-modules'

Andrew Lunn says:

====================
HWMON support for SFP modules

This patchset adds HWMON support to SFP modules. The two patches add
some attributes for temperature and power sensors which are currently
missing from the hwmon core. The third patch adds a helper for
filtering out characters in hwmon names which are invalid. The last
patch then extends the core SFP code to export the sensors found in
SFP modules.

This code has been tested with two SFP modules:

module OEM SFP-7000-85 rev 11.0 sn M1512220075 dc 160221
module FINISAR CORP. FTLF8524E2GNL rev A sn PW40MNN dc 160725

The anonymous module uses external calibration, while the FINISAR uses
internal calibration. Thus both code paths have been tested.

Due to the cross subsystem nature of these patches, as discussed with
the RFC, it is hoped Guenter Roeck will ACK the patches, and then Dave
Miller will merge them all via net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: sfp: Add HWMON support for module sensors

SFP modules can contain a number of sensors. The EEPROM also contains
recommended alarm and critical values for each sensor, and indications
of if these have been exceeded. Export this information via
HWMON. Currently temperature, VCC, bias current, transmit power, and
possibly receiver power is supported.

The sensors in the modules can either return calibrate or uncalibrated
values. Uncalibrated values need to be manipulated, using coefficients
provided in the SFP EEPROM. Uncalibrated receive power values require
floating point maths in order to calibrate them. Performing this in
the kernel is hard. So if the SFP module indicates it uses
uncalibrated values, RX power is not made available.

With this hwmon device, it is possible to view the sensor values using
lm-sensors programs:

in0:          +3.29 V  (crit min =  +2.90 V, min =  +3.00 V)
                       (max =  +3.60 V, crit max =  +3.70 V)
temp1:        +33.0°C  (low  =  -5.0°C, high = +80.0°C)
                       (crit low = -10.0°C, crit = +85.0°C)
power1:      1000.00 nW (max = 794.00 uW, min =  50.00 uW)  ALARM (LCRIT)
                       (lcrit =  40.00 uW, crit = 1000.00 uW)
curr1:        +0.00 A  (crit min =  +0.00 A, min =  +0.00 A)  ALARM (LCRIT, MIN)
                       (max =  +0.01 A, crit max =  +0.01 A)

The scaling sensors performs on the bias current is not particularly
good. The raw values are more useful:

curr1:
  curr1_input: 0.000
  curr1_min: 0.002
  curr1_max: 0.010
  curr1_lcrit: 0.000
  curr1_crit: 0.011
  curr1_min_alarm: 1.000
  curr1_max_alarm: 0.000
  curr1_lcrit_alarm: 1.000
  curr1_crit_alarm: 0.000

In order to keep the I2C overhead to a minimum, the constant values,
such as limits and calibration coefficients are read once at module
insertion time. Thus only reading *_input and *_alarm properties
requires i2c read operations.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

hwmon: Add helper to tell if a char is invalid in a name

HWMON device names are not allowed to contain "-* \t\n". Add a helper
which will return true if passed an invalid character. It can be used
to massage a string into a hwmon compatible name by replacing invalid
characters with '_'.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

hwmon: Add support for power min, lcrit, min_alarm and lcrit_alarm

Some sensors support reporting minimal and lower critical power, as
well as alarms when these thresholds are reached. Add support for
these attributes to the hwmon core.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

hwmon: Add missing HWMON_T_LCRIT_ALARM define

The enum hwmon_temp_lcrit_alarm exists, but the BIT definition is
missing.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'r8169-add-phylib-support'

Heiner Kallweit says:

====================
r8169: add phylib support

Now that all the basic refactoring has been done we can add phylib
support. This patch series was successfully tested on:
RTL8168h
RTL8168evl
RTL8169sb

Changes in v2:
- return error in mdio ops if phyaddr > 0
- advertise pause modes
- added reviewed-by for several patches

Changes in v3:
- return ENODEV for unused phy addresses in mdio ops
- remove unneeded PHY suspend in patch 2
- use recently added phy_speed_down and phy_speed_up in patch 7
- other minor changes based on review comments
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: don't read chip phy status register

Instead of accessing the PHYstatus register we can use the information
phylib stores in the phy_device structure.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: remove mii_if_info member from struct rtl8169_private

The only remaining usage of the struct mii_if_info member is to store the
information whether the chip is GMII-capable. So we can replace it with
a simple flag.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: remove rtl8169_set_speed_xmii

We can remove rtl8169_set_speed_xmii() now that phylib handles all this.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: use phy_speed_down / phy_speed_up

Use new phylib functions phy_speed_down() and phy_speed_up().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: use phy_mii_ioctl

Switch to using phy_mii_ioctl().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: use phy_ethtool_nway_reset

Switch to using phy_ethtool_nway_reset().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: use phy_ethtool_(g|s)et_link_ksettings

Use phy_ethtool_(g|s)et_link_ksettings() for the respective ethtool_ops
callbacks.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: replace open-coded PHY soft reset with genphy_soft_reset

Use genphy_soft_reset() instead of open-coding a PHY soft reset. We have
to do an explicit PHY soft reset because some chips use the genphy driver
which uses a no-op as soft_reset callback.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: use phy_resume/phy_suspend

Use phy_resume() / phy_suspend() instead of open coding this functionality.
The chip version specific differences are handled by the respective PHY
drivers.

The call to r8168_phy_power_down() in r8168_pll_power_down() can be
removed because phylib takes care now. The relevant scenarios are:
- rtl8169_close(): phy_disconnect() powers down PHY
- suspend: mdio_bus_phy_suspend() takes care
- runtime-suspend: WoL is active, don't suspend PHY
- rtl_shutdown(): no need to power down PHY

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: add basic phylib support

Add basic phylib support to r8169. All now unneeded old PHY handling code
will be removed in subsequent patches.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: correct error msg text when removing VLAN ID

Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Fix GRO_HASH_BUCKETS assertion.

FIELD_SIZEOF() is in bytes, but we want bits.

Fixes: d9f37d01e294 ("net: convert gro_count to bitmask")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sch_cake: Fix tin order when set through skb->priority

In diffserv mode, CAKE stores tins in a different order internally than
the logical order exposed to userspace. The order remapping was missing
in the handling of 'tc filter' priority mappings through skb->priority,
resulting in bulk and best effort mappings being reversed relative to
how they are displayed.

Fix this by adding the missing mapping when reading skb->priority.

Fixes: 83f8fd69af4f ("sch_cake: Add DiffServ handling")
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: collect ASIC LA dumps from ULP TX

Signed-off-by: Surendra Mobiya <surendra@chelsio.com>
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Expose counters for various packet sizes

Expose counters ASIC has in the group of RFC 2819 counters that count
number of packets within specific size range.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: fix hang when re-binding VF host drv after running DPDK VF driver

When configuring SLI_PKTn_OUTPUT_CONTROL, VF driver was assuming that IPTR
mode was disabled by reset, which was not true. Since DPDK driver had
set IPTR mode previously, the VF driver (which uses buf-ptr-only mode) was
not properly handling DROQ packets (i.e. it saw zero-length packets).

This represented an invalid hardware configuration which the driver could
not handle.

Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: cavium: Drop dependency of NET_VENDOR_CAVIUM on PCI

Octeon Ethernet drivers work perfectly without PCI.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: simplify retrieving the tag type from the frame header

The tag type in the frame extraction header is only a bit wide. There's
no need to use GENMASK when retrieving the information. This patch
simplify the code by dropping GENMASK and using BIT instead.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: do not return DUPLEX_UNKNOWN when link is down

We were returning DUPLEX_UNKNOWN in get_link_ksettings() when
the link was down. Unfortunately, this causes a problem when
"ethtool -s autoneg on" is issued for a link which is down because
the ethtool code first reads the settings and then reapplies them
with only the changes provided on the command line. Which results
in us diving into set_link_ksettings() with DUPLEX_UNKNOWN which is
not DUPLEX_FULL, so set_link_ksettings() throws an -EINVAL error.
do not return DUPLEX_UNKNOWN to fix the issue.

Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: convert gro_count to bitmask

gro_hash size is 192 bytes, and uses 3 cache lines, if there is few
flows, gro_hash may be not fully used, so it is unnecessary to iterate
all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline.

convert gro_count to a bitmask, and rename it as gro_bitmask, each bit
represents a element of gro_hash, only flush a gro_hash element if the
related bit is set, to speed up napi_gro_flush().

and update gro_bitmask only if it will be changed, to reduce cache
update

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: remove redundant debug register dma mem allocation

hwrm_dbg_resp_addr and hwrm_dbg_resp_dma_addr are never used
and can be removed.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: Use %pad printk format for dma_addr_t values

Use the existing %pad printk format to print dma_addr_t values.
This avoids the following warnings when compiling on the parisc platform:

warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'dma_addr_t {aka unsigned int}' [-Wformat=]

Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: realtek: add missing entry for RTL8211C to mdio_device_id table

Add missing entry for RTL8211C to mdio_device_id table.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Fixes: cf87915cb9f8 ("net: phy: realtek: add support for RTL8211C")
Signed-off-by: David S. Miller <davem@davemloft.net>

net: usb: hso: use swap macro in hso_kick_transmit

Make use of the swap macro and remove unnecessary variable *temp*.
This makes the code easier to read and maintain. Also, slightly
refactor some code due to the removal of *temp*.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'phy-helpers'

Heiner Kallweit says:

====================
net: phy: add functionality to speed down PHY when waiting for WoL packet

Some network drivers include functionality to speed down the PHY when
suspending and just waiting for a WoL packet because this saves energy.

This patch is based on our recent discussion about factoring out this
functionality to phylib. First user will be the r8169 driver.

v2:
- add warning comment to phy_speed_down regarding usage of sync = false
- remove sync parameter from phy_speed_up
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add phy_speed_down and phy_speed_up

Some network drivers include functionality to speed down the PHY when
suspending and just waiting for a WoL packet because this saves energy.
This functionality is quite generic, therefore let's factor it out to
phylib.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add helper phy_config_aneg

This functionality will also be needed in subsequent patches of this
series, therefore factor it out to a helper.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: tls: add selftests for TLS sockets

Add selftests for tls socket. Tests various iov and message options,
poll blocking and nonblocking behavior, partial message sends / receives,
and control message data. Tests should pass regardless of if TLS
is enabled in the kernel or not, and print a warning message if not.

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'docs-Fix-failover-build-warnings'

Tobin C. Harding says:

====================
docs: Fix failover build warnings

This is my first patch set to net-next.  Please shout loud and clear if
I've botched anything.

Recently failover and net_failover modules were added to the mainline.
Documentation was included in rst format but they were not added to the
toctree in `networking/index.rst`.  Also building docs for net_failover
is currently emitting a few warnings.

Patch 1 adds failover and net_failover to the index toctree
Patch 2 fixes the build warnings for net_failover

I haven't been super active on netdev list so if there is some reason I
missed why these files are not in the index please do say so.

Has there been any discussion on preferred order for the toctree index
list?  I just added them to the bottom of the list.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

docs: networking: Fix failover build warnings

Currently building the net_failover docs causes a bunch of warnings to
be emitted. These warnings are all related to indentation and correctly
highlight missing '::' (for code sections). It looks, from other rst
files in Documentation, that the first column should be indented 2
spaces.

Add '::' before code snippets and indent all snippets uniformly starting
with 2 spaces.

Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>

docs: networking: Add failover docs to index

Currently we have rst format docs for the failover and net_failover
modules however these docs are not linked to within the index.

Add `failover` and `net_failover` to the networking documentation index.

Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hns3-next'

Salil Mehta says:

====================
Bug fixes and some minor changes to HNS3 driver

This patch-set presents some fixes and minor changes to the HNS3 Ethernet Driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix comments for hclge_get_ring_chain_from_mbx

Actually, hclge_get_ring_chain_from_mbx is used to get ring type, tqp id,
and int_gl index from mailbox message. So the comments is incorrect. This
patch fixes it.

Fixes: dde1a86e93ca ("net: hns3: Add mailbox support to PF driver")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix for using wrong mask and shift in hclge_get_ring_chain_from_mbx

HCLGE_INT_GL_IDX_M and HCLGE_INT_GL_IDX_S are used to set fireware
cmd. When getting int_gl value from mailbox message, we should use
HNAE3_RING_GL_IDX_M and HNAE3_RING_GL_IDX_S.

Fixes: 79eee4108541 ("net: hns3: add int_gl_idx setup for VF")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix for reset_level default assignment probelm

handle->reset_level is assigned to HNAE3_NONE_RESET when client is
initialized, if a tx timeout happens right after initialization,
then handle->reset_level is not resetted to HNAE3_FUNC_RESET in
hclge_reset_event, which will cause reset event not properly
handled problem.

This patch fixes it by setting handle->reset_level properly when
client is initialized.

Fixes: 6d4c3981a8d8 ("net: hns3: Changes to make enet watchdog timeout func common for PF/VF")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: remove unnecessary ring configuration operation while resetting

The configuration of the ring will be used to reinitialize the
ring after the hardware reset is completed. So we should not
release and reacquire this configuration during reset.

Fixes: bb6b94a896d4 ("net: hns3: Add reset interface implementation in client")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix return value error in hns3_reset_notify_down_enet

When doing reset, netdev has not been brought up is not an error,
it means that we do not need do the stop operation, so just return
zero.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Correct reset event status register

According to hardware's description, driver should get reset event
from VECTOR0_PF_OTHER_INT_ST(0x20800) instead of
VECTOR0_PF_OTHER_INT_SRC(0x20700).

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Prevent to request reset frequently

Netdevice reset should not be requested frequently, a new one
must wait a moment since there may be some work not completed.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Reset net device with rtnl_lock

Since current locking was not covering certain code where
netdev was being accessed or manipulated, this patch fixes
it.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Modify the order of initializing command queue register

According to hardware's description, the head pointer register should
be written before the tail pointer register while doing command queue
initialization.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'TLS-offload-rx-netdev-and-mlx5'

Boris Pismenny says:

====================
TLS offload rx, netdev & mlx5

The following series provides TLS RX inline crypto offload.

v5->v4:
    - Remove the Kconfig to mutually exclude both IPsec and TLS

v4->v3:
    - Remove the iov revert for zero copy send flow

v2->v3:
    - Fix typo
    - Adjust cover letter
    - Fix bug in zero copy flows
    - Use network byte order for the record number in resync
    - Adjust the sequence provided in resync

v1->v2:
    - Fix bisectability problems due to variable name changes
    - Fix potential uninitialized return value

This series completes the generic infrastructure to offload TLS crypto to
a network devices. It enables the kernel TLS socket to skip decryption and
authentication operations for SKBs marked as decrypted on the receive
side of the data path. Leaving those computationally expensive operations
to the NIC.

This infrastructure doesn't require a TCP offload engine. Instead, the
NIC decrypts a packet's payload if the packet contains the expected TCP
sequence number. The TLS record authentication tag remains unmodified
regardless of decryption. If the packet is decrypted successfully and it
contains an authentication tag, then the authentication check has passed.
Otherwise, if the authentication fails, then the packet is provided
unmodified and the KTLS layer is responsible for handling it.
Out-Of-Order TCP packets are provided unmodified. As a result,
in the slow path some of the SKBs are decrypted while others remain as
ciphertext.

The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs.
At the worst case a received TLS record consists of both plaintext
and ciphertext packets. These partially decrypted records must be
reencrypted, only to be decrypted.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW whenever it lost track of the TLS record framing in
the TCP stream.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC identifies packets that should be offloaded according to
the 5-tuple and the TCP sequence number. If these match and the
packet is decrypted and authenticated successfully, then a syndrome
is provided to software. Otherwise, the packet is unmodified.
Decrypted and non-decrypted packets aren't coalesced by the network stack,
and the KTLS layer decrypts and authenticates partially decrypted records.
The NIC provides an indication whenever a resync is required. The resync
operation is triggered by the KTLS layer while parsing TLS record headers.

Finally, we measure the performance obtained by running single stream
iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected
back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound)
and KTLS-Offload running both in Tx and Rx. The results show that the
performance of offload is comparable to TCP.

                          | Bandwidth (Gbps) | CPU Tx (%) | CPU rx (%)
TCP                       | 28.8             | 5          | 12
KTLS-Offload-Tx-Rx   | 28.6      | 7          | 14

Paper: https://netdevconf.org/2.2/papers/pismenny-tlscrypto-talk.pdf
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: IPsec, fix byte count in CQE

This patch fixes the byte count indication in CQE for processed IPsec
packets that contain a metadata header.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Accel, add common metadata functions

This patch adds common functions to handle mellanox metadata headers.
These functions are used by IPsec and TLS to process FPGA metadata.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: TLS, build TLS netdev from capabilities

This patch enables TLS Rx based on available HW capabilities.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: TLS, add software statistics

This patch adds software statistics for TLS to count important
events.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: TLS, add Innova TLS rx data path

Implement the TLS rx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

When hardware loses synchronization a special resync request
metadata message is used to request resync.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: TLS, add innova rx support

Add the mlx5 implementation of the TLS Rx routines to add/del TLS
contexts, also add the tls_dev_resync_rx routine
to work with the TLS inline Rx crypto offload infrastructure.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Accel, add TLS rx offload routines

In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.

Complete the implementation for Innova TLS (FPGA-based) hardware by
adding support for rx inline crypto offload.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: TLS, refactor variable names

For symmetry, we rename mlx5e_tls_offload_context to
mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Aviad Yehezkel <aviadye@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: Fix zerocopy_from_iter iov handling

zerocopy_from_iter iterates over the message, but it doesn't revert the
updates made by the iov iteration. This patch fixes it. Now, the iov can
be used after calling zerocopy_from_iter.

Fixes: 3c4d75591 ("tls: kernel TLS support")
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: Add rx inline crypto offload

This patch completes the generic infrastructure to offload TLS crypto to a
network device. It enables the kernel to skip decryption and
authentication of some skbs marked as decrypted by the NIC. In the fast
path, all packets received are decrypted by the NIC and the performance
is comparable to plain TCP.

This infrastructure doesn't require a TCP offload engine. Instead, the
NIC only decrypts packets that contain the expected TCP sequence number.
Out-Of-Order TCP packets are provided unmodified. As a result, at the
worst case a received TLS record consists of both plaintext and ciphertext
packets. These partially decrypted records must be reencrypted,
only to be decrypted.

The notable differences between SW KTLS Rx and this offload are as
follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW after HW lost track of TLS record framing in
the TCP stream.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: Fill software context without allocation

This patch allows tls_set_sw_offload to fill the context in case it was
already allocated previously.

We will use it in TLS_DEVICE to fill the RX software context.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: Split tls_sw_release_resources_rx

This patch splits tls_sw_release_resources_rx into two functions one
which releases all inner software tls structures and another that also
frees the containing structure.

In TLS_DEVICE we will need to release the software structures without
freeeing the containing structure, which contains other information.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: Split decrypt_skb to two functions

Previously, decrypt_skb also updated the TLS context.
Now, decrypt_skb only decrypts the payload using the current context,
while decrypt_skb_update also updates the state.

Later, in the tls_device Rx flow, we will use decrypt_skb directly.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: Refactor tls_offload variable names

For symmetry, we rename tls_offload_context to
tls_offload_context_tx before we add tls_offload_context_rx.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: Don't coalesce decrypted and encrypted SKBs

Prevent coalescing of decrypted and encrypted SKBs in GRO
and TCP layer.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add TLS rx resync NDO

Add new netdev tls op for resynchronizing HW tls context

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add TLS RX offload feature

This patch adds a netdev feature to configure TLS RX inline crypto offload.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add decrypted field to skb

The decrypted bit is propogated to cloned/copied skbs.
This will be used later by the inline crypto receive side offload
of tls.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mvpp2-add-debugfs-interface'

Maxime Chevallier says:

====================
net: mvpp2: add debugfs interface

The PPv2 Header Parser and Classifier are not straightforward to debug,
having easy access to some of the many lookup tables configuration is
helpful during development and debug.

This series adds a basic debugfs interface, allowing to read data from
the Header Parser and some of the Classifier tables.

For now, the interface is read-only, and contains only some basic info.

This was actually used during RSS development, and might be useful to
troubleshoot some issues we might find.

The first patch of the series converts the mvpp2 files to SPDX, which
eases adding the new debugfs dedicated file.

The second patch adds the interface, and exposes basic Header Parser data.

The 3rd patch adds a hit counter for the Header Parser TCAM.

The 4th patch exposes classifier info.

The 5th patch adds some hit counters for some of the classifier engines.

Changes since V1:
- Rebased on the lastest net-next
- Made cls_flow_get non static so that it can be used in mvpp2_debugfs
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: debugfs: add classifier hit counters

The classification operations that are used for RSS make use of several
lookup tables. Having hit counters for these tables is really helpful
to determine what flows were matched by ingress traffic, and see the
path of packets among all the classifier tables.

This commit adds hit counters for the 3 tables used at the moment :

- The decoding table (also called lookup_id table), that links flows
   identified by the Header Parser to the flow table.

   There's one entry per flow, located at :
   .../mvpp2/<controller>/flows/XX/dec_hits

   Note that there are 21 flows in the decoding table, whereas there are
   52 flows in the Header Parser. That's because there are several kind
   of traffic that will match a given flow. Reading the hit counter from
   one sub-flow will clear all hit counter that have the same flow_id.

   This also applies to the flow_hits.

- The flow table, that contains all the different lookups to be
   performed by the classifier for each packet of a given flow. The match
   is done on the first entry of the flow sequence.

- The C2 engine entries, that are used to assign the default rx queue,
   and enable or disable RSS for a given port.

   There's one entry per flow, located at:
   .../mvpp2/<controller>/flows/XX/flow_hits

   There is one C2 entry per port, so the c2 hit counter is located at :
   .../mvpp2/<controller>/ethX/c2_hits

All hit counter values are 16-bits clear-on-read values.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: debugfs: add entries for classifier flows

The classifier configuration for RSS is quite complex, with several
lookup tables being used. This commit adds useful info in debugfs to
see how the different tables are configured :

Added 2 new entries in the per-port directory :

  - .../eth0/default_rxq : The default rx queue on that port
  - .../eth0/rss_enable : Indicates if RSS is enabled in the C2 entry

Added the 'flows' directory :

  It contains one entry per sub-flow. a 'sub-flow' is a unique path from
  Header Parser to the flow table. Multiple sub-flows can point to the
  same 'flow' (each flow has an id from 8 to 29, which is its index in the
  Lookup Id table) :

  - .../flows/00/...
             /01/...
             ...
             /51/id : The flow id. There are 21 unique flows. There's one
                       flow per combination of the following parameters :
                       - L4 protocol (TCP, UDP, none)
                       - L3 protocol (IPv4, IPv6)
                       - L3 parameters (Fragmented or not)
                       - L2 parameters (Vlan tag presence or not)
              .../type : The flow type. This is an even higher level flow,
                         that we manipulate with ethtool. It can be :
                         "udp4" "tcp4" "udp6" "tcp6" "ipv4" "ipv6" "other".
              .../eth0/...
              .../eth1/engine : The hash generation engine used for this
                        flow on the given port
                  .../hash_opts : The hash generation options indicating on
                                  what data we base the hash (vlan tag, src
                                  IP, src port, etc.)

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: debugfs: add hit counter stats for Header Parser entries

One helpful feature to help debug the Header Parser TCAM filter in PPv2
is to be able to see if the entries did match something when a packet
comes in. This can be done by using the built-in hit counter for TCAM
entries.

This commit implements reading the counter, and exposing its value on
debugfs for each filter entry.

The counter is a 16-bits clear-on-read value, located at:
.../mvpp2/<controller>/parser/XXX/hits

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: add a debugfs interface for the Header Parser

Marvell PPv2 Packer Header Parser has a TCAM based filter, that is not
trivial to configure and debug. Being able to dump TCAM entries from
userspace can be really helpful to help development of new features
and debug existing ones.

This commit adds a basic debugfs interface for the PPv2 driver, focusing
on TCAM related features.

<mnt>/mvpp2/ --- f2000000.ethernet
              \- f4000000.ethernet --- parser --- 000 ...
                                    |          \- 001
                                    |          \- ...
                                    |          \- 255 --- ai
                                    |                  \- header_data
                                    |                  \- lookup_id
                                    |                  \- sram
                                    |                  \- valid
                                    \- eth1 ...
                                    \- eth2 --- mac_filter
                                             \- parser_entries
                                             \- vid_filter

There's one directory per PPv2 instance, named after pdev->name to make
sure names are uniques. In each of these directories, there's :

- one directory per interface on the controller, each containing :

   - "mac_filter", which lists all filtered addresses for this port
     (based on TCAM, not on the kernel's uc / mc lists)

   - "parser_entries", which lists the indices of all valid TCAM
      entries that have this port in their port map

   - "vid_filter", which lists the vids allowed on this port, based on
     TCAM

- one "parser" directory (the parser is common to all ports), containing :

   - one directory per TCAM entry (256 of them, from 0 to 255), each
     containing :

     - "ai" : Contains the 1 byte Additional Info field from TCAM, and

     - "header_data" : Contains the 8 bytes Header Data extracted from
       the packet

     - "lookup_id" : Contains the 4 bits LU_ID

     - "sram" : contains the raw SRAM data, which is the result of the TCAM
lookup. This readonly at the moment.

     - "valid" : Indicates if the entry is valid of not.

All entries are read-only, and everything is output in hex form.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: switch to SPDX identifiers

Use the appropriate SPDX license identifiers and drop the license text.
This patch is only cosmetic.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-07-15

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Various different arm32 JIT improvements in order to optimize code emission
   and make the JIT code itself more robust, from Russell.

2) Support simultaneous driver and offloaded XDP in order to allow for advanced
   use-cases where some work is offloaded to the NIC and some to the host. Also
   add ability for bpftool to load programs and maps beyond just the cgroup case,
   from Jakub.

3) Add BPF JIT support in nfp for multiplication as well as division. For the
   latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.

4) Add BTF pretty print functionality to bpftool in plain and JSON output
   format, from Okash.

5) Add build and installation to the BPF helper man page into bpftool, from Quentin.

6) Add a TCP BPF callback for listening sockets which is triggered right after
   the socket transitions to TCP_LISTEN state, from Andrey.

7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
   tree and prints all attached programs, from Roman.

8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
   packets, from Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-tcp-listen-cb'

Andrey Ignatov says:

====================
This patchset adds TCP-BPF callback for listening sockets.

Patch 0001 provides more details and is the main patch in the set.

Patch 0006 adds selftest for the new callback.

Other patches are bug fixes and improvements in TCP-BPF selftest
to make it easier to extend in 0006.
====================

Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB

Cover new TCP-BPF callback in test_tcpbpf: when listen() is called on
socket, set BPF_SOCK_OPS_STATE_CB_FLAG so that BPF_SOCK_OPS_STATE_CB
callback can be called on future state transition, and when such a
transition happens (TCP_LISTEN -> TCP_CLOSE), track it in the map and
verify it in user space later.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Better verification in test_tcpbpf

Reduce amount of copy/paste for debug info when result is verified in
the test and keep that info together with values being checked so that
they won't get out of sync.

It also improves debug experience: instead of checking manually what
doesn't match in debug output for all fields, only unexpected field is
printed.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Switch test_tcpbpf_user to cgroup_helpers

Switch to cgroup_helpers to simplify the code and fix cgroup cleanup:
before cgroup was not cleaned up after the test.

It also removes SYSTEM macro, that only printed error, but didn't
terminate the test.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Fix const'ness in cgroup_helpers

Lack of const in cgroup helpers signatures forces to write ugly client
code. Fix it.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Sync bpf.h to tools/

Sync BPF_SOCK_OPS_TCP_LISTEN_CB related UAPI changes to tools/.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB

Add new TCP-BPF callback that is called on listen(2) right after socket
transition to TCP_LISTEN state.

It fills the gap for listening sockets in TCP-BPF. For example BPF
program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening
and track later transition from TCP_LISTEN to TCP_CLOSE with
BPF_SOCK_OPS_STATE_CB callback.

Before there was no way to do it with TCP-BPF and other options were
much harder to work with. E.g. socket state tracking can be done with
tracepoints (either raw or regular) but they can't be attached to cgroup
and their lifetime has to be managed separately.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'mlxsw-VRRP'

Ido Schimmel says:

====================
mlxsw: Add VRRP support

When a router that is acting as the default gateway of a host stops
functioning, the host will encounter packet loss until the router starts
functioning again.

To increase the reliability of the default gateway without performing
reconfiguration on the host, a host can use a Virtual Router Redundancy
Protocol (VRRP) Router. This virtual router is composed from several
routers where only one is actually forwarding packets from the host (the
master router) while the other routers act as backup routers. The
election of the master router is determined by the VRRP protocol [1].

Packets addressed to the virtual router are always sent to the virtual
router MAC address (IPv4: 00-00-5E-00-01-XX, IPv6: 00-00-5E-00-02-XX).
Such packets can only be accepted by the master router and must be
discarded by the backup routers.

In Linux, VRRP is usually implemented by configuring a macvlan with the
virtual router MAC on top of the router interface that is connected to
the host / LAN. The macvlan on the master router is assigned the virtual
IP (VIP) that the host uses as its gateway.

In order to support VRRP in mlxsw, we first need to enable macvlan upper
devices on top of mlxsw netdevs and their uppers. This is done by the
first patch, which also takes care of sanitizing macvlan configurations
that are not currently supported by the driver.

The second patch directs packets with destination MAC addresses as the
macvlans to the router so that they will undergo an L3 lookup. This is
consistent with the kernel's behavior where the macvlan's Rx handler
will re-inject such packets to the Rx path so that they will be picked
up by the IPvX protocol handlers and undergo an L3 lookup. Note that the
driver prevents the macvlans from being enslaved to other devices, to
ensure the packets will be picked up by the protocol handler and not by
another Rx handler.

The third patch adds packet traps for VRRP control packets for both IPv4
and IPv6. Finally, the last patch optimizes the reception of VRRP MACs
by potentially skipping one L2 lookup for them.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Optimize processing of VRRP MACs

Hosts using a VRRP router send their packets with a destination MAC of
the VRRP router which is of the following form [1]:

IPv4 - 00-00-5E-00-01-{VRID}
IPv6 - 00-00-5E-00-02-{VRID}

Where VRID is the ID of the virtual router. Such packets are directed to
the router block in the ASIC by an FDB entry that was added in the
previous patch.

However, in certain cases it is possible to skip this FDB lookup and
send such packets directly to the router. This is accomplished by adding
these special MAC addresses to the RIF cache. If the cache is hit, the
packet will skip the L2 lookup and ingress the router with the RIF
specified in the cache entry.

1. https://tools.ietf.org/html/rfc5798#section-7.3

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Add VRRP traps

Virtual Router Redundancy Protocol packets are used to communicate the
state of the Master router associated with the virtual router ID (VRID).

These are link-local multicast packets sent with IP protocol 112 that
are trapped in the router block in the ASIC.

Add a trap for these packets and mark the trapped packets to prevent
them from potentially being re-flooded by the bridge driver.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Direct macvlans' MACs to router

An IP packet received on a netdev with a macvlan upper whose MAC matches
the packet's destination MAC will be re-injected to the Rx path as if it
was received by the macvlan, and perform an L3 lookup.

Reflect this functionality to the ASIC by programming FDB entries that
will direct MACs of macvlan uppers to the router.

In a similar fashion to router interfaces (RIFs) that are programmed
upon the addition of the first IP address on an interface and destroyed
upon the removal of the last IP address, the FDB entries for the macvlan
are added and destroyed based on the addition of the first and removal
of the last IP address on the macvlan.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Enable macvlan upper devices

In order to allow more unicast MAC addresses (e.g., VRRP virtual MAC) to
be directed to the router we need to enable macvlan uppers on top of
mlxsw netdevs.

Allow macvlan upper devices on top of mlxsw netdevs and sanitize
configurations that can't work. For example, a macvlan can't be enslaved
to a bridge as without ACLs the device doesn't take the destination MAC
into account when classifying a packet to a bridge instance (i.e., a
FID).

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: remove redundant rcv_nxt update

tcp_rcv_nxt_update() is already executed in tcp_data_queue().
This line is redundant.

See bellow,
tcp_queue_rcv
tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq); <<<< redundant

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: btf: print map dump and lookup with btf info

This patch augments the output of bpftool's map dump and map lookup
commands to print data along side btf info, if the correspondin btf
info is available. The outputs for each of  map dump and map lookup
commands are augmented in two ways:

1. when neither of -j and -p are supplied, btf-ful map data is printed
whose aim is human readability. This means no commitments for json- or
backward- compatibility.

2. when either -j or -p are supplied, a new json object named
"formatted" is added for each key-value pair. This object contains the
same data as the key-value pair, but with btf info. "formatted" object
promises json- and backward- compatibility. Below is a sample output.

$ bpftool map dump -p id 8
[{
        "key": ["0x0f","0x00","0x00","0x00"
        ],
        "value": ["0x03", "0x00", "0x00", "0x00", ...
        ],
        "formatted": {
                "key": 15,
                "value": {
                        "int_field":  3,
                        ...
                }
        }
}
]

This patch calls btf_dumper introduced in previous patch to accomplish
the above. Indeed, btf-ful info is only displayed if btf data for the
given map is available. Otherwise existing output is displayed as-is.

Signed-off-by: Okash Khawaja <osk@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: btf: add btf print functionality

This consumes functionality exported in the previous patch. It does the
main job of printing with BTF data. This is used in the following patch
to provide a more readable output of a map's dump. It relies on
json_writer to do json printing. Below is sample output where map keys
are ints and values are of type struct A:

typedef int int_type;
enum E {
        E0,
        E1,
};

struct B {
        int x;
        int y;
};

struct A {
        int m;
        unsigned long long n;
        char o;
        int p[8];
        int q[4][8];
        enum E r;
        void *s;
        struct B t;
        const int u;
        int_type v;
        unsigned int w1: 3;
        unsigned int w2: 3;
};

$ sudo bpftool map dump id 14
[{
        "key": 0,
        "value": {
            "m": 1,
            "n": 2,
            "o": "c",
            "p": [15,16,17,18,15,16,17,18
            ],
            "q": [[25,26,27,28,25,26,27,28
                ],[35,36,37,38,35,36,37,38
                ],[45,46,47,48,45,46,47,48
                ],[55,56,57,58,55,56,57,58
                ]
            ],
            "r": 1,
            "s": 0x7ffd80531cf8,
            "t": {
                "x": 5,
                "y": 10
            },
            "u": 100,
            "v": 20,
            "w1": 0x7,
            "w2": 0x3
        }
    }
]

This patch uses json's {} and [] to imply struct/union and array. More
explicit information can be added later. For example, a command line
option can be introduced to print whether a key or value is struct
or union, name of a struct etc. This will however come at the expense
of duplicating info when, for example, printing an array of structs.
enums are printed as ints without their names.

Signed-off-by: Okash Khawaja <osk@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: btf: export btf types and name by offset from lib

This patch introduces btf__resolve_type() function and exports two
existing functions from libbpf. btf__resolve_type follows modifier
types like const and typedef until it hits a type which actually takes
up memory, and then returns it. This function follows similar pattern
to btf__resolve_size but instead of computing size, it just returns
the type.

These functions will be used in the followig patch which parses
information inside array of `struct btf_type *`. btf_name_by_offset is
used for printing variable names.

Signed-off-by: Okash Khawaja <osk@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

tools: include reallocarray feature test in FEATURE_TESTS_BASIC

perf propagates its feature check results to libbpf. This means
features for which perf probes must be a superset of libbpf's
required features. perf depends on FEATURE_TESTS_BASIC for its list
of features.

commit 531b014e7a2f ("tools: bpf: make use of reallocarray") added
reallocarray use to libbpf, make perf also perform the reallocarray
feature check.

Fixes: 531b014e7a2f ("tools: bpf: make use of reallocarray")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: mvpp2: mvpp2_cls_flow_get() can be static

Fixes: f9358e12a0af ("net: mvpp2: split ingress traffic into multiple flows")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

of: mdio: Support fixed links in of_phy_get_and_connect()

By a simple extension of of_phy_get_and_connect() drivers
that have a fixed link on e.g. RGMII can support also
fixed links, so in addition to:

ethernet-port {
phy-mode = "rgmii";
phy-handle = <&foo>;
};

This setup with a fixed-link node and no phy-handle will
now also work just fine:

ethernet-port {
phy-mode = "rgmii";
fixed-link {
speed = <1000>;
full-duplex;
pause;
};
};

This is very helpful for connecting random ethernet ports
to e.g. DSA switches that typically reside on fixed links.

The phy-mode is still there as the fixes link in this case
is still an RGMII link.

Tested on the Cortina Gemini driver with the Vitesse DSA
router chip on a fixed 1Gbit link.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: refactor flower walk to iterate over idr

Extend struct tcf_walker with additional 'cookie' field. It is intended to
be used by classifier walk implementations to continue iteration directly
from particular filter, instead of iterating 'skip' number of times.

Change flower walk implementation to save filter handle in 'cookie'. Each
time flower walk is called, it looks up filter with saved handle directly
with idr, instead of iterating over filter linked list 'skip' number of
times. This change improves complexity of dumping flower classifier from
quadratic to linearithmic. (assuming idr lookup has logarithmic complexity)

Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reported-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: xdp_redirect_cpu handle parsing of double VLAN tagged packets

People noticed that the code match on IEEE 802.1ad (ETH_P_8021AD) ethertype,
and this implies Q-in-Q or double tagged VLANs.  Thus, we better parse
the next VLAN header too.  It is even marked as a TODO.

This is relevant for real world use-cases, as XDP cpumap redirect can be
used when the NIC RSS hashing is broken.  E.g. the ixgbe driver HW cannot
handle double tagged VLAN packets, and places everything into a single
RX queue.  Using cpumap redirect, users can redistribute traffic across
CPUs to solve this, which is faster than the network stacks RPS solution.

It is left as an exerise how to distribute the packets across CPUs.  It
would be convenient to use the RX hash, but that is not _yet_ exposed
to XDP programs. For now, users can code their own hash, as I've demonstrated
in the Suricata code (where Q-in-Q is handled correctly).

Reported-by: Florian Maury <florian.maury-cv@x-cli.eu>
Reported-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: ipmr: add support for passing full packet on wrong vif

This patch adds support for IGMPMSG_WRVIFWHOLE which is used to pass
full packet and real vif id when the incoming interface is wrong.
While the RP and FHR are setting up state we need to be sending the
registers encapsulated with all the data inside otherwise we lose it.
The RP then decapsulates it and forwards it to the interested parties.
Currently with WRONGVIF we can only be sending empty register packets
and will lose that data.
This behaviour can be enabled by using MRT_PIM with
val == IGMPMSG_WRVIFWHOLE. This doesn't prevent IGMPMSG_WRONGVIF from
happening, it happens in addition to it, also it is controlled by the same
throttling parameters as WRONGVIF (i.e. 1 packet per 3 seconds currently).
Both messages are generated to keep backwards compatibily and avoid
breaking someone who was enabling MRT_PIM with val == 4, since any
positive val is accepted and treated the same.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-xdp-driver-and-hw'

Jakub Kicinski says:

====================
This set is adding support for loading driver and offload XDP
at the same time. This enables advanced use cases where some
of the work is offloaded to the NIC and some is done by the host.
Separate netlink attributes are added for each mode of operation.
Driver callbacks for offload are cleaned up a little, including
removal of .prog_attached flag.
====================

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

nfp: add support for simultaneous driver and hw XDP

Split handling of offloaded and driver programs completely. Since
offloaded programs always come with XDP_FLAGS_HW_MODE set in reality
there could be no sharing, anyway, programs would only be installed
in driver or in hardware. Splitting the handling allows us to install
programs in HW and in driver at the same time.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: add test for multiple programs

Add tests for having an XDP program attached in the driver and
another one attached in HW simultaneously.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

netdevsim: add support for simultaneous driver and hw XDP

Allow netdevsim to accept driver and offload attachment of XDP
BPF programs at the same time.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xdp: support simultaneous driver and hw XDP attachment

Split the query of HW-attached program from the software one.
Introduce new .ndo_bpf command to query HW-attached program.
This will allow drivers to install different programs in HW
and SW at the same time. Netlink can now also carry multiple
programs on dump (in which case mode will be set to
XDP_ATTACHED_MULTI and user has to check per-attachment point
attributes, IFLA_XDP_PROG_ID will not be present). We reuse
IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size()
doesn't need to be updated.

Note that the installation side is still not there, since all
drivers currently reject installing more than one program at
the time.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>