]> git.proxmox.com Git - mirror_ubuntu-jammy-kernel.git/log
mirror_ubuntu-jammy-kernel.git
5 years agonet: skb_scrub_packet(): Scrub offload_fwd_mark
Petr Machata [Mon, 19 Nov 2018 16:11:07 +0000 (16:11 +0000)]
net: skb_scrub_packet(): Scrub offload_fwd_mark

When a packet is trapped and the corresponding SKB marked as
already-forwarded, it retains this marking even after it is forwarded
across veth links into another bridge. There, since it ingresses the
bridge over veth, which doesn't have offload_fwd_mark, it triggers a
warning in nbp_switchdev_frame_mark().

Then nbp_switchdev_allowed_egress() decides not to allow egress from
this bridge through another veth, because the SKB is already marked, and
the mark (of 0) of course matches. Thus the packet is incorrectly
blocked.

Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
function is called from tunnels and also from veth, and thus catches the
cases where traffic is forwarded between bridges and transformed in a
way that invalidates the marking.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'octeontx2-af-NPC-MCAM-support-and-FLR-handling'
David S. Miller [Tue, 20 Nov 2018 01:56:09 +0000 (17:56 -0800)]
Merge branch 'octeontx2-af-NPC-MCAM-support-and-FLR-handling'

Sunil Goutham says:

====================
octeontx2-af: NPC MCAM support and FLR handling

This patchset is a continuation to earlier submitted three patch
series to add a new driver for Marvell's OcteonTX2 SOC's
Resource virtualization unit (RVU) admin function driver.

1. octeontx2-af: Add RVU Admin Function driver
   https://www.spinics.net/lists/netdev/msg528272.html
2. octeontx2-af: NPA and NIX blocks initialization
   https://www.spinics.net/lists/netdev/msg529163.html
3. octeontx2-af: NPC parser and NIX blocks initialization
   https://www.spinics.net/lists/netdev/msg530252.html

This patch series adds support for below
RVU generic:
- Function Level Reset irq handler
  When FLR is triggered for PFs, AF receives interrupt.
  This patchset adds logic for cleaning up of NPA, NIX
  and NPC block resources being used by PF.

- Mailbox communication between AF and it's VFs.
  Unlike VFs of PF1-PFn, AF which is PF0 can communicate
  with it's VFs directly. Added support for the same.

- AF's VFs IO configuration
  These VFs are mapped to use internal HW loopback channels
  instead of CGX LMACs. Each pair of VFs work as two of ends
  of hardwired interfaces. VF0's TX is VF1's Rx & viceversa.

NPC block:
- MCAM entry management
  Alloc/Free of contiguous/non-contiguous and lower/higher
  priority MCAM entry allocation and programming support.
- MCAM counters management and map/unmap with MCAM entries
- Default KEY extract profile
- HW errata workarounds

NIX block:
- Minimum and maximum allowed packet length config
- HW errata workarounds

Few more changes like shift to use mutex instead of spinlock etc
are done in this patchset.

Changes from v2:
 1 Fixed commit message of patch 'Relax resource lock into mutex'
   to a more unambiguous one.
   - Suggested by David Miller.

Changes from v1:
 1 Converted all mailbox message handler API names to small letters
   from mixed small and capital letters.
   - Suggested by David Miller.
 2 Fixed endian issues in patch 'Add support for stripping STAG/CTAG'
   - Suggested by Arnd Bergmann
 3 Elaborated commit message of patch 'Add FLR interrupt handler'
   to make it a bit more easy to understand.
   - Suggested by Arnd Bergmann

 Will fix the padding and alignment in mailbox message structure
 in a follow-up patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Workarounds for HW errata
Sunil Goutham [Mon, 19 Nov 2018 10:47:43 +0000 (16:17 +0530)]
octeontx2-af: Workarounds for HW errata

Errata 35038
  Software sets NIX_AF_RX_SW_SYNC[ENA] to sync (flush) in-flight packets
  the RX data path before configuration changes (e.g. disabling one or
  more RQs). Hardware clears [ENA] to indicate sync is done

  An issue exists whereby NIX may clear NIX_AF_RX_SW_SYNC [ENA] too
  early.

Errata 35057
  NIX may corrupt internal state when conditional clocks turn off.
  So turnon all clocks by default.

Errata 35786
 Parse nibble enable NPC configuration for KEY generation has to be
 identical for both Rx and Tx interfaces.

Also corrected endianness configuration for NIX i.e NIX_AF_CFG[AF_BE]
is bit8 and not bit1.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add interrupt handlers for Master Enable event
Linu Cherian [Mon, 19 Nov 2018 10:47:42 +0000 (16:17 +0530)]
octeontx2-af: Add interrupt handlers for Master Enable event

- Add interrupt handlers for Master Enable events from PFs
  and Master Enable events from VFs of AF
- Master Enable is required for the MSIX delivery to work
- Master Enable bit trap handler doesn't have to do any anything
  other than clearing the TRPEND bit, since the enable/disable
  requirements are already taken care using mbox requests/flr handler.

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add FLR handling support for AF's VFs
Sunil Goutham [Mon, 19 Nov 2018 10:47:41 +0000 (16:17 +0530)]
octeontx2-af: Add FLR handling support for AF's VFs

Added support to handle FLR for AF's VFs (i.e LBK VFs).
Just the FLR interrupt enable/disable, handler registration
etc, actual HW resource cleanup or LFs teardown logic is
already there.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Configure AF VFs to talk over LBK channels
Tomasz Duszynski [Mon, 19 Nov 2018 10:47:40 +0000 (16:17 +0530)]
octeontx2-af: Configure AF VFs to talk over LBK channels

Configure AF VFs such that they are able to talk over consecutive
loopback channels.

If 8 VFs are attached to AF then communication will work as below:

TX      RX
lbk0 -> lbk1
lbk1 -> lbk0

lbk2 -> lbk3
lbk3 -> lbk2

lbk4 -> lbk5
lbk5 -> lbk4

lbk6 -> lbk7
lbk7 -> lbk6

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Enable sriov on AF to create VFs
Tomasz Duszynski [Mon, 19 Nov 2018 10:47:39 +0000 (16:17 +0530)]
octeontx2-af: Enable sriov on AF to create VFs

Enable all AF VFs during probe. Since AF's VFs work in pairs
(eg: Pkts sent on VF0 are received by VF1 and viceversa),
enable only even number of VFs out of totalVFs, which should
again be less than number of loopback (LBK) channels.

Also enable VF's mailbox interrupts.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Mbox communication support btw AF and it's VFs
Tomasz Duszynski [Mon, 19 Nov 2018 10:47:38 +0000 (16:17 +0530)]
octeontx2-af: Mbox communication support btw AF and it's VFs

VFs attached to PFs other than AF can not communicate with AF
directly. Instead they are supposed to first send message to
the PF they are residing on and PF forwards it to the AF.
Responses to messages are handled in the reverse order.

On the other hand if VFs are on AF (PF0) itself then direct mailbox
communication is possible since there's no other PF in the way.

This patch addresses this particular case and adds support for
handling it.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Signed-off-by: Marko Kallio <mkallio@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Teardown NPA, NIX LF upon receiving FLR
Geetha sowjanya [Mon, 19 Nov 2018 10:47:37 +0000 (16:17 +0530)]
octeontx2-af: Teardown NPA, NIX LF upon receiving FLR

Upon receiving FLR IRQ for a RVU PF, teardown or cleanup
resources held by that PF_FUNC. This patch cleans up,
NIX LF
 - Stop ingress/egress traffic
 - Disable NPC MCAM entries being used.
 - Free Tx scheduler queues
 - Disable RQ/SQ/CQ HW contexts
NPA LF
 - Disable Pool/Aura HW contexts
In future teardown of SSO/SSOW/TIM/CPT will be added.

Also added a mailbox message for a RVU PF to request
AF, to perform FLR for a RVU VF under it.

Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Stanislaw Kardach <skardach@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add FLR interrupt handler
Geetha sowjanya [Mon, 19 Nov 2018 10:47:36 +0000 (16:17 +0530)]
octeontx2-af: Add FLR interrupt handler

RVU admin function (AF) has all the priviliges to cleanup
HW state when VFIO triggers a PCIe function level reset (FLR)
due to either reset or a VM crash. FLR for RVU PF1-PFn will
trigger an IRQ to AF.

This patch enables all RVU PF's FLR interrupts and registers a
handler. Upon receiving an interrupt, a workqueue is scheduled
to cleanup all hardware blocks being used by the PF which
received the FLR.

Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Verify NPA/SSO/NIX PF_FUNC mapping
Sunil Goutham [Mon, 19 Nov 2018 10:47:35 +0000 (16:17 +0530)]
octeontx2-af: Verify NPA/SSO/NIX PF_FUNC mapping

While mapping a NIX LF to a NPA LF attached PF_FUNC or
SSO LF attached PF_FUNC, verify if PF_FUNC is valid and
if that PF_FUNC has a LF of that block attached to it or not.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add support for stripping STAG/CTAG
Tomasz Duszynski [Mon, 19 Nov 2018 10:47:34 +0000 (16:17 +0530)]
octeontx2-af: Add support for stripping STAG/CTAG

This works by shadowing existing UCAST MCAM entry
with a new one additionally matching either NPC_LT_LB_CTAG
or NPC_LT_LB_STAG. For this to fully work one needs to
send properly configured NIX_VTAG_CFG message afterwards i.e with
strip and capture enabled and type set to 0.

On receiving tagged packet NIX will remove outer VLAN and capture
TCI in NIX_RX_PARSE_S.

Also simplified RX Vtag configuration flow
With this setting STRIP/CAPTURE VTAG actions separately would be
possible. Following combinations are possible: STRIP,
STRIP and CAPTURE, CAPTURE or nothing (0 disables respective actions).

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Support to enable/disable default MCAM entries
Sunil Goutham [Mon, 19 Nov 2018 10:47:33 +0000 (16:17 +0530)]
octeontx2-af: Support to enable/disable default MCAM entries

For a PF/VF with a NIXLF attached has default/reserved MCAM entries
for receiving Ucast/Bcast/Promisc traffic. Ideally traffic should be
forwarded to NIXLF only after it's contexts are initialized. This
patch keeps these default entries disabled and adds mbox messages
for a PF/VF to enable these once NPA/NIXLF initialization is done.
Likewise while PF/VF is being teared down, it can send the disable
mailbox message to stop receiving traffic.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add MKEX default profile
Santosh Shukla [Mon, 19 Nov 2018 10:47:32 +0000 (16:17 +0530)]
octeontx2-af: Add MKEX default profile

Added basic default MKEX profile. This profile tells
hardware what data to extract from packet and where to
place it (bit offset) in final KEY generated for the
parsed packet. Based on the bit placement of the packet
data, MCAM entries have to programmed for matching.

Also added a msg to retrieve this MKEX profile from PF/VF
which inturn can process it to determine how MCAM entry
has to be populated.

Signed-off-by: Santosh Shukla <sshukla@marvell.com>
Signed-off-by: Yuri Tolstov <ytolstov@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Alloc and config NPC MCAM entry at a time
Sunil Goutham [Mon, 19 Nov 2018 10:47:31 +0000 (16:17 +0530)]
octeontx2-af: Alloc and config NPC MCAM entry at a time

A new mailbox message is added to support allocating a MCAM entry
along with a counter and configuring it in one go. This reduces
the amount of mailbox communication involved in installing a new
MCAM rule.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Map or unmap NPC MCAM entry and counter
Sunil Goutham [Mon, 19 Nov 2018 10:47:30 +0000 (16:17 +0530)]
octeontx2-af: Map or unmap NPC MCAM entry and counter

Alloc memory to save MCAM 'entry to counter' mapping and since
multiple entries can map to same counter, added counter's reference
count tracking.

Do 'entry to counter' mapping when a entry is being installed
and mbox msg sender requested to configure a counter as well.
Mapping is removed when a entry or counter is being freed or
a explicit mbox msg is received to unmap them.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Support for NPC MCAM counters
Sunil Goutham [Mon, 19 Nov 2018 10:47:29 +0000 (16:17 +0530)]
octeontx2-af: Support for NPC MCAM counters

NPC HW has counters which can be mapped to MCAM
entries to gather entry match statistics. This
patch adds support to allocate, free, clear and retrieve
stats of NPC MCAM counters. New mailbox messages have
been added for this. Similar to MCAM entries both
contiguous and non-contiguous counter allocation is
supported.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: MCAM entry installation support
Sunil Goutham [Mon, 19 Nov 2018 10:47:28 +0000 (16:17 +0530)]
octeontx2-af: MCAM entry installation support

Add support for a RVU PF/VF to enable, disable, configure
and shuffle MCAM entries via mbox commands. This patch adds
mailbox message formats and handling of these commands.

As of now otherthan validating MCAM entry index, info like
channel number e.t.c in MCAM config data sent by PF/VF are
not validated.

Also a max of 64 MCAM entries can be shuffled with a single
mbox command.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: NPC MCAM entry alloc/free support
Sunil Goutham [Mon, 19 Nov 2018 10:47:27 +0000 (16:17 +0530)]
octeontx2-af: NPC MCAM entry alloc/free support

This patch adds NPC MCAM entry management and support for
allocating and freeing them via mailbox. Both contiguous and
non-contiguous allocations are supported. Incase of contiguous,
if request cannot be met then max contiguous number of available
entries are allocated.

High or low priority index allocation w.r.t a reference MCAM index
is also supported.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Relax resource lock into mutex
Stanislaw Kardach [Mon, 19 Nov 2018 10:47:26 +0000 (16:17 +0530)]
octeontx2-af: Relax resource lock into mutex

Mailbox message handling is done in a workqueue context scheduled
from interrupt handler. So resource locks does not need to be a spinlock.
Therefore relax them into a mutex so that later on we may use them
in routines that might sleep.

Signed-off-by: Stanislaw Kardach <skardach@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Support to get NIX HW constants from AF
Kiran Kumar [Mon, 19 Nov 2018 10:47:25 +0000 (16:17 +0530)]
octeontx2-af: Support to get NIX HW constants from AF

This patch adds reading HW limits like number of Rx/Tx stats,
number of queue IRQs supported per NIX LF from AF registers
and sync them to PF/VF.

Signed-off-by: Kiran Kumar <kirankumark@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Support to modify min/max allowed packet lengths
Sunil Goutham [Mon, 19 Nov 2018 10:47:24 +0000 (16:17 +0530)]
octeontx2-af: Support to modify min/max allowed packet lengths

This patch adds support for RVU PF/VFs to modify min/max
packet lengths allowed by HW. For VFs on PF0, settings will
be automatically applied on LBK link. RX link's min/maxlen
is configured to min/max of PF and it's all VFs. On the TX side
if requested all SMQs attached to the requesting NIXLF will be
updated with new min/max lengths.

Also updates transmit credits for Tx links based on new maxlen.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Convert mbox handlers APIs to lowercase
Sunil Goutham [Mon, 19 Nov 2018 10:47:23 +0000 (16:17 +0530)]
octeontx2-af: Convert mbox handlers APIs to lowercase

This patch converts all mailbox message handler API
names to lowercase.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'r8169-series-with-further-smaller-improvements'
David S. Miller [Tue, 20 Nov 2018 01:32:15 +0000 (17:32 -0800)]
Merge branch 'r8169-series-with-further-smaller-improvements'

Heiner Kallweit says:

====================
r8169: series with further smaller improvements

Again nothing exciting, just smaller improvements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: improve chip version identification
Heiner Kallweit [Mon, 19 Nov 2018 21:41:35 +0000 (22:41 +0100)]
r8169: improve chip version identification

Only the upper 12 bits are used for chip identification, this helps
to reduce the size of array mac_info.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: simplify ocp functions
Heiner Kallweit [Mon, 19 Nov 2018 21:40:04 +0000 (22:40 +0100)]
r8169: simplify ocp functions

rtl8168_oob_notify is used in rtl8168dp_driver_start and
rtl8168dp_driver_stop only, so we can rename it to r8168dp_oob_notify.
The same applies to condition rtl_ocp_read_cond which can be renamed
to rtl_dp_ocp_read_cond. This allows to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove workaround for ancient gcc bug
Heiner Kallweit [Mon, 19 Nov 2018 21:39:14 +0000 (22:39 +0100)]
r8169: remove workaround for ancient gcc bug

The kernel can't be built any longer with this ancient GCC version.
Eventually it becomes clear what this statement actually does.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove manual padding in struct ring_info
Heiner Kallweit [Mon, 19 Nov 2018 21:38:22 +0000 (22:38 +0100)]
r8169: remove manual padding in struct ring_info

The compiler takes care of alignment and padding, I see no need to
bother him with manual hints.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove "not PCI Express" message
Heiner Kallweit [Mon, 19 Nov 2018 21:37:34 +0000 (22:37 +0100)]
r8169: remove "not PCI Express" message

The ones who want to know can easily identify whether chip is PCI or
PCIe based on the chip name. I doubt there's any benefit in this
message, so remove it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove print_mac_version
Heiner Kallweit [Mon, 19 Nov 2018 21:36:15 +0000 (22:36 +0100)]
r8169: remove print_mac_version

The syslog message printed on driver load allows to easily identify
the mac version number (based on chip name and XID). So we don't
need this extra debug message which is wrong anyway because e.g.
RTL_GIGA_MAC_VER_01 has value 0.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: use PCI_VDEVICE macro
Heiner Kallweit [Mon, 19 Nov 2018 21:35:08 +0000 (22:35 +0100)]
r8169: use PCI_VDEVICE macro

Using macro PCI_VDEVICE helps to simplify the PCI ID table.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: replace event_slow with irq_mask
Heiner Kallweit [Mon, 19 Nov 2018 21:34:17 +0000 (22:34 +0100)]
r8169: replace event_slow with irq_mask

Recently the "slow event" handler was removed, therefore the member
name isn't appropriate any longer. In addition store the full mask,
including the RTL_EVENT_NAPI interrupt source bits.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove unused interrupt sources
Heiner Kallweit [Mon, 19 Nov 2018 21:33:00 +0000 (22:33 +0100)]
r8169: remove unused interrupt sources

Setting PCSTimeout interrupt source was copied from the vendor driver
which uses the chip programmable timer interrupt. The mainline driver
doesn't use this timer interrupt.

SYSErr indicates a PCI error and isn't defined on the PCIe models.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: use dev_get_drvdata where possible
Heiner Kallweit [Mon, 19 Nov 2018 21:32:18 +0000 (22:32 +0100)]
r8169: use dev_get_drvdata where possible

Using dev_get_drvdata directly is simpler here.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: merge rtl_irq_enable and rtl_irq_enable_all
Heiner Kallweit [Mon, 19 Nov 2018 21:31:32 +0000 (22:31 +0100)]
r8169: merge rtl_irq_enable and rtl_irq_enable_all

After the recent changes to the interrupt handler rtl_irq_enable and
rtl_irq_enable_all can be merged.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'sctp-add-subscribe-per-asoc-and-sockopt-SCTP_EVENT'
David S. Miller [Mon, 19 Nov 2018 20:25:43 +0000 (12:25 -0800)]
Merge branch 'sctp-add-subscribe-per-asoc-and-sockopt-SCTP_EVENT'

Xin Long says:

====================
sctp: add subscribe per asoc and sockopt SCTP_EVENT

This patchset mainly adds the Event Subscription sockopt described in
rfc6525#section-6.2:

"Subscribing to events as described in [RFC6458] uses a setsockopt()
call with the SCTP_EVENT socket option.  This option takes the
following structure, which specifies the association, the event type
(using the same value found in the event type field), and an on/off
boolean.

  struct sctp_event {
    sctp_assoc_t se_assoc_id;
    uint16_t     se_type;
    uint8_t      se_on;
  };

The user fills in the se_type field with the same value found in the
strreset_type field, i.e., SCTP_STREAM_RESET_EVENT.  The user will
also fill in the se_assoc_id field with either the association to set
this event on (this field is ignored for one-to-one style sockets) or
one of the reserved constant values defined in [RFC6458].  Finally,
the se_on field is set with a 1 to enable the event or a 0 to disable
the event."

As for the old SCTP_EVENTS Option with struct sctp_event_subscribe,
it's being DEPRECATED.

v1->v2:
  - fix some key word in changelog that triggerred the filters at
    vger.kernel.org.
v2->v3:
  - fix an array out of bounds noticed by Neil in patch 1/4.
====================

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: add sockopt SCTP_EVENT
Xin Long [Sun, 18 Nov 2018 08:08:54 +0000 (16:08 +0800)]
sctp: add sockopt SCTP_EVENT

This patch adds sockopt SCTP_EVENT described in rfc6525#section-6.2.
With this sockopt users can subscribe to an event from a specified
asoc.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: rename enum sctp_event to sctp_event_type
Xin Long [Sun, 18 Nov 2018 08:08:53 +0000 (16:08 +0800)]
sctp: rename enum sctp_event to sctp_event_type

sctp_event is a structure name defined in RFC for sockopt
SCTP_EVENT. To avoid the conflict, rename it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: add subscribe per asoc
Xin Long [Sun, 18 Nov 2018 08:08:52 +0000 (16:08 +0800)]
sctp: add subscribe per asoc

The member subscribe should be per asoc, so that sockopt SCTP_EVENT
in the next patch can subscribe a event from one asoc only.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: define subscribe in sctp_sock as __u16
Xin Long [Sun, 18 Nov 2018 08:08:51 +0000 (16:08 +0800)]
sctp: define subscribe in sctp_sock as __u16

The member subscribe in sctp_sock is used to indicate to which of
the events it is subscribed, more like a group of flags. So it's
better to be defined as __u16 (2 bytpes), instead of struct
sctp_event_subscribe (13 bytes).

Note that sctp_event_subscribe is an UAPI struct, used on sockopt
calls, and thus it will not be removed. This patch only changes
the internal storage of the flags.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Mon, 19 Nov 2018 18:55:00 +0000 (10:55 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Mon, 19 Nov 2018 17:24:04 +0000 (09:24 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Fix some potentially uninitialized variables and use-after-free in
    kvaser_usb can drier, from Jimmy Assarsson.

 2) Fix leaks in qed driver, from Denis Bolotin.

 3) Socket leak in l2tp, from Xin Long.

 4) RSS context allocation fix in bnxt_en from Michael Chan.

 5) Fix cxgb4 build errors, from Ganesh Goudar.

 6) Route leaks in ipv6 when removing exceptions, from Xin Long.

 7) Memory leak in IDR allocation handling of act_pedit, from Davide
    Caratti.

 8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.

 9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
    Sabrina Dubroca.

10) When NAPI cached skb is reused, we must set it to the proper initial
    state which includes skb->pkt_type. From Eric Dumazet.

11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.

12) Set RX queue properly in various tuntap receive paths, from Matthew
    Cover.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
  tuntap: fix multiqueue rx
  ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
  tipc: don't assume linear buffer when reading ancillary data
  tipc: fix lockdep warning when reinitilaizing sockets
  net-gro: reset skb->pkt_type in napi_reuse_skb()
  tc-testing: tdc.py: Guard against lack of returncode in executed command
  tc-testing: tdc.py: ignore errors when decoding stdout/stderr
  ip_tunnel: don't force DF when MTU is locked
  MAINTAINERS: Add entry for CAKE qdisc
  net: bridge: fix vlan stats use-after-free on destruction
  socket: do a generic_file_splice_read when proto_ops has no splice_read
  net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
  Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
  net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
  net/sched: act_pedit: fix memory leak when IDR allocation fails
  net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
  ipv6: fix a dst leak when removing its exception
  net: mvneta: Don't advertise 2.5G modes
  drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
  net/mlx4: Fix UBSAN warning of signed integer overflow
  ...

5 years agotuntap: fix multiqueue rx
Matthew Cover [Sun, 18 Nov 2018 07:46:00 +0000 (00:46 -0700)]
tuntap: fix multiqueue rx

When writing packets to a descriptor associated with a combined queue, the
packets should end up on that queue.

Before this change all packets written to any descriptor associated with a
tap interface end up on rx-0, even when the descriptor is associated with a
different queue.

The rx traffic can be generated by either of the following.
  1. a simple tap program which spins up multiple queues and writes packets
     to each of the file descriptors
  2. tx from a qemu vm with a tap multiqueue netdev

The queue for rx traffic can be observed by either of the following (done
on the hypervisor in the qemu case).
  1. a simple netmap program which opens and reads from per-queue
     descriptors
  2. configuring RPS and doing per-cpu captures with rxtxcpu

Alternatively, if you printk() the return value of skb_get_rx_queue() just
before each instance of netif_receive_skb() in tun.c, you will get 65535
for every skb.

Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
the association between descriptor and rx queue.

Signed-off-by: Matthew Cover <matthew.cover@stackpath.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
David Ahern [Sun, 18 Nov 2018 18:45:30 +0000 (10:45 -0800)]
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF

Preethi reported that PMTU discovery for UDP/raw applications is not
working in the presence of VRF when the socket is not bound to a device.
The problem is that ip6_sk_update_pmtu does not consider the L3 domain
of the skb device if the socket is not bound. Update the function to
set oif to the L3 master device if relevant.

Fixes: ca254490c8df ("net: Add VRF support to IPv6 stack")
Reported-by: Preethi Ramachandra <preethir@juniper.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Expose discard counters via ethtool
Shalom Toledo [Sun, 18 Nov 2018 16:43:03 +0000 (16:43 +0000)]
mlxsw: spectrum: Expose discard counters via ethtool

Expose packets discard counters via ethtool to help with debugging.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotun: use netdev_alloc_frag() in tun_napi_alloc_frags()
Eric Dumazet [Sun, 18 Nov 2018 15:37:33 +0000 (07:37 -0800)]
tun: use netdev_alloc_frag() in tun_napi_alloc_frags()

In order to cook skbs in the same way than Ethernet drivers,
it is probably better to not use GFP_KERNEL, but rather
use the GFP_ATOMIC and PFMEMALLOC mechanisms provided by
netdev_alloc_frag().

This would allow to use tun driver even in memory stress
situations, especially if swap is used over this tun channel.

Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Petar Penkov <peterpenkov96@gmail.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'IP101GR-devicetree-based-configuration-of-SEL_INTR32'
David S. Miller [Mon, 19 Nov 2018 00:16:20 +0000 (16:16 -0800)]
Merge branch 'IP101GR-devicetree-based-configuration-of-SEL_INTR32'

Martin Blumenstingl says:

====================
IP101GR: devicetree based configuration of SEL_INTR32

The IP101GR is a 32-pin QFN package variant of the IP101G/IP101GA
Ethernet PHY. Due to it's limited amount of pins the RXER (receive
error) and INTR32 (interrupt) functions share pin 21.

The goal of this series is:
- some small cleanups in patches 3, 4 and 5
- allowing the kernel to detect IRQ floods on boards where the IP101GR
  is configured in RXER mode but the RXER line is configured on the
  host SoC as interrupt line (patch 6)
- configuration of the SEL_INTR32 register so we can use the interrupt
  function on boards where the RXER/INTR32 pin (pin 21) is routed to
  one of the host SoC's interrupt inputs (patches 1, 2, 7)

A use-case where this is needed is the Endless Mini (EC-100). I have
tested my changes on that board. This also confirms that Heiner
Kallweit's recent icplus.c PHY driver changes are working (at least on
my setup).

This series is based on net-next commit 7c460cf9cd1a ("net: aquantia:
fix spelling mistake "specfield" -> "specified"")

Changes since v1 at [0]:
- collected Andrew's Reviewed-by's (thank you!)
- updated description of patch #2 to explain why two properties were
  added instead of adding an "this is a IP101GR" property
- validate that there's no conflicting configuration in patch #7
- rebased on top of latest net-next

[0] https://patchwork.ozlabs.org/cover/999371/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: allow configuring the interrupt function on IP101GR
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:59 +0000 (22:23 +0100)]
net: phy: icplus: allow configuring the interrupt function on IP101GR

The IP101GR is a 32-pin QFN package variant of the IP101G/IP101GA
Ethernet PHY. Due to it's limited amount of pins the RXER (receive
error) and INTR32 (interrupt) functions share pin 21.
By default the PHY is configured to output the "receive error" status on
pin 21. Depending on the board layout and requirements we may want to
re-configure the PHY to output the interrupt signal there.

The mode of pin 21 can be configured in the "Digital I/O Specific
Control Register" (register 29), bit 2:
- 0 = RXER function
- 1 = INTR(32) function

Depending on the devicetree configuration we will now:
- change the mode to either ther RXER or INTR32 function
- keep the SEL_INTR32 value set by the bootloader (default) if no
  configuration is provided (to ensure that we're not breaking existing
  boards)
- error out if conflicting configuration is given (RXER and INTR32 mode
  are enabled at the same time)

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: implement .did_interrupt for IP101A/G
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:58 +0000 (22:23 +0100)]
net: phy: icplus: implement .did_interrupt for IP101A/G

The IP101A_G_IRQ_CONF_STATUS register has bits to detect which
interrupts have fired. Implement the .did_interrupt callback to let the
PHY core know whether the interrupt was for this specific PHY.

This is useful for debugging interrupt problems with 32-pin IP101GR PHYs
where the interrupt line is shared with the RX_ERR (receive error
status) signal. The default values are:
- RX_ERR is enabled by default (LOW means that there is no receive
  error)
- the PHY's interrupt line is configured "active low" by default

Without any additional changes there is a flood of interrupts if the
RX_ERR/INTR32 signal is configured in RX_ERR mode (which is the
default). Having a did_interrupt ensures that the PHY core returns
IRQ_NONE instead of endlessly triggering the PHY state machine.
Additionally the kernel will report this after a while:
  irq 28: nobody cared (try booting with the "irqpoll" option)

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: rename IP101A_G_NO_IRQ to IP101A_G_IRQ_ALL_MASK
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:57 +0000 (22:23 +0100)]
net: phy: icplus: rename IP101A_G_NO_IRQ to IP101A_G_IRQ_ALL_MASK

The datasheet uses the name "All Mask" for this bit. Change the name of
our #define to be consistent with the datasheet. While here also replace
the tab between the #define and IP101A_G_IRQ_ALL_MASK with a space.
No functional changes.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: use the BIT macro where possible
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:56 +0000 (22:23 +0100)]
net: phy: icplus: use the BIT macro where possible

This makes the code consistent by using the BIT() macro instead of
manual bit-shifting for some of the fields. No functional changes.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: keep all ip101a_g functions together
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:55 +0000 (22:23 +0100)]
net: phy: icplus: keep all ip101a_g functions together

This simply moves ip101a_g_config_init right above
ip101a_g_config_intr so all functions for the ICPlus IP101A/G PHYs are
grouped together.
No functional changes.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodt-bindings: net: phy: add bindings for the IC Plus Corp. IP101A/G PHYs
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:54 +0000 (22:23 +0100)]
dt-bindings: net: phy: add bindings for the IC Plus Corp. IP101A/G PHYs

The IP101A and IP101G series both have various models. Depending on the
board implementation we need a special property for the IP101GR (32-pin
LQFP package) PHY:
pin 21 ("RXER/INTR_32") outputs the "receive error" signal by default
(LOW means "normal operation", HIGH means that there's either a decoding
error of the received signal or that the PHY is receiving LPI). This pin
can also be switched to INTR32 mode, where the interrupt signal is
routed to this pin. The other PHYs don't need this special handling
because they have more pins available so the interrupt function gets a
dedicated pin.

This adds two properties to either select the "receive error" or
"interrupt" function of pin 21. Not specifying any function means that
the default set by the bootloader is used. This is required because the
IP101GR cannot be differentiated between other IP101 PHYs as the PHY
identification registers on all of these is 0x02430c54.

The IP101G (sold as die only, without package) may suffer from the same
issue depending on how it's integrated into a multi chip package by
another manufacturer. If only the RXER/INTR_32 pin is routed then the
users of the die-only variant may also have to explicitly configure the
mode of hte RXER/INTR_32 pin. This is the reason why no "is-ip101gr"
property was added. I have no evidence though which would confirm this
theory - so the binding itself is independent of that.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodt-bindings: vendor-prefix: add prefix for IC Plus Corp.
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:53 +0000 (22:23 +0100)]
dt-bindings: vendor-prefix: add prefix for IC Plus Corp.

IC Plus Corp. has various Ethernet related products such as Ethernet
transceivers, Ethernet controllers, Ethernet switches, etc.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoLinux 4.20-rc3
Linus Torvalds [Sun, 18 Nov 2018 21:33:44 +0000 (13:33 -0800)]
Linux 4.20-rc3

5 years agotg3: optionally use eth_platform_get_mac_address() to get mac address
thesven73@gmail.com [Sat, 17 Nov 2018 15:56:18 +0000 (10:56 -0500)]
tg3: optionally use eth_platform_get_mac_address() to get mac address

This function will try to determine the mac address via the devicetree,
or via an architecture-specific method (e.g. a PROM on SPARC).

The SPARC-specific code in this driver (#ifdef SPARC) did exactly this,
and is therefore removed.

Note that you can now specify the tg3 mac address via the devicetree,
on any platform, not just SPARC:

Devicetree example:
(see Documentation/devicetree/bindings/pci/pci.txt)

&pcie {
host@0 {
#address-cells = <3>;
#size-cells = <2>;
reg = <0 0 0 0 0>;
bcm5778: bcm5778@0 {
reg = <0 0 0 0 0>;
mac-address = [CA 11 AB 1E 10 01];
};
};
};

Signed-off-by: Sven Van Asbroeck <svendev@arcx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: Add part of TCP counts explanations in snmp_counters.rst
yupeng [Fri, 16 Nov 2018 19:17:40 +0000 (11:17 -0800)]
net: Add part of TCP counts explanations in snmp_counters.rst

Add explanations of some generic TCP counters, fast open
related counters and TCP abort related counters and several
examples.

Signed-off-by: yupeng <yupeng0921@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 18 Nov 2018 20:21:09 +0000 (12:21 -0800)]
Merge tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:
 "A small batch of fixes for v4.20-rc3.

  The overflow continuation fix addresses something that has been broken
  for several releases. Arguably it could wait even longer, but it's a
  one line fix and this finishes the last of the known address range
  scrub bug reports. The revert addresses a lockdep regression. The unit
  tests are not critical to fix, but no reason to hold this fix back.

  Summary:

   - Address Range Scrub overflow continuation handling has been broken
     since it was initially merged. It was only recently that error
     injection and platform-BIOS support enabled this corner case to be
     exercised.

   - The recent attempt to provide more isolation for the kernel Address
     Range Scrub state machine from userapace initiated sessions
     triggers a lockdep report. Revert and try again at the next merge
     window.

   - Fix a kasan reported buffer overflow in libnvdimm unit test
     infrastrucutre (nfit_test)"

* tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  Revert "acpi, nfit: Further restrict userspace ARS start requests"
  acpi, nfit: Fix ARS overflow continuation
  tools/testing/nvdimm: Fix the array size for dimm devices.

5 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sun, 18 Nov 2018 19:31:26 +0000 (11:31 -0800)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "16 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
  mm, page_alloc: check for max order in hot path
  scripts/spdxcheck.py: make python3 compliant
  tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
  lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
  mm/vmstat.c: fix NUMA statistics updates
  mm/gup.c: fix follow_page_mask() kerneldoc comment
  ocfs2: free up write context when direct IO failed
  scripts/faddr2line: fix location of start_kernel in comment
  mm: don't reclaim inodes with many attached pages
  mm, memory_hotplug: check zone_movable in has_unmovable_pages
  mm/swapfile.c: use kvzalloc for swap_info_struct allocation
  MAINTAINERS: update OMAP MMC entry
  hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
  kernel/sched/psi.c: simplify cgroup_move_task()
  z3fold: fix possible reclaim races

5 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 18 Nov 2018 18:58:20 +0000 (10:58 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix an exec() related scalability/performance regression, which was
  caused by incorrectly calculating load and migrating tasks on exec()
  when they shouldn't be"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix cpu_util_wake() for 'execl' type workloads

5 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 18 Nov 2018 18:54:59 +0000 (10:54 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Fix uncore PMU enumeration for CofeeLake CPUs"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Support CoffeeLake 8th CBOX
  perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs

5 years agoMerge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 18 Nov 2018 18:52:26 +0000 (10:52 -0800)]
Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI fixes from Ingo Molnar:
 "Misc fixes: two warning splat fixes, a leak fix and persistent memory
  allocation fixes for ARM"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: Permit calling efi_mem_reserve_persistent() from atomic context
  efi/arm: Defer persistent reservations until after paging_init()
  efi/arm/libstub: Pack FDT after populating it
  efi/arm: Revert deferred unmap of early memmap mapping
  efi: Fix debugobjects warning on 'efi_rts_work'

5 years agoMerge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm
Linus Torvalds [Sun, 18 Nov 2018 18:45:09 +0000 (10:45 -0800)]
Merge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM spectre updates from Russell King:
 "These are the currently known final bits that resolve the Spectre
  issues. big.Little systems used to be sufficiently identical in that
  there were no differences between individual CPUs in the system that
  mattered to the kernel. With the advent of the Spectre problem, the
  CPUs now have differences in how the workaround is applied.

  As a result of previous Spectre patches, these systems ended up
  reporting quite a lot of:

     "CPUx: Spectre v2: incorrect context switching function, system vulnerable"

  messages due to the action of the big.Little switcher causing the CPUs
  to be re-initialised regularly. This series resolves that issue by
  making the CPU vtable unique to each CPU.

  However, since this is used very early, before per-cpu is setup,
  per-cpu can't be used. We also have a problem that two of the methods
  are not called from preempt-safe paths, but thankfully these remain
  identical between all CPUs in the system. To make sure, we validate
  that these are identical during boot"

* 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: spectre-v2: per-CPU vtables to work around big.Little systems
  ARM: add PROC_VTABLE and PROC_TABLE macros
  ARM: clean up per-processor check_bugs method call
  ARM: split out processor lookup
  ARM: make lookup_processor_type() non-__init

5 years agomm/memblock.c: fix a typo in __next_mem_pfn_range() comments
Chen Chang [Fri, 16 Nov 2018 23:08:57 +0000 (15:08 -0800)]
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments

Link: http://lkml.kernel.org/r/20181107100247.13359-1-rainccrun@gmail.com
Signed-off-by: Chen Chang <rainccrun@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, page_alloc: check for max order in hot path
Michal Hocko [Fri, 16 Nov 2018 23:08:53 +0000 (15:08 -0800)]
mm, page_alloc: check for max order in hot path

Konstantin has noticed that kvmalloc might trigger the following
warning:

  WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
  [...]
  Call Trace:
   fragmentation_index+0x76/0x90
   compaction_suitable+0x4f/0xf0
   shrink_node+0x295/0x310
   node_reclaim+0x205/0x250
   get_page_from_freelist+0x649/0xad0
   __alloc_pages_nodemask+0x12a/0x2a0
   kmalloc_large_node+0x47/0x90
   __kmalloc_node+0x22b/0x2e0
   kvmalloc_node+0x3e/0x70
   xt_alloc_table_info+0x3a/0x80 [x_tables]
   do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
   nf_setsockopt+0x44/0x60
   SyS_setsockopt+0x6f/0xc0
   do_syscall_64+0x67/0x120
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2

the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already.  This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile.  A recent UBSAN report just underlines that
by the following report

  UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
  shift exponent 51 is too large for 32-bit type 'int'
  CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0xd2/0x148 lib/dump_stack.c:113
   ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
   __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
   __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
   zone_watermark_fast mm/page_alloc.c:3216 [inline]
   get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
   __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
   alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
   alloc_pages include/linux/gfp.h:509 [inline]
   __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
   dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
   raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
   raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
   fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
   fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
   __blkdev_driver_ioctl block/ioctl.c:303 [inline]
   blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
   block_ioctl+0x105/0x150 fs/block_dev.c:1883
   vfs_ioctl fs/ioctl.c:46 [inline]
   do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
   ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
   __do_sys_ioctl fs/ioctl.c:709 [inline]
   __se_sys_ioctl fs/ioctl.c:707 [inline]
   __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
   do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note that this is not a kvmalloc path.  It is just that the fast path
really depends on having sanitzed order as well.  Therefore move the
order check to the fast path.

Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/spdxcheck.py: make python3 compliant
Uwe Kleine-König [Fri, 16 Nov 2018 23:08:43 +0000 (15:08 -0800)]
scripts/spdxcheck.py: make python3 compliant

Without this change the following happens when using Python3 (3.6.6):

$ echo "GPL-2.0" | python3 scripts/spdxcheck.py -
FAIL: 'str' object has no attribute 'decode'
Traceback (most recent call last):
  File "scripts/spdxcheck.py", line 253, in <module>
    parser.parse_lines(sys.stdin, args.maxlines, '-')
  File "scripts/spdxcheck.py", line 171, in parse_lines
    line = line.decode(locale.getpreferredencoding(False), errors='ignore')
AttributeError: 'str' object has no attribute 'decode'

So as the line is already a string, there is no need to decode it and
the line can be dropped.

/usr/bin/python on Arch is Python 3.  So this would indeed be worth
going into 4.19.

Link: http://lkml.kernel.org/r/20181023070802.22558-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Perches <joe@perches.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
Yufen Yu [Fri, 16 Nov 2018 23:08:39 +0000 (15:08 -0800)]
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset

Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.

man 2 lseek says

:      EINVAL whence  is  not  valid.   Or: the resulting file offset would be
:             negative, or beyond the end of a seekable device.
:
:      ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is  beyond
:             the end of the file.

Make tmpfs return ENXIO under these circumstances as well.  After this,
tmpfs also passes xfstests's generic/448.

[akpm@linux-foundation.org: rewrite changelog]
Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
Arnd Bergmann [Fri, 16 Nov 2018 23:08:35 +0000 (15:08 -0800)]
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn

gcc-8 complains about the prototype for this function:

  lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes]

This is actually a GCC's bug. In GCC internals
__ubsan_handle_builtin_unreachable() declared with both 'noreturn' and
'const' attributes instead of only 'noreturn':

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84210

Workaround this by removing the noreturn attribute.

[aryabinin: add information about GCC bug in changelog]
Link: http://lkml.kernel.org/r/20181107144516.4587-1-aryabinin@virtuozzo.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/vmstat.c: fix NUMA statistics updates
Janne Huttunen [Fri, 16 Nov 2018 23:08:32 +0000 (15:08 -0800)]
mm/vmstat.c: fix NUMA statistics updates

Scan through the whole array to see if an update is needed.  While we're
at it, use sizeof() to be safe against any possible type changes in the
future.

The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus.  Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway.  So I wouldn't bother to mark
this for stable, yet something nice to fix.

[mhocko@suse.com: changelog enhancement]
Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
Fixes: 1d90ca897cb0 ("mm: update NUMA counter threshold size")
Signed-off-by: Janne Huttunen <janne.huttunen@nokia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/gup.c: fix follow_page_mask() kerneldoc comment
Mike Rapoport [Fri, 16 Nov 2018 23:08:29 +0000 (15:08 -0800)]
mm/gup.c: fix follow_page_mask() kerneldoc comment

Commit df06b37ffe5a ("mm/gup: cache dev_pagemap while pinning pages")
modified the signature of follow_page_mask() but left the parameter
description behind.

Update the description to make the code and comments agree again.

While at it, update formatting of the return value description to match
Documentation/doc-guide/kernel-doc.rst guidelines.

Link: http://lkml.kernel.org/r/1541603316-27832-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoocfs2: free up write context when direct IO failed
Wengang Wang [Fri, 16 Nov 2018 23:08:25 +0000 (15:08 -0800)]
ocfs2: free up write context when direct IO failed

The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:

  ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
  ERROR: Clear inode of 215043, inode has unwritten extents
  ...
  Call Trace:
  ? __set_current_blocked+0x42/0x68
  ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
  ? bit_waitqueue+0x40/0x33
  evict+0xdb/0x1af
  iput+0x1a2/0x1f7
  do_unlinkat+0x194/0x28f
  SyS_unlinkat+0x1b/0x2f
  do_syscall_64+0x79/0x1ae
  entry_SYSCALL_64_after_hwframe+0x151/0x0

This patch also logs, with frequency limit, direct IO failures.

Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/faddr2line: fix location of start_kernel in comment
Randy Dunlap [Fri, 16 Nov 2018 23:08:22 +0000 (15:08 -0800)]
scripts/faddr2line: fix location of start_kernel in comment

Fix a source file reference location to the correct path name.

Link: http://lkml.kernel.org/r/1d50bd3d-178e-dcd8-779f-9711887440eb@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: don't reclaim inodes with many attached pages
Roman Gushchin [Fri, 16 Nov 2018 23:08:18 +0000 (15:08 -0800)]
mm: don't reclaim inodes with many attached pages

Spock reported that commit 172b06c32b94 ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.

The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache.  The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there.  So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.

The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.

To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page.  Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.

Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, memory_hotplug: check zone_movable in has_unmovable_pages
Michal Hocko [Fri, 16 Nov 2018 23:08:15 +0000 (15:08 -0800)]
mm, memory_hotplug: check zone_movable in has_unmovable_pages

Page state checks are racy.  Under a heavy memory workload (e.g.  stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet.  A debugging patch to
dump the struct page state shows

  has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
  page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
  flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)

Note that the state has been checked for both PageLRU and PageSwapBacked
already.  Closing this race completely would require some sort of retry
logic.  This can be tricky and error prone (think of potential endless
or long taking loops).

Workaround this problem for movable zones at least.  Such a zone should
only contain movable pages.  Commit 15c30bc09085 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though.  Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check.  Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.

Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
Fixes: 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/swapfile.c: use kvzalloc for swap_info_struct allocation
Vasily Averin [Fri, 16 Nov 2018 23:08:11 +0000 (15:08 -0800)]
mm/swapfile.c: use kvzalloc for swap_info_struct allocation

Commit a2468cc9bfdf ("swap: choose swap device according to numa node")
changed 'avail_lists' field of 'struct swap_info_struct' to an array.
In popular linux distros it increased size of swap_info_struct up to 40
Kbytes and now swap_info_struct allocation requires order-4 page.
Switch to kvzmalloc allows to avoid unexpected allocation failures.

Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoMAINTAINERS: update OMAP MMC entry
Aaro Koskinen [Fri, 16 Nov 2018 23:08:08 +0000 (15:08 -0800)]
MAINTAINERS: update OMAP MMC entry

Jarkko's e-mail address hasn't worked for a long time.  We still want to
keep this driver working as it is critical for some of the OMAP boards.
I use and test this driver frequently, so change myself as a maintainer
with "Odd Fixes" status.

Link: http://lkml.kernel.org/r/20181106222750.12939-1-aaro.koskinen@iki.fi
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
Mike Kravetz [Fri, 16 Nov 2018 23:08:04 +0000 (15:08 -0800)]
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!

This bug has been experienced several times by the Oracle DB team.  The
BUG is in remove_inode_hugepages() as follows:

/*
 * If page is mapped, it was faulted in after being
 * unmapped in caller.  Unmap (again) now after taking
 * the fault mutex.  The mutex will prevent faults
 * until we finish removing the page.
 *
 * This race can only happen in the hole punch case.
 * Getting here in a truncate operation is a bug.
 */
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);

In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code.  Consider the following:

 - Process A maps a hugetlbfs file of sufficient size and alignment
   (PUD_SIZE) that a pmd page could be shared.

 - Process B maps the same hugetlbfs file with the same size and
   alignment such that a pmd page is shared.

 - Process B then calls mprotect() to change protections for the mapping
   with the shared pmd. As a result, the pmd is 'unshared'.

 - Process B then calls mprotect() again to chage protections for the
   mapping back to their original value. pmd remains unshared.

 - Process B then forks and process C is created. During the fork
   process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
   tables. Copying page tables for hugetlb mappings is done in the
   routine copy_hugetlb_page_range.

In copy_hugetlb_page_range(), the destination pte is obtained by:

dst_pte = huge_pte_alloc(dst, addr, sz);

If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table.  In the situation above, process C could share with
either process A or process B.  Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.

However, the check for pmd sharing in copy_hugetlb_page_range is:

/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;

Since process C is sharing with process A instead of process B, the
above test fails.  The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page.  This is how we end up with an elevated map count.

To solve, check the dst_pte entry for huge_pte_none.  If !none, this
implies PMD sharing so do not copy.

Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/sched/psi.c: simplify cgroup_move_task()
Olof Johansson [Fri, 16 Nov 2018 23:08:00 +0000 (15:08 -0800)]
kernel/sched/psi.c: simplify cgroup_move_task()

The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized.  Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.

Warning was:

  kernel/sched/psi.c: In function `cgroup_move_task':
  kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]

Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes: 2ce7135adc9ad ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoz3fold: fix possible reclaim races
Vitaly Wool [Fri, 16 Nov 2018 23:07:56 +0000 (15:07 -0800)]
z3fold: fix possible reclaim races

Reclaim and free can race on an object which is basically fine but in
order for reclaim to be able to map "freed" object we need to encode
object length in the handle.  handle_to_chunks() is then introduced to
extract object length from a handle and use it during mapping.

Moreover, to avoid racing on a z3fold "headless" page release, we should
not try to free that page in z3fold_free() if the reclaim bit is set.
Also, in the unlikely case of trying to reclaim a page being freed, we
should not proceed with that page.

While at it, fix the page accounting in reclaim function.

This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".

Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Signed-off-by: Jongseok Kim <ks77sj@gmail.com>
Reported-by-by: Jongseok Kim <ks77sj@gmail.com>
Reviewed-by: Snild Dolkow <snild@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotipc: don't assume linear buffer when reading ancillary data
Jon Maloy [Sat, 17 Nov 2018 17:17:06 +0000 (12:17 -0500)]
tipc: don't assume linear buffer when reading ancillary data

The code for reading ancillary data from a received buffer is assuming
the buffer is linear. To make this assumption true we have to linearize
the buffer before message data is read.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'bcmgenet-fix-aborted-suspend'
David S. Miller [Sun, 18 Nov 2018 06:04:39 +0000 (22:04 -0800)]
Merge branch 'bcmgenet-fix-aborted-suspend'

Doug Berger says:

====================
net: bcmgenet: fix aborted suspend

It is not enough to return an error code from the driver suspend
routine. The driver must also restore the device functionality.

This commit corrects the issue introduced by commit 0db55093b566
("net: bcmgenet: return correct value 'ret' from bcmgenet_power_down")
by calling the driver resume function if the suspend function returns
an error.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bcmgenet: abort suspend on error
Doug Berger [Sat, 17 Nov 2018 02:00:22 +0000 (18:00 -0800)]
net: bcmgenet: abort suspend on error

If an error occurs during suspension of the driver the driver should
restore the hardware configuration and return an error to force the
system to resume.

Fixes: 0db55093b566 ("net: bcmgenet: return correct value 'ret' from bcmgenet_power_down")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bcmgenet: code movement
Doug Berger [Sat, 17 Nov 2018 02:00:21 +0000 (18:00 -0800)]
net: bcmgenet: code movement

This commit switches the order of bcmgenet_suspend and bcmgenet_resume
in the file to prevent the need for a forward declaration in the next
commit and to make the review of that commit easier.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agogeneve: Initialize addr6 with memset
Nathan Chancellor [Sat, 17 Nov 2018 01:36:27 +0000 (18:36 -0700)]
geneve: Initialize addr6 with memset

Clang warns:

drivers/net/geneve.c:428:29: error: suggest braces around initialization
of subobject [-Werror,-Wmissing-braces]
                struct in6_addr addr6 = { 0 };
                                          ^
                                          {}

Rather than trying to appease the various compilers that support the
kernel, use memset, which is unambiguous.

Fixes: a07966447f39 ("geneve: ICMP error lookup handler")
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotipc: fix lockdep warning when reinitilaizing sockets
Jon Maloy [Fri, 16 Nov 2018 21:55:04 +0000 (16:55 -0500)]
tipc: fix lockdep warning when reinitilaizing sockets

We get the following warning:

[   47.926140] 32-bit node address hash set to 2010a0a
[   47.927202]
[   47.927433] ================================
[   47.928050] WARNING: inconsistent lock state
[   47.928661] 4.19.0+ #37 Tainted: G            E
[   47.929346] --------------------------------
[   47.929954] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[   47.930116] swapper/3/0 [HC0[0]:SC1[3]:HE1:SE0] takes:
[   47.930116] 00000000af8bc31e (&(&ht->lock)->rlock){+.?.}, at: rhashtable_walk_enter+0x36/0xb0
[   47.930116] {SOFTIRQ-ON-W} state was registered at:
[   47.930116]   _raw_spin_lock+0x29/0x60
[   47.930116]   rht_deferred_worker+0x556/0x810
[   47.930116]   process_one_work+0x1f5/0x540
[   47.930116]   worker_thread+0x64/0x3e0
[   47.930116]   kthread+0x112/0x150
[   47.930116]   ret_from_fork+0x3a/0x50
[   47.930116] irq event stamp: 14044
[   47.930116] hardirqs last  enabled at (14044): [<ffffffff9a07fbba>] __local_bh_enable_ip+0x7a/0xf0
[   47.938117] hardirqs last disabled at (14043): [<ffffffff9a07fb81>] __local_bh_enable_ip+0x41/0xf0
[   47.938117] softirqs last  enabled at (14028): [<ffffffff9a0803ee>] irq_enter+0x5e/0x60
[   47.938117] softirqs last disabled at (14029): [<ffffffff9a0804a5>] irq_exit+0xb5/0xc0
[   47.938117]
[   47.938117] other info that might help us debug this:
[   47.938117]  Possible unsafe locking scenario:
[   47.938117]
[   47.938117]        CPU0
[   47.938117]        ----
[   47.938117]   lock(&(&ht->lock)->rlock);
[   47.938117]   <Interrupt>
[   47.938117]     lock(&(&ht->lock)->rlock);
[   47.938117]
[   47.938117]  *** DEADLOCK ***
[   47.938117]
[   47.938117] 2 locks held by swapper/3/0:
[   47.938117]  #0: 0000000062c64f90 ((&d->timer)){+.-.}, at: call_timer_fn+0x5/0x280
[   47.938117]  #1: 00000000ee39619c (&(&d->lock)->rlock){+.-.}, at: tipc_disc_timeout+0xc8/0x540 [tipc]
[   47.938117]
[   47.938117] stack backtrace:
[   47.938117] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G            E     4.19.0+ #37
[   47.938117] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   47.938117] Call Trace:
[   47.938117]  <IRQ>
[   47.938117]  dump_stack+0x5e/0x8b
[   47.938117]  print_usage_bug+0x1ed/0x1ff
[   47.938117]  mark_lock+0x5b5/0x630
[   47.938117]  __lock_acquire+0x4c0/0x18f0
[   47.938117]  ? lock_acquire+0xa6/0x180
[   47.938117]  lock_acquire+0xa6/0x180
[   47.938117]  ? rhashtable_walk_enter+0x36/0xb0
[   47.938117]  _raw_spin_lock+0x29/0x60
[   47.938117]  ? rhashtable_walk_enter+0x36/0xb0
[   47.938117]  rhashtable_walk_enter+0x36/0xb0
[   47.938117]  tipc_sk_reinit+0xb0/0x410 [tipc]
[   47.938117]  ? mark_held_locks+0x6f/0x90
[   47.938117]  ? __local_bh_enable_ip+0x7a/0xf0
[   47.938117]  ? lockdep_hardirqs_on+0x20/0x1a0
[   47.938117]  tipc_net_finalize+0xbf/0x180 [tipc]
[   47.938117]  tipc_disc_timeout+0x509/0x540 [tipc]
[   47.938117]  ? call_timer_fn+0x5/0x280
[   47.938117]  ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[   47.938117]  ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[   47.938117]  call_timer_fn+0xa1/0x280
[   47.938117]  ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[   47.938117]  run_timer_softirq+0x1f2/0x4d0
[   47.938117]  __do_softirq+0xfc/0x413
[   47.938117]  irq_exit+0xb5/0xc0
[   47.938117]  smp_apic_timer_interrupt+0xac/0x210
[   47.938117]  apic_timer_interrupt+0xf/0x20
[   47.938117]  </IRQ>
[   47.938117] RIP: 0010:default_idle+0x1c/0x140
[   47.938117] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 53 65 8b 2d d8 2b 74 65 0f 1f 44 00 00 e8 c6 2c 8b ff fb f4 <65> 8b 2d c5 2b 74 65 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 b4 2b
[   47.938117] RSP: 0018:ffffaf6ac0207ec8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[   47.938117] RAX: ffff8f5b3735e200 RBX: 0000000000000003 RCX: 0000000000000001
[   47.938117] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8f5b3735e200
[   47.938117] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000000
[   47.938117] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   47.938117] R13: 0000000000000000 R14: ffff8f5b3735e200 R15: ffff8f5b3735e200
[   47.938117]  ? default_idle+0x1a/0x140
[   47.938117]  do_idle+0x1bc/0x280
[   47.938117]  cpu_startup_entry+0x19/0x20
[   47.938117]  start_secondary+0x187/0x1c0
[   47.938117]  secondary_startup_64+0xa4/0xb0

The reason seems to be that tipc_net_finalize()->tipc_sk_reinit() is
calling the function rhashtable_walk_enter() within a timer interrupt.
We fix this by executing tipc_net_finalize() in work queue context.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet-gro: reset skb->pkt_type in napi_reuse_skb()
Eric Dumazet [Sun, 18 Nov 2018 05:57:02 +0000 (21:57 -0800)]
net-gro: reset skb->pkt_type in napi_reuse_skb()

eth_type_trans() assumes initial value for skb->pkt_type
is PACKET_HOST.

This is indeed the value right after a fresh skb allocation.

However, it is possible that GRO merged a packet with a different
value (like PACKET_OTHERHOST in case macvlan is used), so
we need to make sure napi->skb will have pkt_type set back to
PACKET_HOST.

Otherwise, valid packets might be dropped by the stack because
their pkt_type is not PACKET_HOST.

napi_reuse_skb() was added in commit 96e93eab2033 ("gro: Add
internal interfaces for VLAN"), but this bug always has
been there.

Fixes: 96e93eab2033 ("gro: Add internal interfaces for VLAN")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'net-hns3-Add-vf-mtu-support'
David S. Miller [Sun, 18 Nov 2018 05:57:30 +0000 (21:57 -0800)]
Merge branch 'net-hns3-Add-vf-mtu-support'

Salil Mehta says:

====================
net: hns3: Add vf mtu support

This patchset adds vf mtu support to HNS3 driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: up/down netdev in hclge module when setting mtu
Yunsheng Lin [Sun, 18 Nov 2018 03:19:14 +0000 (03:19 +0000)]
net: hns3: up/down netdev in hclge module when setting mtu

Currently netdev is down in enet module, and it is before
mtu range checking in hclge module, which may be cause
netdev being down unnecessarily.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: Add mtu setting support for vf
Yunsheng Lin [Sun, 18 Nov 2018 03:19:13 +0000 (03:19 +0000)]
net: hns3: Add mtu setting support for vf

The patch adds mtu setting support for vf, currently
vf and pf share the same hardware mtu setting. Mtu set
by vf must be less than or equal to pf' mtu, and mtu
set by pf must be greater than or equal to vf' mtu.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: Add vport alive state checking support
Yunsheng Lin [Sun, 18 Nov 2018 03:19:12 +0000 (03:19 +0000)]
net: hns3: Add vport alive state checking support

Currently there is no way for pf to know if a vf device is
alive or not, so PF does not know which vf to notify when
reset happens, or which vf's mtu is invalid when vf and pf
share the same hardware mtu setting.

This patch adds vport alive state checking support, in order
to support the above scenario.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: Refactor mac mtu setting related functions
Yunsheng Lin [Sun, 18 Nov 2018 03:19:11 +0000 (03:19 +0000)]
net: hns3: Refactor mac mtu setting related functions

This patch refactors mac mtu setting related functions,
normalizes the use of mps and mtu.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: Support two vlan header when setting mtu
Yunsheng Lin [Sun, 18 Nov 2018 03:19:10 +0000 (03:19 +0000)]
net: hns3: Support two vlan header when setting mtu

This patch adds supports for two vlan header when setting mtu.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'tdc-fixes'
David S. Miller [Sun, 18 Nov 2018 05:54:53 +0000 (21:54 -0800)]
Merge branch 'tdc-fixes'

Lucas Bates says:

====================
Prevent uncaught exceptions in tdc

This patch series addresses two potential bugs in tdc that can
cause exceptions to be raised in certain circumstances.  These
exceptions are generally not handled, so instead we will prevent
them from being raised.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotc-testing: tdc.py: Guard against lack of returncode in executed command
Brenda J. Butler [Fri, 16 Nov 2018 22:37:56 +0000 (17:37 -0500)]
tc-testing: tdc.py: Guard against lack of returncode in executed command

Add some defensive coding in case one of the subprocesses created by tdc
returns nothing. If no object is returned from exec_cmd, then tdc will
halt with an unhandled exception.

Signed-off-by: Brenda J. Butler <bjb@mojatatu.com>
Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotc-testing: tdc.py: ignore errors when decoding stdout/stderr
Lucas Bates [Fri, 16 Nov 2018 22:37:55 +0000 (17:37 -0500)]
tc-testing: tdc.py: ignore errors when decoding stdout/stderr

Prevent exceptions from being raised while decoding output
from an executed command. There is no impact on tdc's
execution and the verify command phase would fail the pattern
match.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: fsl: Use device_type helpers to access the node type
Rob Herring [Fri, 16 Nov 2018 22:11:03 +0000 (16:11 -0600)]
net: fsl: Use device_type helpers to access the node type

Remove directly accessing device_node.type pointer and use the accessors
instead. This will eventually allow removing the type pointer.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoatm: Convert to using %pOFn instead of device_node.name
Rob Herring [Fri, 16 Nov 2018 22:05:37 +0000 (16:05 -0600)]
atm: Convert to using %pOFn instead of device_node.name

In preparation to remove the node name pointer from struct device_node,
convert printf users to use the %pOFn format specifier.

Cc: Chas Williams <3chas3@gmail.com>
Cc: linux-atm-general@lists.sourceforge.net
Cc: netdev@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoip_tunnel: don't force DF when MTU is locked
Sabrina Dubroca [Fri, 16 Nov 2018 15:58:19 +0000 (16:58 +0100)]
ip_tunnel: don't force DF when MTU is locked

The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.

This patch makes setting the DF bit conditional on the route's MTU
locking state.

This issue seems to be older than git history.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMAINTAINERS: Add entry for CAKE qdisc
Toke Høiland-Jørgensen [Fri, 16 Nov 2018 20:13:59 +0000 (12:13 -0800)]
MAINTAINERS: Add entry for CAKE qdisc

We would like the existing community to be kept in the loop for any new
developments on CAKE; and I certainly plan to keep maintaining it. Reflect
this in MAINTAINERS.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: fix vlan stats use-after-free on destruction
Nikolay Aleksandrov [Fri, 16 Nov 2018 16:50:01 +0000 (18:50 +0200)]
net: bridge: fix vlan stats use-after-free on destruction

Syzbot reported a use-after-free of the global vlan context on port vlan
destruction. When I added per-port vlan stats I missed the fact that the
global vlan context can be freed before the per-port vlan rcu callback.
There're a few different ways to deal with this, I've chosen to add a
new private flag that is set only when per-port stats are allocated so
we can directly check it on destruction without dereferencing the global
context at all. The new field in net_bridge_vlan uses a hole.

v2: cosmetic change, move the check to br_process_vlan_info where the
    other checks are done
v3: add change log in the patch, add private (in-kernel only) flags in a
    hole in net_bridge_vlan struct and use that instead of mixing
    user-space flags with private flags

Fixes: 9163a0fc1f0c ("net: bridge: add support for per-port vlan stats")
Reported-by: syzbot+04681da557a0e49a52e5@syzkaller.appspotmail.com
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>