git.proxmox.com Git - mirror_ubuntu-eoan-kernel.git/log

bridge: disable softirqs around br_fdb_update to avoid lockup

br_fdb_update() can be called in process context in the following way:
br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set)
so we need to disable softirqs because there are softirq users of the
hash_lock. One easy way to reproduce this is to modify the bridge utility
to set NTF_USE, enable stp and then set maxageing to a low value so
br_fdb_cleanup() is called frequently and then just add new entries in
a loop. This happens because br_fdb_cleanup() is called from timer/softirq
context. The spin locks in br_fdb_update were _bh before commit f8ae737deea1
("[BRIDGE]: forwarding remove unneeded preempt and bh diasables")
and at the time that commit was correct because br_fdb_update() couldn't be
called from process context, but that changed after commit:
292d1398983f ("bridge: add NTF_USE support")
Using local_bh_disable/enable around br_fdb_update() allows us to keep
using the spin_lock/unlock in br_fdb_update for the fast-path.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 292d1398983f ("bridge: add NTF_USE support")
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "bridge: use _bh spinlock variant for br_fdb_update to avoid lockup"

This reverts commit 1d7c49037b12016e7056b9f2c990380e2187e766.

Nikolay Aleksandrov has a better version of this fix.

Signed-off-by: David S. Miller <davem@davemloft.net>

mpls: fix possible use after free of device

The mpls device is used in an RCU read context without a lock being
held. As the memory is freed without waiting for the RCU grace period
to elapse, the freed memory could still be in use.

Address this by using kfree_rcu to free the memory for the mpls device
after the RCU grace period has elapsed.

Fixes: 03c57747a702 ("mpls: Per-device MPLS state")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Replace dma/pci_alloc_coherent() calls with dma_zalloc_coherent()

There are several places in the driver (all in control paths) where
coherent dma memory is being allocated using either dma_alloc_coherent()
or the deprecated pci_alloc_consistent(). All these calls should be
changed to use dma_zalloc_coherent() to avoid uninitialized fields in
data structures backed by this memory.

Reported-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@avagotech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: use _bh spinlock variant for br_fdb_update to avoid lockup

br_fdb_update() can be called in process context in the following way:
br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set)
so we need to use spin_lock_bh because there are softirq users of the
hash_lock. One easy way to reproduce this is to modify the bridge utility
to set NTF_USE, enable stp and then set maxageing to a low value so
br_fdb_cleanup() is called frequently and then just add new entries in
a loop. This happens because br_fdb_cleanup() is called from timer/softirq
context. These locks were _bh before commit f8ae737deea1
("[BRIDGE]: forwarding remove unneeded preempt and bh diasables")
and at the time that commit was correct because br_fdb_update() couldn't be
called from process context, but that changed after commit:
292d1398983f ("bridge: add NTF_USE support")

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 292d1398983f ("bridge: add NTF_USE support")
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Use disable_irq_nosync from within timer function

Since the Tx timer function runs in softirq context the driver needs
to call disable_irq_nosync instead of a disable_irq.

Reported-by: Josh Stone <jistone@redhat.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rhashtable: add missing import <linux/export.h>

rhashtable uses EXPORT_SYMBOL_GPL() without importing linux/export.h
directly it is only imported indirectly through some other includes.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-06-04

This series contains updates to i40e and i40evf.

Anjali provides three fixes, first to resolve a Tx queue hang if mixed
size frags are passed to the driver while using TSO.  There was a corner
case where we needed to linearize but we were not.  Next fixes a bug in
the default configuration which prevented a software bridge loaded on the
PF interface from working correctly because broadcast packets are
incorrectly looped back.  Lastly fixes an NPAR bug when SRIOV is enabled,
where we need to be in VEB mode, not VEPA mode at probe.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Make sure to be in VEB mode if SRIOV is enabled at probe

If SRIOV is enabled we need to be in VEB mode not VEPA mode at probe.
This fixes an NPAR bug when SRIOV is enabled in the BIOS.

Change-ID: Ibf006abafd9a0ca3698ec24848cd771cf345cbbc
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: start up in VEPA mode by default

The patch fixes a bug in the default configuration which
prevented a software bridge loaded on the PF interface from
working correctly because broadcast packets are incorrectly
looped back.

Fix the general case, by loading the driver in VEPA mode Until a
VF or VMDq VSI is added. This way loopback on the Main VSI is
turned off until needed and can resolve the issue of unnecessary
reflection for users that do not have VF or VMDq VSIs setup.

The driver must now coordinate the loopback setting for the Flow
Director (FDIR) VSI to make sure it is in sync with the current
VEB or VEPA mode setting.

The user can still switch bridge modes from the bridge commands and
choose to be in VEPA mode with VF VSIs. Because of hardware
requirements, the call to switch to VEB mode when no VF/VMDqs are
present will be rejected.

NOTE: This patch uses BIT_ULL as that is preferred going forward,
a followup patch in the lower priority queue to net-next will fix
up the remaining 1 << usages.

Change-ID: Ib121ddb18fe4b3c4f52e9deda6fcbeb9105683d1
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Fix mixed size frags and linearization

This patch fixes a bug where the i40e Tx queue will hang if this
skb is passed to the driver.

With mixed size fragments while using TSO there was a corner case
where we needed to linearize but we were not. This was seen with
iSCSI traffic and could be reproduced with a frag list that looks
like this:

num_frags = 17, gso_segs = 17, hdr_len = 66,
skb_shinfo(skb)->gso_size = 1448
size = 3002, j = 1, frag_size = 2936, num_frags = 17
size = 4268, j = 1, frag_size = 4096, num_frags = 16
size = 5534, j = 1, frag_size = 4096, num_frags = 15
size = 5352, j = 1, frag_size = 4096, num_frags = 14
size = 5170, j = 1, frag_size = 4096, num_frags = 13
size = 3468, j = 1, frag_size = 2576, num_frags = 12
size = 750, j = 1, frag_size = 112, num_frags = 11
size = 862, j = 2, frag_size = 112, num_frags = 10
size = 974, j = 3, frag_size = 112, num_frags = 9
size = 1126, j = 4, frag_size = 152, num_frags = 8
size = 1330, j = 5, frag_size = 204, num_frags = 7
size = 1534, j = 6, frag_size = 204, num_frags = 6
size = 356, j = 1, frag_size = 204, num_frags = 5
size = 560, j = 2, frag_size = 204, num_frags = 4
size = 764, j = 3, frag_size = 204, num_frags = 3
size = 968, j = 4, frag_size = 204, num_frags = 2
size = 1140, j = 5, frag_size = 172, num_frags = 1
result: linearize = 0, j = 6

Change-ID: I79bb1aeab0af255fe2ce28e93672a85d85bf47e8
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()

421b3885bf6d56391297844f43fb7154a6396e12 "udp: ipv4: Add udp early
demux" introduced a regression that allowed sockets bound to INADDR_ANY
to receive packets from multicast groups that the socket had not joined.
For example a socket that had joined 224.168.2.9 could also receive
packets from 225.168.2.9 despite not having joined that group if
ip_early_demux is enabled.

Fix this by calling ip_check_mc_rcu() in udp_v4_early_demux() to verify
that the multicast packet is indeed ours.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: disable LRO

Currently, openvswitch tries to disable LRO from the user space. This does
not work correctly when the device added is a vlan interface, though.
Instead of dealing with possibly complex stacked cross name space relations
in the user space, do the same as bridging does and call dev_disable_lro in
the kernel.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/bpf: fix bpf frame pointer setup

Currently the bpf frame pointer is set to the old r15. This is
wrong because of packed stack. Fix this and adjust the frame pointer
to respect packed stack. This now generates a prolog like the following:

3ff8001c3fa: eb67f0480024   stmg    %r6,%r7,72(%r15)
3ff8001c400: ebcff0780024   stmg    %r12,%r15,120(%r15)
3ff8001c406: b904001f       lgr     %r1,%r15      <- load backchain
3ff8001c40a: 41d0f048       la      %r13,72(%r15) <- load adjusted bfp
3ff8001c40e: a7fbfd98       aghi    %r15,-616
3ff8001c412: e310f0980024   stg     %r1,152(%r15) <- save backchain

Fixes: 054623105728 ("s390/bpf: Add s390x eBPF JIT compiler backend")
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/bpf: fix stack allocation

On s390x we have to provide 160 bytes stack space before we can call
the next function. From the 160 bytes that we got from the previous
function we only use 11 * 8 bytes and have 160 - 11 * 8 bytes left.
Currently for BPF we allocate additional 160 - 11 * 8 bytes for the
next function. This is wrong because then the next function only gets:

(160 - 11 * 8) + (160 - 11 * 8) = 2 * 72 = 144 bytes

Fix this and allocate enough memory for the next function.

Fixes: 054623105728 ("s390/bpf: Add s390x eBPF JIT compiler backend")
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Various VTI tunnel (mark handling, PMTU) bug fixes from Alexander
    Duyck and Steffen Klassert.

2) Revert ethtool PHY query change, it wasn't correct.  The PHY address
    selected by the driver running the PHY to MAC connection decides
    what PHY address GET ethtool operations return information from.

3) Fix handling of sequence number bits for encryption IV generation in
    ESP driver, from Herbert Xu.

4) UDP can return -EAGAIN when we hit a bad checksum on receive, even
    when there are other packets in the receive queue which is wrong.
    Just respect the error returned from the generic socket recv
    datagram helper.  From Eric Dumazet.

5) Fix BNA driver firmware loading on big-endian systems, from Ivan
    Vecera.

6) Fix regression in that we were inheriting the congestion control of
    the listening socket for new connections, the intended behavior
    always was to use the default in this case.  From Neal Cardwell.

7) Fix NULL deref in brcmfmac driver, from Arend van Spriel.

8) OTP parsing fix in iwlwifi from Liad Kaufman.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  vti6: Add pmtu handling to vti6_xmit.
  Revert "net: core: 'ethtool' issue with querying phy settings"
  bnx2x: Move statistics implementation into semaphores
  xen: netback: read hotplug script once at start of day.
  xen: netback: fix printf format string warning
  Revert "netfilter: ensure number of counters is >0 in do_replace()"
  net: dsa: Properly propagate errors from dsa_switch_setup_one
  tcp: fix child sockets to use system default congestion control if not set
  udp: fix behavior of wrong checksums
  sfc: free multiple Rx buffers when required
  bna: fix soft lock-up during firmware initialization failure
  bna: remove unreasonable iocpf timer start
  bna: fix firmware loading on big-endian machines
  bridge: fix br_multicast_query_expired() bug
  via-rhine: Resigning as maintainer
  brcmfmac: avoid null pointer access when brcmf_msgbuf_get_pktid() fails
  mac80211: Fix mac80211.h docbook comments
  iwlwifi: nvm: fix otp parsing in 8000 hw family
  iwlwifi: pcie: fix tracking of cmd_in_flight
  ip_vti/ip6_vti: Preserve skb->mark after rcv_cb call
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

Pull Sparc fixes from David Miller:

1) Setup the core/threads/sockets bitmaps correctly so that 'lscpus'
    and friends operate properly.  Frtom Chris Hyser.

2) The bit that normally means "Cached Virtually" on sun4v systems,
    actually changes meaning in M7 and later chips.  Fix from Khalid
    Aziz.

3) One some PCI-E systems we need to probe different OF properties to
    fill in the PCI slot information properly, from Eric Snowberg.

4) Kill an extraneous memset after kzalloc(), from Christophe Jaillet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE
  sparc64: pci slots information is not populated in sysfs
  sparc: kernel: GRPCI2: Remove a useless memset
  sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fix from Michael Tsirkin:
"Last-minute virtio fix for 4.1

  This tweaks an exported user-space header to fix build breakage for
  userspace using it"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  include/uapi/linux/virtio_balloon.h: include linux/virtio_types.h

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fix for net

The following patch reverts the ebtables chunk that enforces counters that was
introduced in the recently applied d26e2c9ffa38 ('Revert "netfilter: ensure
number of counters is >0 in do_replace()"') since this breaks ebtables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'wireless-drivers-for-davem-2015-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
iwlwifi:

* fix OTP parsing 8260
* fix powersave handling for 8260

brcmfmac:

* fix null pointer crash
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

vti6: Add pmtu handling to vti6_xmit.

We currently rely on the PMTU discovery of xfrm.
However if a packet is localy sent, the PMTU mechanism
of xfrm tries to to local socket notification what
might not work for applications like ping that don't
check for this. So add pmtu handling to vti6_xmit to
report MTU changes immediately.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "net: core: 'ethtool' issue with querying phy settings"

This reverts commit f96dee13b8e10f00840124255bed1d8b4c6afd6f.

It isn't right, ethtool is meant to manage one PHY instance
per netdevice at a time, and this is selected by the SET
command. Therefore by definition the GET command must only
return the settings for the configured and selected PHY.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Move statistics implementation into semaphores

Commit dff173de84958 ("bnx2x: Fix statistics locking scheme") changed the
bnx2x locking around statistics state into using a mutex - but the lock
is being accessed via a timer which is forbidden.

[If compiled with CONFIG_DEBUG_MUTEXES, logs show a warning about
accessing the mutex in interrupt context]

This moves the implementation into using a semaphore [with size '1']
instead.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

xen: netback: read hotplug script once at start of day.

When we come to tear things down in netback_remove() and generate the
uevent it is possible that the xenstore directory has already been
removed (details below).

In such cases netback_uevent() won't be able to read the hotplug
script and will write a xenstore error node.

A recent change to the hypervisor exposed this race such that we now
sometimes lose it (where apparently we didn't ever before).

Instead read the hotplug script configuration during setup and use it
for the lifetime of the backend device.

The apparently more obvious fix of moving the transition to
state=Closed in netback_remove() to after the uevent does not work
because it is possible that we are already in state=Closed (in
reaction to the guest having disconnected as it shutdown). Being
already in Closed means the toolstack is at liberty to start tearing
down the xenstore directories. In principal it might be possible to
arrange to unregister the device sooner (e.g on transition to Closing)
such that xenstore would still be there but this state machine is
fragile and prone to anger...

A modern Xen system only relies on the hotplug uevent for driver
domains, when the backend is in the same domain as the toolstack it
will run the necessary setup/teardown directly in the correct sequence
wrt xenstore changes.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

xen: netback: fix printf format string warning

drivers/net/xen-netback/netback.c: In function ‘xenvif_tx_build_gops’:
drivers/net/xen-netback/netback.c:1253:8: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘int’ [-Wformat=]
(txreq.offset&~PAGE_MASK) + txreq.size);
^

PAGE_MASK's type can vary by arch, so a cast is needed.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
----
v2: Cast to unsigned long, since PAGE_MASK can vary by arch.
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "netfilter: ensure number of counters is >0 in do_replace()"

This partially reverts commit 1086bbe97a07 ("netfilter: ensure number of
counters is >0 in do_replace()") in net/bridge/netfilter/ebtables.c.

Setting rules with ebtables does not work any more with 1086bbe97a07 place.

There is an error message and no rules set in the end.

e.g.

~# ebtables -t nat -A POSTROUTING --src 12:34:56:78:9a:bc -j DROP
Unable to update the kernel. Two possible causes:
1. Multiple ebtables programs were executing simultaneously. The ebtables
userspace tool doesn't by default support multiple ebtables programs
running

Reverting the ebtables part of 1086bbe97a07 makes this work again.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

include/uapi/linux/virtio_balloon.h: include linux/virtio_types.h

Fixes userspace compilation error:

error: unknown type name ‘__virtio16’
__virtio16 tag;

Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE

sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE

Bit 9 of TTE is CV (Cacheable in V-cache) on sparc v9 processor while
the same bit 9 is MCDE (Memory Corruption Detection Enable) on M7
processor. This creates a conflicting usage of the same bit. Kernel
sets TTE.cv bit on all pages for sun4v architecture which works well
for sparc v9 but enables memory corruption detection on M7 processor
which is not the intent. This patch adds code to determine if kernel
is running on M7 processor and takes steps to not enable memory
corruption detection in TTE erroneously.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sparc64: pci slots information is not populated in sysfs

Add PCI slot numbers within sysfs for PCIe hardware. Larger
PCIe systems with nested PCI bridges and slots further
down on these bridges were not being populated within sysfs.
This will add ACPI style PCI slot numbers for these systems
since the OF 'slot-names' information is not available on
all PCIe platforms.

Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sparc: kernel: GRPCI2: Remove a useless memset

grpci2priv is allocated using kzalloc, so there is no need to memset it.

Signed-off-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Properly propagate errors from dsa_switch_setup_one

While shuffling some code around, dsa_switch_setup_one() was introduced,
and it was modified to return either an error code using ERR_PTR() or a
NULL pointer when running out of memory or failing to setup a switch.

This is a problem for its caler: dsa_switch_setup() which uses IS_ERR()
and expects to find an error code, not a NULL pointer, so we still try
to proceed with dsa_switch_setup() and operate on invalid memory
addresses. This can be easily reproduced by having e.g: the bcm_sf2
driver built-in, but having no such switch, such that drv->setup will
fail.

Fix this by using PTR_ERR() consistently which is both more informative
and avoids for the caller to use IS_ERR_OR_NULL().

Fixes: df197195a5248 ("net: dsa: split dsa_switch_setup into two functions")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: fix child sockets to use system default congestion control if not set

Linux 3.17 and earlier are explicitly engineered so that if the app
doesn't specifically request a CC module on a listener before the SYN
arrives, then the child gets the system default CC when the connection
is established. See tcp_init_congestion_control() in 3.17 or earlier,
which says "if no choice made yet assign the current value set as
default". The change ("net: tcp: assign tcp cong_ops when tcp sk is
created") altered these semantics, so that children got their parent
listener's congestion control even if the system default had changed
after the listener was created.

This commit returns to those original semantics from 3.17 and earlier,
since they are the original semantics from 2007 in 4d4d3d1e8 ("[TCP]:
Congestion control initialization."), and some Linux congestion
control workflows depend on that.

In summary, if a listener socket specifically sets TCP_CONGESTION to
"x", or the route locks the CC module to "x", then the child gets
"x". Otherwise the child gets current system default from
net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and
earlier, and this commit restores that.

Fixes: 55d8694fa82c ("net: tcp: assign tcp cong_ops when tcp sk is created")
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: fix behavior of wrong checksums

We have two problems in UDP stack related to bogus checksums :

1) We return -EAGAIN to application even if receive queue is not empty.
This breaks applications using edge trigger epoll()

2) Under UDP flood, we can loop forever without yielding to other
processes, potentially hanging the host, especially on non SMP.

This patch is an attempt to make things better.

We might in the future add extra support for rt applications
wanting to better control time spent doing a recv() in a hostile
environment. For example we could validate checksums before queuing
packets in socket receive queue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Linux 4.1-rc6

sfc: free multiple Rx buffers when required

When Rx packet data must be dropped, all the buffers
associated with that Rx packet must be freed. Extend
and rename efx_free_rx_buffer() to efx_free_rx_buffers()
and loop through all the fragments.
By doing so this patch fixes a possible memory leak.

Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fix from Al Viro:
"Off-by-one in d_walk()/__dentry_kill() race fix.

  It's very hard to hit; possible in the same conditions as the original
  bug, except that you need the skipped branch to contain all the
  remaining evictables, so that the d_walk()-calling loop in
  d_invalidate() decides there's nothing more to do and doesn't go for
  another pass - otherwise that next pass will sweep the sucker.

  So it's not too urgent, but seeing that the fix is obvious and the
  original commit has spread into all -stable branches..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  d_walk() might skip too much

Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm

Pull ARM fixes from Russell King:
"Three fixes this time around:

   - fix a memory leak which occurs when probing performance monitoring
     unit interrupts

   - fix handling of non-PMD aligned end of RAM causing boot failures

   - fix missing syscall trace exit path with syscall tracing enabled
     causing a kernel oops in the audit code"

* 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
  ARM: 8357/1: perf: fix memory leak when probing PMU PPIs
  ARM: fix missing syscall trace exit
  ARM: 8356/1: mm: handle non-pmd-aligned end of RAM

Merge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/ralf/linux

Pull MIPS fixes from Ralf Baechle:
"MIPS fixes for 4.1 all across the tree"

* 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/ralf/linux:
  MIPS: strnlen_user.S: Fix a CPU_DADDI_WORKAROUNDS regression
  MIPS: BMIPS: Fix bmips_wr_vec()
  MIPS: ath79: fix build problem if CONFIG_BLK_DEV_INITRD is not set
  MIPS: Fuloong 2E: Replace CONFIG_USB_ISP1760_HCD by CONFIG_USB_ISP1760
  MIPS: irq: Use DECLARE_BITMAP
  ttyFDC: Fix to use native endian MMIO reads
  MIPS: Fix CDMM to use native endian MMIO reads

Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux

Pull turbostat tool fixes from Len Brown:
"Just one minor kernel dependency in this batch -- added a #define to
  msr-index.h"

* 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  tools/power turbostat: update version number to 4.7
  tools/power turbostat: allow running without cpu0
  tools/power turbostat: correctly decode of ENERGY_PERFORMANCE_BIAS
  tools/power turbostat: enable turbostat to support Knights Landing (KNL)
  tools/power turbostat: correctly display more than 2 threads/core

Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending

Pull SCSI target fixes from Nicholas Bellinger:
"These are mostly minor fixes, with the exception of the following that
  address fall-out from recent v4.1-rc1 changes:

   - regression fix related to the big fabric API registration changes
     and configfs_depend_item() usage, that required cherry-picking one
     of HCH's patches from for-next to address the issue for v4.1 code.

   - remaining TCM-USER -v2 related changes to enforce full CDB
     passthrough from Andy + Ilias.

  Also included is a target_core_pscsi driver fix from Andy that
  addresses a long standing issue with a Scsi_Host reference being
  leaked on PSCSI device shutdown"

* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
  iser-target: Fix error path in isert_create_pi_ctx()
  target: Use a PASSTHROUGH flag instead of transport_types
  target: Move passthrough CDB parsing into a common function
  target/user: Only support full command pass-through
  target/user: Update example code for new ABI requirements
  target/pscsi: Don't leak scsi_host if hba is VIRTUAL_HOST
  target: Fix se_tpg_tfo->tf_subsys regression + remove tf_subsystem
  target: Drop signal_pending checks after interruptible lock acquire
  target: Add missing parentheses
  target: Fix bidi command handling
  target/user: Disallow full passthrough (pass_level=0)
  ISCSI: fix minor memory leak

Merge tag 'hwmon-for-linus-v4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:
"Some late hwmon patches, all headed for -stable

   - fix sysfs attribute initialization in nct6775 and nct6683 drivers

   - do not attempt to auto-detect tmp435 on I2C address 0x37

   - ensure iio channel is of type IIO_VOLTAGE in ntc_thermistor driver"

* tag 'hwmon-for-linus-v4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (nct6683) Add missing sysfs attribute initialization
  hwmon: (nct6775) Add missing sysfs attribute initialization
  hwmon: (tmp401) Do not auto-detect chip on I2C address 0x37
  hwmon: (ntc_thermistor) Ensure iio channel is of type IIO_VOLTAGE

Merge branch 'bna-fixes'

Ivan Vecera says:

====================
bna: misc bugfixes

These patches fix several bugs found during device initialization debugging.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bna: fix soft lock-up during firmware initialization failure

Bug in the driver initialization causes soft-lockup if firmware
initialization timeout is reached. Polling function bfa_ioc_poll_fwinit()
incorrectly calls bfa_nw_iocpf_timeout() when the timeout is reached.
The problem is that bfa_nw_iocpf_timeout() calls again
bfa_ioc_poll_fwinit()... etc. The bfa_ioc_poll_fwinit() should directly
send timeout event for iocpf and the same should be done if firmware
download into HW fails.

Cc: Rasesh Mody <rasesh.mody@qlogic.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bna: remove unreasonable iocpf timer start

Driver starts iocpf timer prior bnad_ioceth_enable() call and this is
unreasonable. This piece of code probably originates from Brocade/Qlogic
out-of-box driver during initial import into upstream. This driver uses
only one timer and queue to implement multiple timers and this timer is
started at this place. The upstream driver uses multiple timers instead
of this.

Cc: Rasesh Mody <rasesh.mody@qlogic.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bna: fix firmware loading on big-endian machines

Firmware required by bna is stored in appropriate files as sequence
of LE32 integers. After loading by request_firmware() they need to be
byte-swapped on big-endian arches. Without this conversion the NIC
is unusable on big-endian machines.

Cc: Rasesh Mody <rasesh.mody@qlogic.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-for-davem-2015-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
This just has a single docbook build fix. In my confusion
I'd already sent the same fix for -next, but Ben Hutchings
noted it's necessary in 4.1.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: fix br_multicast_query_expired() bug

br_multicast_query_expired() querier argument is a pointer to
a struct bridge_mcast_querier :

struct bridge_mcast_querier {
struct br_ip addr;
struct net_bridge_port __rcu *port;
};

Intent of the code was to clear port field, not the pointer to querier.

Fixes: 2cd4143192e8 ("bridge: memorize and export selected IGMP/MLD querier port")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Cc: Linus Lüssing <linus.luessing@web.de>
Cc: Steinar H. Gunderson <sesse@samfundet.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

iser-target: Fix error path in isert_create_pi_ctx()

We don't assign pi_ctx to desc->pi_ctx until we're certain to succeed
in the function. That means the cleanup path should use the local
pi_ctx variable, not desc->pi_ctx.

This was detected by Coverity (CID 1260062).

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Use a PASSTHROUGH flag instead of transport_types

It seems like we only care if a transport is passthrough or not. Convert
transport_type to a flags field and replace TRANSPORT_PLUGIN_* with a
flag, TRANSPORT_FLAG_PASSTHROUGH.

Signed-off-by: Andy Grover <agrover@redhat.com>
Reviewed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Move passthrough CDB parsing into a common function

Aside from whether they handle BIDI ops or not, parsing of the CDB by
kernel and user SCSI passthrough modules should be identical. Move this
into a new passthrough_parse_cdb() and call it from tcm-pscsi and tcm-user.

Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target/user: Only support full command pass-through

After much discussion, give up on only passing a subset of SCSI commands
to userspace and pass them all. Based on what pscsi is doing, make sure
to set SCF_SCSI_DATA_CDB for I/O ops, and define attributes identical to
pscsi.

Make hw_block_size configurable via dev param.

Remove mention of command filtering from tcmu-design.txt.

Signed-off-by: Andy Grover <agrover@redhat.com>
Reviewed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target/user: Update example code for new ABI requirements

We now require that the userspace handler set a bit if the command is not
handled.

Update calls to tcmu_hdr_get_op for v2.

Signed-off-by: Andy Grover <agrover@redhat.com>
Reviewed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target/pscsi: Don't leak scsi_host if hba is VIRTUAL_HOST

See https://bugzilla.redhat.com/show_bug.cgi?id=1025672

We need to put() the reference to the scsi host that we got in
pscsi_configure_device(). In VIRTUAL_HOST mode it is associated with
the dev_virt, not the hba_virt.

Signed-off-by: Andy Grover <agrover@redhat.com>
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Fix se_tpg_tfo->tf_subsys regression + remove tf_subsystem

There is just one configfs subsystem in the target code, so we might as
well add two helpers to reference / unreference it from the core code
instead of passing pointers to it around.

This fixes a regression introduced for v4.1-rc1 with commit 9ac8928e6,
where configfs_depend_item() callers using se_tpg_tfo->tf_subsys would
fail, because the assignment from the original target_core_subsystem[]
is no longer happening at target_register_template() time.

(Fix target_core_exit_configfs pointer dereference - Sagi)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Himanshu Madhani <himanshu.madhani@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

hwmon: (nct6683) Add missing sysfs attribute initialization

The following error message is seen when loading the nct6683 driver
with DEBUG_LOCK_ALLOC enabled.

BUG: key ffff88040b2f0030 not in .data!
------------[ cut here ]------------
WARNING: CPU: 0 PID: 186 at kernel/locking/lockdep.c:2988
lockdep_init_map+0x469/0x630()
DEBUG_LOCKS_WARN_ON(1)

Caused by a missing call to sysfs_attr_init() when initializing
sysfs attributes.

Reported-by: Alexey Orishko <alexey.orishko@gmail.com>
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Guenter Roeck <linux@roeck-us.net>

hwmon: (nct6775) Add missing sysfs attribute initialization

The following error message is seen when loading the nct6775 driver
with DEBUG_LOCK_ALLOC enabled.

BUG: key ffff88040b2f0030 not in .data!
------------[ cut here ]------------
WARNING: CPU: 0 PID: 186 at kernel/locking/lockdep.c:2988
lockdep_init_map+0x469/0x630()
DEBUG_LOCKS_WARN_ON(1)

Caused by a missing call to sysfs_attr_init() when initializing
sysfs attributes.

Reported-by: Alexey Orishko <alexey.orishko@gmail.com>
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Cc: stable@vger.kernel.org # v3.12+
Signed-off-by: Guenter Roeck <linux@roeck-us.net>

Merge tag 'acpi-pci-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull PCI / ACPI fix from Rafael Wysocki:
"This fixes a bug uncovered by a recent driver core change that
  modified the implementation of the ACPI_COMPANION_SET() macro to
  strictly rely on its second argument to be either NULL or a valid
  pointer to struct acpi_device.

  As it turns out, pcibios_root_bridge_prepare() on x86 and ia64 works
  with the assumption that the only code path calling pci_create_root_bus()
  is pci_acpi_scan_root() and therefore the sysdata argument passed to
  it will always match the expectations of pcibios_root_bridge_prepare().

  That need not be the case, however, and in particular it is not the
  case for the Xen pcifront driver that passes a pointer to its own
  private data strcture as sysdata to pci_scan_bus_parented() which then
  passes it to pci_create_root_bus() and it ends up being used incorrectly
  by pcibios_root_bridge_prepare()"

* tag 'acpi-pci-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PCI / ACPI: Do not set ACPI companions for host bridges with parents

Merge tag 'xfs-for-linus-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fixes from Dave Chinner:
"This is a little larger than I'd like late in the release cycle, but
  all the fixes are for regressions introduced in the 4.1-rc1 merge, or
  are needed back in -stable kernels fairly quickly as they are
  filesystem corruption or userspace visible correctness issues.

  Changes in this update:

   - regression fix for new rename whiteout code

   - regression fixes for new superblock generic per-cpu counter code

   - fix for incorrect error return sign introduced in 3.17

   - metadata corruption fixes that need to go back to -stable kernels"

* tag 'xfs-for-linus-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: fix broken i_nlink accounting for whiteout tmpfile inode
  xfs: xfs_iozero can return positive errno
  xfs: xfs_attr_inactive leaves inconsistent attr fork state behind
  xfs: extent size hints can round up extents past MAXEXTLEN
  xfs: inode and free block counters need to use __percpu_counter_compare
  percpu_counter: batch size aware __percpu_counter_compare()
  xfs: use percpu_counter_read_positive for mp->m_icount

Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Arnd Bergmann:
"Two weeks worth of small bug fixes this time, nothing sticking out
  this time:

   - one defconfig change to adapt to a modified Kconfig symbol

   - two fixes for i.MX for backwards compatibility with older DT files
     that was accidentally broken

   - one regression fix for irq handling on pxa

   - three small dt files on omap, and one each for imx and exynos"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: multi_v7_defconfig: Replace CONFIG_USB_ISP1760_HCD by CONFIG_USB_ISP1760
  ARM: imx6: gpc: don't register power domain if DT data is missing
  ARM: imx6: allow booting with old DT
  ARM: dts: set display clock correctly for exynos4412-trats2
  ARM: pxa: pxa_cplds: signedness bug in probe
  ARM: dts: Fix WLAN interrupt line for AM335x EVM-SK
  ARM: dts: omap3-devkit8000: Fix NAND DT node
  ARM: dts: am335x-boneblack: disable RTC-only sleep
  ARM: dts: fix imx27 dtb build rule
  ARM: dts: imx27: only map 4 Kbyte for fec registers

Merge tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper fixes from Mike Snitzer:
"Quite a few fixes for DM's blk-mq support thanks to extra DM multipath
  testing from Junichi Nomura and Bart Van Assche.

  Also fix a casting bug in dm_merge_bvec() that could cause only a
  single page to be added to a bio (Joe identified this while testing
  dm-cache writeback)"

* tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix casting bug in dm_merge_bvec()
  dm: fix reload failure of 0 path multipath mapping on blk-mq devices
  dm: fix false warning in free_rq_clone() for unmapped requests
  dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY
  dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path
  dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED
  dm: run queue on re-queue

hwmon: (tmp401) Do not auto-detect chip on I2C address 0x37

I2C address 0x37 may be used by EEPROMs, which can result in false
positives. Do not attempt to detect a chip at this address.

Reviewed-by: Jean Delvare <jdelvare@suse.de>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Guenter Roeck <linux@roeck-us.net>

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  scripts/gdb: fix lx-lsmod refcnt
  omfs: fix potential integer overflow in allocator
  omfs: fix sign confusion for bitmap loop counter
  omfs: set error return when d_make_root() fails
  fs, omfs: add NULL terminator in the end up the token list
  MAINTAINERS: update CAPABILITIES pattern
  fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappings
  tracing/mm: don't trace mm_page_pcpu_drain on offline cpus
  tracing/mm: don't trace mm_page_free on offline cpus
  tracing/mm: don't trace kmem_cache_free on offline cpus

Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull fixes for cpumask and modules from Rusty Russell:
"** NOW WITH TESTING! **

  Two fixes which got lost in my recent distraction.  One is a weird
  cpumask function which needed to be rewritten, the other is a module
  bug which is cc:stable"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  cpumask_set_cpu_local_first => cpumask_local_spread, lament
  module: Call module notifier on failure after complete_formation()

MIPS: strnlen_user.S: Fix a CPU_DADDI_WORKAROUNDS regression

Correct a regression introduced with 8453eebd [MIPS: Fix strnlen_user()
return value in case of overlong strings.] causing assembler warnings
and broken code generated in __strnlen_kernel_nocheck_asm:

arch/mips/lib/strnlen_user.S: Assembler messages:
arch/mips/lib/strnlen_user.S:64: Warning: Macro instruction expanded into multiple instructions in a branch delay slot

with the CPU_DADDI_WORKAROUNDS option set, resulting in the function
looping indefinitely upon mounting NFS root.

Use conditional assembly to avoid a microMIPS code size regression.
Using $at unconditionally would cause such a regression as there are no
16-bit instruction encodings available for ALU operations using this
register. Using $v1 unconditionally would produce short microMIPS
encodings, but would prevent this register from being used across calls
to this function.

The extra LI operation introduced is free, replacing a NOP originally
scheduled into the delay slot of the branch that follows.

Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/10205/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

MIPS: BMIPS: Fix bmips_wr_vec()

bmips_wr_vec() copies exception vector code from start to dst.

The call to dma_cache_wback() needs to flush (end-start) bytes,
starting at dst, from write-back cache to memory.

Signed-off-by: Petri Gynther <pgynther@google.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Kevin Cernekee <cernekee@gmail.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/10193/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

MIPS: ath79: fix build problem if CONFIG_BLK_DEV_INITRD is not set

initrd_start is defined in init/do_mounts_initrd.c, which is only
included in kernel if CONFIG_BLK_DEV_INITRD=y.

Signed-off-by: Laurent Fasnacht <l@libres.ch>
Cc: linux-mips@linux-mips.org
Cc: trivial@kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/10198/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
"This is made up 4 groups of fixes detailed below.

  vgem:
      Due to some misgivings about possible bad use cases this allow,
      backout a chunk of the interface to stop those use cases for now.

  radeon:
      Fix for an oops regression in the audio code, and a partial revert
      for a fix that was cauing problems.

  nouveau:
      regression fix for Fermi, and display-less Maxwell boot fixes.

  drm core:
      a fix for i915 cursor vblank waiting in the atomic helpers"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/nouveau/gr/gm204: remove a stray printk
  drm/nouveau/devinit/gm100-: force devinit table execution on boards without PDISP
  drm/nouveau/devinit/gf100: make the force-post condition more obvious
  drm/nouveau/gr/gf100-: fix wrong constant definition
  drm/radeon: partially revert "fix VM_CONTEXT*_PAGE_TABLE_END_ADDR handling"
  drm/radeon/audio: make sure connector is valid in hotplug case
  Revert "drm/radeon: only mark audio as connected if the monitor supports it (v3)"
  drm/radeon: don't share plls if monitors differ in audio support
  drm/vgem: drop DRIVER_PRIME (v2)
  drm/plane-helper: Adapt cursor hack to transitional helpers

Merge tag 'sound-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"No big surprise here, just a bunch of small fixes for HD-audio and
  USB-audio:

   - partial revert of widget power-saving for IDT codecs
   - revert mute-LED enum ctl for Thinkpads due to confusion
   - a quirk for a new Radeon HDMI controller
   - Realtek codec name fix for Dell
   - a workaround for headphone mic boost on some laptops
   - stream_pm ops setup (and its fix for regression)
   - another quirk for MS LifeCam USB-audio"

* tag 'sound-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda - Fix lost sound due to stream_pm ops cleanup
  ALSA: hda - Disable Headphone Mic boost for ALC662
  ALSA: hda - Disable power_save_node for IDT92HD71bxx
  ALSA: hda - Fix noise on AMD radeon 290x controller
  ALSA: hda - Set stream_pm ops automatically by generic parser
  ALSA: hda/realtek - Add ALC256 alias name for Dell
  Revert "ALSA: hda - Add mute-LED mode control to Thinkpad"
  ALSA: usb-audio: Add quirk for MS LifeCam HD-3000

dm: fix casting bug in dm_merge_bvec()

dm_merge_bvec() was originally added in f6fccb ("dm: introduce
merge_bvec_fn").  In that commit a value in sectors is converted to
bytes using << 9, and then assigned to an int.  This code made
assumptions about the value of BIO_MAX_SECTORS.

A later commit 148e51 ("dm: improve documentation and code clarity in
dm_merge_bvec") was meant to have no functional change but it removed
the use of BIO_MAX_SECTORS in favor of using queue_max_sectors().  At
this point the cast from sector_t to int resulted in a zero value.  The
fallout being dm_merge_bvec() would only allow a single page to be added
to a bio.

This interim fix is minimal for the benefit of stable@ because the more
comprehensive cleanup of passing a sector_t to all DM targets' merge
function will impact quite a few DM targets.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+

dm: fix reload failure of 0 path multipath mapping on blk-mq devices

dm-multipath accepts 0 path mapping.

  # echo '0 2097152 multipath 0 0 0 0' | dmsetup create newdev

Such a mapping can be used to release underlying devices while still
holding requests in its queue until working paths come back.

However, once the multipath device is created over blk-mq devices,
it rejects reloading of 0 path mapping:

  # echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \
      | dmsetup create mpath1
  # echo '0 2097152 multipath 0 0 0 0' | dmsetup load mpath1
  device-mapper: reload ioctl on mpath1 failed: Invalid argument
  Command failed

With following kernel message:
  device-mapper: ioctl: can't change device type after initial table load.

DM tries to inherit the current table type using dm_table_set_type()
but it doesn't work as expected because of unnecessary check about
whether the target type is hybrid or not.

Hybrid type is for targets that work as either request-based or bio-based
and not required for blk-mq or non blk-mq checking.

Fixes: 65803c205983 ("dm table: train hybrid target type detection to select blk-mq if appropriate")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Merge tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md

Pull m,ore md bugfixes gfrom Neil Brown:
"Assorted fixes for new RAID5 stripe-batching functionality.

  Unfortunately this functionality was merged a little prematurely.  The
  necessary testing and code review is now complete (or as complete as
  it can be) and to code passes a variety of tests and looks quite
  sensible.

  Also a fix for some recent locking changes - a race was introduced
  which causes a reshape request to sometimes fail.  No data safety
  issues"

* tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md:
  md: fix race when unfreezing sync_action
  md/raid5: break stripe-batches when the array has failed.
  md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
  md/raid5: be more selective about distributing flags across batch.
  md/raid5: add handle_flags arg to break_stripe_batch_list.
  md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
  md/raid5: remove condition test from check_break_stripe_batch_list.
  md/raid5: Ensure a batch member is not handled prematurely.
  md/raid5: close race between STRIPE_BIT_DELAY and batching.
  md/raid5: ensure whole batch is delayed for all required bitmap updates.

dm: fix false warning in free_rq_clone() for unmapped requests

When stacking request-based dm device on non blk-mq device and
device-mapper target could not map the request (error target is used,
multipath target with all paths down, etc), the WARN_ON_ONCE() in
free_rq_clone() will trigger when it shouldn't.

The warning was added by commit aa6df8d ("dm: fix free_rq_clone() NULL
pointer when requeueing unmapped request"). But free_rq_clone() with
clone->q == NULL is valid usage for the case where
dm_kill_unmapped_request() initiates request cleanup.

Fix this false warning by just removing the WARN_ON -- it only generated
false positives and was never useful in catching the intended case
(completing clone request not being mapped e.g. clone->q being NULL).

Fixes: aa6df8d ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

ARM: multi_v7_defconfig: Replace CONFIG_USB_ISP1760_HCD by CONFIG_USB_ISP1760

Since commit 100832abf065bc18 ("usb: isp1760: Make HCD support
optional"), CONFIG_USB_ISP1760_HCD is automatically selected when
needed. Enabling that option in the defconfig is now a no-op, and no
longer enables ISP1760 HCD support.

Re-enable the ISP1760 driver in the defconfig by enabling
USB_ISP1760_HOST_ROLE instead.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'imx-fixes-4.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into fixes

Merge "The i.MX fixes for 4.1, 3rd round" from Shawn Guo:

It includes a couple of fixes for i.MX6 GPC code to let the new kernel
be able to boot with old DTBs:

- Booting v4.1-rc kernel with old DTBs will fail with a fat warning
   (require low-level debug to be seen), due to the adoption of stacked
   IRQ domain.  The first fix improves the situation by allowing kernel
   boot up with old DTBs, although suspend/resume still breaks.

- Booting new kernel with old DTBs that do not have power-domain info
   will result in a hang.  The second patch fixes the hang by skipping
   the kernel power-domain registration if DTB has no power-domain info.

* tag 'imx-fixes-4.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
  ARM: imx6: gpc: don't register power domain if DT data is missing
  ARM: imx6: allow booting with old DT

Merge tag 'samsung-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung into fixes

Merge "Samsung fix for v4.1" from Kukjin Kim:

- Set display clock correctly for exynos4412-trats2
  : fix the following error

    exynos-drm: No connectors reported connected with modes
    [drm] Cannot find any crtc or sizes - going 1024x768

* tag 'samsung-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung:
  ARM: dts: set display clock correctly for exynos4412-trats2

ALSA: hda - Fix lost sound due to stream_pm ops cleanup

The commit [49fb18972581: ALSA: hda - Set stream_pm ops automatically
by generic parser] resulted in regressions on some Realtek and VIA
codecs because these drivers set patch_ops after calling the generic
parser, thus stream_pm got cleared to NULL again. I haven't noticed
since I tested with IDT codec.

Restore (partial revert) the stream_pm ops for them to fix the
regression.

Fixes: 49fb18972581 ('ALSA: hda - Set stream_pm ops automatically by generic parser')
Reported-by: Jeremiah Mahler <jmmahler@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

d_walk() might skip too much

when we find that a child has died while we'd been trying to ascend,
we should go into the first live sibling itself, rather than its sibling.

Off-by-one in question had been introduced in "deal with deadlock in
d_walk()" and the fix needs to be backported to all branches this one
has been backported to.

Cc: stable@vger.kernel.org # 3.2 and later
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2015-05-28

1) Fix a race in xfrm_state_lookup_byspi, we need to take
   the refcount before we release xfrm_state_lock.
   From Li RongQing.

2) Fix IV generation on ESN state. We used just the
   low order sequence numbers for IV generation on
   ESN, as a result the IV can repeat on the same
   state. Fix this by using the  high order sequence
   number bits too and make sure to always initialize
   the high order bits with zero. These patches are
   serious stable candidates. Fixes from Herbert Xu.

3) Fix the skb->mark handling on vti. We don't
   reset skb->mark in skb_scrub_packet anymore,
   so vti must care to restore the original
   value back after it was used to lookup the
   vti policy and state. Fixes from Alexander Duyck.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

via-rhine: Resigning as maintainer

I don't have enough time to look after via-rhine anymore.

Signed-off-by: Roger Luethi <rl@hellgate.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

scripts/gdb: fix lx-lsmod refcnt

Commit 2f35c41f58a9 ("module: Replace module_ref with atomic_t refcnt")
changes the way refcnt is handled but did not update the gdb script to
use the new variable.

Since refcnt is not per-cpu anymore, we can directly read its value.

Signed-off-by: Adrien Schildknecht <adrien+dev@schischi.me>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Pantelis Koukousoulas <pktoss@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

omfs: fix potential integer overflow in allocator

Both 'i' and 'bits_per_entry' are signed integers but the result is a
u64 block number. Cast i to u64 to avoid truncation on 32-bit targets.

Found by Coverity (CID 200679).

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

omfs: fix sign confusion for bitmap loop counter

The count variable is used to iterate down to (below) zero from the size
of the bitmap and handle the one-filling the remainder of the last
partial bitmap block. The loop conditional expects count to be signed
in order to detect when the final block is processed, after which count
goes negative.

Unfortunately, a recent change made this unsigned along with some other
related fields. The result of is this is that during mount,
omfs_get_imap will overrun the bitmap array and corrupt memory unless
number of blocks happens to be a multiple of 8 * blocksize.

Fix by changing count back to signed: it is guaranteed to fit in an s32
without overflow due to an enforced limit on the number of blocks in the
filesystem.

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

omfs: set error return when d_make_root() fails

A static checker found the following issue in the error path for
omfs_fill_super:

fs/omfs/inode.c:552 omfs_fill_super()
warn: missing error code here? 'd_make_root()' failed. 'ret' = '0'

Fix by returning -ENOMEM in this case.

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs, omfs: add NULL terminator in the end up the token list

match_token() expects a NULL terminator at the end of the token list so
that it would know where to stop. Not having one causes it to overrun
to invalid memory.

In practice, passing a mount option that omfs didn't recognize would
sometimes panic the system.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MAINTAINERS: update CAPABILITIES pattern

Commit 1ddd3b4e07a4 ("LSM: Remove unused capability.c") removed the
file, remove the file pattern.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappings

load_elf_binary() returns `retval', not `error'.

Fixes: a87938b2e246b81b4fb ("fs/binfmt_elf.c: fix bug in loading of PIE binaries")
Reported-by: James Hogan <james.hogan@imgtec.com>
Cc: Michael Davidson <md@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

tracing/mm: don't trace mm_page_pcpu_drain on offline cpus

Since tracepoints use RCU for protection, they must not be called on
offline cpus.  trace_mm_page_pcpu_drain can be called on an offline cpu
in this scenario caught by LOCKDEP:

     ===============================
     [ INFO: suspicious RCU usage. ]
     4.1.0-rc1+ #9 Not tainted
     -------------------------------
     include/trace/events/kmem.h:265 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    RCU used illegally from offline CPU!
    rcu_scheduler_active = 1, debug_locks = 1
     1 lock held by swapper/5/0:
      #0:  (&(&zone->lock)->rlock){..-...}, at: [<c0000000002073b0>] .free_pcppages_bulk+0x70/0x920

    stack backtrace:
     CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.1.0-rc1+ #9
     Call Trace:
       .dump_stack+0x98/0xd4 (unreliable)
       .lockdep_rcu_suspicious+0x108/0x170
       .free_pcppages_bulk+0x60c/0x920
       .free_hot_cold_page+0x208/0x280
       .destroy_context+0x90/0xd0
       .__mmdrop+0x58/0x160
       .idle_task_exit+0xf0/0x100
       .pnv_smp_cpu_kill_self+0x58/0x2c0
       .cpu_die+0x34/0x50
       .arch_cpu_idle_dead+0x20/0x40
       .cpu_startup_entry+0x708/0x7a0
       .start_secondary+0x36c/0x3a0
       start_secondary_prolog+0x10/0x14

Fix this by converting mm_page_pcpu_drain trace point into
TRACE_EVENT_CONDITION where condition is cpu_online(smp_processor_id())

Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

tracing/mm: don't trace mm_page_free on offline cpus

Since tracepoints use RCU for protection, they must not be called on
offline cpus.  trace_mm_page_free can be called on an offline cpu in this
scenario caught by LOCKDEP:

     ===============================
     [ INFO: suspicious RCU usage. ]
     4.1.0-rc1+ #9 Not tainted
     -------------------------------
     include/trace/events/kmem.h:170 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    RCU used illegally from offline CPU!
    rcu_scheduler_active = 1, debug_locks = 1
     no locks held by swapper/1/0.

    stack backtrace:
     CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.1.0-rc1+ #9
     Call Trace:
       .dump_stack+0x98/0xd4 (unreliable)
       .lockdep_rcu_suspicious+0x108/0x170
       .free_pages_prepare+0x494/0x680
       .free_hot_cold_page+0x50/0x280
       .destroy_context+0x90/0xd0
       .__mmdrop+0x58/0x160
       .idle_task_exit+0xf0/0x100
       .pnv_smp_cpu_kill_self+0x58/0x2c0
       .cpu_die+0x34/0x50
       .arch_cpu_idle_dead+0x20/0x40
       .cpu_startup_entry+0x708/0x7a0
       .start_secondary+0x36c/0x3a0
       start_secondary_prolog+0x10/0x14

Fix this by converting mm_page_free trace point into TRACE_EVENT_CONDITION
where condition is cpu_online(smp_processor_id())

Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

tracing/mm: don't trace kmem_cache_free on offline cpus

Since tracepoints use RCU for protection, they must not be called on
offline cpus.  trace_kmem_cache_free can be called on an offline cpu in
this scenario caught by LOCKDEP:

    ===============================
    [ INFO: suspicious RCU usage. ]
    4.1.0-rc1+ #9 Not tainted
    -------------------------------
    include/trace/events/kmem.h:148 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    RCU used illegally from offline CPU!
    rcu_scheduler_active = 1, debug_locks = 1
    no locks held by swapper/1/0.

    stack backtrace:
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.1.0-rc1+ #9
    Call Trace:
      .dump_stack+0x98/0xd4 (unreliable)
      .lockdep_rcu_suspicious+0x108/0x170
      .kmem_cache_free+0x344/0x4b0
      .__mmdrop+0x4c/0x160
      .idle_task_exit+0xf0/0x100
      .pnv_smp_cpu_kill_self+0x58/0x2c0
      .cpu_die+0x34/0x50
      .arch_cpu_idle_dead+0x20/0x40
      .cpu_startup_entry+0x708/0x7a0
      .start_secondary+0x36c/0x3a0
      start_secondary_prolog+0x10/0x14

Fix this by converting kmem_cache_free trace point into
TRACE_EVENT_CONDITION where condition is cpu_online(smp_processor_id())

Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'linux-4.1' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-fixes

Regression fix for Fermi acceleration, and fixes important to bringing
up display-less Maxwell boards.

* 'linux-4.1' of git://anongit.freedesktop.org/git/nouveau/linux-2.6:
  drm/nouveau/gr/gm204: remove a stray printk
  drm/nouveau/devinit/gm100-: force devinit table execution on boards without PDISP
  drm/nouveau/devinit/gf100: make the force-post condition more obvious
  drm/nouveau/gr/gf100-: fix wrong constant definition

drm/nouveau/gr/gm204: remove a stray printk

Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

drm/nouveau/devinit/gm100-: force devinit table execution on boards without PDISP

Should fix fdo#89558

Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

drm/nouveau/devinit/gf100: make the force-post condition more obvious

And also more generic, so it can be used on newer chipsets.

Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

drm/nouveau/gr/gf100-: fix wrong constant definition

Commit 3740c82590d8 ("drm/nouveau/gr/gf100-: add symbolic names for
classes") introduced a wrong macro definition causing acceleration setup
to fail. Fix it.

Signed-off-by: Lars Seipel <ls@slrz.net>
Fixes: 3740c82590d8 ("drm/nouveau/gr/gf100-: add symbolic names for classes")
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

Merge branch 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

one more regression fix, partial revert.

* 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: partially revert "fix VM_CONTEXT*_PAGE_TABLE_END_ADDR handling"

xfs: fix broken i_nlink accounting for whiteout tmpfile inode

XFS uses the internal tmpfile() infrastructure for the whiteout inode
used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates
the inode, drops di_nlink, adds the inode to the agi unlinked list,
calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode,
and then finishes the common inode setup (e.g., clear I_NEW and unlock).

The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was
pulled up out of that function as part of the following commit to
resolve a deadlock issue:

330033d6 xfs: fix tmpfile/selinux deadlock and initialize security

As a result, callers of xfs_create_tmpfile() are responsible for either
calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout
tmpfile allocation helper does neither. As a result, the vfs ->i_nlink
becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links
it back into the source dentry and calls xfs_bumplink().

Update the assert in xfs_rename() to help detect this problem in the
future and update xfs_rename_alloc_whiteout() to decrement the link
count as part of the manual tmpfile inode setup.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: xfs_iozero can return positive errno

It was missed when we converted everything in XFs to use negative error
numbers, so fix it now. Bug introduced in 3.17 by commit 2451337 ("xfs: global
error sign conversion"), and should go back to stable kernels.

Thanks to Brian Foster for noticing it.

cc: <stable@vger.kernel.org> # 3.17, 3.18, 3.19, 4.0
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: xfs_attr_inactive leaves inconsistent attr fork state behind

xfs_attr_inactive() is supposed to clean up the attribute fork when
the inode is being freed. While it removes attribute fork extents,
it completely ignores attributes in local format, which means that
there can still be active attributes on the inode after
xfs_attr_inactive() has run.

This leads to problems with concurrent inode writeback - the in-core
inode attribute fork is removed without locking on the assumption
that nothing will be attempting to access the attribute fork after a
call to xfs_attr_inactive() because it isn't supposed to exist on
disk any more.

To fix this, make xfs_attr_inactive() completely remove all traces
of the attribute fork from the inode, regardless of it's state.
Further, also remove the in-core attribute fork structure safely so
that there is nothing further that needs to be done by callers to
clean up the attribute fork. This means we can remove the in-core
and on-disk attribute forks atomically.

Also, on error simply remove the in-memory attribute fork. There's
nothing that can be done with it once we have failed to remove the
on-disk attribute fork, so we may as well just blow it away here
anyway.

cc: <stable@vger.kernel.org> # 3.12 to 4.0
Reported-by: Waiman Long <waiman.long@hp.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: extent size hints can round up extents past MAXEXTLEN

This results in BMBT corruption, as seen by this test:

# mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
....
# mount /dev/vdc /mnt/scratch
# xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo

which results in this failure on a debug kernel:

XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
....
Call Trace:
[<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100
[<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20
[<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120
[<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70
[<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70
[<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0
[<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170
[<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400
[<ffffffff81ddcc29>] ? down_write+0x29/0x60
[<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310
[<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120
[<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570
[<ffffffff811cd680>] vfs_fallocate+0x140/0x260
[<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80
[<ffffffff81ddec09>] system_call_fastpath+0x12/0x17

The tracepoint that indicates the extent that triggered the assert
failure is:

xfs_iext_insert:   idx 0 offset 0 block 16777224 count 2097152 flag 1

Clearly indicating that the extent length is greater than MAXEXTLEN,
which is 2097151. A prior trace point shows the allocation was an
exact size match and that a length greater than MAXEXTLEN was asked
for:

xfs_alloc_size_done:  agno 1 agbno 8 minlen 2097152 maxlen 2097152
    ^^^^^^^        ^^^^^^^

We don't see this problem with extent size hints through the IO path
because we can't do single IOs large enough to trigger MAXEXTLEN
allocation. fallocate(), OTOH, is not limited in it's allocation
sizes and so needs help here.

The issue is that the extent size hint alignment is rounding up the
extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking
into account extent size hints when calculating the maximum extent
length to allocate. xfs_bmapi_reserve_delalloc() is already doing
this, but direct extent allocation is not.

Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is
wrong, and it works only because delayed allocation extents are not
limited in size to MAXEXTLEN in the in-core extent tree. hence this
calculation does not work for direct allocation, and the delalloc
code needs fixing. This may, in fact be the underlying bug that
occassionally causes transaction overruns in delayed allocation
extent conversion, so now we know it's wrong we should fix it, too.
Many thanks to Brian Foster for finding this problem during review
of this patch.

Hence the fix, after much code reading, is to allow
xfs_bmap_extsize_align() to align partial extents when full
alignment would extend the alignment past MAXEXTLEN. We can safely
do this because all callers have higher layer allocation loops that
already handle short allocations, and so will simply run another
allocation to cover the remainder of the requested allocation range
that we ignored during alignment. The advantage of this approach is
that it also removes the need for callers to do anything other than
limit their requests to MAXEXTLEN - they don't really need to be
aware of extent size hints at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: inode and free block counters need to use __percpu_counter_compare

Because the counters use a custom batch size, the comparison
functions need to be aware of that batch size otherwise the
comparison does not work correctly. This leads to ASSERT failures
on generic/027 like this:

XFS: Assertion failed: 0, file: fs/xfs/xfs_mount.c, line: 1099
------------[ cut here ]------------
....
Call Trace:
  [<ffffffff81522a39>] xfs_mod_icount+0x99/0xc0
  [<ffffffff815285cb>] xfs_trans_unreserve_and_mod_sb+0x28b/0x5b0
  [<ffffffff8152f941>] xfs_log_commit_cil+0x321/0x580
  [<ffffffff81528e17>] xfs_trans_commit+0xb7/0x260
  [<ffffffff81503d4d>] xfs_bmap_finish+0xcd/0x1b0
  [<ffffffff8151da41>] xfs_inactive_ifree+0x1e1/0x250
  [<ffffffff8151dbe0>] xfs_inactive+0x130/0x200
  [<ffffffff81523a21>] xfs_fs_evict_inode+0x91/0xf0
  [<ffffffff811f3958>] evict+0xb8/0x190
  [<ffffffff811f433b>] iput+0x18b/0x1f0
  [<ffffffff811e8853>] do_unlinkat+0x1f3/0x320
  [<ffffffff811d548a>] ? filp_close+0x5a/0x80
  [<ffffffff811e999b>] SyS_unlinkat+0x1b/0x40
  [<ffffffff81e0892e>] system_call_fastpath+0x12/0x71

This is a regression introduced by commit 501ab32 ("xfs: use generic
percpu counters for inode counter").

This patch fixes the same problem for both the inode counter and the
free block counter in the superblocks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>