ovs: do not allocate memory from offline numa node
When openvswitch tries allocate memory from offline numa node 0:
stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0)
It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
[ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
This patch disables numa affinity in this case.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 2 Oct 2015 10:06:03 +0000 (12:06 +0200)]
bpf: fix panic in SO_GET_FILTER with native ebpf programs
When sockets have a native eBPF program attached through
setsockopt(sk, SOL_SOCKET, SO_ATTACH_BPF, ...), and then try to
dump these over getsockopt(sk, SOL_SOCKET, SO_GET_FILTER, ...),
the following panic appears:
The underlying issue is the very same as in commit b382c0865600
("sock, diag: fix panic in sock_diag_put_filterinfo"), that is,
native eBPF programs don't store an original program since this
is only needed in cBPF ones.
However, sk_get_filter() wasn't updated to test for this at the
time when eBPF could be attached. Just throw an error to the user
to indicate that eBPF cannot be dumped over this interface.
That way, it can also be known that a program _is_ attached (as
opposed to just return 0), and a different (future) method needs
to be consulted for a dump.
Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Stringer [Thu, 1 Oct 2015 22:00:37 +0000 (15:00 -0700)]
openvswitch: Rename LABEL->LABELS
Conntrack LABELS (plural) are exposed by conntrack; rename the OVS name
for these to be consistent with conntrack.
Fixes: c2ac667 "openvswitch: Allow matching on conntrack label" Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Vagin [Thu, 1 Oct 2015 21:05:36 +0000 (00:05 +0300)]
net/unix: fix logic about sk_peek_offset
Now send with MSG_PEEK can return data from multiple SKBs.
Unfortunately we take into account the peek offset for each skb,
that is wrong. We need to apply the peek offset only once.
In addition, the peek offset should be used only if MSG_PEEK is set.
Cc: "David S. Miller" <davem@davemloft.net> (maintainer:NETWORKING Cc: Eric Dumazet <edumazet@google.com> (commit_signer:1/14=7%) Cc: Aaron Conole <aconole@bytheb.org> Fixes: 9f389e35674f ("af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag") Signed-off-by: Andrey Vagin <avagin@openvz.org> Tested-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 1 Oct 2015 18:37:43 +0000 (11:37 -0700)]
act_mirred: always release tcf hash
Align with other tc actions.
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 1 Oct 2015 18:37:42 +0000 (11:37 -0700)]
act_mirred: fix a race condition on mirred_list
After commit 1ce87720d456 ("net: sched: make cls_u32 lockless")
we began to release tc actions in a RCU callback. However,
mirred action relies on RTNL lock to protect the global
mirred_list, therefore we could have a race condition
between RCU callback and netdevice event, which caused
a list corruption as reported by Vinson.
Instead of relying on RTNL lock, introduce a spinlock to
protect this list.
Note, in non-bind case, it is still called with RTNL lock,
therefore should disable BH too.
Reported-by: Vinson Lee <vlee@twopensource.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The driver still was not offloading TSO on GRE tunnels because
it forgot to set the GSO_GRE flag, causing lots of retransmits.
This fixes generic GRE traffic (like a tunnel added like below)
whereas before it would get 1Gb/s or less, now on a 10G adapter
it gets 8.7Gb/s.
ip ad ad 11.1.0.2/24 dev ens2f0
ip l set ens2f0 up
ip link add gre2 type gretap remote 11.1.0.1 local 11.1.0.2 dev ens2f0
ip l set gre2 up
ip ad ad 192.168.124.2/24 dev gre2
ping 192.168.124.1
netperf -H 192.168.124.1
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Thu, 1 Oct 2015 14:25:43 +0000 (16:25 +0200)]
ipv4: fix reply_dst leakage on arp reply
There are cases when the created metadata reply is not used. Ensure the
allocated memory is freed also in such cases.
Fixes: 63d008a4e9ee ("ipv4: send arp replies to the correct tunnel") Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 1 Oct 2015 12:39:26 +0000 (05:39 -0700)]
inet: fix race in reqsk_queue_unlink()
reqsk_timer_handler() tests if icsk_accept_queue.listen_opt
is NULL at its beginning.
By the time it calls inet_csk_reqsk_queue_drop() and
reqsk_queue_unlink(), listener might have been closed and
inet_csk_listen_stop() had called reqsk_queue_yank_acceptq()
which sets icsk_accept_queue.listen_opt to NULL
We therefore need to correctly check listen_opt being NULL
after holding syn_wait_lock for proper synchronization.
Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer") Fixes: b357a364c57c ("inet: fix possible panic in reqsk_queue_unlink()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Raanan Avargil [Thu, 1 Oct 2015 11:48:53 +0000 (04:48 -0700)]
tcp/dccp: fix old style declarations
I’m using the compilation flag -Werror=old-style-declaration, which
requires that the “inline” word would come at the beginning of the code
line.
$ make drivers/net/ethernet/intel/e1000e/e1000e.ko
...
include/net/inet_timewait_sock.h:116:1: error: ‘inline’ is not at
beginning of declaration [-Werror=old-style-declaration]
static void inline inet_twsk_schedule(struct inet_timewait_sock *tw, int
timeo)
include/net/inet_timewait_sock.h:121:1: error: ‘inline’ is not at
beginning of declaration [-Werror=old-style-declaration]
static void inline inet_twsk_reschedule(struct inet_timewait_sock *tw,
int timeo)
Fixes: ed2e92394589 ("tcp/dccp: fix timewait races in timer handling") Signed-off-by: Raanan Avargil <raanan.avargil@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David B. Robins [Wed, 30 Sep 2015 20:20:04 +0000 (16:20 -0400)]
net: usb: asix: Fix crash on skb alloc failure
If asix_rx_fixup_internal() fails to allocate rx->ax_skb, it will return
but not clear rx->size. rx points to driver private data. A later call
assumes that nonzero size means ax_skb was allocated and passes a null
ax_skb to skb_put. Changed allocation failure return to clear size first.
Found testing board with AX88772B devices.
Signed-off-by: David B. Robins <linux@davidrobins.net> Signed-off-by: David S. Miller <davem@davemloft.net>
amd-xgbe: fix potential memory leak in xgbe-debugfs
Added kfree() to avoid the memory leak when debugfs_create_dir() fails.
Signed-off-by: Geliang Tang <geliangtang@163.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ppp: don't override sk->sk_state in pppoe_flush_dev()
Since commit 2b018d57ff18 ("pppoe: drop PPPOX_ZOMBIEs in pppoe_release"),
pppoe_release() calls dev_put(po->pppoe_dev) if sk is in the
PPPOX_ZOMBIE state. But pppoe_flush_dev() can set sk->sk_state to
PPPOX_ZOMBIE _and_ reset po->pppoe_dev to NULL. This leads to the
following oops:
pppoe_flush_dev() has no reason to override sk->sk_state with
PPPOX_ZOMBIE. pppox_unbind_sock() already sets sk->sk_state to
PPPOX_DEAD, which is the correct state given that sk is unbound and
po->pppoe_dev is NULL.
Fixes: 2b018d57ff18 ("pppoe: drop PPPOX_ZOMBIEs in pppoe_release") Tested-by: Oleksii Berezhniak <core@irc.lg.ua> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 2 Oct 2015 12:03:04 +0000 (08:03 -0400)]
Merge tag 'mmc-v4.3-rc3' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC fixes from Ulf Hansson:
"Here are some mmc fixes intended for v4.3 rc4:
MMC core:
- Allow users of mmc_of_parse() to succeed when CONFIG_GPIOLIB is
unset
- Prevent infinite loop of re-tuning for CRC-errors for CMD19 and
CMD21
* tag 'mmc-v4.3-rc3' of git://git.linaro.org/people/ulf.hansson/mmc:
mmc: core: fix dead loop of mmc_retune
mmc: pxamci: fix card detect with slot-gpio API
mmc: sunxi: Fix clk-delay settings
mmc: core: Don't return an error for CD/WP GPIOs when GPIOLIB is unset
Linus Torvalds [Fri, 2 Oct 2015 11:59:29 +0000 (07:59 -0400)]
Merge git://git.infradead.org/intel-iommu
Pull IOVA fixes from David Woodhouse:
"The main fix here is the first one, fixing the over-allocation of
size-aligned requests. The other patches simply make the existing
IOVA code available to users other than the Intel VT-d driver, with no
functional change.
I concede the latter really *should* have been submitted during the
merge window, but since it's basically risk-free and people are
waiting to build on top of it and it's my fault I didn't get it in, I
(and they) would be grateful if you'd take it"
* git://git.infradead.org/intel-iommu:
iommu: Make the iova library a module
iommu: iova: Export symbols
iommu: iova: Move iova cache management to the iova library
iommu/iova: Avoid over-allocating when size-aligned
Linus Torvalds [Fri, 2 Oct 2015 02:20:11 +0000 (22:20 -0400)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"12 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
dmapool: fix overflow condition in pool_find_page()
thermal: avoid division by zero in power allocator
memcg: remove pcp_counter_lock
kprobes: use _do_fork() in samples to make them work again
drivers/input/joystick/Kconfig: zhenhua.c needs BITREVERSE
memcg: make mem_cgroup_read_stat() unsigned
memcg: fix dirty page migration
dax: fix NULL pointer in __dax_pmd_fault()
mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault
mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)
userfaultfd: remove kernel header include from uapi header
arch/x86/include/asm/efi.h: fix build failure
Linus Torvalds [Fri, 2 Oct 2015 02:06:40 +0000 (22:06 -0400)]
Merge tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"These are fixes mostly, for a few changes made in this cycle (the
intel_idle driver, the OPP library, the ACPI EC driver, turbostat) and
for some issues that have just been discovered (ACPI PCI IRQ
management, PCI power management documentation, turbostat), with a
couple of cleanups on top of them.
Specifics:
- intel_idle driver fixup for the recently added Skylake chips
support (Len Brown).
- Operating Performance Points (OPP) library fix related to the
recently added support for new DT bindings and a fix for a typo in
a comment (Viresh Kumar, Stephen Boyd).
- ACPI EC driver fix for a recently introduced memory leak in an
error code path (Lv Zheng).
- ACPI PCI IRQ management fix for the issue where an ISA IRQ is
shared with a PCI device which requires it to be configured in a
different way and may cause an interrupt storm to happen as a
result with an extra ACPI SCI IRQ handling simplification on top of
it (Jiang Liu).
- Update of the PCI power management documentation that became
outdated and started to actively confuse the readers to make it
actually reflect the code (Rafael J Wysocki).
- turbostat fixes including an IVB Xeon regression fix (related to
the --debug command line option), Skylake adjustment for the TSC
running at a frequency that doesn't match the base one exactly, and
a Knights Landing quirk to account for the fact that it only
updates APERF and MPERF every 1024 clock cycles plus bumping up the
turbostat version number (Len Brown, Hubert Chrzaniuk)"
* tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
tools/power turbosat: update version number
tools/power turbostat: SKL: Adjust for TSC difference from base frequency
tools/power turbostat: KNL workaround for %Busy and Avg_MHz
tools/power turbostat: IVB Xeon: fix --debug regression
ACPI / PCI: Remove duplicated penalty on SCI IRQ
ACPI, PCI, irq: Do not share PCI IRQ with ISA IRQ
ACPI / EC: Fix a memory leak issue in acpi_ec_query()
PM / OPP: Fix typo modifcation -> modification
PCI / PM: Update runtime PM documentation for PCI devices
PM / OPP: of_property_count_u32_elems() can return errors
intel_idle: Skylake Client Support - updated
1) Fix regression in SKB partial checksum handling, from Pravin B
Shalar.
2) Fix VLAN inside of VXLAN handling in i40e driver, from Jesse
Brandeburg.
3) Cure softlockups during accept() in SCTP, from Karl Heiss.
4) MSG_PEEK should return multiple SKBs worth of data in AF_UNIX, from
Aaron Conole.
5) IPV6 erroneously ignores output interface specifier in lookup key for
route lookups, fix from David Ahern.
6) In Marvell DSA driver, forward unknown frames to CPU port, from
Andrew Lunn.
7) Mission flow flag initializations in some code paths, from David
Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: Initialize flow flags in input path
net: dsa: fix preparation of a port STP update
testptp: Silence compiler warnings on ppc64
net/mlx4: Handle return codes in mlx4_qp_attach_common
dsa: mv88e6xxx: Enable forwarding for unknown to the CPU port
skbuff: Fix skb checksum partial check.
net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set
net sysfs: Print link speed as signed integer
bna: fix error handling
af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag
af_unix: Convert the unix_sk macro to an inline function for type safety
net: sctp: Don't use 64 kilobyte lookup table for four elements
l2tp: protect tunnel->del_work by ref_count
net/ibm/emac: bump version numbers for correct work with ethtool
sctp: Prevent soft lockup when sctp_accept() is called during a timeout event
sctp: Whitespace fix
i40e/i40evf: check for stopped admin queue
i40e: fix VLAN inside VXLAN
r8169: fix handling rtl_readphy result
net: hisilicon: fix handling platform_get_irq result
Robin Murphy [Thu, 1 Oct 2015 22:37:19 +0000 (15:37 -0700)]
dmapool: fix overflow condition in pool_find_page()
If a DMA pool lies at the very top of the dma_addr_t range (as may
happen with an IOMMU involved), the calculated end address of the pool
wraps around to zero, and page lookup always fails.
Tweak the relevant calculation to be overflow-proof.
Signed-off-by: Robin Murphy <robin.murphy@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Sakari Ailus <sakari.ailus@iki.fi> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Thelen [Thu, 1 Oct 2015 22:37:13 +0000 (15:37 -0700)]
memcg: remove pcp_counter_lock
Commit 733a572e66d2 ("memcg: make mem_cgroup_read_{stat|event}() iterate
possible cpus instead of online") removed the last use of the per memcg
pcp_counter_lock but forgot to remove the variable.
Kill the vestigial variable.
Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Thelen [Thu, 1 Oct 2015 22:37:05 +0000 (15:37 -0700)]
memcg: make mem_cgroup_read_stat() unsigned
mem_cgroup_read_stat() returns a page count by summing per cpu page
counters. The summing is racy wrt. updates, so a transient negative
sum is possible. Callers don't want negative values:
- mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
This could confuse dirty throttling.
- oom reports and memory.stat shouldn't show confusing negative usage.
- tree_usage() already avoids negatives.
Avoid returning negative page counts from mem_cgroup_read_stat() and
convert it to unsigned.
[akpm@linux-foundation.org: fix old typo while we're in there] Signed-off-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Thelen [Thu, 1 Oct 2015 22:37:02 +0000 (15:37 -0700)]
memcg: fix dirty page migration
The problem starts with a file backed dirty page which is charged to a
memcg. Then page migration is used to move oldpage to newpage.
Migration:
- copies the oldpage's data to newpage
- clears oldpage.PG_dirty
- sets newpage.PG_dirty
- uncharges oldpage from memcg
- charges newpage to memcg
Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
count.
However, because newpage is not yet charged, setting newpage.PG_dirty
does not increment the memcg's dirty page count. After migration
completes newpage.PG_dirty is eventually cleared, often in
account_page_cleaned(). At this time newpage is charged to a memcg so
the memcg's dirty page count is decremented which causes underflow
because the count was not previously incremented by migration. This
underflow causes balance_dirty_pages() to see a very large unsigned
number of dirty memcg pages which leads to aggressive throttling of
buffered writes by processes in non root memcg.
This issue:
- can harm performance of non root memcg buffered writes.
- can report too small (even negative) values in
memory.stat[(total_)dirty] counters of all memcg, including the root.
To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
page_memcg() and set_page_memcg() helpers.
Test:
0) setup and enter limited memcg
mkdir /sys/fs/cgroup/test
echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/test/cgroup.procs
Notice that unpatched baseline performance (1) fell after
migration (3): 841 -> 114 MB/s. In the patched kernel, post
migration performance matches baseline.
Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting") Signed-off-by: Greg Thelen <gthelen@google.com> Reported-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Thu, 1 Oct 2015 22:36:59 +0000 (15:36 -0700)]
dax: fix NULL pointer in __dax_pmd_fault()
Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
DAX") moved some code in __dax_pmd_fault() that was responsible for
zeroing newly allocated PMD pages. The new location didn't properly set
up 'kaddr', so when run this code resulted in a NULL pointer BUG.
Fix this by getting the correct 'kaddr' via bdev_direct_access().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I think I find a linux bug, I have the test cases is constructed. I
can stable recurring problems in fedora22(4.0.4) kernel version,
arch for x86_64. I construct transparent huge page, when the parent
and child process with MAP_SHARE, MAP_PRIVATE way to access the same
huge page area, it has the opportunity to lead to huge page copy on
write failure, and then it will munmap the child corresponding mmap
area, but then the child mmap area with VM_MAYSHARE attributes, child
process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
functions (vma - > vm_flags & VM_MAYSHARE).
There were a number of problems with the report (e.g. it's hugetlbfs that
triggers this, not transparent huge pages) but it was fundamentally
correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
looks like this
vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
pgoff 0 file ffff88106bdb9800 private_data (null)
flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
------------
kernel BUG at mm/hugetlb.c:462!
SMP
Modules linked in: xt_pkttype xt_LOG xt_limit [..]
CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
set_vma_resv_flags+0x2d/0x30
The VM_BUG_ON is correct because private and shared mappings have
different reservation accounting but the warning clearly shows that the
VMA is shared.
When a private COW fails to allocate a new page then only the process
that created the VMA gets the page -- all the children unmap the page.
If the children access that data in the future then they get killed.
The problem is that the same file is mapped shared and private. During
the COW, the allocation fails, the VMAs are traversed to unmap the other
private pages but a shared VMA is found and the bug is triggered. This
patch identifies such VMAs and skips them.
Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
larger cache size than the size index INDEX_NODE mapping. In kernels
3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.
However, sometimes we can't get the right output we expected via
kmalloc_size(INDEX_NODE + 1), causing a BUG().
The mapping table in the latest kernel is like:
index = {0, 1, 2 , 3, 4, 5, 6, n}
size = {0, 96, 192, 8, 16, 32, 64, 2^n}
The mapping table before 3.10 is like this:
index = {0 , 1 , 2, 3, 4 , 5 , 6, n}
size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}
The problem on my mips64 machine is as follows:
(1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
&& DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
kmalloc_index(sizeof(struct kmem_cache_node))
(2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.
(3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
= PAGE_SIZE".
(4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
"flags |= CFLGS_OFF_SLAB" will be covered.
(5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
"cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
here may be NULL while kernel bootup.
(6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
BUG info as the following shows (may be only mips64 has this problem):
This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
the BUG by adding 'size >= 256' check to guarantee that all necessary
small sized slabs are initialized regardless sequence of slab size in
mapping table.
Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Liuhailong <liu.hailong6@zte.com.cn> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andre Przywara [Thu, 1 Oct 2015 22:36:51 +0000 (15:36 -0700)]
userfaultfd: remove kernel header include from uapi header
As include/uapi/linux/userfaultfd.h is a user visible header file, it
should not include kernel-exclusive header files.
So trying to build the userfaultfd test program from the selftests
directory fails, since it contains a reference to linux/compiler.h. As
it turns out, that header is not really needed there, so we can simply
remove it to fix that issue.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Thu, 1 Oct 2015 22:36:48 +0000 (15:36 -0700)]
arch/x86/include/asm/efi.h: fix build failure
With KMEMCHECK=y, KASAN=n:
arch/x86/platform/efi/efi.c:673:3: error: implicit declaration of function `memcpy' [-Werror=implicit-function-declaration]
arch/x86/platform/efi/efi_64.c:139:2: error: implicit declaration of function `memcpy' [-Werror=implicit-function-declaration]
arch/x86/include/asm/desc.h:121:2: error: implicit declaration of function `memcpy' [-Werror=implicit-function-declaration]
Linus Torvalds [Thu, 1 Oct 2015 20:43:25 +0000 (16:43 -0400)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"(Relatively) a lot of reverts, mostly.
Bugs have trickled in for a new feature in 4.2 (MTRR support in
guests) so I'm reverting it all; let's not make this -rc period busier
for KVM than it's been so far. This covers the four reverts from me.
The fifth patch is being reverted because Radim found a bug in the
implementation of stable scheduler clock, *but* also managed to
implement the feature entirely without hypervisor support. So instead
of fixing the hypervisor side we can remove it completely; 4.4 will
get the new implementation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Update KVM homepage Url
Revert "KVM: SVM: use NPT page attributes"
Revert "KVM: svm: handle KVM_X86_QUIRK_CD_NW_CLEARED in svm_get_mt_mask"
Revert "KVM: SVM: Sync g_pat with guest-written PAT value"
Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"
Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
Linus Torvalds [Thu, 1 Oct 2015 20:38:52 +0000 (16:38 -0400)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
- Fixes for mlx5 related issues
- Fixes for ipoib multicast handling
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/ipoib: increase the max mcast backlog queue
IB/ipoib: Make sendonly multicast joins create the mcast group
IB/ipoib: Expire sendonly multicast joins
IB/mlx5: Remove pa_lkey usages
IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY
IB/iser: Add module parameter for always register memory
xprtrdma: Replace global lkey with lkey local to PD
* pm-tools:
tools/power turbosat: update version number
tools/power turbostat: SKL: Adjust for TSC difference from base frequency
tools/power turbostat: KNL workaround for %Busy and Avg_MHz
tools/power turbostat: IVB Xeon: fix --debug regression
Dirk Müller [Thu, 1 Oct 2015 11:43:42 +0000 (13:43 +0200)]
Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
The cpu feature flags are not ever going to change, so warning
everytime can cause a lot of kernel log spam
(in our case more than 10GB/hour).
The warning seems to only occur when nested virtualization is
enabled, so it's probably triggered by a KVM bug. This is a
sensible and safe change anyway, and the KVM bug fix might not
be suitable for stable releases anyway.
Cc: stable@vger.kernel.org Signed-off-by: Dirk Mueller <dmueller@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Thu, 1 Oct 2015 11:57:27 +0000 (07:57 -0400)]
Merge tag 'upstream-4.3-rc4' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Richard Weinberger:
"This contains three bug fixes for both UBI and UBIFS"
* tag 'upstream-4.3-rc4' of git://git.infradead.org/linux-ubifs:
UBI: return ENOSPC if no enough space available
UBI: Validate data_size
UBIFS: Kill unneeded locking in ubifs_init_security
Linus Torvalds [Thu, 1 Oct 2015 11:50:08 +0000 (07:50 -0400)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull key signing fixes from James Morris:
"Keyrings and modsign fixes from David Howells"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
MODSIGN: Change from CMS to PKCS#7 signing if the openssl is too old
X.509: Don't strip leading 00's from key ID when constructing key description
KEYS: Remove unnecessary header #inclusions from extract-cert.c
KEYS: Fix race between key destruction and finding a keyring by name
Paolo Bonzini [Thu, 1 Oct 2015 11:20:22 +0000 (13:20 +0200)]
Revert "KVM: SVM: use NPT page attributes"
This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22.
Initializing the mapping from MTRR to PAT values was reported to
fail nondeterministically, and it also caused extremely slow boot
(due to caching getting disabled---bug 103321) with assigned devices.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reported-by: Sebastian Schuette <dracon@ewetel.net> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge tag 'hwmon-for-linus-v4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmin fixes from Guenter Roeck:
"Fix module autoload for various drivers"
* tag 'hwmon-for-linus-v4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (pwm-fan) Fix module autoload for OF platform driver
hwmon: (gpio-fan) Fix module autoload for OF platform driver
hwmon: (abx500) Fix module autoload for OF platform driver
Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fixes from Ingo Molnar:
"Two RCU fixes:
- work around bug with recent GCC versions.
- fix false positive lockdep splat"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Suppress lockdep false positive for rcp->exp_funnel_mutex
rcu: Change _wait_rcu_gp() to work around GCC bug 67055
Initialize msg/shm IPC objects before doing ipc_addid()
As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
having initialized the IPC object state. Yes, we initialize the IPC
object in a locked state, but with all the lockless RCU lookup work,
that IPC object lock no longer means that the state cannot be seen.
We already did this for the IPC semaphore code (see commit e8577d1f0329:
"ipc/sem.c: fully initialize sem_array before making it visible") but we
clearly forgot about msg and shm.
When get a CRC error, start the mmc_retune, it will issue CMD19/CMD21
to do tune, assume there were 10 clock phase need to try, phase 0 to
phase 6 is ok, phase 7 to phase 9 is NG, we try it from 0 to 9, so
the last CMD19/CMD21 will get CRC error, host->need_retune was set and
cause mmc_retune was called, then dead loop of mmc_retune
Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: bd11e8bd03ca ("mmc: core: Flag re-tuning is needed on CRC errors") Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Vivien Didelot [Tue, 29 Sep 2015 18:17:54 +0000 (14:17 -0400)]
net: dsa: fix preparation of a port STP update
Because of the default 0 value of ret in dsa_slave_port_attr_set, a
driver may return -EOPNOTSUPP from the commit phase of a STP state,
which triggers a WARN() from switchdev.
This happened on a 6185 switch which does not support hardware bridging.
Fixes: 3563606258cf ("switchdev: convert STP update to switchdev attr set") Reported-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Huth [Tue, 29 Sep 2015 15:45:28 +0000 (17:45 +0200)]
testptp: Silence compiler warnings on ppc64
When compiling Documentation/ptp/testptp.c the following compiler
warnings are printed out:
Documentation/ptp/testptp.c: In function ‘main’:
Documentation/ptp/testptp.c:367:11: warning: format ‘%lld’ expects argument
of type ‘long long int’, but argument 3 has type ‘__s64’ [-Wformat=]
event.t.sec, event.t.nsec);
^
Documentation/ptp/testptp.c:505:5: warning: format ‘%lld’ expects argument
of type ‘long long int’, but argument 2 has type ‘__s64’ [-Wformat=]
(pct+2*i)->sec, (pct+2*i)->nsec);
^
Documentation/ptp/testptp.c:507:5: warning: format ‘%lld’ expects argument
of type ‘long long int’, but argument 2 has type ‘__s64’ [-Wformat=]
(pct+2*i+1)->sec, (pct+2*i+1)->nsec);
^
Documentation/ptp/testptp.c:509:5: warning: format ‘%lld’ expects argument
of type ‘long long int’, but argument 2 has type ‘__s64’ [-Wformat=]
(pct+2*i+2)->sec, (pct+2*i+2)->nsec);
This happens because __s64 is by default defined as "long" on ppc64,
not as "long long". However, to fix these warnings, it's possible to
define the __SANE_USERSPACE_TYPES__ so that __s64 gets defined to
"long long" on ppc64, too.
Signed-off-by: Thomas Huth <thuth@redhat.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx4: Handle return codes in mlx4_qp_attach_common
Both new_steering_entry() and existing_steering_entry() return values
based on their success or failure, but currently they fall through
silently. This can make troubleshooting difficult, as we were unable
to tell which one of these two functions returned errors or
specifically what code was returned. This patch remedies that
situation by passing the return codes to err, which is returned by
mlx4_qp_attach_common() itself.
This also addresses a leak in the call to mlx4_bitmap_free() as well.
Signed-off-by: Robb Manes <rmanes@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier patch 6ae459bda tried to detect void ckecksum partial
skb by comparing pull length to checksum offset. But it does
not work for all cases since checksum-offset depends on
updates to skb->data.
Following patch fixes it by validating checksum start offset
after skb-data pointer is updated. Negative value of checksum
offset start means there is no need to checksum.
Fixes: 6ae459bda ("skbuff: Fix skb checksum flag on skb pull") Reported-by: Andrew Vagin <avagin@odin.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Mon, 28 Sep 2015 17:12:13 +0000 (10:12 -0700)]
net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set
Wolfgang reported that IPv6 stack is ignoring oif in output route lookups:
With ipv6, ip -6 route get always returns the specific route.
$ ip -6 r
2001:db8:e2::1 dev enp2s0 proto kernel metric 256
2001:db8:e2::/64 dev enp2s0 metric 1024
2001:db8:e3::1 dev enp3s0 proto kernel metric 256
2001:db8:e3::/64 dev enp3s0 metric 1024
fe80::/64 dev enp3s0 proto kernel metric 256
default via 2001:db8:e3::255 dev enp3s0 metric 1024
$ ip -6 r get 2001:db8:e2::100
2001:db8:e2::100 from :: dev enp2s0 src 2001:db8:e3::1 metric 0
cache
$ ip -6 r get 2001:db8:e2::100 oif enp3s0
2001:db8:e2::100 from :: dev enp2s0 src 2001:db8:e3::1 metric 0
cache
The stack does consider the oif but a mismatch in rt6_device_match is not
considered fatal because RT6_LOOKUP_F_IFACE is not set in the flags.
Cc: Wolfgang Nothdurft <netdev@linux-dude.de> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Stein [Mon, 28 Sep 2015 13:05:33 +0000 (15:05 +0200)]
net sysfs: Print link speed as signed integer
Otherwise 4294967295 (MBit/s) (-1) will be printed when there is no link.
Documentation/ABI/testing/sysfs-class-net does not state if this shall be
signed or unsigned.
Also remove the now unused variable fmt_udec.
Signed-off-by: Alexander Stein <alexander.stein@systec-electronic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrzej Hajda [Mon, 28 Sep 2015 08:49:48 +0000 (10:49 +0200)]
bna: fix error handling
Several functions can return negative value in case of error,
so their return type should be fixed as well as type of variables
to which this value is assigned.
The problem has been detected using proposed semantic patch
scripts/coccinelle/tests/assign_signed_to_unsigned.cocci [1].
David S. Miller [Tue, 29 Sep 2015 20:47:08 +0000 (13:47 -0700)]
Merge branch 'af_unix_MSG_PEEK'
Aaron Conole says:
====================
af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag
This patch set implements a bugfix for kernel.org bugzilla #12323, allowing
MSG_PEEK to return all queued data on the unix domain socket, not just the
data contained in a single SKB.
This is the v3 version of this patch, which includes a suggested modification
by Eric Dumazet to convert the unix_sk() conversion macro to a static inline
function. These patches are independent and can be applied separately.
This set was tested over a 24-hour period, utilizing a loop continually
executing the bugzilla issue attached python code. It was instrumented with
a pr_err_once() ([ 13.798683] unix: went there at least one time).
v2->v3:
- Added Eric Dumazet's suggestion for #define to static inline
- Fixed an issue calling unix_state_lock() with an invalid argument
v3->v4:
- Eliminated an XXX comment
- Changed from goto unlock to explicit unix_state_unlock() and break
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag
AF_UNIX sockets now return multiple skbs from recv() when MSG_PEEK flag
is set.
This is referenced in kernel bugzilla #12323 @
https://bugzilla.kernel.org/show_bug.cgi?id=12323
As described both in the BZ and lkml thread @
http://lkml.org/lkml/2008/1/8/444 calling recv() with MSG_PEEK on an
AF_UNIX socket only reads a single skb, where the desired effect is
to return as much skb data has been queued, until hitting the recv
buffer size (whichever comes first).
The modified MSG_PEEK path will now move to the next skb in the tree
and jump to the again: label, rather than following the natural loop
structure. This requires duplicating some of the loop head actions.
This was tested using the python socketpair python code attached to
the bugzilla issue.
Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: David S. Miller <davem@davemloft.net>
UBI: attaching mtd1 to ubi0
UBI: scanning is finished
UBI error: init_volumes: not enough PEBs, required 706, available 686
UBI error: ubi_wl_init: no enough physical eraseblocks (-20, need 1)
UBI error: ubi_attach_mtd_dev: failed to attach mtd1, error -12 <= NOT ENOMEM
UBI error: ubi_init: cannot attach mtd1
If available PEBs are not enough when initializing volumes, return -ENOSPC
directly. If available PEBs are not enough when initializing WL, return
-ENOSPC instead of -ENOMEM.
Cc: stable@vger.kernel.org Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: David Gstir <david@sigma-star.at>
Make sure that data_size is less than LEB size.
Otherwise a handcrafted UBI image is able to trigger
an out of bounds memory access in ubi_compare_lebs().
Cc: stable@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: David Gstir <david@sigma-star.at>
While the lockdep splat is a false positive, becuase path_openat holds i_mutex
of the parent directory and ubifs_init_security() tries to acquire i_mutex
of a new inode, it reveals that taking i_mutex in ubifs_init_security() is
in vain because it is only being called in the inode allocation path
and therefore nobody else can see the inode yet.
Cc: stable@vger.kernel.org # 3.20- Reported-and-tested-by: Boris Brezillon <boris.brezillon@free-electrons.com> Reviewed-and-tested-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: dedekind1@gmail.com
Robert Jarzmik [Sat, 26 Sep 2015 19:41:01 +0000 (21:41 +0200)]
mmc: pxamci: fix card detect with slot-gpio API
Move pxamci to mmc slot-gpio API to fix interrupt request.
It fixes the case where the card detection is on a gpio expander, on I2C
for example on zylonite board. In this case, the card detect netsted
interrupt is called from a threaded interrupt. The request_irq() fails,
because a hard irq cannot be a nested interrupt from a threaded
interrupt (set __setup_irq()).
This was tested on zylonite and mioa701 boards.
Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Cc: Petr Cvek <petr.cvek@tul.cz> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Hans de Goede [Wed, 23 Sep 2015 20:06:48 +0000 (22:06 +0200)]
mmc: sunxi: Fix clk-delay settings
In recent allwinner kernel sources the mmc clk-delay settings have been
slightly tweaked, and for sun9i they are completely different then what
we are using.
This commit brings us in sync with what allwinner does, fixing problems
accessing sdcards on some A33 devices (and likely others).
For pre sun9i hardware this makes the following changes:
-At 400Khz change the sample delay from 7 to 0 (introduced in A31 sdk)
-At 50 Mhz change the sample delay from 5 to 4 (introduced in A23 sdk)
This also drops the clk-delay calculation for clocks > 50 MHz, we do
not need this as we've: mmc->f_max = 50000000, and the delays in the
old code were not correct (at 100 MHz the delay must be a multiple of 60,
at 200 MHz a multiple of 120).
Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
mmc: core: Don't return an error for CD/WP GPIOs when GPIOLIB is unset
When CONFIG_GPIOLIB is unset, its stubs will return -ENOSYS. That means
when the mmc core parses DT for CD/WP GPIOs via mmc_of_parse(), -ENOSYS
becomes propagated to the caller. Typically this means that the mmc host
driver fails to probe.
As the CD/WP GPIOs are already treated as optional, let's extend that to
cover the case when CONFIG_GPIOLIB is unset.
Reported-by: Michal Simek <michal.simek@xilinx.com> Fixes: 16b23787fc70 ("mmc: sdhci-of-arasan: Call OF parsing for MMC") Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Michal Simek <michal.simek@xilinx.com> Acked-by: Venu Byravarasu <vbyravarasu@nvidia.com>
Ivan Mikhaylov [Fri, 25 Sep 2015 07:52:27 +0000 (11:52 +0400)]
net/ibm/emac: bump version numbers for correct work with ethtool
The size of the MAC register dump used to be the size specified by the
reg property in the device tree. Userland has no good way of finding
out that size, and it was not specified consistently for each MAC type,
so ethtool would end up printing junk at the end of the register dump
if the device tree didn't match the size it assumed.
Using the new version numbers indicates unambiguously that the size of
the MAC register dump is dependent only on the MAC type.
Fixes: 5369c71f7ca2 ("net/ibm/emac: fix size of emac dump memory areas") Signed-off-by: Ivan Mikhaylov <ivan@ru.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Karl Heiss [Thu, 24 Sep 2015 16:15:07 +0000 (12:15 -0400)]
sctp: Prevent soft lockup when sctp_accept() is called during a timeout event
A case can occur when sctp_accept() is called by the user during
a heartbeat timeout event after the 4-way handshake. Since
sctp_assoc_migrate() changes both assoc->base.sk and assoc->ep, the
bh_sock_lock in sctp_generate_heartbeat_event() will be taken with
the listening socket but released with the new association socket.
The result is a deadlock on any future attempts to take the listening
socket lock.
Note that this race can occur with other SCTP timeouts that take
the bh_lock_sock() in the event sctp_accept() is called.
=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
CslRx/12087 is trying to release lock (slock-AF_INET) at:
[<ffffffffa01bcae0>] sctp_generate_timeout_event+0x40/0xe0 [sctp]
but there are no more locks to release!
other info that might help us debug this:
2 locks held by CslRx/12087:
#0: (&asoc->timers[i]){+.-...}, at: [<ffffffff8108ce1f>] run_timer_softirq+0x16f/0x3e0
#1: (slock-AF_INET){+.-...}, at: [<ffffffffa01bcac3>] sctp_generate_timeout_event+0x23/0xe0 [sctp]
Ensure the socket taken is also the same one that is released by
saving a copy of the socket before entering the timeout event
critical section.
Signed-off-by: Karl Heiss <kheiss@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mitch Williams [Tue, 29 Sep 2015 00:31:26 +0000 (17:31 -0700)]
i40e/i40evf: check for stopped admin queue
It's possible that while we are waiting for the spinlock, another
entity (that owns the spinlock) has shut down the admin queue.
If we then attempt to use the queue, we will panic.
Add a check for this condition on the receive side. This matches
an existing check on the send queue side.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Previously to this patch, the hardware was removing
VLAN tags from the inner header of VXLAN packets. The
hardware configuration can be changed to leave the
packet alone since that is what the linux stack
expects for this type of VLAN in VXLAN packet.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When building with allmodconfig the build was failing with the error:
arch/tile/kernel/usb.c:70:1: warning: data definition has no type or storage class [enabled by default]
arch/tile/kernel/usb.c:70:1: error: type defaults to 'int' in declaration of 'arch_initcall' [-Werror=implicit-int]
arch/tile/kernel/usb.c:70:1: warning: parameter names (without types) in function declaration [enabled by default]
arch/tile/kernel/usb.c:63:19: warning: 'tilegx_usb_init' defined but not used [-Wunused-function]
Include linux/module.h to resolve the build failure.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
MSR is a change of ABI. Probably only 2.6.16 based SLES 10 breaks due
to its custom enhancements to kvmclock, but KVM never declared the MSR
only for one-shot initialization. (Doc says that only one write is
needed.)
If I2C is built as module, the iTCO watchdog driver must be built as module
as well. I2C_I801 must only be selected if I2C is configured.
This fixes the following build errors, seen if I2C=m and ITCO_WDT=y.
i2c-i801.c:(.text+0x2bf055): undefined reference to `i2c_del_adapter'
i2c-i801.c:(.text+0x2c13e0): undefined reference to `i2c_add_adapter'
i2c-i801.c:(.text+0x2c17bd): undefined reference to `i2c_new_device'
Fixes: 2a7a0e9bf7b3 ("watchdog: iTCO_wdt: Add support for TCO on Intel Sunrisepoint") Reviewed-by: Matt Fleming <matt.fleming@intel.com> Cc: Lee Jones <lee.jones@linaro.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
Noralf Trønnes [Wed, 17 Jun 2015 14:04:04 +0000 (16:04 +0200)]
watchdog: bcm2835: Fix poweroff behaviour
Currently poweroff/halt results in a reboot on the Raspberry Pi.
The firmware uses the RSTS register to know which partiton to
boot from. The partiton value is spread into bits
0, 2, 4, 6, 8, 10. Partiton 63 is a special partition used by
the firmware to indicate halt.
The firmware made this change in 19 Aug 2013 and was matched
by the downstream commit:
Changes for new NOOBS multi partition booting from gsh
Signed-off-by: Noralf Trønnes <noralf@tronnes.org> Tested-by: Stephen Warren <swarren@wwwdotorg.org> Acked-by: Stephen Warren <swarren@wwwdotorg.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
watchdog: Fix module autoload for OF platform driver
These platform drivers have a OF device ID table but the OF module
alias information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luis@debethencourt.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
- Properly setup irq handling for ATH79 platforms
- Fix bootmem mapstart calculation for contiguous maps
- Handle little endian and older CPUs correct in BPF
- Fix console for Fulong 2E systems
- Handle FTLB correctly on R6 CPUs
- Fixes for CM, GIC and MAAR support code
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: Initialise MAARs on secondary CPUs
MIPS: print MAAR configuration during boot
MIPS: mm: compile maar_init unconditionally
irqchip: mips-gic: Fix pending & mask reads for MIPS64 with 32b GIC.
irqchip: mips-gic: Convert CPU numbers to VP IDs.
MIPS: CM: Provide a function to map from CPU to VP ID.
MIPS: Fix FTLB detection for R6
MIPS: cpu-features: Add cpu_has_ftlb
MIPS: ATH79: Add irq chip ar7240-misc-intc
MIPS: ATH79: Set missing irq ack handler for ar7100-misc-intc irq chip
MIPS: BPF: Fix build on pre-R2 little endian CPUs
MIPS: BPF: Avoid unreachable code on little endian
MIPS: bootmem: Fix mapstart calculation for contiguous maps
MIPS: Fix console output for Fulong2e system
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Another pile of fixes for perf:
- Plug overflows and races in the core code
- Sanitize the flow of the perf syscall so we error out before
handling the more complex and hard to undo setups
- Improve and fix Broadwell and Skylake hardware support
- Revert a fix which broke what it tried to fix in perf tools
- A couple of smaller fixes in various places of perf tools"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Fix copying of /proc/kcore
perf intel-pt: Remove no_force_psb from documentation
perf probe: Use existing routine to look for a kernel module by dso->short_name
perf/x86: Change test_aperfmperf() and test_intel() to static
tools lib traceevent: Fix string handling in heterogeneous arch environments
perf record: Avoid infinite loop at buildid processing with no samples
perf: Fix races in computing the header sizes
perf: Fix u16 overflows
perf: Restructure perf syscall point of no return
perf/x86/intel: Fix Skylake FRONTEND MSR extrareg mask
perf/x86/intel/pebs: Add PEBS frontend profiling for Skylake
perf/x86/intel: Make the CYCLE_ACTIVITY.* constraint on Broadwell more specific
perf tools: Bool functions shouldn't return -1
tools build: Add test for presence of __get_cpuid() gcc builtin
tools build: Add test for presence of numa_num_possible_cpus() in libnuma
Revert "perf symbols: Fix mismatched declarations for elf_getphdrnum"
perf stat: Fix per-pkg event reporting bug
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single bug fix for the scheduler to prevent dequeueing of the idle
task when setting the cpus allowed mask"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix crash trying to dequeue/enqueue the idle thread
Merge branch 'turbostat' of https://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux into pm-tools
Pull turbostat updates for v4.3 from Len Brown.
* 'turbostat' of https://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbosat: update version number
tools/power turbostat: SKL: Adjust for TSC difference from base frequency
tools/power turbostat: KNL workaround for %Busy and Avg_MHz
tools/power turbostat: IVB Xeon: fix --debug regression
Paul Burton [Fri, 25 Sep 2015 15:59:38 +0000 (08:59 -0700)]
MIPS: Initialise MAARs on secondary CPUs
MAARs should be initialised on each CPU (or rather, core) in the system
in order to achieve consistent behaviour & performance. Previously they
have only been initialised on the boot CPU which leads to performance
problems if tasks are later scheduled on a secondary CPU, particularly
if those tasks make use of unaligned vector accesses where some CPUs
don't handle any cases in hardware for non-speculative memory regions.
Fix this by recording the MAAR configuration from the boot CPU and
applying it to secondary CPUs as part of their bringup.
Reported-by: Doug Gilmore <doug.gilmore@imgtec.com> Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven J. Hill <Steven.Hill@imgtec.com> Cc: Andrew Bresticker <abrestic@chromium.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: linux-kernel@vger.kernel.org Cc: Aaro Koskinen <aaro.koskinen@iki.fi> Cc: James Hogan <james.hogan@imgtec.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: Hemmo Nieminen <hemmo.nieminen@iki.fi> Cc: Alex Smith <alex.smith@imgtec.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Patchwork: https://patchwork.linux-mips.org/patch/11239/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Paul Burton [Fri, 25 Sep 2015 15:59:37 +0000 (08:59 -0700)]
MIPS: print MAAR configuration during boot
Verifying that the MAAR configuration is as expected is useful when
debugging the performance of a system. Print out the memory regions
configured via MAAR along with their attributes.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: Steven J. Hill <Steven.Hill@imgtec.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Patchwork: https://patchwork.linux-mips.org/patch/11238/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Paul Burton [Fri, 25 Sep 2015 15:59:36 +0000 (08:59 -0700)]
MIPS: mm: compile maar_init unconditionally
maar_init was previously only compiled when CONFIG_NEED_MULTIPLE_NODES
was not set, which has been fine since it is only called from the
standard implementation of mem_init which has the same condition. In
preparation for calling it from the SMP startup code on secondary CPUs,
move maar_init outside of the #ifndef such that it is always compiled.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Cc: Steven J. Hill <Steven.Hill@imgtec.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: linux-kernel@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org>
Patchwork: https://patchwork.linux-mips.org/patch/11237/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Paul Burton [Tue, 22 Sep 2015 18:29:11 +0000 (11:29 -0700)]
irqchip: mips-gic: Fix pending & mask reads for MIPS64 with 32b GIC.
gic_handle_shared_int reads the GIC interrupt pending & mask registers
directly into a bitmap, which is defined as an array of unsigned longs.
The GIC pending registers may be 32 bits wide if the CM is older than
CM3, regardless of the bit width of the CPU, but for MIPS64 kernels
the unsigned longs in the bitmap will be 64 bits wide. In this case we
need to perform 2 x 32 bit reads per 64 bit unsigned long in order to
avoid missing interrupts.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mips@linux-mips.org Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/11213/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Paul Burton [Tue, 22 Sep 2015 18:29:10 +0000 (11:29 -0700)]
irqchip: mips-gic: Convert CPU numbers to VP IDs.
Make use of the mips_cm_vp_id function to convert from Linux CPU numbers
to the VP IDs used by hardware, which are not identical in all systems.
Without doing so we map interrupts to incorrect VP(E)s.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mips@linux-mips.org Cc: Paul Burton <paul.burton@imgtec.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/11212/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Paul Burton [Tue, 22 Sep 2015 18:29:09 +0000 (11:29 -0700)]
MIPS: CM: Provide a function to map from CPU to VP ID.
The VP ID of a given CPU may not match up with the CPU number used by
Linux. For example, if the width of the VP part of the VP ID is wider
than log2(number of VPs per core) and the system has multiple cores then
this will be the case. Alternatively, if a pre-r6 system implements the
MT ASE with multiple VPEs per core and Linux is built without support
for the MT ASE then the numbers won't match up either. Provide a
function to convert from CPU number to VP ID.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Cc: James Hogan <james.hogan@imgtec.com> Cc: Markos Chandras <markos.chandras@imgtec.com>
Patchwork: https://patchwork.linux-mips.org/patch/11211/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two bugfixes from Andy addressing at least some of the subtle NMI
related wreckage which has been reported by Sasha Levin"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi/64: Fix a paravirt stack-clobbering bug in the NMI code
x86/paravirt: Replace the paravirt nop with a bona fide empty function