Jérôme Glisse [Sat, 13 Oct 2018 04:34:36 +0000 (21:34 -0700)]
mm/thp: fix call to mmu_notifier in set_pmd_migration_entry() v2
Inside set_pmd_migration_entry() we are holding page table locks and thus
we can not sleep so we can not call invalidate_range_start/end()
So remove call to mmu_notifier_invalidate_range_start/end() because they
are call inside the function calling set_pmd_migration_entry() (see
try_to_unmap_one()).
Link: http://lkml.kernel.org/r/20181012181056.7864-1-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jann Horn [Sat, 13 Oct 2018 04:34:32 +0000 (21:34 -0700)]
mm/mmap.c: don't clobber partially overlapping VMA with MAP_FIXED_NOREPLACE
Daniel Micay reports that attempting to use MAP_FIXED_NOREPLACE in an
application causes that application to randomly crash. The existing check
for handling MAP_FIXED_NOREPLACE looks up the first VMA that either
overlaps or follows the requested region, and then bails out if that VMA
overlaps *the start* of the requested region. It does not bail out if the
VMA only overlaps another part of the requested region.
Fix it by checking that the found VMA only starts at or after the end of
the requested region, in which case there is no overlap.
zhong jiang [Sat, 13 Oct 2018 04:34:26 +0000 (21:34 -0700)]
ocfs2: fix a GCC warning
Fix the following compile warning:
fs/ocfs2/dlmglue.c:99:30: warning: \91lockdep_keys\92 defined but not used [-Wunused-variable]
static struct lock_class_key lockdep_keys[OCFS2_NUM_LOCK_TYPES];
Merge tag 'armsoc-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Arnd writes:
"ARM: SoC fixes for 4.19
Two last minute bugfixes, both for NXP platforms:
* The Layerscape 'qbman' infrastructure suffers from probe ordering
bugs in some configurations, a two-patch series adds a hotfix for
this. 4.20 will have a longer set of patches to rework it.
* The old imx53-qsb board regressed in 4.19 after the addition
of cpufreq support, adding a set of explicit operating points
fixes this."
* tag 'armsoc-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
soc: fsl: qman_portals: defer probe after qman's probe
soc: fsl: qbman: add APIs to retrieve the probing status
ARM: dts: imx53-qsb: disable 1.2GHz OPP
David Howells [Fri, 12 Oct 2018 13:00:57 +0000 (14:00 +0100)]
afs: Fix afs_server struct leak
Fix a leak of afs_server structs. The routine that installs them in the
various lookup lists and trees gets a ref on leaving the function, whether
it added the server or a server already exists. It shouldn't increment
the refcount if it added the server.
The effect of this that "rmmod kafs" will hang waiting for the leaked
server to become unused.
Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Merge tag 'mmc-v4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Ulf writes:
"MMC core:
- Avoid fragile multiblock reads for the last sector in SPI mode
WIFI/SDIO:
- libertas: Fixup suspend sequence for the SDIO card"
* tag 'mmc-v4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
libertas: call into generic suspend code before turning off power
mmc: block: avoid multiblock reads for the last sector in SPI mode
Merge tag 'for-gkh' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Doug writes:
"RDMA fixes:
Final for-rc pull request for 4.19
We only have one bug to submit this time around. It fixes a DMA
unmap issue where we unmapped the DMA address from the IOMMU before
we did from the card, resulting in a DMAR error with IOMMU enabled,
or possible crash without."
* tag 'for-gkh' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/mlx5: Unmap DMA addr from HCA before IOMMU
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Dmitry writes:
"Input updates for v4.19-rc7
- we added a few scheduling points into various input interfaces to
ensure that large writes will not cause RCU stalls
- fixed configuring PS/2 keyboards as wakeup devices on newer
platforms
- added a new Xbox gamepad ID."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: uinput - add a schedule point in uinput_inject_events()
Input: evdev - add a schedule point in evdev_write()
Input: mousedev - add a schedule point in mousedev_write()
Input: i8042 - enable keyboard wakeups by default when s2idle is used
Input: xpad - add support for Xbox1 PDP Camo series gamepad
Merge tag 'next-fixes-20181012' of git://git.kernel.org/pub/scm/linux/kernel/git/sfr/next-fixes
Stephen writes:
"A couple of warning fixes:
Two fixes from Peter Oberparleiter <oberpar@linux.ibm.com>:
Commit 6b7dca401cb1 ("tracing: Allow gcov profiling on only ftrace subsystem")
uncovered linker problems when using gcov kernel profiling on some
architectures. These problems were likely introduced earlier, and are
possibly related to compiler changes."
* tag 'next-fixes-20181012' of git://git.kernel.org/pub/scm/linux/kernel/git/sfr/next-fixes:
vmlinux.lds.h: Fix linker warnings about orphan .LPBX sections
vmlinux.lds.h: Fix incomplete .text.exit discards
Arnd Bergmann [Thu, 11 Oct 2018 11:06:17 +0000 (13:06 +0200)]
lib/bch: fix possible stack overrun
The previous patch introduced very large kernel stack usage and a Makefile
change to hide the warning about it.
From what I can tell, a number of things went wrong here:
- The BCH_MAX_T constant was set to the maximum value for 'n',
not the maximum for 't', which is much smaller.
- The stack usage is actually larger than the entire kernel stack
on some architectures that can use 4KB stacks (m68k, sh, c6x), which
leads to an immediate overrun.
- The justification in the patch description claimed that nothing
changed, however that is not the case even without the two points above:
the configuration is machine specific, and most boards never use the
maximum BCH_ECC_WORDS() length but instead have something much smaller.
That maximum would only apply to machines that use both the maximum
block size and the maximum ECC strength.
The largest value for 't' that I could find is '32', which in turn leads
to a 60 byte array instead of 2048 bytes. Making it '64' for future
extension seems also worthwhile, with 120 bytes for the array. Anything
larger won't fit into the OOB area on NAND flash.
With that changed, the warning can be enabled again.
Only linux-4.19+ contains the breakage, so this is only needed
as a stable backport if it does not make it into the release.
3) Fix refcounting in u32 classificer, from Al Viro.
4) Userspace netlink ABI fixes from Eugene Syromiatnikov.
5) Don't double iounmap on rmmod in ena driver, from Arthur
Kiyanovski.
6) Fix devlink string attribute handling, we must pull a copy into a
kernel buffer if the lifetime extends past the netlink request.
From Moshe Shemesh.
7) Fix hangs in RDS, from Ka-Cheong Poon.
8) Fix recursive locking lockdep warnings in tipc, from Ying Xue.
9) Clear RX irq correctly in socionext, from Ilias Apalodimas.
10) bcm_sf2 fixes from Florian Fainelli."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
net: dsa: bcm_sf2: Call setup during switch resume
net: dsa: bcm_sf2: Fix unbind ordering
net: phy: sfp: remove sfp_mutex's definition
r8169: set RX_MULTI_EN bit in RxConfig for 8168F-family chips
net: socionext: clear rx irq correctly
net/mlx4_core: Fix warnings during boot on driverinit param set failures
tipc: eliminate possible recursive locking detected by LOCKDEP
selftests: udpgso_bench.sh explicitly requires bash
selftests: rtnetlink.sh explicitly requires bash.
qmi_wwan: Added support for Gemalto's Cinterion ALASxx WWAN interface
tipc: queue socket protocol error messages into socket receive buffer
tipc: set link tolerance correctly in broadcast link
net: ipv4: don't let PMTU updates increase route MTU
net: ipv4: update fnhe_pmtu when first hop's MTU changes
net/ipv6: stop leaking percpu memory in fib6 info
rds: RDS (tcp) hangs on sendto() to unresponding address
net: make skb_partial_csum_set() more robust against overflows
devlink: Add helper function for safely copy string param
devlink: Fix param cmode driverinit for string type
devlink: Fix param set handling for string type
...
David S. Miller [Thu, 11 Oct 2018 22:20:00 +0000 (15:20 -0700)]
Merge branch 'net-dsa-bcm_sf2-Couple-of-fixes'
Florian Fainelli says:
====================
net: dsa: bcm_sf2: Couple of fixes
Here are two fixes for the bcm_sf2 driver that were found during
testing unbind and analysing another issue during system
suspend/resume.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: bcm_sf2: Call setup during switch resume
There is no reason to open code what the switch setup function does, in
fact, because we just issued a switch reset, we would make all the
register get their default values, including for instance, having unused
port be enabled again and wasting power and leading to an inappropriate
switch core clock being selected.
Fixes: 8cfa94984c9c ("net: dsa: bcm_sf2: add suspend/resume callbacks") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The order in which we release resources is unfortunately leading to bus
errors while dismantling the port. This is because we set
priv->wol_ports_mask to 0 to tell bcm_sf2_sw_suspend() that it is now
permissible to clock gate the switch. Later on, when dsa_slave_destroy()
comes in from dsa_unregister_switch() and calls
dsa_switch_ops::port_disable, we perform the same dismantling again, and
this time we hit registers that are clock gated.
Make sure that dsa_unregister_switch() is the first thing that happens,
which takes care of releasing all user visible resources, then proceed
with clock gating hardware. We still need to set priv->wol_ports_mask to
0 to make sure that an enabled port properly gets disabled in case it
was previously used as part of Wake-on-LAN.
Fixes: d9338023fb8e ("net: dsa: bcm_sf2: Make it a real platform device driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
vmlinux.lds.h: Fix linker warnings about orphan .LPBX sections
Enabling both CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y and
CONFIG_GCOV_PROFILE_ALL=y results in linker warnings:
warning: orphan section `.data..LPBX1' being placed in
section `.data..LPBX1'.
LD_DEAD_CODE_DATA_ELIMINATION adds compiler flag -fdata-sections. This
option causes GCC to create separate data sections for data objects,
including those generated by GCC internally for gcov profiling. The
names of these objects start with a dot (.LPBX0, .LPBX1), resulting in
section names starting with 'data..'.
As section names starting with 'data..' are used for specific purposes
in the Linux kernel, the linker script does not automatically include
them in the output data section, resulting in the "orphan section"
linker warnings.
Fix this by specifically including sections named "data..LPBX*" in the
data section.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Enabling CONFIG_GCOV_PROFILE_ALL=y causes linker errors on ARM:
`.text.exit' referenced in section `.ARM.exidx.text.exit':
defined in discarded section `.text.exit'
`.text.exit' referenced in section `.fini_array.00100':
defined in discarded section `.text.exit'
And related errors on NDS32:
`.text.exit' referenced in section `.dtors.65435':
defined in discarded section `.text.exit'
The gcov compiler flags cause certain compiler versions to generate
additional destructor-related sections that are not yet handled by the
linker script, resulting in references between discarded and
non-discarded sections.
Since destructors are not used in the Linux kernel, fix this by
discarding these additional sections.
Reported-by: Arnd Bergmann <arnd@arndb.de> Tested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Greentime Hu <green.hu@gmail.com> Tested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
The sfp_mutex variable is defined but never used in this file. Not even
in the commit that introduced that variable.
Remove sfp_mutex, it has no purpose.
Cc: Andrew Lunn <andrew@lunn.ch> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
r8169: set RX_MULTI_EN bit in RxConfig for 8168F-family chips
It has been reported that since
commit 05212ba8132b42 ("r8169: set RxConfig after tx/rx is enabled for RTL8169sb/8110sb devices")
at least RTL_GIGA_MAC_VER_38 NICs work erratically after a resume from
suspend.
The problem has been traced to a missing RX_MULTI_EN bit in the RxConfig
register.
We already set this bit for RTL_GIGA_MAC_VER_35 NICs of the same 8168F
chip family so let's do it also for its other siblings: RTL_GIGA_MAC_VER_36
and RTL_GIGA_MAC_VER_38.
Curiously, the NIC seems to work fine after a system boot without having
this bit set as long as the system isn't suspended and resumed.
Fixes: 05212ba8132b42 ("r8169: set RxConfig after tx/rx is enabled for RTL8169sb/8110sb devices") Reported-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilias Apalodimas [Thu, 11 Oct 2018 12:28:26 +0000 (15:28 +0300)]
net: socionext: clear rx irq correctly
commit 63ae7949e94a ("net: socionext: Use descriptor info instead of MMIO reads on Rx")
removed constant mmio reads from the driver and started using a descriptor
field to check if packet should be processed.
This lead the napi rx handler being constantly called while no packets
needed processing and ksoftirq getting 100% cpu usage. Issue one mmio read
to clear the irq correcty after processing packets
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Thu, 11 Oct 2018 12:01:19 +0000 (15:01 +0300)]
net/mlx4_core: Fix warnings during boot on driverinit param set failures
During boot, mlx4_core sets the driverinit configuration parameters and
updates the devlink module on the initial values calling
devlink_param_driverinit_value_set().
If devlink_param_driverinit_value_set() returns an error mlx4_core
reports kernel module warning.
This caused false alarm during boot in case kernel was compiled with
CONFIG_NET_DEVLINK off.
Fix by removing warning reported in case
devlink_param_driverinit_value_set() fails.
This actually makes the function mlx4_devlink_set_init_value()
redundant to using directly devlink_param_driverinit_value_set() and so
removed.
It fixes the following kernel trace:
mlx4_core 0000:00:06.0: devlink set parameter 0 value failed (err = -95)
mlx4_core 0000:00:06.0: devlink set parameter 1 value failed (err = -95)
mlx4_core 0000:00:06.0: devlink set parameter 4 value failed (err = -95)
mlx4_core 0000:00:06.0: devlink set parameter 5 value failed (err = -95)
mlx4_core 0000:00:06.0: devlink set parameter 3 value failed (err = -95)
Fixes: bd1b51dc66df ("mlx4: Add mlx4 initial parameters table and register it") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge branch 'for-4.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Tejun writes:
"cgroup fixes for v4.19-rc7
One cgroup2 threaded mode fix for v4.19-rc7. While threaded mode
isn't used widely (yet) and the bug requires somewhat convoluted
sequence of operations, it causes a userland visible malfunction -
EINVAL on a valid attempt to enable threaded mode. This pull request
contains the fix"
* 'for-4.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Fix dom_cgrp propagation when enabling threaded mode
The reason why the noise above was complained by LOCKDEP is because we
nested to hold l->wakeupq.lock and l->inputq->lock in tipc_link_reset
function. In fact it's unnecessary to move skb buffer from l->wakeupq
queue to l->inputq queue while holding the two locks at the same time.
Instead, we can move skb buffers in l->wakeupq queue to a temporary
list first and then move the buffers of the temporary list to l->inputq
queue, which is also safe for us.
Fixes: 3f32d0be6c16 ("tipc: lock wakeup & inputq at tipc_link_reset()") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'kbuild-fixes-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Masahiro writes:
"Kbuild fixes for v4.19 (2nd)
- Fix warnings from recordmcount.pl when building with Clang
- Allow Clang to use GNU toolchains correctly
- Disable CONFIG_SAMPLES for UML to avoid build error"
* tag 'kbuild-fixes-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
samples: disable CONFIG_SAMPLES for UML
kbuild: allow to use GCC toolchain not in Clang search path
ftrace: Build with CPPFLAGS to get -Qunused-arguments
====================
net: explicitly requires bash when needed.
Some test scripts require bash-only features but use the default shell.
This may cause random failures if the default shell is not bash.
Instead of doing a potentially complex rewrite of such scripts, these patches
require the bash interpreter, where needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The udpgso_bench.sh script requires several bash-only features. This
may cause random failures if the default shell is not bash.
Address the above explicitly requiring bash as the script interpreter
Fixes: 3a687bef148d ("selftests: udp gso benchmark") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Thu, 11 Oct 2018 08:54:52 +0000 (10:54 +0200)]
selftests: rtnetlink.sh explicitly requires bash.
the script rtnetlink.sh requires a bash-only features (sleep with sub-second
precision). This may cause random test failure if the default shell is not
bash.
Address the above explicitly requiring bash as the script interpreter.
Fixes: 33b01b7b4f19 ("selftests: add rtnetlink test script") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Kees writes:
"Fix open-coded multiplication arguments to allocators
- Fixes several new open-coded multiplications added in the 4.19
merge window."
* tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Replace more open-coded allocation size multiplications
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Ingo, a man of few words, writes:
"perf fixes:
misc perf tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf record: Use unmapped IP for inline callchain cursors
perf python: Use -Wno-redundant-decls to build with PYTHON=python3
perf report: Don't try to map ip to invalid map
perf script python: Fix export-to-sqlite.py sample columns
perf script python: Fix export-to-postgresql.py occasional failure
tipc: queue socket protocol error messages into socket receive buffer
In tipc_sk_filter_rcv(), when we detect protocol messages with error we
call tipc_sk_conn_proto_rcv() and let it reset the connection and notify
the socket by calling sk->sk_state_change().
However, tipc_sk_filter_rcv() may have been called from the function
tipc_backlog_rcv(), in which case the socket lock is held and the socket
already awake. This means that the sk_state_change() call is ignored and
the error notification lost. Now the receive queue will remain empty and
the socket sleeps forever.
In this commit, we convert the protocol message into a connection abort
message and enqueue it into the socket's receive queue. By this addition
to the above state change we cover all conditions.
Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Wed, 10 Oct 2018 15:34:01 +0000 (17:34 +0200)]
tipc: set link tolerance correctly in broadcast link
In the patch referred to below we added link tolerance as an additional
criteria for declaring broadcast transmission "stale" and resetting the
affected links.
However, the 'tolerance' field of the broadcast link is never set, and
remains at zero. This renders the whole commit without the intended
improving effect, but luckily also with no negative effect.
In this commit we add the missing initialization.
Fixes: a4dc70d46cf1 ("tipc: extend link reset criteria for stale packet retransmission") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: ipv4: fixes for PMTU when link MTU changes
The first patch adapts the changes that commit e9fa1495d738 ("ipv6:
Reflect MTU changes on PMTU of exceptions for MTU-less routes") did in
IPv6 to IPv4: lower PMTU when the first hop's MTU drops below it, and
raise PMTU when the first hop was limiting PMTU discovery and its MTU
is increased.
The second patch fixes bugs introduced in commit d52e5a7e7ca4 ("ipv4:
lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu") that
only appear once the first patch is applied.
Selftests for these cases were introduced in net-next commit e44e428f59e4 ("selftests: pmtu: add basic IPv4 and IPv6 PMTU tests")
v2: add cover letter, and fix a few small things in patch 1
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Tue, 9 Oct 2018 15:48:15 +0000 (17:48 +0200)]
net: ipv4: don't let PMTU updates increase route MTU
When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
received, we must clamp its value. However, we can receive a PMTU
exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
increase in PMTU.
To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.
Before this patch, in case of an update, the exception's MTU would
always change. Now, an exception can have only its lock flag updated,
but not the MTU, so we need to add a check on locking to the following
"is this exception getting updated, or close to expiring?" test.
Fixes: d52e5a7e7ca4 ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Tue, 9 Oct 2018 15:48:14 +0000 (17:48 +0200)]
net: ipv4: update fnhe_pmtu when first hop's MTU changes
Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.
As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
- if the local MTU is now lower than the PMTU, that PMTU is now
incorrect
- if the local MTU was the lowest value in the path, and is increased,
we might discover a higher PMTU
Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those
cases.
If the exception was locked, the discovered PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.
To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.
Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Rapoport [Tue, 9 Oct 2018 04:02:01 +0000 (07:02 +0300)]
net/ipv6: stop leaking percpu memory in fib6 info
The fib6_info_alloc() function allocates percpu memory to hold per CPU
pointers to rt6_info, but this memory is never freed. Fix it.
Fixes: a64efe142f5e ("net/ipv6: introduce fib6_info struct and helpers") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Here are a set of patches that prepares for and fix problems in rxrpc's
package reception code. There serious problems are:
(A) There's a window between binding the socket and setting the data_ready
hook in which packets can find their way into the UDP socket's receive
queues.
(B) The skb_recv_udp() will return an error (and clear the error state) if
there was an error on the Tx side. rxrpc doesn't handle this.
(C) The rxrpc data_ready handler doesn't fully drain the UDP receive
queue.
(D) The rxrpc data_ready handler assumes it is called in a non-reentrant
state.
The second patch fixes (A) - (C); the third patch renders (B) and (C)
non-issues by using the recap_rcv hook instead of data_ready - and the
final patch fixes (D). That last is the most complex.
The preparatory patches are:
(1) Fix some places that are doing things in the wrong net namespace.
(2) Stop taking the rcu read lock as it's held by the IP input routine in
the call chain.
(3) Only end the Tx phase if *we* rotated the final packet out of the Tx
buffer.
(4) Don't assume that the call state won't change after dropping the
call_state lock.
(5) Only take receive window and MTU suze parameters from an ACK packet if
it's the latest ACK packet.
(6) Record connection-level abort information correctly.
(7) Fix a trace line.
And then there are three main patches - note that these are mixed in with
the preparatory patches somewhat:
(1) Fix the setup window (A), skb_recv_udp() error check (B) and packet
drainage (C).
(2) Switch to using the encap_rcv instead of data_ready to cut out the
effects of the UDP read queues and get the packets delivered directly.
(3) Add more locking into the various packet input paths to defend against
re-entrance (D).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ka-Cheong Poon [Mon, 8 Oct 2018 16:17:11 +0000 (09:17 -0700)]
rds: RDS (tcp) hangs on sendto() to unresponding address
In rds_send_mprds_hash(), if the calculated hash value is non-zero and
the MPRDS connections are not yet up, it will wait. But it should not
wait if the send is non-blocking. In this case, it should just use the
base c_path for sending the message.
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'xfs-fixes-for-4.19-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Dave writes:
"xfs: fixes for 4.19-rc7
Update for 4.19-rc7 to fix numerous file clone and deduplication issues."
* tag 'xfs-fixes-for-4.19-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix data corruption w/ unaligned reflink ranges
xfs: fix data corruption w/ unaligned dedupe ranges
xfs: update ctime and remove suid before cloning files
xfs: zero posteof blocks when cloning above eof
xfs: refactor clonerange preparation into a separate helper
The dm-linear target is independent of the dm-zoned target. For code
requiring support for zoned block devices, use CONFIG_BLK_DEV_ZONED
instead of CONFIG_DM_ZONED.
While at it, similarly to dm linear, also enable the DM_TARGET_ZONED_HM
feature in dm-flakey only if CONFIG_BLK_DEV_ZONED is defined.
Fixes: beb9caac211c1 ("dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled") Fixes: 0be12c1c7fce7 ("dm linear: add support for zoned block devices") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Merge tag 'for-4.19/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Mike writes:
"device mapper fixes for 4.19 final
- Fix a DM cache module init error path bug that doesn't properly
cleanup a KMEM_CACHE if target registration fails.
- Two stable@ fixes for DM zoned target; 4.20 will have changes that
eliminate this code entirely but <= 4.19 needs these changes."
* tag 'for-4.19/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled
dm: fix report zone remapping to account for partition offset
dm cache: destroy migration_cache if cache target registration failed
Merge tag 'trace-v4.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Steven writes:
"vsprint fix:
It was reported that trace_printk() was not reporting properly
values that came after a dereference pointer.
trace_printk() utilizes vbin_printf() and bstr_printf() to keep the
overhead of tracing down. vbin_printf() does not do any conversions
and just stors the string format and the raw arguments into the
buffer. bstr_printf() is used to read the buffer and does the
conversions to complete the printf() output.
This can be troublesome with dereferenced pointers because the
reference may be different from the time vbin_printf() is called to
the time bstr_printf() is called. To fix this, a prior commit changed
vbin_printf() to convert dereferenced pointers into strings and load
the converted string into the buffer. But the change to bstr_printf()
had an off-by-one error and didn't account for the nul character at
the end of the string and this corrupted the rest of the values in
the format that came after a dereferenced pointer."
* tag 'trace-v4.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
vsprintf: Fix off-by-one bug in bstr_printf() processing dereferenced pointers
Merge tag 'devicetree-fixes-for-4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Rob writes:
"Devicetree fixes for 4.19, part 3:
- Fix DT unittest on Oldworld MAC systems"
* tag 'devicetree-fixes-for-4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of: unittest: Disable interrupt node tests for old world MAC systems
Valentine Fatiev [Wed, 10 Oct 2018 06:56:25 +0000 (09:56 +0300)]
IB/mlx5: Unmap DMA addr from HCA before IOMMU
The function that puts back the MR in cache also removes the DMA address
from the HCA. Therefore we need to call this function before we remove
the DMA mapping from MMU. Otherwise the HCA may access a memory that
is no longer DMA mapped.
Root cause is the following check in skb_partial_csum_set()
if (unlikely(start > skb_headlen(skb)) ||
unlikely((int)start + off > skb_headlen(skb) - 2))
return false;
If skb_headlen(skb) is 1, then (skb_headlen(skb) - 2) becomes 0xffffffff
and the check fails to detect that ((int)start + off) is off the limit,
since the compare is unsigned.
When we fix that, then the first condition (start > skb_headlen(skb))
becomes obsolete.
Then we should also check that (skb_headroom(skb) + start) wont
overflow 16bit field.
Fixes: 5ff8dda3035d ("net: Ensure partial checksum offset is inside the skb head") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 10 Oct 2018 17:19:10 +0000 (10:19 -0700)]
Merge branch 'devlink-param-type-string-fixes'
Moshe Shemesh says:
====================
devlink param type string fixes
This patchset fixes devlink param infrastructure for string param type.
The devlink param infrastructure doesn't handle copying the string data
correctly. The first two patches fix it and the third patch adds helper
function to safely copy string value without exceeding
DEVLINK_PARAM_MAX_STRING_VALUE.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Wed, 10 Oct 2018 13:09:27 +0000 (16:09 +0300)]
devlink: Add helper function for safely copy string param
Devlink string param buffer is allocated at the size of
DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure
this size is not exceeded.
Renamed DEVLINK_PARAM_MAX_STRING_VALUE to
__DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by
devlink only. The driver should use the helper function instead to
verify it doesn't exceed the allowed length.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Wed, 10 Oct 2018 13:09:26 +0000 (16:09 +0300)]
devlink: Fix param cmode driverinit for string type
Driverinit configuration mode value is held by devlink to enable the
driver fetch the value after reload command. In case the param type is
string devlink should copy the value from driver string buffer to
devlink string buffer on devlink_param_driverinit_value_set() and
vice-versa on devlink_param_driverinit_value_get().
Fixes: ec01aeb1803e ("devlink: Add support for get/set driverinit value") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Wed, 10 Oct 2018 13:09:25 +0000 (16:09 +0300)]
devlink: Fix param set handling for string type
In case devlink param type is string, it needs to copy the string value
it got from the input to devlink_param_value.
Fixes: e3b7ca18ad7b ("devlink: Add param set command") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Some samples require headers installation, so commit 3fca1700c4c3
("kbuild: make samples really depend on headers_install") added
such dependency in the top Makefile. However, UML fails to build
with CONFIG_SAMPLES=y because UML does not support headers_install.
Fixes: 3fca1700c4c3 ("kbuild: make samples really depend on headers_install") Reported-by: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Mike Snitzer [Wed, 10 Oct 2018 16:01:55 +0000 (12:01 -0400)]
dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled
It is best to avoid any extra overhead associated with bio completion.
DM core will indirectly call a DM target's .end_io if it is defined.
In the case of DM linear, there is no need to do so (for every bio that
completes) if CONFIG_DM_ZONED is not enabled.
Avoiding an extra indirect call for every bio completion is very
important for ensuring DM linear doesn't incur more overhead that
further widens the performance gap between dm-linear and raw block
devices.
Fixes: 0be12c1c7fce7 ("dm linear: add support for zoned block devices") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Marco Felsch [Tue, 2 Oct 2018 08:06:46 +0000 (10:06 +0200)]
pinctrl: mcp23s08: fix irq and irqchip setup order
Since 'commit 02e389e63e35 ("pinctrl: mcp23s08: fix irq setup order")' the
irq request isn't the last devm_* allocation. Without a deeper look at
the irq and testing this isn't a good solution. Since this driver relies
on the devm mechanism, requesting a interrupt should be the last thing
to avoid memory corruptions during unbinding.
'Commit 02e389e63e35 ("pinctrl: mcp23s08: fix irq setup order")' fixed the
order for the interrupt-controller use case only. The
mcp23s08_irq_setup() must be split into two to fix it for the
interrupt-controller use case and to register the irq at last. So the
irq will be freed first during unbind.
Stephen Boyd [Mon, 8 Oct 2018 16:32:13 +0000 (09:32 -0700)]
gpio: Assign gpio_irq_chip::parents to non-stack pointer
gpiochip_set_cascaded_irqchip() is passed 'parent_irq' as an argument
and then the address of that argument is assigned to the gpio chips
gpio_irq_chip 'parents' pointer shortly thereafter. This can't ever
work, because we've just assigned some stack address to a pointer that
we plan to dereference later in gpiochip_irq_map(). I ran into this
issue with the KASAN report below when gpiochip_irq_map() tried to setup
the parent irq with a total junk pointer for the 'parents' array.
BUG: KASAN: stack-out-of-bounds in gpiochip_irq_map+0x228/0x248
Read of size 4 at addr ffffffc0dde472e0 by task swapper/0/1
Let's leave around one unsigned int in the gpio_irq_chip struct for the
single parent irq case and repoint the 'parents' array at it. This way
code is left mostly intact to setup parents and we waste an extra few
bytes per structure of which there should be only a handful in a system.
Daniel Mack [Mon, 8 Oct 2018 20:03:57 +0000 (22:03 +0200)]
libertas: call into generic suspend code before turning off power
When powering down a SDIO connected card during suspend, make sure to call
into the generic lbs_suspend() function before pulling the plug. This will
make sure the card is successfully deregistered from the system to avoid
communication to the card starving out.
Fixes: 7444a8092906 ("libertas: fix suspend and resume for SDIO connected cards") Signed-off-by: Daniel Mack <daniel@zonque.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
of: unittest: Disable interrupt node tests for old world MAC systems
On systems with OF_IMAP_OLDWORLD_MAC set in of_irq_workarounds, the
devicetree interrupt parsing code is different, causing unit tests of
devicetree interrupt nodes to fail. Due to a bug in unittest code, which
tries to dereference an uninitialized pointer, this results in a crash.
Merge tag 'tag-chrome-platform-fixes-for-v4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/bleung/chrome-platform
Benson writes:
"chrome-platform fix for v4.19-rc8
This contains a fix to 57e94c8b974d ("mfd: cros-ec: Increase maximum
mkbp event size"), which caused cros_ec based chromebooks to truncate
an entire column of their built-in keyboard."
* tag 'tag-chrome-platform-fixes-for-v4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/bleung/chrome-platform:
mfd: cros-ec: copy the whole event in get_next_event_xfer
Merge branch 'for-4.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Dennis writes:
"percpu fixes for-4.19-rc8
The new percpu allocator introduced in 4.14 had a missing free for
the percpu metadata. This caused a memory leak when percpu memory is
being churned resulting in the allocation and deallocation of percpu
memory chunks"
* 'for-4.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: stop leaking bitmap metadata blocks
Merge tag 's390-4.19-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Martin writes:
"s390 fixes for 4.19-rc8
Four more patches for 4.19:
- Fix resume after suspend-to-disk if resume-CPU != suspend-CPU
- Fix vfio-ccw check for pinned pages
- Two patches to avoid a usercopy-whitelist warning in vfio-ccw"
* tag 's390-4.19-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cio: Fix how vfio-ccw checks pinned pages
s390/cio: Refactor alloc of ccw_io_region
s390/cio: Convert ccw_io_region to pointer
s390/hibernate: fix error handling when suspend cpu != resume cpu
Merge tag 'mips_fixes_4.19_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Paul writes:
"A few MIPS fixes for 4.19:
- Avoid suboptimal placement of our VDSO when using the legacy mmap
layout, which can prevent statically linked programs that were able
to allocate large amounts of memory using the brk syscall prior to
the introduction of our VDSO from functioning correctly.
- Fix up CONFIG_CMDLINE handling for platforms which ought to ignore
DT arguments but have incorrectly used them & lost other arguments
since v3.16.
- Fix a path in MAINTAINERS to use valid wildcards.
- Fixup a regression from v4.17 in memset() for systems using
CPU_DADDI_WORKAROUNDS."
* tag 'mips_fixes_4.19_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: memset: Fix CPU_DADDI_WORKAROUNDS `small_fixup' regression
MAINTAINERS: MIPS/LOONGSON2 ARCHITECTURE - Use the normal wildcard style
MIPS: Fix CONFIG_CMDLINE handling
MIPS: VDSO: Always map near top of user memory
Emil Karlson [Wed, 3 Oct 2018 18:43:18 +0000 (21:43 +0300)]
mfd: cros-ec: copy the whole event in get_next_event_xfer
Commit 57e94c8b974db2d83c60e1139c89a70806abbea0 caused cros-ec keyboard events
be truncated on many chromebooks so that Left and Right keys on Column 12 were
always 0. Use ret as memcpy len to fix this.
The old code was using ec_dev->event_size, which is the event payload/data size
excluding event_type header, for the length of the memcpy operation. Use ret
as memcpy length to avoid the off by one and copy the whole msg->data.
Fixes: 57e94c8b974d ("mfd: cros-ec: Increase maximum mkbp event size") Acked-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> Tested-by: Emil Renner Berthing <kernel@esmil.dk> Signed-off-by: Emil Karlson <jekarlson@gmail.com> Signed-off-by: Benson Leung <bleung@chromium.org>
Damien Le Moal [Tue, 9 Oct 2018 05:24:31 +0000 (14:24 +0900)]
dm: fix report zone remapping to account for partition offset
If dm-linear or dm-flakey are layered on top of a partition of a zoned
block device, remapping of the start sector and write pointer position
of the zones reported by a report zones BIO must be modified to account
for the target table entry mapping (start offset within the device and
entry mapping with the dm device). If the target's backing device is a
partition of a whole disk, the start sector on the physical device of
the partition must also be accounted for when modifying the zone
information. However, dm_remap_zone_report() was not considering this
last case, resulting in incorrect zone information remapping with
targets using disk partitions.
Fix this by calculating the target backing device start sector using
the position of the completed report zones BIO and the unchanged
position and size of the original report zone BIO. With this value
calculated, the start sector and write pointer position of the target
zones can be correctly remapped.
Fixes: 10999307c14e ("dm: introduce dm_remap_zone_report()") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Shenghui Wang [Sun, 7 Oct 2018 06:45:41 +0000 (14:45 +0800)]
dm cache: destroy migration_cache if cache target registration failed
Commit 7e6358d244e47 ("dm: fix various targets to dm_register_target
after module __init resources created") inadvertently introduced this
bug when it moved dm_register_target() after the call to KMEM_CACHE().
Fixes: 7e6358d244e47 ("dm: fix various targets to dm_register_target after module __init resources created") Cc: stable@vger.kernel.org Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
David S. Miller [Tue, 9 Oct 2018 17:49:50 +0000 (10:49 -0700)]
Merge branch 'ena-fixes'
Arthur Kiyanovski says:
====================
minor bug fixes for ENA Ethernet driver
Arthur Kiyanovski (4):
net: ena: fix warning in rmmod caused by double iounmap
net: ena: fix rare bug when failed restart/resume is followed by
driver removal
net: ena: fix NULL dereference due to untimely napi initialization
net: ena: fix auto casting to boolean
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate potential auto casting compilation error.
Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ena: fix NULL dereference due to untimely napi initialization
napi poll functions should be initialized before running request_irq(),
to handle a rare condition where there is a pending interrupt, causing
the ISR to fire immediately while the poll function wasn't set yet,
causing a NULL dereference.
Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ena: fix rare bug when failed restart/resume is followed by driver removal
In a rare scenario when ena_device_restore() fails, followed by device
remove, an FLR will not be issued. In this case, the device will keep
sending asynchronous AENQ keep-alive events, even after driver removal,
leading to memory corruption.
Fixes: 8c5c7abdeb2d ("net: ena: add power management ops to the ENA driver") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ena: fix warning in rmmod caused by double iounmap
Memory mapped with devm_ioremap is automatically freed when the driver
is disconnected from the device. Therefore there is no need to
explicitly call devm_iounmap.
Fixes: 0857d92f71b6 ("net: ena: add missing unmap bars on device removal") Fixes: 411838e7b41c ("net: ena: fix rare kernel crash when bar memory remap fails") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
gfs2: Fix iomap buffered write support for journaled files
Commit 64bc06bb32ee broke buffered writes to journaled files (chattr
+j): we'll try to journal the buffer heads of the page being written to
in gfs2_iomap_journaled_page_done. However, the iomap code no longer
creates buffer heads, so we'll BUG() in gfs2_page_add_databufs. Fix
that by creating buffer heads ourself when needed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Merge tag 'arc-4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Vineet writes:
"ARC updates for 4.19-rc8
- Fix clone syscall to update Thread pointer register
- Make/build updates (needed for AGL/OE builds) [Alexey]
- Typo fix [Colin Ian King]"
* tag 'arc-4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: clone syscall to setp r25 as thread pointer
ARC: build: Don't set CROSS_COMPILE in arch's Makefile
ARC: fix spelling mistake "entires" -> "entries"
ARC: build: Get rid of toolchain check
ARCv2: build: use mcpu=hs38 iso generic mcpu=archs
Reinette Chatre [Thu, 4 Oct 2018 21:05:23 +0000 (14:05 -0700)]
x86/intel_rdt: Fix out-of-bounds memory access in CBM tests
While the DOC at the beginning of lib/bitmap.c explicitly states that
"The number of valid bits in a given bitmap does _not_ need to be an
exact multiple of BITS_PER_LONG.", some of the bitmap operations do
indeed access BITS_PER_LONG portions of the provided bitmap no matter
the size of the provided bitmap. For example, if bitmap_intersects()
is provided with an 8 bit bitmap the operation will access
BITS_PER_LONG bits from the provided bitmap. While the operation
ensures that these extra bits do not affect the result, the memory
is still accessed.
The capacity bitmasks (CBMs) are typically stored in u32 since they
can never exceed 32 bits. A few instances exist where a bitmap_*
operation is performed on a CBM by simply pointing the bitmap operation
to the stored u32 value.
The consequence of this pattern is that some bitmap_* operations will
access out-of-bounds memory when interacting with the provided CBM. This
is confirmed with a KASAN test that reports:
BUG: KASAN: stack-out-of-bounds in __bitmap_intersects+0xa2/0x100
and
BUG: KASAN: stack-out-of-bounds in __bitmap_weight+0x58/0x90
Fix this by moving any CBM provided to a bitmap operation needing
BITS_PER_LONG to an 'unsigned long' variable.
[ tglx: Changed related function arguments to unsigned long and got rid
of the _cbm extra step ]
Fixes: 72d505056604 ("x86/intel_rdt: Add utilities to test pseudo-locked region possibility") Fixes: 49f7b4efa110 ("x86/intel_rdt: Enable setting of exclusive mode") Fixes: d9b48c86eb38 ("x86/intel_rdt: Display resource groups' allocations' size in bytes") Fixes: 95f0b77efa57 ("x86/intel_rdt: Initialize new resource group with sane defaults") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: fenghua.yu@intel.com Cc: tony.luck@intel.com Cc: gavin.hindman@intel.com Cc: jithu.joseph@intel.com Cc: dave.hansen@intel.com Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/69a428613a53f10e80594679ac726246020ff94f.1538686926.git.reinette.chatre@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
David Howells [Mon, 8 Oct 2018 14:46:25 +0000 (15:46 +0100)]
rxrpc: Fix the packet reception routine
The rxrpc_input_packet() function and its call tree was built around the
assumption that data_ready() handler called from UDP to inform a kernel
service that there is data to be had was non-reentrant. This means that
certain locking could be dispensed with.
This, however, turns out not to be the case with a multi-queue network card
that can deliver packets to multiple cpus simultaneously. Each of those
cpus can be in the rxrpc_input_packet() function at the same time.
Fix by adding or changing some structure members:
(1) Add peer->rtt_input_lock to serialise access to the RTT buffer.
(2) Make conn->service_id into a 32-bit variable so that it can be
cmpxchg'd on all arches.
(3) Add call->input_lock to serialise access to the Rx/Tx state. Note
that although the Rx and Tx states are (almost) entirely separate,
there's no point completing the separation and having separate locks
since it's a bi-phasal RPC protocol rather than a bi-direction
streaming protocol. Data transmission and data reception do not take
place simultaneously on any particular call.
and making the following functional changes:
(1) In rxrpc_input_data(), hold call->input_lock around the core to
prevent simultaneous producing of packets into the Rx ring and
updating of tracking state for a particular call.
(2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
The bit test and bit clear can then be combined. No further locking
is needed here.
(3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
the ACK packet. The superseded ACK check is then done both before and
after the lock is taken.
The handing of ackinfo data is split, parsing before the lock is taken
and processing with it held. This is keyed on rxMTU being non-zero.
Congestion management is also done within the locked section.
(4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
rotation. The ACKALL packet carries no information and is only really
useful after all packets have been transmitted since it's imprecise.
(5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
prevent calls being simultaneously implicitly ended on two cpus and
also to prevent any races with incoming call setup.
(6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
on a connection. It is only permitted to happen once for a
connection.
(7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
rx->incoming_lock to see if someone else set up the call, connection
or peer whilst we were getting there. We can't trust the values from
the earlier routing check unless we pin refs on them - which we want
to avoid.
Further, we need to allow for an incoming call to have its state
changed on another CPU between us making it live and us adjusting it
because the conn is now in the RXRPC_CONN_SERVICE state.
(8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
to the RTT buffer. Don't need to lock around setting peer->rtt.
For reference, the inventory of state-accessing or state-altering functions
used by the packet input procedure is:
* CONNECTION-LEVEL PROCESSING
- Service upgrade
- Can only happen once per conn
! Changed to use cmpxchg
> rxrpc_post_packet_to_conn()
- Setting conn->hi_serial
- Probably safe not using locks
- Maybe use cmpxchg
* CALL-LEVEL PROCESSING
> Old-call checking
> rxrpc_input_implicit_end_call()
> rxrpc_call_completed()
> rxrpc_queue_call()
! Need to take rx->incoming_lock
> __rxrpc_disconnect_call()
> rxrpc_notify_socket()
> rxrpc_new_incoming_call()
- Uses rx->incoming_lock for the entire process
- Might be able to drop this earlier in favour of the call lock
> rxrpc_incoming_call()
! Conflicts with rxrpc_input_implicit_end_call()
> rxrpc_send_ping()
- Don't need locks to check rtt state
> rxrpc_propose_ACK
* PACKET DISTRIBUTION
> rxrpc_input_call_packet()
> rxrpc_input_data()
* QUEUE DATA PACKET ON CALL
> rxrpc_reduce_call_timer()
- Uses timer_reduce()
! Needs call->input_lock()
> rxrpc_receiving_reply()
! Needs locking around ack state
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_proto_abort()
> rxrpc_input_dup_data()
- Fills the Rx buffer
- rxrpc_propose_ACK()
- rxrpc_notify_socket()
> rxrpc_input_ack()
* APPLY ACK PACKET TO CALL AND DISCARD PACKET
> rxrpc_input_ping_response()
- Probably doesn't need any extra locking
! Need READ_ONCE() on call->ping_serial
> rxrpc_input_check_for_lost_ack()
- Takes call->lock to consult Tx buffer
> rxrpc_peer_add_rtt()
! Needs to take a lock (peer->rtt_input_lock)
! Could perhaps manage with cmpxchg() and xadd() instead
> rxrpc_input_requested_ack
- Consults Tx buffer
! Probably needs a lock
> rxrpc_peer_add_rtt()
> rxrpc_propose_ack()
> rxrpc_input_ackinfo()
- Changes call->tx_winsize
! Use cmpxchg to handle change
! Should perhaps track serial number
- Uses peer->lock to record MTU specification changes
> rxrpc_proto_abort()
! Need to take call->input_lock
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_input_soft_acks()
- Consults the Tx buffer
> rxrpc_congestion_management()
- Modifies the Tx annotations
! Needs call->input_lock()
> rxrpc_queue_call()
> rxrpc_input_abort()
* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
> rxrpc_set_call_completion()
> rxrpc_notify_socket()
> rxrpc_input_ackall()
* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
! Need to take call->input_lock
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_reject_packet()
There are some functions used by the above that queue the packet, after
which the procedure is terminated:
- rxrpc_post_packet_to_local()
- local->event_queue is an sk_buff_head
- local->processor is a work_struct
- rxrpc_post_packet_to_conn()
- conn->rx_queue is an sk_buff_head
- conn->processor is a work_struct
- rxrpc_reject_packet()
- local->reject_queue is an sk_buff_head
- local->processor is a work_struct
And some that offload processing to process context:
- rxrpc_notify_socket()
- Uses RCU lock
- Uses call->notify_lock to call call->notify_rx
- Uses call->recvmsg_lock to queue recvmsg side
- rxrpc_queue_call()
- call->processor is a work_struct
- rxrpc_propose_ACK()
- Uses call->lock to wrap __rxrpc_propose_ACK()
And a bunch that complete a call, all of which use call->state_lock to
protect the call state:
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Mon, 8 Oct 2018 14:46:17 +0000 (15:46 +0100)]
rxrpc: Fix connection-level abort handling
Fix connection-level abort handling to cache the abort and error codes
properly so that a new incoming call can be properly aborted if it races
with the parent connection being aborted by another CPU.
The abort_code and error parameters can then be dropped from
rxrpc_abort_calls().
Fixes: f5c17aaeb2ae ("rxrpc: Calls should only have one terminal state") Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Mon, 8 Oct 2018 14:46:11 +0000 (15:46 +0100)]
rxrpc: Only take the rwind and mtu values from latest ACK
Move the out-of-order and duplicate ACK packet check to before the call to
rxrpc_input_ackinfo() so that the receive window size and MTU size are only
checked in the latest ACK packet and don't regress.
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Mon, 8 Oct 2018 14:46:05 +0000 (15:46 +0100)]
rxrpc: Carry call state out of locked section in rxrpc_rotate_tx_window()
Carry the call state out of the locked section in rxrpc_rotate_tx_window()
rather than sampling it afterwards. This is only used to select tracepoint
data, but could have changed by the time we do the tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Mon, 8 Oct 2018 14:46:01 +0000 (15:46 +0100)]
rxrpc: Don't check RXRPC_CALL_TX_LAST after calling rxrpc_rotate_tx_window()
We should only call the function to end a call's Tx phase if we rotated the
marked-last packet out of the transmission buffer.
Make rxrpc_rotate_tx_window() return an indication of whether it just
rotated the packet marked as the last out of the transmit buffer, carrying
the information out of the locked section in that function.
We can then check the return value instead of examining RXRPC_CALL_TX_LAST.
Fixes: 70790dbe3f66 ("rxrpc: Pass the last Tx packet marker in the annotation buffer") Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Mon, 8 Oct 2018 14:45:56 +0000 (15:45 +0100)]
rxrpc: Don't need to take the RCU read lock in the packet receiver
We don't need to take the RCU read lock in the rxrpc packet receive
function because it's held further up the stack in the IP input routine
around the UDP receive routines.
Fix this by dropping the RCU read lock calls from rxrpc_input_packet().
This simplifies the code.
Fixes: 70790dbe3f66 ("rxrpc: Pass the last Tx packet marker in the annotation buffer") Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Thu, 4 Oct 2018 10:10:51 +0000 (11:10 +0100)]
rxrpc: Use the UDP encap_rcv hook
Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
in which a packet is placed onto the UDP receive queue and then immediately
removed again by rxrpc. Going via the queue in this manner seems like it
should be unnecessary.
This does, however, require the invention of a value to place in encap_type
as that's one of the conditions to switch packets out to the encap_rcv
hook. Possibly the value doesn't actually matter for anything other than
sockopts on the UDP socket, which aren't accessible outside of rxrpc
anyway.
This seems to cut a bit of time out of the time elapsed between each
sk_buff being timestamped and turning up in rxrpc (the final number in the
following trace excerpts). I measured this by making the rxrpc_rx_packet
trace point print the time elapsed between the skb being timestamped and
the current time (in ns), e.g.:
... 424.278721: rxrpc_rx_packet: ... ACK 25026
So doing a 512MiB DIO read from my test server, with an unmodified kernel:
N min max sum mean stddev
27605 2626 7581 7.83992e+07 2840.04 181.029
and with the patch applied:
N min max sum mean stddev
27547 1895 12165 6.77461e+07 2459.29 255.02
Signed-off-by: David Howells <dhowells@redhat.com>
In the quest to remove all stack VLA usage from the kernel[1], this
allocates a fixed size array for the maximum number of cookies and
adds a runtime sanity check.