Linus Torvalds [Fri, 5 May 2017 20:36:10 +0000 (13:36 -0700)]
Merge tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Orangefs cleanups, fixes and statx support.
Some cleanups:
- remove unused get_fsid_from_ino
- fix bounds check for listxattr
- clean up oversize xattr validation
- do not set getattr_time on orangefs_lookup
- return from orangefs_devreq_read quickly if possible
- do not wait for timeout if umounting
- handle zero size write in debugfs
Bug fixes:
- do not check possibly stale size on truncate
- ensure the userspace component is unmounted if mount fails
- total reimplementation of dir.c
New feature:
- implement statx
The new implementation of dir.c is kind of a big deal, all new code.
It has been posted to fs-devel during the previous rc period, we
didn't get much review or feedback from there, but it has been
reviewed very heavily here, so much so that we have two entire
versions of the reimplementation.
Not only does the new implementation fix some xfstests, but it passes
all the new tests we made here that involve seeking and rewinding and
giant directories and long file names. The new dir code has three
patches itself:
- skip forward to the next directory entry if seek is short
- invalidate stored directory on seek
- count directory pieces correctly"
* tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: count directory pieces correctly
orangefs: invalidate stored directory on seek
orangefs: skip forward to the next directory entry if seek is short
orangefs: handle zero size write in debugfs
orangefs: do not wait for timeout if umounting
orangefs: return from orangefs_devreq_read quickly if possible
orangefs: ensure the userspace component is unmounted if mount fails
orangefs: do not check possibly stale size on truncate
orangefs: implement statx
orangefs: remove ORANGEFS_READDIR macros
orangefs: support very large directories
orangefs: support llseek on directories
orangefs: rewrite readdir to fix several bugs
orangefs: do not set getattr_time on orangefs_lookup
orangefs: clean up oversize xattr validation
orangefs: fix bounds check for listxattr
orangefs: remove unused get_fsid_from_ino
Linus Torvalds [Fri, 5 May 2017 20:30:11 +0000 (13:30 -0700)]
Merge tag 'initramfs-fix-4.12-rc1' of git://github.com/stffrdhrn/linux
Pull initramfs fix from Stafford Horne:
"This is a fix for an issue that has caused 4.11 to not boot on
OpenRISC. I should have caught this during the 4.11 cycle but I had
been busy on testing some other series of patches.
I would have considered pushing it though a different path but Al Viro
suggested submitting directly to you.
Also, its just one as I havent really got anything else ready on my
queue for 4.12.
Summary:
- Ensure fput() flush is done even for builtin initramfs"
* tag 'initramfs-fix-4.12-rc1' of git://github.com/stffrdhrn/linux:
initramfs: Always do fput() and load modules after rootfs populate
Linus Torvalds [Fri, 5 May 2017 19:11:37 +0000 (12:11 -0700)]
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- kdump support, including two necessary memblock additions:
memblock_clear_nomap() and memblock_cap_memory_range()
- ARMv8.3 HWCAP bits for JavaScript conversion instructions, complex
numbers and weaker release consistency
- arm64 ACPI platform MSI support
- arm perf updates: ACPI PMU support, L3 cache PMU in some Qualcomm
SoCs, Cortex-A53 L2 cache events and DTLB refills, MAINTAINERS update
for DT perf bindings
- architected timer errata framework (the arch/arm64 changes only)
- support for DMA_ATTR_FORCE_CONTIGUOUS in the arm64 iommu DMA API
- arm64 KVM refactoring to use common system register definitions
- remove support for ASID-tagged VIVT I-cache (no ARMv8 implementation
using it and deprecated in the architecture) together with some
I-cache handling clean-up
- PE/COFF EFI header clean-up/hardening
- define BUG() instruction without CONFIG_BUG
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (92 commits)
arm64: Fix the DMA mmap and get_sgtable API with DMA_ATTR_FORCE_CONTIGUOUS
arm64: Print DT machine model in setup_machine_fdt()
arm64: pmu: Wire-up Cortex A53 L2 cache events and DTLB refills
arm64: module: split core and init PLT sections
arm64: pmuv3: handle pmuv3+
arm64: Add CNTFRQ_EL0 trap handler
arm64: Silence spurious kbuild warning on menuconfig
arm64: pmuv3: use arm_pmu ACPI framework
arm64: pmuv3: handle !PMUv3 when probing
drivers/perf: arm_pmu: add ACPI framework
arm64: add function to get a cpu's MADT GICC table
drivers/perf: arm_pmu: split out platform device probe logic
drivers/perf: arm_pmu: move irq request/free into probe
drivers/perf: arm_pmu: split cpu-local irq request/free
drivers/perf: arm_pmu: rename irq request/free functions
drivers/perf: arm_pmu: handle no platform_device
drivers/perf: arm_pmu: simplify cpu_pmu_request_irqs()
drivers/perf: arm_pmu: factor out pmu registration
drivers/perf: arm_pmu: fold init into alloc
drivers/perf: arm_pmu: define armpmu_init_fn
...
Linus Torvalds [Fri, 5 May 2017 18:36:44 +0000 (11:36 -0700)]
Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we
use a 128TB virtual address space, but a process can request access
to the full 512TB by passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator
Interface Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and
runtime.
- Several small fixes and cleanups to the kprobes code, as well as
support for KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts,
correctly treating them as NMIs, giving them a dedicated stack and
using a new hypervisor call to trigger them, all of which should
aid debugging and robustness.
- Many fixes and other minor enhancements.
Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
Yang Shi"
* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
powerpc/powernv: Fix TCE kill on NVLink2
powerpc/mm/radix: Drop support for CPUs without lockless tlbie
powerpc/book3s/mce: Move add_taint() later in virtual mode
powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
powerpc/smp: Document irq enable/disable after migrating IRQs
powerpc/mpc52xx: Don't select user-visible RTAS_PROC
powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
powerpc/eeh: Clean up and document event handling functions
powerpc/eeh: Avoid use after free in eeh_handle_special_event()
cxl: Mask slice error interrupts after first occurrence
cxl: Route eeh events to all drivers in cxl_pci_error_detected()
cxl: Force context lock during EEH flow
powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
powerpc/xmon: Teach xmon oops about radix vectors
powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
powerpc/pseries: Enable VFIO
powerpc/powernv: Fix iommu table size calculation hook for small tables
powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
powerpc: Add arch/powerpc/tools directory
...
Linus Torvalds [Fri, 5 May 2017 18:08:43 +0000 (11:08 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
"This is a set of small fixes that were mostly stumbled over during
more significant development. This proc fix and the fix to
posix-timers are the most significant of the lot.
There is a lot of good development going on but unfortunately it
didn't quite make the merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Fix unbalanced hard link numbers
signal: Make kill_proc_info static
rlimit: Properly call security_task_setrlimit
signal: Remove unused definition of sig_user_definied
ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
ipc: Remove unused declaration of recompute_msgmni
posix-timers: Correct sanity check in posix_cpu_nsleep
sysctl: Remove dead register_sysctl_root
arm64: Fix the DMA mmap and get_sgtable API with DMA_ATTR_FORCE_CONTIGUOUS
While honouring the DMA_ATTR_FORCE_CONTIGUOUS on arm64 (commit 44176bb38fa4: "arm64: Add support for DMA_ATTR_FORCE_CONTIGUOUS to
IOMMU"), the existing uses of dma_mmap_attrs() and dma_get_sgtable()
have been broken by passing a physically contiguous vm_struct with an
invalid pages pointer through the common iommu API.
Since the coherent allocation with DMA_ATTR_FORCE_CONTIGUOUS uses CMA,
this patch simply reuses the existing swiotlb logic for mmap and
get_sgtable.
Note that the current implementation of get_sgtable (both swiotlb and
iommu) is broken if dma_declare_coherent_memory() is used since such
memory does not have a corresponding struct page. To be addressed in a
subsequent patch.
Fixes: 44176bb38fa4 ("arm64: Add support for DMA_ATTR_FORCE_CONTIGUOUS to IOMMU") Reported-by: Andrzej Hajda <a.hajda@samsung.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Andrzej Hajda <a.hajda@samsung.com> Reviewed-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Stafford Horne [Thu, 4 May 2017 12:15:56 +0000 (21:15 +0900)]
initramfs: Always do fput() and load modules after rootfs populate
In OpenRISC we do not have a bootloader passed initrd, but the built in
initramfs does contain the /init and other binaries, including modules.
The previous commit 08865514805d2 ("initramfs: finish fput() before
accessing any binary from initramfs") made a change to only call fput()
if the bootloader initrd was available, this caused intermittent crashes
for OpenRISC.
This patch changes the fput() to happen unconditionally if any rootfs is
loaded. Also, I added some comments to make it a bit more clear why we
call unpack_to_rootfs() multiple times.
Fixes: 08865514805d2 ("initramfs: finish fput() before accessing any binary from initramfs") Cc: stable@vger.kernel.org Cc: Lokesh Vutla <lokeshvutla@ti.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Stafford Horne <shorne@gmail.com>
Linus Torvalds [Fri, 5 May 2017 02:07:10 +0000 (19:07 -0700)]
Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big set of new char/misc driver drivers and features for
4.12-rc1.
There's lots of new drivers added this time around, new firmware
drivers from Google, more auxdisplay drivers, extcon drivers, fpga
drivers, and a bunch of other driver updates. Nothing major, except if
you happen to have the hardware for these drivers, and then you will
be happy :)
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
firmware: google memconsole: Fix return value check in platform_memconsole_init()
firmware: Google VPD: Fix return value check in vpd_platform_init()
goldfish_pipe: fix build warning about using too much stack.
goldfish_pipe: An implementation of more parallel pipe
fpga fr br: update supported version numbers
fpga: region: release FPGA region reference in error path
fpga altera-hps2fpga: disable/unprepare clock on error in alt_fpga_bridge_probe()
mei: drop the TODO from samples
firmware: Google VPD sysfs driver
firmware: Google VPD: import lib_vpd source files
misc: lkdtm: Add volatile to intentional NULL pointer reference
eeprom: idt_89hpesx: Add OF device ID table
misc: ds1682: Add OF device ID table
misc: tsl2550: Add OF device ID table
w1: Remove unneeded use of assert() and remove w1_log.h
w1: Use kernel common min() implementation
uio_mf624: Align memory regions to page size and set correct offsets
uio_mf624: Refactor memory info initialization
uio: Allow handling of non page-aligned memory regions
hangcheck-timer: Fix typo in comment
...
Linus Torvalds [Fri, 5 May 2017 01:27:46 +0000 (18:27 -0700)]
Merge tag 'driver-core-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Very tiny pull request for 4.12-rc1 for the driver core this time
around.
There are some documentation fixes, an eventpoll.h fixup to make it
easier for the libc developers to take our header files directly, and
some very minor driver core fixes and changes.
All have been in linux-next for a very long time with no reported
issues"
* tag 'driver-core-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Revert "kref: double kref_put() in my_data_handler()"
driver core: don't initialize 'parent' in device_add()
drivers: base: dma-mapping: use nth_page helper
Documentation/ABI: add information about cpu_capacity
debugfs: set no_llseek in DEFINE_DEBUGFS_ATTRIBUTE
eventpoll.h: add missing epoll event masks
eventpoll.h: fix epoll event masks
Linus Torvalds [Fri, 5 May 2017 01:03:51 +0000 (18:03 -0700)]
Merge tag 'usb-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB updates from Greg KH:
"Here is the big USB patchset for 4.12-rc1.
Lots of good stuff here, after many many many attempts, the kernel
finally has a working typeC interface, many thanks to Heikki and
Guenter and others who have taken the time to get this merged. It
wasn't an easy path for them at all.
There's also a staging driver that uses this new api, which is why
it's coming in through this tree.
Along with that, there's the usual huge number of changes for gadget
drivers, xhci, and other stuff. Johan also finally refactored pretty
much every driver that was looking at USB endpoints to do it in a
common way, which will help prevent any "badly-formed" devices from
causing problems in drivers. That too wasn't a simple task.
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (263 commits)
staging: typec: Fairchild FUSB302 Type-c chip driver
staging: typec: Type-C Port Controller Interface driver (tcpci)
staging: typec: USB Type-C Port Manager (tcpm)
usb: host: xhci: remove #ifdef around PM functions
usb: musb: don't mark of_dev_auxdata as initdata
usb: misc: legousbtower: Fix buffers on stack
USB: Revert "cdc-wdm: fix "out-of-sync" due to missing notifications"
usb: Make sure usb/phy/of gets built-in
USB: storage: e-mail update in drivers/usb/storage/unusual_devs.h
usb: host: xhci: print correct command ring address
usb: host: xhci: delete sp_dma_buffers for scratchpad
usb: host: xhci: using correct specification chapter reference for DCBAAP
xhci: switch to pci_alloc_irq_vectors
usb: host: xhci-plat: set resume_quirk() for R-Car controllers
usb: host: xhci-plat: add resume_quirk()
usb: host: xhci-plat: enable clk in resume timing
usb: host: plat: Enable xHCI plat runtime PM
USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
USB: serial: constify static arrays
usb: fix some references for /proc/bus/usb
...
2) When a RAW socket is in hdrincl mode, we need to make sure that the
user provided at least a minimally sized ipv4/ipv6 header. Fix from
Alexander Potapenko.
3) We must emit IFLA_PHYS_PORT_NAME netlink attributes using
nla_put_string() so that it is NULL terminated.
4) Fix a bug in TCP fastopen handling, wherein child sockets
erroneously inherit the fastopen_req from the parent, and later can
end up derefencing freed memory or doing a double free. From Eric
Dumazet.
5) Don't clear out netdev stats at close time in tg3 driver, from
YueHaibing.
6) Fix refcount leak in xt_CT, from Gao Feng.
7) In nft_set_bitmap() don't leak dummy elements, from Liping Zhang.
8) Fix deadlock due to taking the expectation lock twice, also from
Liping Zhang.
9) Make xt_socket work again with ipv6, from Peter Tirsek.
10) Don't allow IPV6 to be used with IPVS if ipv6.disable=1, from Paolo
Abeni.
11) Make the BPF loader more flexible wrt. changes to the bpf MAP entry
layout. From Jesper Dangaard Brouer.
12) Fix ethtool reported device name in aquantia driver, from Pavel
Belous.
13) Fix build failures due to the compile time size test not working in
netfilter conntrack. From Geert Uytterhoeven.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
cfg80211: make RATE_INFO_BW_20 the default
ipv6: initialize route null entry in addrconf_init()
qede: Fix possible misconfiguration of advertised autoneg value.
qed: Fix overriding of supported autoneg value.
qed*: Fix possible overflow for status block id field.
rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string
netvsc: make sure napi enabled before vmbus_open
aquantia: Fix driver name reported by ethtool
ipv4, ipv6: ensure raw socket message is big enough to hold an IP header
net/sched: remove redundant null check on head
tcp: do not inherit fastopen_req from parent
forcedeth: remove unnecessary carrier status check
ibmvnic: Move queue restarting in ibmvnic_tx_complete
ibmvnic: Record SKB RX queue during poll
ibmvnic: Continue skb processing after skb completion error
ibmvnic: Check for driver reset first in ibmvnic_xmit
ibmvnic: Wait for any pending scrqs entries at driver close
ibmvnic: Clean up tx pools when closing
ibmvnic: Whitespace correction in release_rx_pools
ibmvnic: Delete napi's when releasing driver resources
...
Linus Torvalds [Thu, 4 May 2017 19:19:44 +0000 (12:19 -0700)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This update includes the usual round of major driver updates
(hisi_sas, ufs, fnic, cxlflash, be2iscsi, ipr, stex). There's also the
usual amount of cosmetic and spelling stuff"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (155 commits)
scsi: qla4xxx: fix spelling mistake: "Tempalate" -> "Template"
scsi: stex: make S6flag static
scsi: mac_esp: fix to pass correct device identity to free_irq()
scsi: aacraid: pci_alloc_consistent() failures on ARM64
scsi: ufs: make ufshcd_get_lists_status() register operation obvious
scsi: ufs: use MASK_EE_STATUS
scsi: mac_esp: Replace bogus memory barrier with spinlock
scsi: fcoe: make fcoe_e_d_tov and fcoe_r_a_tov static
scsi: sd_zbc: Do not write lock zones for reset
scsi: sd_zbc: Remove superfluous assignments
scsi: sd: sd_zbc: Rename sd_zbc_setup_write_cmnd
scsi: Improve scsi_get_sense_info_fld
scsi: sd: Cleanup sd_done sense data handling
scsi: sd: Improve sd_completed_bytes
scsi: sd: Fix function descriptions
scsi: mpt3sas: remove redundant wmb
scsi: mpt: Move scsi_remove_host() out of mptscsih_remove_host()
scsi: sg: reset 'res_in_use' after unlinking reserved array
scsi: mvumi: remove code handling zero scsi_sg_count(scmd) case
scsi: fusion: fix spelling mistake: "Persistancy" -> "Persistency"
...
Linus Torvalds [Thu, 4 May 2017 19:05:32 +0000 (12:05 -0700)]
Merge tag 'gpio-v4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO updates from Linus Walleij:
"This is the bulk of GPIO changes for the v4.12 kernel cycle.
Core changes:
- Return NULL from gpiod_get_optional() when GPIOLIB is disabled.
This was a much discussed change. It affects use cases where people
write drivers that might or might not be using GPIO resources. I
have decided that this is the lesser evil right now.
- Make gpiod_count() behave consistently across different hardware
descriptions.
- Fix the syntax around open drain/open source to not infer active
high/low semantics.
New drivers:
- A new single-register fixed-direction framework driver for hardware
that have lines controlled by a single register that just work in
one direction (out or in), including IRQ support.
- Support the Fintek F71889A GPIO SuperIO controller.
- Support the National NI 169445 MMIO GPIO.
- Support for the X-Gene derivative of the DWC GPIO controller
- Support for the Rohm BD9571MWV-M PMIC GPIO controller.
- Refactor the Gemini GPIO driver to a generic Faraday FTGPIO driver
and replace both the Gemini and the Moxa ART custom drivers with
this driver.
Driver improvements:
- A whole slew of drivers have their spinlocks chaned to raw
spinlocks as they provide irqchips, and thus we are progressing on
realtime compliance.
- Use devm_irq_alloc_descs() in a slew of drivers, getting managed
resources.
- Support for the embedded PWM controller inside the MVEBU driver.
- Debounce, open source and open drain support for the Aspeed driver.
- Misc smaller fixes like spelling and syntax and whatnot"
* tag 'gpio-v4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (77 commits)
gpio: f7188x: Add a missing break
gpio: omap: return error if requested debounce time is not possible
gpio: Add ROHM BD9571MWV-M PMIC GPIO driver
gpio: gpio-wcove: fix GPIO IRQ status mask
gpio: DT bindings, move tca9554 from pcf857x to pca953x
gpio: move tca9554 from pcf857x to pca953x
gpio: arizona: Correct check whether the pin is an input
gpio: Add XRA1403 DTS binding documentation
dt-bindings: add exar to vendor prefixes list
gpio: gpio-wcove: fix irq pending status bit width
gpio: dwapb: use dwapb_read instead of readl_relaxed
gpio: aspeed: Add open-source and open-drain support
gpio: aspeed: Add debounce support
gpio: aspeed: dt: Add optional clocks property
gpio: aspeed: dt: Fix description alignment in bindings document
gpio: mvebu: Add limited PWM support
gpio: Use unsigned int for interrupt numbers
gpio: f7188x: Add F71889A GPIO support.
gpio: core: Decouple open drain/source flag with active low/high
gpio: arizona: Correct handling for reading input GPIOs
...
Linus Torvalds [Thu, 4 May 2017 18:56:59 +0000 (11:56 -0700)]
Merge tag 'platform-drivers-x86-v4.12-1' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform-drivers update from Darren Hart:
"This represents a significantly larger and more complex set of changes
than those of prior merge windows.
In particular, we had several changes with dependencies on other
subsystems which we felt were best managed through merges of immutable
branches, including one each from input, i2c, and leds. Two patches
for the watchdog subsystem are included after discussion with Wim and
Guenter following a collision in linux-next (this should be resolved
and you should only see these two appear in this pull request). These
are called out in the "External" section below.
Summary of changes:
- significant further cleanup of fujitsu-laptop and hp-wmi
- new model support for ideapad, asus, silead, and xiaomi
- new hotkeys for thinkpad and models using intel-vbtn
- dell keyboard backlight improvements
- build and dependency improvements
- intel * ipc fixes, cleanups, and api updates
- single isolated fixes noted below
platform/x86:
- Add Intel Cherry Trail ACPI INT33FE device driver
- remove sparse_keymap_free() calls
- Make SILEAD_DMI depend on TOUCHSCREEN_SILEAD
asus-wmi:
- try to set als by default
- fix cpufv sysfs file permission
acer-wmi:
- setup accelerometer when ACPI device was found
ideapad-laptop:
- Add IdeaPad V310-15ISK to no_hw_rfkill
- Add IdeaPad 310-15IKB to no_hw_rfkill
intel_pmc_ipc:
- use gcr mem base for S0ix counter read
- Fix iTCO_wdt GCS memory mapping failure
- Add pmc gcr read/write/update api's
- fix gcr offset
dell-laptop:
- Add keyboard backlight timeout AC settings
- Handle return error form dell_get_intensity.
- Protect kbd_state against races
- Refactor kbd_led_triggers_store()
hp-wireless:
- reuse module_acpi_driver
- add Xiaomi's hardware id to the supported list
intel-vbtn:
- add volume up and down
INT33FE:
- add i2c dependency
hp-wmi:
- Cleanup exit paths
- Do not shadow errors in sysfs show functions
- Use DEVICE_ATTR_(RO|RW) helper macros
- Refactor dock and tablet state fetchers
- Cleanup wireless get_(hw|sw)state functions
- Refactor redundant HPWMI_READ functions
- Standardize enum usage for constants
- Cleanup local variable declarations
- Do not shadow error values
- Fix detection for dock and tablet mode
- Fix error value for hp_wmi_tablet_state
fujitsu-laptop:
- simplify error handling in acpi_fujitsu_laptop_add()
- do not log LED registration failures
- switch to managed LED class devices
- reorganize LED-related code
- refactor LED registration
- select LEDS_CLASS
- remove redundant fields from struct fujitsu_bl
- account for backlight power when determining brightness
- do not log set_lcd_level() failures in bl_update_status()
- ignore errors when setting backlight power
- make disable_brightness_adjust a boolean
- clean up use_alt_lcd_levels handling
- sync brightness in set_lcd_level()
- simplify set_lcd_level()
- merge set_lcd_level_alt() into set_lcd_level()
- switch to a managed backlight device
- only handle backlight when appropriate
- update debug message logged by call_fext_func()
- rename call_fext_func() arguments
- simplify call_fext_func()
- clean up local variables in call_fext_func()
- remove keycode fields from struct fujitsu_bl
- model-dependent sparse keymap overrides
- use a sparse keymap for hotkey event generation
- switch to a managed hotkey input device
- refactor hotkey input device setup
- use a sparse keymap for brightness key events
- switch to a managed backlight input device
- refactor backlight input device setup
- remove pf_device field from struct fujitsu_bl
- only register platform device if FUJ02E3 is present
- add and remove platform device in separate functions
- simplify platform device attribute definitions
- remove backlight-related attributes from the platform device
- cleanup error labels in fujitsu_init()
- only register backlight device if FUJ02B1 is present
- sync backlight power status in acpi_fujitsu_laptop_add()
- register backlight device in a separate function
- simplify brightness key event generation logic
- decrease indentation in acpi_fujitsu_bl_notify()
intel-hid:
- Add missing ->thaw callback
- do not set parents of input devices explicitly
- remove redundant set_bit() call
- use devm_input_allocate_device() for HID events input device
- make intel_hid_set_enable() take a boolean argument
- simplify enabling/disabling HID events
silead_dmi:
- Add touchscreen info for Surftab Wintron 7.0
- Abort early if DMI does not match
- Do not treat all devices as i2c_clients
- Add entry for Insyde 7W tablets
- Constify properties arrays
thinkpad_acpi:
- add mapping for new hotkeys
- guard generic hotkey case"
* tag 'platform-drivers-x86-v4.12-1' of git://git.infradead.org/linux-platform-drivers-x86: (108 commits)
platform/x86: Make SILEAD_DMI depend on TOUCHSCREEN_SILEAD
platform/x86: asus-wmi: try to set als by default
platform/x86: asus-wmi: fix cpufv sysfs file permission
platform/x86: acer-wmi: setup accelerometer when ACPI device was found
platform/x86: ideapad-laptop: Add IdeaPad V310-15ISK to no_hw_rfkill
platform/x86: intel_pmc_ipc: use gcr mem base for S0ix counter read
platform/x86: intel_pmc_ipc: Fix iTCO_wdt GCS memory mapping failure
watchdog: iTCO_wdt: Add PMC specific noreboot update api
watchdog: iTCO_wdt: cleanup set/unset no_reboot_bit functions
platform/x86: intel_pmc_ipc: Add pmc gcr read/write/update api's
platform/x86: intel_pmc_ipc: fix gcr offset
platform/x86: dell-laptop: Add keyboard backlight timeout AC settings
platform/x86: dell-laptop: Handle return error form dell_get_intensity.
platform/x86: hp-wireless: reuse module_acpi_driver
platform/x86: intel-vbtn: add volume up and down
platform/x86: INT33FE: add i2c dependency
platform/x86: hp-wmi: Cleanup exit paths
platform/x86: hp-wmi: Do not shadow errors in sysfs show functions
platform/x86: hp-wmi: Use DEVICE_ATTR_(RO|RW) helper macros
platform/x86: hp-wmi: Refactor dock and tablet state fetchers
...
A large directory full of differently sized file names triggered this.
Most directories, even very large directories with shorter names, would
be lucky enough to fit in one server response.
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
If an application seeks to a position before the point which has been
read, it must want updates which have been made to the directory. So
delete the copy stored in the kernel so it will be fetched again.
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
orangefs: skip forward to the next directory entry if seek is short
If userspace seeks to a position in the stream which is not correct, it
would have returned EIO because the data in the buffer at that offset
would be incorrect. This and the userspace daemon returning a corrupt
directory are indistinguishable.
Now if the data does not look right, skip forward to the next chunk and
try again. The motivation is that if the directory changes, an
application may seek to a position that was valid and no longer is valid.
It is not yet possible for a directory to change.
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Linus Torvalds [Thu, 4 May 2017 18:37:09 +0000 (11:37 -0700)]
Merge tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen fixes and featrues for 4.12. The main changes are:
- enable building the kernel with Xen support but without enabling
paravirtualized mode (Vitaly Kuznetsov)
- add a new 9pfs xen frontend driver (Stefano Stabellini)
- simplify Xen's cpuid handling by making use of cpu capabilities
(Juergen Gross)
- add/modify some headers for new Xen paravirtualized devices
(Oleksandr Andrushchenko)
- EFI reset_system support under Xen (Julien Grall)
- and the usual cleanups and corrections"
* tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (57 commits)
xen: Move xen_have_vector_callback definition to enlighten.c
xen: Implement EFI reset_system callback
arm/xen: Consolidate calls to shutdown hypercall in a single helper
xen: Export xen_reboot
xen/x86: Call xen_smp_intr_init_pv() on BSP
xen: Revert commits da72ff5bfcb0 and 72a9b186292d
xen/pvh: Do not fill kernel's e820 map in init_pvh_bootparams()
xen/scsifront: use offset_in_page() macro
xen/arm,arm64: rename __generic_dma_ops to xen_get_dma_ops
xen/arm,arm64: fix xen_dma_ops after 815dd18 "Consolidate get_dma_ops..."
xen/9pfs: select CONFIG_XEN_XENBUS_FRONTEND
x86/cpu: remove hypervisor specific set_cpu_features
vmware: set cpu capabilities during platform initialization
x86/xen: use capabilities instead of fake cpuid values for xsave
x86/xen: use capabilities instead of fake cpuid values for x2apic
x86/xen: use capabilities instead of fake cpuid values for mwait
x86/xen: use capabilities instead of fake cpuid values for acpi
x86/xen: use capabilities instead of fake cpuid values for acc
x86/xen: use capabilities instead of fake cpuid values for mtrr
x86/xen: use capabilities instead of fake cpuid values for aperf
...
Johannes Berg [Thu, 4 May 2017 06:42:30 +0000 (08:42 +0200)]
cfg80211: make RATE_INFO_BW_20 the default
Due to the way I did the RX bitrate conversions in mac80211 with
spatch, going setting flags to setting the value, many drivers now
don't set the bandwidth value for 20 MHz, since with the flags it
wasn't necessary to (there was no 20 MHz flag, only the others.)
Rather than go through and try to fix up all the drivers, instead
renumber the enum so that 20 MHz, which is the typical bandwidth,
actually has the value 0, making those drivers all work again.
If VHT was hit used with a driver not reporting it, e.g. iwlmvm,
this manifested in hitting the bandwidth warning in
cfg80211_calculate_bitrate_vht().
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 4 May 2017 05:07:31 +0000 (22:07 -0700)]
ipv6: initialize route null entry in addrconf_init()
Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev
since it is always NULL.
This is clearly wrong, we have code to initialize it to loopback_dev,
unfortunately the order is still not correct.
loopback_dev is registered very early during boot, we lose a chance
to re-initialize it in notifier. addrconf_init() is called after
ip6_route_init(), which means we have no chance to correct it.
Fix it by moving this initialization explicitly after
ipv6_add_dev(init_net.loopback_dev) in addrconf_init().
Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
qede: Fix possible misconfiguration of advertised autoneg value.
Fail the configuration of advertised speed-autoneg value if the config
update is not supported.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Driver currently uses advertised-autoneg value to populate the
supported-autoneg field. When advertised field is updated, user gets
the same value for supported field. Supported-autoneg value need to be
populated from the link capabilities value returned by the MFW.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
qed*: Fix possible overflow for status block id field.
Value for status block id could be more than 256 in 100G mode, need to
update its data type from u8 to u16.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
IFLA_PHYS_PORT_NAME is a string attribute, so terminate it with \0.
Otherwise libnl3 fails to validate netlink messages with this attribute.
"ip -detail a" assumes too that the attribute is NUL-terminated when
printing it. It often was, due to padding.
I noticed this as libvirtd failing to start on a system with sfc driver
after upgrading it to Linux 4.11, i.e. when sfc added support for
phys_port_name.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4, ipv6: ensure raw socket message is big enough to hold an IP header
raw_send_hdrinc() and rawv6_send_hdrinc() expect that the buffer copied
from the userspace contains the IPv4/IPv6 header, so if too few bytes are
copied, parts of the header may remain uninitialized.
Eric Dumazet [Wed, 3 May 2017 13:39:31 +0000 (06:39 -0700)]
tcp: do not inherit fastopen_req from parent
Under fuzzer stress, it is possible that a child gets a non NULL
fastopen_req pointer from its parent at accept() time, when/if parent
morphs from listener to active session.
We need to make sure this can not happen, by clearing the field after
socket cloning.
Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 7db92362d2fe ("tcp: fix potential double free issue for fastopen_req") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 4 May 2017 02:12:27 +0000 (19:12 -0700)]
Merge tag 'modules-for-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
- Minor code cleanups
- Fix section alignment for .init_array
* tag 'modules-for-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
kallsyms: Use bounded strnchr() when parsing string
module: Unify the return value type of try_module_get
module: set .init_array alignment to 8
Linus Torvalds [Thu, 4 May 2017 01:41:21 +0000 (18:41 -0700)]
Merge tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New features for this release:
- Pretty much a full rewrite of the processing of function plugins.
i.e. echo do_IRQ:stacktrace > set_ftrace_filter
- The rewrite was needed to add plugins to be unique to tracing
instances. i.e. mkdir instance/foo; cd instances/foo; echo
do_IRQ:stacktrace > set_ftrace_filter The old way was written very
hacky. This removes a lot of those hacks.
- New "function-fork" tracing option. When set, pids in the
set_ftrace_pid will have their children added when the processes
with their pids listed in the set_ftrace_pid file forks.
- Exposure of "maxactive" for kretprobe in kprobe_events
- Allow for builtin init functions to be traced by the function
tracer (via the kernel command line). Module init function tracing
will come in the next release.
- Added more selftests, and have selftests also test in an instance"
* tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (60 commits)
ring-buffer: Return reader page back into existing ring buffer
selftests: ftrace: Allow some event trigger tests to run in an instance
selftests: ftrace: Have some basic tests run in a tracing instance too
selftests: ftrace: Have event tests also run in an tracing instance
selftests: ftrace: Make func_event_triggers and func_traceonoff_triggers tests do instances
selftests: ftrace: Allow some tests to be run in a tracing instance
tracing/ftrace: Allow for instances to trigger their own stacktrace probes
tracing/ftrace: Allow for the traceonoff probe be unique to instances
tracing/ftrace: Enable snapshot function trigger to work with instances
tracing/ftrace: Allow instances to have their own function probes
tracing/ftrace: Add a better way to pass data via the probe functions
ftrace: Dynamically create the probe ftrace_ops for the trace_array
tracing: Pass the trace_array into ftrace_probe_ops functions
tracing: Have the trace_array hold the list of registered func probes
ftrace: If the hash for a probe fails to update then free what was initialized
ftrace: Have the function probes call their own function
ftrace: Have each function probe use its own ftrace_ops
ftrace: Have unregister_ftrace_function_probe_func() return a value
ftrace: Add helper function ftrace_hash_move_and_update_ops()
ftrace: Remove data field from ftrace_func_probe structure
...
Linus Torvalds [Thu, 4 May 2017 01:29:28 +0000 (18:29 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:
- There is a situation when early console is not deregistered because
the preferred one matches a wrong entry. It caused messages to appear
twice.
This is the 2nd attempt to fix it. The first one was wrong, see the
commit c6c7d83b9c9e ('Revert "console: don't prefer first registered
if DT specifies stdout-path"').
The fix is coupled with some small code clean up. Well, the console
registration code would deserve a big one. We need to think about it.
- Do not lose information about the preemtive context when the console
semaphore is re-taken.
- Do not block CPU hotplug when someone else is already pushing
messages to the console.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
printk: fix double printing with earlycon
printk: rename selected_console -> preferred_console
printk: fix name/type/scope of preferred_console var
printk: Correctly handle preemption in console_unlock()
printk: use console_trylock() in console_cpu_notify()
Linus Torvalds [Thu, 4 May 2017 00:55:59 +0000 (17:55 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
- a few misc things
- most of MM
- KASAN updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
kasan: separate report parts by empty lines
kasan: improve double-free report format
kasan: print page description after stacks
kasan: improve slab object description
kasan: change report header
kasan: simplify address description logic
kasan: change allocation and freeing stack traces headers
kasan: unify report headers
kasan: introduce helper functions for determining bug type
mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
mm: hwpoison: call shake_page() unconditionally
mm/swapfile.c: fix swap space leak in error path of swap_free_entries()
mm/gup.c: fix access_ok() argument type
mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
fs/block_dev: always invalidate cleancache in invalidate_bdev()
fs: fix data invalidation in the cleancache during direct IO
zram: reduce load operation in page_same_filled
zram: use zram_free_page instead of open-coded
zram: introduce zram data accessor
...
BUG: Double free or freeing an invalid pointer
Unexpected shadow byte: 0xFB
to
BUG: KASAN: double-free or invalid-free in kmalloc_oob_left+0xe5/0xef
This makes a bug uniquely identifiable by the first report line. To
account for removing of the unexpected shadow value, print shadow bytes
at the end of the report as in reports for other kinds of bugs.
The buggy address belongs to the object at ffff880068388540
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 123 bytes inside of
128-byte region [ffff880068388540, ffff8800683885c0)
Makes it more explanatory and adds information about relative offset of
the accessed address to the start of the object.
kasan: introduce helper functions for determining bug type
Patch series "kasan: improve error reports", v2.
This patchset improves KASAN reports by making them easier to read and a
little more detailed. Also improves mm/kasan/report.c readability.
Effectively changes a use-after-free report to:
==================================================================
BUG: KASAN: use-after-free in kmalloc_uaf+0xaa/0xb6 [test_kasan]
Write of size 1 at addr ffff88006aa59da8 by task insmod/3951
Memory state around the buggy address: ffff88006aa59c80: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc ffff88006aa59d00: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
>ffff88006aa59d80: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
^ ffff88006aa59e00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc ffff88006aa59e80: fb fb fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
==================================================================
from:
==================================================================
BUG: KASAN: use-after-free in kmalloc_uaf+0xaa/0xb6 [test_kasan] at addr ffff88006c4dcb28
Write of size 1 by task insmod/3984
CPU: 1 PID: 3984 Comm: insmod Tainted: G B 4.10.0+ #83
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x292/0x398
kasan_object_err+0x1c/0x70
kasan_report.part.1+0x20e/0x4e0
__asan_report_store1_noabort+0x2c/0x30
kmalloc_uaf+0xaa/0xb6 [test_kasan]
kmalloc_tests_init+0x4f/0xa48 [test_kasan]
do_one_initcall+0xf3/0x390
do_init_module+0x215/0x5d0
load_module+0x54de/0x82b0
SYSC_init_module+0x3be/0x430
SyS_init_module+0x9/0x10
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x7feca0f779da
RSP: 002b:00007ffdfeae5218 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 000055a064c13090 RCX: 00007feca0f779da
RDX: 00007feca1236f88 RSI: 000000000004df7e RDI: 00007feca1605000
RBP: 00007feca1236f88 R08: 0000000000000003 R09: 0000000000000000
R10: 00007feca0f73d0a R11: 0000000000000206 R12: 000055a064c14190
R13: 000000000001fe81 R14: 0000000000000000 R15: 0000000000000004
Object at ffff88006c4dcb20, in cache kmalloc-16 size: 16
Allocated:
PID = 3984
save_stack_trace+0x16/0x20
save_stack+0x43/0xd0
kasan_kmalloc+0xad/0xe0
kmem_cache_alloc_trace+0x82/0x270
kmalloc_uaf+0x56/0xb6 [test_kasan]
kmalloc_tests_init+0x4f/0xa48 [test_kasan]
do_one_initcall+0xf3/0x390
do_init_module+0x215/0x5d0
load_module+0x54de/0x82b0
SYSC_init_module+0x3be/0x430
SyS_init_module+0x9/0x10
entry_SYSCALL_64_fastpath+0x1f/0xc2
Freed:
PID = 3984
save_stack_trace+0x16/0x20
save_stack+0x43/0xd0
kasan_slab_free+0x73/0xc0
kfree+0xe8/0x2b0
kmalloc_uaf+0x85/0xb6 [test_kasan]
kmalloc_tests_init+0x4f/0xa48 [test_kasan]
do_one_initcall+0xf3/0x390
do_init_module+0x215/0x5d0
load_module+0x54de/0x82b0
SYSC_init_module+0x3be/0x430
SyS_init_module+0x9/0x10
entry_SYSCALL_64_fastpath+0x1f/0xc2
Memory state around the buggy address: ffff88006c4dca00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc ffff88006c4dca80: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
>ffff88006c4dcb00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
^ ffff88006c4dcb80: fb fb fc fc 00 00 fc fc fb fb fc fc fb fb fc fc ffff88006c4dcc00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
==================================================================
This patch (of 9):
Introduce get_shadow_bug_type() function, which determines bug type
based on the shadow value for a particular kernel address. Introduce
get_wild_bug_type() function, which determines bug type for addresses
which don't have a corresponding shadow value.
Naoya Horiguchi [Wed, 3 May 2017 21:56:22 +0000 (14:56 -0700)]
mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
Memory error handler calls try_to_unmap() for error pages in various
states. If the error page is a mlocked page, error handling could fail
with "still referenced by 1 users" message. This is because the page is
linked to and stays in lru cache after the following call chain.
memory_failure() calls shake_page() to hanlde the similar issue, but
current code doesn't cover because shake_page() is called only before
try_to_unmap(). So this patches adds shake_page().
Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace") Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop Link: http://lkml.kernel.org/r/1493197841-23986-3-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Xiaolong Ye <xiaolong.ye@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 3 May 2017 21:56:19 +0000 (14:56 -0700)]
mm: hwpoison: call shake_page() unconditionally
shake_page() is called before going into core error handling code in
order to ensure that the error page is flushed from lru_cache lists
where pages stay during transferring among LRU lists.
But currently it's not fully functional because when the page is linked
to lru_cache by calling activate_page(), its PageLRU flag is set and
shake_page() is skipped. The result is to fail error handling with
"still referenced by 1 users" message.
When the page is linked to lru_cache by isolate_lru_page(), its PageLRU
is clear, so that's fine.
This patch makes shake_page() unconditionally called to avoild the
failure.
Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace") Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop Link: http://lkml.kernel.org/r/1493197841-23986-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Xiaolong Ye <xiaolong.ye@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Wed, 3 May 2017 21:56:16 +0000 (14:56 -0700)]
mm/swapfile.c: fix swap space leak in error path of swap_free_entries()
In swapcache_free_entries(), if swap_info_get_cont() returns NULL,
something wrong occurs for the swap entry. But we should still continue
to free the following swap entries in the array instead of skip them to
avoid swap space leak. This is just problem in error path, where system
may be in an inconsistent state, but it is still good to fix it.
Link: http://lkml.kernel.org/r/20170421124739.24534-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Wed, 3 May 2017 21:56:12 +0000 (14:56 -0700)]
mm/gup.c: fix access_ok() argument type
MIPS just got changed to only accept a pointer argument for access_ok(),
causing one warning in drivers/scsi/pmcraid.c. I tried changing x86 the
same way and found the same warning in __get_user_pages_fast() and
nowhere else in the kernel during randconfig testing:
mm/gup.c: In function '__get_user_pages_fast':
mm/gup.c:1578:6: error: passing argument 1 of '__chk_range_not_ok' makes pointer from integer without a cast [-Werror=int-conversion]
It would probably be a good idea to enforce type-safety in general, so
let's change this file to not cause a warning if we do that.
I don't know why the warning did not appear on MIPS.
Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") Link: http://lkml.kernel.org/r/20170421162659.3314521-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cleancache_invalidate_inode() called truncate_inode_pages_range() and
invalidate_inode_pages2_range() twice - on entry and on exit. It's
stupid and waste of time. It's enough to call it once at exit.
Link: http://lkml.kernel.org/r/20170424164135.22350-5-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Wed, 3 May 2017 21:56:06 +0000 (14:56 -0700)]
mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
If mapping is empty (both ->nrpages and ->nrexceptional is zero) we can
avoid pointless lookups in empty radix tree and bail out immediately
after cleancache invalidation.
Link: http://lkml.kernel.org/r/20170424164135.22350-4-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Wed, 3 May 2017 21:55:59 +0000 (14:55 -0700)]
fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.
We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.
Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.
- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?
- Patches 3-4: some optimizations.
This patch (of 4):
Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.
Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.
Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.
Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.
Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sangwoo Park [Wed, 3 May 2017 21:55:56 +0000 (14:55 -0700)]
zram: reduce load operation in page_same_filled
In page_same_filled function, all elements in the page is compared with
next index value. The current comparison routine compares the (i)th and
(i+1)th values of the page.
In this case, two load operaions occur for each comparison. But if we
store first value of the page stores at 'val' variable and using it to
compare with others, the load opearation is reduced. It reduce load
operation per page by up to 64times.
Link: http://lkml.kernel.org/r/1488428104-7257-1-git-send-email-sangwoo2.park@lge.com Signed-off-by: Sangwoo Park <sangwoo2.park@lge.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 3 May 2017 21:55:53 +0000 (14:55 -0700)]
zram: use zram_free_page instead of open-coded
The zram_free_page already handles NULL handle case and same page so use
it to reduce error probability. (Acutaully, I made a mistake when I
handled same page feature)
Link: http://lkml.kernel.org/r/1492052365-16169-7-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 3 May 2017 21:55:50 +0000 (14:55 -0700)]
zram: introduce zram data accessor
With element, sometime I got confused handle and element access. It
might be my bad but I think it's time to introduce accessor to prevent
future idiot like me. This patch is just clean-up patch so it shouldn't
change any behavior.
Link: http://lkml.kernel.org/r/1492052365-16169-6-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 3 May 2017 21:55:41 +0000 (14:55 -0700)]
zram: partial IO refactoring
For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
However, the mixed code for handling normal/partial IO is too mess,
error-prone to modify IO handler functions with upcoming feature so this
patch aims for cleaning up zram's IO handling functions.
Link: http://lkml.kernel.org/r/1492052365-16169-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Thumshirn reported system goes the panic when using NVMe over
Fabrics loopback target with zram.
The reason is zram expects each bvec in bio contains a single page
but nvme can attach a huge bulk of pages attached to the bio's bvec
so that zram's index arithmetic could be wrong so that out-of-bound
access makes system panic.
[1] in mainline solved solved the problem by limiting max_sectors with
SECTORS_PER_PAGE but it makes zram slow because bio should split with
each pages so this patch makes zram aware of multiple pages in a bvec
so it could solve without any regression(ie, bio split).
[1] 0bc315381fe9, zram: set physical queue limits to avoid array out of
bounds accesses
Link: http://lkml.kernel.org/r/20170413134057.GA27499@bbox Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Wed, 3 May 2017 21:55:34 +0000 (14:55 -0700)]
mm, page_alloc: remove debug_guardpage_minorder() test in warn_alloc()
Commit c0a32fc5a2e4 ("mm: more intensive memory corruption debugging")
changed to check debug_guardpage_minorder() > 0 when reporting
allocation failures. The reasoning was
When we use guard page to debug memory corruption, it shrinks
available pages to 1/2, 1/4, 1/8 and so on, depending on parameter
value. In such case memory allocation failures can be common and
printing errors can flood dmesg. If somebody debug corruption,
allocation failures are not the things he/she is interested about.
but this is misguided.
Allocation requests with __GFP_NOWARN flag by definition do not cause
flooding of allocation failure messages. Allocation requests with
__GFP_NORETRY flag likely also have __GFP_NOWARN flag. Costly
allocation requests likely also have __GFP_NOWARN flag.
Allocation requests without __GFP_DIRECT_RECLAIM flag likely also have
__GFP_NOWARN flag or __GFP_HIGH flag. Non-costly allocation requests
with __GFP_DIRECT_RECLAIM flag basically retry forever due to the "too
small to fail" memory-allocation rule.
Therefore, as a whole, shrinking available pages by
debug_guardpage_minorder= kernel boot parameter might cause flooding of
OOM killer messages but unlikely causes flooding of allocation failure
messages. Let's remove debug_guardpage_minorder() > 0 check which would
likely be pointless.
Link: http://lkml.kernel.org/r/1491910035-4231-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/madvise: move up the behavior parameter validation
madvise_behavior_valid() should be called before acting upon the
behavior parameter. Hence move up the function. This also includes
MADV_SOFT_OFFLINE and MADV_HWPOISON options as valid behavior parameter
for the system call madvise().
mm/madvise.c: clean up MADV_SOFT_OFFLINE and MADV_HWPOISON
This cleans up handling MADV_SOFT_OFFLINE and MADV_HWPOISON called
through madvise() system call.
* madvise_memory_failure() was misleading to accommodate handling of
both memory_failure() as well as soft_offline_page() functions.
Basically it handles memory error injection from user space which
can go either way as memory failure or soft offline. Renamed as
madvise_inject_error() instead.
* Renamed struct page pointer 'p' to 'page'.
* pr_info() was essentially printing PFN value but it said 'page'
which was misleading. Made the process virtual address explicit.
Before the patch:
Soft offlining page 0x15e3e at 0x3fff8c230000
Soft offlining page 0x1f3 at 0x3fffa0da0000
Soft offlining page 0x744 at 0x3fff7d200000
Soft offlining page 0x1634d at 0x3fff95e20000
Soft offlining page 0x16349 at 0x3fff95e30000
Soft offlining page 0x1d6 at 0x3fff9e8b0000
Soft offlining page 0x5f3 at 0x3fff91bd0000
Injecting memory failure for page 0x15c8b at 0x3fff83280000
Injecting memory failure for page 0x16190 at 0x3fff83290000
Injecting memory failure for page 0x740 at 0x3fff9a2e0000
Injecting memory failure for page 0x741 at 0x3fff9a2f0000
After the patch:
Soft offlining pfn 0x1484e at process virtual address 0x3fff883c0000
Soft offlining pfn 0x1484f at process virtual address 0x3fff883d0000
Soft offlining pfn 0x14850 at process virtual address 0x3fff883e0000
Soft offlining pfn 0x14851 at process virtual address 0x3fff883f0000
Soft offlining pfn 0x14852 at process virtual address 0x3fff88400000
Soft offlining pfn 0x14853 at process virtual address 0x3fff88410000
Soft offlining pfn 0x14854 at process virtual address 0x3fff88420000
Soft offlining pfn 0x1521c at process virtual address 0x3fff6bc70000
Injecting memory failure for pfn 0x10fcf at process virtual address 0x3fff86310000
Injecting memory failure for pfn 0x10fd0 at process virtual address 0x3fff86320000
Injecting memory failure for pfn 0x10fd1 at process virtual address 0x3fff86330000
Injecting memory failure for pfn 0x10fd2 at process virtual address 0x3fff86340000
Injecting memory failure for pfn 0x10fd3 at process virtual address 0x3fff86350000
Injecting memory failure for pfn 0x10fd4 at process virtual address 0x3fff86360000
Injecting memory failure for pfn 0x10fd5 at process virtual address 0x3fff86370000
Johannes Weiner [Wed, 3 May 2017 21:55:03 +0000 (14:55 -0700)]
mm: vmscan: fix IO/refault regression in cache workingset transition
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") we noticed bigger IO spikes during changes in cache access
patterns.
The patch in question shrunk the inactive list size to leave more room
for the current workingset in the presence of streaming IO. However,
workingset transitions that previously happened on the inactive list are
now pushed out of memory and incur more refaults to complete.
This patch disables active list protection when refaults are being
observed. This accelerates workingset transitions, and allows more of
the new set to establish itself from memory, without eating into the
ability to protect the established workingset during stable periods.
The workloads that were measurably affected for us were hit pretty bad
by it, with refault/majfault rates doubling and tripling during cache
transitions, and the machines sustaining half-hour periods of 100% IO
utilization, where they'd previously have sub-minute peaks at 60-90%.
Stateful services that handle user data tend to be more conservative
with kernel upgrades. As a result we hit most page cache issues with
some delay, as was the case here.
The severity seemed to warrant a stable tag.
Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/mmap: replace SHM_HUGE_MASK with MAP_HUGE_MASK inside mmap_pgoff
Commit 091d0d55b286 ("shm: fix null pointer deref when userspace
specifies invalid hugepage size") had replaced MAP_HUGE_MASK with
SHM_HUGE_MASK. Though both of them contain the same numeric value of
0x3f, MAP_HUGE_MASK flag sounds more appropriate than the other one in
the context. Hence change it back.
Link: http://lkml.kernel.org/r/20170404045635.616-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 3 May 2017 21:54:57 +0000 (14:54 -0700)]
oom: improve oom disable handling
Tetsuo has reported that sysrq triggered OOM killer will print a
misleading information when no tasks are selected:
sysrq: SysRq : Manual OOM execution
Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child
Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB
sysrq: SysRq : Manual OOM execution
Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child
Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
sysrq: SysRq : Manual OOM execution
sysrq: OOM request ignored because killer is disabled
sysrq: SysRq : Manual OOM execution
sysrq: OOM request ignored because killer is disabled
sysrq: SysRq : Manual OOM execution
sysrq: OOM request ignored because killer is disabled
The real reason is that there are no eligible tasks for the OOM killer
to select but since commit 7c5f64f84483 ("mm: oom: deduplicate victim
selection code for memcg and global oom") the semantic of out_of_memory
has changed without updating moom_callback.
This patch updates moom_callback to tell that no task was eligible which
is the case for both oom killer disabled and no eligible tasks. In
order to help distinguish first case from the second add printk to both
oom_killer_{enable,disable}. This information is useful on its own
because it might help debugging potential memory allocation failures.
Fixes: 7c5f64f84483 ("mm: oom: deduplicate victim selection code for memcg and global oom") Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 3 May 2017 21:54:54 +0000 (14:54 -0700)]
userfaultfd: selftest: combine all cases into a single executable
Currently, selftest for userfaultfd is compiled three times: for
anonymous, shared and hugetlb memory. Let's combine all the cases into
a single executable which will have a command line option for selection
of the test type.
Link: http://lkml.kernel.org/r/1490869741-5913-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vinayak Menon [Wed, 3 May 2017 21:54:42 +0000 (14:54 -0700)]
mm: enable page poisoning early at boot
On SPARSEMEM systems page poisoning is enabled after buddy is up,
because of the dependency on page extension init. This causes the pages
released by free_all_bootmem not to be poisoned. This either delays or
misses the identification of some issues because the pages have to
undergo another cycle of alloc-free-alloc for any corruption to be
detected.
Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON
flag. Since all the free pages will now be poisoned, the flag need not
be verified before checking the poison during an alloc.
Huang Ying [Wed, 3 May 2017 21:54:39 +0000 (14:54 -0700)]
mm, swap: avoid lock swap_avail_lock when held cluster lock
Cluster lock is used to protect the swap_cluster_info and corresponding
elements in swap_info_struct->swap_map[]. But it is found that now in
scan_swap_map_slots(), swap_avail_lock may be acquired when cluster lock
is held. This does no good except making the locking more complex and
improving the potential locking contention, because the
swap_info_struct->lock is used to protect the data structure operated in
the code already. Fix this via moving the corresponding operations in
scan_swap_map_slots() out of cluster lock.
Link: http://lkml.kernel.org/r/20170317064635.12792-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Wed, 3 May 2017 21:54:36 +0000 (14:54 -0700)]
mm, swap: improve readability via make spin_lock/unlock balanced
This is just a cleanup patch, no functionality change.
In cluster_list_add_tail(), spin_lock_nested() is used to lock the
cluster, while unlock_cluster() is used to unlock the cluster. To
improve the code readability, use spin_unlock() directly to unlock the
cluster.
Link: http://lkml.kernel.org/r/20170317064635.12792-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Wed, 3 May 2017 21:54:33 +0000 (14:54 -0700)]
mm, swap: fix comment in __read_swap_cache_async
Commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
deadlock while waiting on discard I/O completion") fixed a deadlock in
read_swap_cache_async(). Because at that time, in swap allocation path,
a swap entry may be set as SWAP_HAS_CACHE, then wait for discarding to
complete before the page for the swap entry is added to the swap cache.
But in commit 815c2c543d3a ("swap: make swap discard async"), the
discarding for swap become asynchronous, waiting for discarding to
complete will be done before the swap entry is set as SWAP_HAS_CACHE.
So the comments in code is incorrect now. This patch fixes the
comments.
The cond_resched() added in the commit cbab0e4eec29 is not necessary now
too. But if we added some sleep in swap allocation path in the future,
there may be some hard to debug/reproduce deadlock bug. So it is kept.
Link: http://lkml.kernel.org/r/20170317064635.12792-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 3 May 2017 21:54:27 +0000 (14:54 -0700)]
mm: make rmap_one boolean function
rmap_one's return value controls whether rmap_work should contine to
scan other ptes or not so it's target for changing to boolean. Return
true if the scan should be continued. Otherwise, return false to stop
the scanning.
This patch makes rmap_one's return value to boolean.
Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 3 May 2017 21:54:17 +0000 (14:54 -0700)]
mm: remove SWAP_AGAIN in ttu
In 2002, [1] introduced SWAP_AGAIN. At that time, try_to_unmap_one used
spin_trylock(&mm->page_table_lock) so it's really easy to contend and
fail to hold a lock so SWAP_AGAIN to keep LRU status makes sense.
However, now we changed it to mutex-based lock and be able to block
without skip pte so there is few of small window to return SWAP_AGAIN so
remove SWAP_AGAIN and just return SWAP_FAIL.
Minchan Kim [Wed, 3 May 2017 21:54:13 +0000 (14:54 -0700)]
mm: remove SWAP_MLOCK in ttu
ttu doesn't need to return SWAP_MLOCK. Instead, just return SWAP_FAIL
because it means the page is not-swappable so it should move to another
LRU list(active or unevictable). putback friends will move it to right
list depending on the page's LRU flag.
Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 3 May 2017 21:54:10 +0000 (14:54 -0700)]
mm: make try_to_munlock() return void
try_to_munlock returns SWAP_MLOCK if the one of VMAs mapped the page has
VM_LOCKED flag. In that time, VM set PG_mlocked to the page if the page
is not pte-mapped THP which cannot be mlocked, either.
With that, __munlock_isolated_page can use PageMlocked to check whether
try_to_munlock is successful or not without relying on try_to_munlock's
retval. It helps to make try_to_unmap/try_to_unmap_one simple with
upcoming patches.
Minchan Kim [Wed, 3 May 2017 21:54:07 +0000 (14:54 -0700)]
mm: remove SWAP_MLOCK check for SWAP_SUCCESS in ttu
If the page is mapped and rescue in try_to_unmap_one, the
page_mapcount() of a page cannot be zero, so the page_mapcount check in
try_to_unmap is enough to return SWAP_SUCCESS. IOW, SWAP_MLOCK check is
redundant so remove it.
Link: http://lkml.kernel.org/r/1489555493-14659-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 3 May 2017 21:54:04 +0000 (14:54 -0700)]
mm: remove SWAP_DIRTY in ttu
If we found lazyfree page is dirty, try_to_unmap_one can just
SetPageSwapBakced in there like PG_mlocked page and just return with
SWAP_FAIL which is very natural because the page is not swappable right
now so that vmscan can activate it. There is no point to introduce new
return value SWAP_DIRTY in try_to_unmap at the moment.
Link: http://lkml.kernel.org/r/1489555493-14659-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yisheng Xie [Wed, 3 May 2017 21:53:57 +0000 (14:53 -0700)]
mm/vmscan: more restrictive condition for retry in do_try_to_free_pages
By reviewing code, I find that when enter do_try_to_free_pages, the
may_thrash is always clear, and it will retry shrink zones to tap
cgroup's reserves memory by setting may_thrash when the former
shrink_zones reclaim nothing.
However, when memcg is disabled or on legacy hierarchy, or there do not
have any memcg protected by low limit, it should not do this useless
retry at all, for we do not have any cgroup's reserves memory to tap,
and we have already done hard work but made no progress, which as Michal
pointed out in former version, we are trying hard to control the retry
logical of page alloctor, and the current additional round of reclaim is
just lame.
Therefore, to avoid this unneeded retrying and make code more readable,
we remove the may_thrash field in scan_control, instead, introduce
memcg_low_reclaim and memcg_low_skipped, and only retry when
memcg_low_skipped, by setting memcg_low_reclaim.
[xieyisheng1@huawei.com: remove may_thrash field, introduce mem_cgroup_reclaim] Link: http://lkml.kernel.org/r/1490191893-5923-1-git-send-email-ysxie@foxmail.com Link: http://lkml.kernel.org/r/1490191893-5923-1-git-send-email-ysxie@foxmail.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Michal Hocko <mhocko@kernel.org> Suggested-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yisheng Xie [Wed, 3 May 2017 21:53:54 +0000 (14:53 -0700)]
mm/compaction: ignore block suitable after check large free page
By reviewing code, I find that if the migrate target is a large free
page and we ignore suitable, it may splite large target free page into
smaller block which is not good for defrag. So move the ignore block
suitable after check large free page.
As Vlastimil pointed out in RFC version that this patch is just based on
logical analyses which might be better for future-proofing the function
and it is most likely won't have any visible effect right now, for
direct compaction shouldn't have to be called if there's a
>=pageblock_order page already available.
Link: http://lkml.kernel.org/r/1489490743-5364-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Wed, 3 May 2017 21:53:51 +0000 (14:53 -0700)]
mm/sparse: refine usemap_size() a little
The current implementation calculates usemap_size in two steps:
* calculate number of bytes to cover these bits
* calculate number of "unsigned long" to cover these bytes
It would be more clear by:
* calculate number of "unsigned long" to cover these bits
* multiple it with sizeof(unsigned long)
This patch refine usemap_size() a little to make it more easy to
understand.
__GFP_NOWARN, which is usually added to avoid warnings from callsites
that expect to fail and have fallbacks, currently also suppresses
allocation stall warnings. These trigger when an allocation is stuck
inside the allocator for 10 seconds or longer.
But there is no class of allocations that can get legitimately stuck in
the allocator for this long. This always indicates a problem.
Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.
Link: http://lkml.kernel.org/r/20170125181150.GA16398@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 3 May 2017 21:53:45 +0000 (14:53 -0700)]
mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken
for in pgdat->kswapd_classzone_idx. If it has not been woken recently,
this zone will be 0.
However, the decision on whether to sleep is made on
kswapd_classzone_idx which is 0 without a recent wakeup request and that
classzone does not account for lowmem reserves. This allows kswapd to
sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
request even if a stream of allocations cannot use that zone. While
kswapd may be woken again shortly in the near future there are two
consequences -- the pgdat bits that control congestion are cleared
prematurely and direct reclaim is more likely as kswapd slept
prematurely.
This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
invalid index) when there has been no recent wakeups. If there are no
wakeups, it'll decide whether to sleep based on the highest possible
zone available (MAX_NR_ZONES - 1). It then becomes critical that the
"pgdat balanced" decisions during reclaim and when deciding to sleep are
the same. If there is a mismatch, kswapd can stay awake continually
trying to balance tiny zones.
simoop was used to evaluate it again. Two of the preparation patches
regressed the workload so they are included as the second set of
results. Otherwise this patch looks artifically excellent
With this patch on top, write and allocation latencies are massively
improved. The read latencies are slightly impaired but it's worth
noting that this is mostly due to the IO scheduler and not directly
related to reclaim. The vmstats are a bit of a mix but the relevant
ones are as follows;
Direct reclaim is reduced but not eliminated. It's worth noting that
there are two periods of direct reclaim for this workload. The first is
when it switches from preparing the files for the actual test itself.
It's a lot of file IO followed by a lot of allocs that reclaims heavily
for a brief window. While direct reclaim is lower with clear-v2, it is
due to kswapd scanning aggressively and trying to reclaim the world
which is not the right thing to do. With the patches applied, there is
still direct reclaim but the phase change from "creating work files" to
starting multiple threads that allocate a lot of anonymous memory faster
than kswapd can reclaim.
Scanning/reclaim efficiency is restored by this patch.
Page writes from reclaim context are back at 0 which is ideal.
Pages immediately reclaimed after IO completes is slightly improved but
it is expected this will vary slightly.
On UMA, there is almost no change so this is not expected to be a
universal win.
Mel Gorman [Wed, 3 May 2017 21:53:41 +0000 (14:53 -0700)]
mm, vmscan: only clear pgdat congested/dirty/writeback state when balanced
A pgdat tracks if recent reclaim encountered too many dirty, writeback
or congested pages. The flags control whether kswapd writes pages back
from reclaim context, tags pages for immediate reclaim when IO
completes, whether processes block on wait_iff_congested and whether
kswapd blocks when too many pages marked for immediate reclaim are
encountered.
The state is cleared in a check function with side-effects. With the
patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the
timing of when the bits get cleared changed. Due to the way the check
works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA
allocation because it does not account for lowmem reserves properly.
For the simoop workload, kswapd is not stalling when it should due to
the premature clearing, writing pages from reclaim context like crazy
and generally being unhelpful.
This patch resets the pgdat bits related to page reclaim only when
kswapd is going to sleep. The comparison with simoop is then
Read latency is improved, write latency is mostly improved but
allocation latency is regressed. kswapd is still reclaiming
inefficiently, pages are being written back from writeback context and a
host of other issues. However, given the change, it needed to be
spelled out why the side-effect was moved.
Shantanu Goel [Wed, 3 May 2017 21:53:38 +0000 (14:53 -0700)]
mm, vmscan: fix zone balance check in prepare_kswapd_sleep
Patch series "Reduce amount of time kswapd sleeps prematurely", v2.
The series is unusual in that the first patch fixes one problem and
introduces other issues that are noted in the changelog. Patch 2 makes
a minor modification that is worth considering on its own but leaves the
kernel in a state where it behaves badly. It's not until patch 3 that
there is an improvement against baseline.
This was mostly motivated by examining Chris Mason's "simoop" benchmark
which puts the VM under similar pressure to HADOOP. It has been
reported that the benchmark has regressed severely during the last
number of releases. While I cannot reproduce all the same problems
Chris experienced due to hardware limitations, there was a number of
problems on a 2-socket machine with a single disk.
These are latencies. Read/write are threads reading fixed-size random
blocks from a simulated database. The allocation latency is mmaping and
faulting regions of memory. The p50, 95 and p99 reports the worst
latencies for 50% of the samples, 95% and 99% respectively.
For example, the report indicates that while the test was running 99% of
writes completed 99.75% faster. It's worth noting that on a UMA machine
that no difference in performance with simoop was observed so milage
will vary.
It's noted that there is a slight impact to read latencies but it's
mostly due to IO scheduler decisions and offset by the large reduction
in other latencies.
This patch (of 3):
The check in prepare_kswapd_sleep needs to match the one in
balance_pgdat since the latter will return as soon as any one of the
zones in the classzone is above the watermark. This is specially
important for higher order allocations since balance_pgdat will
typically reset the order to zero relying on compaction to create the
higher order pages. Without this patch, prepare_kswapd_sleep fails to
wake up kcompactd since the zone balance check fails.
It was first reported against 4.9.7 that kswapd is failing to wake up
kcompactd due to a mismatch in the zone balance check between
balance_pgdat() and prepare_kswapd_sleep().
balance_pgdat() returns as soon as a single zone satisfies the
allocation but prepare_kswapd_sleep() requires all zones to do +the
same. This causes prepare_kswapd_sleep() to never succeed except in the
order == 0 case and consequently, wakeup_kcompactd() is never called.
For the machine that originally motivated this patch, the state of
compaction from /proc/vmstat looked this way after a day and a half +of
uptime:
Further notes from Mel that motivated him to pick this patch up and
resend it;
It was observed for the simoop workload (pressures the VM similar to
HADOOP) that kswapd was failing to keep ahead of direct reclaim. The
investigation noted that there was a need to rationalise kswapd
decisions to reclaim with kswapd decisions to sleep. With this patch on
a 2-socket box, there was a 49% reduction in direct reclaim scanning.
However, the impact otherwise is extremely negative. Kswapd reclaim
efficiency dropped from 98% to 76%. simoop has three latency-related
metrics for read, write and allocation (an anonymous mmap and fault).
Of greater concern is that the patch causes swapping and page writes
from kswapd context rose from 0 pages to 4189753 pages during the hour
the workload ran for. By and large, the patch has very bad behaviour
but easily missed as the impact on a UMA machine is negligible.
This patch is included with the data in case a bisection leads to this
area. This patch is also a pre-requisite for the rest of the series.