Merge tag 'arc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC updates from Vineet Gupta:
- Support for HSDK board hosting a Quad core HS38x4 based SoC running
@1GHz (and some prerrquisite changes such as ability to scoot the
kernel code/data from start of memory map etc)
- Quite a few updates for EZChip (Mellanox) platform
- Fixes to fault/exception printing
* tag 'arc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (26 commits)
ARC: Re-enable MMU upon Machine Check exception
ARC: Show fault information passed to show_kernel_fault_diag()
ARC: [plat-hsdk] initial port for HSDK board
ARC: mm: Decouple RAM base address from kernel link address
ARCv2: IOC: Tighten up the contraints (specifically base / size alignment)
ARC: [plat-axs103] refactor the DT fudging code
ARC: [plat-axs103] use clk driver #2: Add core pll node to DT to manage cpu clk
ARC: [plat-axs103] use clk driver #1: Get rid of platform specific cpu clk setting
ARCv2: SLC: provide a line based flush routine for debugging
ARC: Hardcode ARCH_DMA_MINALIGN to max line length we may have
ARC: [plat-eznps] handle extra aux regs #2: kernel/entry exit
ARC: [plat-eznps] handle extra aux regs #1: save/restore on context switch
ARC: [plat-eznps] avoid toggling of DPC register
ARC: [plat-eznps] Update the init sequence of aux regs per cpu.
ARC: [plat-eznps] new command line argument for HW scheduler at MTM
ARC: set boot print log level to PR_INFO
ARC: [plat-eznps] Handle user memory error same in simulation and silicon
ARC: [plat-eznps] use schd.wft instruction instead of sleep at idle task
ARC: create cpu specific version of arch_cpu_idle()
ARC: [plat-eznps] spinlock aware for MTM
...
Merge tag 'pci-v4.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- add enhanced Downstream Port Containment support, which prints more
details about Root Port Programmed I/O errors (Dongdong Liu)
- add Layerscape ls1088a and ls2088a support (Hou Zhiqiang)
- add MediaTek MT2712 and MT7622 support (Ryder Lee)
- add MediaTek MT2712 and MT7622 MSI support (Honghui Zhang)
- add Qualcom IPQ8074 support (Varadarajan Narayanan)
- add R-Car r8a7743/5 device tree support (Biju Das)
- add Rockchip per-lane PHY support for better power management (Shawn
Lin)
- fix IRQ mapping for hot-added devices by replacing the
pci_fixup_irqs() boot-time design with a host bridge hook called at
probe-time (Lorenzo Pieralisi, Matthew Minter)
- fix race when enabling two devices that results in upstream bridge
not being enabled correctly (Srinath Mannam)
- fix pciehp power fault infinite loop (Keith Busch)
- fix SHPC bridge MSI hotplug events by enabling bus mastering
(Aleksandr Bezzubikov)
- fix a VFIO issue by correcting PCIe capability sizes (Alex
Williamson)
- fix an INTD issue on Xilinx and possibly other drivers by unifying
INTx IRQ domain support (Paul Burton)
- avoid IOMMU stalls by marking AMD Stoney GPU ATS as broken (Joerg
Roedel)
- allow APM X-Gene device assignment to guests by adding an ACS quirk
(Feng Kan)
- fix driver crashes by disabling Extended Tags on Broadcom HT2100
(Extended Tags support is required for PCIe Receivers but not
Requesters, and we now enable them by default when Requesters support
them) (Sinan Kaya)
- fix MSIs for devices that use phantom RIDs for DMA by assuming MSIs
use the real Requester ID (not a phantom RID) (Robin Murphy)
- prevent assignment of Intel VMD children to guests (which may be
supported eventually, but isn't yet) by not associating an IOMMU with
them (Jon Derrick)
- fix Intel VMD suspend/resume by releasing IRQs on suspend (Scott
Bauer)
- fix a Function-Level Reset issue with Intel 750 NVMe by waiting
longer (up to 60sec instead of 1sec) for device to become ready
(Sinan Kaya)
- fix a Function-Level Reset issue on iProc Stingray by working around
hardware defects in the CRS implementation (Oza Pawandeep)
- fix an issue with Intel NVMe P3700 after an iProc reset by adding a
delay during shutdown (Oza Pawandeep)
- fix a Microsoft Hyper-V lockdep issue by polling instead of blocking
in compose_msi_msg() (Stephen Hemminger)
- fix a wireless LAN driver timeout by clearing DesignWare MSI
interrupt status after it is handled, not before (Faiz Abbas)
- reduce Layerscape dependencies on the bootloader by doing more
initialization in the driver (Hou Zhiqiang)
- improve Intel VMD performance allowing allocation of more IRQ vectors
than present CPUs (Keith Busch)
- improve endpoint framework support for initial DMA mask, different
BAR sizes, configurable page sizes, MSI, test driver, etc (Kishon
Vijay Abraham I, Stan Drozd)
- rework CRS support to add periodic messages while we poll during
enumeration and after Function-Level Reset and prepare for possible
other uses of CRS (Sinan Kaya)
- clean up Root Port AER handling by removing unnecessary code and
moving error handler methods to struct pcie_port_service_driver
(Christoph Hellwig)
- clean up error handling paths in various drivers (Bjorn Andersson,
Fabio Estevam, Gustavo A. R. Silva, Harunobu Kurokawa, Jeffy Chen,
Lorenzo Pieralisi, Sergei Shtylyov)
- clean up SR-IOV resource handling by disabling VF decoding before
updating the corresponding resource structs (Gavin Shan)
- clean up DesignWare-based drivers by unifying quirks to update Class
Code and Interrupt Pin and related handling of write-protected
registers (Hou Zhiqiang)
- clean up by adding empty generic pcibios_align_resource() and
pcibios_fixup_bus() and removing empty arch-specific implementations
(Palmer Dabbelt)
- request exclusive reset control for several drivers to allow cleanup
elsewhere (Philipp Zabel)
- constify various structures (Arvind Yadav, Bhumika Goyal)
Merge tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"First batch of KVM changes for 4.14
Common:
- improve heuristic for boosting preempted spinlocks by ignoring
VCPUs in user mode
ARM:
- fix for decoding external abort types from guests
- added support for migrating the active priority of interrupts when
running a GICv2 guest on a GICv3 host
- minor cleanup
PPC:
- expose storage keys to userspace
- merge kvm-ppc-fixes with a fix that missed 4.13 because of
vacations
- fixes
s390:
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups
x86:
- emulate Hyper-V TSC frequency MSRs
- add nested INVPCID
- emulate EPTP switching VMFUNC
- support Virtual GIF
- support 5 level page tables
- speedup nested VM exits by packing byte operations
- speedup MMIO by using hardware provided physical address
- a lot of fixes and cleanups, especially nested"
* tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
KVM: arm/arm64: Support uaccess of GICC_APRn
KVM: arm/arm64: Extract GICv3 max APRn index calculation
KVM: arm/arm64: vITS: Drop its_ite->lpi field
KVM: arm/arm64: vgic: constify seq_operations and file_operations
KVM: arm/arm64: Fix guest external abort matching
KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd
KVM: s390: vsie: cleanup mcck reinjection
KVM: s390: use WARN_ON_ONCE only for checking
KVM: s390: guestdbg: fix range check
KVM: PPC: Book3S HV: Report storage key support to userspace
KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9
KVM: PPC: Book3S HV: Fix invalid use of register expression
KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation
KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
KVM: PPC: e500mc: Fix a NULL dereference
KVM: PPC: e500: Fix some NULL dereferences on error
KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
KVM: s390: we are always in czam mode
KVM: s390: expose no-DAT to guest and migration support
KVM: s390: sthyi: remove invalid guest write access
...
Merge tag 'linux-kselftest-4.14-rc1-update' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest updates from Shuah Khan:
- TAP13 framework API and converting tests to TAP13 continues. A few
more tests are converted and kselftest common RUN_TESTS in lib.mk is
enhanced to print TAP13 to cover test shell scripts that won't be
able to use kselftest API.
- Several fixes to existing tests to not fail in unsupported cases.
This has been an ongoing work based on the feedback from stable
release kselftest users.
- A new watchdog test and much needed cleanups to the existing tests
from Eugeniu Rosca.
- Changes to kselftest common lib.mk framework to make RUN_TESTS a
function to be called from individual test make files to run stress
and destructive sub-tests.
* tag 'linux-kselftest-4.14-rc1-update' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (41 commits)
selftests: Enhance kselftest_harness.h to print which assert failed
selftests: lib.mk: change RUN_TESTS to print messages in TAP13 format
selftests: change lib.mk RUN_TESTS to take test list as an argument
selftests: lib.mk: suppress "cd" output from run_tests target
selftests: kselftest framework: change skip exit code to 0
selftests/timers: make loop consistent with array size
selftests: timers: remove rtctest_setdate from run_destructive_tests
selftests: timers: Fix run_destructive_tests target to handle skipped tests
kselftests: timers: leap-a-day: Change default arguments to help test runs
selftests: timers: drop support for !KTEST case
rtc: rtctest: Improve support detection
selftests/cpu-hotplug: Skip test when there is only one online cpu
selftests/cpu-hotplug: exit with failure when test occured unexpected behaviors
selftests: futex: convert test to use ksft TAP13 framework
selftests: capabilities: convert error output to TAP13 ksft framework
selftests: memfd: Align STACK_SIZE for ARM AArch64 system
selftests: warn if failure is due to lack of executable bit
selftests: kselftest framework: add error counter
selftests: capabilities: convert the test to use TAP13 ksft framework
selftests: capabilities: fix to run Non-root +ia, sgidroot => i test
...
Merge tag 'trace-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Nothing new in development for this release. These are mostly fixes
that were found during development of changes for the next merge
window and fixes that were sent to me late in the last cycle"
* tag 'trace-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Apply trace_clock changes to instance max buffer
tracing: Fix clear of RECORDED_TGID flag when disabling trace event
tracing: Add barrier to trace_printk() buffer nesting modification
ftrace: Fix memleak when unregistering dynamic ops when tracing disabled
ftrace: Fix selftest goto location on error
ftrace: Zero out ftrace hashes when a module is removed
tracing: Only have rmmod clear buffers that its events were active in
ftrace: Fix debug preempt config name in stack_tracer_{en,dis}able
I had stupidly missed one special use of 'is_reserved_word()' when I
converted the code to avoid gperf.
I had changed that function to return the token ID directly rather than
a pointer to the token descriptor structure, but that meant that the
test for "is this a reserved word" changed from checking the return
value against NULL, to checking that it wasn't negative.
And while I had converted the main token parser over, I missed the
special case of the typeof phrase handling. And since our dependency
chain for genksyms does not include the genksyms program itself
changing, my kernel rebuild didn't show the problem.
Merge branch 'kvm-ppc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
This fix was intended for 4.13, but didn't get in because both
maintainers were on vacation.
Paul Mackerras:
"It adds mutual exclusion between list_add_rcu and list_del_rcu calls
on the kvm->arch.spapr_tce_tables list. Without this, userspace could
potentially trigger corruption of the list and cause a host crash or
worse."
Remove our use of 'gperf' for generating perfect hashes from some of our
build tools.
This removal was prompted by Masahiro Yamada sending out a patch that
removes all our pre-generated files, and when I tested it, I noticed
that the gperf version I have (3.1) apparently generates code that no
longer works with out code-base because the function interfaces
generated by gperf have changed.
We really don't care that much, and the gperf people changed their
interfaces in ways that makes it annoying to work with them. Tools that
make it hard to use them should not be used, and the kernel is not at
all interested in some autoconf mess. So remove the gperf dependency
entirely.
It turns out that if you ignore the pre-generated files, the use of
gperf apparently saved us a whopping fifteen lines of code. It
obviously wasn't worth it, considering that the pre-generated files are
about 500 lines.
I sent this out as a patch about three weeks ago, and got absolutely
zero responses. So let's see if anybody notices now that I merge it.
Because there might be serious bugs here, but it WorksForMe(tm).
* gperf-removal:
Remove gperf usage from toolchain
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas,
megaraid_sas, zfcp and a host of minor updates.
The major driver change here is the elimination of the block based
cciss driver in favour of the SCSI based hpsa driver (which now drives
all the legacy cases cciss used to be required for). Plus a reset
handler clean up and the redo of the SAS SMP handler to use bsg lib"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits)
scsi: scsi-mq: Always unprepare before requeuing a request
scsi: Show .retries and .jiffies_at_alloc in debugfs
scsi: Improve requeuing behavior
scsi: Call scsi_initialize_rq() for filesystem requests
scsi: qla2xxx: Reset the logo flag, after target re-login.
scsi: qla2xxx: Fix slow mem alloc behind lock
scsi: qla2xxx: Clear fc4f_nvme flag
scsi: qla2xxx: add missing includes for qla_isr
scsi: qla2xxx: Fix an integer overflow in sysfs code
scsi: aacraid: report -ENOMEM to upper layer from aac_convert_sgraw2()
scsi: aacraid: get rid of one level of indentation
scsi: aacraid: fix indentation errors
scsi: storvsc: fix memory leak on ring buffer busy
scsi: scsi_transport_sas: switch to bsg-lib for SMP passthrough
scsi: smartpqi: remove the smp_handler stub
scsi: hpsa: remove the smp_handler stub
scsi: bsg-lib: pass the release callback through bsg_setup_queue
scsi: Rework handling of scsi_device.vpd_pg8[03]
scsi: Rework the code for caching Vital Product Data (VPD)
scsi: rcu: Introduce rcu_swap_protected()
...
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:
- Do not allow use of freed init data and code even when boot consoles
are forced to stay. Also check for the init memory more precisely.
- Some code clean up by starting contributors.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
printk: Clean up do_syslog() error handling
printk/console: Enhance the check for consoles using init memory
printk/console: Always disable boot consoles that use init memory before it is freed
printk: Modify operators of printed_len and text_len
Merge tag 'audit-pr-20170907' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"A small pull request for audit this time, only four patches and only
two with any real code changes.
Those two changes are the removal of a pointless SELinux AVC
initialization audit event and a fix to improve the audit timestamp
overhead.
The other two patches are comment cleanup and administrative updates,
nothing very exciting.
Everything passes our tests"
* tag 'audit-pr-20170907' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: update the function comments
selinux: remove AVC init audit log message
audit: update the audit info in MAINTAINERS
audit: Reduce overhead using a coarse clock
Merge tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull secureexec update from Kees Cook:
"This series has the ultimate goal of providing a sane stack rlimit
when running set*id processes.
To do this, the bprm_secureexec LSM hook is collapsed into the
bprm_set_creds hook so the secureexec-ness of an exec can be
determined early enough to make decisions about rlimits and the
resulting memory layouts. Other logic acting on the secureexec-ness of
an exec is similarly consolidated. Capabilities needed some special
handling, but the refactoring removed other special handling, so that
was a wash"
* tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
exec: Consolidate pdeath_signal clearing
exec: Use sane stack rlimit under secureexec
exec: Consolidate dumpability logic
smack: Remove redundant pdeath_signal clearing
exec: Use secureexec for clearing pdeath_signal
exec: Use secureexec for setting dumpability
LSM: drop bprm_secureexec hook
commoncap: Move cap_elevated calculation into bprm_set_creds
commoncap: Refactor to remove bprm_secureexec hook
smack: Refactor to remove bprm_secureexec hook
selinux: Refactor to remove bprm_secureexec hook
apparmor: Refactor to remove bprm_secureexec hook
binfmt: Introduce secureexec flag
exec: Correct comments about "point of no return"
exec: Rename bprm->cred_prepared to called_set_creds
Merge tag 'gcc-plugins-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc plugins update from Kees Cook:
"This finishes the porting work on randstruct, and introduces a new
option to structleak, both noted below:
- For the randstruct plugin, enable automatic randomization of
structures that are entirely function pointers (along with a couple
designated initializer fixes).
- For the structleak plugin, provide an option to perform zeroing
initialization of all otherwise uninitialized stack variables that
are passed by reference (Ard Biesheuvel)"
* tag 'gcc-plugins-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: structleak: add option to init all vars used as byref args
randstruct: Enable function pointer struct detection
drivers/net/wan/z85230.c: Use designated initializers
drm/amd/powerplay: rv: Use designated initializers
Merge tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore update from Kees Cook:
"Make pstore permissions more versatile by removing CAP_SYSLOG
requirement and defining more restrictive root directory DAC
permissions default (0750, which can be adjust after boot unlike the
CAP_SYSLOG check).
Suggested by Nick Kralevich"
* tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"
pstore: Make default pstorefs root dir perms 0750
Merge tag '4.14-smb3-xattr-enable' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs update from Steve French:
"Enable xattr support for smb3 and also a bugfix"
* tag '4.14-smb3-xattr-enable' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Check for timeout on Negotiate stage
cifs: Add support for writing attributes on SMB2+
cifs: Add support for reading attributes on SMB2+
Merge branch 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota scaling updates from Jan Kara:
"This contains changes to make the quota subsystem more scalable.
Reportedly it improves number of files created per second on ext4
filesystem on fast storage by about a factor of 2x"
* 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
quota: Add lock annotations to struct members
quota: Reduce contention on dq_data_lock
fs: Provide __inode_get_bytes()
quota: Inline dquot_[re]claim_reserved_space() into callsite
quota: Inline inode_{incr,decr}_space() into callsites
quota: Inline functions into their callsites
ext4: Disable dirty list tracking of dquots when journalling quotas
quota: Allow disabling tracking of dirty dquots in a list
quota: Remove dq_wait_unused from dquot
quota: Move locking into clear_dquot_dirty()
quota: Do not dirty bad dquots
quota: Fix possible corruption of dqi_flags
quota: Propagate ->quota_read errors from v2_read_file_info()
quota: Fix error codes in v2_read_file_info()
quota: Push dqio_sem down to ->read_file_info()
quota: Push dqio_sem down to ->write_file_info()
quota: Push dqio_sem down to ->get_next_id()
quota: Push dqio_sem down to ->release_dqblk()
quota: Remove locking for writing to the old quota format
quota: Do not acquire dqio_sem for dquot overwrites in v2 format
...
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF, reiserfs, quota, fsnotify cleanups from Jan Kara:
"Several UDF, reiserfs, quota and fsnotify cleanups.
Note that there is also a patch updating MAINTAINERS entry for
notification subsystem to point to me as a maintainer since current
entries are stale"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: make dnotify_fsnotify_ops const
isofs: Delete an unnecessary variable initialisation in isofs_read_inode()
isofs: Adjust four checks for null pointers
isofs: Delete an error message for a failed memory allocation in isofs_read_inode()
quota_v2: Delete an error message for a failed memory allocation in v2_read_file_info()
fs-udf: Delete an error message for a failed memory allocation in two functions
fs-udf: Improve six size determinations
fs-udf: Adjust two checks for null pointers
reiserfs: fix spelling mistake: "tranasction" -> "transaction"
MAINTAINERS: Update entries for notification subsystem
uapi/linux/quota.h: Do not include linux/errno.h
Merge tag 'devicetree-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull DeviceTree updates from Rob Herring:
"There's a few orphans in the conversion to %pOF printf specifiers
included here that no one else picked up.
Summary:
- Convert more DT code to use of_property_read_* API.
- Improve DT overlay support when adding multiple overlays
- Convert printk's to %pOF format specifiers. Most went via subsystem
trees, but picked up the remaining orphans
- Correct unittests to use preferred "okay" for "status" property
value
- Add a KASLR seed property
- Vendor prefixes for Mellanox, Theobroma System, Adaptrum, Moxa
- Fix modalias buffer handling
- Clean-up of include paths for building dtbs
- Add bindings for amc6821, isl1208, tsl2x7x, srf02, and srf10
devices
- Add nvmem bindings for MediaTek MT7623 and MT7622 SoC
- Add compatible string for Allwinner H5 Mali-450 GPU
- Fix links to old OpenFirmware docs with new mirror on
devicetree.org
- Remove status property from binding doc examples"
* tag 'devicetree-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (45 commits)
devicetree: Adjust status "ok" -> "okay" under drivers/of/
dt-bindings: Remove "status" from examples
dt-bindings: pinctrl: sh-pfc: Use generic node name
dt-bindings: Add vendor Mellanox
dt-binding: net/phy: fix interrupts description
virt: Convert to using %pOF instead of full_name
macintosh: Convert to using %pOF instead of full_name
ide: pmac: Convert to using %pOF instead of full_name
microblaze: Convert to using %pOF instead of full_name
dt-bindings: usb: musb: Grammar s/the/to/, s/is/are/
of: Use PLATFORM_DEVID_NONE definition
of/device: Fix of_device_get_modalias() buffer handling
of/device: Prevent buffer overflow in of_device_modalias()
dt-bindings: add amc6821, isl1208 trivial bindings
dt-bindings: add vendor prefix for Theobroma Systems
of: search scripts/dtc/include-prefixes path for both CPP and DTC
of: remove arch/$(SRCARCH)/boot/dts from include search path for CPP
of: remove drivers/of/testcase-data from include search path for CPP
of: return of_get_cpu_node from of_cpu_device_node_get if CPUs are not registered
iio: srf08: add device tree binding for srf02 and srf10
...
Merge tag 'drm-intel-next-fixes-2017-09-07' of git://anongit.freedesktop.org/git/drm-intel
Pull i916 drm fixes from Rodrigo Vivi:
"Since Dave is on paternity leave we are sending drm/i915 fixes for
v4.14-rc1 directly to you as he had asked us to do.
The most critical ones are the GPU reset fix for gen2-4 and GVT fix
for a regression that is blocking gvt init to work on your tree.
The rest is general fixes for patches coming from drm-next"
Acked-by: Dave Airlie <airlied@redhat.com>
* tag 'drm-intel-next-fixes-2017-09-07' of git://anongit.freedesktop.org/git/drm-intel:
drm/i915: Re-enable GTT following a device reset
drm/i915: Annotate user relocs with __user
drm/i915: Silence sparse by using gfp_t
drm/i915: Add __rcu to radix tree slot pointer
drm/i915: Fix the missing PPAT cache attributes on CNL
drm/i915/gvt: Remove one duplicated MMIO
drm/i915: Fix enum pipe vs. enum transcoder for the PCH transcoder
drm/i915: Make i2c lock ops static
drm/i915: Make i9xx_load_ycbcr_conversion_matrix() static
drm/i915/edp: Increase T12 panel delay to 900 ms to fix DP AUX CH timeouts
drm/i915: Ignore duplicate VMA stored within the per-object handle LUT
drm/i915: Skip fence alignemnt check for the CCS plane
drm/i915: Treat fb->offsets[] as a raw byte offset instead of a linear offset
drm/i915: Always wake the device to flush the GTT
drm/i915: Recreate vmapping even when the object is pinned
drm/i915: Quietly cancel FBC activation if CRTC is turned off before worker
Merge tag 'leds_for_4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED updates from Jacek Anaszewski:
"LED class drivers improvements:
leds-pca955x:
- add Device Tree support and bindings
- use devm_led_classdev_register()
- add GPIO support
- prevent crippled LED class device name
- check for I2C errors
leds-gpio:
- add optional retain-state-shutdown DT property
- allow LED to retain state at shutdown
Make several arrays static const in:
- leds-aat1290
- leds-lp5521
- leds-lp5562
- leds-lp8501"
* tag 'leds_for_4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
leds: pca955x: check for I2C errors
leds: gpio: Allow LED to retain state at shutdown
dt-bindings: leds: gpio: Add optional retain-state-shutdown property
leds: powernv: Delete an error message for a failed memory allocation in powernv_led_create()
leds: lp8501: make several arrays static const
leds: lp5562: make several arrays static const
leds: lp5521: make several arrays static const
leds: aat1290: make array max_mm_current_percent static const
leds: pca955x: Prevent crippled LED device name
leds: lm3533: constify attribute_group structure
dt-bindings: leds: add pca955x
leds: pca955x: add GPIO support
leds: pca955x: use devm_led_classdev_register
leds: pca955x: add device tree support
leds: Convert to using %pOF instead of full_name
leds: blinkm: constify attribute_group structures.
leds: tlc591xx: add missing of_node_put
leds: tlc591xx: merge conditional tests
Merge tag 'dmaengine-4.14-rc1' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine updates from Vinod Koul:
"This one features the usual updates to the drivers and one good part
of removing DA_SG from core as it has no users.
Summary:
- Remove DMA_SG support as we have no users for this feature
- New driver for Altera / Intel mSGDMA IP core
- Support for memset in dmatest and qcom_hidma driver
- Update for non cyclic mode in k3dma, bunch of update in bam_dma,
bcm sba-raid
- Constify device ids across drivers"
* tag 'dmaengine-4.14-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (52 commits)
dmaengine: sun6i: support V3s SoC variant
dmaengine: sun6i: make gate bit in sun8i's DMA engines a common quirk
dmaengine: rcar-dmac: document R8A77970 bindings
dmaengine: xilinx_dma: Fix error code format specifier
dmaengine: altera: Use macros instead of structs to describe the registers
dmaengine: ti-dma-crossbar: Fix dra7 reserve function
dmaengine: pl330: constify amba_id
dmaengine: pl08x: constify amba_id
dmaengine: bcm-sba-raid: Remove redundant SBA_REQUEST_STATE_COMPLETED
dmaengine: bcm-sba-raid: Explicitly ACK mailbox message after sending
dmaengine: bcm-sba-raid: Add debugfs support
dmaengine: bcm-sba-raid: Remove redundant SBA_REQUEST_STATE_RECEIVED
dmaengine: bcm-sba-raid: Re-factor sba_process_deferred_requests()
dmaengine: bcm-sba-raid: Pre-ack async tx descriptor
dmaengine: bcm-sba-raid: Peek mbox when we have no free requests
dmaengine: bcm-sba-raid: Alloc resources before registering DMA device
dmaengine: bcm-sba-raid: Improve sba_issue_pending() run duration
dmaengine: bcm-sba-raid: Increase number of free sba_request
dmaengine: bcm-sba-raid: Allow arbitrary number free sba_request
dmaengine: bcm-sba-raid: Remove reqs_free_count from sba_device
...
* tag 'backlight-next-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
backlight: gpio_backlight: Delete pdata inversion
backlight: gpio_backlight: Convert to use GPIO descriptor
backlight: pwm_bl: Make of_device_ids const
backlight: lm3630a: Bump REG_MAX value to 0x50 instead of 0x1F
Merge tag 'mfd-next-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
"New Drivers
- RK805 Power Management IC (PMIC)
- ROHM BD9571MWV-M MFD Power Management IC (PMIC)
- Texas Instruments TPS68470 Power Management IC (PMIC) & LEDs
New Device Support:
- Add support for HiSilicon Hi6421v530 to hi6421-pmic-core
- Add support for X-Powers AXP806 to axp20x
- Add support for X-Powers AXP813 to axp20x
- Add support for Intel Sunrise Point LPSS to intel-lpss-pci
New Functionality:
- Amend API to provide register layout; atmel-smc
Bug Fixes:
- Fix counter duplication issue; da9052-core
- Fix potential NULL deference issue; max8998
- Leave SPI-NOR write-protection bit alone; lpc_ich
- Ensure device is put into reset during suspend; intel-lpss
- Correct register offset variable size; omap-usb-tll"
* tag 'mfd-next-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (61 commits)
mfd: intel_soc_pmic: Differentiate between Bay and Cherry Trail CRC variants
mfd: intel_soc_pmic: Export separate mfd-cell configs for BYT and CHT
dt-bindings: mfd: Add bindings for ZII RAVE devices
mfd: omap-usb-tll: Fix register offsets
mfd: da9052: Constify spi_device_id
mfd: intel-lpss: Put I2C and SPI controllers into reset state on suspend
mfd: da9055: Constify i2c_device_id
mfd: intel-lpss: Add missing PCI ID for Intel Sunrise Point LPSS devices
mfd: t7l66xb: Handle return value of clk_prepare_enable
mfd: Add ROHM BD9571MWV-M PMIC DT bindings
mfd: intel_soc_pmic_chtwc: Turn Kconfig option into a bool
mfd: lp87565: Convert to use devm_mfd_add_devices()
mfd: Add support for TPS68470 device
mfd: lpc_ich: Do not touch SPI-NOR write protection bit on Haswell/Broadwell
mfd: syscon: atmel-smc: Add helper to retrieve register layout
mfd: axp20x: Use correct platform device ID for many PEK
dt-bindings: mfd: axp20x: Introduce bindings for AXP813
mfd: axp20x: Add support for AXP813 PMIC
dt-bindings: mfd: axp20x: Add AXP806 to supported list of chips
mfd: Add ROHM BD9571MWV-M MFD PMIC driver
...
Merge tag 'mailbox-v4.14' of git://git.linaro.org/landing-teams/working/fujitsu/integration
Pull mailbox updates from Jassi Brar:
"Just behavorial changes to a controller driver: the Broadcom's Flexrm
mailbox driver has been modifified to support debugfs and TX-Done
mechanism by ACK.
Nothing for the core mailbox stack"
* tag 'mailbox-v4.14' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
mailbox: bcm-flexrm-mailbox: Use txdone_ack instead of txdone_poll
mailbox: bcm-flexrm-mailbox: Use bitmap instead of IDA
mailbox: bcm-flexrm-mailbox: Fix mask used in CMPL_START_ADDR_VALUE()
mailbox: bcm-flexrm-mailbox: Add debugfs support
mailbox: bcm-flexrm-mailbox: Set IRQ affinity hint for FlexRM ring IRQs
Merge tag 'media/v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
"Brazil's Independence Day pull request :-)
This is one of the biggest media pull requests, with 625 patches
affecting almost all parts of media (RC, DVB, V4L2, CEC, docs).
This contains:
- A lot of new drivers:
* DVB frontends: mxl5xx, stv0910, stv6111;
* camera flash: as3645a led driver;
* HDMI receiver: adv748X;
* camera sensor: Omnivision 6650 5M driver (ov6650);
* HDMI CEC: ao-cec meson driver;
* V4L2: Qualcom camss driver;
* Remote controller: gpio-ir-tx, pwm-ir-tx and zx-irdec drivers.
- The DDbridge DVB driver got a massive update, with makes it in sync
with modern hardware from that vendor;
- There's an important milestone on this series: the DVB
documentation was written in 2003, but only started to be updated
in 2007. It also used to contain several gaps from the time it was
kept out of tree, mentioning error codes and device nodes that
never existed upstream. On this series, it received a massive
update: all non-deprecated digital TV APIs are now in sync with the
current implementation;
- Some DVB APIs that aren't used by any upstream driver got removed;
- Other parts of the media documentation algo got updated, fixing
some bugs on its PDF output and making it compatible with Sphinx
version 1.6.
As the number of hacks required to build PDF output reduced, I hope
we'll have less troubles as newer versions of our documentation
toolchain are released (famous last words);
- As usual, lots of driver cleanups and improvements"
* tag 'media/v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (624 commits)
media: leds: as3645a: add V4L2_FLASH_LED_CLASS dependency
media: get rid of removed DMX_GET_CAPS and DMX_SET_SOURCE leftovers
media: Revert "[media] v4l: async: make v4l2 coexist with devicetree nodes in a dt overlay"
media: staging: atomisp: sh_css_calloc shall return a pointer to the allocated space
media: Revert "[media] lirc_dev: remove superfluous get/put_device() calls"
media: add qcom_camss.rst to v4l-drivers rst file
media: dvb headers: make checkpatch happier
media: dvb uapi: move frontend legacy API to another part of the book
media: pixfmt-srggb12p.rst: better format the table for PDF output
media: docs-rst: media: Don't use \small for V4L2_PIX_FMT_SRGGB10 documentation
media: index.rst: don't write "Contents:" on PDF output
media: pixfmt*.rst: replace a two dots by a comma
media: vidioc-g-fmt.rst: adjust table format
media: vivid.rst: add a blank line to correct ReST format
media: v4l2 uapi book: get rid of driver programming's chapter
media: format.rst: use the right markup for important notes
media: docs-rst: cardlists: change their format to flat-tables
media: em28xx-cardlist.rst: update to reflect last changes
media: v4l2-event.rst: adjust table to fit on PDF output
media: docs: don't show ToC for each part on PDF output
...
Merge tag 'sound-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"We have touched quite a lot of files but with fewer changes at this
cycle; as you can see, most of changes are trivial fixes, especially
constification patches.
Among the massive attacks by constification gangs, we had a few core
changes (mostly for ASoC core), as well the fixes and the updates by
major vendors.
Some highlights:
ALSA core:
- Fix possible races in control API user-TLV codes
- Small cleanup of PCM core
ASoC:
- Continued work for componentization; still half-baked, but we're
certainly progressing
- Use of devres for jack detection GPIOs, rather as a cleanup
- Jack detection support for Qualcomm MSM8916
- Support for Allwinner H3, Cirrus Logic CS43130, Intel Kabylake
systems with RT5663, Realtek RT274, TI TLV320AIC32x6 and Wolfson
WM8523"
* tag 'sound-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (512 commits)
ALSA: hda/ca0132 - Fix memory leak at error path
ALSA: hda: Fix forget to free resource in error handling code path in hda_codec_driver_probe
ASoC: cs43130: Fix unused compiler warnings for PM runtime
ASoC: cs43130: Fix possible Oops with invalid dev_id
ASoC: cs43130: fix spelling mistake: "irq_occurrance" -> "irq_occurrence"
ALSA: atmel: Remove leftovers of AVR32 removal
ALSA: atmel: convert AC97c driver to GPIO descriptor API
ALSA: hda/realtek - Enable jack detection function for Intel ALC700
ALSA: hda: Fix regression of hdmi eld control created based on invalid pcm
ASoC: Intel: Skylake: Add IPC to configure the copier secondary pins
ASoC: add missing compile rule for max98371
ASoC: add missing compile rule for sirf-audio-codec
ASoC: add missing compile rule for max98371
ASoC: cs43130: Add devicetree bindings for CS43130
ASoC: cs43130: Add support for CS43130 codec
ASoC: make clock direction configurable in asoc-simple
ALSA: ctxfi: Remove null check before kfree
ASoC: max98927: Changed device property read function
ASoC: max98927: Modified DAPM widget and map to enable/disable VI sense path
ASoC: max98927: Added PM suspend and resume function
...
* tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/bitmap: disable bitmap_resize for file-backed bitmaps.
raid5-ppl: Recovery support for multiple partial parity logs
md: Runtime support for multiple ppls
md/raid0: attach correct cgroup info in bio
lib/raid6: align AVX512 constants to 512 bits, not bytes
raid5: remove raid5_build_block
md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show
md: replace seq_release_private with seq_release
md: notify about new spare disk in the container
md/raid1/10: reset bio allocated from mempool
md/raid5: release/flush io in raid5_do_work()
md/bitmap: copy correct data for bitmap super
Merge tag 'mmc-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC updates from Ulf Hansson:
"MMC core:
- Continue to refactor the mmc block code to prepare for blkmq
- Move mmc block debugfs into block module
- Next step for eMMC CMDQ by adding a new mmc host interface for it
- Move Kconfig option MMC_DEBUG from core to host
- Some additional minor improvements
MMC host:
- Declare structs as const when applicable
- Explicitly request exclusive reset control when applicable
- Improve some error paths and other various cleanups
- sdhci: Preparations to support SDHCI OMAP
- sdhci: Improve some PM related code
- sdhci: Re-factoring and modernizations
- sdhci-xenon: Add runtime PM and system sleep support
- sdhci-xenon: Add support for eMMC HS400 Enhanced Strobe
- sdhci-cadence: Add system sleep support
- sdhci-of-at91: Improve system sleep support
- dw_mmc: Add support for Hisilicon hi3660
- sunxi: Add support for A83T eMMC
- sunxi: Add support for DDR52 mode
- meson-gx: Add support for UHS-I SD-cards
- meson-gx: Cleanups and improvements
- tmio: Fix CMD12 (STOP) handling
- tmio: Cleanups and improvements
- renesas_sdhi: Add r8a7743/5 support
- renesas-sdhi: Add support for R-Car Gen3 SDHI DMAC
- renesas_sdhi: Cleanups and improvements"
* tag 'mmc-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (145 commits)
mmc: renesas_sdhi: Add r8a7743/5 support
mmc: meson-gx: fix __ffsdi2 undefined on arm32
mmc: sdhci-xenon: add runtime pm support and reimplement standby
mmc: core: Move mmc_start_areq() declaration
mmc: mmci: stop building qcom dml as module
mmc: sunxi: Reset the device at probe time
clk: sunxi-ng: Provide a default reset hook
mmc: meson-gx: rework tuning function
mmc: meson-gx: change default tx phase
mmc: meson-gx: implement voltage switch callback
mmc: meson-gx: use CCF to handle the clock phases
mmc: meson-gx: implement card_busy callback
mmc: meson-gx: simplify interrupt handler
mmc: meson-gx: work around clk-stop issue
mmc: meson-gx: fix dual data rate mode frequencies
mmc: meson-gx: rework clock init function
mmc: meson-gx: rework clk_set function
mmc: meson-gx: rework set_ios function
mmc: meson-gx: cfg init overwrite values
mmc: meson-gx: initialize sane clk default before clock register
...
Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
* pci/enumeration:
PCI: Warn periodically while waiting for non-CRS ("device ready") status
PCI: Wait up to 60 seconds for device to become ready after FLR
PCI: Factor out pci_bus_wait_crs()
PCI: Add pci_bus_crs_vendor_id() to detect CRS response data
PCI: Always check for non-CRS response before timeout
PCI: Avoid race while enabling upstream bridges
PCI: Mark Broadcom HT2100 Root Port Extended Tags as broken
* pci/endpoint:
tools: PCI: Add a missing option help line
misc: pci_endpoint_test: Enable/Disable MSI using module param
misc: pci_endpoint_test: Avoid using hard-coded BAR sizes
misc: pci_endpoint_test: Add support to not enable MSI interrupts
misc: pci_endpoint_test: Add support to provide aligned buffer addresses
misc: pci_endpoint_test: Add support for PCI_ENDPOINT_TEST regs to be mapped to any BAR
PCI: designware-ep: Do not disable BARs during initialization
PCI: dra7xx: Reset all BARs during initialization
PCI: dwc: designware: Provide page_size to pci_epc_mem
PCI: endpoint: Remove the ->remove() callback
PCI: endpoint: Add support to poll early for host commands
PCI: endpoint: Add support to use _any_ BAR to map PCI_ENDPOINT_TEST regs
PCI: endpoint: Do not reset *command* inadvertently
PCI: endpoint: Add "volatile" to pci_epf_test_reg
PCI: endpoint: Add support for configurable page size
PCI: endpoint: Make ->remove() callback optional
PCI: endpoint: Add an API to get matching "pci_epf_device_id"
PCI: endpoint: Use of_dma_configure() to set initial DMA mask
* pci/host-vmd:
iommu/vt-d: Prevent VMD child devices from being remapping targets
x86/PCI: Use is_vmd() rather than relying on the domain number
x86/PCI: Move VMD quirk to x86 fixups
MAINTAINERS: Add Jonathan Derrick as VMD maintainer
PCI: vmd: Remove IRQ affinity so we can allocate more IRQs
PCI: vmd: Free up IRQs on suspend path
PCI: vmd: Assign vector zero to all bridges
PCI: vmd: Reserve IRQ pre-vector for better affinity
* pci/host-rockchip:
PCI: rockchip: Fix platform_get_irq() error handling
PCI: rockchip: Umap IO space if probe fails
PCI: rockchip: Remove IRQ domain if probe fails
PCI: rockchip: Disable vpcie0v9 if resume_noirq fails
PCI: rockchip: Clean up PHY if driver probe or resume fails
PCI: rockchip: Factor out rockchip_pcie_deinit_phys()
PCI: rockchip: Factor out rockchip_pcie_disable_clocks()
PCI: rockchip: Factor out rockchip_pcie_enable_clocks()
PCI: rockchip: Factor out rockchip_pcie_setup_irq()
PCI: rockchip: Use gpiod_set_value_cansleep() to allow reset via expanders
PCI: rockchip: Use PCI_NUM_INTX
PCI: rockchip: Explicitly request exclusive reset control
dt-bindings: phy-rockchip-pcie: Convert to per-lane PHY model
dt-bindings: PCI: rockchip: Convert to per-lane PHY model
arm64: dts: rockchip: convert PCIe to use per-lane PHYs for rk3339
PCI: rockchip: Idle inactive PHY(s)
phy: rockchip-pcie: Reconstruct driver to support per-lane PHYs
PCI: rockchip: Add per-lane PHY support
PCI: rockchip: Factor out rockchip_pcie_get_phys()
PCI: rockchip: Control optional 12v power supply
dt-bindings: PCI: rockchip: Add vpcie12v-supply for Rockchip PCIe controller
* pci/host-rcar:
PCI: rcar: Add device tree support for r8a7743/5
PCI: rcar: Fix memory leak when no PCIe card is inserted
PCI: rcar: Fix error exit path
* pci/host-qcom:
PCI: qcom: Add support for IPQ8074 PCIe controller
dt-bindings: PCI: qcom: Add support for IPQ8074
PCI: qcom: Use block IP version for operations
PCI: qcom: Explicitly request exclusive reset control
PCI: qcom: Use gpiod_set_value_cansleep() to allow reset via expanders
* pci/host-mediatek:
PCI: mediatek: Use PCI_NUM_INTX
PCI: mediatek: Add MSI support for MT2712 and MT7622
PCI: mediatek: Use bus->sysdata to get host private data
dt-bindings: PCI: Add support for MT2712 and MT7622
PCI: mediatek: Add controller support for MT2712 and MT7622
dt-bindings: PCI: Cleanup MediaTek binding text
dt-bindings: PCI: Rename MediaTek binding
PCI: mediatek: Switch to use platform_get_resource_byname()
PCI: mediatek: Add a structure to abstract the controller generations
PCI: mediatek: Rename port->index and mtk_pcie_parse_ports()
PCI: mediatek: Use readl_poll_timeout() to wait for Gen2 training
PCI: mediatek: Explicitly request exclusive reset control
* pci/host-layerscape:
PCI: layerscape: Add support for ls1088a
PCI: layerscape: Add support for ls2088a
PCI: artpec6: Stop enabling writes to DBI read-only registers
PCI: layerscape: Remove unnecessary class code fixup
PCI: dwc: Enable write permission for Class Code, Interrupt Pin updates
PCI: dwc: Add accessors for write permission of DBI read-only registers
PCI: layerscape: Disable outbound windows configured by bootloader
PCI: layerscape: Refactor ls1021_pcie_host_init()
PCI: layerscape: Move generic init functions earlier in file
PCI: layerscape: Add class code and multifunction fixups for ls1021a
PCI: layerscape: Move STRFMR1 access out from the DBI write-enable bracket
PCI: layerscape: Call dw_pcie_setup_rc() from ls_pcie_host_init()
* pci/host-designware:
PCI: dwc: Clear MSI interrupt status after it is handled, not before
PCI: qcom: Allow ->post_init() to fail
PCI: qcom: Don't unroll init if ->init() fails
PCI: dwc: designware: Handle ->host_init() failures
PCI: dwc: designware: Test PCIE_ATU_ENABLE bit specifically
PCI: dwc: designware: Make dw_pcie_prog_*_atu_unroll() static
Merge tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Nothing really major this release, despite quite a lot of activity.
Just lots of things all over the place.
Some things of note include:
- Access via perf to a new type of PMU (IMC) on Power9, which can
count both core events as well as nest unit events (Memory
controller etc).
- Optimisations to the radix MMU TLB flushing, mostly to avoid
unnecessary Page Walk Cache (PWC) flushes when the structure of the
tree is not changing.
- Reworks/cleanups of do_page_fault() to modernise it and bring it
closer to other architectures where possible.
- Rework of our page table walking so that THP updates only need to
send IPIs to CPUs where the affected mm has run, rather than all
CPUs.
- The size of our vmalloc area is increased to 56T on 64-bit hash MMU
systems. This avoids problems with the percpu allocator on systems
with very sparse NUMA layouts.
- STRICT_KERNEL_RWX support on PPC32.
- A new sched domain topology for Power9, to capture the fact that
pairs of cores may share an L2 cache.
- Power9 support for VAS, which is a new mechanism for accessing
coprocessors, and initial support for using it with the NX
compression accelerator.
- Major work on the instruction emulation support, adding support for
many new instructions, and reworking it so it can be used to
implement the emulation needed to fixup alignment faults.
- Support for guests under PowerVM to use the Power9 XIVE interrupt
controller.
And probably that many things again that are almost as interesting,
but I had to keep the list short. Plus the usual fixes and cleanups as
always.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab,
Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh,
Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly,
Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang,
Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes
Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall,
LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring,
Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo,
Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff,
Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu,
Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui"
* tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits)
powerpc/xive: Fix section __init warning
powerpc: Fix kernel crash in emulation of vector loads and stores
powerpc/xive: improve debugging macros
powerpc/xive: add XIVE Exploitation Mode to CAS
powerpc/xive: introduce H_INT_ESB hcall
powerpc/xive: add the HW IRQ number under xive_irq_data
powerpc/xive: introduce xive_esb_write()
powerpc/xive: rename xive_poke_esb() in xive_esb_read()
powerpc/xive: guest exploitation of the XIVE interrupt controller
powerpc/xive: introduce a common routine xive_queue_page_alloc()
powerpc/sstep: Avoid used uninitialized error
axonram: Return directly after a failed kzalloc() in axon_ram_probe()
axonram: Improve a size determination in axon_ram_probe()
axonram: Delete an error message for a failed memory allocation in axon_ram_probe()
powerpc/powernv/npu: Move tlb flush before launching ATSD
powerpc/macintosh: constify wf_sensor_ops structures
powerpc/iommu: Use permission-specific DEVICE_ATTR variants
powerpc/eeh: Delete an error out of memory message at init time
powerpc/mm: Use seq_putc() in two functions
macintosh: Convert to using %pOF instead of full_name
...
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/bgrt: Use efi_mem_type()
efi: Move efi_mem_type() to common code
efi/reboot: Make function pointer orig_pm_power_off static
efi/random: Increase size of firmware supplied randomness
efi/libstub: Enable reset attack mitigation
firmware/efi/esrt: Constify attribute_group structures
firmware/efi: Constify attribute_group structures
firmware/dcdbas: Constify attribute_group structures
arm/efi: Split zImage code and data into separate PE/COFF sections
arm/efi: Replace open coded constants with symbolic ones
arm/efi: Remove pointless dummy .reloc section
arm/efi: Remove forbidden values from the PE/COFF header
drivers/fbdev/efifb: Allow BAR to be moved instead of claiming it
efi/reboot: Fall back to original power-off method if EFI_RESET_SHUTDOWN returns
efi/arm/arm64: Add missing assignment of efi.config_table
efi/libstub/arm64: Set -fpie when building the EFI stub
efi/libstub/arm64: Force 'hidden' visibility for section markers
efi/libstub/arm64: Use hidden attribute for struct screen_info reference
efi/arm: Don't mark ACPI reclaim memory as MEMBLOCK_NOMAP
Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
KVM/PPC update for 4.14
There are various minor fixes and cleanups. The only new feature is
that we now export information about storage key support to userspace,
so it can advertise it to the guest.
I have pulled in Michael Ellerman's topic/ppc-kvm branch from the
powerpc tree to get a couple of fixes that touch both KVM PPC code and
other PPC code. That's why there is some arch/powerpc stuff in the
diffstat that isn't arch/powerpc/kvm.
This patch uses the original value of nr_events (from userspace) to
increment aio-nr and count against aio-max-nr, which resolves those.
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Reported-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com> Tested-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com> Tested-by: Paul Nguyen <nguyenp@us.ibm.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Ingo Molnar:
"The main changes include various Hyper-V optimizations such as faster
hypercalls and faster/better TLB flushes - and there's also some
Intel-MID cleanups"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing/hyper-v: Trace hyperv_mmu_flush_tlb_others()
x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls
x86/platform/intel-mid: Make several arrays static, to make code smaller
MAINTAINERS: Add missed file for Hyper-V
x86/hyper-v: Use hypercall for remote TLB flush
hyper-v: Globalize vp_index
x86/hyper-v: Implement rep hypercalls
hyper-v: Use fast hypercall for HVCALL_SIGNAL_EVENT
x86/hyper-v: Introduce fast hypercall implementation
x86/hyper-v: Make hv_do_hypercall() inline
x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set
x86/platform/intel-mid: Make 'bt_sfi_data' const
x86/platform/intel-mid: Make IRQ allocation a bit more flexible
x86/platform/intel-mid: Group timers callbacks together
Merge tag 'kvm-arm-for-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
KVM/ARM Changes for v4.14
Two minor cleanups and improvements, a fix for decoding external abort
types from guests, and added support for migrating the active priority
of interrupts when running a GICv2 guest on a GICv3 host.
Merge tag 'kvm-s390-next-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux
KVM: s390: Fixes and features for 4.14
- merge of topic branch tlb-flushing from the s390 tree to get the
no-dat base features
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups
PCI: xgene: Define XGENE_PCI_EXP_CAP and use generic PCI_EXP_RTCTL offset
Apparently the PCIe capability is at address 0x40 in config space of X-Gene
v1 Root Ports. Add a definition of that and use the generic PCI_EXP_RTCTL
offset into the capability. No functional change intended.
Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:
"Except for the ahci fix that fixes a boot issue, nothing major in this
pull request. Some new platform controller support and device specific
changes"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
libata: zpodd: make arrays cdb static, reduces object code size
ahci: don't use MSI for devices with the silly Intel NVMe remapping scheme
dt-bindings: ata: add DT bindings for MediaTek SATA controller
ata: mediatek: add support for MediaTek SATA controller
pata_octeon_cf: use of_property_read_{bool|u32}()
cs5536: add support for IDE controller variant
ata: sata_gemini: Introduce explicit IDE pin control
ata: sata_gemini: Retire custom pin control
ata: ahci_platform: Add shutdown handler
ata: sata_gemini: explicitly request exclusive reset control
ata: Drop unnecessary static
ata: Convert to using %pOF instead of full_name
Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Several notable changes this cycle:
- Thread mode was merged. This will be used for cgroup2 support for
CPU and possibly other controllers. Unfortunately, CPU controller
cgroup2 support didn't make this pull request but most contentions
have been resolved and the support is likely to be merged before
the next merge window.
- cgroup.stat now shows the number of descendant cgroups.
- cpuset now can enable the easier-to-configure v2 behavior on v1
hierarchy"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: Allow v2 behavior in v1 cgroup
cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
cgroup: remove unneeded checks
cgroup: misc changes
cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
cgroup: re-use the parent pointer in cgroup_destroy_locked()
cgroup: add cgroup.stat interface with basic hierarchy stats
cgroup: implement hierarchy limits
cgroup: keep track of number of descent cgroups
cgroup: add comment to cgroup_enable_threaded()
cgroup: remove unnecessary empty check when enabling threaded mode
cgroup: update debug controller to print out thread mode information
cgroup: implement cgroup v2 thread support
cgroup: implement CSS_TASK_ITER_THREADED
cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
cgroup: reorganize cgroup.procs / task write path
cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
cgroup: distinguish local and children populated states
cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
...
Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
"Nothing major. I introduced a flag collsion bug during v4.13 cycle
which is fixed in this pull request. Fortunately, the flag is for
debugging / verification and the bug isn't critical"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix flag collision
workqueue: Use TASK_IDLE
workqueue: fix path to documentation
workqueue: doc change for ST behavior on NUMA systems
Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
"A lot of changes for percpu this time around. percpu inherited the
same area allocator from the original pre-virtual-address-mapped
implementation. This was from the time when percpu allocator wasn't
used all that much and the implementation was focused on simplicity,
with the unfortunate computational complexity of O(number of areas
allocated from the chunk) per alloc / free.
With the increase in percpu usage, we're hitting cases where the lack
of scalability is hurting. The most prominent one right now is bpf
perpcu map creation / destruction which may allocate and free a lot of
entries consecutively and it's likely that the problem will become
more prominent in the future.
To address the issue, Dennis replaced the area allocator with hinted
bitmap allocator which is more consistent. While the new allocator
does perform a bit worse in some cases, it outperforms the old
allocator way more than an order of magnitude in other more common
scenarios while staying mostly flat in CPU overhead and completely
flat in memory consumption"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (27 commits)
percpu: update header to contain bitmap allocator explanation.
percpu: update pcpu_find_block_fit to use an iterator
percpu: use metadata blocks to update the chunk contig hint
percpu: update free path to take advantage of contig hints
percpu: update alloc path to only scan if contig hints are broken
percpu: keep track of the best offset for contig hints
percpu: skip chunks if the alloc does not fit in the contig hint
percpu: add first_bit to keep track of the first free in the bitmap
percpu: introduce bitmap metadata blocks
percpu: replace area map allocator with bitmap
percpu: generalize bitmap (un)populated iterators
percpu: increase minimum percpu allocation size and align first regions
percpu: introduce nr_empty_pop_pages to help empty page accounting
percpu: change the number of pages marked in the first_chunk pop bitmap
percpu: combine percpu address checks
percpu: modify base_addr to be region specific
percpu: setup_first_chunk rename schunk/dchunk to chunk
percpu: end chunk area maps page aligned for the populated bitmap
percpu: unify allocation of schunk and dchunk
percpu: setup_first_chunk remove dyn_size and consolidate logic
...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
mm,fork: introduce MADV_WIPEONFORK
x86,mpx: make mpx depend on x86-64 to free up VMA flag
mm: add /proc/pid/smaps_rollup
mm: hugetlb: clear target sub-page last when clearing huge page
mm: oom: let oom_reap_task and exit_mmap run concurrently
swap: choose swap device according to numa node
mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
mm, oom: do not rely on TIF_MEMDIE for memory reserves access
z3fold: use per-cpu unbuddied lists
mm, swap: don't use VMA based swap readahead if HDD is used as swap
mm, swap: add sysfs interface for VMA based swap readahead
mm, swap: VMA based swap readahead
mm, swap: fix swap readahead marking
mm, swap: add swap readahead hit statistics
mm/vmalloc.c: don't reinvent the wheel but use existing llist API
mm/vmstat.c: fix wrong comment
selftests/memfd: add memfd_create hugetlbfs selftest
mm/shmem: add hugetlbfs support to memfd_create()
mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
...
Andy Lutomirski [Thu, 7 Sep 2017 02:54:54 +0000 (19:54 -0700)]
x86/mm: Document how CR4.PCIDE restore works
While debugging a problem, I thought that using
cr4_set_bits_and_update_boot() to restore CR4.PCIDE would be
helpful. It turns out to be counterproductive.
Add a comment documenting how this works.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Lutomirski [Thu, 7 Sep 2017 02:54:53 +0000 (19:54 -0700)]
x86/mm: Reinitialize TLB state on hotplug and resume
When Linux brings a CPU down and back up, it switches to init_mm and then
loads swapper_pg_dir into CR3. With PCID enabled, this has the side effect
of masking off the ASID bits in CR3.
This can result in some confusion in the TLB handling code. If we
bring a CPU down and back up with any ASID other than 0, we end up
with the wrong ASID active on the CPU after resume. This could
cause our internal state to become corrupt, although major
corruption is unlikely because init_mm doesn't have any user pages.
More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion
in the next context switch. The result of *that* is a failure to
resume from suspend with probability 1 - 1/6^(cpus-1).
Fix it by reinitializing cpu_tlbstate on resume and CPU bringup.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Jiri Kosina <jikos@kernel.org> Fixes: 10af6235e0d3 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Baohong Liu [Tue, 5 Sep 2017 21:57:19 +0000 (16:57 -0500)]
tracing: Apply trace_clock changes to instance max buffer
Currently trace_clock timestamps are applied to both regular and max
buffers only for global trace. For instance trace, trace_clock
timestamps are applied only to regular buffer. But, regular and max
buffers can be swapped, for example, following a snapshot. So, for
instance trace, bad timestamps can be seen following a snapshot.
Let's apply trace_clock timestamps to instance max buffer as well.
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
in the child process after fork. This differs from MADV_DONTFORK in one
important way.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
MADV_WIPEONFORK only works on private, anonymous VMAs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
[akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines] Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Florian Weimer <fweimer@redhat.com> Reported-by: Colm MacCártaigh <colm@allcosts.net> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86,mpx: make mpx depend on x86-64 to free up VMA flag
Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
This patch (of 2):
MPX only seems to be available on 64 bit CPUs, starting with Skylake and
Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
order to free up a VMA flag.
Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Colm MacCártaigh <colm@allcosts.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/pid/smaps_rollup is a new proc file that improves the performance
of user programs that determine aggregate memory statistics (e.g., total
PSS) of a process.
Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes. This sampling process involves
opening /proc/pid/smaps and summing certain fields. For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.
smaps_rollup improves the situation. It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process. In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all of
the real-smaps VMAs. Using a common format for smaps_rollup and smaps
allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.
By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a large
process's potentially very numerous memory mappings. For sampling
system_server's PSS in Android, we measured a 12x speedup, representing
a savings of several hundred milliseconds.
One alternative to a new per-process proc file would have been including
PSS information in /proc/pid/status. We considered this option but
thought that PSS would be too expensive (by a few orders of magnitude)
to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.
The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for each
VMA. In this way, summation happens automatically. We let seq_file
walk over the VMAs just as it does for regular smaps and just emit
nothing to the seq_file until we hit the last VMA.
Benchmarks:
using smaps:
iterations:1000 pid:1163 pss:220023808
0m29.46s real 0m08.28s user 0m20.98s system
using smaps_rollup:
iterations:1000 pid:1163 pss:220702720
0m04.39s real 0m00.03s user 0m04.31s system
We're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows. Android has been using PSS for these purposes for
a long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more. IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.
Tim said:
: There are two main reasons why Android gathers PSS information:
:
: 1. Android devices can show the user the amount of memory used per
: application via the settings app. This is a less important use case.
:
: 2. We log PSS to help identify leaks in applications. We have found
: an enormous number of bugs (in the Android platform, in Google's own
: apps, and in third-party applications) using this data.
:
: To do this, system_server (the main process in Android userspace) will
: sample the PSS of a process three seconds after it changes state (for
: example, app is launched and becomes the foreground application) and about
: every ten minutes after that. The net result is that PSS collection is
: regularly running on at least one process in the system (usually a few
: times a minute while the screen is on, less when screen is off due to
: suspend). PSS of a process is an incredibly useful stat to track, and we
: aren't going to get rid of it. We've looked at some very hacky approaches
: using RSS ("take the RSS of the target process, subtract the RSS of the
: zygote process that is the parent of all Android apps") to reduce the
: accounting time, but it regularly overestimated the memory used by 20+
: percent. Accordingly, I don't think that there's a good alternative to
: using PSS.
:
: We started looking into PSS collection performance after we noticed random
: frequency spikes while a phone's screen was off; occasionally, one of the
: CPU clusters would ramp to a high frequency because there was 200-300ms of
: constant CPU work from a single thread in the main Android userspace
: process. The work causing the spike (which is reasonable governor
: behavior given the amount of CPU time needed) was always PSS collection.
: As a result, Android is burning more power than we should be on PSS
: collection.
:
: The other issue (and why I'm less sure about improving smaps as a
: long-term solution) is that the number of VMAs per process has increased
: significantly from release to release. After trying to figure out why we
: were seeing these 200-300ms PSS collection times on Android O but had not
: noticed it in previous versions, we found that the number of VMAs in the
: main system process increased by 50% from Android N to Android O (from
: ~1800 to ~2700) and varying increases in every userspace process. Android
: M to N also had an increase in the number of VMAs, although not as much.
: I'm not sure why this is increasing so much over time, but thinking about
: ASLR and ways to make ASLR better, I expect that this will continue to
: increase going forward. I would not be surprised if we hit 5000 VMAs on
: the main Android process (system_server) by 2020.
:
: If we assume that the number of VMAs is going to increase over time, then
: doing anything we can do to reduce the overhead of each VMA during PSS
: collection seems like the right way to go, and that means outputting an
: aggregate statistic (to avoid whatever overhead there is per line in
: writing smaps and in reading each line from userspace).
Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com Signed-off-by: Daniel Colascione <dancol@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sonny Rao <sonnyrao@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: hugetlb: clear target sub-page last when clearing huge page
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
clearing huge page on x86_64 platform, the cache footprint is 2M. But
on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
LLC (last level cache). That is, in average, there are 2.5M LLC for
each core and 1.25M LLC for each thread.
If the cache pressure is heavy when clearing the huge page, and we clear
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing clearing the
end of the huge page. And it is possible for the application to access
the begin of the huge page after clearing the huge page.
To help the above situation, in this patch, when we clear a huge page,
the order to clear sub-pages is changed. In quite some situation, we
can get the address that the application will access after we clear the
huge page, for example, in a page fault handler. Instead of clearing
the huge page from begin to end, we will clear the sub-pages farthest
from the the sub-page to access firstly, and clear the sub-page to
access last. This will make the sub-page to access most cache-hot and
sub-pages around it more cache-hot too. If we cannot know the address
the application will access, the begin of the huge page is assumed to be
the the address the application will access.
With this patch, the throughput increases ~28.3% in vm-scalability
anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case creates 72 processes, each
process mmap a big anonymous memory area and writes to it from the begin
to the end. For each process, other processes could be seen as other
workload which generates heavy cache pressure. At the same time, the
cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
cycle) increased from 0.56 to 0.74, and the time spent in user space is
reduced ~7.9%
Christopher Lameter suggests to clear bytes inside a sub-page from end
to begin too. But tests show no visible performance difference in the
tests. May because the size of page is small compared with the cache
size.
Thanks Andi Kleen to propose to use address to access to determine the
order of sub-pages to clear.
The hugetlbfs access address could be improved, will do that in another
patch.
[ying.huang@intel.com: improve readability of clear_huge_page()] Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com Suggested-by: Andi Kleen <andi.kleen@intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: oom: let oom_reap_task and exit_mmap run concurrently
This is purely required because exit_aio() may block and exit_mmap() may
never start, if the oom_reap_task cannot start running on a mm with
mm_users == 0.
At the same time if the OOM reaper doesn't wait at all for the memory of
the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
generate a spurious OOM kill.
If it wasn't because of the exit_aio or similar blocking functions in
the last mmput, it would be enough to change the oom_reap_task() in the
case it finds mm_users == 0, to wait for a timeout or to wait for
__mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
problem here so the concurrency of exit_mmap and oom_reap_task is
apparently warranted.
It's a non standard runtime, exit_mmap() runs without mmap_sem, and
oom_reap_task runs with the mmap_sem for reading as usual (kind of
MADV_DONTNEED).
The race between the two is solved with a combination of
tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
(serialized by a dummy down_write/up_write cycle on the same lines of
the ksm_exit method).
If the oom_reap_task() may be running concurrently during exit_mmap,
exit_mmap will wait it to finish in down_write (before taking down mm
structures that would make the oom_reap_task fail with use after free).
If exit_mmap comes first, oom_reap_task() will skip the mm if
MMF_OOM_SKIP is already set and in turn all memory is already freed and
furthermore the mm data structures may already have been taken down by
free_pgtables.
Aaron Lu [Wed, 6 Sep 2017 23:24:57 +0000 (16:24 -0700)]
swap: choose swap device according to numa node
If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap
device to use in get_swap_pages() to get better performance.
The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same
priority, they are used round robin. This patch changes the previous
single global swap_avail_list into a per-numa-node list, i.e. for each
numa node, it sees its own priority based list of available swap
devices. Swap device's priority can be promoted on its matching node's
swap_avail_list.
The current swap device's priority is set as: user can set a >=0 value,
or the system will pick one starting from -1 then downwards. The
priority value in the swap_avail_list is the negated value of the swap
device's due to plist being sorted from low to high. The new policy
doesn't change the semantics for priority >=0 cases, the previous
starting from -1 then downwards now becomes starting from -2 then
downwards and -1 is reserved as the promoted value.
Take 4-node EX machine as an example, suppose 4 swap devices are
available, each sit on a different node:
swapA on node 0
swapB on node 1
swapC on node 2
swapD on node 3
After they are all swapped on in the sequence of ABCD.
Current behaviour:
their priorities will be:
swapA: -1
swapB: -2
swapC: -3
swapD: -4
And their position in the global swap_avail_list will be:
swapA -> swapB -> swapC -> swapD
prio:1 prio:2 prio:3 prio:4
New behaviour:
their priorities will be(note that -1 is skipped):
swapA: -2
swapB: -3
swapC: -4
swapD: -5
And their positions in the 4 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
swapA -> swapB -> swapC -> swapD
prio:1 prio:3 prio:4 prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
swapB -> swapA -> swapC -> swapD
prio:1 prio:2 prio:4 prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
swapC -> swapA -> swapB -> swapD
prio:1 prio:2 prio:3 prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
swapD -> swapA -> swapB -> swapC
prio:1 prio:2 prio:3 prio:4
To see the effect of the patch, a test that starts N process, each mmap
a region of anonymous memory and then continually write to it at random
position to trigger both swap in and out is used.
On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
are used as swap devices with each attached to a different node, the
result is:
runtime=30m/processes=32/total test size=128G/each process mmap region=4G
kernel throughput
vanilla 13306
auto-binding 15169 +14%
runtime=30m/processes=64/total test size=128G/each process mmap region=2G
kernel throughput
vanilla 11885
auto-binding 14879 +25%
Michal Hocko [Wed, 6 Sep 2017 23:24:53 +0000 (16:24 -0700)]
mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
TIF_MEMDIE is set only to the tasks whick were either directly selected
by the OOM killer or passed through mark_oom_victim from the allocator
path. tsk_is_oom_victim is more generic and allows to identify all
tasks (threads) which share the mm with the oom victim.
Please note that the freezer still needs to check TIF_MEMDIE because we
cannot thaw tasks which do not participage in oom_victims counting
otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 6 Sep 2017 23:24:50 +0000 (16:24 -0700)]
mm, oom: do not rely on TIF_MEMDIE for memory reserves access
For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full access
to memory reserves. There are few shortcomings of this implementation,
though.
First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.
Secondly this flag is per task_struct while the OOM killer operates on
mm_struct granularity so all processes sharing the given mm are killed.
Giving the full access to all these task_structs could lead to a quick
memory reserves depletion. We have tried to reduce this risk by giving
TIF_MEMDIE only to the main thread and the currently allocating task but
that doesn't really solve this problem while it surely opens up a room
for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
allocator without access to memory reserves because a particular thread
was not the group leader.
Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing. We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory. oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.
Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves. This makes the access to reserves independent
on which task has passed through mark_oom_victim. Also drop any usage
of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.
CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach.
There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once. This change will allow such a usecase
without worrying about complete memory reserves depletion.
Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's been noted that z3fold doesn't scale well when it's run in a large
number of threads on many cores, which can be easily reproduced with fio
'randrw' test with --numjobs=32. E.g. the result for 1 cluster (4 cores)
is:
Run status group 0 (all jobs):
READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ...
WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ...
While for 8 cores (2 clusters) the result is:
Run status group 0 (all jobs):
READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ...
WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ...
The bottleneck here is the pool lock which many threads become waiting
upon. To reduce that spin lock contention, z3fold can operate only on
the lists local to the current CPU whenever possible. Due to the nature
of z3fold unbuddied list handling (it only takes the first entry off the
list on a hot path), if the z3fold pool is big enough and balanced well
enough, limiting search to only local unbuddied list doesn't lead to a
significant compression ratio degrade (2.57x vs 2.65x in our
measurements).
This patch also introduces two worker threads: one for async in-page
object layout optimization and one for releasing freed pages. This is
done to speed up z3fold_free() which is often on a hot path.
The fio results for 8-core case are now the following:
Run status group 0 (all jobs):
READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ...
WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ...