Pramod Gurav [Wed, 19 Nov 2014 10:09:04 +0000 (15:39 +0530)]
ARM: qcom: add description of KPSS WDT for APQ8064
Describe the Krait Processor Sub-system (KPSS) Watchdog timer in the
APQ8064 device tree. Also, add a fixed-clock description of SLEEP_CLK,
which will do for now.
Rob Clark [Tue, 28 Jul 2015 12:54:09 +0000 (13:54 +0100)]
ARM: dts: apq8064: Add MDP support
This patch adds MDP node to APQ8064 dt.
Signed-off-by: Rob Clark <robdclark@gmail.com>
[Srinivas Kandagatla] : updated with new style rpm regulators Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Currently the rates of the xo and sleep clocks are hard-coded in the
GCC driver, but this is a board layout description that actually should
be in the DT. Moving them into DT also allows us to insert the RPM
controlled clocks between the DT and GCC clocks.
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> Signed-off-by: Andy Gross <agross@codeaurora.org>
drm/msm: mdp4 lvds: Check the panel node in detect_panel()
This patch checks if the panel node is disabled in DT or not, this would
let us return proper error code so that the driver could stop panel
specific intialization.
drm/msm: mdp4 lvds: continue if the panel is not connected
Two issues:
1> Intializing panel specific bits without actual panel presence.
2> Bailing out if the detect_panel() return -ENODEV.
With the existing code if detect_panel() returns an error code the
driver would bail out without doing anything, However it could continue
intializing hdmi related things. This patch adds two things.
1> moves the panel specific intialization only if the panel is detected
2> let the driver continue with hdmi intialization if detect_panel()
return -ENODEV.
This patch adds panel picker support to simple-panel.
The idea of panel picker is to select the correct panel timings if it
supports probing edid via DDC bus, edid contains manufacture ID
and Manufacturer product code, so it can match against the panel_picker
entries to get the correct panel timings.
From DT point of view the panel picker uses generic compatible string
"panel-simple", keeping the panel specific compatible strings still
supported.
Panels can be static entry in the DT, but practically development boards
like IFC6410 where developers can connect any LVDS panel which makes it
difficult to maintian the dt support for those panels in dts file.
With this dynamic probing via panel picker makes it easy to support such
use-cases.
This patch also adds panel presence detection based, if there is no
panel detected or panel picker could not find the panel then the driver
would mark the panel DT node as disabled so that the drm driver would
be able to take right decision based on that panel node status.
This patch adds support to get edid way early before the connector is
created, this is mainly used for panel drivers to auto-probe the panel
based on the vendor and product id from EDID.
PCI: designware: add memory barrier after enabling region
Add 'write memory' barrier after enable region in PCIE_ATU_CR2
register. The barrier is needed to ensure that the region enable
request has been reached it's destination at time when we
read/write to PCI configuration space.
Without this barrier PCI device enumeration during kernel boot
is not reliable, and reading configuration space for particular
PCI device on the bus returns zero aka no device.
As part of Rob Clarks cleanup one of the flags have been renamed in
header which introduced a compilation failure.
drivers/iommu/msm_iommu.c: In function ‘msm_iommu_map’:
drivers/iommu/msm_iommu.c:426:7: error: ‘FL_AP_READ’ undeclared (first use in this function)
drivers/iommu/msm_iommu.c:426:7: note: each undeclared identifier is reported only once for each function it appears in
drivers/iommu/msm_iommu.c:426:20: error: ‘FL_AP_WRITE’ undeclared (first use in this function)
make[3]: *** [drivers/iommu/msm_iommu.o] Error 1
make[2]: *** [drivers/iommu] Error 2
make[1]: *** [drivers] Error 2
iThis patch fixes the error by using the new flags in the driver.
UBUNTU: SAUCE: s390/mm: fix race on mm->context.flush_mm
BugLink: http://bugs.launchpad.net/bugs/1708399
The order in __tlb_flush_mm_lazy is to flush TLB first and then clear
the mm->context.flush_mm bit. This can lead to missed flushes as the
bit can be set anytime, the order needs to be the other way aronud.
But this leads to a different race, __tlb_flush_mm_lazy may be called
on two CPUs concurrently. If mm->context.flush_mm is cleared first then
another CPU can bypass __tlb_flush_mm_lazy although the first CPU has
not done the flush yet. In a virtualized environment the time until the
flush is finally completed can be arbitrarily long.
Add a spinlock to serialize __tlb_flush_mm_lazy and use the function
in finish_arch_post_lock_switch as well.
Cc: <stable@vger.kernel.org> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
(backported from 60f07c8ec5fae06c23e9fd7bab67dabce92b3414 linux-next)
[context adaption] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Acked-by: Brad Figg <brad.figg@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
UBUNTU: SAUCE: s390/mm: fix local TLB flushing vs. detach of an mm address space
BugLink: http://bugs.launchpad.net/bugs/1708399
The local TLB flushing code keeps an additional mask in the mm.context,
the cpu_attach_mask. At the time a global flush of an address space is
done the cpu_attach_mask is copied to the mm_cpumask in order to avoid
future global flushes in case the mm is used by a single CPU only after
the flush.
Trouble is that the reset of the mm_cpumask is racy against the detach
of an mm address space by switch_mm. The current order is first the
global TLB flush and then the copy of the cpu_attach_mask to the
mm_cpumask. The order needs to be the other way around.
Cc: <stable@vger.kernel.org> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
(backported from b3e5dc45fd1ec2aa1de6b80008f9295eb17e0659 linux-next)
[merged with "s390/mm,kvm: flush gmap address space with IDTE"] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Acked-by: Brad Figg <brad.figg@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
s390/mm: no local TLB flush for clearing-by-ASCE IDTE
BugLink: http://bugs.launchpad.net/bugs/1708399
The local-clearing control of the IDTE instruction does not have any effect
for the clearing-by-ASCE operation. Only the invalidation-and-clearing
operation respects the local-clearing bit.
Remove __tlb_flush_idte_local and simplify the batched TLB flushing code.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
(backported from commit d5dcafee5f183e9aedddb147a89cb46ab038f26b)
[Kept MACHINE_HAS_TLB_LC checks as 64f31d5802af11fd is not applied.] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Acked-by: Brad Figg <brad.figg@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Kees Cook [Fri, 18 Aug 2017 22:16:31 +0000 (15:16 -0700)]
mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes
BugLink: http://bugs.launchpad.net/bugs/1715636
Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000
broke AddressSanitizer. This is a partial revert of:
eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") 02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
The AddressSanitizer tool has hard-coded expectations about where
executable mappings are loaded.
The motivation for changing the PIE base in the above commits was to
avoid the Stack-Clash CVEs that allowed executable mappings to get too
close to heap and stack. This was mainly a problem on 32-bit, but the
64-bit bases were moved too, in an effort to proactively protect those
systems (proofs of concept do exist that show 64-bit collisions, but
other recent changes to fix stack accounting and setuid behaviors will
minimize the impact).
The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC
base), so only the 64-bit PIE base needs to be reverted to let x86 and
arm64 ASan binaries run again. Future changes to the 64-bit PIE base on
these architectures can be made optional once a more dynamic method for
dealing with AddressSanitizer is found. (e.g. always loading PIE into
the mmap region for marked binaries.)
Link: http://lkml.kernel.org/r/20170807201542.GA21271@beast Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") Fixes: 02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Kostya Serebryany <kcc@google.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c715b72c1ba406f133217b509044c38d8e714a37) Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Juerg Haefliger [Thu, 17 Aug 2017 11:32:00 +0000 (13:32 +0200)]
UBUNTU: SAUCE: bnxt_en_bpo: Remove PCI_IDs handled by the regular driver
BugLink: http://bugs.launchpad.net/bugs/1711056
The backported bnxt_en_bpo driver should only load for NICs that are not
handled by the current bnxt_en driver.
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Juerg Haefliger [Thu, 17 Aug 2017 11:32:00 +0000 (13:32 +0200)]
UBUNTU: SAUCE: bnxt_en_bpo: Drop distro out-of-tree detection logic
BugLink: http://bugs.launchpad.net/bugs/1711056
The provided Makefile is a generic out-of-tree Makefile that tries
to be smart and detect the source code location plus a few other
things based on the current distro. We don't need any of this since
we're adding the driver as an in-tree module, so get rid of it.
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
This patch adds ALPS PTP sticks with pid/device id 0x120A to the list of
devices supported by hid-multitouch.
Signed-off-by: Shrirang Bagul <shrirang.bagul@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Support PTP Stick and Touchpad device. This Touchpad is Precision Touchpad
(PTP), and Stick Pointer data is the same as Mouse; Stick Pointer works as
Mouse.
[jkosina@suse.cz: changelog deuglification] Signed-off-by: Masaki Ota <masaki.ota@jp.alps.com> Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
(cherry picked from commit 504c932c7c47a6b4c572302a13873f7d83af1ff3) Signed-off-by: Shrirang Bagul <shrirang.bagul@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
UBUNTU: SAUCE: igb: add support for using Broadcom 54616 as PHY
BugLink: https://launchpad.net/bugs/1712024
Ported from packages/base/any/kernels/3.18.25/patches/driver-support-intel-igb-bcm54616-phy.patch
in OpenNetworkLinux https://github.com/opencomputeproject/OpenNetworkLinux/
Signed-off-by: Wen-chien Jesse Sung <jesse.sung@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-By: AceLan Kao <acelan.kao@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
BugLink: http://bugs.launchpad.net/bugs/1682644
On a dual controller setup with multipath enabled, some MEDIUM ERRORs
caused both paths to be failed, thus I/O got queued/blocked since the
'queue_if_no_path' feature is enabled by default on IPR controllers.
This example disabled 'queue_if_no_path' so the I/O failure is seen at
the sg_dd program. Notice that after the sg_dd test-case, both paths
are in 'failed' state, and both path/priority groups are in 'enabled'
state (not 'active') -- which would block I/O with 'queue_if_no_path'.
This is not the desired behavior. The dm-multipath explicitly checks
for the MEDIUM ERROR case (and a few others) so not to fail the path
(e.g., I/O to other sectors could potentially happen without problems).
See dm-mpath.c :: do_end_io_bio() -> noretry_error() !->! fail_path().
The problem trace is:
1) ipr_scsi_done() // SENSE KEY/CHECK CONDITION detected, go to..
2) ipr_erp_start() // ipr_is_gscsi() and masked_ioasc OK, go to..
3) ipr_gen_sense() // masked_ioasc is IPR_IOASC_MED_DO_NOT_REALLOC,
// so set DID_PASSTHROUGH.
4) scsi_decide_disposition() // check for DID_PASSTHROUGH and return
// early on, faking a DID_OK.. *instead*
// of reaching scsi_check_sense().
// Had it reached the latter, that would
// set host_byte to DID_MEDIUM_ERROR.
5) scsi_finish_command()
6) scsi_io_completion()
7) __scsi_error_from_host_byte() // That would be converted to -ENODATA
<...>
8) dm_softirq_done()
9) multipath_end_io()
10) do_end_io()
11) noretry_error() // And that is checked in dm-mpath :: noretry_error()
// which would cause fail_path() not to be called.
With this patch applied, the I/O is failed but the paths are not. This
multipath device continues accepting more I/O requests without blocking.
(and notice the different host byte/driver byte handling per SCSI layer).
# dmesg
[...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 driverbyte=DRIVER_OK
[...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
[...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current]
[...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend rewrite the data
[...] blk_update_request: critical medium error, dev sdaf, sector 0
[...] blk_update_request: critical medium error, dev dm-6, sector 0
[...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 driverbyte=DRIVER_OK
[...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 10 00
[...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current]
[...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend rewrite the data
[...] blk_update_request: critical medium error, dev sdaf, sector 0
[...] blk_update_request: critical medium error, dev dm-6, sector 0
[...] Buffer I/O error on dev dm-6, logical block 0, async page read
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 785a470496d8e0a32e3d39f376984eb2c98ca5b3) Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
BugLink: http://bugs.launchpad.net/bugs/1711401
Commit 2def86a7200c ("hvc: Convert to using interrupts instead of opal
events") enabled the use of interrupts in the hvc_driver for OPAL
platforms. However on machines with more than one hvc console, any
console after the first will fail to register an interrupt handler in
notifier_add_irq() since all consoles share the same IRQ number but do
not set the IRQF_SHARED flag:
genirq: Flags mismatch irq 31. 00000000 (hvc_console) vs. 00000000 (hvc_console)
hvc_open: request_irq failed with rc -16.
This error propagates up to hvc_open() and the console is closed, but
OPAL will still generate interrupts that are not handled, leading to
rcu_sched stall warnings.
Set IRQF_SHARED when calling request_irq(), allowing additional consoles
to start properly. This is only set for consoles handled by
hvc_opal_probe(), leaving other types unaffected.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit bbc3dfe8805de86874b1a1b1429a002e8670043e) Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Po-Hsu Lin [Thu, 10 Aug 2017 06:32:00 +0000 (08:32 +0200)]
selftests: fix memory-hotplug test
BugLink: http://bugs.launchpad.net/bugs/1710868
In the memory offline test, the $ration was used with RANDOM as the
possibility to get it offlined, correct it to become the portion of
available removable memory blocks.
Also ask the tool to try to offline the next available memory block
if the attempt is unsuccessful. It will only fail if all removable
memory blocks are busy.
A nice example:
$ sudo ./test.sh
Test scope: 10% hotplug memory
online all hot-pluggable memory in offline state:
SKIPPED - no hot-pluggable memory in offline state
offline 10% hot-pluggable memory in online state
trying to offline 3 out of 28 memory block(s):
online->offline memory1
online->offline memory10
./test.sh: line 74: echo: write error: Resource temporarily unavailable
offline_memory_expect_success 10: unexpected fail
online->offline memory100
online->offline memory101
online all hot-pluggable memory in offline state:
offline->online memory1
offline->online memory100
offline->online memory101
skip extra tests: debugfs is not mounted
$ echo $?
0
Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
(cherry picked from commit 5ff0c60b0e5e8c2a2139c304d7620b0ea70d721a) Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Po-Hsu Lin [Thu, 10 Aug 2017 06:32:00 +0000 (08:32 +0200)]
selftests: add missing test name in memory-hotplug test
BugLink: http://bugs.launchpad.net/bugs/1710868
There is no prompt for testing memory notifier error injection,
added with the same echo format of other tests above.
Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
(cherry picked from commit 02d8f075ac44f6f0dc46881965f815576921f2c0) Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Tore Anderson [Thu, 17 Aug 2017 11:16:00 +0000 (13:16 +0200)]
net: cdc_mbim: apply "NDP to end" quirk to HP lt4132
BugLink: https://bugs.launchpad.net/bugs/1707643
The HP lt4132 LTE/HSPA+ 4G Module (03f0:a31d) is a rebranded Huawei
ME906s-158 device. It, like the ME906s-158, requires the "NDP to end"
quirk for correct operation.
Signed-off-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit a68491f895a937778bb25b0795830797239de31f) Signed-off-by: Aurimas Fišeras <aurimas@members.fsf.org> Tested-by: Aurimas Fišeras <aurimas@members.fsf.org> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Andrea Arcangeli [Tue, 18 Jul 2017 12:05:00 +0000 (14:05 +0200)]
ksm: optimize refile of stable_node_dup at the head of the chain
BugLink: http://bugs.launchpad.net/bugs/1680513
If a candidate stable_node_dup has been found and it can accept further
merges it can be refiled to the head of the list to speedup next
searches without altering which dup is found and how the dups accumulate
in the chain.
We already refiled it back to the head in the prune_stale_stable_nodes
case, but we didn't refile it if not pruning (which is more common).
And we also refiled it when it was already at the head which is
unnecessary (in the prune_stale_stable_nodes case, nr > 1 means there's
more than one dup in the chain, it doesn't mean it's not already at the
head of the chain).
The stable_node_chain list is single threaded and there's no SMP locking
contention so it should be faster to refile it to the head of the list
also if prune_stale_stable_nodes is false.
Profiling shows the refile happens 1.9% of the time when a dup is found
with a max_page_sharing limit setting of 3 (with max_page_sharing of 2
the refile never happens of course as there's never space for one more
merge) which is reasonably low. At higher max_page_sharing values it
should be much less frequent.
This is just an optimization.
Link: http://lkml.kernel.org/r/20170518173721.22316-4-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 80b18dfa53bbb03085eba6401f5d29dad49205b7) Signed-off-by: Gavin Guo <gavin.guo@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
There are two output parameters to chain/chain_prune, one is tree_page
the other is stable_node_dup. Like in get_ksm_page the caller has to
check tree_page is NULL before touching the stable_node. Similarly in
chain/chain_prune the caller has to check tree_page before touching the
stable_node_dup returned or the original stable_node passed as
parameter.
Because the tree_page is never returned as a stale pointer, it may be
more intuitive to return tree_page and to pass stable_node_dup for
reference instead of the reverse.
This patch purely swaps the two output parameters of chain/chain_prune
as a cleanup for the static checker and to mimic the get_ksm_page
behavior more closely. There's no change to the caller at all except
the swap, it's purely a cleanup and it is a noop from the caller point
of view.
Link: http://lkml.kernel.org/r/20170518173721.22316-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Tested-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8dc5ffcd5a74da39ed2c533d786a3f78671a38b8) Signed-off-by: Gavin Guo <gavin.guo@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
There are no fixes here it's just minor cleanups and optimizations.
1/3 removes makes the "fix" for the stale stable_node fall in the
standard case without introducing new cases. Setting stable_node to
NULL was marginally safer, but stale pointer is still wiped from the
caller, this looks cleaner.
2/3 should fix the false positive from Dan's static checker.
3/3 is a microoptimization to apply the the refile of future merge
candidate dups at the head of the chain in all cases and to skip it in
one case where we did it and but it was a noop (to avoid checking if
it was already at the head but now we've to check it anyway so it got
optimized away).
This patch (of 3):
When the stable_node chain is collapsed we can as well set the caller
stable_node to match the returned stable_node_dup in chain_prune().
This way the collapse case becomes indistinguishable from the regular
stable_node case and we can remove two branches from the KSM page
migration handling slow paths.
While it was all correct this looks cleaner (and faster) as the caller has
to deal with fewer special cases.
Link: http://lkml.kernel.org/r/20170518173721.22316-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0ba1d0f7c41cdab306a3d30e036bc393c3ebba7e) Signed-off-by: Gavin Guo <gavin.guo@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Andrea Arcangeli [Tue, 18 Jul 2017 12:05:00 +0000 (14:05 +0200)]
ksm: fix use after free with merge_across_nodes = 0
BugLink: http://bugs.launchpad.net/bugs/1680513
If merge_across_nodes was manually set to 0 (not the default value) by
the admin or a tuned profile on NUMA systems triggering cross-NODE page
migrations, a stable_node use after free could materialize.
If the chain is collapsed stable_node would point to the old chain that
was already freed. stable_node_dup would be the stable_node dup now
converted to a regular stable_node and indexed in the rbtree in
replacement of the freed stable_node chain (not anymore a dup).
This special case where the chain is collapsed in the NUMA replacement
path, is now detected by setting stable_node to NULL by the chain_prune
callee if it decides to collapse the chain. This tells the NUMA
replacement code that even if stable_node and stable_node_dup are
different, this is not a chain if stable_node is NULL, as the
stable_node_dup was converted to a regular stable_node and the chain was
collapsed.
It is generally safer for the callee to force the caller stable_node to
NULL the moment it become stale so any other mistake like this would
result in an instant Oops easier to debug than an use after free.
Otherwise the replace logic would act like if stable_node was a valid
chain, when in fact it was freed. Notably
stable_node_chain_add_dup(page_node, stable_node) would run on a stable
stable_node.
Andrey Ryabinin found the source of the use after free in chain_prune().
Link: http://lkml.kernel.org/r/20170512193805.8807-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Evgheni Dereveanchin <ederevea@redhat.com> Tested-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b4fecc67cc569b14301f5a1111363d5818b8da5e) Signed-off-by: Gavin Guo <gavin.guo@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Andrea Arcangeli [Tue, 18 Jul 2017 12:05:00 +0000 (14:05 +0200)]
ksm: introduce ksm_max_page_sharing per page deduplication limit
BugLink: http://bugs.launchpad.net/bugs/1680513
Without a max deduplication limit for each KSM page, the list of the
rmap_items associated to each stable_node can grow infinitely large.
During the rmap walk each entry can take up to ~10usec to process
because of IPIs for the TLB flushing (both for the primary MMU and the
secondary MMUs with the MMU notifier). With only 16GB of address space
shared in the same KSM page, that would amount to dozens of seconds of
kernel runtime.
A ~256 max deduplication factor will reduce the latencies of the rmap
walks on KSM pages to order of a few msec. Just doing the
cond_resched() during the rmap walks is not enough, the list size must
have a limit too, otherwise the caller could get blocked in (schedule
friendly) kernel computations for seconds, unexpectedly.
There's room for optimization to significantly reduce the IPI delivery
cost during the page_referenced(), but at least for page_migration in
the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
it may be inevitable to send lots of IPIs if each rmap_item->mm is
active on a different CPU and there are lots of CPUs. Even if we ignore
the IPI delivery cost, we've still to walk the whole KSM rmap list, so
we can't allow millions or billions (ulimited) number of entries in the
KSM stable_node rmap_item lists.
The limit is enforced efficiently by adding a second dimension to the
stable rbtree. So there are three types of stable_nodes: the regular
ones (identical as before, living in the first flat dimension of the
stable rbtree), the "chains" and the "dups".
Every "chain" and all "dups" linked into a "chain" enforce the invariant
that they represent the same write protected memory content, even if
each "dup" will be pointed by a different KSM page copy of that content.
This way the stable rbtree lookup computational complexity is unaffected
if compared to an unlimited max_sharing_limit. It is still enforced
that there cannot be KSM page content duplicates in the stable rbtree
itself.
Adding the second dimension to the stable rbtree only after the
max_page_sharing limit hits, provides for a zero memory footprint
increase on 64bit archs. The memory overhead of the per-KSM page
stable_tree and per virtual mapping rmap_item is unchanged. Only after
the max_page_sharing limit hits, we need to allocate a stable_tree
"chain" and rb_replace() the "regular" stable_node with the newly
allocated stable_node "chain". After that we simply add the "regular"
stable_node to the chain as a stable_node "dup" by linking hlist_dup in
the stable_node_chain->hlist. This way the "regular" (flat) stable_node
is converted to a stable_node "dup" living in the second dimension of
the stable rbtree.
During stable rbtree lookups the stable_node "chain" is identified as
stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
is_stable_node_chain()).
When dropping stable_nodes, the stable_node "dup" is identified as
stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).
The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
elsewhere in any stable_node->head/node to avoid a clashes with the
stable_node->node.rb_parent_color pointer, and different from
&migrate_nodes. So the second field of &migrate_nodes is picked and
verified as always safe with a BUILD_BUG_ON in case the list_head
implementation changes in the future.
The STABLE_NODE_DUP is picked as a random negative value in
stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when
it's a "regular" stable_node or a stable_node "dup".
The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn
is aliased in a union with a time field used to rate limit the
stable_node_chain->hlist prunes.
The garbage collection of the stable_node_chain happens lazily during
stable rbtree lookups (as for all other kind of stable_nodes), or while
disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
entire stable rbtree.
While the "regular" stable_nodes and the stable_node "dups" must wait
for their underlying tree_page to be freed before they can be freed
themselves, the stable_node "chains" can be freed immediately if the
stable_node->hlist turns empty. This is because the "chains" are never
pointed by any page->mapping and they're effectively stable rbtree KSM
self contained metadata.
[akpm@linux-foundation.org: fix non-NUMA build] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(backported from commit 2c653d0ee2ae78ff3a174cc877a057c8afac7069) Signed-off-by: Gavin Guo <gavin.guo@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>