Joonsoo Kim [Tue, 15 Mar 2016 21:54:24 +0000 (14:54 -0700)]
mm/slab: alternative implementation for DEBUG_SLAB_LEAK
DEBUG_SLAB_LEAK is a debug option. It's current implementation requires
status buffer so we need more memory to use it. And, it cause
kmem_cache initialization step more complex.
To remove this extra memory usage and to simplify initialization step,
this patch implement this feature with another way.
When user requests to get slab object owner information, it marks that
getting information is started. And then, all free objects in caches
are flushed to corresponding slab page. Now, we can distinguish all
freed object so we can know all allocated objects, too. After
collecting slab object owner information on allocated objects, mark is
checked that there is no free during the processing. If true, we can be
sure that our information is correct so information is returned to user.
Although this way is rather complex, it has two important benefits
mentioned above. So, I think it is worth changing.
There is one drawback that it takes more time to get slab object owner
information but it is just a debug option so it doesn't matter at all.
To help review, this patch implements new way only. Following patch
will remove useless code.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Tue, 15 Mar 2016 21:54:18 +0000 (14:54 -0700)]
mm/slab: use more appropriate condition check for debug_pagealloc
debug_pagealloc debugging is related to SLAB_POISON flag rather than
FORCED_DEBUG option, although FORCED_DEBUG option will enable
SLAB_POISON. Fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Tue, 15 Mar 2016 21:54:12 +0000 (14:54 -0700)]
mm/slab: remove the checks for slab implementation bug
Some of "#if DEBUG" are for reporting slab implementation bug rather
than user usecase bug. It's not really needed because slab is stable
for a quite long time and it makes code too dirty. This patch remove
it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Tue, 15 Mar 2016 21:54:06 +0000 (14:54 -0700)]
mm/slab: fix stale code comment
This patchset implements a new freed object management way, that is,
OBJFREELIST_SLAB. Purpose of it is to reduce memory overhead in SLAB.
SLAB needs a array to manage freed objects in a slab. If there is
leftover after objects are packed into a slab, we can use it as a
management array, and, in this case, there is no memory waste. But, in
the other cases, we need to allocate extra memory for a management array
or utilize dedicated internal memory in a slab for it. Both cases
causes memory waste so it's not good.
With this patchset, freed object itself can be used for a management
array. So, memory waste could be reduced. Detailed idea and numbers
are described in last patch's commit description. Please refer it.
In fact, I tested another idea implementing OBJFREELIST_SLAB with
extendable linked array through another freed object. It can remove
memory waste completely but it causes more computational overhead in
critical lock path and it seems that overhead outweigh benefit. So,
this patchset doesn't include it. I will attach prototype just for a
reference.
This patch (of 16):
We use freelist_idx_t type for free object management whose size would be
smaller than size of unsigned int. Fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduce a new API call kfree_bulk() for bulk freeing memory
objects not bound to a single kmem_cache.
Christoph pointed out that it is possible to implement freeing of
objects, without knowing the kmem_cache pointer as that information is
available from the object's page->slab_cache. Proposing to remove the
kmem_cache argument from the bulk free API.
Jesper demonstrated that these extra steps per object comes at a
performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled
in and activated runtime that these steps are done anyhow. The extra
cost is most visible for SLAB allocator, because the SLUB allocator does
the page lookup (virt_to_head_page()) anyhow.
Thus, the conclusion was to keep the kmem_cache free bulk API with a
kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
easily. Simply by handling if kmem_cache_free_bulk() gets called with a
kmem_cache NULL pointer.
This does increase the code size a bit, but implementing a separate
kfree_bulk() call would likely increase code size even more.
Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
@ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
Code size increase for SLAB:
add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
function old new delta
kmem_cache_free_bulk 660 734 +74
This patch implements the free side of bulk API for the SLAB allocator
kmem_cache_free_bulk(), and concludes the implementation of optimized
bulk API for SLAB allocator.
Benchmarked[1] cost of alloc+free (obj size 256 bytes) on CPU i7-4790K @
4.00GHz, with no debug options, no PREEMPT and CONFIG_MEMCG_KMEM=y but
no active user of kmemcg.
SLAB single alloc+free cost: 87 cycles(tsc) 21.814 ns with this
optimized config.
This patch implements the alloc side of bulk API for the SLAB allocator.
Further optimization are still possible by changing the call to
__do_cache_alloc() into something that can return multiple objects.
This optimization is left for later, given end results already show in
the area of 80% speedup.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slab: use slab_post_alloc_hook in SLAB allocator shared with SLUB
Reviewers notice that the order in slab_post_alloc_hook() of
kmemcheck_slab_alloc() and kmemleak_alloc_recursive() gets swapped
compared to slab.c / SLAB allocator.
Also notice memset now occurs before calling kmemcheck_slab_alloc() and
kmemleak_alloc_recursive().
I assume this reordering of kmemcheck, kmemleak and memset is okay
because this is the order they are used by the SLUB allocator.
This patch completes the sharing of alloc_hook's between SLUB and SLAB.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: kmemcheck skip object if slab allocation failed
In the SLAB allocator kmemcheck_slab_alloc() is guarded against being
called in case the object is NULL. In SLUB allocator this NULL pointer
invocation can happen, which seems like an oversight.
Move the NULL pointer check into kmemcheck code (kmemcheck_slab_alloc)
so the check gets moved out of the fastpath, when not compiled with
CONFIG_KMEMCHECK.
This is a step towards sharing post_alloc_hook between SLUB and SLAB,
because slab_post_alloc_hook() does not perform this check before
calling kmemcheck_slab_alloc().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slab: use slab_pre_alloc_hook in SLAB allocator shared with SLUB
Deduplicate code in SLAB allocator functions slab_alloc() and
slab_alloc_node() by using the slab_pre_alloc_hook() call, which is now
shared between SLUB and SLAB.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: fault-inject take over bootstrap kmem_cache check
Remove the SLAB specific function slab_should_failslab(), by moving the
check against fault-injection for the bootstrap slab, into the shared
function should_failslab() (used by both SLAB and SLUB).
This is a step towards sharing alloc_hook's between SLUB and SLAB.
This bootstrap slab "kmem_cache" is used for allocating struct
kmem_cache objects to the allocator itself.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/slab: move SLUB alloc hooks to common mm/slab.h
First step towards sharing alloc_hook's between SLUB and SLAB
allocators. Move the SLUB allocators *_alloc_hook to the common
mm/slab.h for internal slab definitions.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slub: clean up code for kmem cgroup support to kmem_cache_free_bulk
This change is primarily an attempt to make it easier to realize the
optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
enabled.
Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the
overhead is zero. This is because, as long as no process have enabled
kmem cgroups accounting, the assignment is replaced by asm-NOP
operations. This is possible because memcg_kmem_enabled() uses a
static_key_false() construct.
It also helps readability as it avoid accessing the p[] array like:
p[size - 1] which "expose" that the array is processed backwards inside
helper function build_detached_freelist().
Lastly this also makes the code more robust, in error case like passing
NULL pointers in the array. Which were previously handled before commit 033745189b1b ("slub: add missing kmem cgroup support to
kmem_cache_free_bulk").
Fixes: 033745189b1b ("slub: add missing kmem cgroup support to kmem_cache_free_bulk") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Tue, 15 Mar 2016 21:53:29 +0000 (14:53 -0700)]
paride: make 'verbose' parameter an 'int' again
gcc-6.0 found an ancient bug in the paride driver, which had a
"module_param(verbose, bool, 0);" since before 2.6.12, but actually uses
it to accept '0', '1' or '2' as arguments:
drivers/block/paride/pd.c: In function 'pd_init_dev_parms':
drivers/block/paride/pd.c:298:29: warning: comparison of constant '1' with boolean expression is always false [-Wbool-compare]
#define DBMSG(msg) ((verbose>1)?(msg):NULL)
In 2012, Rusty did a cleanup patch that also changed the type of the
variable to 'bool', which introduced what is now a gcc warning.
This changes the type back to 'int' and adapts the module_param() line
instead, so it should work as documented in case anyone ever cares about
running the ancient driver with debugging.
San Mehat [Tue, 15 Mar 2016 21:53:26 +0000 (14:53 -0700)]
block: partition: add partition specific uevent callbacks for partition info
This patch has been carried in the Android tree for quite some time and
is one of the few patches required to get a mainline kernel up and
running with an exsiting Android userspace. So I wanted to submit it
for review and consideration if it should be merged.
For partitions, add new uevent parameters 'PARTN' which specifies the
partitions index in the table, and 'PARTNAME', which specifies PARTNAME
specifices the partition name of a partition device.
Android's userspace uses this for creating device node links from the
partition name and number, ie:
/dev/block/platform/soc/by-name/system
or
/dev/block/platform/soc/by-num/p1
One can see its usage here:
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
and
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494
[john.stultz@linaro.org: dropped NPARTS and reworded commit message for context] Signed-off-by: Dima Zavin <dima@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Rom Lemarchand <romlem@google.com> Cc: Android Kernel Team <kernel-team@android.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: <harald@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jun Piao [Tue, 15 Mar 2016 21:53:23 +0000 (14:53 -0700)]
ocfs2/dlm: fix a variable overflow problem in dlmdomain.c
In dlm_send_join_cancels(), node is defined with type unsigned int, but
initialized with -1, this will lead variable overflow. Although this
won't cause any runtime problem, the code looks a little uncoordinated.
Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiufei Xue [Tue, 15 Mar 2016 21:53:20 +0000 (14:53 -0700)]
ocfs2: fix a tiny race that leads file system read-only
when o2hb detect a node down, it first set the dead node to recovery map
and create ocfs2rec which will replay journal for dead node. o2hb
thread then call dlm_do_local_recovery_cleanup() to delete the lock for
dead node. After the lock of dead node is gone, locks for other nodes
can be granted and may modify the meta data without replaying journal of
the dead node. The detail is described as follows.
N1 N2 N3(master)
modify the extent tree of
inode, and commit
dirty metadata to journal,
then goes down.
o2hb thread detects
N1 goes down, set
recovery map and
delete the lock of N1.
dlm_thread flush ast
for the lock of N2.
do not detect the death
of N1, so recovery map is
empty.
read inode from disk
without replaying
the journal of N1 and
modify the extent tree
of the inode that N1
had modified.
ocfs2rec recover the
journal of N1.
The modification of N2
is lost.
The modification of N1 and N2 are not serial, and it will lead to
read-only file system. We can set recovery_waiting flag to the lock
resource after delete the lock for dead node to prevent other node from
getting the lock before dlm recovery. After dlm recovery, the recovery
map on N2 is not empty, ocfs2_inode_lock_full_nested() will wait for ocfs2
recovery.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xuejiufei [Tue, 15 Mar 2016 21:53:17 +0000 (14:53 -0700)]
ocfs2/dlm: return EINVAL when the lockres on migration target is in DROPPING_REF state
If master migrate this lock resource to node when it happened to purge
it, a new lock resource will be created and inserted into hash list. If
then master goes down, the lock resource being purged is recovered, so
there exist two lock resource with different owner. So return error to
master if the lock resource is in DROPPING state, master will retry to
migrate this lock resource.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xuejiufei [Tue, 15 Mar 2016 21:53:14 +0000 (14:53 -0700)]
ocfs2/dlm: clear DROPPING_REF flag when the master goes down
If the master goes down after return in-progress for deref message. The
lock resource on non-master node can not be purged. Clear the
DROPPING_REF flag and recovery it.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xuejiufei [Tue, 15 Mar 2016 21:53:11 +0000 (14:53 -0700)]
ocfs2/dlm: return in progress if master can not clear the refmap bit right now
Master returns in-progress to non-master node when it can not clear the
refmap bit right now. And non-master node will not purge the lock
resource until receiving deref done message.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_purge_lockres
dlm_deref_handler
-> find lock resource is in
DLM_LOCK_RES_SETREF_INPROG state,
so dispatch a deref work
dlm_purge_lockres succeed.
call dlmlock again
dlm_do_master_request
dlm_master_request_handler
-> dlm_lockres_set_refmap_bit
deref work trigger, call
dlm_lockres_clear_refmap_bit
to clear Node 1 from refmap
dlm_purge_lockres succeed
dlm_send_remote_lock_request
return DLM_IVLOCKID because
the lockres is not exist
BUG if the lockres is $RECOVERY
This series of patches add a new message to keep the order of set and
clear. Other nodes can purge the lock resource only after the refmap bit
on master is cleared.
This patch is to add DEREF_DONE message and corresponding handler. Node
can purge the lock resource after receiving this message. As a new
message is added, so increase the minor number of dlm protocol version.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Tue, 15 Mar 2016 21:53:05 +0000 (14:53 -0700)]
ocfs2/dlm: fix a typo in dlmcommon.h
Refer to cluster/tcp.h, NET_MAX_PAYLOAD_BYTES is a typo for
O2NET_MAX_PAYLOAD_BYTES.
Since currently DLM_MIG_LOCKRES_RESERVED is not actually used, it won't
cause any problem. But we'd better correct it for further use.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jiangyiwen [Tue, 15 Mar 2016 21:53:01 +0000 (14:53 -0700)]
ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.
jiangyiwen [Tue, 15 Mar 2016 21:52:58 +0000 (14:52 -0700)]
ocfs2/cluster: replace the interrupt safe spinlocks with common ones
There actually no hardware or software interrupts in the context which
using o2hb_live_lock, so we don't need to worry about race conditions
caused by irq/softirq with spinlock held. Turning off irq is not good
for system performance after all. Just replace them with a non
interrupt safe function.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sudip Mukherjee [Tue, 15 Mar 2016 21:52:55 +0000 (14:52 -0700)]
blackfin: define dummy pgprot_writecombine for !MMU
blackfin allmodconfig build fails with the error:
../sound/core/pcm_native.c: In function 'snd_pcm_lib_default_mmap':
../sound/core/pcm_native.c:3386:24: error: implicit declaration of function 'pgprot_writecombine' [-Werror=implicit-function-declaration]
area->vm_page_prot = pgprot_writecombine(area->vm_page_prot);
^
../sound/core/pcm_native.c:3386:22: error: incompatible types when assigning to type 'pgprot_t {aka struct <anonymous>}' from type 'int'
area->vm_page_prot = pgprot_writecombine(area->vm_page_prot);
^
When !MMU, asm-generic will not define default pgprot_writecombine, so
blackfin needs to define it by itself.
The patch idea is from commit 65b9ab888cd7 ("arch/c6x/include/asm/pgtable.h:
define dummy pgprot_writecombine for !MMU")
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Cc: Steven Miao <realmz6@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Zijlstra [Tue, 15 Mar 2016 21:52:49 +0000 (14:52 -0700)]
tags: Fix DEFINE_PER_CPU expansions
$ make tags
GEN tags
ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
Which are all the result of the DEFINE_PER_CPU pattern:
The below cures them. All except the workqueue one are within reasonable
distance of the 80 char limit. TJ do you have any preference on how to
fix the wq one, or shall we just not care its too long?
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 15 Mar 2016 02:44:38 +0000 (19:44 -0700)]
Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ updates from Ingo Molnar:
"NOHZ enhancements, by Frederic Weisbecker, which reorganizes/refactors
the NOHZ 'can the tick be stopped?' infrastructure and related code to
be data driven, and harmonizes the naming and handling of all the
various properties"
[ This makes the ugly "fetch_or()" macro that the scheduler used
internally a new generic helper, and does a bad job at it.
I'm pulling it, but I've asked Ingo and Frederic to get this
fixed up ]
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched-clock: Migrate to use new tick dependency mask model
posix-cpu-timers: Migrate to use new tick dependency mask model
sched: Migrate sched to use new tick dependency mask model
sched: Account rr tasks
perf: Migrate perf to use new tick dependency mask model
nohz: Use enum code for tick stop failure tracing message
nohz: New tick dependency mask
nohz: Implement wide kick on top of irq work
atomic: Export fetch_or()
Linus Torvalds [Tue, 15 Mar 2016 02:14:06 +0000 (19:14 -0700)]
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- Make schedstats a runtime tunable (disabled by default) and
optimize it via static keys.
As most distributions enable CONFIG_SCHEDSTATS=y due to its
instrumentation value, this is a nice performance enhancement.
(Mel Gorman)
- Implement 'simple waitqueues' (swait): these are just pure
waitqueues without any of the more complex features of full-blown
waitqueues (callbacks, wake flags, wake keys, etc.). Simple
waitqueues have less memory overhead and are faster.
Use simple waitqueues in the RCU code (in 4 different places) and
for handling KVM vCPU wakeups.
(Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
Marcelo Tosatti)
- sched/numa enhancements (Rik van Riel)
- NOHZ performance enhancements (Rik van Riel)
- Various sched/deadline enhancements (Steven Rostedt)
- Various fixes (Peter Zijlstra)
- ... and a number of other fixes, cleanups and smaller enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
sched/cputime: Fix steal_account_process_tick() to always return jiffies
sched/deadline: Remove dl_new from struct sched_dl_entity
Revert "kbuild: Add option to turn incompatible pointer check into error"
sched/deadline: Remove superfluous call to switched_to_dl()
sched/debug: Fix preempt_disable_ip recording for preempt_disable()
sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
time, acct: Drop irq save & restore from __acct_update_integrals()
acct, time: Change indentation in __acct_update_integrals()
sched, time: Remove non-power-of-two divides from __acct_update_integrals()
sched/rt: Kick RT bandwidth timer immediately on start up
sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
sched/debug: Move sched_domain_sysctl to debug.c
sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
sched/rt: Fix PI handling vs. sched_setscheduler()
sched/core: Remove duplicated sched_group_set_shares() prototype
sched/fair: Consolidate nohz CPU load update code
sched/fair: Avoid using decay_load_missed() with a negative value
sched/deadline: Always calculate end of period on sched_yield()
sched/cgroup: Fix cgroup entity load tracking tear-down
rcu: Use simple wait queues where possible in rcutree
...
Linus Torvalds [Tue, 15 Mar 2016 01:43:51 +0000 (18:43 -0700)]
Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Ingo Molnar:
"Various RAS updates:
- AMD MCE support updates for future CPUs, fixes and 'SMCA' (Scalable
MCA) error decoding support (Aravind Gopalakrishnan)
- x86 memcpy_mcsafe() support, to enable smart(er) hardware error
recovery in NVDIMM drivers, based on an extension of the x86
exception handling code. (Tony Luck)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/sb_edac: Fix computation of channel address
x86/mm, x86/mce: Add memcpy_mcsafe()
x86/mce/AMD: Document some functionality
x86/mce: Clarify comments regarding deferred error
x86/mce/AMD: Fix logic to obtain block address
x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errors
x86/mce: Move MCx_CONFIG MSR definitions
x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries
x86/mm: Expand the exception table logic to allow new handling options
x86/mce/AMD: Set MCAX Enable bit
x86/mce/AMD: Carve out threshold block preparation
x86/mce/AMD: Fix LVT offset configuration for thresholding
x86/mce/AMD: Reduce number of blocks scanned per bank
x86/mce/AMD: Do not perform shared bank check for future processors
x86/mce: Fix order of AMD MCE init function call
Linus Torvalds [Tue, 15 Mar 2016 00:58:53 +0000 (17:58 -0700)]
Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Main kernel side changes:
- Big reorganization of the x86 perf support code. The old code grew
organically deep inside arch/x86/kernel/cpu/perf* and its naming
became somewhat messy.
The new location is under arch/x86/events/, using the following
cleaner hierarchy of source code files:
- Add 'L' hotkey to dynamicly set the percent threshold for
histogram entries and callchains, i.e. dynamicly do what the
--percent-limit command line option to 'top' and 'report' does.
(Namhyung Kim)
perf mem:
- Allow specifying events via -e in 'perf mem record', also listing
what events can be specified via 'perf mem record -e list' (Jiri
Olsa)
perf record:
- Add 'perf record' --all-user/--all-kernel options, so that one
can tell that all the events in the command line should be
restricted to the user or kernel levels (Jiri Olsa), i.e.:
perf record -e cycles:u,instructions:u
is equivalent to:
perf record --all-user -e cycles,instructions
- Make 'perf record' collect CPU cache info in the perf.data file header:
$ perf record usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (7 samples) ]
$ perf report --header-only -I | tail -10 | head -8
# CPU cache info:
# L1 Data 32K [0-1]
# L1 Instruction 32K [0-1]
# L1 Data 32K [2-3]
# L1 Instruction 32K [2-3]
# L2 Unified 256K [0-1]
# L2 Unified 256K [2-3]
# L3 Unified 4096K [0-3]
Will be used in 'perf c2c' and eventually in 'perf diff' to
allow, for instance running the same workload in multiple
machines and then when using 'diff' show the hardware difference.
(Jiri Olsa)
- Improved support for Java, using the JVMTI agent library to do
jitdumps that then will be inserted in synthesized
PERF_RECORD_MMAP2 events via 'perf inject' pointed to synthesized
ELF files stored in ~/.debug and keyed with build-ids, to allow
symbol resolution and even annotation with source line info, see
the changeset comments to see how to use it (Stephane Eranian)
perf script/trace:
- Decode data_src values (e.g. perf.data files generated by 'perf
mem record') in 'perf script': (Jiri Olsa)
# perf script
perf 693 [1] 4.088652: 1 cpu/mem-loads,ldlat=30/P: ffff88007d0b0f4068100142 L1 hit|SNP None|TLB L1 or L2 hit|LCK No <SNIP>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Improve support to 'data_src', 'weight' and 'addr' fields in
'perf script' (Jiri Olsa)
- Handle empty print fmts in 'perf script -s' i.e. when running
python or perl scripts (Taeung Song)
perf stat:
- 'perf stat' now shows shadow metrics (insn per cycle, etc) in
interval mode too. E.g:
# perf stat -I 1000 -e instructions,cycles sleep 1
# time counts unit events
1.000215928 519,620 instructions # 0.69 insn per cycle
1.000215928 752,003 cycles
<SNIP>
- Port 'perf kvm stat' to PowerPC (Hemant Kumar)
- Implement CSV metrics output in 'perf stat' (Andi Kleen)
perf BPF support:
- Support converting data from bpf events in 'perf data' (Wang Nan)
- Print bpf-output events in 'perf script': (Wang Nan).
- Add API to set values of map entries in a BPF object, be it
individual map slots or ranges (Wang Nan)
- Introduce support for the 'bpf-output' event (Wang Nan)
- Add glue to read perf events in a BPF program (Wang Nan)
- Improve support for bpf-output events in 'perf trace' (Wang Nan)
... and tons of other changes as well - see the shortlog and git log
for details!"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (342 commits)
perf stat: Add --metric-only support for -A
perf stat: Implement --metric-only mode
perf stat: Document CSV format in manpage
perf hists browser: Check sort keys before hot key actions
perf hists browser: Allow thread filtering for comm sort key
perf tools: Add sort__has_comm variable
perf tools: Recalc total periods using top-level entries in hierarchy
perf tools: Remove nr_sort_keys field
perf hists browser: Cleanup hist_browser__fprintf_hierarchy_entry()
perf tools: Remove hist_entry->fmt field
perf tools: Fix command line filters in hierarchy mode
perf tools: Add more sort entry check functions
perf tools: Fix hist_entry__filter() for hierarchy
perf jitdump: Build only on supported archs
tools lib traceevent: Add '~' operation within arg_num_eval()
perf tools: Omit unnecessary cast in perf_pmu__parse_scale
perf tools: Pass perf_hpp_list all the way through setup_sort_list
perf tools: Fix perf script python database export crash
perf jitdump: DWARF is also needed
perf bench mem: Prepare the x86-64 build for upstream memcpy_mcsafe() changes
...
Linus Torvalds [Mon, 14 Mar 2016 23:58:50 +0000 (16:58 -0700)]
Merge branch 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull read-only kernel memory updates from Ingo Molnar:
"This tree adds two (security related) enhancements to the kernel's
handling of read-only kernel memory:
- extend read-only kernel memory to a new class of formerly writable
kernel data: 'post-init read-only memory' via the __ro_after_init
attribute, and mark the ARM and x86 vDSO as such read-only memory.
This kind of attribute can be used for data that requires a once
per bootup initialization sequence, but is otherwise never modified
after that point.
This feature was based on the work by PaX Team and Brad Spengler.
(by Kees Cook, the ARM vDSO bits by David Brown.)
- make CONFIG_DEBUG_RODATA always enabled on x86 and remove the
Kconfig option. This simplifies the kernel and also signals that
read-only memory is the default model and a first-class citizen.
(Kees Cook)"
* 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ARM/vdso: Mark the vDSO code read-only after init
x86/vdso: Mark the vDSO code read-only after init
lkdtm: Verify that '__ro_after_init' works correctly
arch: Introduce post-init read-only memory
x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option
mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
asm-generic: Consolidate mark_rodata_ro()
Linus Torvalds [Mon, 14 Mar 2016 23:31:41 +0000 (16:31 -0700)]
Merge branch 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull dma_*_writecombine rename from Ingo Molnar:
"Rename dma_*_writecombine() to dma_*_wc()
This is a tree-wide API rename, to move the dma_*() write-combining
APIs closer in name to their usual API families. (The old API names
are kept as compatibility wrappers to not introduce extra breakage.)
The patch was Coccinelle generated"
* 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
dma, mm/pat: Rename dma_*_writecombine() to dma_*_wc()
Linus Torvalds [Mon, 14 Mar 2016 22:50:44 +0000 (15:50 -0700)]
Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking changes from Ingo Molnar:
"Various updates:
- Futex scalability improvements: remove page lock use for shared
futex get_futex_key(), which speeds up 'perf bench futex hash'
benchmarks by over 40% on a 60-core Westmere. This makes anon-mem
shared futexes perform close to private futexes. (Mel Gorman)
- lockdep hash collision detection and fix (Alfredo Alvarez
Fernandez)
Linus Torvalds [Mon, 14 Mar 2016 22:15:51 +0000 (15:15 -0700)]
Merge branch 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull ram resource handling changes from Ingo Molnar:
"Core kernel resource handling changes to support NVDIMM error
injection.
This tree introduces a new I/O resource type, IORESOURCE_SYSTEM_RAM,
for System RAM while keeping the current IORESOURCE_MEM type bit set
for all memory-mapped ranges (including System RAM) for backward
compatibility.
With this resource flag it no longer takes a strcmp() loop through the
resource tree to find "System RAM" resources.
The new resource type is then used to extend ACPI/APEI error injection
facility to also support NVDIMM"
* 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ACPI/EINJ: Allow memory error injection to NVDIMM
resource: Kill walk_iomem_res()
x86/kexec: Remove walk_iomem_res() call with GART type
x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem search
resource: Add walk_iomem_res_desc()
memremap: Change region_intersects() to take @flags and @desc
arm/samsung: Change s3c_pm_run_res() to use System RAM type
resource: Change walk_system_ram() to use System RAM type
drivers: Initialize resource entry to zero
xen, mm: Set IORESOURCE_SYSTEM_RAM to System RAM
kexec: Set IORESOURCE_SYSTEM_RAM for System RAM
arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM
ia64: Set System RAM type and descriptor
x86/e820: Set System RAM type and descriptor
resource: Add I/O resource descriptor
resource: Handle resource flags properly
resource: Add System RAM resource type
Linus Torvalds [Sun, 13 Mar 2016 20:04:46 +0000 (13:04 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"Another round of MIPS fixes for 4.5:
- Fix JZ4780 build with DEBUG_ZBOOT and MACH_JZ4780
- Fix build with DEBUG_ZBOOT and MACH_JZ4780
- Fix issue with uninitialised temp_foreign_map
- Fix awk regex compile failure with certain versions of awk. At
this time, the sole user, ld-ifversion, is only used on MIPS"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: smp.c: Fix uninitialised temp_foreign_map
MIPS: Fix build error when SMP is used without GIC
ld-version: Fix awk regex compile failure
MIPS: Fix build with DEBUG_ZBOOT and MACH_JZ4780
James Hogan [Fri, 4 Mar 2016 10:10:51 +0000 (10:10 +0000)]
MIPS: smp.c: Fix uninitialised temp_foreign_map
When calculate_cpu_foreign_map() recalculates the cpu_foreign_map
cpumask it uses the local variable temp_foreign_map without initialising
it to zero. Since the calculation only ever sets bits in this cpumask
any existing bits at that memory location will remain set and find their
way into cpu_foreign_map too. This could potentially lead to cache
operations suboptimally doing smp calls to multiple VPEs in the same
core, even though the VPEs share primary caches.
Therefore initialise temp_foreign_map using cpumask_clear() before use.
Fixes: cccf34e9411c ("MIPS: c-r4k: Fix cache flushing for MT cores") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/12759/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Hauke Mehrtens [Sun, 6 Mar 2016 21:28:56 +0000 (22:28 +0100)]
MIPS: Fix build error when SMP is used without GIC
The MIPS_GIC_IPI should only be selected when MIPS_GIC is also
selected, otherwise it results in a compile error. smp-gic.c uses some
functions from include/linux/irqchip/mips-gic.h like
plat_ipi_call_int_xlate() which are only added to the header file when
MIPS_GIC is set. The Lantiq SoC does not use the GIC, but supports SMP.
The calls top the functions from smp-gic.c are already protected by
some #ifdefs
The first part of this was introduced in commit 72e20142b2bf ("MIPS:
Move GIC IPI functions out of smp-cmp.c")
James Hogan [Tue, 8 Mar 2016 16:47:53 +0000 (16:47 +0000)]
ld-version: Fix awk regex compile failure
The ld-version.sh script fails on some versions of awk with the
following error, resulting in build failures for MIPS:
awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(')
This is due to the regular expression ".*)", meant to strip off the
beginning of the ld version string up to the close bracket, however
brackets have a meaning in regular expressions, so lets escape it so
that awk doesn't expect a corresponding open bracket.
Fixes: ccbef1674a15 ("Kbuild, lto: add ld-version and ld-ifversion ...") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: James Hogan <james.hogan@imgtec.com> Tested-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk> Cc: Michal Marek <mmarek@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: linux-mips@linux-mips.org Cc: linux-kbuild@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org # 4.4.x-
Patchwork: https://patchwork.linux-mips.org/patch/12838/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Aaro Koskinen [Mon, 7 Mar 2016 22:06:36 +0000 (00:06 +0200)]
MIPS: Fix build with DEBUG_ZBOOT and MACH_JZ4780
Ingenic SoC declares ZBOOT support, but debug definitions are missing
for MACH_JZ4780 resulting in a build failure when DEBUG_ZBOOT is set.
The UART addresses are same as with JZ4740, so fix by covering JZ4780
with those as well.
Linus Torvalds [Sun, 13 Mar 2016 04:09:25 +0000 (20:09 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"This fixes 3 FPU handling related bugs, an EFI boot crash and a
runtime warning.
The EFI fix arrived late but I didn't want to delay it to after v4.5
because the effects are pretty bad for the systems that are affected
by it"
[ Actually, I don't think the EFI fix really matters yet, because we
haven't switched to the separate EFI page tables in mainline yet ]
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables
x86/fpu: Fix eager-FPU handling on legacy FPU machines
x86/delay: Avoid preemptible context checks in delay_mwaitx()
x86/fpu: Revert ("x86/fpu: Disable AVX when eagerfpu is off")
x86/fpu: Fix 'no387' regression
Pull SCSI target bug fix from Nicholas Bellinger:
"Here is an outstanding target-core bug-fix for v4.5 code."
This patch addresses a recent Task Management (TMR) regression related
to larger set of multi-port LUN_RESET bug-fixes in v4.5-rc5.
It drops a left-over target_put_sess_cmd() of se_cmd->cmd_kref within
ABORT_TASK failure path, once a se_cmd descriptor has already
completed posting response to fabric driver logic, and must be skipped
during normal ABORT_TASK se_cmd->tag lookup"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
target: Drop incorrect ABORT_TASK put for completed commands
Ming Lei [Sat, 12 Mar 2016 14:56:19 +0000 (22:56 +0800)]
block: don't optimize for non-cloned bio in bio_get_last_bvec()
For !BIO_CLONED bio, we can use .bi_vcnt safely, but it
doesn't mean we can just simply return .bi_io_vec[.bi_vcnt - 1]
because the start postion may have been moved in the middle of
the bvec, such as splitting in the middle of bvec.
Fixes: 7bcd79ac50d9(block: bio: introduce helpers to get the 1st and last bvec) Cc: stable@vger.kernel.org Reported-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Matt Fleming [Fri, 11 Mar 2016 11:19:23 +0000 (11:19 +0000)]
x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables
Some machines have EFI regions in page zero (physical address
0x00000000) and historically that region has been added to the e820
map via trim_bios_range(), and ultimately mapped into the kernel page
tables. It was not mapped via efi_map_regions() as one would expect.
Alexis reports that with the new separate EFI page tables some boot
services regions, such as page zero, are not mapped. This triggers an
oops during the SetVirtualAddressMap() runtime call.
For the EFI boot services quirk on x86 we need to memblock_reserve()
boot services regions until after SetVirtualAddressMap(). Doing that
while respecting the ownership of regions that may have already been
reserved by the kernel was the motivation behind this commit:
7d68dc3f1003 ("x86, efi: Do not reserve boot services regions within reserved areas")
That patch was merged at a time when the EFI runtime virtual mappings
were inserted into the kernel page tables as described above, and the
trick of setting ->numpages (and hence the region size) to zero to
track regions that should not be freed in efi_free_boot_services()
meant that we never mapped those regions in efi_map_regions(). Instead
we were relying solely on the existing kernel mappings.
Now that we have separate page tables we need to make sure the EFI
boot services regions are mapped correctly, even if someone else has
already called memblock_reserve(). Instead of stashing a tag in
->numpages, set the EFI_MEMORY_RUNTIME bit of ->attribute. Since it
generally makes no sense to mark a boot services region as required at
runtime, it's pretty much guaranteed the firmware will not have
already set this bit.
For the record, the specific circumstances under which Alexis
triggered this bug was that an EFI runtime driver on his machine was
responding to the EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE event during
SetVirtualAddressMap().
The event handler for this driver looks like this,
Which is pretty typical code for an EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE
handler. The "mov r11, QWORD PTR [rip+0x2424]" was the faulting
instruction because ConvertPointer() was being called to convert the
address 0x0000000000000000, which when converted is left unchanged and
remains 0x0000000000000000.
The output of the oops trace gave the impression of a standard NULL
pointer dereference bug, but because we're accessing physical
addresses during ConvertPointer(), it wasn't. EFI boot services code
is stored at that address on Alexis' machine.
Reported-by: Alexis Murzeau <amurzeau@gmail.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raphael Hertzog <hertzog@debian.org> Cc: Roger Shimizu <rogershimizu@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1457695163-29632-2-git-send-email-matt@codeblueprint.co.uk Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815125 Signed-off-by: Ingo Molnar <mingo@kernel.org>
Borislav Petkov [Fri, 11 Mar 2016 11:32:06 +0000 (12:32 +0100)]
x86/fpu: Fix eager-FPU handling on legacy FPU machines
i486 derived cores like Intel Quark support only the very old,
legacy x87 FPU (FSAVE/FRSTOR, CPUID bit FXSR is not set), and
our FPU code wasn't handling the saving and restoring there
properly in the 'eagerfpu' case.
So after we made eagerfpu the default for all CPU types:
58122bf1d856 x86/fpu: Default eagerfpu=on on all CPUs
these old FPU designs broke. First, Andy Shevchenko reported a splat:
WARNING: CPU: 0 PID: 823 at arch/x86/include/asm/fpu/internal.h:163 fpu__clear+0x8c/0x160
which was us trying to execute FXRSTOR on those machines even though
they don't support it.
After taking care of that, Bryan O'Donoghue reported that a simple FPU
test still failed because we weren't initializing the FPU state properly
on those machines.
Take care of all that.
Reported-and-tested-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/20160311113206.GD4312@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Sat, 12 Mar 2016 00:34:18 +0000 (16:34 -0800)]
Merge tag 'for-linus-20160311' of git://git.infradead.org/linux-mtd
Pull MTD fixes from Brian Norris:
"Late MTD fix for v4.5:
- A simple error code handling fix for the NAND ECC test; this was a
regression in v4.5-rc1
- A MAINTAINERS update, which might as well go in ASAP"
* tag 'for-linus-20160311' of git://git.infradead.org/linux-mtd:
MAINTAINERS: add a maintainer for the NAND subsystem
mtd: nand: tests: fix regression introduced in mtd_nandectest
Linus Torvalds [Sat, 12 Mar 2016 00:19:23 +0000 (16:19 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm/i915 fixes from Dave Airlie:
"Just two i915 regression fixes, that should be it from me"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/i915: Actually retry with bit-banging after GMBUS timeout
drm/i915: Fix bogus dig_port_map[] assignment for pre-HSW
Matthew Dawson [Fri, 11 Mar 2016 21:08:07 +0000 (13:08 -0800)]
mm/mempool: avoid KASAN marking mempool poison checks as use-after-free
When removing an element from the mempool, mark it as unpoisoned in KASAN
before verifying its contents for SLUB/SLAB debugging. Otherwise KASAN
will flag the reads checking the element use-after-free writes as
use-after-free reads.
Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Mar 2016 20:35:54 +0000 (12:35 -0800)]
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"Two more fixes for 4.5:
- One is a fix for OMAP that is urgently needed to avoid DRA7xx chips
from premature aging, by always keeping the Ethernet clock enabled.
- The other solves a I/O memory layout issue on Armada, where SROM
and PCI memory windows were conflicting in some configurations"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: mvebu: fix overlap of Crypto SRAM with PCIe memory window
ARM: dts: dra7: do not gate cpsw clock due to errata i877
ARM: OMAP2+: hwmod: Introduce ti,no-idle dt property
Linus Torvalds [Fri, 11 Mar 2016 20:32:02 +0000 (12:32 -0800)]
Merge tag 'media/v4.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fix from Mauro Carvalho Chehab:
"One last time fix: It adds a code that prevents some media tools like
media-ctl to hide some entities that have their IDs out of the range
expected by those apps"
* tag 'media/v4.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] media-device: map new functions into old types for legacy API
ARM: mvebu: fix overlap of Crypto SRAM with PCIe memory window
When the Crypto SRAM mappings were added to the Device Tree files
describing the Armada XP boards in commit c466d997bb16 ("ARM: mvebu:
define crypto SRAM ranges for all armada-xp boards"), the fact that
those mappings were overlaping with the PCIe memory aperture was
overlooked. Due to this, we currently have for all Armada XP platforms
a situation that looks like this:
Memory mapping on Armada XP boards with internal registers at
0xf1000000:
The overlap means that when PCIe devices are added, depending on their
memory window needs, they might or might not be mapped into the
physical address space. Indeed, they will not be mapped if the area
allocated in the PCIe memory aperture by the PCI core overlaps with
one of the Crypto SRAM. Typically, a Intel IGB PCIe NIC that needs 8MB
of PCIe memory will see its PCIe memory window allocated from
0xf80000000 for 8MB, which overlaps with the Crypto SRAM windows. Due
to this, the PCIe window is not created, and any attempt to access the
PCIe window makes the kernel explode:
[ 3.302213] igb: Copyright (c) 2007-2014 Intel Corporation.
[ 3.307841] pci 0000:00:09.0: enabling device (0140 -> 0143)
[ 3.313539] mvebu_mbus: cannot add window '4:f8', conflicts with another window
[ 3.320870] mvebu-pcie soc:pcie-controller: Could not create MBus window at [mem 0xf8000000-0xf87fffff]: -22
[ 3.330811] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf08c0018
This problem does not occur on Armada 370 boards, because we use the
following memory mapping (for boards that have internal registers at
0xf1000000):
Obviously, the solution is to align the location of the Crypto SRAM
mappings of Armada XP to be similar with the ones on Armada 370, i.e
have them between the "internal registers" area and the beginning of
the PCIe aperture.
However, we have a special case with the OpenBlocks AX3-4 platform,
which has a 128 MB NOR flash. Currently, this NOR flash is mapped from
0xf0000000 to 0xf8000000. This is possible because on OpenBlocks
AX3-4, the internal registers are not at 0xf1000000. And this explains
why the Crypto SRAM mappings were not configured at the same place on
Armada XP.
Hence, the solution is two-fold:
(1) Move the NOR flash mapping on Armada XP OpenBlocks AX3-4 from
0xe8000000 to 0xf0000000. This frees the 0xf0000000 ->
0xf80000000 space.
(2) Move the Crypto SRAM mappings on Armada XP to be similar to
Armada 370 (except of course that Armada XP has two Crypto SRAM
and not one).
After this patch, the memory mapping on Armada XP boards with
registers at 0xf1 is:
Linus Torvalds [Fri, 11 Mar 2016 18:45:03 +0000 (10:45 -0800)]
Merge tag 'pm+acpi-4.5-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"Two more fixes for issues introduced recently, one in the generic
device properties framework and one in ACPICA.
Specifics:
- Revert a recent ACPICA commit that has been reverted upstream,
because it caused problems to happen on user systems and the
problem it attempted to address will not be relevant any more after
upcoming ACPI specification changes (Bob Moore).
- Fix crash in the generic device properties framework introduced by
a recent change that forgot to check pointers against error values
in addition to checking them against NULL (Heikki Krogerus)"
* tag 'pm+acpi-4.5-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
device property: fwnode->secondary may contain ERR_PTR(-ENODEV)
ACPICA: Revert "Parser: Fix for SuperName method invocation"
Linus Torvalds [Fri, 11 Mar 2016 18:21:32 +0000 (10:21 -0800)]
Merge tag 'xfs-for-linus-4.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"This is a fix for a regression introduced in 4.5-rc1 by the new torn
log write detection code. The regression only affects people moving a
clean filesystem between machines/kernels of different architecture
(such as changing between 32 bit and 64 bit kernels), but this is the
recommended (and only!) safe way to migrate a filesystem between
architectures so we really need to ensure it works.
The changes are larger than I'd prefer right at the end of the release
cycle, but the majority of the change is just factoring code to enable
the detection of a clean log at the correct time to avoid this issue.
Changes:
- Only perform torn log write detection on dirty logs. This prevents
failures being detected due to a clean filesystem being moved
between machines or kernels of different architectures (e.g. 32 ->
64 bit, BE -> LE, etc). This fixes a regression introduced by the
torn log write detection in 4.5-rc1"
* tag 'xfs-for-linus-4.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: only run torn log write detection on dirty logs
xfs: refactor in-core log state update to helper
xfs: refactor unmount record detection into helper
xfs: separate log head record discovery from verification
Linus Torvalds [Fri, 11 Mar 2016 18:13:49 +0000 (10:13 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"A couple of fixes: Fix for my dumb braino in ncpfs and a long-standing
breakage on recovery from failed rename() in jffs2"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
jffs2: reduce the breakage on recovery from halfway failed rename()
ncpfs: fix a braino in OOM handling in ncp_fill_cache()
Ville SyrjÀlÀ [Mon, 7 Mar 2016 15:56:57 +0000 (17:56 +0200)]
drm/i915: Actually retry with bit-banging after GMBUS timeout
After the GMBUS transfer times out, we set force_bit=1 and
return -EAGAIN expecting the i2c core to call the .master_xfer
hook again so that we will retry the same transfer via bit-banging.
This is in case the gmbus hardware is somehow faulty.
Unfortunately we left adapter->retries to 0, meaning the i2c core
didn't actually do the retry. Let's tell the core we want one retry
when we return -EAGAIN.
Note that i2c-algo-bit also uses this retry count for some internal
retries, so we'll end up increasing those a bit as well.
Andi Kleen [Thu, 3 Mar 2016 23:57:36 +0000 (15:57 -0800)]
perf stat: Implement --metric-only mode
Add a new mode to only print metrics. Sometimes we don't care about the
raw values, just want the computed metrics. This allows more compact
printing, so with -I each sample is only a single line. This also
allows easier plotting and processing with other tools.
The main target is with using --topdown, but it also works with -T and
standard perf stat. A few metrics are not supported.
To avoiding having to hardcode all the metrics in the code it uses a two
pass approach: first compute dummy metrics and only print the headers in
the print_metric callback. Then use the callback to print the actual
values.
There are some additional changes in the stat printout code to handle
all metrics being on a single line.
One issue is that the column code doesn't know in advance what events
are not supported by the CPU, and it would be hard to find out as this
could change based on dynamic conditions. That causes empty columns in
some cases.
The output can be fairly wide, often you may need more than 80 columns.
Example:
% perf stat -a -I 1000 --metric-only
1.001452803 frontend cycles idle insn per cycle stalled cycles per insn branch-misses of all branches
1.001452803 158.91% 0.66 2.39 2.92%
2.002192321 180.63% 0.76 2.08 2.96%
3.003088282 150.59% 0.62 2.57 2.84%
4.004369835 196.20% 0.98 1.62 3.79%
5.005227314 231.98% 0.84 1.90 4.71%
v2: Lots of updates.
v3: Use slightly narrower columns
v4: Add comment
Andi Kleen [Thu, 3 Mar 2016 23:57:35 +0000 (15:57 -0800)]
perf stat: Document CSV format in manpage
With all the recently added fields in the perf stat CSV output we should
finally document them in the man page. Do this here.
v2: Fix fields in documentation (Jiri)
v3: fix order of fields again (Jiri)
v4: Change order again.
v5: Document more fields (Jiri)
v6: Move time stamp first
v7: More fixes (Jiri)
Namhyung Kim [Wed, 9 Mar 2016 14:20:53 +0000 (23:20 +0900)]
perf hists browser: Check sort keys before hot key actions
The context menu in TUI hists browser checks corresponding sort keys
when creating the menu item. But hotkey actions lacks these checks so
it can filter using incorrect info.
For example, default sort key of 'perf top' doesn't contain 'comm' or
'pid' sort key so each hist entry's thread info is not reliable. Thus
it should prohibit using thread filter on 't' key.
Namhyung Kim [Wed, 9 Mar 2016 15:14:50 +0000 (00:14 +0900)]
perf hists browser: Allow thread filtering for comm sort key
The commit 2eafd410e669 ("perf hists browser: Only 'Zoom into thread'
only when sort order has 'pid'") disabled thread filtering in hist
browser for the default sort key. However the he->thread is still valid
even if 'pid' sort key is not given. Only thing it should not use is
the pid (or tid) of the thread. So allow to filter by thread when
'comm' sort key is given and show pid only if 'pid' sort key is given.
Namhyung Kim [Wed, 9 Mar 2016 14:20:51 +0000 (23:20 +0900)]
perf tools: Add sort__has_comm variable
The sort__has_comm variable is to check whether the comm sort key is
given. This is necessary to support thread filtering in the TUI hists
browser later.
Namhyung Kim [Wed, 9 Mar 2016 13:47:02 +0000 (22:47 +0900)]
perf tools: Recalc total periods using top-level entries in hierarchy
When hierarchy mode is enabled, each entry in a hierarchy level shares
the period. IOW an upper level entry's period is the sum of lower level
entries. Thus perf uses only one of them to calculate the total period
of hists. It was lowest-level (leaf) entries but it has a problem when
it comes to filters.
If a filter is applied, entries in the same level will be filtered or
not. But upper level entries still have period of their sum including
filtered one. So total sum of upper level entries will not be same as
sum of lower level entries.
This resulted in entries having more than 100% of overhead and it can be
produced using perf top with filter(s).
Reported-and-Tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1457531222-18130-8-git-send-email-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Wed, 9 Mar 2016 13:47:01 +0000 (22:47 +0900)]
perf tools: Remove nr_sort_keys field
The nr_sort_keys field is to carry the number of sort entries in a
hpp_list or hists to determine the depth of indentation of a hist entry.
As it's only used in hierarchy mode and now we have used nr_hpp_node for
this reason, there's no need to keep it anymore. Let's get rid of it.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1457531222-18130-7-git-send-email-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The hist_browser__fprintf_hierarchy_entry() if to dump current output
into a file so it needs to be sync-ed with the corresponding function
hist_browser__show_hierarchy_entry(). So use hists->nr_hpp_node to
indent width and use first fmt_node to print overhead columns instead of
checking whether it's a sort entry (or dynamic entry).
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1457531222-18130-6-git-send-email-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Wed, 9 Mar 2016 13:46:58 +0000 (22:46 +0900)]
perf tools: Fix command line filters in hierarchy mode
When a command-line filter is applied in hierarchy mode, output is
broken especially when filtering on lower level. The higher level
entries doesn't show up so it's hard to see the results.
Also it needs to handle multi sort keys in a single hierarchy level.
Namhyung Kim [Wed, 9 Mar 2016 13:46:57 +0000 (22:46 +0900)]
perf tools: Add more sort entry check functions
Those functions are for checkinf if a given perf_hpp_fmt is a
filter-related sort entry. With hierarchy mode, it needs to check
filters on the hist entries with its own hpp format list.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1457531222-18130-3-git-send-email-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Wed, 9 Mar 2016 13:46:56 +0000 (22:46 +0900)]
perf tools: Fix hist_entry__filter() for hierarchy
When hierarchy mode is enabled each output format is in a separate hpp
list. So when applying a filter it should check all formats in the
list. Currently it only checks a single ->fmt field which was not set
properly.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1457531222-18130-2-git-send-email-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Thu, 10 Mar 2016 16:41:13 +0000 (17:41 +0100)]
perf jitdump: Build only on supported archs
Build jitdump only on architectures defined in util/genelf.h file, to avoid
breaking the build on such arches.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Colin Ian King <colin.king@canonical.com> Cc: David Ahern <dsahern@gmail.com> Cc: Davidlohr Bueso <dbueso@suse.com> Cc: He Kuang <hekuang@huawei.com> Cc: Mel Gorman <mgorman@suse.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/20160310164113.GA11357@krava.redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Boris BREZILLON [Wed, 9 Mar 2016 20:57:24 +0000 (21:57 +0100)]
MAINTAINERS: add a maintainer for the NAND subsystem
Add myself as the maintainer of the NAND subsystem.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com> Acked-by: Richard Weinberger <richard@nod.at> Acked-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Linus Torvalds [Thu, 10 Mar 2016 18:42:15 +0000 (10:42 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"A few simple fixes for ARM, x86, PPC and generic code.
The x86 MMU fix is a bit larger because the surrounding code needed a
cleanup, but nothing worrisome"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0
KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
kvm: cap halt polling at exactly halt_poll_ns
KVM: s390: correct fprs on SIGP (STOP AND) STORE STATUS
KVM: VMX: disable PEBS before a guest entry
KVM: PPC: Book3S HV: Sanitize special-purpose register values on guest exit
Linus Torvalds [Thu, 10 Mar 2016 18:39:04 +0000 (10:39 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"I thought we were done for 4.5, but then the 64k-page chaps came
crawling out of the woodwork. *sigh*
The vmemmap fix I sent for -rc7 caused a regression with 64k pages and
sparsemem and at some point during the release cycle the new hugetlb
code using contiguous ptes started failing the libhugetlbfs tests with
64k pages enabled.
So here are a couple of patches that fix the vmemmap alignment and
disable the new hugetlb page sizes whilst a proper fix is being
developed:
- Temporarily disable huge pages built using contiguous ptes
- Ensure vmemmap region is sufficiently aligned for sparsemem
sections"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: hugetlb: partial revert of 66b3923a1a0f
arm64: account for sparsemem section alignment when choosing vmemmap offset
Linus Torvalds [Thu, 10 Mar 2016 18:36:07 +0000 (10:36 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"Three bug fixes:
- The fix for the page table corruption (CVE-2016-2143)
- The diagnose statistics introduced a regression for the dasd diag
driver
- Boot crash on systems without the set-program-parameters facility"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/mm: four page table levels vs. fork
s390/cpumf: Fix lpp detection
s390/dasd: fix diag 0x250 inline assembly
[media] media-device: map new functions into old types for legacy API
The legacy media controller userspace API exposes entity types that
carry both type and function information. The new API replaces the type
with a function. It preserves backward compatibility by defining legacy
functions for the existing types and using them in drivers.
This works fine, as long as newer entity functions won't be added.
Unfortunately, some tools, like media-ctl with --print-dot argument
rely on the now legacy MEDIA_ENT_T_V4L2_SUBDEV and MEDIA_ENT_T_DEVNODE
numeric ranges to identify what entities will be shown.
Also, if the entity doesn't match those ranges, it will ignore the
major/minor information on devnodes, and won't be getting the devnode
name via udev or sysfs.
As we're now adding devices outside the old range, the legacy ioctl
needs to map the new entity functions into a type at the old range,
or otherwise we'll have a regression.
Detected on all released media-ctl versions (e. g. versions <= 1.10).
Fix this by deriving the type from the function to emulate the legacy
API if the function isn't in the legacy functions range.
Luck, Tony [Thu, 10 Mar 2016 00:40:48 +0000 (16:40 -0800)]
EDAC/sb_edac: Fix computation of channel address
Large memory Haswell-EX systems with multiple DIMMs per channel were
sometimes reporting the wrong DIMM.
Found three problems:
1) Debug printouts for socket and channel interleave were not interpreting
the register fields correctly. The socket interleave field is a 2^X
value (0=1, 1=2, 2=4, 3=8). The channel interleave is X+1 (0=1, 1=2,
2=3. 3=4).
2) Actual use of the socket interleave value didn't interpret as 2^X
3) Conversion of address to channel address was complicated, and wrong.
When computing the residue we need two pieces of information: the current
descriptor and the remaining data of the current descriptor. To get
that information, we need to read consecutively two registers but we
can't do it in an atomic way. For that reason, we have to check manually
that current descriptor has not changed.
Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com> Suggested-by: Cyrille Pitchen <cyrille.pitchen@atmel.com> Reported-by: David Engraf <david.engraf@sysgo.com> Tested-by: David Engraf <david.engraf@sysgo.com> Fixes: e1f7c9eee707 ("dmaengine: at_xdmac: creation of the atmel
eXtended DMA Controller driver") Cc: stable@vger.kernel.org #4.1 and later Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Borislav Petkov [Wed, 9 Mar 2016 20:56:22 +0000 (21:56 +0100)]
x86/delay: Avoid preemptible context checks in delay_mwaitx()
We do use this_cpu_ptr(&cpu_tss) as a cacheline-aligned, seldomly
accessed per-cpu var as the MONITORX target in delay_mwaitx(). However,
when called in preemptible context, this_cpu_ptr -> smp_processor_id() ->
debug_smp_processor_id() fires:
BUG: using smp_processor_id() in preemptible [00000000] code: udevd/312
caller is delay_mwaitx+0x40/0xa0
But we don't care about that check - we only need cpu_tss as a MONITORX
target and it doesn't really matter which CPU's var we're touching as
we're going idle anyway. Fix that.
Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Huang Rui <ray.huang@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: spg_linux_kernel@amd.com Link: http://lkml.kernel.org/r/20160309205622.GG6564@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
Paolo Bonzini [Wed, 9 Mar 2016 13:28:02 +0000 (14:28 +0100)]
KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0
KVM has special logic to handle pages with pte.u=1 and pte.w=0 when
CR0.WP=1. These pages' SPTEs flip continuously between two states:
U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed)
and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed).
When SMEP is in effect, however, U=0 will enable kernel execution of
this page. To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1.
When guest EFER has the NX bit cleared, the reserved bit check thinks
that the latter state is invalid; teach it that the smep_andnot_wp case
will also use the NX bit of SPTEs.
Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.
KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
specially when pte.u=1/pte.w=0/CR0.WP=0. Such writes cause a fault
when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
restarts execution. This will still cause a user write to fault, while
supervisor writes will succeed. User reads will fault spuriously now,
and KVM will then flip U and W again in the SPTE (U=1, W=0). User reads
will be enabled and supervisor writes disabled, going back to the
originary situation where supervisor writes fault spuriously.
When SMEP is in effect, however, U=0 will enable kernel execution of
this page. To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0. If the guest has not enabled NX, the result is a continuous
stream of page faults due to the NX bit being reserved.
The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
switch. (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
control, so they do not use user-return notifiers for EFER---if they did,
EFER.NX would be forced to the same value as the host).
There is another bug in the reserved bit check, which I've split to a
separate patch for easier application to stable kernels.
Cc: stable@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Fixes: f6577a5fa15d82217ca73c74cd2dcbc0f6c781dd Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Davidlohr Bueso [Thu, 10 Mar 2016 01:55:36 +0000 (17:55 -0800)]
locking/csd_lock: Use smp_cond_acquire() in csd_lock_wait()
We can micro-optimize this call and mildly relax the
barrier requirements by relying on ctrl + rmb, keeping
the acquire semantics. In addition, this is pretty much
the now standard for busy-waiting under such restraints.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1457574936-19065-3-git-send-email-dbueso@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
While the compiler tends to already to it for us (except for
csd_unlock), make it explicit. These helpers mainly deal with
the ->flags, are short-lived and can be called, for example,
from smp_call_function_many().
Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1457574936-19065-2-git-send-email-dbueso@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yu-cheng Yu [Thu, 10 Mar 2016 00:28:54 +0000 (16:28 -0800)]
x86/fpu: Revert ("x86/fpu: Disable AVX when eagerfpu is off")
Leonid Shatz noticed that the SDM interpretation of the following
recent commit:
394db20ca240741 ("x86/fpu: Disable AVX when eagerfpu is off")
... is incorrect and that the original behavior of the FPU code was correct.
Because AVX is not stated in CR0 TS bit description, it was mistakenly
believed to be not supported for lazy context switch. This turns out
to be false:
Intel Software Developer's Manual Vol. 3A, Sec. 2.5 Control Registers:
'TS Task Switched bit (bit 3 of CR0) -- Allows the saving of the x87 FPU/
MMX/SSE/SSE2/SSE3/SSSE3/SSE4 context on a task switch to be delayed until
an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction is actually executed
by the new task.'
The fork of a process with four page table levels is broken since
git commit 6252d702c5311ce9 "[S390] dynamic page tables."
All new mm contexts are created with three page table levels and
an asce limit of 4TB. If the parent has four levels dup_mmap will
add vmas to the new context which are outside of the asce limit.
The subsequent call to copy_page_range will walk the three level
page table structure of the new process with non-zero pgd and pud
indexes. This leads to memory clobbers as the pgd_index *and* the
pud_index is added to the mm->pgd pointer without a pgd_deref
in between.
The init_new_context() function is selecting the number of page
table levels for a new context. The function is used by mm_init()
which in turn is called by dup_mm() and mm_alloc(). These two are
used by fork() and exec(). The init_new_context() function can
distinguish the two cases by looking at mm->context.asce_limit,
for fork() the mm struct has been copied and the number of page
table levels may not change. For exec() the mm_alloc() function
set the new mm structure to zero, in this case a three-level page
table is created as the temporary stack space is located at
STACK_TOP_MAX = 4TB.
This fixes CVE-2016-2143.
Reported-by: Marcin KoĆcielnicki <koriakin@0x04.net> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Linus Torvalds [Thu, 10 Mar 2016 04:24:23 +0000 (20:24 -0800)]
Merge tag 'spi-fix-v4.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few driver specific fixes for the Rockchip and i.MX SPI controllers,
especially for the i.MX they're annoying bugs if you run into them"
* tag 'spi-fix-v4.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: imx: fix spi resource leak with dma transfer
spi: imx: allow only WML aligned transfers to use DMA
spi: rockchip: add missing spi_master_put
spi: rockchip: disable runtime pm when in err case