Chen Gang [Fri, 6 Nov 2015 02:48:41 +0000 (18:48 -0800)]
mm/mmap.c: change __install_special_mapping() args order
Make __install_special_mapping() args order match the caller, so the
caller can pass their register args directly to callee with no touch.
For most of architectures, args (at least the first 5th args) are in
registers, so this change will have effect on most of architectures.
For -O2, __install_special_mapping() may be inlined under most of
architectures, but for -Os, it should not. So this change can get a
little better performance for -Os, at least.
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Geliang Tang [Fri, 6 Nov 2015 02:48:38 +0000 (18:48 -0800)]
mm/nommu.c: drop unlikely inside BUG_ON()
(1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't
care less and the change is ok.
(2) ppc and mips, which HAVE_ARCH_BUG_ON, do not rely on branch
predictions as it seems to be pointless[1] and thus callers should not
be trying to push an optimization in the first place.
(3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an
unlikely compiler flag already.
Vineet Gupta [Fri, 6 Nov 2015 02:48:29 +0000 (18:48 -0800)]
mm: optimize PageHighMem() check
This came up when implementing HIHGMEM/PAE40 for ARC. The kmap() /
kmap_atomic() generated code seemed needlessly bloated due to the way
PageHighMem() macro is implemented. It derives the exact zone for page
and then does pointer subtraction with first zone to infer the zone_type.
The pointer arithmatic in turn generates the code bloat.
Oleg Nesterov [Fri, 6 Nov 2015 02:48:26 +0000 (18:48 -0800)]
mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process()
Both "child->mm == mm" and "p->mm != mm" checks in oom_kill_process() are
wrong. task->mm can be NULL if the task is the exited group leader. This
means in particular that "kill sharing same memory" loop can miss a
process with a zombie leader which uses the same ->mm.
Note: the process_has_mm(child, p->mm) check is still not 100% correct,
p->mm can be NULL too. This is minor, but probably deserves a fix or a
comment anyway.
[akpm@linux-foundation.org: document process_shares_mm() a bit] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Nov 2015 02:48:23 +0000 (18:48 -0800)]
mm/oom_kill: cleanup the "kill sharing same memory" loop
Purely cosmetic, but the complex "if" condition looks annoying to me.
Especially because it is not consistent with OOM_SCORE_ADJ_MIN check
which adds another if/continue.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Nov 2015 02:48:20 +0000 (18:48 -0800)]
mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process()
The fatal_signal_pending() was added to suppress unnecessary "sharing same
memory" message, but it can't 100% help anyway because it can be
false-negative; SIGKILL can be already dequeued.
And worse, it can be false-positive due to exec or coredump. exec is
mostly fine, but coredump is not. It is possible that the group leader
has the pending SIGKILL because its sub-thread originated the coredump, in
this case we must not skip this process.
We could probably add the additional ->group_exit_task check but this
patch just removes the wrong check along with pr_info().
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Nov 2015 02:48:14 +0000 (18:48 -0800)]
mm: fix the racy mm->locked_vm change in
"mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
not safe; multiple threads using the same ->mm can do this at the same
time trying to expans different vma's under down_read(mmap_sem). This
means that one of the "locked_vm += grow" changes can be lost and we can
miss munlock_vma_pages_all() later.
Move this code into the caller(s) under mm->page_table_lock. All other
updates to ->locked_vm hold mmap_sem for writing.
Alexandru Moise [Fri, 6 Nov 2015 02:48:08 +0000 (18:48 -0800)]
mm/vmscan.c: fix types of some locals
In zone_reclaimable_pages(), `nr' is returned by a function which is
declared as returning "unsigned long", so declare it such. Negative
values are meaningless here.
In zone_pagecache_reclaimable() we should also declare `delta' and
`nr_pagecache_reclaimable' as being unsigned longs because they're used to
store the values returned by zone_page_state() and
zone_unmapped_file_pages() which also happen to return unsigned integers.
[akpm@linux-foundation.org: make zone_pagecache_reclaimable() return ulong rather than long] Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Fri, 6 Nov 2015 02:48:05 +0000 (18:48 -0800)]
mm, oom: remove task_lock protecting comm printing
The oom killer takes task_lock() in a couple of places solely to protect
printing the task's comm.
A process's comm, including current's comm, may change due to
/proc/pid/comm or PR_SET_NAME.
The comm will always be NULL-terminated, so the worst race scenario would
only be during update. We can tolerate a comm being printed that is in
the middle of an update to avoid taking the lock.
Other locations in the kernel have already dropped task_lock() when
printing comm, so this is consistent.
Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 6 Nov 2015 02:48:02 +0000 (18:48 -0800)]
mm, compaction: distinguish contended status in tracepoints
Compaction returns prematurely with COMPACT_PARTIAL when contended or has
fatal signal pending. This is ok for the callers, but might be misleading
in the traces, as the usual reason to return COMPACT_PARTIAL is that we
think the allocation should succeed. After this patch we distinguish the
premature ending condition in the mm_compaction_finished and
mm_compaction_end tracepoints.
The contended status covers the following reasons:
- lock contention or need_resched() detected in async compaction
- fatal signal pending
- too many pages isolated in the zone (only for async compaction)
Further distinguishing the exact reason seems unnecessary for now.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 6 Nov 2015 02:47:59 +0000 (18:47 -0800)]
mm, compaction: export tracepoints zone names to userspace
Some compaction tracepoints use zone->name to print which zone is being
compacted. This works for in-kernel printing, but not userspace trace
printing of raw captured trace such as via trace-cmd report.
This patch uses zone_idx() instead of zone->name as the raw value, and
when printing, converts the zone_type to string using the appropriate EM()
macros and some ugly tricks to overcome the problem that half the values
depend on CONFIG_ options and one does not simply use #ifdef inside of
#define.
Vlastimil Babka [Fri, 6 Nov 2015 02:47:56 +0000 (18:47 -0800)]
mm, compaction: export tracepoints status strings to userspace
Some compaction tracepoints convert the integer return values to strings
using the compaction_status_string array. This works for in-kernel
printing, but not userspace trace printing of raw captured trace such as
via trace-cmd report.
This patch converts the private array to appropriate tracepoint macros
that result in proper userspace support.
Tetsuo Handa [Fri, 6 Nov 2015 02:47:54 +0000 (18:47 -0800)]
mm/oom_kill.c: suppress unnecessary "sharing same memory" message
oom_kill_process() sends SIGKILL to other thread groups sharing victim's
mm. But printing
"Kill process %d (%s) sharing same memory\n"
lines makes no sense if they already have pending SIGKILL. This patch
reduces the "Kill process" lines by printing that line with info level
only if SIGKILL is not pending.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 6 Nov 2015 02:47:51 +0000 (18:47 -0800)]
mm/oom_kill.c: fix potentially killing unrelated process
At the for_each_process() loop in oom_kill_process(), we are comparing
address of OOM victim's mm without holding a reference to that mm. If
there are a lot of processes to compare or a lot of "Kill process %d (%s)
sharing same memory" messages to print, for_each_process() loop could take
very long time.
It is possible that meanwhile the OOM victim exits and releases its mm,
and then mm is allocated with the same address and assigned to some
unrelated process. When we hit such race, the unrelated process will be
killed by error. To make sure that the OOM victim's mm does not go away
until for_each_process() loop finishes, get a reference on the OOM
victim's mm before calling task_unlock(victim).
[oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 6 Nov 2015 02:47:44 +0000 (18:47 -0800)]
mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL
It was confirmed that a local unprivileged user can consume all memory
reserves and hang up that system using time lag between the OOM killer
sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
printk() inside for_each_process() loop at oom_kill_process() can consume
many seconds when there are many thread groups sharing the same memory.
The oom-depleter's thread group leader which got TIF_MEMDIE started
memset() in user space after the OOM killer set TIF_MEMDIE, and it was
free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space
until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is
set, the oom-depleter can terminate without touching memory reserves.
Although the possibility of hitting this time lag is very small for 3.19
and earlier kernels because TIF_MEMDIE is set immediately before sending
SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can
step between and allow memory allocations which are not needed for
terminating the OOM victim.
Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under task_lock") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A higher value can cause excessive swap IO and waste memory. A lower
value can prevent THPs from being collapsed, resulting fewer pages being
collapsed into THPs, and lower memory access performance.
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jerome Marchand [Fri, 6 Nov 2015 02:47:29 +0000 (18:47 -0800)]
mm/memcontrol.c: fix order calculation in try_charge()
Since commit 6539cc053869 ("mm: memcontrol: fold mem_cgroup_do_charge()"),
the order to pass to mem_cgroup_oom() is calculated by passing the
number of pages to get_order() instead of the expected size in bytes.
AFAICT, it only affects the value displayed in the oom warning message.
This patch fix this.
Michal said:
: We haven't noticed that just because the OOM is enabled only for page
: faults of order-0 (single page) and get_order work just fine. Thanks for
: noticing this. If we ever start triggering OOM on different orders this
: would be broken.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Fri, 6 Nov 2015 02:47:26 +0000 (18:47 -0800)]
mm: hwpoison: ratelimit messages from unpoison_memory()
Currently kernel prints out results of every single unpoison event, which
i= s not necessary because unpoison is purely a testing feature and
testers can = get little or no information from lots of lines of unpoison
log storm. So this patch ratelimits printk in unpoison_memory().
This patch introduces a file local ratelimit_state, which adds 64 bytes to
memory-failure.o. If we apply pr_info_ratelimited() for 8 callsite below,
2= 56 bytes is added, so it's a win.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junichi Nomura [Fri, 6 Nov 2015 02:47:23 +0000 (18:47 -0800)]
mm/filemap.c: make global sync not clear error status of individual inodes
filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.
The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).
However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.
As a result, fsync() may not be able to detect writeback error if events
happen in the following order:
Application System admin
----------------------------------------------------------
write data on page cache
Run sync command
writeback completes with error
filemap_fdatawait() clears error
fsync returns success
(but the data is not on disk)
This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.
Naoya Horiguchi [Fri, 6 Nov 2015 02:47:14 +0000 (18:47 -0800)]
mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
Currently there's no easy way to get per-process usage of hugetlb pages,
which is inconvenient because userspace applications which use hugetlb
typically want to control their processes on the basis of how much memory
(including hugetlb) they use. So this patch simply provides easy access
to the info via /proc/PID/status.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Joern Engel <joern@logfs.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Fri, 6 Nov 2015 02:47:11 +0000 (18:47 -0800)]
mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps
Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB),
which is inconvenient when we want to know per-task or per-vma base
hugetlb usage. To solve this, this patch adds new fields for hugetlb
usage like below:
Roman Gushchin [Fri, 6 Nov 2015 02:47:08 +0000 (18:47 -0800)]
mm: use only per-device readahead limit
Maximal readahead size is limited now by two values:
1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
2) by configurable per-device value* (bdi->ra_pages)
There are devices, which require custom readahead limit.
For instance, for RAIDs it's calculated as number of devices
multiplied by chunk size times 2.
Readahead size can never be larger than bdi->ra_pages * 2 value
(POSIX_FADV_SEQUNTIAL doubles readahead size).
If so, why do we need two limits?
I suggest to completely remove this max_sane_readahead() stuff and
use per-device readahead limit everywhere.
Also, using right readahead size for RAID disks can significantly
increase i/o performance:
before:
dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out 10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
after:
$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out 10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
(It's an 8-disks RAID5 storage).
This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
behavior introduced by 6d2be915e589b58 ("mm/readahead.c: fix readahead
failure for memoryless NUMA nodes and limit readahead pages").
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yaowei Bai [Fri, 6 Nov 2015 02:47:06 +0000 (18:47 -0800)]
mm/page_alloc: remove unused parameter in init_currently_empty_zone()
Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops
experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime by introducing an 'enum memmap_context'
parameter to memmap_init_zone() and init_currently_empty_zone(). This
parameter is intended to be used to tell whether the call of these two
functions is being made on behalf of a hotplug event, or happening at
boot-time. However, init_currently_empty_zone() does not use this
parameter at all, so remove it.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 6 Nov 2015 02:47:03 +0000 (18:47 -0800)]
mm, migrate: count pages failing all retries in vmstat and tracepoint
Migration tries up to 10 times to migrate pages that return -EAGAIN until
it gives up. If some pages fail all retries, they are counted towards the
number of failed pages that migrate_pages() returns. They should also be
counted in the /proc/vmstat pgmigrate_fail and in the mm_migrate_pages
tracepoint.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Raghavendra K T [Fri, 6 Nov 2015 02:46:29 +0000 (18:46 -0800)]
arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes
With the setup_nr_nodes(), we have already initialized
node_possible_map. So it is safe to use for_each_node here.
There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially. That should be reason we had maintained the same for
powerpc.
But, since sparse numa node ids possible on powerpc, we unnecessarily
allocate memory for non existent numa nodes.
For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and
we allocate memory for nodes 2-14. This patch we allocate memory for
only existing numa nodes.
The patch is boot tested on a 4 node tuleta, confirming with printks
that it works as expected.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Raghavendra K T [Fri, 6 Nov 2015 02:46:26 +0000 (18:46 -0800)]
mm/list_lru.c: replace nr_node_ids for loop with for_each_node()
The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.
Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is
common), this patch saves some unnecessary allocations for non existing
numa nodes.
Even without that saving, perhaps patch makes code more readable.
[vdavydov@parallels.com: take memcg_aware check outside for_each loop] Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jonathan Corbet [Fri, 6 Nov 2015 02:46:23 +0000 (18:46 -0800)]
mm: fix docbook comment for get_vaddr_frames()
get_vaddr_frames() has a comment that's *almost* a docbook comment; add
the missing star so that the tools will find it properly.
Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:20 +0000 (18:46 -0800)]
memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass()
__memcg_kmem_bypass() decides whether a kmem allocation should be bypassed
to the root memcg. Some conditions that it tests are valid criteria
regarding who should be held accountable; however, there are a couple
unnecessary tests for cold paths - __GFP_FAIL and fatal_signal_pending().
The previous patch updated try_charge() to handle both __GFP_FAIL and
dying tasks correctly and the only thing these two tests are doing is
making accounting less accurate and sprinkling tests for cold path
conditions in the hot paths. There's nothing meaningful gained by these
extra tests.
This patch removes the two unnecessary tests from __memcg_kmem_bypass().
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:17 +0000 (18:46 -0800)]
memcg: ratify and consolidate over-charge handling
try_charge() is the main charging logic of memcg. When it hits the limit
but either can't fail the allocation due to __GFP_NOFAIL or the task is
likely to free memory very soon, being OOM killed, has SIGKILL pending or
exiting, it "bypasses" the charge to the root memcg and returns -EINTR.
While this is one approach which can be taken for these situations, it has
several issues.
* It unnecessarily lies about the reality. The number itself doesn't
go over the limit but the actual usage does. memcg is either forced
to or actively chooses to go over the limit because that is the
right behavior under the circumstances, which is completely fine,
but, if at all avoidable, it shouldn't be misrepresenting what's
happening by sneaking the charges into the root memcg.
* Despite trying, we already do over-charge. kmemcg can't deal with
switching over to the root memcg by the point try_charge() returns
-EINTR, so it open-codes over-charing.
* It complicates the callers. Each try_charge() user has to handle
the weird -EINTR exception. memcg_charge_kmem() does the manual
over-charging. mem_cgroup_do_precharge() performs unnecessary
uncharging of root memcg, which BTW is inconsistent with what
memcg_charge_kmem() does but not broken as [un]charging are noops on
root memcg. mem_cgroup_try_charge() needs to switch the returned
cgroup to the root one.
The reality is that in memcg there are cases where we are forced and/or
willing to go over the limit. Each such case needs to be scrutinized and
justified but there definitely are situations where that is the right
thing to do. We alredy do this but with a superficial and inconsistent
disguise which leads to unnecessary complications.
This patch updates try_charge() so that it over-charges and returns 0 when
deemed necessary. -EINTR return is removed along with all special case
handling in the callers.
While at it, remove the local variable @ret, which was initialized to zero
and never changed, along with done: label which just returned the always
zero @ret.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:14 +0000 (18:46 -0800)]
memcg: collect kmem bypass conditions into __memcg_kmem_bypass()
memcg_kmem_newpage_charge() and memcg_kmem_get_cache() are testing the
same series of conditions to decide whether to bypass kmem accounting.
Collect the tests into __memcg_kmem_bypass().
This is pure refactoring.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:11 +0000 (18:46 -0800)]
memcg: punt high overage reclaim to return-to-userland path
Currently, try_charge() tries to reclaim memory synchronously when the
high limit is breached; however, if the allocation doesn't have
__GFP_WAIT, synchronous reclaim is skipped. If a process performs only
speculative allocations, it can blow way past the high limit. This is
actually easily reproducible by simply doing "find /". slab/slub
allocator tries speculative allocations first, so as long as there's
memory which can be consumed without blocking, it can keep allocating
memory regardless of the high limit.
This patch makes try_charge() always punt the over-high reclaim to the
return-to-userland path. If try_charge() detects that high limit is
breached, it adds the overage to current->memcg_nr_pages_over_high and
schedules execution of mem_cgroup_handle_over_high() which performs
synchronous reclaim from the return-to-userland path.
As long as kernel doesn't have a run-away allocation spree, this should
provide enough protection while making kmemcg behave more consistently.
It also has the following benefits.
- All over-high reclaims can use GFP_KERNEL regardless of the specific
gfp mask in use, e.g. GFP_NOFS, when the limit was breached.
- It copes with prio inversion. Previously, a low-prio task with
small memory.high might perform over-high reclaim with a bunch of
locks held. If a higher prio task needed any of these locks, it
would have to wait until the low prio task finished reclaim and
released the locks. By handing over-high reclaim to the task exit
path this issue can be avoided.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:09 +0000 (18:46 -0800)]
memcg: flatten task_struct->memcg_oom
task_struct->memcg_oom is a sub-struct containing fields which are used
for async memcg oom handling. Most task_struct fields aren't packaged
this way and it can lead to unnecessary alignment paddings. This patch
flattens it.
Andrew Morton [Fri, 6 Nov 2015 02:46:03 +0000 (18:46 -0800)]
uaccess: reimplement probe_kernel_address() using probe_kernel_read()
probe_kernel_address() is basically the same as the (later added)
probe_kernel_read().
The return value on EFAULT is a bit different: probe_kernel_address()
returns number-of-bytes-not-copied whereas probe_kernel_read() returns
-EFAULT. All callers have been checked, none cared.
probe_kernel_read() can be overridden by the architecture whereas
probe_kernel_address() cannot. parisc, blackfin and um do this, to insert
additional checking. Hence this patch possibly fixes obscure bugs,
although there are only two probe_kernel_address() callsites outside
arch/.
My first attempt involved removing probe_kernel_address() entirely and
converting all callsites to use probe_kernel_read() directly, but that got
tiresome.
This patch shrinks mm/slab_common.o by 218 bytes. For a single
probe_kernel_address() callsite.
Cc: Steven Miao <realmz6@gmail.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Catalin Marinas [Fri, 6 Nov 2015 02:45:54 +0000 (18:45 -0800)]
mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE
On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc
configurations defining ARCH_DMA_MINALIGN to 128), the first
kmalloc_caches[] entry to be initialised after slab_early_init = 0 is
"kmalloc-128" with index 7. Depending on the debug kernel configuration,
sizeof(struct kmem_cache) can be larger than 128 resulting in an
INDEX_NODE of 8.
Commit 8fc9cf420b36 ("slab: make more slab management structure off the
slab") enables off-slab management objects for sizes starting with
PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation
of the "kmalloc-128" cache would try to place the management objects
off-slab. However, since KMALLOC_MIN_SIZE is already 128 and
freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size)
returns NULL (kmalloc_caches[7] not populated yet). This triggers the
following bug on arm64:
kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540
Hardware name: Juno (DT)
PC is at __kmem_cache_create+0x21c/0x280
LR is at __kmem_cache_create+0x210/0x280
[...]
Call trace:
__kmem_cache_create+0x21c/0x280
create_boot_cache+0x48/0x80
create_kmalloc_cache+0x50/0x88
create_kmalloc_caches+0x4c/0xf4
kmem_cache_init+0x100/0x118
start_kernel+0x214/0x33c
This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab
management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE.
Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Fri, 6 Nov 2015 02:45:51 +0000 (18:45 -0800)]
mm/slub: calculate start order with reserved in consideration
In slub_order(), the order starts from max(min_order,
get_order(min_objects * size)). When (min_objects * size) has different
order from (min_objects * size + reserved), it will skip this order via a
check in the loop.
This patch optimizes this a little by calculating the start order with
`reserved' in consideration and removing the check in loop.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Fri, 6 Nov 2015 02:45:48 +0000 (18:45 -0800)]
mm/slub: use get_order() instead of fls()
get_order() is more easy to understand.
This patch just replaces it.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Fri, 6 Nov 2015 02:45:46 +0000 (18:45 -0800)]
mm/slub: correct the comment in calculate_order()
In calculate_order(), it tries to calculate the best order by adjusting
the fraction and min_objects. On each iteration on min_objects, fraction
iterates on 16, 8, 4. Which means the acceptable waste increases with
1/16, 1/8, 1/4.
This patch corrects the comment according to the code.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandru Moise [Fri, 6 Nov 2015 02:45:43 +0000 (18:45 -0800)]
mm/slab_common.c: initialize kmem_cache pointer to NULL
The assignment to NULL within the error condition was written in a 2014
patch to suppress a compiler warning. However it would be cleaner to just
initialize the kmem_cache to NULL and just return it in case of an error
condition.
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per Cache Average Min Max Total
----------------------------------------------------------------------------
#Objects 5147 1 89068 324301
#Slabs 199 1 3886 12537
#PartSlab 12 0 240 778
%PartSlab 32% 0% 100% 6%
PartObjs 5 0 4569 18151
% PartObj 26% 0% 100% 5%
Memory 3171409 8192 127336448199798784
Used 3001736 160 121429728189109408
Loss 169672 0 590672010689376
Per Object Average Min Max
-----------------------------------------------------------
Memory 585 8 8192
User 583 8 8192
Loss 2 0 64
Slabs sorted by size
--------------------
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 69948 1736 127336448 3871/0/15 18 3 0 95 a
dentry 89068 288 26058752 3164/0/17 28 1 0 98 a
Slabs sorted by loss
--------------------
Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 69948 1736 5906720 3871/0/15 18 3 0 95 a
inode_cache 11628 864 537472 642/0/4 18 2 0 94 a
Besides, store_size() does not use powers of two for G/M/K
Per Cache Average Min Max Total
---------------------------------------------------------
#Objects 14.1K 1 227.8K 920.1K
#Slabs 533 1 11.7K 34.7K
#PartSlab 86 0 4.3K 5.6K
%PartSlab 24% 0% 100% 16%
PartObjs 17 0 129.3K 161.2K
% PartObj 17% 0% 100% 17%
Memory 8.7M 8.1K 384.7M 568.3M
Used 8.2M 160 366.5M 537.9M
Loss 468.8K 0 18.2M 30.4M
Per Object Average Min Max
---------------------------------------------
Memory 587 8 8.1K
User 584 8 8.1K
Loss 2 0 64
Slabs sorted by size
----------------------
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 211142 1736 384.7M 11732/40/10 18 3 0 95 a
Slabs sorted by loss
----------------------
Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 211142 1736 18.2M 11732/40/10 18 3 0 95 a
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per Cache Average Min Max Total
----------------------------------------------------------------------------
#Objects 5147 1 89068 324301
#Slabs 199 1 3886 12537
#PartSlab 12 0 240 778
%PartSlab 32% 0% 100% 6%
PartObjs 5 0 4569 18151
% PartObj 26% 0% 100% 5%
Memory 3171409 8192 127336448199798784
Used 3001736 160 121429728189109408
Loss 169672 0 590672010689376
Per Object Average Min Max
-----------------------------------------------------------
Memory 585 8 8192
User 583 8 8192
Loss 2 0 64
Slabs sorted by size
--------------------
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 69948 1736 127336448 3871/0/15 18 3 0 95 a
dentry 89068 288 26058752 3164/0/17 28 1 0 98 a
Slabs sorted by loss
--------------------
Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 69948 1736 5906720 3871/0/15 18 3 0 95 a
inode_cache 11628 864 537472 642/0/4 18 2 0 94 a
The last patch in the series addresses Linus' comment from
http://marc.info/?l=linux-mm&m=144148518703321&w=2
(well, it's been some time. sorry.)
gnuplot script takes the slabinfo records file, where every record is a `slabinfo -X'
output. So the basic workflow is, for example, as follows:
while [ 1 ]; do slabinfo -X -N 2 >> stats; sleep 1; done
^C
slabinfo-gnuplot.sh stats
The last command will produce 3 png files (and 3 stats files)
-- graph of slabinfo totals
-- graph of slabs by size
-- graph of slabs by loss
It's also possible to select a range of records for plotting (a range of collected
slabinfo outputs) via `-r 10,100` (for example); and compare totals from several
measurements (to visially compare slabs behaviour (10,50 range)) using
pre-parsed totals files:
slabinfo-gnuplot.sh -r 10,50 -t stats-totals1 .. stats-totals2
This also, technically, supports ktest. Upload new slabinfo to target,
collect the stats and give the resulting stats file to slabinfo-gnuplot
This patch (of 8):
Use getopt constants in `struct option' ->has_arg instead of numerical
representations.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/slab_common.c: do not warn that cache is busy on destroy more than once
Currently, when kmem_cache_destroy() is called for a global cache, we
print a warning for each per memcg cache attached to it that has active
objects (see shutdown_cache). This is redundant, because it gives no new
information and only clutters the log. If a cache being destroyed has
active objects, there must be a memory leak in the module that created the
cache, and it does not matter if the cache was used by users in memory
cgroups or not.
This patch moves the warning from shutdown_cache(), which is called for
shutting down both global and per memcg caches, to kmem_cache_destroy(),
so that the warning is only printed once if there are objects left in the
cache being destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/slab_common.c: clear pointers to per memcg caches on destroy
Currently, we do not clear pointers to per memcg caches in the
memcg_params.memcg_caches array when a global cache is destroyed with
kmem_cache_destroy.
This is fine if the global cache does get destroyed. However, a cache can
be left on the list if it still has active objects when kmem_cache_destroy
is called (due to a memory leak). If this happens, the entries in the
array will point to already freed areas, which is likely to result in data
corruption when the cache is reused (via slab merging).
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_kmem_cache_create(), do_kmem_cache_shutdown(), and
do_kmem_cache_release() sound awkward for static helper functions that are
not supposed to be used outside slab_common.c. Rename them to
create_cache(), shutdown_cache(), and release_caches(), respectively.
This patch is a pure cleanup and does not introduce any functional
changes.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparse apparently pretends to be gcc >= 4.9, yet isn't prepared to handle
all the function attributes supported by those gccs and complains loudly.
So hide the definition of __assume_aligned from it (so that the generic
one in compiler.h gets used).
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: Christopher Li <sparse@chrisli.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
compiler.h: add support for function attribute assume_aligned
gcc 4.9 added the function attribute assume_aligned, indicating to the
caller that the returned pointer may be assumed to have a certain minimal
alignment. This is useful if, for example, the return value is passed to
memset(). Add a shorthand macro for that.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/watchdog.c: fix race between proc_watchdog_thresh() and watchdog_timer_fn()
Theoretically it is possible that the watchdog timer expires right at the
time when a user sets 'watchdog_thresh' to zero (note: this disables the
lockup detectors). In this scenario, the is_softlockup() function - which
is called by the timer - could produce a false positive.
Fix this by checking the current value of 'watchdog_thresh'.
kernel/watchdog.c: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads()
watchdog_{park|unpark}_threads() are now called in code paths that protect
themselves against CPU hotplug, so {get|put}_online_cpus() calls are
redundant and can be removed.
kernel/watchdog.c: avoid races between /proc handlers and CPU hotplug
The handler functions for watchdog parameters in /proc/sys/kernel do not
protect themselves against races with CPU hotplug. Hence, theoretically
it is possible that a new watchdog thread is started on a hotplugged CPU
while a parameter is being modified, and the thread could thus use a
parameter value that is 'in transition'.
For example, if 'watchdog_thresh' is being set to zero (note: this
disables the lockup detectors) the thread would erroneously use the value
zero as the sample period.
To avoid such races and to keep the /proc handler code consistent,
call
{get|put}_online_cpus() in proc_watchdog_common()
{get|put}_online_cpus() in proc_watchdog_thresh()
{get|put}_online_cpus() in proc_watchdog_cpumask()
kernel/watchdog.c: avoid race between lockup detector suspend/resume and CPU hotplug
The lockup detector suspend/resume interface that was introduced by
commit 8c073d27d7ad ("watchdog: introduce watchdog_suspend() and
watchdog_resume()") does not protect itself against races with CPU
hotplug. Hence, theoretically it is possible that a new watchdog thread
is started on a hotplugged CPU while the lockup detector is suspended,
and the thread could thus interfere unexpectedly with the code that
requested to suspend the lockup detector.
Avoid the race by calling
get_online_cpus() in lockup_detector_suspend()
put_online_cpus() in lockup_detector_resume()
Jiri Kosina [Fri, 6 Nov 2015 02:44:41 +0000 (18:44 -0800)]
kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup
In many cases of hardlockup reports, it's actually not possible to know
why it triggered, because the CPU that got stuck is usually waiting on a
resource (with IRQs disabled) in posession of some other CPU is holding.
IOW, we are often looking at the stacktrace of the victim and not the
actual offender.
Introduce sysctl / cmdline parameter that makes it possible to have
hardlockup detector perform all-CPU backtrace.
watchdog: do not unpark threads in watchdog_park_threads() on error
If kthread_park() returns an error, watchdog_park_threads() should not
blindly 'roll back' the already parked threads to the unparked state.
Instead leave it up to the callers to handle such errors appropriately in
their context. For example, it is redundant to unpark the threads if the
lockup detectors will soon be disabled by the callers anyway.
watchdog: implement error handling in update_watchdog_all_cpus() and callers
update_watchdog_all_cpus() now passes errors from watchdog_park_threads()
up to functions in the call chain. This allows watchdog_enable_all_cpus()
and proc_watchdog_update() to handle such errors too.
watchdog: move watchdog_disable_all_cpus() outside of ifdef
Move watchdog_disable_all_cpus() outside of the ifdef so that it is
available if CONFIG_SYSCTL is not defined. This is preparation for
"watchdog: implement error handling in update_watchdog_all_cpus() and
callers".
watchdog: fix error handling in proc_watchdog_thresh()
The original watchdog_park_threads() function that was introduced by
commit 81a4beef91ba ("watchdog: introduce watchdog_park_threads() and
watchdog_unpark_threads()") takes a very simple approach to handle
errors returned by kthread_park(): It attempts to roll back all watchdog
threads to the unparked state. However, this may be undesired behaviour
from the perspective of the caller which may want to handle errors as
appropriate in its specific context. Currently, there are two possible
call chains:
Instead of 'blindly' attempting to unpark the watchdog threads if a
kthread_park() call fails, the new approach is to disable the lockup
detectors in the above call chains. Failure becomes visible to the user
as follows:
- error messages from lockup_detector_suspend()
or watchdog_enable_all_cpus()
- the state that can be read from /proc/sys/kernel/watchdog_enabled
- the 'write' system call in the latter call chain returns an error
I did not experience kthread_park() failures in practice, I used some
instrumentation to fake error returns from kthread_park() in order to test
the patches.
This patch (of 5):
Restore the previous value of watchdog_thresh _and_ sample_period if
proc_watchdog_update() returns an error. The variables must be consistent
to avoid false positives of the lockup detectors.
9p: do not overwrite return code when locking fails
If the remote locking fail, we run a local vfs unlock that should work and
return success to userland when we didn't actually lock at all. We need
to tell the application that tried to lock that it didn't get it, not that
all went well.
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rcu: force alignment on struct callback_head/rcu_head
Make struct callback_head aligned to size of pointer. On most
architectures it happens naturally due ABI requirements, but some
architectures (like CRIS) have weird ABI and we need to ask it explicitly.
The alignment is required to guarantee that bits 0 and 1 of @next will be
clear under normal conditions -- as long as we use call_rcu(),
call_rcu_bh(), call_rcu_sched(), or call_srcu() to queue callback.
This guarantee is important for few reasons:
- future call_rcu_lazy() will make use of lower bits in the pointer;
- the structure shares storage spacer in struct page with @compound_head,
which encode PageTail() in bit 0. The guarantee is needed to avoid
false-positive PageTail().
False postive PageTail() caused crash on crisv32[1]. It happend due
misaligned task_struct->rcu, which was byte-aligned.
Joseph Qi [Fri, 6 Nov 2015 02:44:16 +0000 (18:44 -0800)]
ocfs2: clean up unused variable in ocfs2_duplicate_clusters_by_page()
readahead_pages in ocfs2_duplicate_clusters_by_page is defined but not
used, so clean it up.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Fri, 6 Nov 2015 02:44:13 +0000 (18:44 -0800)]
ocfs2: add uuid to ocfs2 thread name for problem analysis
A node can mount multiple ocfs2 volumes. And if thread names are same for
each volume/domain, it will bring inconvenience when analyzing problems
because we have to identify which volume/domain the messages belong to.
Since thread name will be printed to messages, so add volume uuid or dlm
name to thread name can benefit problem analysis.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Gang He <ghe@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alex chen [Fri, 6 Nov 2015 02:44:10 +0000 (18:44 -0800)]
ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error
In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
should reclaim the inode successfully claimed above, otherwise, the
inode never be reused. The case is described below:
ocfs2_mknod
ocfs2_mknod_locked
ocfs2_claim_new_inode
Successfully claim the inode
__ocfs2_mknod_locked
ocfs2_journal_access_di
Failed because of -ENOMEM or other reasons, the inode
lockres has not been initialized yet.
iput(inode)
ocfs2_evict_inode
ocfs2_delete_inode
ocfs2_inode_lock
ocfs2_inode_lock_full_nested
__ocfs2_cluster_lock
Return -EINVAL because of the inode
lockres has not been initialized.
So the following operations are not performed
ocfs2_wipe_inode
ocfs2_remove_inode
ocfs2_free_dinode
ocfs2_free_suballoc_bits
Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Fri, 6 Nov 2015 02:44:07 +0000 (18:44 -0800)]
ocfs2: fix race between mount and delete node/cluster
There is a race case between mount and delete node/cluster, which will
lead o2hb_thread to malfunctioning dead loop.
o2hb_thread
{
o2nm_depend_this_node();
<<<<<< race window, node may have already been deleted, and then
enter the loop, o2hb thread will be malfunctioning
because of no configured nodes found.
while (!kthread_should_stop() &&
!reg->hr_unclean_stop && !reg->hr_aborted_start) {
}
So check the return value of o2nm_depend_this_node() is needed. If node
has been deleted, do not enter the loop and let mount fail.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Fri, 6 Nov 2015 02:44:04 +0000 (18:44 -0800)]
ocfs2: only take lock if dio entry when recover orphans
We have no need to take inode mutex, rw and inode lock if it is not dio
entry when recover orphans. Optimize it by adding a flag
OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Fri, 6 Nov 2015 02:44:01 +0000 (18:44 -0800)]
ocfs2: do not include dio entry in case of orphan scan
dio entry will only do truncate in case of ORPHAN_NEED_TRUNCATE. So do
not include it when doing normal orphan scan to reduce contention.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Fri, 6 Nov 2015 02:43:58 +0000 (18:43 -0800)]
ocfs2: improve performance for localalloc
Currently cluster allocation is always trying to find a victim chain (a
chian has most space), and this may lead to poor performance because of
discontiguous allocation in some scenarios.
Our test case is block size 4k, cluster size 1M and mount option with
localalloc=2048 (2G), since a gd is 32256M (about 31.5G) and a localalloc
window is only 2G, creating 50G file will result in 2G from gd0, 2G from
gd1, ...
One way to improve performance is enlarge localalloc window size (max
31104M), but this will make end user feel that about 30G is suddenly
"missing", and localalloc currently do not support steal, which means one
node cannot use another node's localalloc even it is not used in fact. So
using the last gd to record the allocation and continues with the gd if it
has enough space for a localalloc window can make the allocation as more
contiguous as possible.
Our test result is below (evaluated in IOPS), which is using iometer
running in VM, dynamic vhd virtual disk stored in ocfs2.
IO model Original After Improved(%)
16K60%Write100%Random 703 876 24.59%
8K90%Write100%Random 735 827 12.59%
4K100%Write100%Random 859 915 6.52%
4K100%Read100%Random 2092 2600 24.30%
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Tested-by: Norton Zhu <norton.zhu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jiangyiwen [Fri, 6 Nov 2015 02:43:55 +0000 (18:43 -0800)]
ocfs2: fill in the unused portion of the block with zeros by dio_zero_block()
A simplified test case is (this case from Ryan):
1) dd if=/dev/zero of=/mnt/hello bs=512 count=1 oflag=direct;
2) truncate /mnt/hello -s 2097152
file 'hello' is not exist before test. After this command,
file 'hello' should be all zero. But 512~4096 is some random data.
Setting bh state to new when get a new block, if so,
direct_io_worker()->dio_zero_block() will fill-in the unused portion
of the block with zero.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sudip Mukherjee [Fri, 6 Nov 2015 02:43:49 +0000 (18:43 -0800)]
logfs: fix build warning
fs/logfs/dev_bdev.c: In function '__bdev_writeseg':
include/linux/kernel.h:601:17: warning: comparison of distinct pointer types lacks a cast [enabled by default]
(void) (&_min1 == &_min2); \
fs/logfs/dev_bdev.c:84:14: note: in expansion of macro 'min'
max_pages = min(nr_pages, BIO_MAX_PAGES);
fs/logfs/dev_bdev.c: In function 'do_erase':
include/linux/kernel.h:601:17: warning: comparison of distinct pointer types lacks a cast [enabled by default]
(void) (&_min1 == &_min2); \
fs/logfs/dev_bdev.c:174:14: note: in expansion of macro 'min'
max_pages = min(nr_pages, BIO_MAX_PAGES);
Lets use min_t and mention the type.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Fri, 6 Nov 2015 02:43:46 +0000 (18:43 -0800)]
inotify: actually check for invalid bits in sys_inotify_add_watch()
The comment here says that it is checking for invalid bits. But, the mask
is *actually* checking to ensure that _any_ valid bit is set, which is
quite different.
Without this check, an unexpected bit could get set on an inotify object.
Since these bits are also interpreted by the fsnotify/dnotify code, there
is the potential for an object to be mishandled inside the kernel. For
instance, can we be sure that setting the dnotify flag FS_DN_RENAME on an
inotify watch is harmless?
Add the actual check which was intended. Retain the existing inotify bits
are being added to the watch. Plus, this is existing behavior which would
be nice to preserve.
I did a quick sniff test that inotify functions and that my
'inotify-tools' package passes 'make check'.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Eric Paris <eparis@parisplace.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>