hfs,hfsplus: cache pages correctly between bnode_create and bnode_free
Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().
This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state". Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:
After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.
There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years. [1]
The unpatched code [2] has always been wrong since it entered the kernel
tree. The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.
The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents. When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging. There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.
References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")
Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")
Jan Harkes [Wed, 9 Sep 2015 22:38:01 +0000 (15:38 -0700)]
fs/coda: fix readlink buffer overflow
Dan Carpenter discovered a buffer overflow in the Coda file system
readlink code. A userspace file system daemon can return a 4096 byte
result which then triggers a one byte write past the allocated readlink
result buffer.
This does not trigger with an unmodified Coda implementation because Coda
has a 1024 byte limit for symbolic links, however other userspace file
systems using the Coda kernel module could be affected.
Although this is an obvious overflow, I don't think this has to be handled
as too sensitive from a security perspective because the overflow is on
the Coda userspace daemon side which already needs root to open Coda's
kernel device and to mount the file system before we get to the point that
links can be read.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 9 Sep 2015 22:37:58 +0000 (15:37 -0700)]
checkpatch: add constant comparison on left side test
"CONST <comparison> variable" checks like:
if (NULL != foo)
and
while (0 < bar(...))
where a constant (or what appears to be a constant like an upper case
identifier) is on the left of a comparison are generally preferred to be
written using the constant on the right side like:
if (foo != NULL)
and
while (bar(...) > 0)
Add a test for this.
Add a --fix option too, but only do it when the code is immediately
surrounded by parentheses to avoid misfixing things like "(0 < bar() +
constant)"
Signed-off-by: Joe Perches <joe@perches.com> Cc: Nicolas Morey Chaisemartin <nmorey@kalray.eu> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 9 Sep 2015 22:37:55 +0000 (15:37 -0700)]
checkpatch: add __pmem to $Sparse annotations
commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates") added a new __pmem annotation for sparse
verification. Add __pmem to the $Sparse variable so checkpatch can
appropriately ignore uses of this attribute too.
Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 9 Sep 2015 22:37:41 +0000 (15:37 -0700)]
checkpatch: always check block comment styles
Some of the block comment tests that are used only for networking are
appropriate for all patches.
For example, these styles are not encouraged:
/*
block comment without introductory *
*/
and
/*
* block comment with line terminating */
Remove the networking specific test and add comments.
There are some infrequent false positives where code is lazily
commented out using /* and */ rather than using #if 0/#endif blocks
like:
/* case foo:
case bar: */
case baz:
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 9 Sep 2015 22:37:27 +0000 (15:37 -0700)]
checkpatch: add warning on BUG/BUG_ON use
Using BUG/BUG_ON crashes the kernel and is just unfriendly.
Enable code that emits a warning on BUG/BUG_ON use.
Make the code emit the message at WARNING level when scanning a patch and
at CHECK level when scanning files so that script users don't feel an
obligation to fix code that might be above their pay grade.
Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wang Long [Wed, 9 Sep 2015 22:37:22 +0000 (15:37 -0700)]
lib/test_kasan.c: make kmalloc_oob_krealloc_less more correctly
In kmalloc_oob_krealloc_less, I think it is better to test
the size2 boundary.
If we do not call krealloc, the access of position size1 will still cause
out-of-bounds and access of position size2 does not. After call krealloc,
the access of position size2 cause out-of-bounds. So using size2 is more
correct.
Signed-off-by: Wang Long <long.wanglong@huawei.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To further clarify the purpose of the "esc" argument, rename it to "only"
to reflect that it is a limit, not a list of additional characters to
escape.
hexdump: do not print debug dumps for !CONFIG_DEBUG
print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or
dev_dbg() & friends. Currently it will adhere to dynamic debug, but will
not stub out prints if CONFIG_DEBUG is not set. Let's make it do the
right thing, because I am tired of having my dmesg buffer full of hex
dumps on production systems.
Pan Xinhui [Wed, 9 Sep 2015 22:37:08 +0000 (15:37 -0700)]
lib/bitmap.c: bitmap_parselist can accept string with whitespaces on head or tail
In __bitmap_parselist we can accept whitespaces on head or tail during
every parsing procedure. If input has valid ranges, there is no reason to
reject the user.
For example, bitmap_parselist(" 1-3, 5, ", &mask, nmaskbits). After
separating the string, we get " 1-3", " 5", and " ". It's possible and
reasonable to accept such string as long as the parsing result is correct.
Signed-off-by: Pan Xinhui <xinhuix.pan@intel.com> Cc: Yury Norov <yury.norov@gmail.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pan Xinhui [Wed, 9 Sep 2015 22:37:05 +0000 (15:37 -0700)]
lib/bitmap.c: fix a special string handling bug in __bitmap_parselist
If string end with '-', for exapmle, bitmap_parselist("1,0-",&mask,
nmaskbits), It is not in a valid pattern, so add a check after loop.
Return -EINVAL on such condition.
Signed-off-by: Pan Xinhui <xinhuix.pan@intel.com> Cc: Yury Norov <yury.norov@gmail.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
include/linux/printk.h: include pr_fmt in pr_debug_ratelimited
The other two implementations of pr_debug_ratelimited include pr_fmt,
along with every other pr_* function. But pr_debug_ratelimited forgot to
add it with the CONFIG_DYNAMIC_DEBUG implementation.
This patch unifies the behavior.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e0e817392b9a ("CRED: Add some configurable debugging [try #6]")
added the kdebug mechanism to this file back in 2009.
The kdebug macro calls no_printk which always evaluates arguments.
Most of the kdebug uses have an unnecessary call of
atomic_read(&cred->usage)
Make the kdebug macro do nothing by defining it with
do { if (0) no_printk(...); } while (0)
when not enabled.
$ size kernel/cred.o* (defconfig x86-64)
text data bss dec hex filename
2748 336 8 3092 c14 kernel/cred.o.new
2788 336 8 3132 c3c kernel/cred.o.old
Miscellanea:
o Neaten the #define kdebug macros while there
Signed-off-by: Joe Perches <joe@perches.com> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Poison pointer values should be small enough to find a room in
non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space"
is located starting from 0x0. Given unprivileged users cannot mmap
anything below mmap_min_addr, it should be safe to use poison pointers
lower than mmap_min_addr.
The current poison pointer values of LIST_POISON{1,2} might be too big for
mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu
uses only 0x10000). There is little point to use such a big value given
the "poison pointer space" below 1 MB is not yet exhausted. Changing it
to a smaller value solves the problem for small mmap_min_addr setups.
The values are suggested by Solar Designer:
http://www.openwall.com/lists/oss-security/2015/05/02/6
Signed-off-by: Vasily Kulikov <segoon@openwall.com> Cc: Solar Designer <solar@openwall.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Waiman Long [Wed, 9 Sep 2015 22:35:57 +0000 (15:35 -0700)]
proc: change proc_subdir_lock to a rwlock
The proc_subdir_lock spinlock is used to allow only one task to make
change to the proc directory structure as well as looking up information
in it. However, the information lookup part can actually be entered by
more than one task as the pde_get() and pde_put() reference count update
calls in the critical sections are atomic increment and decrement
respectively and so are safe with concurrent updates.
The x86 architecture has already used qrwlock which is fair and other
architectures like ARM are in the process of switching to qrwlock. So
unfairness shouldn't be a concern in that conversion.
This patch changed the proc_subdir_lock to a rwlock in order to enable
concurrent lookup. The following functions were modified to take a
write lock:
- proc_register()
- remove_proc_entry()
- remove_proc_subtree()
The following functions were modified to take a read lock:
- xlate_proc_name()
- proc_lookup_de()
- proc_readdir_de()
A parallel /proc filesystem search with the "find" command (1000 threads)
was run on a 4-socket Haswell-EX box (144 threads). Before the patch, the
parallel search took about 39s. After the patch, the parallel find took
only 25s, a saving of about 14s.
The micro-benchmark that I used was artificial, but it was used to
reproduce an exit hanging problem that I saw in real application. In
fact, only allow one task to do a lookup seems too limiting to me.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Scott J Norton <scott.norton@hp.com> Cc: Douglas Hatch <doug.hatch@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes] Signed-off-by: Calvin Owens <calvinowens@fb.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc: add cond_resched to /proc/kpage* read/write loop
Reading/writing a /proc/kpage* file may take long on machines with a lot
of RAM installed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Andres Lagar-Cavilla <andreslc@google.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
is that one can easily filter dirty and/or unevictable pages while
estimating the size of unused memory.
Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/sys/kernel/mm/page_idle/bitmap first.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:
- it does not count unmapped file pages
- it affects the reclaimer logic
To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.
The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.
Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not only
in primary, but also in secondary ptes. The latter is required in order
to estimate wss of KVM VMs. At the same time we want to avoid flushing
tlb, because it is quite expensive and it won't really affect the final
result.
Currently, there is no function for clearing pte young bit that would meet
our requirements, so this patch introduces one. To achieve that we have
to add a new mmu-notifier callback, clear_young, since there is no method
for testing-and-clearing a secondary pte w/o flushing tlb. The new method
is not mandatory and currently only implemented by KVM.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each
page is charged to, indexed by PFN. Having this information is useful for
estimating a cgroup working set size.
The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is only used in mem_cgroup_try_charge, so fold it in and zap it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hwpoison: use page_cgroup_ino for filtering by memcg
Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have a helper method for
that, page_cgroup_ino, so use it instead.
This patch also loosens the hwpoison memcg filter dependency rules - it
makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
hwpoison memcg filter does not require anything (nor it used to) from
CONFIG_MEMCG_SWAP side.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning the
system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.
==== USE CASES ====
The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.
Another use case is balancing workloads within a compute cluster. Knowing
how much memory is not really used by a workload unit may help take a more
optimal decision when considering migrating the unit to another node
within the cluster.
Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.
==== USER API ====
The user API consists of two new files:
* /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each
bit corresponds to a page, indexed by PFN. When the bit is set, the
corresponding page is idle. A page is considered idle if it has not been
accessed since it was marked idle. To mark a page idle one should set the
bit corresponding to the page by writing to the file. A value written to the
file is OR-ed with the current bitmap value. Only user memory pages can be
marked idle, for other page types input is silently ignored. Writing to this
file beyond max PFN results in the ENXIO error. Only available when
CONFIG_IDLE_PAGE_TRACKING is set.
This file can be used to estimate the amount of pages that are not
used by a particular workload as follows:
1. mark all pages of interest idle by setting corresponding bits in the
/sys/kernel/mm/page_idle/bitmap
2. wait until the workload accesses its working set
3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.
This file can be used to find all pages (including unmapped file pages)
accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
can then estimate the cgroup working set size.
For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.
==== REASONING ====
The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:
- it does not count unmapped file pages
- it affects the reclaimer logic
The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 6.
==== PATCHSET STRUCTURE ====
The patch set is organized as follows:
- patch 1 adds page_cgroup_ino() helper for the sake of
/proc/kpagecgroup and patches 2-3 do related cleanup
- patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
charged to
- patch 5 introduces a new mmu notifier callback, clear_young, which is
a lightweight version of clear_flush_young; it is used in patch 6
- patch 6 implements the idle page tracking feature, including the
userspace API, /sys/kernel/mm/page_idle/bitmap
- patch 7 exports idle flag via /proc/kpageflags
==== SIMILAR WORKS ====
Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
difference between Michel's patch and this one is that Michel implemented
a kernel space daemon for estimating idle memory size per cgroup while
this patch only provides the userspace with the minimal API for doing the
job, leaving the rest up to the userspace. However, they both share the
same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
==== PERFORMANCE EVALUATION ====
SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
performance impact introduced by this patch set. Three runs were carried
out:
- base: kernel without the patch
- patched: patched kernel, the feature is not used
- patched-active: patched kernel, 1 minute-period daemon is used for
tracking idle memory
For tracking idle memory, idlememstat utility was used:
https://github.com/locker/idlememstat
for dir, subdirs, files in os.walk(CGROUP_MOUNT):
ino = os.stat(dir)[stat.ST_INO]
print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
==== END SCRIPT ====
This patch (of 8):
Add page_cgroup_ino() helper to memcg.
This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to. It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Streetman [Wed, 9 Sep 2015 22:35:21 +0000 (15:35 -0700)]
zswap: change zpool/compressor at runtime
Update the zpool and compressor parameters to be changeable at runtime.
When changed, a new pool is created with the requested zpool/compressor,
and added as the current pool at the front of the pool list. Previous
pools remain in the list only to remove existing compressed pages from.
The old pool(s) are removed once they become empty.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Streetman [Wed, 9 Sep 2015 22:35:16 +0000 (15:35 -0700)]
zpool: add zpool_has_pool()
This series makes creation of the zpool and compressor dynamic, so that
they can be changed at runtime. This makes using/configuring zswap
easier, as before this zswap had to be configured at boot time, using boot
params.
This uses a single list to track both the zpool and compressor together,
although Seth had mentioned an alternative which is to track the zpools
and compressors using separate lists. In the most common case, only a
single zpool and single compressor, using one list is slightly simpler
than using two lists, and for the uncommon case of multiple zpools and/or
compressors, using one list is slightly less simple (and uses slightly
more memory, probably) than using two lists.
This patch (of 4):
Add zpool_has_pool() function, indicating if the specified type of zpool
is available (i.e. zsmalloc or zbud). This allows checking if a pool is
available, without actually trying to allocate it, similar to
crypto_has_alg().
This is used by a following patch to zswap that enables the dynamic
runtime creation of zswap zpools.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull inifiniband/rdma updates from Doug Ledford:
"This is a fairly sizeable set of changes. I've put them through a
decent amount of testing prior to sending the pull request due to
that.
There are still a few fixups that I know are coming, but I wanted to
go ahead and get the big, sizable chunk into your hands sooner rather
than waiting for those last few fixups.
Of note is the fact that this creates what is intended to be a
temporary area in the drivers/staging tree specifically for some
cleanups and additions that are coming for the RDMA stack. We
deprecated two drivers (ipath and amso1100) and are waiting to hear
back if we can deprecate another one (ehca). We also put Intel's new
hfi1 driver into this area because it needs to be refactored and a
transfer library created out of the factored out code, and then it and
the qib driver and the soft-roce driver should all be modified to use
that library.
I expect drivers/staging/rdma to be around for three or four kernel
releases and then to go away as all of the work is completed and final
deletions of deprecated drivers are done.
Summary of changes for 4.3:
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular
tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing read
and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
IB/ipoib: Suppress warning for send only join failures
IB/ipoib: Clean up send-only multicast joins
IB/srp: Fix possible protection fault
IB/core: Move SM class defines from ib_mad.h to ib_smi.h
IB/core: Remove unnecessary defines from ib_mad.h
IB/hfi1: Add PSM2 user space header to header_install
IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
mlx5: Fix incorrect wc pkey_index assignment for GSI messages
IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
IB/uverbs: reject invalid or unknown opcodes
IB/cxgb4: Fix if statement in pick_local_ip6adddrs
IB/sa: Fix rdma netlink message flags
IB/ucma: HW Device hot-removal support
IB/mlx4_ib: Disassociate support
IB/uverbs: Enable device removal when there are active user space applications
IB/uverbs: Explicitly pass ib_dev to uverbs commands
IB/uverbs: Fix race between ib_uverbs_open and remove_one
IB/uverbs: Fix reference counting usage of event files
IB/core: Make ib_dealloc_pd return void
IB/srp: Create an insecure all physical rkey only if needed
...
Merge tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi
Pull IPMI updates from Corey Minyard:
"Most of these have been sitting in linux-next for more than a release,
particularly commit 0fbcf4af7c83 ("ipmi: Convert the IPMI SI ACPI
handling to a platform device") which is probably the most complex
patch.
That is also the one that changes drivers/acpi/acpi_pnp.c. The change
in that file is only removing IPMI from a "special platform devices"
list, since I convert it to the standard PNP interface. I posted this
one to the ACPI list twice and got no response, and it seems to work
well in my testing, so I'm hoping it's good.
Hidehiro Kawai posted a set of changes that improves the panic time
handling in the IPMI driver.
The rest of the changes are minor bug fixes or cleanups and some
documentation"
* tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi:
ipmi:ssif: Add a module parm to specify that SMBus alerts don't work
ipmi: add of_device_id in MODULE_DEVICE_TABLE
ipmi: Compensate for BMCs that wont set the irq enable bit
ipmi: Don't call receive handler in the panic context
ipmi: Avoid touching possible corrupted lists in the panic context
ipmi: Don't flush messages in sender() in run-to-completion mode
ipmi: Factor out message flushing procedure
ipmi: Remove unneeded set_run_to_completion call
ipmi: Make some data const that was only read
ipmi: constify SSIF ACPI device ids
ipmi: Delete an unnecessary check before the function call "cleanup_one_si"
char:ipmi - Change 1 to true for bool type variables during initialization.
impi:Remove unneeded setting of module owner to THIS_MODULE in the platform structure, powernv_ipmi_driver
ipmi: Add a comment in how messages are delivered from the lower layer
ipmi/powernv: Fix potential invalid pointer dereference
ipmi: Convert the IPMI SI ACPI handling to a platform device
ipmi: Add device tree bindings information
Merge second patch-bomb from Andrew Morton:
"Almost all of the rest of MM. There was an unusually large amount of
MM material this time"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
zpool: remove no-op module init/exit
mm: zbud: constify the zbud_ops
mm: zpool: constify the zpool_ops
mm: swap: zswap: maybe_preload & refactoring
zram: unify error reporting
zsmalloc: remove null check from destroy_handle_cache()
zsmalloc: do not take class lock in zs_shrinker_count()
zsmalloc: use class->pages_per_zspage
zsmalloc: consider ZS_ALMOST_FULL as migrate source
zsmalloc: partial page ordering within a fullness_list
zsmalloc: use shrinker to trigger auto-compaction
zsmalloc: account the number of compacted pages
zsmalloc/zram: introduce zs_pool_stats api
zsmalloc: cosmetic compaction code adjustments
zsmalloc: introduce zs_can_compact() function
zsmalloc: always keep per-class stats
zsmalloc: drop unused variable `nr_to_migrate'
mm/memblock.c: fix comment in __next_mem_range()
mm/page_alloc.c: fix type information of memoryless node
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
...
Merge branch 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
"The most important changes in this patchset are:
- re-enable 64bit PCI bus addresses which were temporarily disabled
for PA-RISC in kernel 4.2
- fix the 64bit CAS operation in the LWS path which now enables us to
enable the 64bit gcc atomic builtins even on 32bit userspace with
64bit kernel
- fix a long-standing bug which sometimes crashed kernel at bootup
while serial interrupt wasn't registered yet"
* 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Use platform_device_register_simple("rtc-generic")
parisc: Drop CONFIG_SMP around update_cr16_clocksource()
parisc: Use double word condition in 64bit CAS operation
parisc: Filter out spurious interrupts in PA-RISC irq handler
parisc: Additionally check for in_atomic() in page fault handler
PCI,parisc: Enable 64-bit bus addresses on PA-RISC
parisc: Define ioremap_uc and ioremap_wc
Merge tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest update from Shuah Khan:
"This update adds new zram test and fixes to problems found during
testing this new zram test. In addition, there are a few bug fixes
and ksefltest improvement patches from Linaro developers.
I will send another update later on this week to fix kselftest
breakage due to commit 2bf9e0ab08c6 ("locking/static_keys: Provide a
selftest") after the fix soaks in next for a couple of days"
* tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/zram: Makefile fix
selftests/zram: must be run as root
selftests: breakpoints: fix installing error on the architecture except x86
selftests: check before install
selftests/zram: Adding zram tests
Merge tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates for from Joerg Roedel:
"This time the IOMMU updates are mostly cleanups or fixes. No big new
features or drivers this time. In particular the changes include:
- Bigger cleanup of the Domain<->IOMMU data structures and the code
that manages them in the Intel VT-d driver. This makes the code
easier to understand and maintain, and also easier to keep the data
structures in sync. It is also a preparation step to make use of
default domains from the IOMMU core in the Intel VT-d driver.
- Fixes for a couple of DMA-API misuses in ARM IOMMU drivers, namely
in the ARM and Tegra SMMU drivers.
- Fix for a potential buffer overflow in the OMAP iommu driver's
debug code
- A couple of smaller fixes and cleanups in various drivers
- One small new feature: Report domain-id usage in the Intel VT-d
driver to easier detect bugs where these are leaked"
* tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (83 commits)
iommu/vt-d: Really use upper context table when necessary
x86/vt-d: Fix documentation of DRHD
iommu/fsl: Really fix init section(s) content
iommu/io-pgtable-arm: Unmap and free table when overwriting with block
iommu/io-pgtable-arm: Move init-fn declarations to io-pgtable.h
iommu/msm: Use BUG_ON instead of if () BUG()
iommu/vt-d: Access iomem correctly
iommu/vt-d: Make two functions static
iommu/vt-d: Use BUG_ON instead of if () BUG()
iommu/vt-d: Return false instead of 0 in irq_remapping_cap()
iommu/amd: Use BUG_ON instead of if () BUG()
iommu/amd: Make a symbol static
iommu/amd: Simplify allocation in irq_remapping_alloc()
iommu/tegra-smmu: Parameterize number of TLB lines
iommu/tegra-smmu: Factor out tegra_smmu_set_pde()
iommu/tegra-smmu: Extract tegra_smmu_pte_get_use()
iommu/tegra-smmu: Use __GFP_ZERO to allocate zeroed pages
iommu/tegra-smmu: Remove PageReserved manipulation
iommu/tegra-smmu: Convert to use DMA API
iommu/tegra-smmu: smmu_flush_ptc() wants device addresses
...
Merge tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap updates from Mark Brown:
"This has been a busy release for regmap.
By far the biggest set of changes here are those from Markus Pargmann
which implement support for block transfers in smbus devices. This
required quite a bit of refactoring but leaves us better able to
handle odd restrictions that controllers may have and with better
performance on smbus.
Other new features include:
- Fix interactions with lockdep for nested regmaps (eg, when a device
using regmap is connected to a bus where the bus controller has a
separate regmap). Lockdep's default class identification is too
crude to work without help.
- Support for must write bitfield operations, useful for operations
which require writing a bit to trigger them from Kuniori Morimoto.
- Support for delaying during register patch application from Nariman
Poushin.
- Support for overriding cache state via the debugfs implementation
from Richard Fitzgerald"
* tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits)
regmap: fix a NULL pointer dereference in __regmap_init
regmap: Support bulk reads for devices without raw formatting
regmap-i2c: Add smbus i2c block support
regmap: Add raw_write/read checks for max_raw_write/read sizes
regmap: regmap max_raw_read/write getter functions
regmap: Introduce max_raw_read/write for regmap_bulk_read/write
regmap: Add missing comments about struct regmap_bus
regmap: No multi_write support if bus->write does not exist
regmap: Split use_single_rw internally into use_single_read/write
regmap: Fix regmap_bulk_write for bus writes
regmap: regmap_raw_read return error on !bus->read
regulator: core: Print at debug level on debugfs creation failure
regmap: Fix regmap_can_raw_write check
regmap: fix typos in regmap.c
regmap: Fix integertypes for register address and value
regmap: Move documentation to regmap.h
regmap: Use different lockdep class for each regmap init call
thermal: sti: Add parentheses around bridge->ops->regmap_init call
mfd: vexpress: Add parentheses around bridge->ops->regmap_init call
regmap: debugfs: Fix misuse of IS_ENABLED
...
Merge tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull fbdev updates from Tomi Valkeinen:
"Minor fixes and cleanups"
* tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
video: fbdev: atmel_lcdfb: remove useless include
video: fbdev: pxa168fb: Use devm_clk_get
fbdev: ssd1307fb: fix error return code
fbdev: fix snprintf() limit in show_bl_curve()
video: fbdev: s3c-fb: Constify platform_device_id
video: fbdev: atmel: fix warning for const return value
video: fbdev: Drop owner assignment from platform_driver
video: fbdev: Drop owner assignment from i2c_driver
fbdev: remove unnecessary memset in vfb
framebuffer: disable vgacon on microblaze arch
fbdev: udlfb: remove unneeded initialization in few places
fbdev: Allow compile test of GPIO consumers if !GPIOLIB
fbdev: fix cea_modes array size
Merge tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC updates from Ulf Hansson:
"MMC core:
- Fix a race condition in the request handling
- Skip trim commands for some buggy kingston eMMCs
- An optimization and a correction for erase groups
- Set CMD23 quirk for some Sandisk cards
MMC host:
- sdhci: Give GPIO CD higher precedence and don't poll when it's used
- sdhci: Fix DMA memory leakage
- sdhci: Some updates for clock management
- sdhci-of-at91: introduce driver for the Atmel SDMMC
- sdhci-of-arasan: Add support for sdhci-5.1
- sdhci-esdhc-imx: Add support for imx7d which also supports HS400
- sdhci: A collection of fixes and improvements for various sdhci hosts
- omap_hsmmc: Modernization of the regulator code
- dw_mmc: A couple of fixes for DMA and PIO mode
- usdhi6rol0: A few fixes and support probe deferral for regulators
- pxamci: Convert to use dmaengine
- sh_mmcif: Fix the suspend process in a short term solution
- tmio: Adjust timeout for commands
- sunxi: Fix timeout while gating/ungating clock"
* tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc: (67 commits)
mmc: android-goldfish: remove incorrect __iomem annotation
mmc: core: fix race condition in mmc_wait_data_done
mmc: host: omap_hsmmc: remove CONFIG_REGULATOR check
mmc: host: omap_hsmmc: use ios->vdd for setting vmmc voltage
mmc: host: omap_hsmmc: use regulator_is_enabled to find pbias status
mmc: host: omap_hsmmc: enable/disable vmmc_aux regulator based on previous state
mmc: host: omap_hsmmc: don't use ->set_power to set initial regulator state
mmc: host: omap_hsmmc: avoid pbias regulator enable on power off
mmc: host: omap_hsmmc: add separate function to set pbias
mmc: host: omap_hsmmc: add separate functions for enable/disable supply
mmc: host: omap_hsmmc: return error if any of the regulator APIs fail
mmc: host: omap_hsmmc: remove unnecessary pbias set_voltage
mmc: host: omap_hsmmc: use mmc_host's vmmc and vqmmc
mmc: host: omap_hsmmc: use the ocrmask provided by the vmmc regulator
mmc: host: omap_hsmmc: cleanup omap_hsmmc_reg_get()
mmc: host: omap_hsmmc: return on fatal errors from omap_hsmmc_reg_get
mmc: host: omap_hsmmc: use devm_regulator_get_optional() for vmmc
mmc: sdhci-of-at91: fix platform_no_drv_owner.cocci warnings
mmc: sh_mmcif: Fix suspend process
mmc: usdhi6rol0: fix error return code
...
Merge tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
Pull x86 platform driver updates from Darren Hart:
"Significant work on toshiba_acpi, including new hardware support,
refactoring, and cleanups. Extend device support for asus, ideapad,
and acer systems. New surface pro 3 buttons driver. Misc minor
cleanups for thinkpad and hp-wireless.
acer-wmi:
- No rfkill on HP Omen 15 wifi
thinkpad_acpi:
- Remove side effects from vdbg_printk -> no_printk macro
surface pro 3:
- Add support driver for Surface Pro 3 buttons
hp-wireless:
- remove unneeded goto/label in hpwl_init
ideapad-laptop:
- add alternative representation for Yoga 2 to DMI table
- Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
asus-laptop:
- Add key found on Asus F3M
MAINTAINERS:
- Remove Toshiba Linux mailing list address
toshiba_acpi:
- Bump driver version to 0.23
- Remove unnecessary checks and returns in HCI/SCI functions
- Refactor *{get, set} functions return value
- Remove "*not supported" feature prints
- Change *available functions return type
- Add set_fan_status function
- Change some variables to avoid warnings from ninja-check
- Reorder toshiba_acpi_alt_keymap entries
- Remove unused wireless defines
- Transflective backlight updates
- Avoid registering input device on WMI event laptops
- Add /dev/toshiba_acpi device
- Adapt /proc/acpi/toshiba/keys to TOS1900 devices"
* tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: (21 commits)
acer-wmi: No rfkill on HP Omen 15 wifi
thinkpad_acpi: Remove side effects from vdbg_printk -> no_printk macro
surface pro 3: Add support driver for Surface Pro 3 buttons
hp-wireless: remove unneeded goto/label in hpwl_init
ideapad-laptop: add alternative representation for Yoga 2 to DMI table
asus-laptop: Add key found on Asus F3M
MAINTAINERS: Remove Toshiba Linux mailing list address
ideapad-laptop: Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
toshiba_acpi: Bump driver version to 0.23
toshiba_acpi: Remove unnecessary checks and returns in HCI/SCI functions
toshiba_acpi: Refactor *{get, set} functions return value
toshiba_acpi: Remove "*not supported" feature prints
toshiba_acpi: Change *available functions return type
toshiba_acpi: Add set_fan_status function
toshiba_acpi: Change some variables to avoid warnings from ninja-check
toshiba_acpi: Reorder toshiba_acpi_alt_keymap entries
toshiba_acpi: Remove unused wireless defines
toshiba_acpi: Transflective backlight updates
toshiba_acpi: Avoid registering input device on WMI event laptops
toshiba_acpi: Add /dev/toshiba_acpi device
...
Merge branch 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
"Features:
- new drivers: Renesas EMEV2, register based MUX, NXP LPC2xxx
- core: scans DT and assigns wakeup interrupts. no driver changes needed.
- core: some refcouting issues fixed and better API for that
- core: new helper function for best effort block read emulation
- slave framework: proper DT bindings and userspace instantiation
- some bigger work for xiic, pxa, omap drivers
.. and quite a number of smaller driver fixes, cleanups, improvements"
* 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (65 commits)
i2c: mux: reg Change ioread endianness for readback
i2c: mux: reg: fix compilation warnings
i2c: mux: reg: simplify register size checking
i2c: muxes: fix leaked i2c adapter device node references
i2c: allow specifying separate wakeup interrupt in device tree
of/irq: export of_get_irq_byname()
i2c: xgene-slimpro: dma_mapping_error() doesn't return an error code
i2c: Replace I2C_CROS_EC_TUNNEL dependency
eeprom: at24: use i2c_smbus_read_i2c_block_data_or_emulated
i2c: core: Add support for best effort block read emulation
i2c: lpc2k: add driver
i2c: mux: Add register-based mux i2c-mux-reg
i2c: dt: describe generic bindings
i2c: slave: print warning if slave flag not set
i2c: support 10 bit and slave addresses in sysfs 'new_device'
i2c: take address space into account when checking for used addresses
i2c: apply DT flags when probing
i2c: make address check indpendent from client struct
i2c: rename address check functions
i2c: apply address offset for slaves, too
...
Merge tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"Core:
- use is_visible() to control sysfs attributes
- switch wakealarm attribute to DEVICE_ATTR_RW
- make rtc_does_wakealarm() return boolean
- properly manage lifetime of dev and cdev in rtc device
- remove unnecessary device_get() in rtc_device_unregister
- fix double free in rtc_register_device() error path
Subsystem wide cleanups:
- fix drivers that consider 0 as a valid IRQ in client->irq
- Drop (un)likely before IS_ERR(_OR_NULL)
- drop the remaining owner assignment for i2c_driver and
platform_driver
- module autoload fixes
Drivers:
- 88pm80x: add device tree support
- abx80x: fix RTC write bit
- ab8500: Add a sentinel to ab85xx_rtc_ids[]
- armada38x: Align RTC set time procedure with the official errata
- as3722: correct month value
- at91sam9: cleanups
- at91rm9200: get and use slow clock and cleanups
- bq32k: remove redundant check
- cmos: century support, proper fix for the spurious wakeup
- ds1307: cleanups and wakeup irq support
- ds1374: Remove unused variable
- ds1685: Use module_platform_driver
- ds3232: fix WARNING trace in resume function
- gemini: fix ptr_ret.cocci warnings
- mt6397: implement suspend/resume
- omap: support internal and external clock enabling
- opal: Enable alarms only when opal supports tpo
- pcf2127: use OFS flag to detect unreliable date and warn the user
- pl031: fix typo for author email
- rx8025: huge cleanup and fixes
- sa1100/pxa: share common code
- s5m: fix to update ctrl register
- s3c: fix clocks and wakeup, cleanup
- sirfsoc: use regmap
- nvram_read()/nvram_write() functions for cmos, ds1305, ds1307,
ds1343, ds1511, ds1553, ds1742, m48t59, rp5c01, stk17ta8, tx4939
- use rtc_valid_tm() error code when reading date/time instead of 0
for isl12022, pcf2123, pcf2127"
* tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (90 commits)
rtc: abx80x: fix RTC write bit
rtc: ab8500: Add a sentinel to ab85xx_rtc_ids[]
rtc: ds1374: Remove unused variable
rtc: Fix module autoload for OF platform drivers
rtc: Fix module autoload for rtc-{ab8500,max8997,s5m} drivers
rtc: omap: Add external clock enabling support
rtc: omap: Add internal clock enabling support
ARM: dts: AM437x: Add the internal and external clock nodes for rtc
rtc: s5m: fix to update ctrl register
rtc: add xilinx zynqmp rtc driver
devicetree: bindings: rtc: add bindings for xilinx zynqmp rtc
rtc: as3722: correct month value
ARM: config: Switch PXA27x platforms to use PXA RTC driver
ARM: mmp: remove unused RTC register definitions
ARM: sa1100: remove unused RTC register definitions
rtc: sa1100/pxa: convert to run-time register mapping
ARM: pxa: add memory resource to SA1100 RTC device
rtc: pxa: convert to use shared sa1100 functions
rtc: sa1100: prepare to share sa1100_rtc_ops
rtc: ds3232: fix WARNING trace in resume function
...
zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
same code with only significant difference in return value and usage of
swap_readpage.
I a helper __read_swap_cache_async() with the common code. Behavior
change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
instead radix_tree_preload. Looks like, this wasn't changed only by the
reason of code duplication.
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make zram syslog error reporting more consistent. We have random
error levels in some places. For example, critical errors like
"Error allocating memory for compressed page"
and
"Unable to allocate temp memory"
are reported as KERN_INFO messages.
a) Reassign error levels
Error messages that directly affect zram
functionality -- pr_err():
Error allocating zram address table
Error creating memory pool
Decompression failed! err=%d, page=%u
Unable to allocate temp memory
Compression failed! err=%d
Error allocating memory for compressed page: %u, size=%zu
Cannot initialise %s compressing backend
Error allocating disk queue for device %d
Error allocating disk structure for device %d
Error creating sysfs group for device %d
Unable to register zram-control class
Unable to get major number
Messages that do not affect functionality, but user
must be warned (because sysfs attrs will be removed in
this particular case) -- pr_warn():
%d (%s) Attribute %s (and others) will be removed. %s
Messages that do not affect functionality and mostly are
informative -- pr_info():
Cannot change max compression streams
Can't change algorithm for initialized device
Cannot change disksize for initialized device
Added device: %s
Removed device: %s
b) Update sysfs_create_group() error message
First, it lacks a trailing new line; add it. Second, every error message
in zram_add() has a "for device %d" part, which makes errors more
informative. Add missing part to "Error creating sysfs group" message.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zsmalloc: do not take class lock in zs_shrinker_count()
We can avoid taking class ->lock around zs_can_compact() in
zs_shrinker_count(), because the number that we return back is outdated
in general case, by design. We have different sources that are able to
change class's state right after we return from zs_can_compact() --
ongoing I/O operations, manually triggered compaction, or two of them
happening simultaneously.
We re-do this calculations during compaction on a per class basis
anyway.
zs_unregister_shrinker() will not return until we have an active
shrinker, so classes won't unexpectedly disappear while
zs_shrinker_count() iterates them.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zsmalloc: partial page ordering within a fullness_list
We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY}
pages. Put a page with higher ->inuse count first within its
->fullness_list, which will give us better chances to fill up this page
with new objects (find_get_zspage() return ->fullness_list head for new
object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL
quicker.
It performs a trivial and cheap ->inuse compare which does not slow down
zsmalloc and in the worst case keeps the list pages in no particular
order.
A more expensive solution could sort fullness_list by ->inuse count.
[minchan@kernel.org: code adjustments] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Perform automatic pool compaction by a shrinker when system is getting
tight on memory.
User-space has a very little knowledge regarding zsmalloc fragmentation
and basically has no mechanism to tell whether compaction will result in
any memory gain. Another issue is that user space is not always aware
of the fact that system is getting tight on memory. Which leads to very
uncomfortable scenarios when user space may start issuing compaction
'randomly' or from crontab (for example). Fragmentation is not always
necessarily bad, allocated and unused objects, after all, may be filled
with the data later, w/o the need of allocating a new zspage. On the
other hand, we obviously don't want to waste memory when the system
needs it.
Compaction now has a relatively quick pool scan so we are able to
estimate the number of pages that will be freed easily, which makes it
possible to call this function from a shrinker->count_objects()
callback. We also abort compaction as soon as we detect that we can't
free any pages any more, preventing wasteful objects migrations.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction returns back to zram the number of migrated objects, which is
quite uninformative -- we have objects of different sizes so user space
cannot obtain any valuable data from that number. Change compaction to
operate in terms of pages and return back to compaction issuer the
number of pages that were freed during compaction. So from now on we
will export more meaningful value in zram<id>/mm_stat -- the number of
freed (compacted) pages.
This requires:
(a) a rename of `num_migrated' to 'pages_compacted'
(b) a internal API change -- return first_page's fullness_group from
putback_zspage(), so we know when putback_zspage() did
free_zspage(). It helps us to account compaction stats correctly.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
`zs_compact_control' accounts the number of migrated objects but it has
a limited lifespan -- we lose it as soon as zs_compaction() returns back
to zram. It worked fine, because (a) zram had it's own counter of
migrated objects and (b) only zram could trigger compaction. However,
this does not work for automatic pool compaction (not issued by zram).
To account objects migrated during auto-compaction (issued by the
shrinker) we need to store this number in zs_pool.
Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
there. It provides only `num_migrated', as of this writing, but it
surely can be extended.
A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
caller.
Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change zs_object_copy() argument order to be (DST, SRC) rather than
(SRC, DST). copy/move functions usually have (to, from) arguments
order.
Rename alloc_target_page() to isolate_target_page(). This function
doesn't allocate anything, it isolates target page, pretty much like
isolate_source_page().
Tweak __zs_compact() comment.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function checks if class compaction will free any pages.
Rephrasing -- do we have enough unused objects to form at least one
ZS_EMPTY page and free it. It aborts compaction if class compaction
will not result in any (further) savings.
EXAMPLE (this debug output is not part of this patch set):
- class size
- number of allocated objects
- number of used objects
- max objects per zspage
- pages per zspage
- estimated number of pages that will be freed
Every "compaction is useless" indicates that we saved CPU cycles.
class-512 has
544 object allocated
540 objects used
8 objects per-page
Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to
migrate all of its objects and free this zspage; so compaction will not
make a lot of sense, it's better to just leave it as is.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always account per-class `zs_size_stat' stats. This data will help us
make better decisions during compaction. We are especially interested
in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction
will result in any memory gain.
For instance, we know the number of allocated objects in the class, the
number of objects being used (so we also know how many objects are not
used) and the number of objects per-page. So we can ensure if we have
enough unused objects to form at least one ZS_EMPTY zspage during
compaction.
We calculate this value on per-class basis so we can calculate a total
number of zspages that can be released. Which is exactly what a
shrinker wants to know.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset tweaks compaction and makes it possible to trigger pool
compaction automatically when system is getting low on memory.
zsmalloc in some cases can suffer from a notable fragmentation and
compaction can release some considerable amount of memory. The problem
here is that currently we fully rely on user space to perform compaction
when needed. However, performing zsmalloc compaction is not always an
obvious thing to do. For example, suppose we have a `idle' fragmented
(compaction was never performed) zram device and system is getting low
on memory due to some 3rd party user processes (gcc LTO, or firefox,
etc.). It's quite unlikely that user space will issue zpool compaction
in this case. Besides, user space cannot tell for sure how badly pool
is fragmented; however, this info is known to zsmalloc and, hence, to a
shrinker.
This patch (of 7):
__zs_compact() does not use `nr_to_migrate', drop it.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
When hot adding a node from add_memory(), we will add memblock first, so
the node is not empty. But when called from cpu_up(), the node should
be empty.
Yaowei Bai [Tue, 8 Sep 2015 22:04:13 +0000 (15:04 -0700)]
mm/page_alloc.c: change sysctl_lower_zone_reserve_ratio to sysctl_lowmem_reserve_ratio in comments
We use sysctl_lowmem_reserve_ratio rather than
sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel
is in defending lowmem from the possibility of being captured into
pinned user memory. To avoid misleading, correct it in some comments.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yaowei Bai [Tue, 8 Sep 2015 22:04:10 +0000 (15:04 -0700)]
mm/page_alloc.c: fix a misleading comment
The comment says that the per-cpu batchsize and zone watermarks are
determined by present_pages which is definitely wrong, they are both
calculated from managed_pages. Fix it.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Tue, 8 Sep 2015 22:04:05 +0000 (15:04 -0700)]
mm/khugepaged: allow interruption of allocation sleep again
Commit 1dfb059b9438 ("thp: reduce khugepaged freezing latency") fixed
khugepaged to do not block a system suspend. But the result is that it
could not get interrupted before the given timeout because the condition
for the wait event is "false".
This patch puts back the original approach but it uses
freezable_schedule_timeout_interruptible() instead of
schedule_timeout_interruptible(). It does the right thing. I am pretty
sure that the freezable variant was not used in the original fix only
because it was not available at that time.
The regression has been there for ages. It was not critical. It just
did the allocation throttling a little bit more aggressively.
I found this problem when converting the kthread to kthread worker API
and trying to understand the code.
This bug is thought to have minimal userspace-visible impact. Somebody
could set a high alloc_sleep value by mistake, and then try to fix it
back, but khugepaged would keep sleeping until the high value expires.
Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Tue, 8 Sep 2015 22:03:59 +0000 (15:03 -0700)]
mm/compaction: correct to flush migrated pages if pageblock skip happens
We cache isolate_start_pfn before entering isolate_migratepages(). If
pageblock is skipped in isolate_migratepages() due to whatever reason,
cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
that were freed. For example, the following scenario can be possible:
- assume order-9 compaction, pageblock order is 9
- start_isolate_pfn is 0x200
- isolate_migratepages()
- skip a number of pageblocks
- start to isolate from pfn 0x600
- cc->migrate_pfn = 0x620
- return
- last_migrated_pfn is set to 0x200
- check flushing condition
- current_block_start is set to 0x600
- last_migrated_pfn < current_block_start then do useless flush
This wrong flush would not help the performance and success rate so this
patch tries to fix it. One simple way to know the exact position where
we start to isolate migratable pages is that we cache it in
isolate_migratepages() before entering actual isolation. This patch
implements that and fixes the problem.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_pages_node() might fail when called with NUMA_NO_NODE and
__GFP_THISNODE on a CPU belonging to a memoryless node. To make the
local-node fallback more robust and prevent such situations, use
numa_mem_id(), which was introduced for similar scenarios in the slab
context.
Suggested-by: Christoph Lameter <cl@linux.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: unify checks in alloc_pages_node() and __alloc_pages_node()
Perform the same debug checks in alloc_pages_node() as are done in
__alloc_pages_node(), by making the former function a wrapper of the
latter one.
In addition to better diagnostics in DEBUG_VM builds for situations
which have been already fatal (e.g. out-of-bounds node id), there are
two visible changes for potential existing buggy callers of
alloc_pages_node():
- calling alloc_pages_node() with any negative nid (e.g. due to arithmetic
overflow) was treated as passing NUMA_NO_NODE and fallback to local node was
applied. This will now be fatal.
- calling alloc_pages_node() with an offline node will now be checked for
DEBUG_VM builds. Since it's not fatal if the node has been previously online,
and this patch may expose some existing buggy callers, change the VM_BUG_ON
in __alloc_pages_node() to VM_WARN_ON.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: rename alloc_pages_exact_node() to __alloc_pages_node()
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: unlock page while waiting on writeback
This is merely a politeness: I've not found that shrink_page_list()
leads to deadlock with the page it holds locked across
wait_on_page_writeback(); but nevertheless, why hold others off by
keeping the page locked there?
And while we're at it: remove the mistaken "not " from the commentary on
this Case 3 (and a distracting blank line from Case 2, if I may).
Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wang Kai [Tue, 8 Sep 2015 22:03:41 +0000 (15:03 -0700)]
kmemleak: record accurate early log buffer count and report when exceeded
In log_early function, crt_early_log should also count once when
'crt_early_log >= ARRAY_SIZE(early_log)'. Otherwise the reported count
from kmemleak_init is one less than 'actual number'.
Then, in kmemleak_init, if early_log buffer size equal actual number,
kmemleak will init sucessful, so change warning condition to
'crt_early_log > ARRAY_SIZE(early_log)'.
Signed-off-by: Wang Kai <morgan.wang@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shmem uses shmem_recalc_inode to update i_blocks when it allocates page,
undoes range or swaps. But mm can drop clean page without notifying
shmem. This makes fstat sometimes return out-of-date block size.
The problem can be partially solved when we add
inode_operations->getattr which calls shmem_recalc_inode to update
i_blocks for fstat.
shmem_recalc_inode also updates counter used by statfs and
vm_committed_as. For them the situation is not changed. They still
suffer from the discrepancy after dropping clean page and before the
function is called by aforementioned triggers.
mm/memblock.c: rename local variable of memblock_type to 'type'
Since commit e3239ff92a17 ("memblock: Rename memblock_region to
memblock_type and memblock_property to memblock_region"), all local
variables of the membock_type type were renamed to 'type'. This commit
renames all remaining local variables with the memblock_type type to the
same view.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/hwpoison: don't try to unpoison containment-failed pages
memory_failure() can be called at any page at any time, which means that
we can't eliminate the possibility of containment failure. In such case
the best option is to leak the page intentionally (and never touch it
later.)
We have an unpoison function for testing, and it cannot handle such
containment-failed pages, which results in kernel panic (visible with
various calltraces.) So this patch suggests that we limit the
unpoisonable pages to properly contained pages and ignore any other
ones.
Testers are recommended to keep in mind that there're un-unpoisonable
pages when writing test programs.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
soft_offline_page
__soft_offline_page
TestSetPageHWPoison
unpoison_memory
PageHWPoison check (true)
TestClearPageHWPoison
put_page -> release refcount held by get_hwpoison_page in unpoison_memory
put_page -> release refcount held by isolate_lru_page in __soft_offline_page
migrate_pages
The second put_page() releases refcount held by isolate_lru_page() which
will lead to unmap_and_move() releases the last refcount of page and w/
mapcount still 1 since try_to_unmap() is not called if there is only one
user map the page. Anyway, the page refcount and mapcount will still
mess if the page is mapped by multiple users.
This race was introduced by commit 4491f71260 ("mm/memory-failure: set
PageHWPoison before migrate_pages()"), which focuses on preventing the
reuse of successfully migrated page. Before this commit we prevent the
reuse by changing the migratetype to MIGRATE_ISOLATE during soft
offlining, which has the following problems, so simply reverting the
commit is not a best option:
1) it doesn't eliminate the reuse completely, because
set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
target page if the pageblock of the page contains one or more
unmovable pages (i.e. has_unmovable_pages() returns true).
2) the original code changes migratetype to MIGRATE_ISOLATE
forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
regardless of the original migratetype state, which could impact
other subsystems like memory hotplug or compaction.
This patch moves PageSetHWPoison just after put_page() in
unmap_and_move(), which closes up the reported race window and minimizes
another race window b/w SetPageHWPoison and reallocation (which causes
the reuse of soft-offlined page.) The latter race window still exists
but it's acceptable, because it's rare and effectively the same as
ordinary "containment failure" case even if it happens, so keep the
window open is acceptable.
Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()") Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Tue, 8 Sep 2015 22:03:18 +0000 (15:03 -0700)]
mm/hwpoison: fix refcount of THP head page in no-injection case
Hwpoison injection takes a refcount of target page and another refcount
of head page of THP if the target page is the tail page of a THP.
However, current code doesn't release the refcount of head page if the
THP is not supported to be injected wrt hwpoison filter.
Fix it by reducing the refcount of head page if the target page is the
tail page of a THP and it is not supported to be injected.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>