Joe Perches [Wed, 11 Sep 2013 21:23:37 +0000 (14:23 -0700)]
MAINTAINERS: ARM: spear: consolidate sections
Commit a7ed099ffc8e ("ARM: spear: move all files to mach-spear") moved
all the files into a single directory, delete the now unnecessary
duplicate sections and update the pattern.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:23:32 +0000 (14:23 -0700)]
MAINTAINERS: EXYNOS: remove board files
Commit ca9143501c30 ("ARM: EXYNOS: Remove unused board files") removed
the files, remove the patterns too.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Tomasz Figa <t.figa@samsung.com> Acked-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Wed, 11 Sep 2013 21:23:29 +0000 (14:23 -0700)]
kernel/smp.c: quit unconditionally enabling irqs in on_each_cpu_mask().
As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
!SMP version of on_each_cpu()"), we don't want to enable irqs if they
are not already enabled.
I don't know of any bugs currently caused by this unconditional
local_irq_enable(), but I want to use this function in MIPS/OCTEON early
boot (when we have early_boot_irqs_disabled). This also makes this
function have similar semantics to on_each_cpu() which is good in
itself.
Signed-off-by: David Daney <david.daney@cavium.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: Christoph Lameter <cl@linux.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syscalls.h: add forward declarations for inplace syscall wrappers
Unclutter -Wmissing-prototypes warning types (enabled at make W=1)
linux/include/linux/syscalls.h:190:18: warning: no previous prototype for 'SyS_semctl' [-Wmissing-prototypes]
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
^
linux/include/linux/syscalls.h:183:2: note: in expansion of macro '__SYSCALL_DEFINEx'
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
^
by adding forward declarations right before definitions.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At least on ARM no-MMU the extable is empty and so there is nothing to
sort. So add a check for the table to be empty which effectively only
changes that the misleading pr_notice is suppressed.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Daney <david.daney@cavium.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Wed, 11 Sep 2013 21:23:26 +0000 (14:23 -0700)]
smp.h: move !SMP version of on_each_cpu() out-of-line
All of the other non-trivial !SMP versions of functions in smp.h are
out-of-line in up.c. Move on_each_cpu() there as well.
This allows us to get rid of the #include <linux/irqflags.h>. The
drawback is that this makes both the x86_64 and i386 defconfig !SMP
kernels about 200 bytes larger each.
Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Wed, 11 Sep 2013 21:23:25 +0000 (14:23 -0700)]
up.c: use local_irq_{save,restore}() in smp_call_function_single.
The SMP version of this function doesn't unconditionally enable irqs, so
neither should this !SMP version. There are no know problems caused by
this, but we make the change for consistency's sake.
Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Wed, 11 Sep 2013 21:23:24 +0000 (14:23 -0700)]
smp: quit unconditionally enabling irq in on_each_cpu_mask and on_each_cpu_cond
As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
!SMP version of on_each_cpu()"), we don't want to enable irqs if they
are not already enabled. There are currently no known problematical
callers of these functions, but since it is a known failure pattern, we
preemptively fix them.
Since they are not trivial functions, make them non-inline by moving
them to up.c. This also makes it so we don't have to fix #include
dependancies for preempt_{disable,enable}.
Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Deacon [Wed, 11 Sep 2013 21:23:23 +0000 (14:23 -0700)]
kernel/spinlock.c: add default arch_*_relax definitions for GENERIC_LOCKBREAK
When running with GENERIC_LOCKBREAK=y, the locking implementations emit
calls to arch_{read,write,spin}_relax when spinning on a contended lock
in order to allow architectures to favour the CPU owning the lock if
possible.
In reality, everybody apart from PowerPC and S390 just does cpu_relax()
here, so make that the default behaviour and allow it to be overridden
if required.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lto, watchdog/hpwdt.c: make assembler label global
We cannot assume that the inline assembler code always ends up in the same
file as the original C file. So make any assembler labels that are called
with "extern" by C global
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Wim Van Sebroeck <wim@iguana.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
since those appear in compat code, we could probably expect the kernel
addresses not to be reachable in the lower 32-bit range, so I think they
might not be exploitable.
For the "__get_user" cases, I don't think those are exploitable: the worse
that can happen is that the kernel will copy kernel memory into in-kernel
buffers, and will fail immediately afterward.
The alpha csum_partial_copy_from_user() seems to be missing the
access_ok() check entirely. The fix is inspired from x86. This could
lead to information leak on alpha. I also noticed that many architectures
map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
wonder if the latter is performing the access checks on every
architectures.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Wed, 11 Sep 2013 21:23:17 +0000 (14:23 -0700)]
drivers/firmware/google/gsmi.c: replace strict_strtoul() with kstrtoul()
The use of strict_strtoul() is not preferred, because strict_strtoul() is
obsolete. Thus, kstrtoul() should be used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Tom Gundersen <teg@jklm.no> Cc: Mike Waychison <mikew@google.com> Acked-by: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
platform: convert apple-gmux driver to dev_pm_ops from legacy pm_ops
Convert drivers/platform/x86/apple-gmux to use dev_pm_ops instead of
legacy pm_ops. This patch depends on pnp driver bus ops change to invoke
pnp_driver dev_pm_ops.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tpm: convert tpm_tis driver to use dev_pm_ops from legacy pm_ops
Convert drivers/char/tpm/tpm_tis.c to use dev_pm_ops instead of legacy
pm_ops. This patch depends on pnp driver bus ops change to invoke
pnp_driver dev_pm_ops.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rtc: convert rtc-cmos to dev_pm_ops from legacy pm_ops
Convert drivers/rtc/rtc-cmos to use dev_pm_ops instead of legacy pm_ops.
This patch depends on pnp driver bus ops change to invoke pnp_driver
dev_pm_ops.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pnp: change pnp bus pm_ops to invoke pnp driver dev_pm_ops if specified
pnp_bus_suspend() and pnp_bus_resume() invoke legacy pm_ops from
pnp_driver. Changed pnp_bus_suspend() and pnp_bus_resume() to check if
pnp driver has dev_pm_ops and call. If dev_pm_ops don't exist, then call
use legacy pm_ops. Without this change, pnp_driver dev_pm_ops will not
get called.
In addition to the pnp driver bus pm_ops change to invoke driver
dev_pm_ops, this patch set contains changes to rtc-cmos, tpm_tis, and
apple-gmux pnp drivers to convert from legacy pm_ops to dev_pm_ops.
This patch (of 4):
pnp_bus_suspend() and pnp_bus_resume() invoke legacy pm_ops from
pnp_driver. Changed pnp_bus_suspend() and pnp_bus_resume() to check if
pnp driver has dev_pm_ops and call. If dev_pm_ops don't exist, then call
use legacy pm_ops. Without this change, pnp_driver dev_pm_ops will not
get called.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A memory cgroup with (1) multiple threshold notifications and (2) at least
one threshold >=2G was not reliable. Specifically the notifications would
either not fire or would not fire in the proper order.
The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
thresholds in sorted order. mem_cgroup_usage_register_event() sorts them
with compare_thresholds(), which returns the difference of two 64 bit
thresholds as an int. If the difference is positive but has bit[31] set,
then sort() treats the difference as negative and breaks sort order.
This fix compares the two arbitrary 64 bit thresholds returning the
classic -1, 0, 1 result.
The test below sets two notifications (at 0x1000 and 0x81001000):
cd /sys/fs/cgroup/memory
mkdir x
for x in 4096 2164264960; do
cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
done
echo $$ > x/cgroup.procs
anon_leaker 500M
v3.11-rc7 fails to signal the 4096 event listener:
Leaking...
Done leaking pages.
The fixed bug is old. It appears to date back to the introduction of
memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg:
implement memory thresholds"
Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Wed, 11 Sep 2013 21:23:04 +0000 (14:23 -0700)]
writeback: fix race that cause writeback hung
There is a race between mark inode dirty and writeback thread, see the
following scenario. In this case, writeback thread will not run though
there is dirty_io.
__mark_inode_dirty() bdi_writeback_workfn()
... ...
spin_lock(&inode->i_lock);
...
if (bdi_cap_writeback_dirty(bdi)) {
<<< assume wb has dirty_io, so wakeup_bdi is false.
<<< the following inode_dirty also have wakeup_bdi false.
if (!wb_has_dirty_io(&bdi->wb))
wakeup_bdi = true;
}
spin_unlock(&inode->i_lock);
<<< assume last dirty_io is removed here.
pages_written = wb_do_writeback(wb);
...
<<< work_list empty and wb has no dirty_io,
<<< delayed_work will not be queued.
if (!list_empty(&bdi->work_list) ||
(wb_has_dirty_io(wb) && dirty_writeback_interval))
queue_delayed_work(bdi_wq, &wb->dwork,
msecs_to_jiffies(dirty_writeback_interval * 10));
spin_lock(&bdi->wb.list_lock);
inode->dirtied_when = jiffies;
<<< new dirty_io is added.
list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
spin_unlock(&bdi->wb.list_lock);
<<< though there is dirty_io, but wakeup_bdi is false,
<<< so writeback thread will not be waked up and
<<< the new dirty_io will not be flushed.
if (wakeup_bdi)
bdi_wakeup_thread_delayed(bdi);
Writeback will run until there is a new flush work queued. This may cause
a lot of dirty pages stay in memory for a long time.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:23:02 +0000 (14:23 -0700)]
mm/madvise.c: fix return value of madvise_hwpoison()
The return value outside for loop is always zero which means
madvise_hwpoison return success, however, this is not truth for
soft_offline_page w/ failure return value.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
return -1;
munmap(mem, PAGES_TO_TEST * PAGE_SIZE);
return 0;
}
There is one page reference count for default empty zero page,
madvise_hwpoison add another one by get_user_pages_fast. memory_hwpoison
reduce one page reference count since it's a non LRU page.
unpoison_memory release the last page reference count and free empty zero
page to buddy system which is not correct since empty zero page has
PG_reserved flag. This patch fix it by don't reduce the page reference
count under 1 against empty zero page.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:59 +0000 (14:22 -0700)]
mm/hwpoison.c: fix held reference count after unpoisoning empty zero page
madvise hwpoison inject will poison the read-only empty zero page if there
is no write access before poison. Empty zero page reference count will be
increased for hwpoison, subsequent poison zero page will return directly
since page has already been set PG_hwpoison, however, page reference count
is still increased by get_user_pages_fast. The unpoison process will
unpoison the empty zero page and decrease the reference count successfully
for the fist time, however, subsequent unpoison empty zero page will
return directly since page has already been unpoisoned and without
decrease the page reference count of empty zero page.
This patch fixes it by make madvise_hwpoison() put a page and return
immediately (without calling memory_failure() or soft_offline_page()) when
the page is already hwpoisoned.
Wanpeng Li [Wed, 11 Sep 2013 21:22:55 +0000 (14:22 -0700)]
mm/hwpoison: don't set migration type twice to avoid holding heavily contend zone->lock
Set pageblock migration type will hold zone->lock which is heavy contended
in system to avoid race. However, soft offline page will set pageblock
migration type twice during get page if the page is in used, not hugetlbfs
page and not on lru list. There is unnecessary to set the pageblock
migration type and hold heavy contended zone->lock again if the first
round get page have already set the pageblock to right migration type.
The trick here is migration type is MIGRATE_ISOLATE. There are other two
parts can change MIGRATE_ISOLATE except hwpoison. One is memory hoplug,
however, we hold lock_memory_hotplug() which avoid race. The second is
CMA which umovable page allocation requst can't fallback to. So it's safe
here.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:53 +0000 (14:22 -0700)]
mm/hwpoison: fix race against poison thp
There is a race between hwpoison page and unpoison page, memory_failure
set the page hwpoison and increase num_poisoned_pages without hold page
lock, and one page count will be accounted against thp for
num_poisoned_pages. However, unpoison can occur before memory_failure
hold page lock and split transparent hugepage, unpoison will decrease
num_poisoned_pages by 1 << compound_order since memory_failure has not yet
split transparent hugepage with page lock held. That means we account one
page for hwpoison and 1 << compound_order for unpoison. This patch fix it
by inserting a PageTransHuge check before doing TestClearPageHWPoison,
unpoison failed without clearing PageHWPoison and decreasing
num_poisoned_pages.
A B
memory_failue
TestSetPageHWPoison(p);
if (PageHuge(p))
nr_pages = 1 << compound_order(hpage);
else
nr_pages = 1;
atomic_long_add(nr_pages, &num_poisoned_pages);
unpoison_memory
nr_pages = 1<< compound_trans_order(page);
if(TestClearPageHWPoison(p))
atomic_long_sub(nr_pages, &num_poisoned_pages);
lock page
if (!PageHWPoison(p))
unlock page and return
hwpoison_user_mappings
if (PageTransHuge(hpage))
split_huge_page(hpage);
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:52 +0000 (14:22 -0700)]
mm/hwpoison: don't need to hold compound lock for hugetlbfs page
compound lock is introduced by commit e9da73d67("thp: compound_lock."), it
is used to serialize put_page against __split_huge_page_refcount(). In
addition, transparent hugepages will be splitted in hwpoison handler and
just one subpage will be poisoned. There is unnecessary to hold compound
lock for hugetlbfs page. This patch replace compound_trans_order by
compond_order in the place where the page is hugetlbfs page.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:50 +0000 (14:22 -0700)]
mm/hwpoison: fix loss of PG_dirty for errors on mlocked pages
memory_failure() store the page flag of the error page before doing unmap,
and (only) if the first check with page flags at the time decided the
error page is unknown, it do the second check with the stored page flag
since memory_failure() does unmapping of the error pages before doing
page_action(). This unmapping changes the page state, especially
page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so
page_action() can't catch mlocked pages after that.
However, memory_failure() can't handle memory errors on dirty mlocked
pages correctly. try_to_unmap_one will move the dirty bit from pte to the
physical page, the second check lose it since it check the stored page
flag. This patch fix it by restore PG_dirty flag to stored page flag if
the page is dirty.
hwpoison: always unset MIGRATE_ISOLATE before returning from soft_offline_page()
Soft offline code expects that MIGRATE_ISOLATE is set on the target page
only during soft offlining work. But currenly it doesn't work as expected
when get_any_page() fails and returns negative value. In the result, end
users can have unexpectedly isolated pages. This patch just fixes it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling. For such filesystems balance_dirty_pages always check bdi
counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.
The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
filesystem may set the flag when it initializes its BDI.
The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
writeback). The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately. A fuse request queued
for real processing bears a copy of original page. Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.
To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".
The problem was very easy to reproduce. There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region. An
hour later I observed almost all RAM consumed by fuse writeback. Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.
Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users. This is write-through
page cache policy FUSE currently uses. I.e. handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation. Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.
And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature. Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain. Let's make simple
computations. Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.
After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule. May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).
[akpm@linux-foundation.org: fix warning in page-writeback.c] Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Cc: Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 11 Sep 2013 21:22:44 +0000 (14:22 -0700)]
mm/backing-dev.c: check user buffer length before copying data to the related user buffer
'*lenp' may be less than "sizeof(kbuf)" so we must check this before the
next copy_to_user().
pdflush_proc_obsolete() is called by sysctl which 'procname' is
"nr_pdflush_threads", if the user passes buffer length less than
"sizeof(kbuf)", it will cause issue.
Signed-off-by: Chen Gang <gang.chen@asianux.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:38 +0000 (14:22 -0700)]
mm/sparse: introduce alloc_usemap_and_memmap
After commit 9bdac9142407 ("sparsemem: Put mem map for one node
together."), vmemmap for one node will be allocated together, its logic
is similar as memory allocation for pageblock flags. This patch
introduces alloc_usemap_and_memmap to extract the same logic of memory
alloction for pageblock flags and vmemmap.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lisa Du [Wed, 11 Sep 2013 21:22:36 +0000 (14:22 -0700)]
mm: vmscan: fix do_try_to_free_pages() livelock
This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.
Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:
kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.
Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.
In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737
We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.
Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.
[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Nick Piggin <npiggin@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Neil Zhang <zhangwm@marvell.com> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lisa Du <cldu@marvell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: munlock: manual pte walk in fast path instead of follow_page_mask()
Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
each individual struct page. This entails repeated full page table
translations and page table lock taken for each page separately.
This patch avoids the costly follow_page_mask() where possible, by
iterating over ptes within single pmd under single page table lock. The
first pte is obtained by get_locked_pte() for non-THP page acquired by the
initial follow_page_mask(). The rest of the on-stack pagevec for munlock
is filled up using pte_walk as long as pte_present() and vm_normal_page()
are sufficient to obtain the struct page.
After this patch, a 14% speedup was measured for munlocking a 56GB large
memory area with THP disabled.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: munlock: remove redundant get_page/put_page pair on the fast path
The performance of the fast path in munlock_vma_range() can be further
improved by avoiding atomic ops of a redundant get_page()/put_page() pair.
When calling get_page() during page isolation, we already have the pin
from follow_page_mask(). This pin will be then returned by
__pagevec_lru_add(), after which we do not reference the pages anymore.
After this patch, an 8% speedup was measured for munlocking a 56GB large
memory area with THP disabled.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: munlock: bypass per-cpu pvec for putback_lru_page
After introducing batching by pagevecs into munlock_vma_range(), we can
further improve performance by bypassing the copying into per-cpu pagevec
and the get_page/put_page pair associated with that. Instead we perform
LRU putback directly from our pagevec. However, this is possible only for
single-mapped pages that are evictable after munlock. Unevictable pages
require rechecking after putting on the unevictable list, so for those we
fallback to putback_lru_page(), hich handles that.
After this patch, a 13% speedup was measured for munlocking a 56GB large
memory area with THP disabled.
[akpm@linux-foundation.org:clarify comment] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Depending on previous batch which introduced batched isolation in
munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats.
After the whole pagevec is processed for page isolation, the stats are
updated only once with the number of successful isolations. There were
however no measurable perfomance gains.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: munlock: batch non-THP page isolation and munlock+putback using pagevec
Currently, munlock_vma_range() calls munlock_vma_page on each page in a
loop, which results in repeated taking and releasing of the lru_lock
spinlock for isolating pages one by one. This patch batches the munlock
operations using an on-stack pagevec, so that isolation is done under
single lru_lock. For THP pages, the old behavior is preserved as they
might be split while putting them into the pagevec. After this patch, a
9% speedup was measured for munlocking a 56GB large memory area with THP
disabled.
A new function __munlock_pagevec() is introduced that takes a pagevec and:
1) It clears PageMlocked and isolates all pages under lru_lock. Zone page
stats can be also updated using the variant which assumes disabled
interrupts. 2) It finishes the munlock and lru putback on all pages under
their lock_page. Note that previously, lock_page covered also the
PageMlocked clearing and page isolation, but it is not needed for those
operations.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: munlock: remove unnecessary call to lru_add_drain()
In munlock_vma_range(), lru_add_drain() is currently called in a loop
before each munlock_vma_page() call.
This is suboptimal for performance when munlocking many pages. The
benefits of per-cpu pagevec for batching the LRU putback are removed since
the pagevec only holds at most one page from the previous loop's
iteration.
The lru_add_drain() call also does not serve any purposes for correctness
- it does not even drain pagavecs of all cpu's. The munlock code already
expects and handles situations where a page cannot be isolated from the
LRU (e.g. because it is on some per-cpu pagevec).
The history of the (not commented) call also suggest that it appears there
as an oversight rather than intentionally. Before commit ff6a6da6 ("mm:
accelerate munlock() treatment of THP pages") the call happened only once
upon entering the function. The commit has moved the call into the while
loope. So while the other changes in the commit improved munlock
performance for THP pages, it introduced the abovementioned suboptimal
per-cpu pagevec usage.
Further in history, before commit 408e82b7 ("mm: munlock use
follow_page"), munlock_vma_pages_range() was just a wrapper around
__mlock_vma_pages_range which performed both mlock and munlock depending
on a flag. However, before ba470de4 ("mmap: handle mlocked pages during
map, remap, unmap") the function handled only mlock, not munlock. The
lru_add_drain call thus comes from the implementation in commit b291f000
("mlock: mlocked pages are unevictable" and was intended only for
mlocking, not munlocking. The original intention of draining the LRU
pagevec at mlock time was to ensure the pages were on the LRU before the
lock operation so that they could be placed on the unevictable list
immediately. There is very little motivation to do the same in the
munlock path this, particularly for every single page.
This patch therefore removes the call completely. After removing the
call, a 10% speedup was measured for munlock() of a 56GB large memory area
with THP disabled.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: putback_lru_page: remove unnecessary call to page_lru_base_type()
The goal of this patch series is to improve performance of munlock() of
large mlocked memory areas on systems without THP. This is motivated by
reported very long times of crash recovery of processes with such areas,
where munlock() can take several seconds. See
http://lwn.net/Articles/548108/
The work was driven by a simple benchmark (to be included in mmtests) that
mmaps() e.g. 56GB with MAP_LOCKED | MAP_POPULATE and measures the time of
munlock(). Profiling was performed by attaching operf --pid to the
process and sending a signal to trigger the munlock() part and then notify
bach the monitoring wrapper to stop operf, so that only munlock() appears
in the profile.
The profiles have shown that CPU time is spent mostly by atomic operations
and repeated locking per single pages. This series aims to reduce both, starting
from simpler to more complex changes.
Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base
type is not determined without being actually needed.
Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu
pagevec after each munlocked page is put there.
Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating
multiple non-THP pages under a single lru_lock instead of locking and
processing each page separately.
Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec
introduced by previous patch.
Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page
when possible, bypassing the per-cpu pvec and associated overhead.
Patch 6 removes a redundant get_page/put_page pair which saves costly atomic
operations.
Patch 7 avoids calling follow_page_mask() on each individual page, and obtains
multiple page references under a single page table lock where possible.
Measurements were made using 3.11-rc3 as a baseline. The first set of
measurements shows the possibly ideal conditions where batching should
help the most. All memory is allocated from a single NUMA node and THP is
disabled.
The second set of measurements simulates the worst possible conditions for
batching by using numactl --interleave, so that there is in fact only one
page per pagevec. Even in this case the series seems to improve
performance thanks to reduced atomic operations and removal of
lru_add_drain().
For completeness, a third set of measurements shows the situation where
THP is enabled and allocations are again done on a single NUMA node. Here
munlock() is already very fast thanks to huge pages, and this series does
not compromise that performance. It seems that the removal of call to
lru_add_drain() still helps a bit.
In putback_lru_page() since commit c53954a092 (""mm: remove lru parameter
from __lru_cache_add and lru_cache_add_lru") it is no longer needed to
determine lru list via page_lru_base_type().
This patch replaces it with simple flag is_unevictable which says that the
page was put on the inevictable list. This is the only information that
matters in subsequent tests.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel reported that in case if vma area get unmapped and then mapped (or
expanded) in-place, the soft dirty tracker won't be able to recognize this
situation since it works on pte level and ptes are get zapped on unmap,
loosing soft dirty bit of course.
So to resolve this situation we need to track actions on vma level, there
VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
we set this bit, and keep it here until application calls for clearing
soft dirty bit.
Thus when user space application track memory changes now it can detect if
vma area is renewed.
Reported-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 11 Sep 2013 21:22:22 +0000 (14:22 -0700)]
writeback: fix occasional slow sync(1)
In case when system contains no dirty pages, wakeup_flusher_threads() will
submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits
immediately without doing anything, even though there are dirty inodes in
the system. Thus sync(1) will write all the dirty inodes from a
WB_SYNC_ALL writeback pass which is slow.
Fix the problem by using get_nr_dirty_pages() in wakeup_flusher_threads()
instead of calculating number of dirty pages manually. That function also
takes number of dirty inodes into account.
Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Paul Taysom <taysom@chromium.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Khalid Aziz [Wed, 11 Sep 2013 21:22:20 +0000 (14:22 -0700)]
mm: fix aio performance regression for database caused by THP
I am working with a tool that simulates oracle database I/O workload.
This tool (orion to be specific -
<http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then
does aio into these pages from flash disks using various common block
sizes used by database. I am looking at performance with two of the most
common block sizes - 1M and 64K. aio performance with these two block
sizes plunged after Transparent HugePages was introduced in the kernel.
Here are performance numbers:
I have narrowed the performance impact down to the overheads introduced by
THP in __get_page_tail() and put_compound_page() routines. perf top shows
>40% of cycles being spent in these two routines. Every time direct I/O
to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
the pages and calls put_page() when I/O completes to put the reference
away. THP introduced significant amount of locking overhead to get_page()
and put_page() when dealing with compound pages because hugepages can be
split underneath get_page() and put_page(). It added this overhead
irrespective of whether it is dealing with hugetlbfs pages or transparent
hugepages. This resulted in 20%-45% drop in aio performance when using
hugetlbfs pages.
Since hugetlbfs pages can not be split, there is no reason to go through
all the locking overhead for these pages from what I can see. I added
code to __get_page_tail() and put_compound_page() to bypass all the
locking code when working with hugetlbfs pages. This improved performance
significantly. Performance numbers with this patch:
Performance with 64K read is still lower than what it was before THP, but
still a 53% improvement. It does mean there is more work to be done but I
will take a 53% improvement for now.
Please take a look at the following patch and let me know if it looks
reasonable.
[akpm@linux-foundation.org: tweak comments] Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin B Shelar <pshelar@nicira.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kswapd was reclaiming for a high order and resets it to 0 due to
fragmentation it will still call compact_pgdat. For the most part, this
will fail a compaction_suitable() test and not compact but it is
unnecessarily sloppy. It could be fixed in the caller but fix it in the
API instead.
[dhillf@gmail.com: pointed out that it was a potential problem] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <dhillf@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmemcg: don't allocate extra memory for root memcg_cache_params
The memcg_cache_params structure contains the common part and the union,
which represents two different types of data: one for root cashes and
another for child caches.
The size of child data is fixed. The size of the memcg_caches array is
calculated in runtime.
Currently the size of memcg_cache_params for root caches is calculated
incorrectly, because it includes the size of parameters for child caches.
Yinghai Lu [Wed, 11 Sep 2013 21:22:17 +0000 (14:22 -0700)]
memblock, numa: binary search node id
Current early_pfn_to_nid() on arch that support memblock go over
memblock.memory one by one, so will take too many try near the end.
We can use existing memblock_search to find the node id for given pfn,
that could save some time on bigger system that have many entries
memblock.memory array.
Here are the timing differences for several machines. In each case with
the patch less time was spent in __early_pfn_to_nid().
new_vma_page() is called only by page migration called from do_mbind(),
where pages to be migrated are queued into a pagelist by
queue_pages_range(). queue_pages_range() confirms that a queued page
belongs to some vma, so !vma case is not supposed to be happen. This
patch adds BUG_ON() to catch this unexpected case.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/mempolicy: rename check_*range to queue_pages_*range
The function check_range() (and its family) is not well-named, because it
does not only checking something, but moving pages from list to list to do
page migration for them. So queue_pages_*range is more desirable name.
mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movable
Now hugepage migration is enabled, although restricted on pmd-based
hugepages for now (due to lack of testing.) So we should allocate
migratable hugepages from ZONE_MOVABLE if possible.
This patch makes GFP flags in hugepage allocation dependent on migration
support, not only the value of hugepages_treat_as_movable. It provides no
change on the behavior for architectures which do not support hugepage
migration,
mm: migrate: check movability of hugepage in unmap_and_move_huge_page()
Currently hugepage migration works well only for pmd-based hugepages
(mainly due to lack of testing,) so we had better not enable migration of
other levels of hugepages until we are ready for it.
Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
page table walk and check pud/pmd_huge() there, so they are safe. But the
other users (softoffline and memory hotremove) don't do this, so without
this patch they can try to migrate unexpected types of hugepages.
To prevent this, we introduce hugepage_migration_support() as an
architecture dependent check of whether hugepage are implemented on a pmd
basis or not. And on some architecture multiple sizes of hugepages are
available, so hugepage_migration_support() also checks hugepage size.
mm: memory-hotplug: enable memory hotplug to handle hugepage
Until now we can't offline memory blocks which contain hugepages because a
hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.
What's different from other users of hugepage migration is that we need to
decompose all the hugepages inside the target memory block into free buddy
pages after hugepage migration, because otherwise free hugepages remaining
in the memory block intervene the memory offlining. For this reason we
introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().
Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().
As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to
migrate hugepage with mbind(2) after applying the enablement patch which
comes later in this series.
mm: migrate: add hugepage migration code to move_pages()
Extend move_pages() to handle vma with VM_HUGETLB set. We will be able to
migrate hugepage with move_pages(2) after applying the enablement patch
which comes later in this series.
We avoid getting refcount on tail pages of hugepage, because unlike thp,
hugepage is not split and we need not care about races with splitting.
And migration of larger (1GB for x86_64) hugepage are not enabled.
migrate: add hugepage migration code to migrate_pages()
Extend check_range() to handle vma with VM_HUGETLB set. We will be able
to migrate hugepage with migrate_pages(2) after applying the enablement
patch which comes later in this series.
Note that for larger hugepages (covered by pud entries, 1GB for x86_64 for
example), we simply skip it now.
Note that using pmd_huge/pud_huge assumes that hugepages are pointed to by
pmd/pud. This is not true in some architectures implementing hugepage
with other mechanisms like ia64, but it's OK because pmd_huge/pud_huge
simply return 0 in such arch and page walker simply ignores such
hugepages.
mm: soft-offline: use migrate_pages() instead of migrate_huge_page()
Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
as an argument, instead of taking a pointer to the list of hugepages to be
migrated. This behavior was introduced in commit 189ebff28 ("hugetlb:
simplify migrate_huge_page()"), and was OK because until now hugepage
migration is enabled only for soft-offlining which migrates only one
hugepage in a single call.
But the situation will change in the later patches in this series which
enable other users of page migration to support hugepage migration. They
can kick migration for both of normal pages and hugepages in a single
call, so we need to go back to original implementation which uses linked
lists to collect the hugepages to be migrated.
With this patch, soft_offline_huge_page() switches to use migrate_pages(),
and migrate_huge_page() is not used any more. So let's remove it.
mm: migrate: make core migration code aware of hugepage
Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.
The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.
This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)
Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)
As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.
My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:
And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.
This patch (of 9):
Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.
Joonsoo Kim [Wed, 11 Sep 2013 21:21:58 +0000 (14:21 -0700)]
mm, hugetlb: return a reserved page to a reserved pool if failed
If we fail with a reserved page, just calling put_page() is not
sufficient, because put_page() invoke free_huge_page() at last step and it
doesn't know whether a page comes from a reserved pool or not. So it
doesn't do anything related to reserved count. This makes reserve count
lower than how we need, because reserve count already decrease in
dequeue_huge_page_vma(). This patch fix this situation.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 11 Sep 2013 21:21:54 +0000 (14:21 -0700)]
mm, hugetlb: fix subpool accounting handling
If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
So, we should check subpool counter when avoid_reserve. This patch
implement it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 11 Sep 2013 21:21:53 +0000 (14:21 -0700)]
mm, hugetlb: change variable name reservations to resv
'reservations' is so long name as a variable and we use 'resv_map' to
represent 'struct resv_map' in other place. To reduce confusion and
unreadability, change it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 11 Sep 2013 21:21:51 +0000 (14:21 -0700)]
mm, hugetlb: protect reserved pages when soft offlining a hugepage
Don't use the reserve pool when soft offlining a hugepage. Check we have
free pages outside the reserve pool before we dequeue the huge page.
Otherwise, we can steal other's reserve page.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/hotplug: remove stop_machine() from try_offline_node()
lock_device_hotplug() serializes hotplug & online/offline operations. The
lock is held in common sysfs online/offline interfaces and ACPI hotplug
code paths.
And here are the code paths:
- CPU & Mem online/offline via sysfs online
store_online()->lock_device_hotplug()
- Mem online via sysfs state:
store_mem_state()->lock_device_hotplug()
- ACPI CPU & Mem hot-add:
acpi_scan_bus_device_check()->lock_device_hotplug()
- ACPI CPU & Mem hot-delete:
acpi_scan_hot_remove()->lock_device_hotplug()
try_offline_node() off-lines a node if all memory sections and cpus are
removed on the node. It is called from acpi_processor_remove() and
acpi_memory_remove_memory()->remove_memory() paths, both of which are in
the ACPI hotplug code.
try_offline_node() calls stop_machine() to stop all cpus while checking
all cpu status with the assumption that the caller is not protected from
CPU hotplug or CPU online/offline operations. However, the caller is
always serialized with lock_device_hotplug(). Also, the code needs to be
properly serialized with a lock, not by stopping all cpus at a random
place with stop_machine().
This patch removes the use of stop_machine() in try_offline_node() and
adds comments to try_offline_node() and remove_memory() that
lock_device_hotplug() is required.
Signed-off-by: Toshi Kani <toshi.kani@hp.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
add_memory() and remove_memory() can only handle a memory range aligned
with section. There are problems when an unaligned range is added and
then deleted as follows:
- add_memory() with an unaligned range succeeds, but __add_pages()
called from add_memory() adds a whole section of pages even though
a given memory range is less than the section size.
- remove_memory() to the added unaligned range hits BUG_ON() in
__remove_pages().
This patch changes add_memory() and remove_memory() to check if a given
memory range is aligned with section at the beginning. As the result,
add_memory() fails with -EINVAL when a given range is unaligned, and does
not add such memory range. This prevents remove_memory() to be called
with an unaligned range as well. Note that remove_memory() has to use
BUG_ON() since this function cannot fail.
Explicitly mention/recommend using the libhugetlbfs test cases when
changing related kernel code. Developers that are unaware of the project
can easily miss this and introduce potential regressions that may or may
not be caught by community review.
Also do some cleanups that make the document visually easier to view at a
first glance.
readahead: make context readahead more conservative
This helps performance on moderately dense random reads on SSD.
Transaction-Per-Second numbers provided by Taobao:
QPS case
-------------------------------------------------------
7536 disable context readahead totally
w/ patch: 7129 slower size rampup and start RA on the 3rd read
6717 slower size rampup
w/o patch: 5581 unmodified context readahead
Before, readahead will be started whenever reading page N+1 when it happen
to read N recently. After patch, we'll only start readahead when *three*
random reads happen to access pages N, N+1, N+2. The probability of this
happening is extremely low for pure random reads, unless they are very
dense, which actually deserves some readahead.
Also start with a smaller readahead window. The impact to interleaved
sequential reads should be small, because for a long run stream, the the
small readahead window rampup phase is negletable.
The context readahead actually benefits clustered random reads on HDD
whose seek cost is pretty high. However as SSD is increasingly used for
random read workloads it's better for the context readahead to concentrate
on interleaved sequential reads.
lib/genalloc.c: fix overflow of ending address of memory chunk
In struct gen_pool_chunk, end_addr means the end address of memory chunk
(inclusive), but in the implementation it is treated as address + size of
memory chunk (exclusive), so it points to the address plus one instead of
correct ending address.
The ending address of memory chunk plus one will cause overflow on the
memory chunk including the last address of memory map, e.g. when starting
address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending
address will be 0x100000000.
Use correct ending address like starting address + size - 1.
[akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr] Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>