Jan Kara [Thu, 16 Nov 2017 01:37:33 +0000 (17:37 -0800)]
mm: batch radix tree operations when truncating pages
Currently we remove pages from the radix tree one by one. To speed up
page cache truncation, lock several pages at once and free them in one
go. This allows us to batch radix tree operations in a more efficient
way and also save round-trips on mapping->tree_lock. As a result we
gain about 20% speed improvement in page cache truncation.
Data from a simple benchmark timing 10000 truncates of 1024 pages (on
ext4 on ramdisk but the filesystem is barely visible in the profiles).
The range shows 1% and 95% percentiles of the measured times:
Jan Kara [Thu, 16 Nov 2017 01:37:29 +0000 (17:37 -0800)]
mm: factor out checks and accounting from __delete_from_page_cache()
Move checks and accounting updates from __delete_from_page_cache() into
a separate function. We will reuse it when batching page cache
truncation operations.
Link: http://lkml.kernel.org/r/20171010151937.26984-7-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:37:22 +0000 (17:37 -0800)]
mm: move accounting updates before page_cache_tree_delete()
Move updates of various counters before page_cache_tree_delete() call.
It will be easier to batch things this way and there is no difference
whether the counters get updated before or after removal from the radix
tree.
Link: http://lkml.kernel.org/r/20171010151937.26984-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:37:18 +0000 (17:37 -0800)]
mm: factor out page cache page freeing into a separate function
Factor out page freeing from delete_from_page_cache() into a separate
function. We will need to call the same when batching pagecache
deletion operations.
invalidate_complete_page2() and replace_page_cache_page() might want to
call this function as well however they currently don't seem to handle
THPs so it's unnecessary for them to take the hit of checking whether a
page is THP or not.
Link: http://lkml.kernel.org/r/20171010151937.26984-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:37:15 +0000 (17:37 -0800)]
mm: refactor truncate_complete_page()
Move call of delete_from_page_cache() and page->mapping check out of
truncate_complete_page() into the single caller - truncate_inode_page().
Also move page_mapped() check into truncate_complete_page(). That way
it will be easier to batch operations.
Also rename truncate_complete_page() to truncate_cleanup_page().
Link: http://lkml.kernel.org/r/20171010151937.26984-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:37:11 +0000 (17:37 -0800)]
mm: speed up cancel_dirty_page() for clean pages
Patch series "Speed up page cache truncation", v1.
When rebasing our enterprise distro to a newer kernel (from 4.4 to 4.12)
we have noticed a regression in bonnie++ benchmark when deleting files.
Eventually we have tracked this down to a fact that page cache
truncation got slower by about 10%. There were both gains and losses in
the above interval of kernels but we have been able to identify that
commit 83929372f629 ("filemap: prepare find and delete operations for
huge pages") caused about 10% regression on its own.
After some investigation it didn't seem easily possible to fix the
regression while maintaining the THP in page cache functionality so
we've decided to optimize the page cache truncation path instead to make
up for the change. This series is a result of that effort.
Patch 1 is an easy speedup of cancel_dirty_page(). Patches 2-6 refactor
page cache truncation code so that it is easier to batch radix tree
operations. Patch 7 implements batching of deletes from the radix tree
which more than makes up for the original regression.
This patch (of 7):
cancel_dirty_page() does quite some work even for clean pages (fetching
of mapping, locking of memcg, atomic bit op on page flags) so it
accounts for ~2.5% of cost of truncation of a clean page. That is not
much but still dumb for something we don't need at all. Check whether a
page is actually dirty and avoid any work if not.
Link: http://lkml.kernel.org/r/20171010151937.26984-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Ian King [Thu, 16 Nov 2017 01:37:08 +0000 (17:37 -0800)]
drivers/block/zram/zram_drv.c: make zram_page_end_io() static
zram_page_end_io() is local to the source and does not need to be in
global scope, so make it static.
Cleans up sparse warning:
symbol 'zram_page_end_io' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20171016173336.20320-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Thu, 16 Nov 2017 01:37:04 +0000 (17:37 -0800)]
mm/page-writeback.c: convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer
to all timer callbacks, switch to using the new timer_setup() and
from_timer() to pass the timer pointer explicitly.
Link: http://lkml.kernel.org/r/20171016225913.GA99214@beast Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Thu, 16 Nov 2017 01:36:56 +0000 (17:36 -0800)]
userfaultfd: use mmgrab instead of open-coded increment of mm_count
Link: http://lkml.kernel.org/r/1508132478-7738-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aaron Lu [Thu, 16 Nov 2017 01:36:53 +0000 (17:36 -0800)]
mm/page_alloc: make sure __rmqueue() etc are always inline
__rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
__rmqueue_cma_fallback() are all in page allocator's hot path and better
be finished as soon as possible. One way to make them faster is by making
them inline. But as Andrew Morton and Andi Kleen pointed out:
To make sure they are inlined, we should use __always_inline for them.
With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
processes to stress buddy, the results for will-it-scale.processes with
and without the patch are:
The above 4 compilers are used because I've done the tests through
Intel's Linux Kernel Performance(LKP) infrastructure and they are the
available compilers there.
The benefit being less on 4 sockets machine is due to the lock
contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
less severe than on the 2 sockets machine(85%).
What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.
Pavel Tatashin [Thu, 16 Nov 2017 01:36:48 +0000 (17:36 -0800)]
sparc64: optimize struct page zeroing
Add an optimized mm_zero_struct_page(), so struct page's are zeroed
without calling memset(). We do eight to ten regular stores based on
the size of struct page. Compiler optimizes out the conditions of
switch() statement.
SPARC-M6 with 15T of memory, single thread performance:
BASE FIX OPTIMIZED_FIX
bootmem_init 28.440467985s 2.305674818s 2.305161615s
free_area_init_nodes 202.845901673s 225.343084508s 172.556506560s
--------------------------------------------
Total 231.286369658s 227.648759326s 174.861668175s
BASE: current linux
FIX: This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.
bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time:
it does not increase as memory is increased.
Link: http://lkml.kernel.org/r/20171013173214.27300-11-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 16 Nov 2017 01:36:44 +0000 (17:36 -0800)]
mm: stop zeroing memory during allocation in vmemmap
vmemmap_alloc_block() will no longer zero the block, so zero memory at
its call sites for everything except struct pages. Struct page memory
is zero'd by struct page initialization.
Replace allocators in sparse-vmemmap to use the non-zeroing version.
So, we will get the performance improvement by zeroing the memory in
parallel when struct pages are zeroed.
Add struct page zeroing as a part of initialization of other fields in
__init_single_page().
This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
BASE FIX
sparse_init 11.244671836s 0.007199623s
zone_sizes_init 4.879775891s 8.355182299s
--------------------------
Total 16.124447727s 8.362381922s
sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().
[akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c] Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Deacon [Thu, 16 Nov 2017 01:36:40 +0000 (17:36 -0800)]
arm64/mm/kasan: don't use vmemmap_populate() to initialize shadow
The kasan shadow is currently mapped using vmemmap_populate() since that
provides a semi-convenient way to map pages into init_top_pgt. However,
since that no longer zeroes the mapped pages, it is not suitable for
kasan, which requires zeroed shadow memory.
Add kasan_populate_shadow() interface and use it instead of
vmemmap_populate(). Besides, this allows us to take advantage of
gigantic pages and use them to populate the shadow, which should save us
some memory wasted on page tables and reduce TLB pressure.
Link: http://lkml.kernel.org/r/20171103185147.2688-3-pasha.tatashin@oracle.com Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Thu, 16 Nov 2017 01:36:35 +0000 (17:36 -0800)]
x86/mm/kasan: don't use vmemmap_populate() to initialize shadow
The kasan shadow is currently mapped using vmemmap_populate() since that
provides a semi-convenient way to map pages into init_top_pgt. However,
since that no longer zeroes the mapped pages, it is not suitable for
kasan, which requires zeroed shadow memory.
Add kasan_populate_shadow() interface and use it instead of
vmemmap_populate(). Besides, this allows us to take advantage of
gigantic pages and use them to populate the shadow, which should save us
some memory wasted on page tables and reduce TLB pressure.
Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 16 Nov 2017 01:36:31 +0000 (17:36 -0800)]
mm: zero reserved and unavailable struct pages
Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by
going through __init_single_page().
In some cases these struct pages are accessed even if they do not
contain any data. One example is page_to_pfn() might access page->flags
if this is where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).
One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the
exiting memory from pfn 1 (i.e. KVM).
Since struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.
The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)
Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().
===
Here is more detailed example of problem that this patch is addressing:
Run tested on qemu with the following arguments:
-enable-kvm -cpu kvm64 -m 512 -smp 2
This patch reports that there are 98 unavailable pages.
They are: pfn 0 and pfns in range [159, 255].
Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.
e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]
Notice, that exactly unavailable pfns are missing!
Now, lets check what we have in zone 0: [1, 131039]
pfn 0, is not part of the zone, but pfns [1, 158], are.
However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug. Because, that path operates at 2M
boundaries (section_nr). And checks if 2M range of pages is hot
removable. It starts with first pfn from zone, rounds it down to 2M
boundary (sturct pages are allocated at 2M boundaries when vmemmap is
created), and checks if that section is hot removable. In this case
start with pfn 1 and convert it down to pfn 0. Later pfn is converted
to struct page, and some fields are checked. Now, if we do not zero
struct pages, we get unpredictable results.
In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
vmemmap memory to ones, the following panic is observed with kernel test
without this patch applied:
Pavel Tatashin [Thu, 16 Nov 2017 01:36:27 +0000 (17:36 -0800)]
mm: define memblock_virt_alloc_try_nid_raw
* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied
* optimize early system hash allocations
Clients can call alloc_large_system_hash() with flag: HASH_ZERO to
specify that memory that was allocated for system hash needs to be
zeroed, otherwise the memory does not need to be zeroed, and client will
initialize it.
If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot
performance.
* debug for raw alloctor
When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.
Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 16 Nov 2017 01:36:18 +0000 (17:36 -0800)]
sparc64/mm: set fields in deferred pages
Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to
first initializing struct pages by going through __init_single_page().
With deferred struct page feature enabled there is a case where we set
some fields prior to initializing:
When register_page_bootmem_info() is called only non-deferred struct
pages are initialized. But, this function goes through some reserved
pages which might be part of the deferred, and thus are not yet
initialized.
mem_init
register_page_bootmem_info
register_page_bootmem_info_node
get_page_bootmem
.. setting fields here ..
such as: page->freelist = (void *)type;
free_all_bootmem()
free_low_memory_core_early()
for_each_reserved_mem_region()
reserve_bootmem_region()
init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here
We end up with similar issue as in the previous patch, where currently
we do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.
Also, because in this patch series we will stop zeroing struct page
memory during allocation, we must make sure that struct pages are
properly initialized prior to using them.
The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.
Link: http://lkml.kernel.org/r/20171013173214.27300-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 16 Nov 2017 01:36:14 +0000 (17:36 -0800)]
x86/mm: set fields in deferred pages
Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to
first initializing struct pages by going through __init_single_page().
With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after
in free_all_bootmem:
When register_page_bootmem_info() is called only non-deferred struct
pages are initialized. But, this function goes through some reserved
pages which might be part of the deferred, and thus are not yet
initialized.
mem_init
register_page_bootmem_info
register_page_bootmem_info_node
get_page_bootmem
.. setting fields here ..
such as: page->freelist = (void *)type;
free_all_bootmem()
free_low_memory_core_early()
for_each_reserved_mem_region()
reserve_bootmem_region()
init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here
We end up with issue where, currently we do not observe problem as
memory is explicitly zeroed. But, if flag asserts are changed we can
start hitting issues.
Also, because in this patch series we will stop zeroing struct page
memory during allocation, we must make sure that struct pages are
properly initialized prior to using them.
The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.
Link: http://lkml.kernel.org/r/20171013173214.27300-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 16 Nov 2017 01:36:09 +0000 (17:36 -0800)]
mm: deferred_init_memmap improvements
Patch series "complete deferred page initialization", v12.
SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
option, which defers initializing struct pages until all cpus have been
started so it can be done in parallel.
However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been
zeroed, and the zeroing is done early in boot with a single thread only.
Also, we access that memory and set flags before struct pages are
initialized. All of this is fixed in this patchset.
In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization
==========================================================================
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME SPEED UP
base no deferred: 95.796233s
fix no deferred: 79.978956s 19.77%
base deferred: 77.254713s
fix deferred: 55.050509s 40.34%
==========================================================================
SPARC M6 3600 MHz with 15T of memory
TIME SPEED UP
base no deferred: 358.335727s
fix no deferred: 302.320936s 18.52%
base deferred: 237.534603s
fix deferred: 182.103003s 30.44%
==========================================================================
Raw dmesg output with timestamps:
x86 base no deferred: https://hastebin.com/ofunepurit.scala
x86 base deferred: https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred: https://hastebin.com/ofupevikuk.scala
sparc base no deferred: https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred: https://hastebin.com/muhegoheyi.go
sparc fix deferred: https://hastebin.com/xadinobutu.go
This patch (of 11):
deferred_init_memmap() is called when struct pages are initialized later
in boot by slave CPUs. This patch simplifies and optimizes this
function, and also fixes a couple issues (described below).
The main change is that now we are iterating through free memblock areas
instead of all configured memory. Thus, we do not have to check if the
struct page has already been initialized.
=====
In deferred_init_memmap() where all deferred struct pages are
initialized we have a check like this:
if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}
This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed,
and the only way flags are not zero if it went through
__init_single_page() before. But, once we change the current behavior
and won't zero the memory in memblock allocator, we cannot trust
anything inside "struct page"es until they are initialized. This patch
fixes this.
The deferred_init_memmap() is re-written to loop through only free
memory ranges provided by memblock.
Note, this first issue is relevant only when the following change is
merged:
=====
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.
In for_each_mem_pfn_range() we have code like this:
if (!pfn_valid_within(pfn)
goto free_range;
Note: 'page' is not set to NULL and is not incremented but 'pfn'
advances. Thus means if deferred struct pages are enabled on systems
with these kind of holes, linux would get memory corruptions. I have
fixed this issue by defining a new macro that performs all the necessary
operations when we free the current set of pages.
[pasha.tatashin@oracle.com: buddy page accessed before initialized] Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow). KASan is already upstream.
We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).
The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.
Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.
This patch (of 4):
Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Ian King [Thu, 16 Nov 2017 01:35:47 +0000 (17:35 -0800)]
mm/rmap.c: remove redundant variable cend
Variable cend is set but never read, hence it is redundant and can be
removed.
Cleans up clang build warning: Value stored to 'cend' is never read
Link: http://lkml.kernel.org/r/20171011174942.1372-1-colin.king@canonical.com Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Thu, 16 Nov 2017 01:35:44 +0000 (17:35 -0800)]
fs, mm: account filp cache to kmemcg
The allocations from filp cache can be directly triggered by userspace
applications. A buggy application can consume a significant amount of
unaccounted system memory. Though we have not noticed such buggy
applications in our production but upon close inspection, we found that
a lot of machines spend very significant amount of memory on these
caches.
One way to limit allocations from filp cache is to set system level
limit of maximum number of open files. However this limit is shared
between different users on the system and one user can hog this
resource. To cater that, we can charge filp to kmemcg and set the
maximum limit very high and let the memory limit of each user limit the
number of files they can open and indirectly limiting their allocations
from filp cache.
One side effect of this change is that it will allow _sysctl() to return
ENOMEM and the man page of _sysctl() does not specify that. However the
man page also discourages to use _sysctl() at all.
Link: http://lkml.kernel.org/r/20171011190359.34926-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we account page tables separately for each page table level,
but that's redundant -- we only make use of total memory allocated to
page tables for oom_badness calculation. We also provide the
information to userspace, but it has dubious value there too.
This patch switches page table accounting to single counter.
mm->pgtables_bytes is now used to account all page table levels. We use
bytes, because page table size for different levels of page table tree
may be different.
The change has user-visible effect: we don't have VmPMD and VmPUD
reported in /proc/[pid]/status. Not sure if anybody uses them. (As
alternative, we can always report 0 kB for them.)
OOM-killer report is also slightly changed: we now report pgtables_bytes
instead of nr_ptes, nr_pmd, nr_puds.
Apart from reducing number of counters per-mm, the benefit is that we
now calculate oom_badness() more correctly for machines which have
different size of page tables depending on level or where page tables
are less than a page in size.
The only downside can be debuggability because we do not know which page
table level could leak. But I do not remember many bugs that would be
caught by separate counters so I wouldn't lose sleep over this.
On a machine with 5-level paging support a process can allocate
significant amount of memory and stay unnoticed by oom-killer and memory
cgroup. The trick is to allocate a lot of PUD page tables. We don't
account PUD page tables, only PMD and PTE.
We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b ("mm: account pmd page tables to the process").
Introduction of 5-level paging brings the same issue for PUD page
tables.
kmemleak: change /sys/kernel/debug/kmemleak permissions from 0444 to 0644
Kmemleak can be tweaked at runtime by writing commands into debugfs
file. Root can use it anyway, but without the write-bit this interface
isn't obvious.
Link: http://lkml.kernel.org/r/150728996582.744328.11541332857988399411.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:35:26 +0000 (17:35 -0800)]
cifs: use find_get_pages_range_tag()
wdata_alloc_and_fillpages() needlessly iterates calls to
find_get_pages_tag(). Also it wants only pages from given range. Make
it use find_get_pages_range_tag().
Link: http://lkml.kernel.org/r/20171009151359.31984-17-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Suggested-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Steve French <sfrench@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:35:23 +0000 (17:35 -0800)]
afs: use find_get_pages_range_tag()
Use find_get_pages_range_tag() in afs_writepages_region() as we are
interested only in pages from given range. Remove unnecessary code
after this conversion.
Link: http://lkml.kernel.org/r/20171009151359.31984-16-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:35:19 +0000 (17:35 -0800)]
mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.
Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:35:16 +0000 (17:35 -0800)]
ceph: use pagevec_lookup_range_nr_tag()
Use new function for looking up pages since nr_pages argument from
pagevec_lookup_range_tag() is going away.
Link: http://lkml.kernel.org/r/20171009151359.31984-14-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:35:12 +0000 (17:35 -0800)]
mm: add variant of pagevec_lookup_range_tag() taking number of pages
Currently pagevec_lookup_range_tag() takes number of pages to look up
but most users don't need this. Create a new function
pagevec_lookup_range_nr_tag() that takes maximum number of pages to
lookup for Ceph which wants this functionality so that we can drop
nr_pages argument from pagevec_lookup_range_tag().
Link: http://lkml.kernel.org/r/20171009151359.31984-13-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:35:09 +0000 (17:35 -0800)]
mm: use pagevec_lookup_range_tag() in write_cache_pages()
Use pagevec_lookup_range_tag() in write_cache_pages() as it is
interested only in pages from given range. Remove unnecessary code
resulting from this.
Link: http://lkml.kernel.org/r/20171009151359.31984-12-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:35:05 +0000 (17:35 -0800)]
mm: use pagevec_lookup_range_tag() in __filemap_fdatawait_range()
Use pagevec_lookup_range_tag() in __filemap_fdatawait_range() as it is
interested only in pages from given range. Remove unnecessary code
resulting from this.
Link: http://lkml.kernel.org/r/20171009151359.31984-11-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:35:02 +0000 (17:35 -0800)]
nilfs2: use pagevec_lookup_range_tag()
We want only pages from given range in nilfs_lookup_dirty_data_buffers().
Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and
remove unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-10-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:34:58 +0000 (17:34 -0800)]
gfs2: use pagevec_lookup_range_tag()
We want only pages from given range in gfs2_write_cache_jdata(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-9-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:34:55 +0000 (17:34 -0800)]
f2fs: use find_get_pages_tag() for looking up single page
__get_first_dirty_index() wants to lookup only the first dirty page
after given index. There's no point in using pagevec_lookup_tag() for
that. Just use find_get_pages_tag() directly.
Link: http://lkml.kernel.org/r/20171009151359.31984-8-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:34:51 +0000 (17:34 -0800)]
f2fs: simplify page iteration loops
In several places we want to iterate over all tagged pages in a mapping.
However the code was apparently copied from places that iterate only
over a limited range and thus it checks for index <= end, optimizes the
case where we are coming close to range end which is all pointless when
end == ULONG_MAX. So just remove this dead code.
[akpm@linux-foundation.org: fix warnings] Link: http://lkml.kernel.org/r/20171009151359.31984-7-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:34:48 +0000 (17:34 -0800)]
f2fs: use pagevec_lookup_range_tag()
We want only pages from given range in f2fs_write_cache_pages(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-6-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:34:44 +0000 (17:34 -0800)]
ext4: use pagevec_lookup_range_tag()
We want only pages from given range in ext4_writepages(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:34:41 +0000 (17:34 -0800)]
ceph: use pagevec_lookup_range_tag()
We want only pages from given range in ceph_writepages_start(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:34:37 +0000 (17:34 -0800)]
btrfs: use pagevec_lookup_range_tag()
We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 16 Nov 2017 01:34:33 +0000 (17:34 -0800)]
mm: implement find_get_pages_range_tag()
Patch series "Ranged pagevec tagged lookup", v3.
In this series I provide a ranged variant of pagevec_lookup_tag() and
use it in places where it makes sense. This series removes some common
code and it also has a potential for speeding up some operations
similarly as for pagevec_lookup_range() (but for now I can think of only
artificial cases where this happens).
This patch (of 16):
Implement a variant of find_get_pages_tag() that stops iterating at
given index. Lots of users of this function (through pagevec_lookup())
actually want a range lookup and all of them are currently open-coding
this.
Also create corresponding pagevec_lookup_range_tag() function.
Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Steve French <sfrench@samba.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ayush Mittal [Thu, 16 Nov 2017 01:34:30 +0000 (17:34 -0800)]
mm/page_owner.c: reduce page_owner structure size
Maximum page order can be at max 10 which can be accomodated in short
data type(2 bytes). last_migrate_reason is defined as enum type whose
values can be accomodated in short data type (2 bytes).
Total structure size is currently 16 bytes but after changing structure
size it goes to 12 bytes.
Vlastimil said:
"Looks like it works, so why not.
Before:
[ 0.001000] allocated 50331648 bytes of page_ext
After:
[ 0.001000] allocated 41943040 bytes of page_ext"
Pintu Agarwal [Thu, 16 Nov 2017 01:34:26 +0000 (17:34 -0800)]
mm/cma.c: change pr_info to pr_err for cma_alloc fail log
It was observed that under cma_alloc fail log, pr_info was used instead
of pr_err. This will lead to problems if printk debug level is set to
below 7. In this case the cma_alloc failure log will not be captured in
the log and it will be difficult to debug.
Simply replace the pr_info with pr_err to capture failure log.
Link: http://lkml.kernel.org/r/1507650633-4430-1-git-send-email-pintu.ping@gmail.com Signed-off-by: Pintu Agarwal <pintu.ping@gmail.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 16 Nov 2017 01:34:22 +0000 (17:34 -0800)]
mm, arch: remove empty_bad_page*
empty_bad_page() and empty_bad_pte_table() seem to be relics from old
days which is not used by any code for a long time. I have tried to
find when exactly but this is not really all that straightforward due to
many code movements - traces disappear around 2.4 times.
Anyway no code really references neither empty_bad_page nor
empty_bad_pte_table. We only allocate the storage which is not used by
anybody so remove them.
Link: http://lkml.kernel.org/r/20171004150045.30755-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Ralf Baechle <ralf@linus-mips.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: David Howells <dhowells@redhat.com> Cc: Rich Felker <dalias@libc.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tim Chen [Thu, 16 Nov 2017 01:34:18 +0000 (17:34 -0800)]
mm/swap_slots.c: fix race conditions in swap_slots cache init
Memory allocations can happen before the swap_slots cache initialization
is completed during cpu bring up. If we are low on memory, we could
call get_swap_page() and access swap_slots_cache before it is fully
initialized.
Add a check in get_swap_page() for initialized swap_slots_cache to
prevent this condition. Similar check already exists in free_swap_slot.
Also annotate the checks to indicate the likely condition.
We also added a memory barrier to make sure that the locks
initialization are done before the assignment of cache->slots and
cache->slots_ret pointers. This ensures the assumption that it is safe
to acquire the slots cache locks and use the slots cache when the
corresponding cache->slots or cache->slots_ret pointers are non null.
[akpm@linux-foundation.org: tidy up comment]
[akpm@linux-foundation.org: fix spello in comment] Link: http://lkml.kernel.org/r/65a9d0f133f63e66bba37b53b2fd0464b7cae771.1500677066.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com> Acked-by: Ying Huang <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Thu, 16 Nov 2017 01:34:15 +0000 (17:34 -0800)]
mm: remove unused pgdat->inactive_ratio
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") 'pgdat->inactive_ratio' is not used, except for printing
"node_inactive_ratio: 0" in /proc/zoneinfo output.
Remove it.
Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/mmu_notifier: avoid call to invalidate_range() in range_end()
This is an optimization patch that only affect mmu_notifier users which
rely on the invalidate_range() callback. This patch avoids calling that
callback twice in a row from inside __mmu_notifier_invalidate_range_end
Existing pattern (before this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_end()
mmu_notifier_invalidate_range()
New pattern (after this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_only_end()
We call the invalidate_range callback after clearing the page table
under the page table lock and we skip the call to invalidate_range
inside the __mmu_notifier_invalidate_range_end() function.
mm/mmu_notifier: avoid double notification when it is useless
This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.
When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary
in all cases.
This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end. It also adds documentation in all
those cases explaining why.
Below is a more in depth analysis of why this is fine to do this:
For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space). There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:
A) page backing address is free before mmu_notifier_invalidate_range_end
B) a page table entry is updated to point to a new page (COW, write fault
on zero page, __replace_page(), ...)
Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.
Case B is more subtle. For correctness it requires the following sequence
to happen:
- take page table lock
- clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
- set page table entry to point to new page
If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.
Consider the following scenario (device use a feature similar to ATS/
PASID):
Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).
[Time N] -----------------------------------------------------------------
CPU-thread-0 {try to write to addrA}
CPU-thread-1 {try to write to addrB}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA and populate device TLB}
DEV-thread-2 {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0 {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1 {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {write to addrA which is a write to new page}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {}
CPU-thread-3 {write to addrB which is a write to new page}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA from old page}
DEV-thread-2 {read addrB from new page}
So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.
When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock. This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end
Thanks to Andrea for thinking of a problematic scenario for COW.
zsmalloc: calling zs_map_object() from irq is a bug
Use BUG_ON(in_interrupt()) in zs_map_object(). This is not a new
BUG_ON(), it's always been there, but was recently changed to
VM_BUG_ON(). There are several problems there. First, we use use
per-CPU mappings both in zsmalloc and in zram, and interrupt may easily
corrupt those buffers. Second, and more importantly, we believe it's
possible to start leaking sensitive information. Consider the following
case:
-> process P
swap out
zram
per-cpu mapping CPU1
compress page A
-> IRQ
swap out
zram
per-cpu mapping CPU1
compress page B
write page from per-cpu mapping CPU1 to zsmalloc pool
iret
-> process P
write page from per-cpu mapping CPU1 to zsmalloc pool [*]
return
* so we store overwritten data that actually belongs to another
page (task) and potentially contains sensitive data. And when
process P will page fault it's going to read (swap in) that
other task's data.
Link: http://lkml.kernel.org/r/20170929045140.4055-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/hugetlbfs/inode.c: remove redundant -ENIVAL return from hugetlbfs_setattr()
There is no need to have a local return code set with -EINVAL when both
the conditions following it return error codes appropriately. Just
remove the redundant one.
Link: http://lkml.kernel.org/r/20170929145444.17611-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Over all, ZSTD has slower WRITE, but much faster READ (perhaps
a static compression buffer used during the test helped ZSTD a
lot), which results in faster test results.
: I did test with my sample data and compared zstd with deflate. zstd's
: compress ratio is lower a little bit but compression speed is much faster
: 3 times more and decompress speed is too 2 times more. With different
: data, it is different but overall, zstd would be better for speed at the
: cost of a little lower compress ratio(about 5%) so I believe it's worth to
: replace deflate.
[1] https://github.com/facebook/zstd
Link: http://lkml.kernel.org/r/20170912050005.3247-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yafang Shao [Thu, 16 Nov 2017 01:33:45 +0000 (17:33 -0800)]
mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical
The vm direct limit setting must be set greater than vm background limit
setting. Otherwise print a warning to help the operator to figure out
that the vm dirtiness settings is in illogical state.
Link: http://lkml.kernel.org/r/1506592464-30962-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gioh Kim [Thu, 16 Nov 2017 01:33:42 +0000 (17:33 -0800)]
mm/memblock.c: make the index explicit argument of for_each_memblock_type
for_each_memblock_type macro function relies on idx variable defined in
the caller context. Silent macro arguments are almost always wrong
thing to do. They make code harder to read and easier to get wrong.
Let's use an explicit iterator parameter for for_each_memblock_type and
make the code more obious. This patch is a mere cleanup and it
shouldn't introduce any functional change.
Michal Hocko [Thu, 16 Nov 2017 01:33:38 +0000 (17:33 -0800)]
mm, memory_hotplug: remove timeout from __offline_memory
We have a hardcoded 120s timeout after which the memory offline fails
basically since the hot remove has been introduced. This is essentially
a policy implemented in the kernel. Moreover there is no way to adjust
the timeout and so we are sometimes facing memory offline failures if
the system is under a heavy memory pressure or very intensive CPU
workload on large machines.
It is not very clear what purpose the timeout actually serves. The
offline operation is interruptible by a signal so if userspace wants
some timeout based termination this can be done trivially by sending a
signal.
If there is a strong usecase to do this from the kernel then we should
do it properly and have a it tunable from the userspace with the timeout
disabled by default along with the explanation who uses it and for what
purporse.
Michal Hocko [Thu, 16 Nov 2017 01:33:34 +0000 (17:33 -0800)]
mm, memory_hotplug: do not fail offlining too early
Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2.
While testing memory hotplug on a large 4TB machine we have noticed that
memory offlining is just too eager to fail. The primary reason is that
the retry logic is just too easy to give up. We have 4 ways out of the
offline
- we have a permanent failure (isolation or memory notifiers fail,
or hugetlb pages cannot be dropped)
- userspace sends a signal
- a hardcoded 120s timeout expires
- page migration fails 5 times
This is way too convoluted and it doesn't scale very well. We have seen
both temporary migration failures as well as 120s being triggered.
After removing those restrictions we were able to pass stress testing
during memory hot remove without any other negative side effects
observed. Therefore I suggest dropping both hard coded policies. I
couldn't have found any specific reason for them in the changelog. I
neither didn't get any response [1] from Kamezawa. If we need some
upper bound - e.g. timeout based - then we should have a proper and
user defined policy for that. In any case there should be a clear use
case when introducing it.
This patch (of 2):
Memory offlining can fail too eagerly under heavy memory pressure.
Isolation has failed here because the page is not on LRU. Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet. In both cases the page doesn't
look non-migrable so retrying more makes sense.
__offline_pages seems rather cluttered when it comes to the retry logic.
We have 5 retries at maximum and a timeout. We could argue whether the
timeout makes sense but failing just because of a race when somebody
isoltes a page from LRU or puts it on a pcp LRU lists is just wrong. It
only takes it to race with a process which unmaps some pages and remove
them from the LRU list and we can fail the whole offline because of
something that is a temporary condition and actually not harmful for the
offline.
Please note that unmovable pages should be already excluded during
start_isolate_page_range. We could argue that has_unmovable_pages is
racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
pages in most cases and movable zone shouldn't contain unmovable pages
at all. Some of those pages might be pinned but not for ever because
that would be a bug on its own. In any case the context is still
interruptible and so the userspace can easily bail out when the
operation takes too long. This is certainly better behavior than a
hardcoded retry loop which is racy.
Fix this by removing the max retry count and only rely on the timeout
resp. interruption by a signal from the userspace. Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.
Michal Hocko [Thu, 16 Nov 2017 01:33:30 +0000 (17:33 -0800)]
mm, page_alloc: fail has_unmovable_pages when seeing reserved pages
Reserved pages should be completely ignored by the core mm because they
have a special meaning for their owners. has_unmovable_pages doesn't
check those so we rely on other tests (reference count, or PageLRU) to
fail on such pages. Althought this happens to work it is safer to
simply check for those explicitly and do not rely on the owner of the
page to abuse those fields for special purposes.
Please note that this is more of a further fortification of the code
rahter than a fix of an existing issue.
Michal Hocko [Thu, 16 Nov 2017 01:33:26 +0000 (17:33 -0800)]
mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages()
Joonsoo has noticed that "mm: drop migrate type checks from
has_unmovable_pages" would break CMA allocator because it relies on
has_unmovable_pages returning false even for CMA pageblocks which in
fact don't have to be movable:
This is a result of the code sharing between CMA and memory hotplug
while each one has a different idea of what has_unmovable_pages should
return. This is unfortunate but fixing it properly would require a lot
of code duplication.
Fix the issue by introducing the requested migrate type argument and
special case MIGRATE_CMA case where CMA page blocks are handled
properly. This will work for memory hotplug because it requires
MIGRATE_MOVABLE.
Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ran Wang <ran.wang_1@nxp.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Igor Mammedov <imammedo@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current implementation will fail the operation after several failed
page migration attempts but we shouldn't even attempt to migrate that
memory and fail right away because this memory is clearly not
migrateable. This will become a real problem when we drop the retry
loop counter resp. timeout.
The real problem is in has_unmovable_pages in fact. We should fail if
there are any non migrateable pages in the area. In orther to guarantee
that remove the migrate type checks because MIGRATE_MOVABLE is not
guaranteed to contain only migrateable pages. It is merely a heuristic.
Similarly MIGRATE_CMA does guarantee that the page allocator doesn't
allocate any non-migrateable pages from the block but CMA allocations
themselves are unlikely to migrateable. Therefore remove both checks.
[akpm@linux-foundation.org: remove unused local `mt'] Link: http://lkml.kernel.org/r/20171013120013.698-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ran Wang <ran.wang_1@nxp.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tahsin Erdogan [Thu, 16 Nov 2017 01:33:19 +0000 (17:33 -0800)]
mm/page-writeback.c: remove unused parameter from balance_dirty_pages()
"mapping" parameter to balance_dirty_pages() is not used anymore.
Fixes: dfb8ae567835 ("writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback") Link: http://lkml.kernel.org/r/20170927221311.23263-1-tahsin@google.com Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Thu, 16 Nov 2017 01:33:15 +0000 (17:33 -0800)]
mm, swap: fix false error message in __swp_swapcount()
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Christian Kujau <lists@nerdbynature.de> Acked-by: Minchan Kim <minchan@kernel.org> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Thu, 16 Nov 2017 01:33:11 +0000 (17:33 -0800)]
mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference
When SWP_SYNCHRONOUS_IO swapped-in pages are shared by several
processes, it can cause unnecessary memory wastage by skipping swap
cache. Because, with swapin fault by read, they could share a page if
the page were in swap cache. Thus, it avoids allocating same content
new pages.
This patch makes the swapcache skipping work only if the swap pte is
non-sharable.
[akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1507620825-5537-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Thu, 16 Nov 2017 01:33:07 +0000 (17:33 -0800)]
mm, swap: skip swapcache for swapin of synchronous device
With fast swap storage, the platforms want to use swap more aggressively
and swap-in is crucial to application latency.
The rw_page() based synchronous devices like zram, pmem and btt are such
fast storage. When I profile swapin performance with zram lz4
decompress test, S/W overhead is more than 70%. Maybe, it would be
bigger in nvdimm.
This patch aims to reduce swap-in latency by skipping swapcache if the
swap device is synchronous device like rw_page based device. It
enhances 45% my swapin test(5G sequential swapin, no readahead, from
2.41sec to 1.64sec).
Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Thu, 16 Nov 2017 01:33:04 +0000 (17:33 -0800)]
mm, swap: introduce SWP_SYNCHRONOUS_IO
If rw-page based fast storage is used for swap devices, we need to
detect it to enhance swap IO operations. This patch is preparation for
optimizing of swap-in operation with next patch.
Link: http://lkml.kernel.org/r/1505886205-9671-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
someday we will remove rw_page(). If so, we need something to detect
such super-fast storage on which synchronous IO operations like the
current rw_page are always a win.
Introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices. With it, we
could use various optimization techniques.
Link: http://lkml.kernel.org/r/1505886205-9671-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Thu, 16 Nov 2017 01:32:56 +0000 (17:32 -0800)]
zram: set BDI_CAP_STABLE_WRITES once
With fast swap storage, the platform wants to use swap more aggressively
and swap-in is crucial to application latency.
The rw_page() based synchronous devices like zram, pmem and btt are such
fast storage. When I profile swapin performance with zram lz4
decompress test, S/W overhead is more than 70%. Maybe, it would be
bigger in nvdimm.
This patchset reduces swap-in latency by skipping swapcache if the swap
device is a synchronous device like a rw_page() based device.
It enhances by 45% my swapin test (5G sequential swapin, no readahead)
from 2.41sec to 1.64sec.
This patch (of 4):
Commit 19b7ccf8651d ("block: get rid of blk_integrity_revalidate()")
fixed a weird thing (i.e., reset BDI_CAP_STABLE_WRITES flag
unconditionally whenever revalidat_disk is called) so zram doesn't need
to reset the flag any more when revalidating the bdev. Instead, set the
flag just once when the zram device is created.
It shouldn't change any behavior.
Link: http://lkml.kernel.org/r/1505886205-9671-2-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Hugh Dickins <hughd@google.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
include/linux/slab.h: add kmalloc_array_node() and kcalloc_node()
Patch series "Add kmalloc_array_node() and kcalloc_node()".
Our current memeory allocation routines suffer form an API imbalance,
for one we have kmalloc_array() and kcalloc() which check for overflows
in size multiplication and we have kmalloc_node() and kzalloc_node()
which allow for memory allocation on a certain NUMA node but don't check
for eventual overflows.
This patch (of 6):
We have kmalloc_array() and kcalloc() wrappers on top of kmalloc() which
ensure us overflow free multiplication for the size of a memory
allocation but these implementations are not NUMA-aware.
Likewise we have kmalloc_node() which is a NUMA-aware version of
kmalloc() but the implementation is not aware of any possible overflows
in eventual size calculations.
Introduce a combination of the two above cases to have a NUMA-node aware
version of kmalloc_array() and kcalloc().
Link: http://lkml.kernel.org/r/20170927082038.3782-2-jthumshirn@suse.de Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: David Rientjes <rientjes@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Marciniszyn <infinipath@intel.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miles Chen [Thu, 16 Nov 2017 01:32:25 +0000 (17:32 -0800)]
slub: fix sysfs duplicate filename creation when slub_debug=O
When slub_debug=O is set. It is possible to clear debug flags for an
"unmergeable" slab cache in kmem_cache_open(). It makes the "unmergeable"
cache became "mergeable" in sysfs_slab_add().
These caches will generate their "unique IDs" by create_unique_id(), but
it is possible to create identical unique IDs. In my experiment,
sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and
the kernel reports "sysfs: cannot create duplicate filename
'/kernel/slab/:Ft-0004096'".
To repeat my experiment, set disable_higher_order_debug=1,
CONFIG_SLUB_DEBUG_ON=y in kernel-4.14.
Fix this issue by setting unmergeable=1 if slub_debug=O and the the
default slub_debug contains any no-merge flags.
call path:
kmem_cache_create()
__kmem_cache_alias() -> we set SLAB_NEVER_MERGE flags here
create_cache()
__kmem_cache_create()
kmem_cache_open() -> clear DEBUG_METADATA_FLAGS
sysfs_slab_add() -> the slab cache is mergeable now
sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.14.0-rc7ajb-00131-gd4c2e9f-dirty #123
Hardware name: linux,dummy-virt (DT)
task: ffffffc07d4e0080 task.stack: ffffff8008008000
PC is at sysfs_warn_dup+0x60/0x7c
LR is at sysfs_warn_dup+0x60/0x7c
pc : lr : pstate: 60000145
Call trace:
sysfs_warn_dup+0x60/0x7c
sysfs_create_dir_ns+0x98/0xa0
kobject_add_internal+0xa0/0x294
kobject_init_and_add+0x90/0xb4
sysfs_slab_add+0x90/0x200
__kmem_cache_create+0x26c/0x438
kmem_cache_create+0x164/0x1f4
sg_pool_init+0x60/0x100
do_one_initcall+0x38/0x12c
kernel_init_freeable+0x138/0x1d4
kernel_init+0x10/0xfc
ret_from_fork+0x10/0x18
Link: http://lkml.kernel.org/r/1510365805-5155-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Thu, 16 Nov 2017 01:32:14 +0000 (17:32 -0800)]
mm/slab.c: only set __GFP_RECLAIMABLE once
SLAB_RECLAIM_ACCOUNT is a permanent attribute of a slab cache. Set
__GFP_RECLAIMABLE as part of its ->allocflags rather than check the
cachep flag on every page allocation.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1710171527560.140898@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miles Chen [Thu, 16 Nov 2017 01:32:10 +0000 (17:32 -0800)]
mm/slob.c: remove an unnecessary check for __GFP_ZERO
Current flow guarantees a valid pointer when handling the __GFP_ZERO
case. So remove the unnecessary NULL pointer check.
Link: http://lkml.kernel.org/r/1507203141-11959-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Thu, 16 Nov 2017 01:32:07 +0000 (17:32 -0800)]
mm: oom: show unreclaimable slab info when unreclaimable slabs > user memory
The kernel may panic when an oom happens without killable process
sometimes it is caused by huge unreclaimable slabs used by kernel.
Although kdump could help debug such problem, however, kdump is not
available on all architectures and it might be malfunction sometime.
And, since kernel already panic it is worthy capturing such information
in dmesg to aid touble shooting.
Print out unreclaimable slab info (used size and total size) which
actual memory usage is not zero (num_objs * size != 0) when
unreclaimable slabs amount is greater than total user memory (LRU
pages).
Yang Shi [Thu, 16 Nov 2017 01:32:03 +0000 (17:32 -0800)]
mm: slabinfo: remove CONFIG_SLABINFO
According to discussion with Christoph
(https://marc.info/?l=linux-kernel&m=150695909709711&w=2), it sounds like
it is pointless to keep CONFIG_SLABINFO around.
This patch removes the CONFIG_SLABINFO config option, but /proc/slabinfo
is still available.
Yang Shi [Thu, 16 Nov 2017 01:31:59 +0000 (17:31 -0800)]
tools: slabinfo: add "-U" option to show unreclaimable slabs only
Patch series "oom: capture unreclaimable slab info in oom message", v10.
Recently we ran into a oom issue, kernel panic due to no killable
process. The dmesg shows huge unreclaimable slabs used almost 100%
memory, but kdump doesn't capture vmcore due to some reason.
So, it may sound better to capture unreclaimable slab info in oom
message when kernel panic to aid trouble shooting and cover the corner
case. Since kernel already panic, so capturing more information sounds
worthy and doesn't bother normal oom killer.
With the patchset, tools/vm/slabinfo has a new option, "-U", to show
unreclaimable slab only.
And, oom will print all non zero (num_objs * size != 0) unreclaimable
slabs in oom killer message.
This patch (of 3):
Add "-U" option to show unreclaimable slabs only.
"-U" and "-S" together can tell us what unreclaimable slabs use the most
memory to help debug huge unreclaimable slabs issue.
Link: http://lkml.kernel.org/r/1507152550-46205-2-git-send-email-yang.s@alibaba-inc.com Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Thu, 16 Nov 2017 01:31:52 +0000 (17:31 -0800)]
ocfs2/dlm: get mle inuse only when it is initialized
When dlm_add_migration_mle returns -EEXIST, previously input mle will
not be initialized. So we can't use its associated dlm object. And we
truly don't need this mle for already launched migration progress, since
oldmle has taken this role.
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED7AA61@H3CMLB14-EX.srv.huawei-3com.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alex chen [Thu, 16 Nov 2017 01:31:48 +0000 (17:31 -0800)]
ocfs2: subsystem.su_mutex is required while accessing the item->ci_parent
The subsystem.su_mutex is required while accessing the item->ci_parent,
otherwise, NULL pointer dereference to the item->ci_parent will be
triggered in the following situation:
add node delete node
sys_write
vfs_write
configfs_write_file
o2nm_node_store
o2nm_node_local_write
do_rmdir
vfs_rmdir
configfs_rmdir
mutex_lock(&subsys->su_mutex);
unlink_obj
item->ci_group = NULL;
item->ci_parent = NULL;
to_o2nm_cluster_from_node
node->nd_item.ci_parent->ci_parent
BUG since of NULL pointer dereference to nd_item.ci_parent
Moreover, the o2nm_cluster also should be protected by the
subsystem.su_mutex.
[alex.chen@huawei.com: v2] Link: http://lkml.kernel.org/r/59EEAA69.9080703@huawei.com Link: http://lkml.kernel.org/r/59E9B36A.10700@huawei.com Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alex chen [Thu, 16 Nov 2017 01:31:44 +0000 (17:31 -0800)]
ocfs2: ip_alloc_sem should be taken in ocfs2_get_block()
ip_alloc_sem should be taken in ocfs2_get_block() when reading file in
DIRECT mode to prevent concurrent access to extent tree with
ocfs2_dio_end_io_write(), which may cause BUGON in the following
situation:
read file 'A' end_io of writing file 'A'
vfs_read
__vfs_read
ocfs2_file_read_iter
generic_file_read_iter
ocfs2_direct_IO
__blockdev_direct_IO
do_blockdev_direct_IO
do_direct_IO
get_more_blocks
ocfs2_get_block
ocfs2_extent_map_get_blocks
ocfs2_get_clusters
ocfs2_get_clusters_nocache()
ocfs2_search_extent_list
return the index of record which
contains the v_cluster, that is
v_cluster > rec[i]->e_cpos.
ocfs2_dio_end_io
ocfs2_dio_end_io_write
down_write(&oi->ip_alloc_sem);
ocfs2_mark_extent_written
ocfs2_change_extent_flag
ocfs2_split_extent
...
--> modify the rec[i]->e_cpos, resulting
in v_cluster < rec[i]->e_cpos.
BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos))
[alex.chen@huawei.com: v3] Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com Fixes: c15471f79506 ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Reviewed-by: Gang He <ghe@suse.com> Acked-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alex chen [Thu, 16 Nov 2017 01:31:40 +0000 (17:31 -0800)]
ocfs2: should wait dio before inode lock in ocfs2_setattr()
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Acked-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
piaojun [Thu, 16 Nov 2017 01:31:37 +0000 (17:31 -0800)]
ocfs2: clean up some unused function declarations
Link: http://lkml.kernel.org/r/59C5D7D6.9050106@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Thu, 16 Nov 2017 01:31:33 +0000 (17:31 -0800)]
ocfs2: fix cluster hang after a node dies
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")' Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com> Tested-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
piaojun [Thu, 16 Nov 2017 01:31:29 +0000 (17:31 -0800)]
ocfs2: cleanup unused func declaration and assignment
Link: http://lkml.kernel.org/r/59E064BB.8000005@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
piaojun [Thu, 16 Nov 2017 01:31:25 +0000 (17:31 -0800)]
ocfs2: no need flush workqueue before destroying it
destroy_workqueue() will do flushing work for us.
Link: http://lkml.kernel.org/r/59E06476.3090502@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>