Minchan Kim [Wed, 5 May 2021 01:37:31 +0000 (18:37 -0700)]
mm: cma: add the CMA instance name to cma trace events
There were missing places to add cma instance name. To identify each CMA
instance, let's add the name for every cma trace. This patch also changes
the existing cma_trace_alloc to cma_trace_finish since we have
cma_alloc_start[1].
Minchan Kim [Wed, 5 May 2021 01:37:28 +0000 (18:37 -0700)]
mm: cma: support sysfs
Since CMA is getting used more widely, it's more important to keep
monitoring CMA statistics for system health since it's directly related to
user experience.
This patch introduces sysfs statistics for CMA, in order to provide some
basic monitoring of the CMA allocator.
* the number of CMA page successful allocations
* the number of CMA page allocation failures
These two values allow the user to calcuate the allocation
failure rate for each CMA area.
The cma_stat was intentionally allocated by dynamic allocation
to harmonize with kobject lifetime management.
https://lore.kernel.org/linux-mm/YCOAmXqt6dZkCQYs@kroah.com/
Link: https://lkml.kernel.org/r/20210324230759.2213957-1-minchan@kernel.org Link: https://lore.kernel.org/linux-mm/20210316100433.17665-1-colin.king@canonical.com/ Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Colin Ian King <colin.king@canonical.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reviewed-by: Dmitry Osipenko <digetx@gmail.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Tested-by: Anders Roxell <anders.roxell@linaro.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: John Dias <joaodias@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Baolin Wang [Wed, 5 May 2021 01:37:22 +0000 (18:37 -0700)]
mm: cma: use pr_err_ratelimited for CMA warning
If we did not reserve extra CMA memory, the log buffer can be easily
filled up by CMA failure warning when the devices calling
dmam_alloc_coherent() to alloc DMA memory. Thus we can use
pr_err_ratelimited() instead to reduce the duplicate CMA warning.
Minchan Kim [Wed, 5 May 2021 01:37:19 +0000 (18:37 -0700)]
mm: vmstat: add cma statistics
Since CMA is used more widely, it's worth to have CMA allocation
statistics into vmstat. With it, we could know how agressively system
uses cma allocation and how often it fails.
Link: https://lkml.kernel.org/r/20210302183346.3707237-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: John Dias <joaodias@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit c77c5cbafe54 ("mm: migrate: skip shared exec THP for NUMA
balancing"), the NUMA balancing would skip shared exec transhuge page.
But this enhancement is not suitable for transhuge page. Because it's
required that page_mapcount() must be 1 due to no migration pte dance is
done here. On the other hand, the shared exec transhuge page will leave
the migrate_misplaced_page() with pte entry untouched and page locked.
Thus pagefault for NUMA will be triggered again and deadlock occurs when
we start waiting for the page lock held by ourselves.
Yang Shi said:
"Thanks for catching this. By relooking the code I think the other
important reason for removing this is
migrate_misplaced_transhuge_page() actually can't see shared exec
file THP at all since page_lock_anon_vma_read() is called before
and if page is not anonymous page it will just restore the PMD
without migrating anything.
The pages for private mapped file vma may be anonymous pages due to
COW but they can't be THP so it won't trigger THP numa fault at all. I
think this is why no bug was reported. I overlooked this in the first
place."
Link: https://lkml.kernel.org/r/20210325131524.48181-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:37:13 +0000 (18:37 -0700)]
mm/migrate.c: use helper migrate_vma_collect_skip() in migrate_vma_collect_hole()
It's more recommended to use helper function migrate_vma_collect_skip() to
skip the unexpected case and it also helps remove some duplicated codes.
Move migrate_vma_collect_skip() above migrate_vma_collect_hole() to avoid
compiler warning.
Link: https://lkml.kernel.org/r/20210325131524.48181-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:37:10 +0000 (18:37 -0700)]
mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()
If the zone device page does not belong to un-addressable device memory,
the variable entry will be uninitialized and lead to indeterminate pte
entry ultimately. Fix this unexpected case and warn about it.
Link: https://lkml.kernel.org/r/20210325131524.48181-4-linmiaohe@huawei.com Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:37:07 +0000 (18:37 -0700)]
mm/migrate.c: remove unnecessary rc != MIGRATEPAGE_SUCCESS check in 'else' case
It's guaranteed that in the 'else' case of the rc == MIGRATEPAGE_SUCCESS
check, rc does not equal to MIGRATEPAGE_SUCCESS. Remove this unnecessary
check.
Link: https://lkml.kernel.org/r/20210325131524.48181-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:37:04 +0000 (18:37 -0700)]
mm/migrate.c: make putback_movable_page() static
Patch series "Cleanup and fixup for mm/migrate.c", v3.
This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE and rc
!= MIGRATEPAGE_SUCCESS check. Also use helper function to remove some
duplicated codes. What's more, this fixes potential deadlock in NUMA
balancing shared exec THP case and so on. More details can be found in
the respective changelogs.
This patch (of 5):
The putback_movable_page() is just called by putback_movable_pages() and
we know the page is locked and both PageMovable() and PageIsolated() is
checked right before calling putback_movable_page(). So we make it static
and remove all the 3 VM_BUG_ON_PAGE().
Minchan Kim [Wed, 5 May 2021 01:37:00 +0000 (18:37 -0700)]
mm: fs: invalidate BH LRU during page migration
Pages containing buffer_heads that are in one of the per-CPU buffer_head
LRU caches will be pinned and thus cannot be migrated. This can prevent
CMA allocations from succeeding, which are often used on platforms with
co-processors (such as a DSP) that can only use physically contiguous
memory. It can also prevent memory hot-unplugging from succeeding,
which involves migrating at least MIN_MEMORY_BLOCK_SIZE bytes of memory,
which ranges from 8 MiB to 1 GiB based on the architecture in use.
Correspondingly, invalidate the BH LRU caches before a migration starts
and stop any buffer_head from being cached in the LRU caches, until
migration has finished.
Link: https://lkml.kernel.org/r/20210319175127.886124-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Chris Goldsworthy <cgoldswo@codeaurora.org> Reported-by: Laura Abbott <labbott@kernel.org> Tested-by: Oliver Sang <oliver.sang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Dias <joaodias@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 5 May 2021 01:36:57 +0000 (18:36 -0700)]
mm: replace migrate_[prep|finish] with lru_cache_[disable|enable]
Currently, migrate_[prep|finish] is merely a wrapper of
lru_cache_[disable|enable]. There is not much to gain from having
additional abstraction.
Use lru_cache_[disable|enable] instead of migrate_[prep|finish], which
would be more descriptive.
note: migrate_prep_local in compaction.c changed into lru_add_drain to
avoid CPU schedule cost with involving many other CPUs to keep old
behavior.
Link: https://lkml.kernel.org/r/20210319175127.886124-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Cc: John Dias <joaodias@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 5 May 2021 01:36:54 +0000 (18:36 -0700)]
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: John Dias <joaodias@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oliver Sang <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: compaction: update the COMPACT[STALL|FAIL] events properly
By definition, COMPACT[STALL|FAIL] events needs to be counted when there
is 'At least in one zone compaction wasn't deferred or skipped from the
direct compaction'. And when compaction is skipped or deferred,
COMPACT_SKIPPED will be returned but it will still go and update these
compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
is counted without even trying the compaction.
Correct this by skipping the counting of these events when
COMPACT_SKIPPED is returned for compaction. This indirectly also avoid
the unnecessary try into the get_page_from_freelist() when compaction is
not even tried.
There is a corner case where compaction is skipped but still count
COMPACTSTALL event, which is that IRQ came and freed the page and the
same is captured in capture_control.
The sysctl_compact_memory is mostly unused in mm/compaction.c It just
acts as a place holder for sysctl to store .data.
But the .data itself is not needed here.
So we can get ride of this variable completely and make .data as NULL.
This will also eliminate the extern declaration from header file. No
functionality is broken or changed this way.
Yang Shi [Wed, 5 May 2021 01:36:45 +0000 (18:36 -0700)]
mm: vmscan: shrink deferred objects proportional to priority
The number of deferred objects might get windup to an absurd number, and
it results in clamp of slab objects. It is undesirable for sustaining
workingset.
So shrink deferred objects proportional to priority and cap nr_deferred
to twice of cache items.
The idea is borrowed from Dave Chinner's patch:
https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/
Tested with kernel build and vfs metadata heavy workload in our
production environment, no regression is spotted so far.
Link: https://lkml.kernel.org/r/20210311190845.9708-14-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Wed, 5 May 2021 01:36:39 +0000 (18:36 -0700)]
mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
Now nr_deferred is available on per memcg level for memcg aware
shrinkers, so don't need allocate shrinker->nr_deferred for such
shrinkers anymore.
The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or
memcg is disabled by kernel command line, then shrinker's
SHRINKER_MEMCG_AWARE flag would be cleared. This makes the
implementation of this patch simpler.
Link: https://lkml.kernel.org/r/20210311190845.9708-12-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Wed, 5 May 2021 01:36:33 +0000 (18:36 -0700)]
mm: vmscan: add per memcg shrinker nr_deferred
Currently the number of deferred objects are per shrinker, but some
slabs, for example, vfs inode/dentry cache are per memcg, this would
result in poor isolation among memcgs.
The deferred objects typically are generated by __GFP_NOFS allocations,
one memcg with excessive __GFP_NOFS allocations may blow up deferred
objects, then other innocent memcgs may suffer from over shrink,
excessive reclaim latency, etc.
For example, two workloads run in memcgA and memcgB respectively,
workload in B is vfs heavy workload. Workload in A generates excessive
deferred objects, then B's vfs cache might be hit heavily (drop half of
caches) by B's limit reclaim or global reclaim.
We observed this hit in our production environment which was running vfs
heavy workload shown as the below tracing log:
The vfs cache and page cache ratio was 10:1 on this machine, and half of
caches were dropped. This also resulted in significant amount of page
caches were dropped due to inodes eviction.
Make nr_deferred per memcg for memcg aware shrinkers would solve the
unfairness and bring better isolation.
The following patch will add nr_deferred to parent memcg when memcg
offline. To preserve nr_deferred when reparenting memcgs to root, root
memcg needs shrinker_info allocated too.
When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the
shrinker's nr_deferred would be used. And non memcg aware shrinkers use
shrinker's nr_deferred all the time.
Link: https://lkml.kernel.org/r/20210311190845.9708-10-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Wed, 5 May 2021 01:36:29 +0000 (18:36 -0700)]
mm: vmscan: use a new flag to indicate shrinker is registered
Currently registered shrinker is indicated by non-NULL
shrinker->nr_deferred. This approach is fine with nr_deferred at the
shrinker level, but the following patches will move MEMCG_AWARE
shrinkers' nr_deferred to memcg level, so their shrinker->nr_deferred
would always be NULL. This would prevent the shrinkers from
unregistering correctly.
Remove SHRINKER_REGISTERING since we could check if shrinker is
registered successfully by the new flag.
Link: https://lkml.kernel.org/r/20210311190845.9708-9-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Wed, 5 May 2021 01:36:26 +0000 (18:36 -0700)]
mm: vmscan: add shrinker_info_protected() helper
The shrinker_info is dereferenced in a couple of places via
rcu_dereference_protected with different calling conventions, for
example, using mem_cgroup_nodeinfo helper or dereferencing
memcg->nodeinfo[nid]->shrinker_info. And the later patch will add more
dereference places.
So extract the dereference into a helper to make the code more readable.
No functional change.
[akpm@linux-foundation.org: retain rcu_dereference_protected() in free_shrinker_info(), per Hugh]
Link: https://lkml.kernel.org/r/20210311190845.9708-8-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Wed, 5 May 2021 01:36:23 +0000 (18:36 -0700)]
mm: memcontrol: rename shrinker_map to shrinker_info
The following patch is going to add nr_deferred into shrinker_map, the
change will make shrinker_map not only include map anymore, so rename it
to "memcg_shrinker_info". And this should make the patch adding
nr_deferred cleaner and readable and make review easier. Also remove the
"memcg_" prefix.
Link: https://lkml.kernel.org/r/20210311190845.9708-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Wed, 5 May 2021 01:36:17 +0000 (18:36 -0700)]
mm: vmscan: remove memcg_shrinker_map_size
Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but
actually the map size can be calculated via shrinker_nr_max, so it seems
unnecessary to keep both. Remove memcg_shrinker_map_size since
shrinker_nr_max is also used by iterating the bit map.
Link: https://lkml.kernel.org/r/20210311190845.9708-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Wed, 5 May 2021 01:36:14 +0000 (18:36 -0700)]
mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
Since memcg_shrinker_map_size just can be changed under holding
shrinker_rwsem exclusively, the read side can be protected by holding read
lock, so it sounds superfluous to have a dedicated mutex.
Kirill Tkhai suggested use write lock since:
* We want the assignment to shrinker_maps is visible for shrink_slab_memcg().
* The rcu_dereference_protected() dereferrencing in shrink_slab_memcg(), but
in case of we use READ lock in alloc_shrinker_maps(), the dereferrencing
is not actually protected.
* READ lock makes alloc_shrinker_info() racy against memory allocation fail.
alloc_shrinker_info()->free_shrinker_info() may free memory right after
shrink_slab_memcg() dereferenced it. You may say
shrink_slab_memcg()->mem_cgroup_online() protects us from it? Yes, sure,
but this is not the thing we want to remember in the future, since this
spreads modularity.
And a test with heavy paging workload didn't show write lock makes things worse.
Link: https://lkml.kernel.org/r/20210311190845.9708-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The shrinker map management is not purely memcg specific, it is at the
intersection between memory cgroup and shrinkers. It's allocation and
assignment of a structure, and the only memcg bit is the map is being
stored in a memcg structure. So move the shrinker_maps handling code
into vmscan.c for tighter integration with shrinker code, and remove the
"memcg_" prefix. There is no functional change.
Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Wed, 5 May 2021 01:36:08 +0000 (18:36 -0700)]
mm: vmscan: use nid from shrink_control for tracepoint
Patch series "Make shrinker's nr_deferred memcg aware", v10.
Recently huge amount one-off slab drop was seen on some vfs metadata
heavy workloads, it turned out there were huge amount accumulated
nr_deferred objects seen by the shrinker.
On our production machine, I saw absurd number of nr_deferred shown as
the below tracing result:
There are 2.5 trillion deferred objects on one node, assuming all of them
are dentry (192 bytes per object), so the total size of deferred on one
node is ~480TB. It is definitely ridiculous.
I managed to reproduce this problem with kernel build workload plus
negative dentry generator.
First step, run the below kernel build test script:
There were huge number of deferred objects before the shrinker was called,
the behavior does match the code but it might be not desirable from the
user's stand of point.
The excessive amount of nr_deferred might be accumulated due to various
reasons, for example:
* GFP_NOFS allocation
* Significant times of small amount scan (< scan_batch, 1024 for vfs
metadata)
However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the
deferred objects is per shrinker, this may have some bad effects:
* Poor isolation among memcgs. Some memcgs which happen to have
frequent limit reclaim may get nr_deferred accumulated to a huge number,
then other innocent memcgs may take the fall. In our case the main
workload was hit.
* Unbounded deferred objects. There is no cap for deferred objects, it
can outgrow ridiculously as the tracing result showed.
* Easy to get out of control. Although shrinkers take into account
deferred objects, but it can go out of control easily. One
misconfigured memcg could incur absurd amount of deferred objects in a
period of time.
* Sort of reclaim problems, i.e. over reclaim, long reclaim latency,
etc. There may be hundred GB slab caches for vfe metadata heavy
workload, shrink half of them may take minutes. We observed latency
spike due to the prolonged reclaim.
These issues also have been discussed in
https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
The patchset is the outcome of that discussion.
So this patchset makes nr_deferred per-memcg to tackle the problem. It
does:
* Have memcg_shrinker_deferred per memcg per node, just like what
shrinker_map does. Instead it is an atomic_long_t array, each element
represent one shrinker even though the shrinker is not memcg aware, this
simplifies the implementation. For memcg aware shrinkers, the deferred
objects are just accumulated to its own memcg. The shrinkers just see
nr_deferred from its own memcg. Non memcg aware shrinkers still use
global nr_deferred from struct shrinker.
* Once the memcg is offlined, its nr_deferred will be reparented to its
parent along with LRUs.
* The root memcg has memcg_shrinker_deferred array too. It simplifies
the handling of reparenting to root memcg.
* Cap nr_deferred to 2x of the length of lru. The idea is borrowed from
Dave Chinner's series
(https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)
The downside is each memcg has to allocate extra memory to store the
nr_deferred array. On our production environment, there are typically
around 40 shrinkers, so each memcg needs ~320 bytes. 10K memcgs would
need ~3.2MB memory. It seems fine.
We have been running the patched kernel on some hosts of our fleet (test
and production) for months, it works very well. The monitor data shows
the working set is sustained as expected.
This patch (of 13):
The tracepoint's nid should show what node the shrink happens on, the
start tracepoint uses nid from shrinkctl, but the nid might be set to 0
before end tracepoint if the shrinker is not NUMA aware, so the tracing
log may show the shrink happens on one node but end up on the other node.
It seems confusing. And the following patch will remove using nid
directly in do_shrink_slab(), this patch also helps cleanup the code.
Link: https://lkml.kernel.org/r/20210311190845.9708-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210311190845.9708-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Wed, 5 May 2021 01:36:04 +0000 (18:36 -0700)]
mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks
RECLAIM_ZONE was assumed to be unused because it was never explicitly
used in the kernel. However, there were a number of places where it was
checked implicitly by checking 'node_reclaim_mode' for a zero value.
These zero checks are not great because it is not obvious what a zero
mode *means* in the code. Replace them with a helper which makes it
more obvious: node_reclaim_enabled().
This helper also provides a handy place to explicitly check the
RECLAIM_ZONE bit itself. Check it explicitly there to make it more
obvious where the bit can affect behavior.
This should have no functional impact.
Link: https://lkml.kernel.org/r/20210219172559.BF589C44@viggo.jf.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Ben Widawsky <ben.widawsky@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: "Tobin C. Harding" <tobin@kernel.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Qian Cai <cai@lca.pw> Cc: Daniel Wagner <dwagner@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Wed, 5 May 2021 01:36:01 +0000 (18:36 -0700)]
mm/vmscan: move RECLAIM* bits to uapi header
It is currently not obvious that the RECLAIM_* bits are part of the uapi
since they are defined in vmscan.c. Move them to a uapi header to make it
obvious.
This should have no functional impact.
Link: https://lkml.kernel.org/r/20210219172557.08074910@viggo.jf.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Ben Widawsky <ben.widawsky@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Daniel Wagner <dwagner@suse.de> Cc: "Tobin C. Harding" <tobin@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Rasmussen [Wed, 5 May 2021 01:35:57 +0000 (18:35 -0700)]
userfaultfd/selftests: add test exercising minor fault handling
Fix a dormant bug in userfaultfd_events_test(), where we did `return
faulting_process(0)` instead of `exit(faulting_process(0))`. This
caused the forked process to keep running, trying to execute any further
test cases after the events test in parallel with the "real" process.
Add a simple test case which exercises minor faults. In short, it does
the following:
1. "Sets up" an area (area_dst) and a second shared mapping to the same
underlying pages (area_dst_alias).
2. Register one of these areas with userfaultfd, in minor fault mode.
3. Start a second thread to handle any minor faults.
4. Populate the underlying pages with the non-UFFD-registered side of
the mapping. Basically, memset() each page with some arbitrary
contents.
5. Then, using the UFFD-registered mapping, read all of the page
contents, asserting that the contents match expectations (we expect
the minor fault handling thread can modify the page contents before
resolving the fault).
The minor fault handling thread, upon receiving an event, flips all the
bits (~) in that page, just to prove that it can modify it in some
arbitrary way. Then it issues a UFFDIO_CONTINUE ioctl, to setup the
mapping and resolve the fault. The reading thread should wake up and
see this modification.
Currently the minor fault test is only enabled in hugetlb_shared mode,
as this is the only configuration the kernel feature supports.
Link: https://lkml.kernel.org/r/20210301222728.176417-7-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Rasmussen [Wed, 5 May 2021 01:35:53 +0000 (18:35 -0700)]
userfaultfd: update documentation to describe minor fault handling
Reword / reorganize things a little bit into "lists", so new features /
modes / ioctls can sort of just be appended.
Describe how UFFDIO_REGISTER_MODE_MINOR and UFFDIO_CONTINUE can be used to
intercept and resolve minor faults. Make it clear that COPY and ZEROPAGE
are used for MISSING faults, whereas CONTINUE is used for MINOR faults.
Link: https://lkml.kernel.org/r/20210301222728.176417-6-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Rasmussen [Wed, 5 May 2021 01:35:49 +0000 (18:35 -0700)]
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Rasmussen [Wed, 5 May 2021 01:35:45 +0000 (18:35 -0700)]
userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled
For background, mm/userfaultfd.c provides a general mcopy_atomic
implementation. But some types of memory (i.e., hugetlb and shmem) need
a slightly different implementation, so they provide their own helpers
for this. In other words, userfaultfd is the only caller of these
functions.
This patch achieves two things:
1. Don't spend time compiling code which will end up never being
referenced anyway (a small build time optimization).
2. In patches later in this series, we extend the signature of these
helpers with UFFD-specific state (a mode enumeration). Once this
happens, we *have to* either not compile the helpers, or
unconditionally define the UFFD-only state (which seems messier to me).
This includes the declarations in the headers, as otherwise they'd
yield warnings about implicitly defining the type of those arguments.
Link: https://lkml.kernel.org/r/20210301222728.176417-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Rasmussen [Wed, 5 May 2021 01:35:40 +0000 (18:35 -0700)]
userfaultfd: disable huge PMD sharing for MINOR registered VMAs
As the comment says: for the MINOR fault use case, although the page
might be present and populated in the other (non-UFFD-registered) half
of the mapping, it may be out of date, and we explicitly want userspace
to get a minor fault so it can check and potentially update the page's
contents.
Huge PMD sharing would prevent these faults from occurring for suitably
aligned areas, so disable it upon UFFD registration.
Link: https://lkml.kernel.org/r/20210301222728.176417-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Rasmussen [Wed, 5 May 2021 01:35:36 +0000 (18:35 -0700)]
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
Oscar Salvador [Wed, 5 May 2021 01:35:33 +0000 (18:35 -0700)]
mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig
pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things. We can drop the in-use page check since
__alloc_contig_pages can migrate away those pages, and the hugetlb page
check can go too since isolate_migratepages_range is now capable of
dealing with hugetlb pages. Either way, those checks are racy so let the
end function handle it when the time comes.
Link: https://lkml.kernel.org/r/20210419075413.1064-8-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Wed, 5 May 2021 01:35:29 +0000 (18:35 -0700)]
mm: make alloc_contig_range handle in-use hugetlb pages
alloc_contig_range() will fail if it finds a HugeTLB page within the
range, without a chance to handle them. Since HugeTLB pages can be
migrated as any LRU or Movable page, it does not make sense to bail out
without trying. Enable the interface to recognize in-use HugeTLB pages so
we can migrate them, and have much better chances to succeed the call.
Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Wed, 5 May 2021 01:35:26 +0000 (18:35 -0700)]
mm: make alloc_contig_range handle free hugetlb pages
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.
We can do better by trying to replace such page.
Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with a
new one.
In order to do that, a new function called alloc_and_dissolve_huge_page is
introduced. This function will first try to get a new fresh hugepage, and
if it succeeds, it will replace the old one in the free hugepage pool.
The free page replacement is done under hugetlb_lock, so no external users
of hugetlb will notice the change. To allocate the new huge page, we use
alloc_buddy_huge_page(), so we do not have to deal with any counters, and
prep_new_huge_page() is not called. This is valulable because in case we
need to free the new page, we only need to call __free_pages().
Once we know that the page to be replaced is a genuine 0-refcounted huge
page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and
__prep_account_new_huge_page() for the new huge page to properly
initialize it and increment the hstate->nr_huge_pages counter (previously
decremented by remove_hugetlb_page()). Once done, the page is enqueued by
enqueue_huge_page() and it is ready to be used.
There is one tricky case when page's refcount is 0 because it is in the
process of being released. A missing PageHugeFreed bit will tell us that
freeing is in flight so we retry after dropping the hugetlb_lock. The
race window should be small and the next retry should make a forward
progress.
E.g:
CPU0 CPU1
free_huge_page() isolate_or_dissolve_huge_page
PageHuge() == T
alloc_and_dissolve_huge_page
alloc_buddy_huge_page()
spin_lock_irq(hugetlb_lock)
// PageHuge() && !PageHugeFreed &&
// !PageCount()
spin_unlock_irq(hugetlb_lock)
spin_lock_irq(hugetlb_lock)
1) update_and_free_page
PageHuge() == F
__free_pages()
2) enqueue_huge_page
SetPageHugeFreed()
spin_unlock_irq(&hugetlb_lock)
spin_lock_irq(hugetlb_lock)
1) PageHuge() == F (freed by case#1 from CPU0)
2) PageHuge() == T
PageHugeFreed() == T
- proceed with replacing the page
In the case above we retry as the window race is quite small and we have
high chances to succeed next time.
With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.
Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, prep_new_huge_page() performs two functions. It sets the
right state for a new hugetlb, and increases the hstate's counters to
account for the new page.
Let us split its functionality into two separate functions, decoupling
the handling of the counters from initializing a hugepage. The outcome
is having __prep_new_huge_page(), which only initializes the page , and
__prep_account_new_huge_page(), which adds the new page to the hstate's
counters.
This allows us to be able to set a hugetlb without having to worry about
the counter/locking. It will prove useful in the next patch.
prep_new_huge_page() still calls both functions.
Link: https://lkml.kernel.org/r/20210419075413.1064-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Wed, 5 May 2021 01:35:20 +0000 (18:35 -0700)]
mm,hugetlb: drop clearing of flag from prep_new_huge_page
Pages allocated via the page allocator or CMA get its private field
cleared by means of post_alloc_hook().
Pages allocated during boot, that is directly from the memblock
allocator, get cleared by paging_init()-> .. ->memmap_init_zone-> ..
->__init_single_page() before any memblock allocation.
Based on this ground, let us remove the clearing of the flag from
prep_new_huge_page() as it is not needed. This was a leftover from
commit 6c0371490140 ("hugetlb: convert PageHugeFreed to HPageFreed
flag").
Previously the explicit clearing was necessary because compound
allocations do not get this initialization (see prep_compound_page).
Link: https://lkml.kernel.org/r/20210419075413.1064-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Wed, 5 May 2021 01:35:17 +0000 (18:35 -0700)]
mm,compaction: let isolate_migratepages_{range,block} return error codes
Currently, isolate_migratepages_{range,block} and their callers use a pfn
== 0 vs pfn != 0 scheme to let the caller know whether there was any error
during isolation.
This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.
Let us rework isolate_migratepages_{range,block} so we can report error
codes. Since isolate_migratepages_block will stop returning the next pfn
to be scanned, we reuse the cc->migrate_pfn field to keep track of that.
Link: https://lkml.kernel.org/r/20210419075413.1064-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Wed, 5 May 2021 01:35:14 +0000 (18:35 -0700)]
mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range
Patch series "Make alloc_contig_range handle Hugetlb pages", v10.
alloc_contig_range lacks the ability to handle HugeTLB pages. This can
be problematic for some users, e.g: CMA and virtio-mem, where those
users will fail the call if alloc_contig_range ever sees a HugeTLB page,
even when those pages lay in ZONE_MOVABLE and are free. That problem
can be easily solved by replacing the page in the free hugepage pool.
In-use HugeTLB are no exception though, as those can be isolated and
migrated as any other LRU or Movable page.
This aims to improve alloc_contig_range->isolate_migratepages_block, so
that HugeTLB pages can be recognized and handled.
Since we also need to start reporting errors down the chain (e.g:
-ENOMEM due to not be able to allocate a new hugetlb page),
isolate_migratepages_{range,block} interfaces need to change to start
reporting error codes instead of the pfn == 0 vs pfn != 0 scheme it is
using right now. From now on, isolate_migratepages_block will not
return the next pfn to be scanned anymore, but -EINTR, -ENOMEM or 0, so
we the next pfn to be scanned will be recorded in cc->migrate_pfn field
(as it is already done in isolate_migratepages_range()).
Below is an insight from David (thanks), where the problem can clearly be
seen:
"Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
ZONE_MOVABLE. Allocate 512 huge pages.
Most probably because it happened to try migrating a huge page
while it was busy. As virtio-mem retries on ZONE_MOVABLE a couple of
times, it can deal with this temporary failure.
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or
-EBUSY, and report them down the chain. The problem is that when
migrate_pages() reports -ENOMEM, we keep going till we exhaust all the
try-attempts (5 at the moment) instead of bailing out.
migrate_pages() bails out right away on -ENOMEM because it is considered a
fatal error. Do the same here instead of keep going and retrying. Note
that this is not fixing a real issue, just a cosmetic change. Although we
can save some cycles by backing off ealier
Link: https://lkml.kernel.org/r/20210419075413.1064-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20210419075413.1064-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 5 May 2021 01:35:10 +0000 (18:35 -0700)]
hugetlb: add lockdep_assert_held() calls for hugetlb_lock
After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.
Link: https://lkml.kernel.org/r/20210409205254.242291-9-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 5 May 2021 01:35:07 +0000 (18:35 -0700)]
hugetlb: make free_huge_page irq safe
Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page being
called from irq context. That commit hands off free_huge_page processing
to a workqueue if !in_task. However, this doesn't cover all the cases as
pointed out by 0day bot lockdep report [1].
Shakeel has later explained that this is very likely TCP TX zerocopy from
hugetlb pages scenario when the networking code drops a last reference to
hugetlb page while having IRQ disabled. Hugetlb freeing path doesn't
disable IRQ while holding hugetlb_lock so a lock dependency chain can lead
to a deadlock.
This commit addresses the issue by doing the following:
- Make hugetlb_lock irq safe. This is mostly a simple process of
changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.
Link: https://lkml.kernel.org/r/20210409205254.242291-8-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 5 May 2021 01:35:03 +0000 (18:35 -0700)]
hugetlb: change free_pool_huge_page to remove_pool_huge_page
free_pool_huge_page was called with hugetlb_lock held. It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy. free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.
Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held. It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators. The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.
Add new helper routine to call update_and_free_page for a list of pages.
Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation. Commit e5bbc8a6c992 ("mm/hugetlb.c: fix
reservation race when freeing surplus pages") modified this routine to
address a race which could occur when dropping the hugetlb_lock in the
loop that removes pool pages. Accounting changes introduced in that
commit were subtle and took some thought to understand. This commit
removes the cond_resched_lock() and the potential race. Therefore,
remove the subtle code and restore the more straight forward accounting
effectively reverting the commit.
Link: https://lkml.kernel.org/r/20210409205254.242291-7-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 5 May 2021 01:34:59 +0000 (18:34 -0700)]
hugetlb: call update_and_free_page without hugetlb_lock
With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock. Change all callers to
drop the lock before calling.
With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.
The ugly unlock/lock cycle in free_pool_huge_page will be removed in a
subsequent patch which restructures free_pool_huge_page.
Link: https://lkml.kernel.org/r/20210409205254.242291-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 5 May 2021 01:34:55 +0000 (18:34 -0700)]
hugetlb: create remove_hugetlb_page() to separate functionality
The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing. It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.
update_and_free_page no longer decrements h->nr_huge_pages{_node} as
this is performed in remove_hugetlb_page. The only functionality
performed by update_and_free_page is to free the base pages to the lower
level allocators.
update_and_free_page is typically called after remove_hugetlb_page.
remove_hugetlb_page is to be called with the hugetlb_lock held.
Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times. This commit should not
introduce any changes to functionality.
Link: https://lkml.kernel.org/r/20210409205254.242291-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 5 May 2021 01:34:52 +0000 (18:34 -0700)]
hugetlb: add per-hstate mutex to synchronize user adjustments
The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc. The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus. adjust_pool_surplus is
called with hugetlb_lock held. However, alloc_pool_huge_page can not be
called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid for
the caller and pages being allocated on the wrong node.
Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot. set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel. Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time. This will synchronize
modifications to the next_nid_to_alloc variable.
Link: https://lkml.kernel.org/r/20210409205254.242291-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 5 May 2021 01:34:48 +0000 (18:34 -0700)]
hugetlb: no need to drop hugetlb_lock to call cma_release
Now that cma_release is non-blocking and irq safe, there is no need to
drop hugetlb_lock before calling.
Link: https://lkml.kernel.org/r/20210409205254.242291-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 5 May 2021 01:34:44 +0000 (18:34 -0700)]
mm/cma: change cma mutex to irq safe spinlock
Patch series "make hugetlb put_page safe for all calling contexts", v5.
This effort is the result a recent bug report [1]. Syzbot found a
potential deadlock in the hugetlb put_page/free_huge_page_path. WARNING:
SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected Since the
free_huge_page_path already has code to 'hand off' page free requests to a
workqueue, a suggestion was proposed to make the in_irq() detection
accurate by always enabling PREEMPT_COUNT [2]. The outcome of that
discussion was that the hugetlb put_page path (free_huge_page) path should
be properly fixed and safe for all calling contexts.
cma_release is currently a sleepable operatation because the bitmap
manipulation is protected by cma->lock mutex. Hugetlb code which relies
on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
irq safe.
The lock doesn't protect any sleepable operation so it can be changed to a
(irq aware) spin lock. The bitmap processing should be quite fast in
typical case but if cma sizes grow to TB then we will likely need to
replace the lock by a more optimized bitmap implementation.
Link: https://lkml.kernel.org/r/20210409205254.242291-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20210409205254.242291-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: David Rientjes <rientjes@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Waiman Long <longman@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:34:38 +0000 (18:34 -0700)]
mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
A rare out of memory error would prevent removal of the reserve map region
for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
and hugetlb_acct_memory could possibly fail too. We should correctly
handle these cases.
Link: https://lkml.kernel.org/r/20210410072348.20437-5-linmiaohe@huawei.com Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:34:35 +0000 (18:34 -0700)]
mm/hugeltb: clarify (chg - freed) won't go negative in hugetlb_unreserve_pages()
The resv_map could be NULL since this routine can be called in the evict
inode path for all hugetlbfs inodes and we will have chg = 0 in this case.
But (chg - freed) won't go negative as Mike pointed out:
"If resv_map is NULL, then no hugetlb pages can be allocated/associated
with the file. As a result, remove_inode_hugepages will never find any
huge pages associated with the inode and the passed value 'freed' will
always be zero."
Add a comment clarifying this to make it clear and also avoid confusion.
Link: https://lkml.kernel.org/r/20210410072348.20437-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:34:32 +0000 (18:34 -0700)]
mm/hugeltb: simplify the return code of __vma_reservation_common()
It's guaranteed that the vma is associated with a resv_map, i.e. either
VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
have returned via !resv check above. So it's unneeded to check whether
HPAGE_RESV_OWNER is set here. Simplify the return code to make it more
clear.
Link: https://lkml.kernel.org/r/20210410072348.20437-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:34:30 +0000 (18:34 -0700)]
mm/hugeltb: remove redundant VM_BUG_ON() in region_add()
Patch series "Cleanup and fixup for hugetlb", v2.
This series contains cleanups to remove redundant VM_BUG_ON() and simplify
the return code. Also this handles the error case in
hugetlb_fix_reserve_counts() correctly. More details can be found in the
respective changelogs.
This patch (of 5):
The same VM_BUG_ON() check is already done in the callee. Remove this
extra one to simplify the code slightly.
Zi Yan [Wed, 5 May 2021 01:34:26 +0000 (18:34 -0700)]
mm: huge_memory: debugfs for file-backed THP split
Further extend <debugfs>/split_huge_pages to accept
"<path>,<pgoff_start>,<pgoff_end>" for file-backed THP split tests since
tmpfs may have file backed by THP that mapped nowhere.
Update selftest program to test file-backed THP split too.
Link: https://lkml.kernel.org/r/20210331235309.332292-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mika Penttila <mika.penttila@nextfour.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zi Yan [Wed, 5 May 2021 01:34:23 +0000 (18:34 -0700)]
mm: huge_memory: a new debugfs interface for splitting THP tests
We did not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Make <debugfs>/split_huge_pages accept a
new command to do that.
By writing "<pid>,<vaddr_start>,<vaddr_end>" to
<debugfs>/split_huge_pages, THPs within the given virtual address range
from the process with the given pid are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.
This does not change the old behavior, i.e., writing 1 to the interface
to split all THPs in the system.
Link: https://lkml.kernel.org/r/20210331235309.332292-1-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mika Penttila <mika.penttila@nextfour.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:34:20 +0000 (18:34 -0700)]
khugepaged: remove meaningless !pte_present() check in khugepaged_scan_pmd()
We know it must meet the !is_swap_pte() and !pte_none() condition if we
reach here. Since !is_swap_pte() indicates pte_none() or pte_present()
is met, it's guaranteed that pte must be present here.
Link: https://lkml.kernel.org/r/20210325135647.64106-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:34:17 +0000 (18:34 -0700)]
khugepaged: remove unnecessary out label in collapse_huge_page()
The out label here is unneeded because it just goes to out_up_write label.
Remove it to make code more concise.
Link: https://lkml.kernel.org/r/20210325135647.64106-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:34:15 +0000 (18:34 -0700)]
khugepaged: use helper function range_in_vma() in collapse_pte_mapped_thp()
Patch series "Cleanup for khugepaged".
This series contains cleanups to remove unnecessary out label and
meaningless !pte_present() check. Also use helper function to simplify
the code. More details can be found in the respective changelogs.
This patch (of 3):
We could use helper function range_in_vma() to check whether the desired
range is inside the vma to simplify the code.
Yanfei Xu [Wed, 5 May 2021 01:34:12 +0000 (18:34 -0700)]
mm/khugepaged.c: replace barrier() with READ_ONCE() for a selective variable
READ_ONCE() is more selective and lightweight. It is more appropriate
that using a READ_ONCE() for the certain variable to prevent the
compiler from reordering.
Link: https://lkml.kernel.org/r/20210323092730.247583-1-yanfei.xu@windriver.com Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:34:08 +0000 (18:34 -0700)]
mm/huge_memory.c: use helper function migration_entry_to_page()
It's more recommended to use helper function migration_entry_to_page()
to get the page via migration entry. We can also enjoy the PageLocked()
check there.
Link: https://lkml.kernel.org/r/20210318122722.13135-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: yuleixzhang <yulei.kernel@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4958e4d86ecb ("mm: thp: remove debug_cow switch") forgot to
remove TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG macro. Remove it here.
Link: https://lkml.kernel.org/r/20210318122722.13135-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: yuleixzhang <yulei.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The !PageCompound() check limits the page must be head or tail while
!PageHead() further limits it to page head only. So !PageHead() check is
equivalent here.
Link: https://lkml.kernel.org/r/20210318122722.13135-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: yuleixzhang <yulei.kernel@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:33:59 +0000 (18:33 -0700)]
mm/huge_memory.c: rework the function do_huge_pmd_numa_page() slightly
The current code that checks if migrating misplaced transhuge page is
needed is pretty hard to follow. Rework it and add a comment to make
its logic more clear and improve readability.
Link: https://lkml.kernel.org/r/20210318122722.13135-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: yuleixzhang <yulei.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:33:55 +0000 (18:33 -0700)]
mm/huge_memory.c: make get_huge_zero_page() return bool
It's guaranteed that huge_zero_page will not be NULL if
huge_zero_refcount is increased successfully.
When READ_ONCE(huge_zero_page) is returned, there must be a
huge_zero_page and it can be replaced with returning
'true' when we do not care about the value of huge_zero_page.
We can thus make it return bool to save READ_ONCE cpu cycles as the
return value is just used to check if huge_zero_page exists.
Link: https://lkml.kernel.org/r/20210318122722.13135-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: yuleixzhang <yulei.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:33:52 +0000 (18:33 -0700)]
mm/huge_memory.c: rework the function vma_adjust_trans_huge()
Patch series "Some cleanups for huge_memory", v3.
This series contains cleanups to rework some function logics to make it
more readable, use helper function and so on. More details can be found
in the respective changelogs.
This patch (of 6):
The current implementation of vma_adjust_trans_huge() contains some
duplicated codes. Add helper function to get rid of these codes to make
it more succinct.
Link: https://lkml.kernel.org/r/20210318122722.13135-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210318122722.13135-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Xu <peterx@redhat.com> Cc: yuleixzhang <yulei.kernel@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:33:46 +0000 (18:33 -0700)]
khugepaged: fix wrong result value for trace_mm_collapse_huge_page_isolate()
In writable and !referenced case, the result value should be
SCAN_LACK_REFERENCED_PAGE for trace_mm_collapse_huge_page_isolate()
instead of default 0 (SCAN_FAIL) here.
Link: https://lkml.kernel.org/r/20210306032947.35921-5-linmiaohe@huawei.com Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:33:43 +0000 (18:33 -0700)]
khugepaged: use helper khugepaged_test_exit() in __khugepaged_enter()
Commit 4d45e75a9955 ("mm: remove the now-unnecessary mmget_still_valid()
hack") have made khugepaged_test_exit() suitable for check mm->mm_users
against 0. Use this helper here.
Link: https://lkml.kernel.org/r/20210306032947.35921-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:33:40 +0000 (18:33 -0700)]
khugepaged: reuse the smp_wmb() inside __SetPageUptodate()
smp_wmb() is needed to avoid the copy_huge_page writes to become visible
after the set_pmd_at() write here. But we can reuse the smp_wmb() inside
__SetPageUptodate() to remove this redundant one.
Link: https://lkml.kernel.org/r/20210306032947.35921-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Wed, 5 May 2021 01:33:37 +0000 (18:33 -0700)]
khugepaged: remove unneeded return value of khugepaged_collapse_pte_mapped_thps()
Patch series "Cleanup and fixup for khugepaged", v2.
This series contains cleanups to remove unneeded return value, use
helper function and so on. And there is one fix to correct the wrong
result value for trace_mm_collapse_huge_page_isolate().
This patch (of 4):
The return value of khugepaged_collapse_pte_mapped_thps() is never checked
since it's introduced. We should remove such unneeded return value.
Miaohe Lin [Wed, 5 May 2021 01:33:22 +0000 (18:33 -0700)]
mm/hugetlb: use some helper functions to cleanup code
Patch series "Some cleanups for hugetlb".
This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE, use
helper function and so on. I also collect some previous patches into this
series in case they are forgotten.
This patch (of 5):
We could use pages_per_huge_page to get the number of pages per hugepage,
use get_hstate_idx to calculate hstate index, and use hstate_is_gigantic
to check if a hstate is gigantic to make code more succinct.
Miaohe Lin [Wed, 5 May 2021 01:33:16 +0000 (18:33 -0700)]
mm/hugetlb: remove redundant reservation check condition in alloc_huge_page()
vma_resv_map(vma) checks if a reserve map is associated with the vma.
The routine vma_needs_reservation() will check vma_resv_map(vma) and
return 1 if no reserv map is present. map_chg is set to the return
value of vma_needs_reservation(). Therefore, !vma_resv_map(vma) is
redundant in the expression:
map_chg || avoid_reserve || !vma_resv_map(vma);
Remove the redundant check.
[Thanks Mike Kravetz for reshaping this commit message!]
Link: https://lkml.kernel.org/r/20210301104726.45159-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Wed, 5 May 2021 01:33:04 +0000 (18:33 -0700)]
hugetlb/userfaultfd: forbid huge pmd sharing when uffd enabled
Huge pmd sharing could bring problem to userfaultfd. The thing is that
userfaultfd is running its logic based on the special bits on page table
entries, however the huge pmd sharing could potentially share page table
entries for different address ranges. That could cause issues on
either:
- When sharing huge pmd page tables for an uffd write protected range,
the newly mapped huge pmd range will also be write protected
unexpectedly, or,
- When we try to write protect a range of huge pmd shared range, we'll
first do huge_pmd_unshare() in hugetlb_change_protection(), however
that also means the UFFDIO_WRITEPROTECT could be silently skipped for
the shared region, which could lead to data loss.
While at it, a few other things are done altogether:
- Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because
that's definitely something that arch code would like to use too
- ARM64 currently directly check against
CONFIG_ARCH_WANT_HUGE_PMD_SHARE when trying to share huge pmd. Switch
to the want_pmd_share() helper.
- Move vma_shareable() from huge_pmd_share() into want_pmd_share().
[peterx@redhat.com: fix build with !ARCH_WANT_HUGE_PMD_SHARE] Link: https://lkml.kernel.org/r/20210310185359.88297-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210218231202.15426-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Wed, 5 May 2021 01:33:00 +0000 (18:33 -0700)]
hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share()
Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4.
This series tries to disable huge pmd unshare of hugetlbfs backed memory
for uffd-wp. Although uffd-wp of hugetlbfs is still during rfc stage,
the idea of this series may be needed for multiple tasks (Axel's uffd
minor fault series, and Mike's soft dirty series), so I picked it out
from the larger series.
This patch (of 4):
It is a preparation work to be able to behave differently in the per
architecture huge_pte_alloc() according to different VMA attributes.
Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.
[peterx@redhat.com: build fix] Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: Link: https://lkml.kernel.org/r/20210218230633.15028-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Wed, 5 May 2021 01:32:57 +0000 (18:32 -0700)]
mm: remove nrexceptional from inode: remove BUG_ON
clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)) is unsafe: we
know of two ways in which nodes can and do (on rare occasions) get left
behind. Until those are fixed, do not BUG_ON() nor even WARN_ON().
Yes, this will then leak those nodes (or the next user of the struct
inode may use them); but this has been happening for years, and the new
BUG_ON(!mapping_empty) was only guilty of revealing that. A proper fix
will follow, but no hurry.
We actually use nrexceptional for very little these days. It's a minor
pain to keep in sync with nrpages, but the pain becomes much bigger with
the THP patches because we don't know how many indices a shadow entry
occupies. It's easier to just remove it than keep it accurate.
Also, we save 8 bytes per inode which is nothing to sneeze at; on my
laptop, it would improve shmem_inode_cache from 22 to 23 objects per
16kB, and inode_cache from 26 to 27 objects. Combined, that saves
a megabyte of memory from a combined usage of 25MB for both caches.
Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save
any memory for ext4.
This patch (of 4):
Instead of checking the two counters (nrpages and nrexceptional), we can
just check whether i_pages is empty.
Jane Chu [Fri, 30 Apr 2021 06:02:19 +0000 (23:02 -0700)]
mm/memory-failure: unnecessary amount of unmapping
It appears that unmap_mapping_range() actually takes a 'size' as its third
argument rather than a location, the current calling fashion causes
unnecessary amount of unmapping to occur.
Link: https://lkml.kernel.org/r/20210420002821.2749748-1-jane.chu@oracle.com Fixes: 6100e34b2526e ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages") Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
On !ARCH_SUPPORTS_DEBUG_PAGEALLOC (like ia64) debug_pagealloc=1 implies
page_poison=on:
if (page_poisoning_enabled() ||
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()))
static_branch_enable(&_page_poisoning_enabled);
page_poison=on needs to override init_on_free=1.
Before the change it did not work as expected for the following case:
- have PAGE_POISONING=y
- have page_poison unset
- have !ARCH_SUPPORTS_DEBUG_PAGEALLOC arch (like ia64)
- have init_on_free=1
- have debug_pagealloc=1
That way we get both keys enabled:
- static_branch_enable(&init_on_free);
- static_branch_enable(&_page_poisoning_enabled);
which leads to poisoned pages returned for __GFP_ZERO pages.
After the change we execute only:
- static_branch_enable(&_page_poisoning_enabled);
and ignore init_on_free=1.
Link: https://lkml.kernel.org/r/20210329222555.3077928-1-slyfox@gentoo.org Link: https://lkml.org/lkml/2021/3/26/443 Fixes: 8db26a3d4735 ("mm, page_poison: use static key more efficiently") Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
net: page_pool: use alloc_pages_bulk in refill code path
There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.
For these workload it can improve performance to bulk alloc pages from the
page-allocator to refill the alloc cache.
For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create an
SKB from the xdp_frame, which then cannot return the page to the
page_pool.
Performance results under GitHub xdp-project[1]:
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
Mel: The patch "net: page_pool: convert to use alloc_pages_bulk_array
variant" was squashed with this patch. From the test page, the array
variant was superior with one of the test results as follows.
Kernel XDP stats CPU pps Delta
Baseline XDP-RX CPU total 3,771,046 n/a
List XDP-RX CPU total 3,940,242 +4.49%
Array XDP-RX CPU total 4,249,224 +12.68%
Link: https://lkml.kernel.org/r/20210325114228.27719-10-mgorman@techsingularity.net Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chuck Lever [Fri, 30 Apr 2021 06:02:01 +0000 (23:02 -0700)]
SUNRPC: refresh rq_pages using a bulk page allocator
Reduce the rate at which nfsd threads hammer on the page allocator. This
improves throughput scalability by enabling the threads to run more
independently of each other.
[mgorman: Update interpretation of alloc_pages_bulk return value]
Link: https://lkml.kernel.org/r/20210325114228.27719-8-mgorman@techsingularity.net Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patches change SUNRPC to invoke the array-based bulk allocator
instead of alloc_page().
The micro-benchmark results are promising. I ran a mixture of 256KB
reads and writes over NFSv3. The server's kernel is built with KASAN
enabled, so the comparison is exaggerated but I believe it is still
valid.
I instrumented svc_recv() to measure the latency of each call to
svc_alloc_arg() and report it via a trace point. The following results
are averages across the trace events.
Single page: 25.007 us per call over 532,571 calls
Bulk list: 6.258 us per call over 517,034 calls
Bulk array: 4.590 us per call over 517,442 calls
This patch (of 2)
Refactor:
I'm about to use the loop variable @i for something else.
As far as the "i++" is concerned, that is a post-increment. The
value of @i is not used subsequently, so the increment operator
is unnecessary and can be removed.
Also note that nfsd_read_actor() was renamed nfsd_splice_actor()
by commit cf8208d0eabd ("sendfile: convert nfsd to
splice_direct_to_actor()").
Link: https://lkml.kernel.org/r/20210325114228.27719-7-mgorman@techsingularity.net Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: optimize code layout for __alloc_pages_bulk
Looking at perf-report and ASM-code for __alloc_pages_bulk() it is clear
that the code activated is suboptimal. The compiler guesses wrong and
places unlikely code at the beginning. Due to the use of WARN_ON_ONCE()
macro the UD2 asm instruction is added to the code, which confuse the
I-cache prefetcher in the CPU.
[mgorman@techsingularity.net: minor changes and rebasing]
Link: https://lkml.kernel.org/r/20210325114228.27719-5-mgorman@techsingularity.net Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-By: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: add an array-based interface to the bulk page allocator
The proposed callers for the bulk allocator store pages from the bulk
allocator in an array. This patch adds an array-based interface to the
API to avoid multiple list iterations. The page list interface is
preserved to avoid requiring all users of the bulk API to allocate and
manage enough storage to store the pages.
[akpm@linux-foundation.org: remove now unused local `allocated']
Link: https://lkml.kernel.org/r/20210325114228.27719-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a new page allocator interface via alloc_pages_bulk, and
__alloc_pages_bulk_nodemask. A caller requests a number of pages to be
allocated and added to a list.
The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch if
necessary.
Note that this implementation is not very efficient and could be
improved but it would require refactoring. The intent is to make it
available early to determine what semantics are required by different
callers. Once the full semantics are nailed down, it can be refactored.
[mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew] Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net
[mgorman@techsingularity.net: fix uninit var warning] Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net
[mgorman@techsingularity.net: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>