]> git.proxmox.com Git - mirror_ubuntu-kernels.git/log
mirror_ubuntu-kernels.git
22 months agojbd2,ocfs2: move jbd2_journal_submit_inode_data_buffers to ocfs2
Christoph Hellwig [Thu, 29 Dec 2022 16:10:29 +0000 (06:10 -1000)]
jbd2,ocfs2: move jbd2_journal_submit_inode_data_buffers to ocfs2

jbd2_journal_submit_inode_data_buffers is only used by ocfs2, so move it
there to prepare for removing generic_writepages.

Link: https://lkml.kernel.org/r/20221229161031.391878-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agontfs3: remove ->writepage
Christoph Hellwig [Thu, 29 Dec 2022 16:10:28 +0000 (06:10 -1000)]
ntfs3: remove ->writepage

->writepage is a very inefficient method to write back data, and only used
through write_cache_pages or a a fallback when no ->migrate_folio method
is present.

Set ->migrate_folio to the generic buffer_head based helper, and remove
the ->writepage implementation.

Link: https://lkml.kernel.org/r/20221229161031.391878-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agontfs3: stop using generic_writepages
Christoph Hellwig [Thu, 29 Dec 2022 16:10:27 +0000 (06:10 -1000)]
ntfs3: stop using generic_writepages

Open code the resident inode handling in ntfs_writepages by directly using
write_cache_pages to prepare removing the ->writepage handler in ntfs3.

Link: https://lkml.kernel.org/r/20221229161031.391878-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agofs: remove an outdated comment on mpage_writepages
Christoph Hellwig [Thu, 29 Dec 2022 16:10:26 +0000 (06:10 -1000)]
fs: remove an outdated comment on mpage_writepages

Patch series "remove generic_writepages"

This series removes generic_writepages by open coding the current
functionality in the three remaining callers.  Besides removing some
code the main benefit is that one of the few remaining ->writepage
callers from outside the core page cache code go away.

This patch (of 6):

mpage_writepages doesn't do any of the page locking itself, so remove and
outdated comment on the locking pattern there.

Link: https://lkml.kernel.org/r/20221229161031.391878-1-hch@lst.de
Link: https://lkml.kernel.org/r/20221229161031.391878-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/thp: check and bail out if page in deferred queue already
Yin Fengwei [Fri, 23 Dec 2022 13:52:07 +0000 (21:52 +0800)]
mm/thp: check and bail out if page in deferred queue already

Kernel build regression with LLVM was reported here:
https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ with
commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
boundaries").  And the commit f35b5d7d676e was reverted.

It turned out the regression is related with madvise(MADV_DONTNEED)
was used by ld.lld. But with none PMD_SIZE aligned parameter len.
trace-bpfcc captured:
531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4

If the underneath physical page is THP, the madvise(MADV_DONTNEED) can
trigger split_queue_lock contention raised significantly. perf showed
following data:
    14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
       entry_SYSCALL_64_after_hwframe
           11.52%
                entry_SYSCALL_64_after_hwframe
                do_syscall_64
                __x64_sys_madvise
                do_madvise.part.0
                zap_page_range
                unmap_single_vma
                unmap_page_range
                page_remove_rmap
                deferred_split_huge_page
                __lock_text_start
                native_queued_spin_lock_slowpath

If THP can't be removed from rmap as whole THP, partial THP will be
removed from rmap by removing sub-pages from rmap.  Even the THP head page
is added to deferred queue already, the split_queue_lock will be acquired
and check whether the THP head page is in the queue already.  Thus, the
contention of split_queue_lock is raised.

Before acquire split_queue_lock, check and bail out early if the THP
head page is in the queue already. The checking without holding
split_queue_lock could race with deferred_split_scan, but it doesn't
impact the correctness here.

Test result of building kernel with ld.lld:
commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        6:07.99 real,   26367.77 user,  5063.35 sys

commit f35b5d7d676e:
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        7:22.15 real,   26235.03 user,  12504.55 sys

commit f35b5d7d676e with the fixing patch:
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        6:08.49 real,   26520.15 user,  5047.91 sys

Link: https://lkml.kernel.org/r/20221223135207.2275317-1-fengwei.yin@intel.com
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/page_reporting: replace rcu_access_pointer() with rcu_dereference_protected()
SeongJae Park [Wed, 28 Dec 2022 17:59:42 +0000 (17:59 +0000)]
mm/page_reporting: replace rcu_access_pointer() with rcu_dereference_protected()

Page reporting fetches pr_dev_info using rcu_access_pointer(), which is
for safely fetching a pointer that will not be dereferenced but could
concurrently updated.  The code indeed does not dereference pr_dev_info
after fetching it using rcu_access_pointer(), but it fetches the pointer
while concurrent updates to the pointer is avoided by holding the update
side lock, page_reporting_mutex.

In the case, rcu_dereference_protected() should be used instead because it
provides better readability and performance on some cases, as
rcu_dereference_protected() avoids use of READ_ONCE().  Replace the
rcu_access_pointer() calls with rcu_dereference_protected().

Link: https://lkml.kernel.org/r/20221228175942.149491-1-sj@kernel.org
Fixes: 36e66c554b5c ("mm: introduce Reported pages")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: fix comment of page table counter
Kele Huang [Sat, 24 Dec 2022 06:02:33 +0000 (01:02 -0500)]
mm: fix comment of page table counter

Commit af5b0f6a09e42 ("mm: consolidate page table accounting")
consolidates page table accounting to a single counter in struct mm_struct
{} as mm->pgtables_bytes.  So the meanning of this counter should be the
size of all page tables now.

Link: https://lkml.kernel.org/r/20221224060233.417827-1-kele.huang@columbia.edu
Signed-off-by: Kele Huang <kele.huang@columbia.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Colin Cross <ccross@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/mprotect: drop pgprot_t parameter from change_protection()
David Hildenbrand [Fri, 23 Dec 2022 15:56:16 +0000 (16:56 +0100)]
mm/mprotect: drop pgprot_t parameter from change_protection()

Being able to provide a custom protection opens the door for
inconsistencies and BUGs: for example, accidentally allowing for more
permissions than desired by other mechanisms (e.g., softdirty tracking).
vma->vm_page_prot should be the single source of truth.

Only PROT_NUMA is special: there is no way we can erroneously allow
for more permissions when removing all permissions. Special-case using
the MM_CP_PROT_NUMA flag.

[david@redhat.com: PAGE_NONE might not be defined without CONFIG_NUMA_BALANCING]
Link: https://lkml.kernel.org/r/5084ff1c-ebb3-f918-6a60-bacabf550a88@redhat.com
Link: https://lkml.kernel.org/r/20221223155616.297723-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/userfaultfd: rely on vma->vm_page_prot in uffd_wp_range()
David Hildenbrand [Fri, 23 Dec 2022 15:56:15 +0000 (16:56 +0100)]
mm/userfaultfd: rely on vma->vm_page_prot in uffd_wp_range()

Patch series "mm: uffd-wp + change_protection() cleanups".

Cleanup page protection handling in uffd-wp when calling
change_protection() and improve unprotecting uffd=wp in private mappings,
trying to set PTEs writable again if possible just like we do during
mprotect() when upgrading write permissions.  Make the change_protection()
interface harder to get wrong :)

I consider both pages primarily cleanups, although patch #1 fixes a corner
case with uffd-wp and softdirty tracking for shmem.  @Peter, please let me
know if we should flag patch #1 as pure cleanup -- I have no idea how
important softdirty tracking on shmem is.

This patch (of 2):

uffd_wp_range() currently calculates page protection manually using
vm_get_page_prot().  This will ignore any other reason for active
writenotify: one mechanism applicable to shmem is softdirty tracking.

For example, the following sequence

1) Write to mapped shmem page
2) Clear softdirty
3) Register uffd-wp covering the mapped page
4) Unregister uffd-wp covering the mapped page
5) Write to page again

will not set the modified page softdirty, because uffd_wp_range() will
ignore that writenotify is required for softdirty tracking and simply map
the page writable again using change_protection().  Similarly, instead of
unregistering, protecting followed by un-protecting the page using uffd-wp
would result in the same situation.

Now that we enable writenotify whenever enabling uffd-wp on a VMA,
vma->vm_page_prot will already properly reflect our requirements: the
default is to write-protect all PTEs.  However, for shared mappings we
would now not remap the PTEs writable if possible when unprotecting, just
like for private mappings (COW).  To compensate, set
MM_CP_TRY_CHANGE_WRITABLE just like mprotect() does to try mapping
individual PTEs writable.

For private mappings, this change implies that we will now always try
setting PTEs writable when un-protecting, just like when upgrading write
permissions using mprotect(), which is an improvement.

For shared mappings, we will only set PTEs writable if
can_change_pte_writable()/can_change_pmd_writable() indicates that it's
ok.  For ordinary shmem, this will be the case when PTEs are dirty, which
should usually be the case -- otherwise we could special-case shmem in
can_change_pte_writable()/can_change_pmd_writable() easily, because shmem
itself doesn't require writenotify.

Note that hugetlb does not yet implement MM_CP_TRY_CHANGE_WRITABLE, so we
won't try setting PTEs writable when unprotecting or when unregistering
uffd-wp.  This can be added later on top by implementing
MM_CP_TRY_CHANGE_WRITABLE.

While commit ffd05793963a ("userfaultfd: wp: support write protection for
userfault vma range") introduced that code, it should only be applicable
to uffd-wp on shared mappings -- shmem (hugetlb does not support softdirty
tracking).  I don't think this corner cases justifies to cc stable.  Let's
just handle it correctly and prepare for change_protection() cleanups.

[david@redhat.com: o need for additional harmless checks if we're wr-protecting either way]
Link: https://lkml.kernel.org/r/71412742-a71f-9c74-865f-773ad83db7a5@redhat.com
Link: https://lkml.kernel.org/r/20221223155616.297723-1-david@redhat.com
Link: https://lkml.kernel.org/r/20221223155616.297723-2-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoselftests/vm: ksm_functional_tests: fix a typo in comment
Xu Panda [Fri, 23 Dec 2022 02:50:24 +0000 (10:50 +0800)]
selftests/vm: ksm_functional_tests: fix a typo in comment

Fix a typo of "comaring" which should be "comparing".

Link: https://lkml.kernel.org/r/202212231050245952617@zte.com.cn
Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: multi-gen LRU: simplify arch_has_hw_pte_young() check
Yu Zhao [Thu, 22 Dec 2022 04:19:06 +0000 (21:19 -0700)]
mm: multi-gen LRU: simplify arch_has_hw_pte_young() check

Scanning page tables when hardware does not set the accessed bit has
no real use cases.

Link: https://lkml.kernel.org/r/20221222041905.2431096-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: multi-gen LRU: clarify scan_control flags
Yu Zhao [Thu, 22 Dec 2022 04:19:05 +0000 (21:19 -0700)]
mm: multi-gen LRU: clarify scan_control flags

Among the flags in scan_control:
1. sc->may_swap, which indicates swap constraint due to memsw.max, is
   supported as usual.
2. sc->proactive, which indicates reclaim by memory.reclaim, may not
   opportunistically skip the aging path, since it is considered less
   latency sensitive.
3. !(sc->gfp_mask & __GFP_IO), which indicates IO constraint, lowers
   swappiness to prioritize file LRU, since clean file folios are more
   likely to exist.
4. sc->may_writepage and sc->may_unmap, which indicates opportunistic
   reclaim, are rejected, since unmapped clean folios are already
   prioritized. Scanning for more of them is likely futile and can
   cause high reclaim latency when there is a large number of memcgs.

The rest are handled by the existing code.

Link: https://lkml.kernel.org/r/20221222041905.2431096-8-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: multi-gen LRU: per-node lru_gen_folio lists
Yu Zhao [Thu, 22 Dec 2022 04:19:04 +0000 (21:19 -0700)]
mm: multi-gen LRU: per-node lru_gen_folio lists

For each node, memcgs are divided into two generations: the old and
the young. For each generation, memcgs are randomly sharded into
multiple bins to improve scalability. For each bin, an RCU hlist_nulls
is virtually divided into three segments: the head, the tail and the
default.

An onlining memcg is added to the tail of a random bin in the old
generation. The eviction starts at the head of a random bin in the old
generation. The per-node memcg generation counter, whose reminder (mod
2) indexes the old generation, is incremented when all its bins become
empty.

There are four operations:
1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in
   its current generation (old or young) and updates its "seg" to
   "head";
2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in
   its current generation (old or young) and updates its "seg" to
   "tail";
3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in
   the old generation, updates its "gen" to "old" and resets its "seg"
   to "default";
4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin
   in the young generation, updates its "gen" to "young" and resets
   its "seg" to "default".

The events that trigger the above operations are:
1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
2. The first attempt to reclaim an memcg below low, which triggers
   MEMCG_LRU_TAIL;
3. The first attempt to reclaim an memcg below reclaimable size
   threshold, which triggers MEMCG_LRU_TAIL;
4. The second attempt to reclaim an memcg below reclaimable size
   threshold, which triggers MEMCG_LRU_YOUNG;
5. Attempting to reclaim an memcg below min, which triggers
   MEMCG_LRU_YOUNG;
6. Finishing the aging on the eviction path, which triggers
   MEMCG_LRU_YOUNG;
7. Offlining an memcg, which triggers MEMCG_LRU_OLD.

Note that memcg LRU only applies to global reclaim, and the
round-robin incrementing of their max_seq counters ensures the
eventual fairness to all eligible memcgs. For memcg reclaim, it still
relies on mem_cgroup_iter().

Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: multi-gen LRU: shuffle should_run_aging()
Yu Zhao [Thu, 22 Dec 2022 04:19:03 +0000 (21:19 -0700)]
mm: multi-gen LRU: shuffle should_run_aging()

Move should_run_aging() next to its only caller left.

Link: https://lkml.kernel.org/r/20221222041905.2431096-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: multi-gen LRU: remove aging fairness safeguard
Yu Zhao [Thu, 22 Dec 2022 04:19:02 +0000 (21:19 -0700)]
mm: multi-gen LRU: remove aging fairness safeguard

Recall that the aging produces the youngest generation: first it scans
for accessed folios and updates their gen counters; then it increments
lrugen->max_seq.

The current aging fairness safeguard for kswapd uses two passes to
ensure the fairness to multiple eligible memcgs. On the first pass,
which is shared with the eviction, it checks whether all eligible
memcgs are low on cold folios. If so, it requires a second pass, on
which it ages all those memcgs at the same time.

With memcg LRU, the aging, while ensuring eventual fairness, will run
when necessary. Therefore the current aging fairness safeguard for
kswapd will not be needed.

Note that memcg LRU only applies to global reclaim. For memcg reclaim,
the aging can be unfair to different memcgs, i.e., their
lrugen->max_seq can be incremented at different paces.

Link: https://lkml.kernel.org/r/20221222041905.2431096-5-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: multi-gen LRU: remove eviction fairness safeguard
Yu Zhao [Thu, 22 Dec 2022 04:19:01 +0000 (21:19 -0700)]
mm: multi-gen LRU: remove eviction fairness safeguard

Recall that the eviction consumes the oldest generation: first it
bucket-sorts folios whose gen counters were updated by the aging and
reclaims the rest; then it increments lrugen->min_seq.

The current eviction fairness safeguard for global reclaim has a
dilemma: when there are multiple eligible memcgs, should it continue
or stop upon meeting the reclaim goal? If it continues, it overshoots
and increases direct reclaim latency; if it stops, it loses fairness
between memcgs it has taken memory away from and those it has yet to.

With memcg LRU, the eviction, while ensuring eventual fairness, will
stop upon meeting its goal. Therefore the current eviction fairness
safeguard for global reclaim will not be needed.

Note that memcg LRU only applies to global reclaim. For memcg reclaim,
the eviction will continue, even if it is overshooting. This becomes
unconditional due to code simplification.

Link: https://lkml.kernel.org/r/20221222041905.2431096-4-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
Yu Zhao [Thu, 22 Dec 2022 04:19:00 +0000 (21:19 -0700)]
mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]

lru_gen_folio will be chained into per-node lists by the coming
lrugen->list.

Link: https://lkml.kernel.org/r/20221222041905.2431096-3-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
Yu Zhao [Thu, 22 Dec 2022 04:18:59 +0000 (21:18 -0700)]
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio

Patch series "mm: multi-gen LRU: memcg LRU", v3.

Overview
========

An memcg LRU is a per-node LRU of memcgs.  It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).

Its goal is to improve the scalability of global reclaim, which is
critical to system-wide memory overcommit in data centers.  Note that
memcg reclaim is currently out of scope.

Its memory bloat is a pointer to each lruvec and negligible to each
pglist_data.  In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not affect
the worst-case complexity O(n).  Therefore, on average, it has a sublinear
complexity in contrast to the current linear complexity.

The basic structure of an memcg LRU can be understood by an analogy to
the active/inactive LRU (of folios):
1. It has the young and the old (generations), i.e., the counterparts
   to the active and the inactive;
2. The increment of max_seq triggers promotion, i.e., the counterpart
   to activation;
3. Other events trigger similar operations, e.g., offlining an memcg
   triggers demotion, i.e., the counterpart to deactivation.

In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
   the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out at will
   and reduces latency without affecting fairness over some time.

The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/

The following is a simple test to quickly verify its effectiveness.

  Test design:
  1. Create multiple memcgs.
  2. Each memcg contains a job (fio).
  3. All jobs access the same amount of memory randomly.
  4. The system does not experience global memory pressure.
  5. Periodically write to the root memory.reclaim.

  Desired outcome:
  1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
     over mean(pgsteal) is close to 0%.
  2. The total pgsteal is close to the total requested through
     memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
     to 100%.

  Actual outcome [1]:
                                     MGLRU off    MGLRU on
  stddev(pgsteal) / mean(pgsteal)    75%          20%
  sum(pgsteal) / sum(requested)      425%         95%

  ####################################################################
  MEMCGS=128

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      mkdir /sys/fs/cgroup/memcg$memcg
  done

  start() {
      echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs

      fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
          --filename=/dev/zero --size=1920M --rw=randrw \
          --rate=64m,64m --random_distribution=random \
          --fadvise_hint=0 --time_based --runtime=10h \
          --group_reporting --minimal
  }

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      start &
  done

  sleep 600

  for ((i = 0; i < 600; i++)); do
      echo 256m >/sys/fs/cgroup/memory.reclaim
      sleep 6
  done

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
  done
  ####################################################################

[1]: This was obtained from running the above script (touches less
     than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
     hour.

This patch (of 8):

The new name lru_gen_folio will be more distinct from the coming
lru_gen_memcg.

Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: vmalloc: replace BUG_ON() by WARN_ON_ONCE()
Uladzislau Rezki (Sony) [Thu, 22 Dec 2022 19:00:22 +0000 (20:00 +0100)]
mm: vmalloc: replace BUG_ON() by WARN_ON_ONCE()

Currently a vm_unmap_ram() functions triggers a BUG() if an area is not
found.  Replace it by the WARN_ON_ONCE() error message and keep machine
alive instead of stopping it.

The worst case is a memory leaking.

Link: https://lkml.kernel.org/r/20221222190022.134380-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: vmalloc: avoid calling __find_vmap_area() twice in __vunmap()
Uladzislau Rezki (Sony) [Thu, 22 Dec 2022 19:00:20 +0000 (20:00 +0100)]
mm: vmalloc: avoid calling __find_vmap_area() twice in __vunmap()

Currently the __vunmap() path calls __find_vmap_area() twice.  Once on
entry to check that the area exists, then inside the remove_vm_area()
function which also performs a new search for the VA.

In order to improvie it from a performance point of view we split
remove_vm_area() into two new parts:
  - find_unlink_vmap_area() that does a search and unlink from tree;
  - __remove_vm_area() that removes without searching.

In this case there is no any functional change for remove_vm_area()
whereas vm_remove_mappings(), where a second search happens, switches to
the __remove_vm_area() variant where the already detached VA is passed as
a parameter, so there is no need to find it again.

Performance wise, i use test_vmalloc.sh with 32 threads doing alloc
free on a 64-CPUs-x86_64-box:

perf without this patch:
-   31.41%     0.50%  vmalloc_test/10  [kernel.vmlinux]    [k] __vunmap
   - 30.92% __vunmap
      - 17.67% _raw_spin_lock
           native_queued_spin_lock_slowpath
      - 12.33% remove_vm_area
         - 11.79% free_vmap_area_noflush
            - 11.18% _raw_spin_lock
                 native_queued_spin_lock_slowpath
        0.76% free_unref_page

perf with this patch:
-   11.35%     0.13%  vmalloc_test/14  [kernel.vmlinux]    [k] __vunmap
   - 11.23% __vunmap
      - 8.28% find_unlink_vmap_area
         - 7.95% _raw_spin_lock
              7.44% native_queued_spin_lock_slowpath
      - 1.93% free_vmap_area_noflush
         - 0.56% _raw_spin_lock
              0.53% native_queued_spin_lock_slowpath
        0.60% __vunmap_range_noflush

__vunmap() consumes around ~20% less CPU cycles on this test.

Also, switch from find_vmap_area() to find_unlink_vmap_area() to prevent a
double access to the vmap_area_lock: one for finding area, second time is
for unlinking from a tree.

[urezki@gmail.com: switch to find_unlink_vmap_area() in vm_unmap_ram()]
Link: https://lkml.kernel.org/r/20221222190022.134380-2-urezki@gmail.com
Link: https://lkml.kernel.org/r/20221222190022.134380-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reported-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: move FOLL_* defs to mm_types.h
David Howells [Wed, 21 Dec 2022 21:24:54 +0000 (21:24 +0000)]
mm: move FOLL_* defs to mm_types.h

Move FOLL_* definitions to linux/mm_types.h to make them more accessible
without having to drag in all of linux/mm.h and everything that drags in
too[1].

Link: https://lkml.kernel.org/r/2161258.1671657894@warthog.procyon.org.uk
Signed-off-by: David Howells <dhowells@redhat.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: new primitive kvmemdup()
Hao Sun [Wed, 21 Dec 2022 14:42:45 +0000 (22:42 +0800)]
mm: new primitive kvmemdup()

Similar to kmemdup(), but support large amount of bytes with kvmalloc()
and does *not* guarantee that the result will be physically contiguous.
Use only in cases where kvmalloc() is needed and free it with kvfree().
Also adapt policy_unpack.c in case someone bisect into this.

Link: https://lkml.kernel.org/r/20221221144245.27164-1-sunhao.th@gmail.com
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nick Terrell <terrelln@fb.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/swap: convert deactivate_page() to folio_deactivate()
Vishal Moola (Oracle) [Wed, 21 Dec 2022 18:08:48 +0000 (10:08 -0800)]
mm/swap: convert deactivate_page() to folio_deactivate()

Deactivate_page() has already been converted to use folios, this change
converts it to take in a folio argument instead of calling page_folio().
It also renames the function folio_deactivate() to be more consistent with
other folio functions.

[akpm@linux-foundation.org: fix left-over comments, per Yu Zhao]
Link: https://lkml.kernel.org/r/20221221180848.20774-5-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/damon: convert damon_pa_mark_accessed_or_deactivate() to use folios
Vishal Moola (Oracle) [Wed, 21 Dec 2022 18:08:47 +0000 (10:08 -0800)]
mm/damon: convert damon_pa_mark_accessed_or_deactivate() to use folios

This change replaces 2 calls to compound_head() from put_page() and 1 call
from mark_page_accessed() with one from page_folio().  This is in
preparation for the conversion of deactivate_page() to folio_deactivate().

Link: https://lkml.kernel.org/r/20221221180848.20774-4-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomadvise: convert madvise_cold_or_pageout_pte_range() to use folios
Vishal Moola (Oracle) [Wed, 21 Dec 2022 18:08:46 +0000 (10:08 -0800)]
madvise: convert madvise_cold_or_pageout_pte_range() to use folios

This change removes a number of calls to compound_head(), and saves
1729 bytes of kernel text.

Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/memory: add vm_normal_folio()
Vishal Moola (Oracle) [Wed, 21 Dec 2022 18:08:45 +0000 (10:08 -0800)]
mm/memory: add vm_normal_folio()

Patch series "Convert deactivate_page() to folio_deactivate()", v4.

Deactivate_page() has already been converted to use folios.  This patch
series modifies the callers of deactivate_page() to use folios.  It also
introduces vm_normal_folio() to assist with folio conversions, and
converts deactivate_page() to folio_deactivate() which takes in a folio.

This patch (of 4):

Introduce a wrapper function called vm_normal_folio().  This function
calls vm_normal_page() and returns the folio of the page found, or null if
no page is found.

This function allows callers to get a folio from a pte, which will
eventually allow them to completely replace their struct page variables
with struct folio instead.

Link: https://lkml.kernel.org/r/20221221180848.20774-1-vishal.moola@gmail.com
Link: https://lkml.kernel.org/r/20221221180848.20774-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomaple_tree: refine mab_calc_split function
Vernon Yang [Wed, 21 Dec 2022 06:00:58 +0000 (14:00 +0800)]
maple_tree: refine mab_calc_split function

Invert the conditional judgment of the mid_split, to focus the return
statement in the last statement, which is easier to understand and for
better readability.

Link: https://lkml.kernel.org/r/20221221060058.609003-8-vernon2gm@gmail.com
Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomaple_tree: refine ma_state init from mas_start()
Vernon Yang [Wed, 21 Dec 2022 06:00:57 +0000 (14:00 +0800)]
maple_tree: refine ma_state init from mas_start()

If mas->node is an MAS_START, there are three cases, and they all assign
different values to mas->node and mas->offset.  So there is no need to set
them to a default value before updating.

Update them directly to make them easier to understand and for better
readability.

Link: https://lkml.kernel.org/r/20221221060058.609003-7-vernon2gm@gmail.com
Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomaple_tree: remove the redundant code
Vernon Yang [Wed, 21 Dec 2022 06:00:56 +0000 (14:00 +0800)]
maple_tree: remove the redundant code

The macros CONFIG_DEBUG_MAPLE_TREE_VERBOSE no one uses, functions
mas_dup_tree() and mas_dup_store() are not implemented, just function
declaration, so drop it.

Link: https://lkml.kernel.org/r/20221221060058.609003-6-vernon2gm@gmail.com
Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomaple_tree: use macro MA_ROOT_PARENT instead of number
Vernon Yang [Wed, 21 Dec 2022 06:00:55 +0000 (14:00 +0800)]
maple_tree: use macro MA_ROOT_PARENT instead of number

When you need to compare whether node->parent is parent of the
root node, using macro MA_ROOT_PARENT is easier to understand
and for better readability.

Link: https://lkml.kernel.org/r/20221221060058.609003-5-vernon2gm@gmail.com
Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomaple_tree: use mt_node_max() instead of direct operations mt_max[]
Vernon Yang [Wed, 21 Dec 2022 06:00:54 +0000 (14:00 +0800)]
maple_tree: use mt_node_max() instead of direct operations mt_max[]

Use mt_node_max() to get the maximum number of slots for a node,
rather than direct operations mt_max[], makes it better portability.

Link: https://lkml.kernel.org/r/20221221060058.609003-4-vernon2gm@gmail.com
Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomaple_tree: remove extra return statement
Vernon Yang [Wed, 21 Dec 2022 06:00:53 +0000 (14:00 +0800)]
maple_tree: remove extra return statement

For functions with a return type of void, it is unnecessary to
add a reurn statement at the end of the function, so drop it.

Link: https://lkml.kernel.org/r/20221221060058.609003-3-vernon2gm@gmail.com
Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomaple_tree: remove extra space and blank line
Vernon Yang [Wed, 21 Dec 2022 06:00:52 +0000 (14:00 +0800)]
maple_tree: remove extra space and blank line

Patch series "Clean up and refinement for maple tree", v2.

This patchset cleans up and refines some maple tree code.  A few small
changes make the code easier to understand and for better readability.

This patch (of 7):

These extra space and blank lines are unnecessary, so drop them.

Link: https://lkml.kernel.org/r/20221221060058.609003-1-vernon2gm@gmail.com
Link: https://lkml.kernel.org/r/20221221060058.609003-2-vernon2gm@gmail.com
Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()
Lorenzo Stoakes [Mon, 19 Dec 2022 12:36:59 +0000 (12:36 +0000)]
mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()

This function sets __GFP_NOWARN in the gfp_mask rendering the warn_alloc()
invocations no-ops.  Remove this and instead rely on this flag being set
only for the vm_area_alloc_pages() function, ensuring it is cleared for
each of the warn_alloc() calls.

Link: https://lkml.kernel.org/r/20221219123659.90614-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agotools/vm/page_owner_sort: free memory before exit
Jianlin Lv [Mon, 19 Dec 2022 16:49:17 +0000 (16:49 +0000)]
tools/vm/page_owner_sort: free memory before exit

Although when a process terminates, the kernel will removes memory
associated with that process, It's neither good style nor proper design to
leave it to kernel.  This patch free allocated memory before process exit.

Link: https://lkml.kernel.org/r/20221219164917.14132-1-iecedge@gmail.com
Signed-off-by: Jianlin Lv <iecedge@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agokasan: allow sampling page_alloc allocations for HW_TAGS
Andrey Konovalov [Mon, 19 Dec 2022 18:09:18 +0000 (19:09 +0100)]
kasan: allow sampling page_alloc allocations for HW_TAGS

As Hardware Tag-Based KASAN is intended to be used in production, its
performance impact is crucial.  As page_alloc allocations tend to be big,
tagging and checking all such allocations can introduce a significant
slowdown.

Add two new boot parameters that allow to alleviate that slowdown:

- kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only
  every Nth page_alloc allocation with the order configured by the second
  added parameter (default: tag every such allocation).

- kasan.page_alloc.sample.order, which makes sampling enabled by the first
  parameter only affect page_alloc allocations with the order equal or
  greater than the specified value (default: 3, see below).

The exact performance improvement caused by using the new parameters
depends on their values and the applied workload.

The chosen default value for kasan.page_alloc.sample.order is 3, which
matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER.  This is
done for two reasons:

1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed
   costly to service", which corresponds to the idea that only large and
   thus costly allocations are supposed to sampled.

2. One of the workloads targeted by this patch is a benchmark that sends
   a large amount of data over a local loopback connection. Most multi-page
   data allocations in the networking subsystem have the order of
   SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER).

When running a local loopback test on a testing MTE-enabled device in sync
mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown.
Applying this patch and setting kasan.page_alloc.sampling to a value
higher than 1 allows to lower the slowdown.  The performance improvement
saturates around the sampling interval value of 10 with the default
sampling page order of 3.  This lowers the slowdown to ~20%.  The slowdown
in real scenarios involving the network will likely be better.

Enabling page_alloc sampling has a downside: KASAN misses bad accesses to
a page_alloc allocation that has not been tagged.  This lowers the value
of KASAN as a security mitigation.

However, based on measuring the number of page_alloc allocations of
different orders during boot in a test build, sampling with the default
kasan.page_alloc.sample.order value affects only ~7% of allocations.  The
rest ~93% of allocations are still checked deterministically.

Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Brand <markbrand@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoswap: avoid holding swap reference in swap_cache_get_folio
Kairui Song [Mon, 19 Dec 2022 18:58:40 +0000 (02:58 +0800)]
swap: avoid holding swap reference in swap_cache_get_folio

All its callers either already hold a reference to, or lock the swap
device while calling this function.  There is only one exception in
shmem_swapin_folio, just make this caller also hold a reference of the
swap device, so this helper can be simplified and saves a few cycles.

This also provides finer control of error handling in shmem_swapin_folio,
on race (with swap off), it can just try again.  For invalid swap entry,
it can fail with a proper error code.

Link: https://lkml.kernel.org/r/20221219185840.25441-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoswap: fold swap_ra_clamp_pfn into swap_ra_info
Kairui Song [Mon, 19 Dec 2022 18:58:39 +0000 (02:58 +0800)]
swap: fold swap_ra_clamp_pfn into swap_ra_info

This makes the code cleaner.  This helper is made of only two line of self
explanational code and not reused anywhere else.

And this actually make the compiled object smaller by a bit.

bloat-o-meter results on x86_64 of mm/swap_state.o:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-35 (-35)
Function                                     old     new   delta
swap_ra_info.constprop                       512     477     -35
Total: Before=8388, After=8353, chg -0.42%

Link: https://lkml.kernel.org/r/20221219185840.25441-4-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoswap: avoid a redundant pte map if ra window is 1
Kairui Song [Mon, 19 Dec 2022 18:58:38 +0000 (02:58 +0800)]
swap: avoid a redundant pte map if ra window is 1

Avoid a redundant pte map/unmap when swap readahead window is 1.

Link: https://lkml.kernel.org/r/20221219185840.25441-3-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoswapfile: get rid of volatile and avoid redundant read
Kairui Song [Mon, 19 Dec 2022 18:58:37 +0000 (02:58 +0800)]
swapfile: get rid of volatile and avoid redundant read

Patch series "Clean up and fixes for swap", v2.

This series cleans up some code paths, saves a few cycles and reduces the
object size by a bit.  It also fixes some rare race issue with statistics.

This patch (of 4):

Convert a volatile variable to more readable READ_ONCE.  And this actually
avoids the code from reading the variable twice redundantly when it races.

Link: https://lkml.kernel.org/r/20221219185840.25441-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20221219185840.25441-2-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoDocs/ABI/damon: document scheme filters files
SeongJae Park [Mon, 5 Dec 2022 23:08:30 +0000 (23:08 +0000)]
Docs/ABI/damon: document scheme filters files

Document newly added DAMON sysfs interface files for DAMOS filtering on
the DAMON ABI document.

Link: https://lkml.kernel.org/r/20221205230830.144349-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoDocs/admin-guide/mm/damon/usage: document DAMOS filters of sysfs
SeongJae Park [Mon, 5 Dec 2022 23:08:29 +0000 (23:08 +0000)]
Docs/admin-guide/mm/damon/usage: document DAMOS filters of sysfs

Document about the newly added files for DAMOS filters on the DAMON usage
document.

Link: https://lkml.kernel.org/r/20221205230830.144349-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoselftests/damon/sysfs: test filters directory
SeongJae Park [Mon, 5 Dec 2022 23:08:28 +0000 (23:08 +0000)]
selftests/damon/sysfs: test filters directory

Add simple test cases for scheme filters of DAMON sysfs interface.  The
test cases check if the files are populated as expected, receives some
valid inputs, and refuses some invalid inputs.

Link: https://lkml.kernel.org/r/20221205230830.144349-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/damon/sysfs-schemes: implement scheme filters
SeongJae Park [Mon, 5 Dec 2022 23:08:27 +0000 (23:08 +0000)]
mm/damon/sysfs-schemes: implement scheme filters

Implement scheme filters functionality of DAMON sysfs interface by making
the code reads the values of files under the filter directories and pass
that to DAMON using DAMON kernel API.

[sj@kernel.org: fix leaking a filter for wrong cgroup path]
Link: https://lkml.kernel.org/r/20221219171807.55708-2-sj@kernel.org
[sj@kernel.org: return an error for filter memcg path id lookup failure]
Link: https://lkml.kernel.org/r/20221219171807.55708-3-sj@kernel.org
Link: https://lkml.kernel.org/r/20221205230830.144349-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/damon/sysfs-schemes: connect filter directory and filters directory
SeongJae Park [Mon, 5 Dec 2022 23:08:26 +0000 (23:08 +0000)]
mm/damon/sysfs-schemes: connect filter directory and filters directory

Implement 'nr_filters' file under 'filters' directory, which will be used
to populate specific number of 'filter' directory under the directory,
similar to other 'nr_*' files in DAMON sysfs interface.

Link: https://lkml.kernel.org/r/20221205230830.144349-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/damon/sysfs-schemes: implement filter directory
SeongJae Park [Mon, 5 Dec 2022 23:08:25 +0000 (23:08 +0000)]
mm/damon/sysfs-schemes: implement filter directory

Implement DAMOS filter directory which will be located under the filters
directory.  The directory provides three files, namely type, matching, and
memcg_path.  'type' and 'matching' will be directly connected to the
fields of 'struct damos_filter' having same name.  'memcg_path' will
receive the path of the memory cgroup of the interest and later converted
to memcg id when it's committed.

Link: https://lkml.kernel.org/r/20221205230830.144349-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/damon/sysfs-schemes: implement filters directory
SeongJae Park [Mon, 5 Dec 2022 23:08:24 +0000 (23:08 +0000)]
mm/damon/sysfs-schemes: implement filters directory

DAMOS filters are currently supported by only DAMON kernel API.  To expose
the feature to user space, implement a DAMON sysfs directory named
'filters' under each scheme directory.  Please note that this is
implementing only the directory.  Following commits will implement more
files and directories, and finally connect the DAMOS filters feature.

Link: https://lkml.kernel.org/r/20221205230830.144349-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoDocs/admin-guide/damon/reclaim: document 'skip_anon' parameter
SeongJae Park [Mon, 5 Dec 2022 23:08:23 +0000 (23:08 +0000)]
Docs/admin-guide/damon/reclaim: document 'skip_anon' parameter

Document the newly added 'skip_anon' parameter of DAMON_RECLAIM, which can
be used to avoid anonymous pages reclamation.

Link: https://lkml.kernel.org/r/20221205230830.144349-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/damon/reclaim: add a parameter called skip_anon for avoiding anonymous pages recla...
SeongJae Park [Mon, 5 Dec 2022 23:08:22 +0000 (23:08 +0000)]
mm/damon/reclaim: add a parameter called skip_anon for avoiding anonymous pages reclamation

In some cases, for example if users have confidence at anonymous pages
management or the swap device is too slow, users would want to avoid
DAMON_RECLAIM swapping the anonymous pages out.  For such case, add yet
another DAMON_RECLAIM parameter, namely 'skip_anon'.  When it is set as
'Y', DAMON_RECLAIM will avoid reclaiming anonymous pages using a DAMOS
filter.

Link: https://lkml.kernel.org/r/20221205230830.144349-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/damon/paddr: support DAMOS filters
SeongJae Park [Mon, 5 Dec 2022 23:08:21 +0000 (23:08 +0000)]
mm/damon/paddr: support DAMOS filters

Implement support of the DAMOS filters in the physical address space
monitoring operations set, for all DAMOS actions that it supports
including 'pageout', 'lru_prio', and 'lru_deprio'.

Link: https://lkml.kernel.org/r/20221205230830.144349-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/damon/core: implement damos filter
SeongJae Park [Mon, 5 Dec 2022 23:08:20 +0000 (23:08 +0000)]
mm/damon/core: implement damos filter

Patch series "implement DAMOS filtering for anon pages and/or specific
memory cgroups"

DAMOS let users do system operations in a data access pattern oriented
way.  The data access pattern, which is extracted by DAMON, is somewhat
accurate more than what user space could know in many cases.  However, in
some situation, users could know something more than the kernel about the
pattern or some special requirements for some types of memory or
processes.  For example, some users would have slow swap devices and knows
latency-ciritical processes and therefore want to use DAMON-based
proactive reclamation (DAMON_RECLAIM) for only non-anonymous pages of
non-latency-critical processes.

For such restriction, users could exclude the memory regions from the
initial monitoring regions and use non-dynamic monitoring regions update
monitoring operations set including fvaddr and paddr.  They could also
adjust the DAMOS target access pattern.  For dynamically changing memory
layout and access pattern, those would be not enough.

To help the case, add an interface, namely DAMOS filters, which can be
used to avoid the DAMOS actions be applied to specific types of memory, to
DAMON kernel API (damon.h).  At the moment, it supports filtering
anonymous pages and/or specific memory cgroups in or out for each DAMOS
scheme.

This patchset adds the support for all DAMOS actions that 'paddr'
monitoring operations set supports ('pageout', 'lru_prio', and
'lru_deprio'), and the functionality is exposed via DAMON kernel API
(damon.h) the DAMON sysfs interface (/sys/kernel/mm/damon/admins/), and
DAMON_RECLAIM module parameters.

Patches Sequence
----------------

First patch implements DAMOS filter interface to DAMON kernel API.  Second
patch makes the physical address space monitoring operations set to
support the filters from all supporting DAMOS actions.  Third patch adds
anonymous pages filter support to DAMON_RECLAIM, and the fourth patch
documents the DAMON_RECLAIM's new feature.  Fifth to seventh patches
implement DAMON sysfs files for support of the filters, and eighth patch
connects the file to use DAMOS filters feature.  Ninth patch adds simple
self test cases for DAMOS filters of the sysfs interface.  Finally,
following two patches (tenth and eleventh) document the new features and
interfaces.

This patch (of 11):

DAMOS lets users do system operation in a data access pattern oriented
way.  The data access pattern, which is extracted by DAMON, is somewhat
accurate more than what user space could know in many cases.  However, in
some situation, users could know something more than the kernel about the
pattern or some special requirements for some types of memory or
processes.  For example, some users would have slow swap devices and knows
latency-ciritical processes and therefore want to use DAMON-based
proactive reclamation (DAMON_RECLAIM) for only non-anonymous pages of
non-latency-critical processes.

For such restriction, users could exclude the memory regions from the
initial monitoring regions and use non-dynamic monitoring regions update
monitoring operations set including fvaddr and paddr.  They could also
adjust the DAMOS target access pattern.  For dynamically changing memory
layout and access pattern, those would be not enough.

To help the case, add an interface, namely DAMOS filters, which can be
used to avoid the DAMOS actions be applied to specific types of memory, to
DAMON kernel API (damon.h).  At the moment, it supports filtering
anonymous pages and/or specific memory cgroups in or out for each DAMOS
scheme.

Note that this commit adds only the interface to the DAMON kernel API.
The impelmentation should be made in the monitoring operations sets, and
following commits will add that.

Link: https://lkml.kernel.org/r/20221205230830.144349-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20221205230830.144349-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: memcontrol: deprecate charge moving
Johannes Weiner [Wed, 7 Dec 2022 13:00:39 +0000 (14:00 +0100)]
mm: memcontrol: deprecate charge moving

Charge moving mode in cgroup1 allows memory to follow tasks as they
migrate between cgroups.  This is, and always has been, a questionable
thing to do - for several reasons.

First, it's expensive.  Pages need to be identified, locked and isolated
from various MM operations, and reassigned, one by one.

Second, it's unreliable.  Once pages are charged to a cgroup, there isn't
always a clear owner task anymore.  Cache isn't moved at all, for example.
Mapped memory is moved - but if trylocking or isolating a page fails,
it's arbitrarily left behind.  Frequent moving between domains may leave a
task's memory scattered all over the place.

Third, it isn't really needed.  Launcher tasks can kick off workload tasks
directly in their target cgroup.  Using dedicated per-workload groups
allows fine-grained policy adjustments - no need to move tasks and their
physical pages between control domains.  The feature was never
forward-ported to cgroup2, and it hasn't been missed.

Despite it being a niche usecase, the maintenance overhead of supporting
it is enormous.  Because pages are moved while they are live and subject
to various MM operations, the synchronization rules are complicated.
There are lock_page_memcg() in MM and FS code, which non-cgroup people
don't understand.  In some cases we've been able to shift code and cgroup
API calls around such that we can rely on native locking as much as
possible.  But that's fragile, and sometimes we need to hold MM locks for
longer than we otherwise would (pte lock e.g.).

Mark the feature deprecated. Hopefully we can remove it soon.

And backport into -stable kernels so that people who develop against
earlier kernels are warned about this deprecation as early as possible.

[akpm@linux-foundation.org: fix memory.rst underlining]
Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: rmap: remove lock_page_memcg()
Johannes Weiner [Tue, 6 Dec 2022 17:13:40 +0000 (18:13 +0100)]
mm: rmap: remove lock_page_memcg()

The previous patch made sure charge moving only touches pages for which
page_mapped() is stable.  lock_page_memcg() is no longer needed.

Link: https://lkml.kernel.org/r/20221206171340.139790-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: memcontrol: skip moving non-present pages that are mapped elsewhere
Johannes Weiner [Tue, 6 Dec 2022 17:13:39 +0000 (18:13 +0100)]
mm: memcontrol: skip moving non-present pages that are mapped elsewhere

Patch series "mm: push down lock_page_memcg()", v2.

This patch (of 3):

During charge moving, the pte lock and the page lock cover nearly all
cases of stabilizing page_mapped().  The only exception is when we're
looking at a non-present pte and find a page in the page cache or in the
swapcache: if the page is mapped elsewhere, it can become unmapped outside
of our control.  For this reason, rmap needs lock_page_memcg().

We don't like cgroup-specific locks in generic MM code - especially in
performance-critical MM code - and for a legacy feature that's unlikely to
have many users left - if any.

So remove the exception.  Arguably that's better semantics anyway: the
page is shared, and another process seems to be the more active user.

Once we stop moving such pages, rmap doesn't need lock_page_memcg()
anymore.  The next patch will remove it.

Link: https://lkml.kernel.org/r/20221206171340.139790-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20221206171340.139790-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agohugetlb: initialize variable to avoid compiler warning
Mike Kravetz [Fri, 16 Dec 2022 22:45:07 +0000 (14:45 -0800)]
hugetlb: initialize variable to avoid compiler warning

With the gcc 'maybe-uninitialized' warning enabled, gcc will produce:

  mm/hugetlb.c:6896:20: warning: `chg' may be used uninitialized

This is a false positive, but may be difficult for the compiler to
determine.  maybe-uninitialized is disabled by default, but this gets
flagged as a 0-DAY build regression.

Initialize the variable to silence the warning.

Link: https://lkml.kernel.org/r/20221216224507.106789-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: swap: convert mark_page_lazyfree() to folio_mark_lazyfree()
Kefeng Wang [Fri, 9 Dec 2022 02:06:18 +0000 (10:06 +0800)]
mm: swap: convert mark_page_lazyfree() to folio_mark_lazyfree()

mark_page_lazyfree() and the callers are converted to use folio, this
rename and make it to take in a folio argument instead of calling
page_folio().

Link: https://lkml.kernel.org/r/20221209020618.190306-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: huge_memory: convert madvise_free_huge_pmd to use a folio
Kefeng Wang [Wed, 7 Dec 2022 02:34:30 +0000 (10:34 +0800)]
mm: huge_memory: convert madvise_free_huge_pmd to use a folio

Using folios instead of pages removes several calls to compound_head(),

Link: https://lkml.kernel.org/r/20221207023431.151008-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agocma: tracing: print alloc result in trace_cma_alloc_finish
Wenchao Hao [Thu, 8 Dec 2022 14:21:30 +0000 (22:21 +0800)]
cma: tracing: print alloc result in trace_cma_alloc_finish

The result of the allocation attempt is not printed in
trace_cma_alloc_finish, but it's important to do it so we can set filters
to catch specific errors on allocation or to trigger some operations on
specific errors.

We have printed the result in log, but the log is conditional and could
not be filtered by tracing events.

It introduces little overhead to print this result.  The result of
allocation is named `errorno' in the trace.

Link: https://lkml.kernel.org/r/20221208142130.1501195-1-haowenchao@huawei.com
Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agolib/test_vmalloc.c: add parameter use_huge for fix_size_alloc_test
Qinglin Pan [Mon, 12 Dec 2022 05:56:57 +0000 (13:56 +0800)]
lib/test_vmalloc.c: add parameter use_huge for fix_size_alloc_test

Add a parameter `use_huge' for fix_size_alloc_test(), which can be used to
test allocation vie vmalloc_huge for both functionality and performance.

Link: https://lkml.kernel.org/r/20221212055657.698420-1-panqinglin2020@iscas.ac.cn
Signed-off-by: Qinglin Pan <panqinglin2020@iscas.ac.cn>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/mempolicy: do not duplicate policy if it is not applicable for set_mempolicy_home_node
Michal Hocko [Fri, 16 Dec 2022 19:45:37 +0000 (14:45 -0500)]
mm/mempolicy: do not duplicate policy if it is not applicable for set_mempolicy_home_node

set_mempolicy_home_node tries to duplicate a memory policy before checking
it whether it is applicable for the operation.  There is no real reason
for doing that and it might actually be a pointless memory allocation and
deallocation exercise for MPOL_INTERLEAVE.

Not a big problem but we can do better. Simply check the policy before
acting on it.

Link: https://lkml.kernel.org/r/20221216194537.238047-2-mathieu.desnoyers@efficios.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agompage: use b_folio in do_mpage_readpage()
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:44:02 +0000 (21:44 +0000)]
mpage: use b_folio in do_mpage_readpage()

Remove this conversion of a folio back to a page.

Link: https://lkml.kernel.org/r/20221215214402.3522366-13-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoreiserfs: replace obvious uses of b_page with b_folio
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:44:01 +0000 (21:44 +0000)]
reiserfs: replace obvious uses of b_page with b_folio

These places just use b_page to get to the buffer's address_space or call
page_folio() on b_page to get a folio.

Link: https://lkml.kernel.org/r/20221215214402.3522366-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agonilfs2: replace obvious uses of b_page with b_folio
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:44:00 +0000 (21:44 +0000)]
nilfs2: replace obvious uses of b_page with b_folio

These places just use b_page to get to the buffer's address_space or the
index of the page the buffer is in.

Link: https://lkml.kernel.org/r/20221215214402.3522366-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agojbd2: replace obvious uses of b_page with b_folio
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:59 +0000 (21:43 +0000)]
jbd2: replace obvious uses of b_page with b_folio

These places just use b_page to get to the buffer's address_space or have
already been converted to folio.

Link: https://lkml.kernel.org/r/20221215214402.3522366-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agogfs2: replace obvious uses of b_page with b_folio
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:58 +0000 (21:43 +0000)]
gfs2: replace obvious uses of b_page with b_folio

These places just use b_page to get to the buffer's address_space.

Link: https://lkml.kernel.org/r/20221215214402.3522366-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agobuffer: use b_folio in mark_buffer_dirty()
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:57 +0000 (21:43 +0000)]
buffer: use b_folio in mark_buffer_dirty()

Removes about four calls to compound_head().  Two of them are inline which
removes 132 bytes from the kernel text.

Link: https://lkml.kernel.org/r/20221215214402.3522366-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agopage_io: remove buffer_head include
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:56 +0000 (21:43 +0000)]
page_io: remove buffer_head include

page_io never uses buffer_heads to do I/O.

Link: https://lkml.kernel.org/r/20221215214402.3522366-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agobuffer: use b_folio in end_buffer_async_write()
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:55 +0000 (21:43 +0000)]
buffer: use b_folio in end_buffer_async_write()

Save 76 bytes from avoiding the call to compound_head() in SetPageError().
Also avoid the call to compound_head() in end_page_writeback().

Link: https://lkml.kernel.org/r/20221215214402.3522366-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agobuffer: use b_folio in end_buffer_async_read()
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:54 +0000 (21:43 +0000)]
buffer: use b_folio in end_buffer_async_read()

Removes a call to compound_head() in SetPageError(), saving 76 bytes of
text.

Link: https://lkml.kernel.org/r/20221215214402.3522366-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agobuffer: use b_folio in touch_buffer()
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:53 +0000 (21:43 +0000)]
buffer: use b_folio in touch_buffer()

Removes a call to compound_head() in this path.

Link: https://lkml.kernel.org/r/20221215214402.3522366-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agobuffer: replace obvious uses of b_page with b_folio
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:52 +0000 (21:43 +0000)]
buffer: replace obvious uses of b_page with b_folio

These cases just check if it's NULL, or use b_page to get to the page's
address space.  They are assumptions that b_page never points to a tail
page.

Link: https://lkml.kernel.org/r/20221215214402.3522366-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agobuffer: add b_folio as an alias of b_page
Matthew Wilcox (Oracle) [Thu, 15 Dec 2022 21:43:51 +0000 (21:43 +0000)]
buffer: add b_folio as an alias of b_page

Patch series "Start converting buffer_heads to use folios".

I was hoping that filesystems would convert from buffer_heads to iomap,
but that's not happening particularly quickly.  So the buffer_head
infrastructure needs to be converted from being page-based to being
folio-based.

This patch (of 12):

Buffer heads point to the allocation (ie the folio), not the page.  This
is currently the same thing for all filesystems that use buffer heads, so
this is a safe transitional step.

Link: https://lkml.kernel.org/r/20221215214402.3522366-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20221215214402.3522366-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: introduce hugetlb_walk()
Peter Xu [Fri, 16 Dec 2022 15:52:29 +0000 (10:52 -0500)]
mm/hugetlb: introduce hugetlb_walk()

huge_pte_offset() is the main walker function for hugetlb pgtables.  The
name is not really representing what it does, though.

Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside.  Assert on the locks when walking
the pgtable.

Note, the vma lock assertion will be a no-op for private mappings.

Document the last special case in the page_vma_mapped_walk() path where we
don't need any more lock to call hugetlb_walk().

Taking vma lock there is not needed because either: (1) potential callers
of hugetlb pvmw holds i_mmap_rwsem already (from one rmap_walk()), or (2)
the caller will not walk a hugetlb vma at all so the hugetlb code path not
reachable (e.g.  in ksm or uprobe paths).

It's slightly implicit for future page_vma_mapped_walk() callers on that
lock requirement.  But anyway, when one day this rule breaks, one will get
a straightforward warning in hugetlb_walk() with lockdep, then there'll be
a way out.

[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20221216155229.2043750-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: make walk_hugetlb_range() safe to pmd unshare
Peter Xu [Fri, 16 Dec 2022 15:52:26 +0000 (10:52 -0500)]
mm/hugetlb: make walk_hugetlb_range() safe to pmd unshare

Since walk_hugetlb_range() walks the pgtable, it needs the vma lock to
make sure the pgtable page will not be freed concurrently.

Link: https://lkml.kernel.org/r/20221216155226.2043738-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: make follow_hugetlb_page() safe to pmd unshare
Peter Xu [Fri, 16 Dec 2022 15:52:23 +0000 (10:52 -0500)]
mm/hugetlb: make follow_hugetlb_page() safe to pmd unshare

Since follow_hugetlb_page() walks the pgtable, it needs the vma lock to
make sure the pgtable page will not be freed concurrently.

Link: https://lkml.kernel.org/r/20221216155223.2043727-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: make hugetlb_follow_page_mask() safe to pmd unshare
Peter Xu [Fri, 16 Dec 2022 15:52:19 +0000 (10:52 -0500)]
mm/hugetlb: make hugetlb_follow_page_mask() safe to pmd unshare

Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Link: https://lkml.kernel.org/r/20221216155219.2043714-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: make userfaultfd_huge_must_wait() safe to pmd unshare
Peter Xu [Fri, 16 Dec 2022 15:52:17 +0000 (10:52 -0500)]
mm/hugetlb: make userfaultfd_huge_must_wait() safe to pmd unshare

We can take the hugetlb walker lock, here taking vma lock directly.

Link: https://lkml.kernel.org/r/20221216155217.2043700-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: move swap entry handling into vma lock when faulted
Peter Xu [Fri, 16 Dec 2022 15:50:55 +0000 (10:50 -0500)]
mm/hugetlb: move swap entry handling into vma lock when faulted

In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset().  That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without any lock to protect the pgtable from being freed after pmd
unshare.

Here the simplest solution to make it safe is to move the swap handling to
be after the vma lock being held.  We may need to take the fault mutex on
either migration or hwpoison entries now (also the vma lock, but that's
really needed), however neither of them is hot path.

Note that the vma lock cannot be released in hugetlb_fault() when the
migration entry is detected, because in migration_entry_wait_huge() the
pgtable page will be used again (by taking the pgtable lock), so that also
need to be protected by the vma lock.  Modify migration_entry_wait_huge()
so that it must be called with vma read lock held, and properly release
the lock in __migration_entry_wait_huge().

Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: document huge_pte_offset usage
Peter Xu [Fri, 16 Dec 2022 15:50:54 +0000 (10:50 -0500)]
mm/hugetlb: document huge_pte_offset usage

huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.

Normally, it's always safe to walk a generic pgtable as long as we're with
the mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.

But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
pgtable freed by pmd unsharing, it means that even with mmap lock held for
current mm, the PMD pgtable page can still go away from under us if pmd
unsharing is possible during the walk.

So we have two ways to make it safe even for a shared mapping:

  (1) If we're with the hugetlb vma lock held for either read/write, it's
      okay because pmd unshare cannot happen at all.

  (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
      okay because even if pmd unshare can happen, the pgtable page cannot
      be freed from under us.

Document it.

Link: https://lkml.kernel.org/r/20221216155100.2043537-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: don't wait for migration entry during follow page
Peter Xu [Fri, 16 Dec 2022 15:50:53 +0000 (10:50 -0500)]
mm/hugetlb: don't wait for migration entry during follow page

That's what the code does with !hugetlb pages, so we should logically do
the same for hugetlb, so migration entry will also be treated as no page.

This is probably also the last piece in follow_page code that may sleep,
the last one should be removed in cf994dd8af27 ("mm/gup: remove
FOLL_MIGRATION", 2022-11-16).

Link: https://lkml.kernel.org/r/20221216155100.2043537-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/hugetlb: let vma_offset_start() to return start
Peter Xu [Fri, 16 Dec 2022 15:50:52 +0000 (10:50 -0500)]
mm/hugetlb: let vma_offset_start() to return start

Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable.  It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.  It was assumed to be safe but it's actually not.  One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed.  It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us.  IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
  #define _GNU_SOURCE
  #include <stdio.h>
  #include <stdlib.h>
  #include <errno.h>
  #include <unistd.h>
  #include <sys/mman.h>
  #include <fcntl.h>
  #include <linux/memfd.h>
  #include <assert.h>
  #include <pthread.h>

  #define  MSIZE  (1UL << 30)     /* 1GB */
  #define  PSIZE  (2UL << 20)     /* 2MB */

  #define  HOLD_SEC  (1)

  int pipefd[2];
  void *buf;

  void *do_map(int fd)
  {
      unsigned char *tmpbuf, *p;
      int ret;

      ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
      if (ret) {
          perror("posix_memalign() failed");
          return NULL;
      }

      tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
                    MAP_SHARED | MAP_FIXED, fd, 0);
      if (tmpbuf == MAP_FAILED) {
          perror("mmap() failed");
          return NULL;
      }
      printf("mmap() -> %p\n", tmpbuf);

      for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
          *p = 1;
      }

      return tmpbuf;
  }

  void do_unmap(void *buf)
  {
      munmap(buf, MSIZE);
  }

  void proc2(int fd)
  {
      unsigned char c;

      buf = do_map(fd);
      if (!buf)
          return;

      read(pipefd[0], &c, 1);
      /*
       * This frees the shared pgtable page, causing use-after-free in
       * proc1_thread1 when soft walking hugetlb pgtable.
       */
      do_unmap(buf);

      printf("Proc2 quitting\n");
  }

  void *proc1_thread1(void *data)
  {
      /*
       * Trigger follow-page on 1st 2m page.  Kernel hack patch needed to
       * withhold this procedure for easier reproduce.
       */
      madvise(buf, PSIZE, MADV_POPULATE_WRITE);
      printf("Proc1-thread1 quitting\n");
      return NULL;
  }

  void *proc1_thread2(void *data)
  {
      unsigned char c;

      /* Wait a while until proc1_thread1() start to wait */
      sleep(0.5);
      /* Trigger pmd unshare */
      madvise(buf, PSIZE, MADV_DONTNEED);
      /* Kick off proc2 to release the pgtable */
      write(pipefd[1], &c, 1);

      printf("Proc1-thread2 quitting\n");
      return NULL;
  }

  void proc1(int fd)
  {
      pthread_t tid1, tid2;
      int ret;

      buf = do_map(fd);
      if (!buf)
          return;

      ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
      assert(ret == 0);
      ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
      assert(ret == 0);

      /* Kick the child to share the PUD entry */
      pthread_join(tid1, NULL);
      pthread_join(tid2, NULL);

      do_unmap(buf);
  }

  int main(void)
  {
      int fd, ret;

      fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
      if (fd < 0) {
          perror("open failed");
          return -1;
      }

      ret = ftruncate(fd, MSIZE);
      if (ret) {
          perror("ftruncate() failed");
          return -1;
      }

      ret = pipe(pipefd);
      if (ret) {
          perror("pipe() failed");
          return -1;
      }

      if (fork()) {
          proc1(fd);
      } else {
          proc2(fd);
      }

      close(pipefd[0]);
      close(pipefd[1]);
      close(fd);

      return 0;
  }
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
: diff --git a/mm/hugetlb.c b/mm/hugetlb.c
: index 9d97c9a2a15d..f8d99dad5004 100644
: --- a/mm/hugetlb.c
: +++ b/mm/hugetlb.c
: @@ -38,6 +38,7 @@
:  #include <asm/page.h>
:  #include <asm/pgalloc.h>
:  #include <asm/tlb.h>
: +#include <asm/delay.h>
:
:  #include <linux/io.h>
:  #include <linux/hugetlb.h>
: @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
:                 bool unshare = false;
:                 int absent;
:                 struct page *page;
: +               unsigned long c = 0;
:
:                 /*
:                  * If we have a pending SIGKILL, don't keep faulting pages and
: @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
:                  */
:                 pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
:                                       huge_page_size(h));
: +
: +               pr_info("%s: withhold 1 sec...\n", __func__);
: +               for (c = 0; c < 100; c++) {
: +                       udelay(10000);
: +               }
: +               pr_info("%s: withhold 1 sec...done\n", __func__);
: +
:                 if (pte)
:                         ptl = huge_pte_lock(h, mm, pte);
:                 absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[   16.959907] follow_hugetlb_page: withhold 1 sec...
[   17.960315] follow_hugetlb_page: withhold 1 sec...done
[   17.960550] ------------[ cut here ]------------
[   17.960742] DEBUG_LOCKS_WARN_ON(1)
[   17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[   17.961264] Modules linked in:
[   17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[   17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[   17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[   17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[   17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[   17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[   17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[   17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[   17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[   17.965123] FS:  00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[   17.965443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[   17.965956] PKRU: 55555554
[   17.966068] Call Trace:
[   17.966172]  <TASK>
[   17.966268]  ? tick_nohz_tick_stopped+0x12/0x30
[   17.966455]  lock_acquire+0xbf/0x2b0
[   17.966603]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.966799]  ? _printk+0x48/0x4e
[   17.966934]  _raw_spin_lock+0x2f/0x40
[   17.967087]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.967285]  follow_hugetlb_page.cold+0x75/0x5c4
[   17.967473]  __get_user_pages+0xbb/0x620
[   17.967635]  faultin_vma_page_range+0x9a/0x100
[   17.967817]  madvise_vma_behavior+0x3c0/0xbd0
[   17.967998]  ? mas_prev+0x11/0x290
[   17.968141]  ? find_vma_prev+0x5e/0xa0
[   17.968304]  ? madvise_vma_anon_name+0x70/0x70
[   17.968486]  madvise_walk_vmas+0xa9/0x120
[   17.968650]  do_madvise.part.0+0xfa/0x270
[   17.968813]  __x64_sys_madvise+0x5a/0x70
[   17.968974]  do_syscall_64+0x37/0x90
[   17.969123]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   17.969329] RIP: 0033:0x7f1840f0efdb
[   17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[   17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[   17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[   17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[   17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[   17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[   17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[   17.972083]  </TASK>
[   17.972199] irq event stamp: 2353
[   17.972372] hardirqs last  enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[   17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[   17.973365] softirqs last  enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[   17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[   17.974341] ---[ end trace 0000000000000000 ]---
[   17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[   17.975012] #PF: supervisor read access in kernel mode
[   17.975314] #PF: error_code(0x0000) - not-present page
[   17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[   17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G        W          6.1.0-rc4-peterx+ #46
[   17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[   17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[   17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[   17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[   17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[   17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[   17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[   17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[   17.984815] FS:  00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[   17.985924] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[   17.986674] PKRU: 55555554
[   17.986832] Call Trace:
[   17.987012]  <TASK>
[   17.987266]  ? tick_nohz_tick_stopped+0x12/0x30
[   17.987770]  lock_acquire+0xbf/0x2b0
[   17.988118]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.988575]  ? _printk+0x48/0x4e
[   17.988889]  _raw_spin_lock+0x2f/0x40
[   17.989243]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.989687]  follow_hugetlb_page.cold+0x75/0x5c4
[   17.990119]  __get_user_pages+0xbb/0x620
[   17.990500]  faultin_vma_page_range+0x9a/0x100
[   17.990928]  madvise_vma_behavior+0x3c0/0xbd0
[   17.991354]  ? mas_prev+0x11/0x290
[   17.991678]  ? find_vma_prev+0x5e/0xa0
[   17.992024]  ? madvise_vma_anon_name+0x70/0x70
[   17.992421]  madvise_walk_vmas+0xa9/0x120
[   17.992793]  do_madvise.part.0+0xfa/0x270
[   17.993166]  __x64_sys_madvise+0x5a/0x70
[   17.993539]  do_syscall_64+0x37/0x90
[   17.993879]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2:         cleanup, or dependency of the follow up patches
Patch 3:           before fixing, document huge_pte_offset() on lock required
Patch 4-8:         each patch resolves one possible race condition
Patch 9:           introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore.  It also passes all hugetlb kselftests.

This patch (of 9):

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.

Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agohugetlb: update vma flag check for hugetlb vma lock
Mike Kravetz [Mon, 12 Dec 2022 23:50:42 +0000 (15:50 -0800)]
hugetlb: update vma flag check for hugetlb vma lock

The check for whether a hugetlb vma lock exists partially depends on the
vma's flags.  Currently, it checks for either VM_MAYSHARE or VM_SHARED.
The reason both flags are used is because VM_MAYSHARE was previously
cleared in hugetlb vmas as they are tore down.  This is no longer the
case, and only the VM_MAYSHARE check is required.

Link: https://lkml.kernel.org/r/20221212235042.178355-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoselftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC
Jeff Xu [Thu, 15 Dec 2022 00:12:05 +0000 (00:12 +0000)]
selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC

Tests to verify MFD_NOEXEC, MFD_EXEC and vm.memfd_noexec sysctl.

Link: https://lkml.kernel.org/r/20221215001205.51969-6-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@google.com>
Co-developed-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: kernel test robot <lkp@intel.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/memfd: add write seals when apply SEAL_EXEC to executable memfd
Jeff Xu [Thu, 15 Dec 2022 00:12:04 +0000 (00:12 +0000)]
mm/memfd: add write seals when apply SEAL_EXEC to executable memfd

In order to avoid WX mappings, add F_SEAL_WRITE when apply F_SEAL_EXEC to
an executable memfd, so W^X from start.

This implys application need to fill the content of the memfd first, after
F_SEAL_EXEC is applied, application can no longer modify the content of
the memfd.

Typically, application seals the memfd right after writing to it.
For example:
1. memfd_create(MFD_EXEC).
2. write() code to the memfd.
3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X.
4. call exec() on the memfd.

Link: https://lkml.kernel.org/r/20221215001205.51969-5-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: kernel test robot <lkp@intel.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
Jeff Xu [Thu, 15 Dec 2022 00:12:03 +0000 (00:12 +0000)]
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
executable bit at creation time (memfd_create).

When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be
executable (mode: 0777) after creation.

when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.

The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
        MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
        MFD_NOEXEC_SEAL was set.
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old-software that
doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable memfd
by default.  Also, the value of memfd_noexec is passed to child namespace
at creation time.  For example, if the init namespace has
vm.memfd_noexec=2, all its children namespaces will be created with 2.

[akpm@linux-foundation.org: add stub functions to fix build]
[akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff]
[akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review]
[akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning]
Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@google.com>
Co-developed-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoselftests/memfd: add tests for F_SEAL_EXEC
Daniel Verkamp [Thu, 15 Dec 2022 00:12:02 +0000 (00:12 +0000)]
selftests/memfd: add tests for F_SEAL_EXEC

Basic tests to ensure that user/group/other execute bits cannot be changed
after applying F_SEAL_EXEC to a memfd.

Link: https://lkml.kernel.org/r/20221215001205.51969-3-jeffxu@google.com
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Co-developed-by: Jeff Xu <jeffxu@google.com>
Signed-off-by: Jeff Xu <jeffxu@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: kernel test robot <lkp@intel.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/memfd: add F_SEAL_EXEC
Daniel Verkamp [Thu, 15 Dec 2022 00:12:01 +0000 (00:12 +0000)]
mm/memfd: add F_SEAL_EXEC

Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8.

Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting it
differently.

However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass and
enables “confused deputy attack”.  E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code and
root escalation.  [2] lists more VRP in this kind.

On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].

To address those above, this set of patches add following:
1> Let memfd_create() set X bit at creation time.
2> Let memfd to be sealed for modifying X bit.
3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
   X bit.For example, if a container has vm.memfd_noexec=2, then
   memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4> A new security hook in memfd_create(). This make it possible to a new
   LSM, which rejects or allows executable memfd based on its security policy.

This patch (of 5):

The new F_SEAL_EXEC flag will prevent modification of the exec bits:
written as traditional octal mask, 0111, or as named flags, S_IXUSR |
S_IXGRP | S_IXOTH.  Any chmod(2) or similar call that attempts to modify
any of these bits after the seal is applied will fail with errno EPERM.

This will preserve the execute bits as they are at the time of sealing, so
the memfd will become either permanently executable or permanently
un-executable.

Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com
Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Co-developed-by: Jeff Xu <jeffxu@google.com>
Signed-off-by: Jeff Xu <jeffxu@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()
Peter Xu [Wed, 14 Dec 2022 20:15:33 +0000 (15:15 -0500)]
mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()

This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.

The reasons I still think this patch is worthwhile, are:

  (1) It is a cleanup already; diffstat tells.

  (2) It just feels natural after I thought about this, if the pte is uffd
      protected, let's remove the write bit no matter what it was.

  (2) Since x86 is the only arch that supports uffd-wp, it also redefines
      pte|pmd_mkuffd_wp() in that it should always contain removals of
      write bits.  It means any future arch that want to implement uffd-wp
      should naturally follow this rule too.  It's good to make it a
      default, even if with vm_page_prot changes on VM_UFFD_WP.

  (3) It covers more than vm_page_prot.  So no chance of any potential
      future "accident" (like pte_mkdirty() sparc64 or loongarch, even
      though it just got its pte_mkdirty fixed <1 month ago).  It'll be
      fairly clear when reading the code too that we don't worry anything
      before a pte_mkuffd_wp() on uncertainty of the write bit.

We may call pte_wrprotect() one more time in some paths (e.g.  thp split),
but that should be fully local bitop instruction so the overhead should be
negligible.

Although this patch should logically also fix all the known issues on
uffd-wp too recently on page migration (not for numa hint recovery - that
may need another explcit pte_wrprotect), but this is not the plan for that
fix.  So no fixes, and stable doesn't need this.

Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ives van Hoorne <ives@codesandbox.io>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm: move folio_set_compound_order() to mm/internal.h
Sidhartha Kumar [Thu, 15 Dec 2022 06:17:57 +0000 (22:17 -0800)]
mm: move folio_set_compound_order() to mm/internal.h

folio_set_compound_order() is moved to an mm-internal location so external
folio users cannot misuse this function.  Change the name of the function
to folio_set_order() and use WARN_ON_ONCE() rather than BUG_ON.  Also,
handle the case if a non-large folio is passed and add clarifying comments
to the function.

Link: https://lore.kernel.org/lkml/20221207223731.32784-1-sidhartha.kumar@oracle.com/T/
Link: https://lkml.kernel.org/r/20221215061757.223440-1-sidhartha.kumar@oracle.com
Fixes: 9fd330582b2f ("mm: add folio dtor and order setter functions")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoPull mm-hotfixes-stable dependencies into mm-stable.
Andrew Morton [Thu, 19 Jan 2023 01:03:20 +0000 (17:03 -0800)]
Pull mm-hotfixes-stable dependencies into mm-stable.

Merge branch 'mm-hotfixes-stable' into mm-stable

22 months agomm: fix a few rare cases of using swapin error pte marker
Peter Xu [Wed, 14 Dec 2022 20:04:53 +0000 (15:04 -0500)]
mm: fix a few rare cases of using swapin error pte marker

This patch should harden commit 15520a3f0469 ("mm: use pte markers for
swap errors") on using pte markers for swapin errors on a few corner
cases.

1. Propagate swapin errors across fork()s: if there're swapin errors in
   the parent mm, after fork()s the child should sigbus too when an error
   page is accessed.

2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte
   marker can be quickly switched to a swapin error.

3. Explicitly ignore swapin error pte markers in change_protection().

I mostly don't worry on (2) or (3) at all, but we should still have them.
Case (1) is special because it can potentially cause silent data corrupt
on child when parent has swapin error triggered with swapoff, but since
swapin error is rare itself already it's probably not easy to trigger
either.

Currently there is a priority difference between the uffd-wp bit and the
swapin error entry, in which the swapin error always has higher priority
(e.g.  we don't need to wr-protect a swapin error pte marker).

If there will be a 3rd bit introduced, we'll probably need to consider a
more involved approach so we may need to start operate on the bits.  Let's
leave that for later.

This patch is tested with case (1) explicitly where we'll get corrupted
data before in the child if there's existing swapin error pte markers, and
after patch applied the child can be rightfully killed.

We don't need to copy stable for this one since 15520a3f0469 just landed
as part of v6.2-rc1, only "Fixes" applied.

Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com
Fixes: 15520a3f0469 ("mm: use pte markers for swap errors")
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agomm/uffd: fix pte marker when fork() without fork event
Peter Xu [Wed, 14 Dec 2022 20:04:52 +0000 (15:04 -0500)]
mm/uffd: fix pte marker when fork() without fork event

Patch series "mm: Fixes on pte markers".

Patch 1 resolves the syzkiller report from Pengfei.

Patch 2 further harden pte markers when used with the recent swapin error
markers.  The major case is we should persist a swapin error marker after
fork(), so child shouldn't read a corrupted page.

This patch (of 2):

When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
have it and has pte marker installed.  The warning is improper along with
the comment.  The right thing is to inherit the pte marker when needed, or
keep the dst pte empty.

A vague guess is this happened by an accident when there's the prior patch
to introduce src/dst vma into this helper during the uffd-wp feature got
developed and I probably messed up in the rebase, since if we replace
dst_vma with src_vma the warning & comment it all makes sense too.

Hugetlb did exactly the right here (copy_hugetlb_page_range()).  Fix the
general path.

Reproducer:

https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c

Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808

Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com
Fixes: c56d1b62cce8 ("mm/shmem: handle uffd-wp during fork()")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org> # 5.19+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
22 months agoSync with v6.2-rc4
Andrew Morton [Thu, 19 Jan 2023 00:52:20 +0000 (16:52 -0800)]
Sync with v6.2-rc4

Merge branch 'master' into mm-hotfixes-stable

22 months agoSync with v6.2-rc4
Andrew Morton [Thu, 19 Jan 2023 00:51:53 +0000 (16:51 -0800)]
Sync with v6.2-rc4

Merge branch 'master' into mm-stable

22 months agoLinux 6.2-rc4
Linus Torvalds [Sun, 15 Jan 2023 15:22:43 +0000 (09:22 -0600)]
Linux 6.2-rc4

22 months agoMerge tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 15 Jan 2023 13:17:44 +0000 (07:17 -0600)]
Merge tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Make sure the poking PGD is pinned for Xen PV as it requires it this
   way

 - Fixes for two resctrl races when moving a task or creating a new
   monitoring group

 - Fix SEV-SNP guests running under HyperV where MTRRs are disabled to
   not return a UC- type mapping type on memremap() and thus cause a
   serious slowdown

 - Fix insn mnemonics in bioscall.S now that binutils is starting to fix
   confusing insn suffixes

* tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: fix poking_init() for Xen PV guests
  x86/resctrl: Fix event counts regression in reused RMIDs
  x86/resctrl: Fix task CLOSID/RMID update race
  x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case
  x86/boot: Avoid using Intel mnemonics in AT&T syntax asm

22 months agoMerge tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 15 Jan 2023 13:12:58 +0000 (07:12 -0600)]
Merge tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC fixes from Borislav Petkov:

 - Fix the EDAC device's confusion in the polling setting units

 - Fix a memory leak in highbank's probing function

* tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/highbank: Fix memory leak in highbank_mc_probe()
  EDAC/device: Fix period calculation in edac_device_reset_delay_period()

22 months agoMerge tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 15 Jan 2023 13:09:41 +0000 (07:09 -0600)]
Merge tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix a build failure with some versions of ld that have an odd version
   string

 - Fix incorrect use of mutex in the IMC PMU driver

Thanks to Kajol Jain, Michael Petlan, Ojaswin Mujoo, Peter Zijlstra, and
Yang Yingliang.

* tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s/hash: Make stress_hpt_timer_fn() static
  powerpc/imc-pmu: Fix use of mutex in IRQs disabled section
  powerpc/boot: Fix incorrect version calculation issue in ld_version

22 months agoMerge tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 14 Jan 2023 16:48:15 +0000 (10:48 -0600)]
Merge tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:

 - Core: Fix an iommu-group refcount leak

 - Fix overflow issue in IOVA alloc path

 - ARM-SMMU fixes from Will:
    - Fix VFIO regression on NXP SoCs by reporting IOMMU_CAP_CACHE_COHERENCY
    - Fix SMMU shutdown paths to avoid device unregistration race

 - Error handling fix for Mediatek IOMMU driver

* tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe()
  iommu/iova: Fix alloc iova overflows issue
  iommu: Fix refcount leak in iommu_device_claim_dma_owner
  iommu/arm-smmu-v3: Don't unregister on shutdown
  iommu/arm-smmu: Don't unregister on shutdown
  iommu/arm-smmu: Report IOMMU_CAP_CACHE_COHERENCY even betterer

22 months agoMerge tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt...
Linus Torvalds [Sat, 14 Jan 2023 16:08:08 +0000 (10:08 -0600)]
Merge tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock fix from Mike Rapoport:
 "memblock: always release pages to the buddy allocator in
  memblock_free_late()

  If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
  only releases pages to the buddy allocator if they are not in the
  deferred range. This is correct for free pages (as defined by
  for_each_free_mem_pfn_range_in_zone()) because free pages in the
  deferred range will be initialized and released as part of the
  deferred init process.

  memblock_free_pages() is called by memblock_free_late(), which is used
  to free reserved ranges after memblock_free_all() has run. All pages
  in reserved ranges have been initialized at that point, and
  accordingly, those pages are not touched by the deferred init process.

  This means that currently, if the pages that memblock_free_late()
  intends to release are in the deferred range, they will never be
  released to the buddy allocator. They will forever be reserved.

  In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
  which is also correct for free pages but is not correct for reserved
  pages. KMSAN metadata for reserved pages is initialized by
  kmsan_init_shadow(), which runs shortly before memblock_free_all().

  For both of these reasons, memblock_free_pages() should only be called
  for free pages, and memblock_free_late() should call
  __free_pages_core() directly instead.

  One case where this issue can occur in the wild is EFI boot on x86_64.
  The x86 EFI code reserves all EFI boot services memory ranges via
  memblock_reserve() and frees them later via memblock_free_late()
  (efi_reserve_boot_services() and efi_free_boot_services(),
  respectively).

  If any of those ranges happens to fall within the deferred init range,
  the pages will not be released and that memory will be unavailable.

  For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:

    v6.2-rc2:
    Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
    Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  178867

    v6.2-rc2 + patch:
    Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
    Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  222816   # +43,949 pages"

* tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  mm: Always release pages to the buddy allocator in memblock_free_late().