Jason Gunthorpe [Tue, 24 Jan 2023 20:34:23 +0000 (16:34 -0400)]
mm/gup: remove obsolete FOLL_LONGTERM comment
These days FOLL_LONGTERM is not allowed at all on any get_user_pages*()
functions, it must be only be used with pin_user_pages*(), plus it now has
universal support for all the pin_user_pages*() functions.
Link: https://lkml.kernel.org/r/2-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jason Gunthorpe [Tue, 24 Jan 2023 20:34:22 +0000 (16:34 -0400)]
mm/gup: have internal functions get the mmap_read_lock()
Patch series "Simplify the external interface for GUP", v2.
It is quite a maze of EXPORTED symbols leading up to the three actual
worker functions of GUP. Simplify this by reorganizing some of the code so
the EXPORTED symbols directly call the correct internal function with
validated and consistent arguments.
Consolidate all the assertions into one place at the top of the call
chains.
Remove some dead code.
Move more things into the mm/internal.h header
This patch (of 13):
__get_user_pages_locked() and __gup_longterm_locked() both require the
mmap lock to be held. They have a slightly unusual locked parameter that
is used to allow these functions to unlock and relock the mmap lock and
convey that fact to the caller.
Several places wrap these functions with a simple mmap_read_lock() just so
they can follow the optimized locked protocol.
Consolidate this internally to the functions. Allow internal callers to
set locked = 0 to cause the functions to acquire and release the lock on
their own.
Reorganize __gup_longterm_locked() to use the autolocking in
__get_user_pages_locked().
Replace all the places obtaining the mmap_read_lock() just to call
__get_user_pages_locked() with the new mechanism. Replace all the
internal callers of get_user_pages_unlocked() with direct calls to
__gup_longterm_locked() using the new mechanism.
A following patch will add assertions ensuring the external interface
continues to always pass in locked = 1.
Baoquan He [Mon, 6 Feb 2023 08:40:18 +0000 (16:40 +0800)]
mm/vmalloc: skip the uninitilized vmalloc areas
For areas allocated via vmalloc_xxx() APIs, it searches for unmapped area
to reserve and allocates new pages to map into, please see function
__vmalloc_node_range(). During the process, flag VM_UNINITIALIZED is set
in vm->flags to indicate that the pages allocation and mapping haven't
been done, until clear_vm_uninitialized_flag() is called to clear
VM_UNINITIALIZED.
For this kind of area, if VM_UNINITIALIZED is still set, let's ignore it
in vread() because pages newly allocated and being mapped in that area
only contains zero data. reading them out by aligned_vread() is wasting
time.
Link: https://lkml.kernel.org/r/20230206084020.174506-6-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baoquan He [Mon, 6 Feb 2023 08:40:17 +0000 (16:40 +0800)]
mm/vmalloc: explicitly identify vm_map_ram area when shown in /proc/vmcoreinfo
Now, by marking VMAP_RAM in vmap_area->flags for vm_map_ram area, we can
clearly differentiate it with other vmalloc areas. So identify
vm_map_area area by checking VMAP_RAM of vmap_area->flags when shown in
/proc/vmcoreinfo.
Meanwhile, the code comment above vm_map_ram area checking in s_show() is
not needed any more, remove it here.
Link: https://lkml.kernel.org/r/20230206084020.174506-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baoquan He [Mon, 6 Feb 2023 08:40:16 +0000 (16:40 +0800)]
mm/vmalloc.c: allow vread() to read out vm_map_ram areas
Currently, vread can read out vmalloc areas which is associated with a
vm_struct. While this doesn't work for areas created by vm_map_ram()
interface because it doesn't have an associated vm_struct. Then in
vread(), these areas are all skipped.
Here, add a new function vmap_ram_vread() to read out vm_map_ram areas.
The area created with vmap_ram_vread() interface directly can be handled
like the other normal vmap areas with aligned_vread(). While areas which
will be further subdivided and managed with vmap_block need carefully read
out page-aligned small regions and zero fill holes.
Link: https://lkml.kernel.org/r/20230206084020.174506-4-bhe@redhat.com Reported-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Tested-by: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baoquan He [Mon, 6 Feb 2023 08:40:15 +0000 (16:40 +0800)]
mm/vmalloc.c: add flags to mark vm_map_ram area
Through vmalloc API, a virtual kernel area is reserved for physical
address mapping. And vmap_area is used to track them, while vm_struct is
allocated to associate with the vmap_area to store more information and
passed out.
However, area reserved via vm_map_ram() is an exception. It doesn't have
vm_struct to associate with vmap_area. And we can't recognize the
vmap_area with '->vm == NULL' as a vm_map_ram() area because the normal
freeing path will set va->vm = NULL before unmapping, please see function
remove_vm_area().
Meanwhile, there are two kinds of handling for vm_map_ram area. One is
the whole vmap_area being reserved and mapped at one time through
vm_map_area() interface; the other is the whole vmap_area with
VMAP_BLOCK_SIZE size being reserved, while mapped into split regions with
smaller size via vb_alloc().
To mark the area reserved through vm_map_ram(), add flags field into
struct vmap_area. Bit 0 indicates this is vm_map_ram area created through
vm_map_ram() interface, while bit 1 marks out the type of vm_map_ram area
which makes use of vmap_block to manage split regions via vb_alloc/free().
This is a preparation for later use.
Link: https://lkml.kernel.org/r/20230206084020.174506-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baoquan He [Mon, 6 Feb 2023 08:40:14 +0000 (16:40 +0800)]
mm/vmalloc.c: add used_map into vmap_block to track space of vmap_block
Patch series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas", v5.
Problem:
***
Stephen reported vread() will skip vm_map_ram areas when reading out
/proc/kcore with drgn utility. Please see below link to get more details.
/proc/kcore reads 0's for vmap_block
https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u
Root cause:
***
The normal vmalloc API uses struct vmap_area to manage the virtual kernel
area allocated, and associate a vm_struct to store more information and
pass out. However, area reserved through vm_map_ram() interface doesn't
allocate vm_struct to associate with. So the current code in vread() will
skip the vm_map_ram area through 'if (!va->vm)' conditional checking.
Solution:
***
To mark the area reserved through vm_map_ram() interface, add field
'flags' into struct vmap_area. Bit 0 indicates this is vm_map_ram area
created through vm_map_ram() interface, bit 1 marks out the type of
vm_map_ram area which makes use of vmap_block to manage split regions via
vb_alloc/free().
And also add bitmap field 'used_map' into struct vmap_block to mark those
further subdivided regions being used to differentiate with dirty and free
regions in vmap_block.
With the help of above vmap_area->flags and vmap_block->used_map, we can
recognize and handle vm_map_ram areas successfully. All these are done in
patch 1~3.
Meanwhile, do some improvement on areas related to vm_map_ram areas in
patch 4, 5. And also change area flag from VM_ALLOC to VM_IOREMAP in
patch 6, 7 because this will show them as 'ioremap' in /proc/vmallocinfo,
and exclude them from /proc/kcore.
This patch (of 7):
In one vmap_block area, there could be three types of regions: region
being used which is allocated through vb_alloc(), dirty region which is
freed via vb_free() and free region. Among them, only used region has
available data. While there's no way to track those used regions
currently.
Here, add bitmap field used_map into vmap_block, and set/clear it during
allocation or freeing regions of vmap_block area.
Yajun Deng [Fri, 3 Feb 2023 10:01:32 +0000 (18:01 +0800)]
mm/page_alloc: reduce fallbacks to (MIGRATE_PCPTYPES - 1)
The commit 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable
pageblocks with others") has removed MIGRATE_CMA and MIGRATE_ISOLATE from
fallbacks list. so there is no need to add an element at the end of every
type.
Reduce fallbacks to (MIGRATE_PCPTYPES - 1).
Link: https://lkml.kernel.org/r/20230203100132.1627787-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hyunmin Lee [Wed, 1 Feb 2023 11:51:42 +0000 (20:51 +0900)]
mm/vmalloc: replace BUG_ON with a simple if statement
As per the coding standards, in the event of an abnormal condition that
should not occur under normal circumstances, the kernel should attempt
recovery and proceed with execution, rather than halting the machine.
Specifically, in the alloc_vmap_area() function, use a simple if()
instead of using BUG_ON() halting the machine.
Link: https://lkml.kernel.org/r/20230201115142.GA7772@min-iamroot Co-developed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Signed-off-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Co-developed-by: Jeungwoo Yoo <casionwoo@gmail.com> Signed-off-by: Jeungwoo Yoo <casionwoo@gmail.com> Co-developed-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Hyunmin Lee <hn.min.lee@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Longlong Xia [Tue, 31 Jan 2023 07:10:35 +0000 (07:10 +0000)]
mm/swapfile: remove pr_debug in get_swap_pages()
It's known that get_swap_pages() may fail to find available space under
some extreme case, but pr_debug() provides useless information. Let's
remove it.
Link: https://lkml.kernel.org/r/20230131071035.1085968-1-xialonglong1@huawei.com Signed-off-by: Longlong Xia <xialonglong1@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
arm: include asm-generic/memory_model.h from page.h rather than memory.h
Patch series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM", v2.
Every architecture that supports FLATMEM memory model defines its own
version of pfn_valid() that essentially compares a pfn to max_mapnr.
Use mips/powerpc version implemented as static inline as a generic
implementation of pfn_valid() and drop its per-architecture definitions
This patch (of 4):
Makes it consistent with other architectures and allows for generic
definition of pfn_valid() in asm-generic/memory_model.h with clear
override in arch/arm/include/asm/page.h
Link: https://lkml.kernel.org/r/20230129124235.209895-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20230129124235.209895-2-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vineet Gupta <vgupta@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As an example of the confusing behavior, the report below hints that the
allocation size was 192, while the kernel actually called kmalloc(184):
==================================================================
BUG: KASAN: slab-out-of-bounds in _find_next_bit+0x143/0x160 lib/find_bit.c:109
Read of size 8 at addr ffff8880175766b8 by task kworker/1:1/26
...
The buggy address belongs to the object at ffff888017576600
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 184 bytes inside of
192-byte region [ffff888017576600, ffff8880175766c0)
...
Memory state around the buggy address: ffff888017576580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff888017576600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888017576680: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
^ ffff888017576700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888017576780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
With this patch, the report shows:
==================================================================
...
The buggy address belongs to the object at ffff888017576600
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 0 bytes to the right of
allocated 184-byte region [ffff888017576600, ffff8880175766b8)
...
==================================================================
Also report slab use-after-free bugs as "slab-use-after-free" and print
"freed" instead of "allocated" in the report when describing the accessed
memory region.
Also improve the metadata-related comment in kasan_find_first_bad_addr
and use addr_has_metadata across KASAN code instead of open-coding
KASAN_SHADOW_START checks.
mmap_assert_write_locked() is used in vm_flags modifiers. Because
mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes
modified from inside a module, it's necessary to export dump_mm()
function.
Link: https://lkml.kernel.org/r/20230126193752.297968-8-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: introduce __vm_flags_mod and use it in untrack_pfn
There are scenarios when vm_flags can be modified without exclusive
mmap_lock, such as:
- after VMA was isolated and mmap_lock was downgraded or dropped
- in exit_mmap when there are no other mm users and locking is unnecessary
Introduce __vm_flags_mod to avoid assertions when the caller takes
responsibility for the required locking.
Pass a hint to untrack_pfn to conditionally use __vm_flags_mod for
flags modification to avoid assertion.
Link: https://lkml.kernel.org/r/20230126193752.297968-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vm_flags are among VMA attributes which affect decisions like VMA merging
and splitting. Therefore all vm_flags modifications are performed after
taking exclusive mmap_lock to prevent vm_flags updates racing with such
operations. Introduce modifier functions for vm_flags to be used whenever
flags are updated. This way we can better check and control correct
locking behavior during these updates.
Link: https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "introduce vm_flags modifier functions", v4.
This patchset was originally published as a part of per-VMA locking [1]
and was split after suggestion that it's viable on its own and to
facilitate the review process. It is now a preprequisite for the next
version of per-VMA lock patchset, which reuses vm_flags modifier functions
to lock the VMA when vm_flags are being updated.
VMA vm_flags modifications are usually done under exclusive mmap_lock
protection because this attrubute affects other decisions like VMA merging
or splitting and races should be prevented. Introduce vm_flags modifier
functions to enforce correct locking.
This patch (of 7):
Convert vma assignment in vm_area_dup() to a memcpy() to prevent compiler
errors when we add a const modifier to vma->vm_flags.
Link: https://lkml.kernel.org/r/20230126193752.297968-1-surenb@google.com Link: https://lkml.kernel.org/r/20230126193752.297968-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 20 Jan 2023 16:26:50 +0000 (11:26 -0500)]
vma_merge: set vma iterator to correct position.
When merging the previous value, set the vma iterator to the previous
slot. Don't use the vma iterator to get the next/prev so that it is in
the correct position for a write.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:49 +0000 (11:26 -0500)]
mm/mmap: remove __vma_adjust()
Inline the work of __vma_adjust() into vma_merge(). This reduces code
size and has the added benefits of the comments for the cases being
located with the code.
Change the comments referencing vma_adjust() accordingly.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:46 +0000 (11:26 -0500)]
mm/mmap: don't use __vma_adjust() in shift_arg_pages()
Introduce shrink_vma() which uses the vma_prepare() and vma_complete()
functions to reduce the vma coverage.
Convert shift_arg_pages() to use expand_vma() and the new shrink_vma()
function. Remove support from __vma_adjust() to reduce a vma size since
shift_arg_pages() is the only user that shrinks a VMA in this way.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:44 +0000 (11:26 -0500)]
mm: don't use __vma_adjust() in __split_vma()
Use the abstracted locking and maple tree operations. Since __split_vma()
is the only user of the __vma_adjust() function to use the insert
argument, drop that argument. Remove the NULL passed through from
fs/exec's shift_arg_pages() and mremap() at the same time.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:43 +0000 (11:26 -0500)]
mm/mmap: introduce init_vma_prep() and init_multi_vma_prep()
Add init_vma_prep() and init_multi_vma_prep() to set up the struct
vma_prepare. This is to abstract the locking when adjusting the VMAs.
Also change __vma_adjust() variable remove_next int in favour of a pointer
to the VMA to remove. Rename next_next to remove2 since this better
reflects its use.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:39 +0000 (11:26 -0500)]
mm: change munmap splitting order and move_vma()
Splitting can be more efficient when the order is not of concern. Change
do_vmi_align_munmap() to reduce walking of the tree during split
operations.
move_vma() must also be altered to remove the dependency of keeping the
original VMA as the active part of the split. Transition to using vma
iterator to look up the prev and/or next vma after munmap.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:38 +0000 (11:26 -0500)]
mmap: clean up mmap_region() unrolling
Move logic of unrolling to the error path as apposed to duplicating it
within the function body. This reduces the potential of missing an update
to one path when making changes.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:36 +0000 (11:26 -0500)]
mm: pass vma iterator through to __vma_adjust()
Pass the iterator through to be used in __vma_adjust(). The state of the
iterator needs to be correct for the operation that will occur so make the
adjustments.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:35 +0000 (11:26 -0500)]
mm: remove unnecessary write to vma iterator in __vma_adjust()
If the vma start address is going to change due to an insert, then it is
safe to not write the vma to the tree. The write of the insert vma will
alter the tree as necessary.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:31 +0000 (11:26 -0500)]
mm/damon/vaddr-test.h: stop using vma_mas_store() for maple tree store
Prepare for the removal of the vma_mas_store() function by open coding the
maple tree store in this test code. Set the range of the maple state and
call the store function directly.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:29 +0000 (11:26 -0500)]
nommu: pass through vma iterator to shrink_vma()
Rename the function to vmi_shrink_vma() indicate it takes the vma
iterator. Use the iterator to preallocate and drop the delete function.
The maple tree is able to do the modification easier than the linked list
and rbtree, so just clear the necessary area in the tree.
add_vma_to_mm() is no longer used, so drop this function.
vmi_add_vma_to_mm() is now only used once, so inline this function into
do_mmap().
Liam R. Howlett [Thu, 26 Jan 2023 21:20:49 +0000 (16:20 -0500)]
ipc/shm: introduce new do_vma_munmap() to munmap
The shm already has the vma iterator in position for a write.
do_vmi_munmap() searches for the correct position and aligns the write, so
it is not the right function to use in this case.
The shm VMA tree modification is similar to the brk munmap situation, the
vma iterator is in position and the VMA is already known. This patch
generalizes the brk munmap function do_brk_munmap() to be used for any
other callers with the vma iterator already in position to munmap a VMA.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:12 +0000 (11:26 -0500)]
mm/mmap: remove preallocation from do_mas_align_munmap()
In preparation of passing the vma state through split, the pre-allocation
that occurs before the split has to be moved to after. Since the
preallocation would then live right next to the store, just call store
instead of preallocating. This effectively restores the potential error
path of splitting and not munmap'ing which pre-dates the maple tree.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:06 +0000 (11:26 -0500)]
maple_tree: fix handle of invalidated state in mas_wr_store_setup()
If an invalidated maple state is encountered during write, reset the maple
state to MAS_START. This will result in a re-walk of the tree to the
correct location for the write.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:04 +0000 (11:26 -0500)]
maple_tree: reduce user error potential
When iterating, a user may operate on the tree and cause the maple state
to be altered and left in an unintuitive state. Detect this scenario and
correct it by setting to the limit and invalidating the state.
Liam R. Howlett [Fri, 20 Jan 2023 16:26:02 +0000 (11:26 -0500)]
maple_tree: add mas_init() function
Patch series "VMA tree type safety and remove __vma_adjust()", v4.
This patchset does two things: 1. Clean up, including removal of
__vma_adjust() and 2. Extends the VMA iterator API to provide type safety
to the VMA operations using the maple tree, as requested by Linus [1].
It also addresses another issue of usability brought up by Linus about
needing to modify the maple state within the loops. The maple state has
been replaced by the VMA iterator and the iterator is now modified within
the MM code so the caller should not need to worry about doing the work
themselves when tree modifications occur.
This brought up a potential inconsistency of the iterator state and what
the user expects, so the inconsistency is addressed to keep the VMA
iterator safe for use after the looping over a VMA range. This is
addressed in patch 3 ("maple_tree: Reduce user error potential") and 4
("test_maple_tree: Test modifications while iterating").
While cleaning up the state, the duplicate locking code in mm/mmap.c
introduced by the maple tree has been address by abstracting it to two
functions: vma_prepare() and vma_complete(). These abstractions allowed
for a much simpler __vma_adjust(), which eventually leads to the removal
of the __vma_adjust() function by placing the logic into the vma_merge()
function itself.
If we have a HIGHMEM system with a large folio, 'offset' may be larger
than PAGE_SIZE, and so min_t will cap at 'len' instead of the intended
end-of-page. That can overflow into the next page which is likely to be
unmapped and fault, but could theoretically copy the wrong data.
Link: https://lkml.kernel.org/r/Y919vmSrtAgsf6K3@casper.infradead.org Fixes: 00cdf76012ab ("mm: add memcpy_from_file_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We're masking with the number of type bits instead of the type mask, which
is obviously wrong.
Link: https://lkml.kernel.org/r/39fd91e3-c93b-23c6-afc6-cbe473bb0ca9@redhat.com Fixes: 20aae9eff5ac ("arm/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mpage: convert __mpage_writepage() to use a folio more fully
This is just a conversion to the folio API. While there are some nods
towards supporting multi-page folios in here, the blocks array is still
sized for one page's worth of blocks, and there are other assumptions such
as the blocks_per_page variable.
Patch series "Convert writepage_t to use a folio".
More folioisation. I split out the mpage work from everything else
because it completely dominated the patch, but some implementations I just
converted outright.
This patch (of 2):
We always write back an entire folio, but that's currently passed as the
head page. Convert all filesystems that use write_cache_pages() to expect
a folio instead of a page.
This is the equivalent of memcpy_from_page(). It differs in that it takes
the position in a file instead of offset in a folio, it accepts the total
number of bytes to be copied (instead of the number of bytes to be copied
from this folio) and it returns how many bytes were copied from the folio,
rather than making the caller calculate that and then checking if the
caller got it right.
[akpm@linux-foundation.org: fix typo in comment] Link: https://lkml.kernel.org/r/20230126201552.1681588-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The ->rw_page method is a special purpose bypass of the usual bio handling
path that is limited to single-page reads and writes and synchronous which
causes a lot of extra code in the drivers, callers and the block layer.
The only remaining user is the MM swap code. Switch that swap code to
simply submit a single-vec on-stack bio an synchronously wait on it based
on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that
currently implement ->rw_page instead. While this touches one extra cache
line and executes extra code, it simplifies the block layer and drivers
and ensures that all feastures are properly supported by all drivers, e.g.
right now ->rw_page bypassed cgroup writeback entirely.
[akpm@linux-foundation.org: fix comment typo, per Dan] Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This series removes the ->rw_page block_device_operation, which is an old
and clumsy attempt at a simple read/write fast path for the block layer.
It isn't actually used by the fastest block layer operations that we
support (polled I/O through io_uring), but only used by the mpage buffered
I/O helpers which are some of the slowest I/O we have and do not make any
difference there at all, and zram which is a block device abused to
duplicate the zram functionality.
Given that zram is heavily used we need to make sure there is a good
replacement for synchronous I/O, so this series adds a new flag for
drivers that complete I/O synchronously and uses that flag to use on-stack
bios and synchronous submission for them in the swap code.
This patch (of 7):
These are micro-optimizations for synchronous I/O, which do not matter
compared to all the other inefficiencies in the legacy buffer_head based
mpage code.
Link: https://lkml.kernel.org/r/20230125133436.447864-1-hch@lst.de Link: https://lkml.kernel.org/r/20230125133436.447864-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: move __remove_vm_area out of va_remove_mappings
__remove_vm_area is the only part of va_remove_mappings that requires a
vmap_area. Move the call out to the caller and only pass the vm_struct to
va_remove_mappings.
Link: https://lkml.kernel.org/r/20230121071051.1143058-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fold __vfree_deferred into vfree_atomic, and call vfree_atomic early on
from vfree if called from interrupt context so that the extra low-level
helper can be avoided.
Link: https://lkml.kernel.org/r/20230121071051.1143058-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__vfree is a subset of vfree that just skips a few checks, and which is
only used by vfree and an error cleanup path. Fold __vfree into vfree and
switch the only other caller to call vfree() instead.
Link: https://lkml.kernel.org/r/20230121071051.1143058-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>