Hao Sun [Wed, 21 Dec 2022 14:42:45 +0000 (22:42 +0800)]
mm: new primitive kvmemdup()
Similar to kmemdup(), but support large amount of bytes with kvmalloc()
and does *not* guarantee that the result will be physically contiguous.
Use only in cases where kvmalloc() is needed and free it with kvfree().
Also adapt policy_unpack.c in case someone bisect into this.
Link: https://lkml.kernel.org/r/20221221144245.27164-1-sunhao.th@gmail.com Signed-off-by: Hao Sun <sunhao.th@gmail.com> Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nick Terrell <terrelln@fb.com> Cc: John Johansen <john.johansen@canonical.com> Cc: Paul Moore <paul@paul-moore.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/swap: convert deactivate_page() to folio_deactivate()
Deactivate_page() has already been converted to use folios, this change
converts it to take in a folio argument instead of calling page_folio().
It also renames the function folio_deactivate() to be more consistent with
other folio functions.
[akpm@linux-foundation.org: fix left-over comments, per Yu Zhao] Link: https://lkml.kernel.org/r/20221221180848.20774-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/damon: convert damon_pa_mark_accessed_or_deactivate() to use folios
This change replaces 2 calls to compound_head() from put_page() and 1 call
from mark_page_accessed() with one from page_folio(). This is in
preparation for the conversion of deactivate_page() to folio_deactivate().
Link: https://lkml.kernel.org/r/20221221180848.20774-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Convert deactivate_page() to folio_deactivate()", v4.
Deactivate_page() has already been converted to use folios. This patch
series modifies the callers of deactivate_page() to use folios. It also
introduces vm_normal_folio() to assist with folio conversions, and
converts deactivate_page() to folio_deactivate() which takes in a folio.
This patch (of 4):
Introduce a wrapper function called vm_normal_folio(). This function
calls vm_normal_page() and returns the folio of the page found, or null if
no page is found.
This function allows callers to get a folio from a pte, which will
eventually allow them to completely replace their struct page variables
with struct folio instead.
Vernon Yang [Wed, 21 Dec 2022 06:00:58 +0000 (14:00 +0800)]
maple_tree: refine mab_calc_split function
Invert the conditional judgment of the mid_split, to focus the return
statement in the last statement, which is easier to understand and for
better readability.
Vernon Yang [Wed, 21 Dec 2022 06:00:57 +0000 (14:00 +0800)]
maple_tree: refine ma_state init from mas_start()
If mas->node is an MAS_START, there are three cases, and they all assign
different values to mas->node and mas->offset. So there is no need to set
them to a default value before updating.
Update them directly to make them easier to understand and for better
readability.
Vernon Yang [Wed, 21 Dec 2022 06:00:56 +0000 (14:00 +0800)]
maple_tree: remove the redundant code
The macros CONFIG_DEBUG_MAPLE_TREE_VERBOSE no one uses, functions
mas_dup_tree() and mas_dup_store() are not implemented, just function
declaration, so drop it.
Vernon Yang [Wed, 21 Dec 2022 06:00:55 +0000 (14:00 +0800)]
maple_tree: use macro MA_ROOT_PARENT instead of number
When you need to compare whether node->parent is parent of the
root node, using macro MA_ROOT_PARENT is easier to understand
and for better readability.
Lorenzo Stoakes [Mon, 19 Dec 2022 12:36:59 +0000 (12:36 +0000)]
mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()
This function sets __GFP_NOWARN in the gfp_mask rendering the warn_alloc()
invocations no-ops. Remove this and instead rely on this flag being set
only for the vm_area_alloc_pages() function, ensuring it is cleared for
each of the warn_alloc() calls.
Link: https://lkml.kernel.org/r/20221219123659.90614-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianlin Lv [Mon, 19 Dec 2022 16:49:17 +0000 (16:49 +0000)]
tools/vm/page_owner_sort: free memory before exit
Although when a process terminates, the kernel will removes memory
associated with that process, It's neither good style nor proper design to
leave it to kernel. This patch free allocated memory before process exit.
Andrey Konovalov [Mon, 19 Dec 2022 18:09:18 +0000 (19:09 +0100)]
kasan: allow sampling page_alloc allocations for HW_TAGS
As Hardware Tag-Based KASAN is intended to be used in production, its
performance impact is crucial. As page_alloc allocations tend to be big,
tagging and checking all such allocations can introduce a significant
slowdown.
Add two new boot parameters that allow to alleviate that slowdown:
- kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only
every Nth page_alloc allocation with the order configured by the second
added parameter (default: tag every such allocation).
- kasan.page_alloc.sample.order, which makes sampling enabled by the first
parameter only affect page_alloc allocations with the order equal or
greater than the specified value (default: 3, see below).
The exact performance improvement caused by using the new parameters
depends on their values and the applied workload.
The chosen default value for kasan.page_alloc.sample.order is 3, which
matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER. This is
done for two reasons:
1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed
costly to service", which corresponds to the idea that only large and
thus costly allocations are supposed to sampled.
2. One of the workloads targeted by this patch is a benchmark that sends
a large amount of data over a local loopback connection. Most multi-page
data allocations in the networking subsystem have the order of
SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER).
When running a local loopback test on a testing MTE-enabled device in sync
mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown.
Applying this patch and setting kasan.page_alloc.sampling to a value
higher than 1 allows to lower the slowdown. The performance improvement
saturates around the sampling interval value of 10 with the default
sampling page order of 3. This lowers the slowdown to ~20%. The slowdown
in real scenarios involving the network will likely be better.
Enabling page_alloc sampling has a downside: KASAN misses bad accesses to
a page_alloc allocation that has not been tagged. This lowers the value
of KASAN as a security mitigation.
However, based on measuring the number of page_alloc allocations of
different orders during boot in a test build, sampling with the default
kasan.page_alloc.sample.order value affects only ~7% of allocations. The
rest ~93% of allocations are still checked deterministically.
Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Jann Horn <jannh@google.com> Cc: Mark Brand <markbrand@google.com> Cc: Peter Collingbourne <pcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 19 Dec 2022 18:58:40 +0000 (02:58 +0800)]
swap: avoid holding swap reference in swap_cache_get_folio
All its callers either already hold a reference to, or lock the swap
device while calling this function. There is only one exception in
shmem_swapin_folio, just make this caller also hold a reference of the
swap device, so this helper can be simplified and saves a few cycles.
This also provides finer control of error handling in shmem_swapin_folio,
on race (with swap off), it can just try again. For invalid swap entry,
it can fail with a proper error code.
Link: https://lkml.kernel.org/r/20221219185840.25441-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 5 Dec 2022 23:08:28 +0000 (23:08 +0000)]
selftests/damon/sysfs: test filters directory
Add simple test cases for scheme filters of DAMON sysfs interface. The
test cases check if the files are populated as expected, receives some
valid inputs, and refuses some invalid inputs.
SeongJae Park [Mon, 5 Dec 2022 23:08:27 +0000 (23:08 +0000)]
mm/damon/sysfs-schemes: implement scheme filters
Implement scheme filters functionality of DAMON sysfs interface by making
the code reads the values of files under the filter directories and pass
that to DAMON using DAMON kernel API.
SeongJae Park [Mon, 5 Dec 2022 23:08:26 +0000 (23:08 +0000)]
mm/damon/sysfs-schemes: connect filter directory and filters directory
Implement 'nr_filters' file under 'filters' directory, which will be used
to populate specific number of 'filter' directory under the directory,
similar to other 'nr_*' files in DAMON sysfs interface.
Implement DAMOS filter directory which will be located under the filters
directory. The directory provides three files, namely type, matching, and
memcg_path. 'type' and 'matching' will be directly connected to the
fields of 'struct damos_filter' having same name. 'memcg_path' will
receive the path of the memory cgroup of the interest and later converted
to memcg id when it's committed.
DAMOS filters are currently supported by only DAMON kernel API. To expose
the feature to user space, implement a DAMON sysfs directory named
'filters' under each scheme directory. Please note that this is
implementing only the directory. Following commits will implement more
files and directories, and finally connect the DAMOS filters feature.
SeongJae Park [Mon, 5 Dec 2022 23:08:22 +0000 (23:08 +0000)]
mm/damon/reclaim: add a parameter called skip_anon for avoiding anonymous pages reclamation
In some cases, for example if users have confidence at anonymous pages
management or the swap device is too slow, users would want to avoid
DAMON_RECLAIM swapping the anonymous pages out. For such case, add yet
another DAMON_RECLAIM parameter, namely 'skip_anon'. When it is set as
'Y', DAMON_RECLAIM will avoid reclaiming anonymous pages using a DAMOS
filter.
SeongJae Park [Mon, 5 Dec 2022 23:08:21 +0000 (23:08 +0000)]
mm/damon/paddr: support DAMOS filters
Implement support of the DAMOS filters in the physical address space
monitoring operations set, for all DAMOS actions that it supports
including 'pageout', 'lru_prio', and 'lru_deprio'.
SeongJae Park [Mon, 5 Dec 2022 23:08:20 +0000 (23:08 +0000)]
mm/damon/core: implement damos filter
Patch series "implement DAMOS filtering for anon pages and/or specific
memory cgroups"
DAMOS let users do system operations in a data access pattern oriented
way. The data access pattern, which is extracted by DAMON, is somewhat
accurate more than what user space could know in many cases. However, in
some situation, users could know something more than the kernel about the
pattern or some special requirements for some types of memory or
processes. For example, some users would have slow swap devices and knows
latency-ciritical processes and therefore want to use DAMON-based
proactive reclamation (DAMON_RECLAIM) for only non-anonymous pages of
non-latency-critical processes.
For such restriction, users could exclude the memory regions from the
initial monitoring regions and use non-dynamic monitoring regions update
monitoring operations set including fvaddr and paddr. They could also
adjust the DAMOS target access pattern. For dynamically changing memory
layout and access pattern, those would be not enough.
To help the case, add an interface, namely DAMOS filters, which can be
used to avoid the DAMOS actions be applied to specific types of memory, to
DAMON kernel API (damon.h). At the moment, it supports filtering
anonymous pages and/or specific memory cgroups in or out for each DAMOS
scheme.
This patchset adds the support for all DAMOS actions that 'paddr'
monitoring operations set supports ('pageout', 'lru_prio', and
'lru_deprio'), and the functionality is exposed via DAMON kernel API
(damon.h) the DAMON sysfs interface (/sys/kernel/mm/damon/admins/), and
DAMON_RECLAIM module parameters.
Patches Sequence
----------------
First patch implements DAMOS filter interface to DAMON kernel API. Second
patch makes the physical address space monitoring operations set to
support the filters from all supporting DAMOS actions. Third patch adds
anonymous pages filter support to DAMON_RECLAIM, and the fourth patch
documents the DAMON_RECLAIM's new feature. Fifth to seventh patches
implement DAMON sysfs files for support of the filters, and eighth patch
connects the file to use DAMOS filters feature. Ninth patch adds simple
self test cases for DAMOS filters of the sysfs interface. Finally,
following two patches (tenth and eleventh) document the new features and
interfaces.
This patch (of 11):
DAMOS lets users do system operation in a data access pattern oriented
way. The data access pattern, which is extracted by DAMON, is somewhat
accurate more than what user space could know in many cases. However, in
some situation, users could know something more than the kernel about the
pattern or some special requirements for some types of memory or
processes. For example, some users would have slow swap devices and knows
latency-ciritical processes and therefore want to use DAMON-based
proactive reclamation (DAMON_RECLAIM) for only non-anonymous pages of
non-latency-critical processes.
For such restriction, users could exclude the memory regions from the
initial monitoring regions and use non-dynamic monitoring regions update
monitoring operations set including fvaddr and paddr. They could also
adjust the DAMOS target access pattern. For dynamically changing memory
layout and access pattern, those would be not enough.
To help the case, add an interface, namely DAMOS filters, which can be
used to avoid the DAMOS actions be applied to specific types of memory, to
DAMON kernel API (damon.h). At the moment, it supports filtering
anonymous pages and/or specific memory cgroups in or out for each DAMOS
scheme.
Note that this commit adds only the interface to the DAMON kernel API.
The impelmentation should be made in the monitoring operations sets, and
following commits will add that.
Johannes Weiner [Wed, 7 Dec 2022 13:00:39 +0000 (14:00 +0100)]
mm: memcontrol: deprecate charge moving
Charge moving mode in cgroup1 allows memory to follow tasks as they
migrate between cgroups. This is, and always has been, a questionable
thing to do - for several reasons.
First, it's expensive. Pages need to be identified, locked and isolated
from various MM operations, and reassigned, one by one.
Second, it's unreliable. Once pages are charged to a cgroup, there isn't
always a clear owner task anymore. Cache isn't moved at all, for example.
Mapped memory is moved - but if trylocking or isolating a page fails,
it's arbitrarily left behind. Frequent moving between domains may leave a
task's memory scattered all over the place.
Third, it isn't really needed. Launcher tasks can kick off workload tasks
directly in their target cgroup. Using dedicated per-workload groups
allows fine-grained policy adjustments - no need to move tasks and their
physical pages between control domains. The feature was never
forward-ported to cgroup2, and it hasn't been missed.
Despite it being a niche usecase, the maintenance overhead of supporting
it is enormous. Because pages are moved while they are live and subject
to various MM operations, the synchronization rules are complicated.
There are lock_page_memcg() in MM and FS code, which non-cgroup people
don't understand. In some cases we've been able to shift code and cgroup
API calls around such that we can rely on native locking as much as
possible. But that's fragile, and sometimes we need to hold MM locks for
longer than we otherwise would (pte lock e.g.).
Mark the feature deprecated. Hopefully we can remove it soon.
And backport into -stable kernels so that people who develop against
earlier kernels are warned about this deprecation as early as possible.
[akpm@linux-foundation.org: fix memory.rst underlining] Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Tue, 6 Dec 2022 17:13:39 +0000 (18:13 +0100)]
mm: memcontrol: skip moving non-present pages that are mapped elsewhere
Patch series "mm: push down lock_page_memcg()", v2.
This patch (of 3):
During charge moving, the pte lock and the page lock cover nearly all
cases of stabilizing page_mapped(). The only exception is when we're
looking at a non-present pte and find a page in the page cache or in the
swapcache: if the page is mapped elsewhere, it can become unmapped outside
of our control. For this reason, rmap needs lock_page_memcg().
We don't like cgroup-specific locks in generic MM code - especially in
performance-critical MM code - and for a legacy feature that's unlikely to
have many users left - if any.
So remove the exception. Arguably that's better semantics anyway: the
page is shared, and another process seems to be the more active user.
Once we stop moving such pages, rmap doesn't need lock_page_memcg()
anymore. The next patch will remove it.
Mike Kravetz [Fri, 16 Dec 2022 22:45:07 +0000 (14:45 -0800)]
hugetlb: initialize variable to avoid compiler warning
With the gcc 'maybe-uninitialized' warning enabled, gcc will produce:
mm/hugetlb.c:6896:20: warning: `chg' may be used uninitialized
This is a false positive, but may be difficult for the compiler to
determine. maybe-uninitialized is disabled by default, but this gets
flagged as a 0-DAY build regression.
Wenchao Hao [Thu, 8 Dec 2022 14:21:30 +0000 (22:21 +0800)]
cma: tracing: print alloc result in trace_cma_alloc_finish
The result of the allocation attempt is not printed in
trace_cma_alloc_finish, but it's important to do it so we can set filters
to catch specific errors on allocation or to trigger some operations on
specific errors.
We have printed the result in log, but the log is conditional and could
not be filtered by tracing events.
It introduces little overhead to print this result. The result of
allocation is named `errorno' in the trace.
Link: https://lkml.kernel.org/r/20221208142130.1501195-1-haowenchao@huawei.com Signed-off-by: Wenchao Hao <haowenchao@huawei.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 16 Dec 2022 19:45:37 +0000 (14:45 -0500)]
mm/mempolicy: do not duplicate policy if it is not applicable for set_mempolicy_home_node
set_mempolicy_home_node tries to duplicate a memory policy before checking
it whether it is applicable for the operation. There is no real reason
for doing that and it might actually be a pointless memory allocation and
deallocation exercise for MPOL_INTERLEAVE.
Not a big problem but we can do better. Simply check the policy before
acting on it.
Link: https://lkml.kernel.org/r/20221216194537.238047-2-mathieu.desnoyers@efficios.com Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Start converting buffer_heads to use folios".
I was hoping that filesystems would convert from buffer_heads to iomap,
but that's not happening particularly quickly. So the buffer_head
infrastructure needs to be converted from being page-based to being
folio-based.
This patch (of 12):
Buffer heads point to the allocation (ie the folio), not the page. This
is currently the same thing for all filesystems that use buffer heads, so
this is a safe transitional step.
Peter Xu [Fri, 16 Dec 2022 15:52:29 +0000 (10:52 -0500)]
mm/hugetlb: introduce hugetlb_walk()
huge_pte_offset() is the main walker function for hugetlb pgtables. The
name is not really representing what it does, though.
Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside. Assert on the locks when walking
the pgtable.
Note, the vma lock assertion will be a no-op for private mappings.
Document the last special case in the page_vma_mapped_walk() path where we
don't need any more lock to call hugetlb_walk().
Taking vma lock there is not needed because either: (1) potential callers
of hugetlb pvmw holds i_mmap_rwsem already (from one rmap_walk()), or (2)
the caller will not walk a hugetlb vma at all so the hugetlb code path not
reachable (e.g. in ksm or uprobe paths).
It's slightly implicit for future page_vma_mapped_walk() callers on that
lock requirement. But anyway, when one day this rule breaks, one will get
a straightforward warning in hugetlb_walk() with lockdep, then there'll be
a way out.
[akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20221216155229.2043750-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 16 Dec 2022 15:52:26 +0000 (10:52 -0500)]
mm/hugetlb: make walk_hugetlb_range() safe to pmd unshare
Since walk_hugetlb_range() walks the pgtable, it needs the vma lock to
make sure the pgtable page will not be freed concurrently.
Link: https://lkml.kernel.org/r/20221216155226.2043738-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 16 Dec 2022 15:52:23 +0000 (10:52 -0500)]
mm/hugetlb: make follow_hugetlb_page() safe to pmd unshare
Since follow_hugetlb_page() walks the pgtable, it needs the vma lock to
make sure the pgtable page will not be freed concurrently.
Link: https://lkml.kernel.org/r/20221216155223.2043727-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 16 Dec 2022 15:52:19 +0000 (10:52 -0500)]
mm/hugetlb: make hugetlb_follow_page_mask() safe to pmd unshare
Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.
Link: https://lkml.kernel.org/r/20221216155219.2043714-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 16 Dec 2022 15:52:17 +0000 (10:52 -0500)]
mm/hugetlb: make userfaultfd_huge_must_wait() safe to pmd unshare
We can take the hugetlb walker lock, here taking vma lock directly.
Link: https://lkml.kernel.org/r/20221216155217.2043700-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 16 Dec 2022 15:50:55 +0000 (10:50 -0500)]
mm/hugetlb: move swap entry handling into vma lock when faulted
In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset(). That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without any lock to protect the pgtable from being freed after pmd
unshare.
Here the simplest solution to make it safe is to move the swap handling to
be after the vma lock being held. We may need to take the fault mutex on
either migration or hwpoison entries now (also the vma lock, but that's
really needed), however neither of them is hot path.
Note that the vma lock cannot be released in hugetlb_fault() when the
migration entry is detected, because in migration_entry_wait_huge() the
pgtable page will be used again (by taking the pgtable lock), so that also
need to be protected by the vma lock. Modify migration_entry_wait_huge()
so that it must be called with vma read lock held, and properly release
the lock in __migration_entry_wait_huge().
Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 16 Dec 2022 15:50:54 +0000 (10:50 -0500)]
mm/hugetlb: document huge_pte_offset usage
huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.
Normally, it's always safe to walk a generic pgtable as long as we're with
the mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.
But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
pgtable freed by pmd unsharing, it means that even with mmap lock held for
current mm, the PMD pgtable page can still go away from under us if pmd
unsharing is possible during the walk.
So we have two ways to make it safe even for a shared mapping:
(1) If we're with the hugetlb vma lock held for either read/write, it's
okay because pmd unshare cannot happen at all.
(2) If we're with the i_mmap_rwsem lock held for either read/write, it's
okay because even if pmd unshare can happen, the pgtable page cannot
be freed from under us.
Document it.
Link: https://lkml.kernel.org/r/20221216155100.2043537-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 16 Dec 2022 15:50:53 +0000 (10:50 -0500)]
mm/hugetlb: don't wait for migration entry during follow page
That's what the code does with !hugetlb pages, so we should logically do
the same for hugetlb, so migration entry will also be treated as no page.
This is probably also the last piece in follow_page code that may sleep,
the last one should be removed in cf994dd8af27 ("mm/gup: remove
FOLL_MIGRATION", 2022-11-16).
Link: https://lkml.kernel.org/r/20221216155100.2043537-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 16 Dec 2022 15:50:52 +0000 (10:50 -0500)]
mm/hugetlb: let vma_offset_start() to return start
Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.
Problem
=======
huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.
huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).
The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.
There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.
One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.
A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):
/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);
This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.
Patch Layout
============
Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-8: each patch resolves one possible race condition
Patch 9: introduce hugetlb_walk() to replace huge_pte_offset()
Tests
=====
The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.
This patch (of 9):
Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.
Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.
Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Mon, 12 Dec 2022 23:50:42 +0000 (15:50 -0800)]
hugetlb: update vma flag check for hugetlb vma lock
The check for whether a hugetlb vma lock exists partially depends on the
vma's flags. Currently, it checks for either VM_MAYSHARE or VM_SHARED.
The reason both flags are used is because VM_MAYSHARE was previously
cleared in hugetlb vmas as they are tore down. This is no longer the
case, and only the VM_MAYSHARE check is required.
Link: https://lkml.kernel.org/r/20221212235042.178355-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jeff Xu [Thu, 15 Dec 2022 00:12:04 +0000 (00:12 +0000)]
mm/memfd: add write seals when apply SEAL_EXEC to executable memfd
In order to avoid WX mappings, add F_SEAL_WRITE when apply F_SEAL_EXEC to
an executable memfd, so W^X from start.
This implys application need to fill the content of the memfd first, after
F_SEAL_EXEC is applied, application can no longer modify the content of
the memfd.
Typically, application seals the memfd right after writing to it.
For example:
1. memfd_create(MFD_EXEC).
2. write() code to the memfd.
3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X.
4. call exec() on the memfd.
Link: https://lkml.kernel.org/r/20221215001205.51969-5-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: kernel test robot <lkp@intel.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jeff Xu [Thu, 15 Dec 2022 00:12:03 +0000 (00:12 +0000)]
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
executable bit at creation time (memfd_create).
When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be
executable (mode: 0777) after creation.
when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.
The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_NOEXEC_SEAL was set.
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
The sysctl allows finer control of memfd_create for old-software that
doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable memfd
by default. Also, the value of memfd_noexec is passed to child namespace
at creation time. For example, if the init namespace has
vm.memfd_noexec=2, all its children namespaces will be created with 2.
[akpm@linux-foundation.org: add stub functions to fix build]
[akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff]
[akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review]
[akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning] Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@google.com> Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Daniel Verkamp [Thu, 15 Dec 2022 00:12:01 +0000 (00:12 +0000)]
mm/memfd: add F_SEAL_EXEC
Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8.
Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting it
differently.
However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass and
enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code and
root escalation. [2] lists more VRP in this kind.
On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].
To address those above, this set of patches add following:
1> Let memfd_create() set X bit at creation time.
2> Let memfd to be sealed for modifying X bit.
3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
X bit.For example, if a container has vm.memfd_noexec=2, then
memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4> A new security hook in memfd_create(). This make it possible to a new
LSM, which rejects or allows executable memfd based on its security policy.
This patch (of 5):
The new F_SEAL_EXEC flag will prevent modification of the exec bits:
written as traditional octal mask, 0111, or as named flags, S_IXUSR |
S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify
any of these bits after the seal is applied will fail with errno EPERM.
This will preserve the execute bits as they are at the time of sealing, so
the memfd will become either permanently executable or permanently
un-executable.
Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Co-developed-by: Jeff Xu <jeffxu@google.com> Signed-off-by: Jeff Xu <jeffxu@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Wed, 14 Dec 2022 20:15:33 +0000 (15:15 -0500)]
mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()
This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.
The reasons I still think this patch is worthwhile, are:
(1) It is a cleanup already; diffstat tells.
(2) It just feels natural after I thought about this, if the pte is uffd
protected, let's remove the write bit no matter what it was.
(2) Since x86 is the only arch that supports uffd-wp, it also redefines
pte|pmd_mkuffd_wp() in that it should always contain removals of
write bits. It means any future arch that want to implement uffd-wp
should naturally follow this rule too. It's good to make it a
default, even if with vm_page_prot changes on VM_UFFD_WP.
(3) It covers more than vm_page_prot. So no chance of any potential
future "accident" (like pte_mkdirty() sparc64 or loongarch, even
though it just got its pte_mkdirty fixed <1 month ago). It'll be
fairly clear when reading the code too that we don't worry anything
before a pte_mkuffd_wp() on uncertainty of the write bit.
We may call pte_wrprotect() one more time in some paths (e.g. thp split),
but that should be fully local bitop instruction so the overhead should be
negligible.
Although this patch should logically also fix all the known issues on
uffd-wp too recently on page migration (not for numa hint recovery - that
may need another explcit pte_wrprotect), but this is not the plan for that
fix. So no fixes, and stable doesn't need this.
Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ives van Hoorne <ives@codesandbox.io> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Thu, 15 Dec 2022 06:17:57 +0000 (22:17 -0800)]
mm: move folio_set_compound_order() to mm/internal.h
folio_set_compound_order() is moved to an mm-internal location so external
folio users cannot misuse this function. Change the name of the function
to folio_set_order() and use WARN_ON_ONCE() rather than BUG_ON. Also,
handle the case if a non-large folio is passed and add clarifying comments
to the function.
Link: https://lore.kernel.org/lkml/20221207223731.32784-1-sidhartha.kumar@oracle.com/T/ Link: https://lkml.kernel.org/r/20221215061757.223440-1-sidhartha.kumar@oracle.com Fixes: 9fd330582b2f ("mm: add folio dtor and order setter functions") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Wed, 14 Dec 2022 20:04:53 +0000 (15:04 -0500)]
mm: fix a few rare cases of using swapin error pte marker
This patch should harden commit 15520a3f0469 ("mm: use pte markers for
swap errors") on using pte markers for swapin errors on a few corner
cases.
1. Propagate swapin errors across fork()s: if there're swapin errors in
the parent mm, after fork()s the child should sigbus too when an error
page is accessed.
2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte
marker can be quickly switched to a swapin error.
3. Explicitly ignore swapin error pte markers in change_protection().
I mostly don't worry on (2) or (3) at all, but we should still have them.
Case (1) is special because it can potentially cause silent data corrupt
on child when parent has swapin error triggered with swapoff, but since
swapin error is rare itself already it's probably not easy to trigger
either.
Currently there is a priority difference between the uffd-wp bit and the
swapin error entry, in which the swapin error always has higher priority
(e.g. we don't need to wr-protect a swapin error pte marker).
If there will be a 3rd bit introduced, we'll probably need to consider a
more involved approach so we may need to start operate on the bits. Let's
leave that for later.
This patch is tested with case (1) explicitly where we'll get corrupted
data before in the child if there's existing swapin error pte markers, and
after patch applied the child can be rightfully killed.
We don't need to copy stable for this one since 15520a3f0469 just landed
as part of v6.2-rc1, only "Fixes" applied.
Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com Fixes: 15520a3f0469 ("mm: use pte markers for swap errors") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Wed, 14 Dec 2022 20:04:52 +0000 (15:04 -0500)]
mm/uffd: fix pte marker when fork() without fork event
Patch series "mm: Fixes on pte markers".
Patch 1 resolves the syzkiller report from Pengfei.
Patch 2 further harden pte markers when used with the recent swapin error
markers. The major case is we should persist a swapin error marker after
fork(), so child shouldn't read a corrupted page.
This patch (of 2):
When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
have it and has pte marker installed. The warning is improper along with
the comment. The right thing is to inherit the pte marker when needed, or
keep the dst pte empty.
A vague guess is this happened by an accident when there's the prior patch
to introduce src/dst vma into this helper during the uffd-wp feature got
developed and I probably messed up in the rebase, since if we replace
dst_vma with src_vma the warning & comment it all makes sense too.
Hugetlb did exactly the right here (copy_hugetlb_page_range()). Fix the
general path.
Linus Torvalds [Sun, 15 Jan 2023 13:17:44 +0000 (07:17 -0600)]
Merge tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure the poking PGD is pinned for Xen PV as it requires it this
way
- Fixes for two resctrl races when moving a task or creating a new
monitoring group
- Fix SEV-SNP guests running under HyperV where MTRRs are disabled to
not return a UC- type mapping type on memremap() and thus cause a
serious slowdown
- Fix insn mnemonics in bioscall.S now that binutils is starting to fix
confusing insn suffixes
* tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: fix poking_init() for Xen PV guests
x86/resctrl: Fix event counts regression in reused RMIDs
x86/resctrl: Fix task CLOSID/RMID update race
x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case
x86/boot: Avoid using Intel mnemonics in AT&T syntax asm
Linus Torvalds [Sun, 15 Jan 2023 13:12:58 +0000 (07:12 -0600)]
Merge tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fixes from Borislav Petkov:
- Fix the EDAC device's confusion in the polling setting units
- Fix a memory leak in highbank's probing function
* tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/highbank: Fix memory leak in highbank_mc_probe()
EDAC/device: Fix period calculation in edac_device_reset_delay_period()
Linus Torvalds [Sun, 15 Jan 2023 13:09:41 +0000 (07:09 -0600)]
Merge tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a build failure with some versions of ld that have an odd version
string
- Fix incorrect use of mutex in the IMC PMU driver
Thanks to Kajol Jain, Michael Petlan, Ojaswin Mujoo, Peter Zijlstra, and
Yang Yingliang.
* tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/hash: Make stress_hpt_timer_fn() static
powerpc/imc-pmu: Fix use of mutex in IRQs disabled section
powerpc/boot: Fix incorrect version calculation issue in ld_version
Linus Torvalds [Sat, 14 Jan 2023 16:48:15 +0000 (10:48 -0600)]
Merge tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Core: Fix an iommu-group refcount leak
- Fix overflow issue in IOVA alloc path
- ARM-SMMU fixes from Will:
- Fix VFIO regression on NXP SoCs by reporting IOMMU_CAP_CACHE_COHERENCY
- Fix SMMU shutdown paths to avoid device unregistration race
- Error handling fix for Mediatek IOMMU driver
* tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe()
iommu/iova: Fix alloc iova overflows issue
iommu: Fix refcount leak in iommu_device_claim_dma_owner
iommu/arm-smmu-v3: Don't unregister on shutdown
iommu/arm-smmu: Don't unregister on shutdown
iommu/arm-smmu: Report IOMMU_CAP_CACHE_COHERENCY even betterer
Linus Torvalds [Sat, 14 Jan 2023 16:08:08 +0000 (10:08 -0600)]
Merge tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock fix from Mike Rapoport:
"memblock: always release pages to the buddy allocator in
memblock_free_late()
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the
deferred init process.
memblock_free_pages() is called by memblock_free_late(), which is used
to free reserved ranges after memblock_free_all() has run. All pages
in reserved ranges have been initialized at that point, and
accordingly, those pages are not touched by the deferred init process.
This means that currently, if the pages that memblock_free_late()
intends to release are in the deferred range, they will never be
released to the buddy allocator. They will forever be reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call
__free_pages_core() directly instead.
One case where this issue can occur in the wild is EFI boot on x86_64.
The x86 EFI code reserves all EFI boot services memory ranges via
memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively).
If any of those ranges happens to fall within the deferred init range,
the pages will not be released and that memory will be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
* tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
mm: Always release pages to the buddy allocator in memblock_free_late().
Linus Torvalds [Sat, 14 Jan 2023 16:04:00 +0000 (10:04 -0600)]
Merge tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kernel hardening fixes from Kees Cook:
- Fix CFI hash randomization with KASAN (Sami Tolvanen)
- Check size of coreboot table entry and use flex-array
* tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
kbuild: Fix CFI hash randomization with KASAN
firmware: coreboot: Check size of table entry and use flex-array
Linus Torvalds [Sat, 14 Jan 2023 14:08:25 +0000 (08:08 -0600)]
Merge tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
- memory leak and double free fix
- two symlink fixes
- minor cleanup fix
- two smb1 fixes
* tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix uninitialized memory read for smb311 posix symlink create
cifs: fix potential memory leaks in session setup
cifs: do not query ifaces on smb1 mounts
cifs: fix double free on failed kerberos auth
cifs: remove redundant assignment to the variable match
cifs: fix file info setting in cifs_open_file()
cifs: fix file info setting in cifs_query_path_info()
Linus Torvalds [Sat, 14 Jan 2023 13:57:25 +0000 (07:57 -0600)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two minor fixes in the hisi_sas driver which only impact enterprise
style multi-expander and shared disk situations and no core changes"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: hisi_sas: Set a port invalid only if there are no devices attached when refreshing port id
scsi: hisi_sas: Use abort task set to reset SAS disks when discovered
Linus Torvalds [Sat, 14 Jan 2023 13:52:11 +0000 (07:52 -0600)]
Merge tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA fix from Damien Le Moal:
"A single fix to prevent building the pata_cs5535 driver with user mode
linux as it uses msr operations that are not defined with UML"
* tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: pata_cs5535: Don't build on UML
Linus Torvalds [Fri, 13 Jan 2023 23:41:19 +0000 (17:41 -0600)]
Merge tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Nothing major in here, just a collection of NVMe fixes and dropping a
wrong might_sleep() that static checkers tripped over but which isn't
valid"
* tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux:
MAINTAINERS: stop nvme matching for nvmem files
nvme: don't allow unprivileged passthrough on partitions
nvme: replace the "bool vec" arguments with flags in the ioctl path
nvme: remove __nvme_ioctl
nvme-pci: fix error handling in nvme_pci_enable()
nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers
nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression
block: Drop spurious might_sleep() from blk_put_queue()
Linus Torvalds [Fri, 13 Jan 2023 23:37:09 +0000 (17:37 -0600)]
Merge tag 'io_uring-6.2-2023-01-13' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"A fix for a regression that happened last week, rest is fixes that
will be headed to stable as well. In detail:
- Fix for a regression added with the leak fix from last week (me)
- In writing a test case for that leak, inadvertently discovered a
case where we a poll request can race. So fix that up and mark it
for stable, and also ensure that fdinfo covers both the poll tables
that we have. The latter was an oversight when the split poll table
were added (me)
- Fix for a lockdep reported issue with IOPOLL (Pavel)"
* tag 'io_uring-6.2-2023-01-13' of git://git.kernel.dk/linux:
io_uring: lock overflowing for IOPOLL
io_uring/poll: attempt request issue after racy poll wakeup
io_uring/fdinfo: include locked hash table in fdinfo output
io_uring/poll: add hash if ready poll request can't complete inline
io_uring/io-wq: only free worker if it was allocated for creation
Linus Torvalds [Fri, 13 Jan 2023 23:32:22 +0000 (17:32 -0600)]
Merge tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci fixes from Bjorn Helgaas:
- Work around apparent firmware issue that made Linux reject MMCONFIG
space, which broke PCI extended config space (Bjorn Helgaas)
- Fix CONFIG_PCIE_BT1 dependency due to mid-air collision between a
PCI_MSI_IRQ_DOMAIN -> PCI_MSI change and addition of PCIE_BT1 (Lukas
Bulwahn)
* tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space
x86/pci: Simplify is_mmconf_reserved() messages
PCI: dwc: Adjust to recent removal of PCI_MSI_IRQ_DOMAIN
Sami Tolvanen [Thu, 12 Jan 2023 22:49:48 +0000 (22:49 +0000)]
kbuild: Fix CFI hash randomization with KASAN
Clang emits a asan.module_ctor constructor to each object file
when KASAN is enabled, and these functions are indirectly called
in do_ctors. With CONFIG_CFI_CLANG, the compiler also emits a CFI
type hash before each address-taken global function so they can
pass indirect call checks.
However, in commit 0c3e806ec0f9 ("x86/cfi: Add boot time hash
randomization"), x86 implemented boot time hash randomization,
which relies on the .cfi_sites section generated by objtool. As
objtool is run against vmlinux.o instead of individual object
files with X86_KERNEL_IBT (enabled by default), CFI types in
object files that are not part of vmlinux.o end up not being
included in .cfi_sites, and thus won't get randomized and trip
CFI when called.
Only .vmlinux.export.o and init/version-timestamp.o are linked
into vmlinux separately from vmlinux.o. As these files don't
contain any functions, disable KASAN for both of them to avoid
breaking hash randomization.
Kees Cook [Thu, 12 Jan 2023 23:03:16 +0000 (15:03 -0800)]
firmware: coreboot: Check size of table entry and use flex-array
The memcpy() of the data following a coreboot_table_entry couldn't
be evaluated by the compiler under CONFIG_FORTIFY_SOURCE. To make it
easier to reason about, add an explicit flexible array member to struct
coreboot_device so the entire entry can be copied at once. Additionally,
validate the sizes before copying. Avoids this run-time false positive
warning:
memcpy: detected field-spanning write (size 168) of single field "&device->entry" at drivers/firmware/google/coreboot_table.c:103 (size 8)
Nicholas Piggin [Thu, 12 Jan 2023 10:54:26 +0000 (20:54 +1000)]
kallsyms: Fix scheduling with interrupts disabled in self-test
kallsyms_on_each* may schedule so must not be called with interrupts
disabled. The iteration function could disable interrupts, but this
also changes lookup_symbol() to match the change to the other timing
code.
Peter Foley [Fri, 13 Jan 2023 04:37:06 +0000 (23:37 -0500)]
ata: pata_cs5535: Don't build on UML
This driver uses MSR functions that aren't implemented under UML.
Avoid building it to prevent tripping up allyesconfig.
e.g.
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x3a3): undefined reference to `__tracepoint_read_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x3d2): undefined reference to `__tracepoint_write_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x457): undefined reference to `__tracepoint_write_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x481): undefined reference to `do_trace_write_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x4d5): undefined reference to `do_trace_write_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x4f5): undefined reference to `do_trace_read_msr'
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x51c): undefined reference to `do_trace_write_msr'
Signed-off-by: Peter Foley <pefoley2@pefoley.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Linus Torvalds [Fri, 13 Jan 2023 20:41:50 +0000 (14:41 -0600)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix the PMCR_EL0 reset value after the PMU rework
- Correctly handle S2 fault triggered by a S1 page table walk by not
always classifying it as a write, as this breaks on R/O memslots
- Document why we cannot exit with KVM_EXIT_MMIO when taking a write
fault from a S1 PTW on a R/O memslot
- Put the Apple M2 on the naughty list for not being able to
correctly implement the vgic SEIS feature, just like the M1 before
it
- Reviewer updates: Alex is stepping down, replaced by Zenghui
x86:
- Fix various rare locking issues in Xen emulation and teach lockdep
to detect them
- Documentation improvements
- Do not return host topology information from KVM_GET_SUPPORTED_CPUID"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lock
KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule
KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()
KVM: x86/xen: Fix lockdep warning on "recursive" gpc locking
Documentation: kvm: fix SRCU locking order docs
KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID
KVM: nSVM: clarify recalc_intercepts() wrt CR8
MAINTAINERS: Remove myself as a KVM/arm64 reviewer
MAINTAINERS: Add Zenghui Yu as a KVM/arm64 reviewer
KVM: arm64: vgic: Add Apple M2 cpus to the list of broken SEIS implementations
KVM: arm64: Convert FSC_* over to ESR_ELx_FSC_*
KVM: arm64: Document the behaviour of S1PTW faults on RO memslots
KVM: arm64: Fix S1PTW handling on RO memslots
KVM: arm64: PMU: Fix PMCR_EL0 reset value
Mateusz Guzik [Fri, 13 Jan 2023 18:44:47 +0000 (19:44 +0100)]
lockref: stop doing cpu_relax in the cmpxchg loop
On the x86-64 architecture even a failing cmpxchg grants exclusive
access to the cacheline, making it preferable to retry the failed op
immediately instead of stalling with the pause instruction.
To illustrate the impact, below are benchmark results obtained by
running various will-it-scale tests on top of the 6.2-rc3 kernel and
Cascade Lake (2 sockets * 24 cores * 2 threads) CPU.
All results in ops/s. Note there is some variance in re-runs, but the
code is consistently faster when contention is present.
That is, fixing the problems in access itself *reduces* scalability
after the cacheline ping-pong only happens in lockref with the pause
instruction.
Note that fstat and access benchmarks are not currently integrated into
will-it-scale, but interested parties can find them in pull requests to
said project.
Code at hand has a rather tortured history. First modification showed
up in commit d472d9d98b46 ("lockref: Relax in cmpxchg loop"), written
with Itanium in mind. Later it got patched up to use an arch-dependent
macro to stop doing it on s390 where it caused a significant regression.
Said macro had undergone revisions and was ultimately eliminated later,
going back to cpu_relax.
While I intended to only remove cpu_relax for x86-64, I got the
following comment from Linus:
I would actually prefer just removing it entirely and see if
somebody else hollers. You have the numbers to prove it hurts on
real hardware, and I don't think we have any numbers to the
contrary.
So I think it's better to trust the numbers and remove it as a
failure, than say "let's just remove it on x86-64 and leave
everybody else with the potentially broken code"
Additionally, Will Deacon (maintainer of the arm64 port, one of the
architectures previously benchmarked):
So, from the arm64 side of the fence, I'm perfectly happy just
removing the cpu_relax() calls from lockref.
As such, come back full circle in history and whack it altogether.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com/ Acked-by: Tony Luck <tony.luck@intel.com> # ia64 Acked-by: Nicholas Piggin <npiggin@gmail.com> # powerpc Acked-by: Will Deacon <will@kernel.org> # arm64 Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bjorn Helgaas [Tue, 10 Jan 2023 18:02:43 +0000 (12:02 -0600)]
x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space
Normally we reject ECAM space unless it is reported as reserved in the E820
table or via a PNP0C02 _CRS method (PCI Firmware, r3.3, sec 4.1.2).
07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map"), removes
E820 entries that correspond to EfiMemoryMappedIO regions because some
other firmware uses EfiMemoryMappedIO for PCI host bridge windows, and the
E820 entries prevent Linux from allocating BAR space for hot-added devices.
Some firmware doesn't report ECAM space via PNP0C02 _CRS methods, but does
mention it as an EfiMemoryMappedIO region via EFI GetMemoryMap(), which is
normally converted to an E820 entry by a bootloader or EFI stub. After 07eab0901ede, that E820 entry is removed, so we reject this ECAM space,
which makes PCI extended config space (offsets 0x100-0xfff) inaccessible.
The lack of extended config space breaks anything that relies on it,
including perf, VSEC telemetry, EDAC, QAT, SR-IOV, etc.
Allow use of ECAM for extended config space when the region is covered by
an EfiMemoryMappedIO region, even if it's not included in E820 or PNP0C02
_CRS.
Link: https://lore.kernel.org/r/ac2693d8-8ba3-72e0-5b66-b3ae008d539d@linux.intel.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=216891 Fixes: 07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map") Link: https://lore.kernel.org/r/20230110180243.1590045-3-helgaas@kernel.org Reported-by: Kan Liang <kan.liang@linux.intel.com> Reported-by: Tony Luck <tony.luck@intel.com> Reported-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reported-by: Yunying Sun <yunying.sun@intel.com> Reported-by: Baowen Zheng <baowen.zheng@corigine.com> Reported-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reported-by: Yang Lixiao <lixiao.yang@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Yunying Sun <yunying.sun@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Rafael J. Wysocki <rafael@kernel.org>
Linus Torvalds [Fri, 13 Jan 2023 16:35:26 +0000 (10:35 -0600)]
Merge tag 'docs-6.2-fixes' of git://git.lwn.net/linux
Pull documentation fixes from Jonathan Corbet:
"Three documentation fixes (or rather two and one warning):
- Sphinx 6.0 broke our configuration mechanism, so fix it
- I broke our configuration for non-Alabaster themes; Akira fixed it
- Deprecate Sphinx < 2.4 with an eye toward future removal"
* tag 'docs-6.2-fixes' of git://git.lwn.net/linux:
docs/conf.py: Use about.html only in sidebar of alabaster theme
docs: Deprecate use of Sphinx < 2.4.x
docs: Fix the docs build with Sphinx 6.0
Ard Biesheuvel [Mon, 9 Jan 2023 09:44:31 +0000 (10:44 +0100)]
efi: tpm: Avoid READ_ONCE() for accessing the event log
Nathan reports that recent kernels built with LTO will crash when doing
EFI boot using Fedora's GRUB and SHIM. The culprit turns out to be a
misaligned load from the TPM event log, which is annotated with
READ_ONCE(), and under LTO, this gets translated into a LDAR instruction
which does not tolerate misaligned accesses.
Interestingly, this does not happen when booting the same kernel
straight from the UEFI shell, and so the fact that the event log may
appear misaligned in memory may be caused by a bug in GRUB or SHIM.
However, using READ_ONCE() to access firmware tables is slightly unusual
in any case, and here, we only need to ensure that 'event' is not
dereferenced again after it gets unmapped, but this is already taken
care of by the implicit barrier() semantics of the early_memunmap()
call.
Cc: <stable@vger.kernel.org> Cc: Peter Jones <pjones@redhat.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1782 Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Linus Torvalds [Fri, 13 Jan 2023 14:20:29 +0000 (08:20 -0600)]
Merge tag 'sound-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This became a slightly big update, but it's more or less expected, as
the first batch after holidays.
All changes (but for the last two last-minute fixes) have been stewed
in linux-next long enough, so it's fairly safe to take:
- PCM UAF fix in 32bit compat layer
- ASoC board-specific fixes for Intel, AMD, Medathek, Qualcomm
- SOF power management fixes
- ASoC Intel link failure fixes
- A series of fixes for USB-audio regressions
- CS35L41 HD-audio codec regression fixes
- HD-audio device-specific fixes / quirks
Note that one SPI patch has been taken in ASoC subtree mistakenly, and
the same fix is found in spi tree, but it should be OK to apply"
* tag 'sound-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (39 commits)
ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF
ALSA: usb-audio: Fix possible NULL pointer dereference in snd_usb_pcm_has_fixed_rate()
ALSA: hda/realtek: Enable mute/micmute LEDs on HP Spectre x360 13-aw0xxx
ASoC: fsl-asoc-card: Fix naming of AC'97 CODEC widgets
ASoC: fsl_ssi: Rename AC'97 streams to avoid collisions with AC'97 CODEC
ALSA: hda/hdmi: Add a HP device 0x8715 to force connect list
ALSA: control-led: use strscpy in set_led_id()
ALSA: usb-audio: Always initialize fixed_rate in snd_usb_find_implicit_fb_sync_format()
ASoC: dt-bindings: qcom,lpass-tx-macro: correct clocks on SC7280
ASoC: dt-bindings: qcom,lpass-wsa-macro: correct clocks on SM8250
ASoC: qcom: Fix building APQ8016 machine driver without SOUNDWIRE
ALSA: hda: cs35l41: Check runtime suspend capability at runtime_idle
ALSA: hda: cs35l41: Don't return -EINVAL from system suspend/resume
ASoC: fsl_micfil: Correct the number of steps on SX controls
ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform
Revert "ALSA: usb-audio: Drop superfluous interface setup at parsing"
ALSA: usb-audio: More refactoring of hw constraint rules
ALSA: usb-audio: Relax hw constraints for implicit fb sync
ALSA: usb-audio: Make sure to stop endpoints before closing EPs
ALSA: hda - Enable headset mic on another Dell laptop with ALC3254
...
Linus Torvalds [Fri, 13 Jan 2023 13:38:14 +0000 (07:38 -0600)]
Merge tag 'pm-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix assorted issues in the ARM cpufreq drivers and in the AMD
P-state driver.
Specifics:
- Fix cpufreq policy reference counting in amd-pstate to prevent it
from crashing on removal (Perry Yuan)
- Fix double initialization and set suspend-freq for Apple's cpufreq
driver (Arnd Bergmann, Hector Martin)
- Fix reading of "reg" property, update cpufreq-dt's blocklist and
update DT documentation for Qualcomm's cpufreq driver (Konrad
Dybcio, Krzysztof Kozlowski)
- Replace 0 with NULL in the Armada cpufreq driver (Miles Chen)
- Fix potential overflows in the CPPC cpufreq driver (Pierre Gondois)
- Update blocklist for the Tegra234 Soc cpufreq driver (Sumit Gupta)"
* tag 'pm-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: amd-pstate: fix kernel hang issue while amd-pstate unregistering
cpufreq: armada-37xx: stop using 0 as NULL pointer
cpufreq: apple-soc: Switch to the lowest frequency on suspend
dt-bindings: cpufreq: cpufreq-qcom-hw: document interrupts
cpufreq: Add SM6375 to cpufreq-dt-platdev blocklist
cpufreq: Add Tegra234 to cpufreq-dt-platdev blocklist
cpufreq: qcom-hw: Fix reading "reg" with address/size-cells != 2
cpufreq: CPPC: Add u64 casts to avoid overflowing
cpufreq: apple: remove duplicate intializer