Yafang Shao [Wed, 12 Aug 2020 01:31:22 +0000 (18:31 -0700)]
mm, oom: make the calculation of oom badness more accurate
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.
We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.
The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.
To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.
[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wenchao Hao [Wed, 12 Aug 2020 01:31:16 +0000 (18:31 -0700)]
mm/mempolicy.c: check parameters first in kernel_get_mempolicy
Previous implementatoin calls untagged_addr() before error check, while if
the error check failed and return EINVAL, the untagged_addr() call is just
useless work.
Signed-off-by: Wenchao Hao <haowenchao22@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: mempolicy: fix kerneldoc of numa_map_to_online_node()
Fix W=1 compile warnings (invalid kerneldoc):
mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node'
mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node'
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200728171109.28687-3-krzk@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alex Shi [Wed, 12 Aug 2020 01:31:10 +0000 (18:31 -0700)]
mm/compaction: correct the comments of compact_defer_shift
There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nitin Gupta [Wed, 12 Aug 2020 01:31:07 +0000 (18:31 -0700)]
mm: use unsigned types for fragmentation score
Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.
Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nitin Gupta [Wed, 12 Aug 2020 01:31:04 +0000 (18:31 -0700)]
mm: fix compile error due to COMPACTION_HPAGE_ORDER
Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
HUGETLB_PAGE_ORDER. The correct way to check if this constant is defined
is to check for CONFIG_HUGETLBFS.
Nitin Gupta [Wed, 12 Aug 2020 01:31:00 +0000 (18:31 -0700)]
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
Michal Koutný [Wed, 12 Aug 2020 01:30:57 +0000 (18:30 -0700)]
/proc/PID/smaps: consistent whitespace output format
The keys in smaps output are padded to fixed width with spaces. All
except for THPeligible that uses tabs (only since commit c06306696f83
("mm: thp: fix false negative of shmem vma's THP eligibility")).
Unify the output formatting to save time debugging some naïve parsers.
(Part of the unification is also aligning FilePmdMapped with others.)
Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/20200728083207.17531-1-mkoutny@suse.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 12 Aug 2020 01:30:54 +0000 (18:30 -0700)]
mm/vmscan: restore active/inactive ratio for anonymous LRU
Now that workingset detection is implemented for anonymous LRU, we don't
need large inactive list to allow detecting frequently accessed pages
before they are reclaimed, anymore. This effectively reverts the
temporary measure put in by commit "mm/vmscan: make active/inactive ratio
as 1:1 for anon lru".
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 12 Aug 2020 01:30:50 +0000 (18:30 -0700)]
mm/swap: implement workingset detection for anonymous LRU
This patch implements workingset detection for anonymous LRU. All the
infrastructure is implemented by the previous patches so this patch just
activates the workingset detection by installing/retrieving the shadow
entry and adding refault calculation.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 12 Aug 2020 01:30:47 +0000 (18:30 -0700)]
mm/swapcache: support to handle the shadow entries
Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache. This patch implements an infrastructure to store the shadow
entry in the swapcache.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 12 Aug 2020 01:30:43 +0000 (18:30 -0700)]
mm/workingset: prepare the workingset detection infrastructure for anon LRU
To prepare the workingset detection for anon LRU, this patch splits
workingset event counters for refault, activate and restore into anon and
file variants, as well as the refaults counter in struct lruvec.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 12 Aug 2020 01:30:40 +0000 (18:30 -0700)]
mm/vmscan: protect the workingset on anonymous LRU
In current implementation, newly created or swap-in anonymous page is
started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
This patch tries to fix this issue. Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list. They are
promoted to active list if enough reference happens. This simple
modification changes the above example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
As you can see, hot pages on active list would be protected.
Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 12 Aug 2020 01:30:36 +0000 (18:30 -0700)]
mm/vmscan: make active/inactive ratio as 1:1 for anon lru
Patch series "workingset protection/detection on the anonymous LRU list", v7.
* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list. Growing the active list results in
rebalancing active/inactive list so old pages on the active list are
demoted to the inactive list. Hence, hot page on the active list isn't
protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list and system can contain total 100
pages. Numbers denote the number of pages on active/inactive list (active
| inactive). (h) stands for hot pages and (uo) stands for used-once
pages.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
As we can see, hot pages are swapped-out and it would cause swap-in later.
* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection. Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list. Also, like as the file LRU list, if
enough reference happens, the page will be promoted. This simple
modification changes the above example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
hot pages remains in the active list. :)
* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.
* SUBJECT
workingset detection
* PROBLEM
Later part of the patchset implements the workingset detection for the
anonymous LRU list. There is a corner case that workingset protection
could cause thrashing. If we can avoid thrashing by workingset detection,
we can get the better performance.
Following is an example of thrashing due to the workingset protection.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
6. repeat 4, 5
Without workingset detection, this kind of workload cannot be promoted and
thrashing happens forever.
* SOLUTION
Therefore, this patchset implements workingset detection. All the
infrastructure for workingset detecion is already implemented, so there is
not much work to do. First, extend workingset detection code to deal with
the anonymous LRU list. Then, make swap cache handles the exceptional
value for the shadow entry. Lastly, install/retrieve the shadow value
into/from the swap cache and check the refault distance.
* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists. Then, I checked that this patchset fixes it.
My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk to
capture the trace.
Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages. hot-2 memory's re-access interval (96+128
MB) is more 190MB, so it cannot be promoted without workingset detection
and swap-in/out happens repeatedly. With this patchset, workingset
detection works and promotion happens. Therefore, swap-in/out occurs
less.
Here is the result. (average of 5 runs)
type swap-in swap-out
base 863240 989945
patch 681565 809273
As we can see, patched kernel do less swap-in/out.
* OVERALL TEST (ebizzy using modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times. Swap-in will
happen if allocated memory is larger than the system memory.
The random function that represents the zipf distribution is used to make
hot/cold memory. Hot/cold ratio is controlled by the parameter. If the
parameter is high, hot memory is accessed much larger than cold one. If
the parameter is low, the number of access on each memory would be
similar. I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.
My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB
ram swap.
Result format is as following.
param: 1-1024-0.1
- 1 (number of thread)
- 1024 (allocated memory size, MB)
- 0.1 (zipf distribution alpha,
0.1 works like as roughly uniform random,
1.3 works like as small portion of memory is hot and the others are cold)
pswpin: smaller is better
std: standard deviation
improvement: negative is better
As we can see, all the cases show improvement. Especially, test case with
zipf distribution 1.3 show more improvements. It means that if there is a
hot/cold tendency in anon pages, this patchset works better.
This patch (of 6):
Current implementation of LRU management for anonymous page has some
problems. Most important one is that it doesn't protect the workingset,
that is, pages on the active LRU list. Although, this problem will be
fixed in the following patchset, the preparation is required and this
patch does it.
What following patch does is to implement workingset protection. After
the following patchset, newly created or swap-in pages will start their
lifetime on the inactive list. If inactive list is too small, there is
not enough chance to be referenced and the page cannot become the
workingset.
In order to provide the newly anonymous or swap-in pages enough chance to
be referenced again, this patch makes active/inactive LRU ratio as 1:1.
This is just a temporary measure. Later patch in the series introduces
workingset detection for anonymous LRU that will be used to better decide
if pages should start on the active and inactive list. Afterwards this
patch is effectively reverted.
Muchun Song [Wed, 12 Aug 2020 01:30:32 +0000 (18:30 -0700)]
mm/hugetlb: add mempolicy check in the reservation routine
In the reservation routine, we only check whether the cpuset meets the
memory allocation requirements. But we ignore the mempolicy of MPOL_BIND
case. If someone mmap hugetlb succeeds, but the subsequent memory
allocation may fail due to mempolicy restrictions and receives the SIGBUS
signal. This can be reproduced by the follow steps.
1) Compile the test case.
cd tools/testing/selftests/vm/
gcc map_hugetlb.c -o map_hugetlb
2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
system. Each node will pre-allocate one huge page.
echo 2 > /proc/sys/vm/nr_hugepages
3) Run test case(mmap 4MB). We receive the SIGBUS signal.
numactl --membind=3D0 ./map_hugetlb 4
With this patch applied, the mmap will fail in the step 3) and throw
"mmap: Cannot allocate memory".
[akpm@linux-foundation.org: include sched.h for `current']
Reported-by: Jianchao Guo <guojianchao@bytedance.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Baoquan He <bhe@redhat.com> Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Wed, 12 Aug 2020 01:30:29 +0000 (18:30 -0700)]
kselftests: cgroup: add perpcu memory accounting test
Add a simple test to check the percpu memory accounting. The test creates
a cgroup tree with 1000 child cgroups and checks values of memory.current
and memory.stat::percpu.
Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200608230819.832349-6-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Wed, 12 Aug 2020 01:30:25 +0000 (18:30 -0700)]
mm: memcg: charge memcg percpu memory to the parent cgroup
Memory cgroups are using large chunks of percpu memory to store vmstat
data. Yet this memory is not accounted at all, so in the case when there
are many (dying) cgroups, it's not exactly clear where all the memory is.
Because the size of memory cgroup internal structures can dramatically
exceed the size of object or page which is pinning it in the memory, it's
not a good idea to simply ignore it. It actually breaks the isolation
between cgroups.
Let's account the consumed percpu memory to the parent cgroup.
[guro@fb.com: add WARN_ON_ONCE()s, per Johannes] Link: http://lkml.kernel.org/r/20200811170611.GB1507044@carbon.DHCP.thefacebook.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-5-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs. Let's track
percpu memory usage for each memcg and display it in memory.stat.
A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page. So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.
[guro@fb.com: v3] Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Wed, 12 Aug 2020 01:30:17 +0000 (18:30 -0700)]
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning] Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Wed, 12 Aug 2020 01:30:14 +0000 (18:30 -0700)]
percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.
The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.
To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.
To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.
This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().
Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 11 Aug 2020 02:21:38 +0000 (19:21 -0700)]
Merge tag 'perf-tools-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools updates from Arnaldo Carvalho de Melo:
"New features:
- Introduce controlling how 'perf stat' and 'perf record' works via a
control file descriptor, allowing starting with events configured
but disabled until commands are received via the control file
descriptor. This allows, for instance for tools such as Intel VTune
to make further use of perf as its Linux platform driver.
- Improve 'perf record' to to register in a perf.data file header the
clockid used to help later correlate things like syslog files and
perf events recorded.
- Add basic syscall and find_next_bit benchmarks to 'perf bench'.
- Allow using computed metrics in calculating other metrics. For
instance:
- Add suport for 'd_ratio', '>' and '<' operators to the expression
resolver used in calculating metrics in 'perf stat'.
Support for new kernel features:
- Support TEXT_POKE and KSYMBOL_TYPE_OOL perf metadata events to cope
with things like ftrace, trampolines, i.e. changes in the kernel
text that gets in the way of properly decoding Intel PT hardware
traces, for instance.
Intel PT:
- Add various knobs to reduce the volume of Intel PT traces by
reducing the level of details such as decoding just some types of
packets (e.g., FUP/TIP, PSB+), also filtering by time range.
- Add new itrace options (log flags to the 'd' option, error flags to
the 'e' one, etc), controlling how Intel PT is transformed into
perf events, document some missing options (e.g., how to synthesize
callchains).
BPF:
- Properly report BPF errors when parsing events.
- Do not setup side-band events if LIBBPF is not linked, fixing a
segfault.
Libraries:
- Improvements to the libtraceevent plugin mechanism.
- Improve libtracevent support for KVM trace events SVM exit reasons.
- Add a libtracevent plugins for decoding syscalls/sys_enter_futex
and for tlb_flush.
- Ensure sample_period is set libpfm4 events in 'perf test'.
- Fixup libperf namespacing, to make sure what is in libperf has the
perf_ namespace while what is now only in tools/perf/ doesn't use
that prefix.
Arch specific:
- Improve the testing of vendor events and metrics in 'perf test'.
- Allow no ARM CoreSight hardware tracer sink to be specified on
command line.
- Fix arm_spe_x recording when mixed with other perf events.
- Add s390 idle functions 'psw_idle' and 'psw_idle_exit' to list of
idle symbols.
- List kernel supplied event aliases for arm64 in 'perf list'.
- Add support for extended register capability in PowerPC 9 and 10.
- Added nest IMC power9 metric events.
Miscellaneous:
- No need to setup sample_regs_intr/sample_regs_user for dummy
events.
- Update various copies of kernel headers, some causing perf to
handle new syscalls, MSRs, etc.
- Improve usage of flex and yacc, enabling warnings and addressing
the fallout.
- Add missing '--output' option to 'perf kmem' so that it can pass it
along to 'perf record'.
- 'perf probe' fixes related to adding multiple probes on the same
address for the same event.
- Make 'perf probe' warn if the target function is a GNU indirect
function.
- Remove //anon mmap events from 'perf inject jit' to fix supporting
both using ELF files for generated functions and the perf-PID.map
approaches"
* tag 'perf-tools-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (144 commits)
perf record: Skip side-band event setup if HAVE_LIBBPF_SUPPORT is not set
perf tools powerpc: Add support for extended regs in power10
perf tools powerpc: Add support for extended register capability
tools headers UAPI: Sync drm/i915_drm.h with the kernel sources
tools arch x86: Sync asm/cpufeatures.h with the kernel sources
tools arch x86: Sync the msr-index.h copy with the kernel sources
tools headers UAPI: update linux/in.h copy
tools headers API: Update close_range affected files
perf script: Add 'tod' field to display time of day
perf script: Change the 'enum perf_output_field' enumerators to be 64 bits
perf data: Add support to store time of day in CTF data conversion
perf tools: Move clockid_res_ns under clock struct
perf header: Store clock references for -k/--clockid option
perf tools: Add clockid_name function
perf clockid: Move parse_clockid() to new clockid object
tools lib traceevent: Handle possible strdup() error in tep_add_plugin_path() API
libtraceevent: Fixed description of tep_add_plugin_path() API
libtraceevent: Fixed type in PRINT_FMT_STING
libtraceevent: Fixed broken indentation in parse_ip4_print_args()
libtraceevent: Improve error handling of tep_plugin_add_option() API
...
Linus Torvalds [Tue, 11 Aug 2020 02:16:26 +0000 (19:16 -0700)]
Merge tag 'ktest-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest
Pull ktest updates from Steven Rostedt:
- Have config-bisect save the good/bad configs at each step.
- Show log file location even on success
- Add PRE_TEST_DIE to kill test if the PRE_TEST fails
- Add a NOT operator for conditionals in config file
- Add the log output of the last test when emailing on failure.
- Other minor clean ups and small fixes.
* tag 'ktest-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
ktest.pl: Fix spelling mistake "Cant" -> "Can't"
ktest.pl: Change the logic to control the size of the log file emailed
ktest.pl: Add MAIL_MAX_SIZE to limit the amount of log emailed
ktest.pl: Add the log of last test in email on failure
ktest.pl: Turn off buffering to the log file
ktest.pl: Just open up the log file once
ktest.pl: Add a NOT operator
ktest.pl: Define PRE_TEST_DIE to kill the test if the PRE_TEST fails
ktest.pl: Always show log file location if defined even on success
ktest.pl: Have config-bisect save each config used in the bisect
Linus Torvalds [Tue, 11 Aug 2020 02:07:44 +0000 (19:07 -0700)]
Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.
While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and
initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...
Linus Torvalds [Tue, 11 Aug 2020 01:33:22 +0000 (18:33 -0700)]
Merge tag 'f2fs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added two small interfaces: (a) GC_URGENT_LOW
mode for performance and (b) F2FS_IOC_SEC_TRIM_FILE ioctl for
security.
The new GC mode allows Android to run some lower priority GCs in
background, while new ioctl discards user information without race
condition when the account is removed.
In addition, some patches were merged to address latency-related
issues. We've fixed some compression-related bug fixes as well as edge
race conditions.
Enhancements:
- add GC_URGENT_LOW mode in gc_urgent
- introduce F2FS_IOC_SEC_TRIM_FILE ioctl
- bypass racy readahead to improve read latencies
- shrink node_write lock coverage to avoid long latency
Bug fixes:
- fix missing compression flag control, i_size, and mount option
- fix deadlock between quota writes and checkpoint
- remove inode eviction path in synchronous path to avoid deadlock
- fix to wait GCed compressed page writeback
- fix a kernel panic in f2fs_is_compressed_page
- check page dirty status before writeback
- wait page writeback before update in node page write flow
- fix a race condition between f2fs_write_end_io and f2fs_del_fsync_node_entry
We've added some minor sanity checks and refactored trivial code
blocks for better readability and debugging information"
* tag 'f2fs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (52 commits)
f2fs: prepare a waiter before entering io_schedule
f2fs: update_sit_entry: Make the judgment condition of f2fs_bug_on more intuitive
f2fs: replace test_and_set/clear_bit() with set/clear_bit()
f2fs: make file immutable even if releasing zero compression block
f2fs: compress: disable compression mount option if compression is off
f2fs: compress: add sanity check during compressed cluster read
f2fs: use macro instead of f2fs verity version
f2fs: fix deadlock between quota writes and checkpoint
f2fs: correct comment of f2fs_exist_written_data
f2fs: compress: delay temp page allocation
f2fs: compress: fix to update isize when overwriting compressed file
f2fs: space related cleanup
f2fs: fix use-after-free issue
f2fs: Change the type of f2fs_flush_inline_data() to void
f2fs: add F2FS_IOC_SEC_TRIM_FILE ioctl
f2fs: should avoid inode eviction in synchronous path
f2fs: segment.h: delete a duplicated word
f2fs: compress: fix to avoid memory leak on cc->cpages
f2fs: use generic names for generic ioctls
f2fs: don't keep meta inode pages used for compressed block migration
...
Linus Torvalds [Tue, 11 Aug 2020 01:22:43 +0000 (18:22 -0700)]
Merge tag 'gfs2-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Make sure transactions won't be started recursively in
gfs2_block_zero_range (bug introduced in 5.4 when switching to
iomap_zero_range)
- Fix a glock holder refcount leak introduced in the iopen glock
locking scheme rework merged in 5.8.
- A few other small improvements (debugging, stack usage, comment
fixes).
* tag 'gfs2-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: When gfs2_dirty_inode gets a glock error, dump the glock
gfs2: Never call gfs2_block_zero_range with an open transaction
gfs2: print details on transactions that aren't properly ended
gfs2: Fix inaccurate comment
fs: Fix typo in comment
gfs2: Fix refcount leak in gfs2_glock_poke
gfs2: Pass glock holder to gfs2_file_direct_{read,write}
gfs2: Add some flags missing from glock output
Linus Torvalds [Tue, 11 Aug 2020 01:20:04 +0000 (18:20 -0700)]
Merge tag 'for-linus-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
"JFFS2:
- Fix for a corner case while mounting
- Fix for an use-after-free issue
UBI:
- Fix for a memory load while attaching
- Don't produce an anchor PEB with fastmap being disabled
UBIFS:
- Fix for orphan inode logic
- Spelling fixes
- New mount option to specify filesystem version"
* tag 'for-linus-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
jffs2: fix UAF problem
jffs2: fix jffs2 mounting failure
ubifs: Fix wrong orphan node deletion in ubifs_jnl_update|rename
ubi: fastmap: Free fastmap next anchor peb during detach
ubi: fastmap: Don't produce the initial next anchor PEB when fastmap is disabled
ubifs: misc.h: delete a duplicated word
ubifs: add option to specify version for new file systems
Linus Torvalds [Mon, 10 Aug 2020 23:35:57 +0000 (16:35 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- an update to Elan touchpad controller driver supporting newer ICs
with enhanced precision reports and a new firmware update process
- an update to EXC3000 touch controller supporting additional parts
- assorted driver fixups
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (27 commits)
Input: exc3000 - add support to query model and fw_version
Input: exc3000 - add reset gpio support
Input: exc3000 - add EXC80H60 and EXC80H84 support
dt-bindings: touchscreen: Convert EETI EXC3000 touchscreen to json-schema
Input: sentelic - fix error return when fsp_reg_write fails
Input: alps - remove redundant assignment to variable ret
Input: ims-pcu - return error code rather than -ENOMEM
Input: elan_i2c - add ic type 0x15
Input: atmel_mxt_ts - only read messages in mxt_acquire_irq() when necessary
Input: uinput - fix typo in function name documentation
Input: ati_remote2 - add missing newlines when printing module parameters
Input: psmouse - add a newline when printing 'proto' by sysfs
Input: synaptics-rmi4 - drop a duplicated word
Input: elan_i2c - add support for high resolution reports
Input: elan_i2c - do not constantly re-query pattern ID
Input: elan_i2c - add firmware update info for ICs 0x11, 0x13, 0x14
Input: elan_i2c - handle firmware updated on newer ICs
Input: elan_i2c - add support for different firmware page sizes
Input: elan_i2c - fix detecting IAP version on older controllers
Input: elan_i2c - handle devices with patterns above 1
...
Linus Torvalds [Mon, 10 Aug 2020 23:33:54 +0000 (16:33 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID updates from Jiri Kosina:
- fix for some modern devices that return multi-byte battery report,
from Grant Likely
- fix for devices with Resolution Multiplier, from Peter Hutterer
- device probing speed increase, from Dmitry Torokhov
- ThinkPad 10 Ultrabook Keyboard support, from Hans de Goede
- other small assorted fixes and device ID additions
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: quirks: add NOGET quirk for Logitech GROUP
HID: Replace HTTP links with HTTPS ones
HID: udraw-ps3: Replace HTTP links with HTTPS ones
HID: mcp2221: Replace HTTP links with HTTPS ones
HID: input: Fix devices that return multiple bytes in battery report
HID: lenovo: Fix spurious F23 key press report during resume from suspend
HID: lenovo: Add ThinkPad 10 Ultrabook Keyboard fn_lock support
HID: lenovo: Add ThinkPad 10 Ultrabook Keyboard support
HID: lenovo: Rename fn_lock sysfs attr handlers to make them generic
HID: lenovo: Factor out generic parts of the LED code
HID: lenovo: Merge tpkbd and cptkbd data structures
HID: intel-ish-hid: Replace PCI_DEV_FLAGS_NO_D3 with pci_save_state
HID: Wiimote: Treat the d-pad as an analogue stick
HID: input: do not run GET_REPORT unless there's a Resolution Multiplier
HID: usbhid: remove redundant assignment to variable retval
HID: usbhid: do not sleep when opening device
ktest.pl: Change the logic to control the size of the log file emailed
If the log file for a given test is larger than the max size given then use
set the seek from the end of the log file instead of from the start of the
test.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Jiri Kosina [Mon, 10 Aug 2020 09:19:41 +0000 (11:19 +0200)]
Merge branch 'for-5.9/core-v2' into for-linus
- fix for some modern devices that return multi-byte battery report, from
Grant Likely
- fix for devices with Resolution Multiplier, from Peter Hutterer
- device probing speed increase, from Dmitry Torokhov
Linus Torvalds [Sun, 9 Aug 2020 21:10:26 +0000 (14:10 -0700)]
Merge tag 'kbuild-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- run the checker (e.g. sparse) after the compiler
- remove unneeded cc-option tests for old compiler flags
- fix tar-pkg to install dtbs
- introduce ccflags-remove-y and asflags-remove-y syntax
- allow to trace functions in sub-directories of lib/
- introduce hostprogs-always-y and userprogs-always-y syntax
- various Makefile cleanups
* tag 'kbuild-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: stop filtering out $(GCC_PLUGINS_CFLAGS) from cc-option base
kbuild: include scripts/Makefile.* only when relevant CONFIG is enabled
kbuild: introduce hostprogs-always-y and userprogs-always-y
kbuild: sort hostprogs before passing it to ifneq
kbuild: move host .so build rules to scripts/gcc-plugins/Makefile
kbuild: Replace HTTP links with HTTPS ones
kbuild: trace functions in subdirectories of lib/
kbuild: introduce ccflags-remove-y and asflags-remove-y
kbuild: do not export LDFLAGS_vmlinux
kbuild: always create directories of targets
powerpc/boot: add DTB to 'targets'
kbuild: buildtar: add dtbs support
kbuild: remove cc-option test of -ffreestanding
kbuild: remove cc-option test of -fno-stack-protector
Revert "kbuild: Create directory for target DTB"
kbuild: run the checker after the compiler
Linus Torvalds [Sun, 9 Aug 2020 20:58:04 +0000 (13:58 -0700)]
Merge tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull NFS server updates from Chuck Lever:
"Highlights:
- Support for user extended attributes on NFS (RFC 8276)
- Further reduce unnecessary NFSv4 delegation recalls
Notable fixes:
- Fix recent krb5p regression
- Address a few resource leaks and a rare NULL dereference
Other:
- De-duplicate RPC/RDMA error handling and other utility functions
- Replace storage and display of kernel memory addresses by tracepoints"
* tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6: (38 commits)
svcrdma: CM event handler clean up
svcrdma: Remove transport reference counting
svcrdma: Fix another Receive buffer leak
SUNRPC: Refresh the show_rqstp_flags() macro
nfsd: netns.h: delete a duplicated word
SUNRPC: Fix ("SUNRPC: Add "@len" parameter to gss_unwrap()")
nfsd: avoid a NULL dereference in __cld_pipe_upcall()
nfsd4: a client's own opens needn't prevent delegations
nfsd: Use seq_putc() in two functions
svcrdma: Display chunk completion ID when posting a rw_ctxt
svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send()
svcrdma: Introduce Send completion IDs
svcrdma: Record Receive completion ID in svc_rdma_decode_rqst
svcrdma: Introduce Receive completion IDs
svcrdma: Introduce infrastructure to support completion IDs
svcrdma: Add common XDR encoders for RDMA and Read segments
svcrdma: Add common XDR decoders for RDMA and Read segments
SUNRPC: Add helpers for decoding list discriminators symbolically
svcrdma: Remove declarations for functions long removed
svcrdma: Clean up trace_svcrdma_send_failed() tracepoint
...
Linus Torvalds [Sun, 9 Aug 2020 20:33:54 +0000 (13:33 -0700)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull regset conversion fix from Al Viro:
"Fix a regression from an unnoticed bisect hazard in the regset series.
A bunch of old (aout, originally) primitives used by coredumps became
dead code after fdpic conversion to regsets. Removal of that dead code
had been the first commit in the followups to regset series;
unfortunately, it happened to hide the bisect hazard on sh (extern for
fpregs_get() had not been updated in the main series when it should
have been; followup simply made fpregs_get() static). And without that
followup commit this bisect hazard became breakage in the mainline"
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill unused dump_fpu() instances
Linus Torvalds [Sun, 9 Aug 2020 19:52:28 +0000 (12:52 -0700)]
Merge tag 'pinctrl-v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"This is the bulk of the pin control changes for the v5.9 kernel
series:
Core changes:
- The GPIO patch "gpiolib: Introduce for_each_requested_gpio_in_range()
macro" was put in an immutable branch and merged into the pinctrl
tree as well. We see these changes also here.
- Improved debug output for pins used as GPIO.
New drivers:
- Ocelot Sparx5 SoC driver.
- Intel Emmitsburg SoC subdriver.
- Intel Tiger Lake-H SoC subdriver.
- Qualcomm PM660 SoC subdriver.
- Renesas SH-PFC R8A774E1 subdriver.
Driver improvements:
- Linear improvement and cleanups of the Intel drivers for
Cherryview, Lynxpoint, Baytrail etc. Improved locking among other
things.
- Renesas SH-PFC has added support for RPC pins, groups, and
functions to r8a77970 and r8a77980.
- The newere Freescale (now NXP) i.MX8 pin controllers have been
modularized. This is driven by the Google Android GKI initiative I
think.
- Open drain support for pins on the Qualcomm IPQ4019.
- The Ingenic driver can handle both edges IRQ detection.
- A big slew of documentation fixes all over the place.
- A few irqchip template conversions by yours truly.
* tag 'pinctrl-v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (107 commits)
dt-bindings: pinctrl: add bindings for MediaTek MT6779 SoC
pinctrl: stmfx: Use irqchip template
pinctrl: amd: Use irqchip template
pinctrl: mediatek: fix build for tristate changes
pinctrl: samsung: Use bank name as irqchip name
pinctrl: core: print gpio in pins debugfs file
pinctrl: mediatek: add mt6779 eint support
pinctrl: mediatek: add pinctrl support for MT6779 SoC
pinctrl: mediatek: avoid virtual gpio trying to set reg
pinctrl: mediatek: update pinmux definitions for mt6779
pinctrl: stm32: use the hwspin_lock_timeout_in_atomic() API
pinctrl: mcp23s08: Use irqchip template
pinctrl: sx150x: Use irqchip template
dt-bindings: ingenic,pinctrl: Support pinmux/pinconf nodes
pinctrl: intel: Add Intel Emmitsburg pin controller support
pinctl: ti: iodelay: Replace HTTP links with HTTPS ones
Revert "gpio: omap: handle pin config bias flags"
pinctrl: single: Use fallthrough pseudo-keyword
pinctrl: qcom: spmi-gpio: Use fallthrough pseudo-keyword
pinctrl: baytrail: Use fallthrough pseudo-keyword
...
Linus Torvalds [Sun, 9 Aug 2020 19:38:51 +0000 (12:38 -0700)]
Merge tag 'mtd/for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd updates from Miquel Raynal:
"MTD core changes:
- Spelling
- http to https updates
NAND core changes:
- Drop useless 'depends on' in Kconfig
- Add an extra level in the Kconfig hierarchy
- Trivial spellings
- Dynamic allocation of the interface configurations
- Dropping the default ONFI timing mode
- Various cleanup (types, structures, naming, comments)
- Hide the chip->data_interface indirection
- Add the generic rb-gpios property
- Add the ->choose_interface_config() hook
- Introduce nand_choose_best_sdr_timings()
- Use default values for tPROG_max and tBERS_max
- Avoid redefining tR_max and tCCS_min
- Add a helper to find the closest ONFI mode
- bcm63xx MTD parsers: simplify CFE detection
Raw NAND controller drivers changes:
- fsl-upm: Deprecation of specific DT properties
- fsl_upm: Driver rework and cleanup in favor of ->exec_op()
- Ingenic: Cleanup ARRAY_SIZE() vs sizeof() use
- brcmnand: ECC error handling on EDU transfers
- brcmnand: Don't default to EDU transfers
- qcom: Set BAM mode only if not set already
- qcom: Avoid write to unavailable register
- gpio: Driver rework in favor of ->exec_op()
- tango: ->exec_op() conversion
- mtk: ->exec_op() conversion
Raw NAND chip drivers changes:
- toshiba: Implement ->choose_interface_config() for TH58NVG2S3HBAI4,
TC58NVG0S3E, and TC58TEG5DCLTA00
- hynix: Implement ->choose_interface_config() for H27UCG8T2ATR-BC
SPI NOR core changes:
- Disable Quad Mode in spi_nor_restore().
- Don't abort BFPT parsing when QER reserved value is used.
- Add support/update capabilities for few flashes.
- Drop s70fl01gs flash: it does not support RDSR(05h) which is
critical for erase/write.
- Merge the SPIMEM DTR bits in spi-nor/next to avoid conflicts during
the release cycle.
SPI NOR controller drivers changes:
- Move the cadence-quadspi driver to spi-mem. The series was taken
through the SPI tree. Merge it also in spi-nor/next to avoid
conflicts during the release cycle.
- intel-spi:
- Add new PCI IDs.
- Ignore the Write Disable command, the controller doesn't support
it.
- Fix performance regression"
* tag 'mtd/for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (79 commits)
MTD: pfow.h: drop a duplicated word
MTD: mtd-abi.h: drop a duplicated word
mtd: rawnand: omap_elm: Replace HTTP links with HTTPS ones
mtd: Replace HTTP links with HTTPS ones
mtd: hyperbus: Replace HTTP links with HTTPS ones
mtd: revert "spi-nor: intel: provide a range for poll_timout"
mtd: spi-nor: update read capabilities for w25q64 and s25fl064k
mtd: spi-nor: micron: Add SPI_NOR_DUAL_READ flag on mt25qu02g
mtd: spi-nor: macronix: Add support for mx66u2g45g
mtd: spi-nor: intel-spi: Simulate WRDI command
mtd: spi-nor: Disable the flash quad mode in spi_nor_restore()
mtd: spi-nor: Add capability to disable flash quad mode
mtd: spi-nor: spansion: Remove s70fl01gs from flash_info
mtd: spi-nor: sfdp: do not make invalid quad enable fatal
dt-bindings: mtd: fsl-upm-nand: Deprecate chip-delay and fsl, upm-wait-flags
mtd: rawnand: stm32_fmc2: get resources from parent node
mtd: rawnand: stm32_fmc2: use regmap APIs
memory: stm32-fmc2-ebi: add STM32 FMC2 EBI controller driver
dt-bindings: memory-controller: add STM32 FMC2 EBI controller documentation
dt-bindings: mtd: update STM32 FMC2 NAND controller documentation
...
Masahiro Yamada [Sat, 1 Aug 2020 12:27:18 +0000 (21:27 +0900)]
kbuild: introduce hostprogs-always-y and userprogs-always-y
To build host programs, you need to add the program names to 'hostprogs'
to use the necessary build rule, but it is not enough to build them
because there is no dependency.
There are two types of host programs: built as the prerequisite of
another (e.g. gen_crc32table in lib/Makefile), or always built when
Kbuild visits the Makefile (e.g. genksyms in scripts/genksyms/Makefile).
The latter is typical in Makefiles under scripts/, which contains host
programs globally used during the kernel build. To build them, you need
to add them to both 'hostprogs' and 'always-y'.
This commit adds hostprogs-always-y as a shorthand.
The same applies to user programs. net/bpfilter/Makefile builds
bpfilter_umh on demand, hence always-y is unneeded. In contrast,
programs under samples/ are added to both 'userprogs' and 'always-y'
so they are always built when Kbuild visits the Makefiles.
userprogs-always-y works as a shorthand.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
... is evaluated to true if $(hostprogs) does not contain any word but
whitespace characters.
ifneq ($(strip $(hostprogs)),)
... is a safe way to avoid interpreting whitespace as a non-empty value,
but I'd rather want to use the side-effect of $(sort ...) to do the
equivalent.
$(sort ...) is used in scripts/Makefile.host in order to drop duplication
in $(hostprogs). It is also useful to strip excessive spaces.
kbuild: move host .so build rules to scripts/gcc-plugins/Makefile
The host shared library rules are currently implemented in
scripts/Makefile.host, but actually GCC-plugin is the only user of
them. (The VDSO .so files are built for the target by different
build rules) Hence, they do not need to be treewide available.
Move all the relevant build rules to scripts/gcc-plugins/Makefile.
I also optimized the build steps so *.so is directly built from .c
because every upstream plugin is compiled from a single source file.
I am still keeping the multi-file plugin support, which Kees Cook
mentioned might be needed by out-of-tree plugins.
(https://lkml.org/lkml/2019/1/11/1107)
If the plugin, foo.so, is compiled from two files foo.c and foo2.c,
then you can do like follows:
foo-objs := foo.o foo2.o
Single-file plugins do not need the *-objs notation.
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
exists here in sub-directories of lib/ to keep the behavior of
commit 2464a609ded0 ("ftrace: do not trace library functions").
Since that commit, not only the objects in lib/ but also the ones in
the sub-directories are excluded from ftrace (although the commit
description did not explicitly mention this).
However, most of library functions in sub-directories are not so hot.
Re-add them to ftrace.
Going forward, only the objects right under lib/ will be excluded.
kbuild: introduce ccflags-remove-y and asflags-remove-y
CFLAGS_REMOVE_<file>.o filters out flags when compiling a particular
object, but there is no convenient way to do that for every object in
a directory.
Add ccflags-remove-y and asflags-remove-y to make it easily.
Use ccflags-remove-y to clean up some Makefiles.
The add/remove order works as follows:
[1] KBUILD_CFLAGS specifies compiler flags used globally
[2] ccflags-y adds compiler flags for all objects in the
current Makefile
[3] ccflags-remove-y removes compiler flags for all objects in the
current Makefile (New feature)
[4] CFLAGS_<file> adds compiler flags per file.
[5] CFLAGS_REMOVE_<file> removes compiler flags per file.
Having [3] before [4] allows us to remove flags from most (but not all)
objects in the current Makefile.
For example, kernel/trace/Makefile removes $(CC_FLAGS_FTRACE)
from all objects in the directory, then adds it back to
trace_selftest_dynamic.o and CFLAGS_trace_kprobe_selftest.o
The same applies to lib/livepatch/Makefile.
Please note ccflags-remove-y has no effect to the sub-directories.
In contrast, the previous notation got rid of compiler flags also from
all the sub-directories.
The following are not affected because they have no sub-directories:
To keep the behavior, I added ccflags-remove-y to all Makefiles
in subdirectories of lib/, except the following:
lib/vdso/Makefile - Kbuild does not descend into this Makefile
lib/raid/test/Makefile - This is not used for the kernel build
I think commit 2464a609ded0 ("ftrace: do not trace library functions")
excluded too much. In the next commit, I will remove ccflags-remove-y
from the sub-directories of lib/.
Suggested-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Brendan Higgins <brendanhiggins@google.com> (KUnit) Tested-by: Anders Roxell <anders.roxell@linaro.org>
Even if you rerun the same command, the error message will not be
shown despite vmlinux is already gone.
To reproduce it, the parallel option -j is needed. Single thread
cleaning always executes 'archclean', 'vmlinuxclean' in this order,
so vmlinux still exists when arch/arm/boot/compressed/ is cleaned.
Looking at arch/arm/boot/compressed/Makefile does not help understand
the reason of the error message. Both KBSS_SZ and LDFLAGS_vmlinux are
assigned with '=' operator, hence, they are not expanded unless used.
Obviously, 'make clean' does not use them.
In fact, the root cause exists in the top Makefile:
export LDFLAGS_vmlinux
Since LDFLAGS_vmlinux is an exported variable, LDFLAGS_vmlinux in
arch/arm/boot/compressed/Makefile is expanded when scripts/Makefile.clean
has a command to execute. This is why the error message shows up only
when there exist build artifacts in arch/arm/boot/compressed/.
Adding 'unexport LDFLAGS_vmlinux' to arch/arm/boot/compressed/Makefile
will fix it as far as ARCH=arm is concerned, but I think the proper fix
is to get rid of 'export LDFLAGS_vmlinux' from the top Makefile.
LDFLAGS_vmlinux in the top Makefile contains linker flags for the top
vmlinux. LDFLAGS_vmlinux in arch/arm/boot/compressed/Makefile is for
arch/arm/boot/compressed/vmlinux. They just happen to have the same
variable name, but are used for different purposes. Stop shadowing
LDFLAGS_vmlinux.
This commit passes LDFLAGS_vmlinux to scripts/link-vmlinux.sh via a
command line parameter instead of via an environment variable. LD and
KBUILD_LDFLAGS are exported, but I did the same for consistency. Anyway,
they must be included in cmd_link-vmlinux to allow if_changed to detect
the changes in LD or KBUILD_LDFLAGS.
They use ':=' or '=' to clear the LDFLAGS_vmlinux inherited from the
top Makefile.
We need to take a closer look at the impact to unicore32 and xtensa.
arch/unicore32/boot/compressed/Makefile only uses '+=' operator for
LDFLAGS_vmlinux. So, the decompressor previously inherited the linker
flags from the top Makefile.
However, commit 70fac51feaf2 ("unicore32 additional architecture files:
boot process") was merged before commit 1f2bfbd00e46 ("kbuild: link of
vmlinux moved to a script"). So, I rather consider this is a bug fix of 1f2bfbd00e46.
arch/xtensa/boot/boot-elf/Makefile is also affected, but this is also
considered a fix for the same reason. It did not inherit LDFLAGS_vmlinux
when commit 4bedea945451 ("[PATCH] xtensa: Architecture support for
Tensilica Xtensa Part 2") was merged. I deleted $(LDFLAGS_vmlinux),
which is now empty.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Linus Torvalds [Sat, 8 Aug 2020 21:16:12 +0000 (14:16 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix tegra194-cpufreq module build failure caused by __cpu_logical_map
not being exported.
- Improve fixed_addresses comment regarding the fixmap buffer sizes.
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Fix __cpu_logical_map undefined issue
arm64/fixmap: make notes of fixed_addresses more precisely
The driver using cpu_logical_map() macro which will expand to
__cpu_logical_map, we can't access it in a drvier. Let's turn
cpu_logical_map() into a C wrapper and export it to fix the
build issue.
Also create a function set_cpu_logical_map(cpu, hwid) when assign
a value to cpu_logical_map(cpu).
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Linus Torvalds [Sat, 8 Aug 2020 16:32:18 +0000 (09:32 -0700)]
Merge tag 'for-linus-5.9-1' of git://github.com/cminyard/linux-ipmi
Pull IPMI updates from Corey Minyard:
"Minor cleanups to the IPMI driver for 5.9
Nothing of any major consequence. Duplicate code, some missing \n's in
sysfs files, some documentation and comment changes"
* tag 'for-linus-5.9-1' of git://github.com/cminyard/linux-ipmi:
ipmi/watchdog: add missing newlines when printing parameters by sysfs
ipmi: remve duplicate code in __ipmi_bmc_register()
ipmi: ssif: Remove finished TODO comment about SMBus alert
Doc: driver-api: ipmi: Add description of alerts_broken module param
Linus Torvalds [Sat, 8 Aug 2020 04:27:37 +0000 (21:27 -0700)]
Merge tag 'for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply
Pull power supply and reset updates from Sebastian Reichel:
"Power-supply core:
- add COOL/WARM/HOT state from JEITA JISC8712:2015 specification
- convert simple-battery DT binding to YAML
- add long-life charging mode
Battery/charger drivers:
- bq25150: new charger driver
- bq27xxx: add support for BQ27z561 and BQ28z610
- max17040: support CAPACITY_ALERT_MIN
- sbs-battery: add PEC support
- wilco-ec: support long-life charging mode
- bq25890: fix DT binding
- misc. fixes and cleanups
Reset drivers:
- linkstation: new reset driver"
* tag 'for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (32 commits)
power: supply: wilco_ec: Add long life charging mode
power: supply: bq27xxx_battery: Add the BQ28z610 Battery monitor
dt-bindings: power: Add BQ28z610 compatible
power: supply: bq27xxx_battery: Add the BQ27Z561 Battery monitor
dt-bindings: power: Add BQ27Z561 compatible
power: supply: test_power: Fix battery_current initial value
power: supply: Fix kerneldoc of power_supply_temp2resist_simple()
power: supply: cpcap-battery: Fix kerneldoc of cpcap_battery_read_accumulated()
dt-bindings: power: Convert battery.txt to battery.yaml
power: supply: rt5033_battery: Fix error code in rt5033_battery_probe()
power: supply: max17040: Add POWER_SUPPLY_PROP_CAPACITY_ALERT_MIN
power: supply: check if calc_soc succeeded in pm860x_init_battery
power: supply: bq2xxxx: Replace HTTP links with HTTPS ones
power: reset: add driver for LinkStation power off
power: supply: sc27xx: prevent adc * 1000 from overflow
math64: New DIV_S64_ROUND_CLOSEST helper
power: fix duplicated words in bq2415x_charger.h
power: Convert to DEFINE_SHOW_ATTRIBUTE
power: reset: keystone-reset: Replace HTTP links with HTTPS ones
power: supply: bq25150 introduce the bq25150
...
Linus Torvalds [Sat, 8 Aug 2020 04:14:30 +0000 (21:14 -0700)]
Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
"No common topic whatsoever in those, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: define inode flags using bit numbers
iov_iter: Move unnecessary inclusion of crypto/hash.h
dlmfs: clean up dlmfs_file_{read,write}() a bit
Linus Torvalds [Sat, 8 Aug 2020 01:48:15 +0000 (18:48 -0700)]
Merge tag 'pci-v5.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Fix pci_cfg_wait queue locking problem (Bjorn Helgaas)
- Convert PCIe capability PCIBIOS errors to errno (Bolarinwa Olayemi
Saheed)
- Align PCIe capability and PCI accessor return values (Bolarinwa
Olayemi Saheed)
- Fix pci_create_slot() reference count leak (Qiushi Wu)
- Announce device after early fixups (Tiezhu Yang)
PCI device hotplug:
- Make rpadlpar functions static (Wei Yongjun)
Driver binding:
- Add device even if driver attach failed (Rajat Jain)
Virtualization:
- xen: Remove redundant initialization of irq (Colin Ian King)
IOMMU:
- Add pci_pri_supported() to check device or associated PF (Ashok Raj)
- Release IVRS table in AMD ACS quirk (Hanjun Guo)
- Mark AMD Navi10 GPU rev 0x00 ATS as broken (Kai-Heng Feng)
- Treat "external-facing" devices themselves as internal (Rajat Jain)
MSI:
- Forward MSI-X error code in pci_alloc_irq_vectors_affinity() (Piotr
Stankiewicz)
Error handling:
- Clear PCIe Device Status errors only if OS owns AER (Jonathan
Cameron)
- Log correctable errors as warning, not error (Matt Jolly)
- Use 'pci_channel_state_t' instead of 'enum pci_channel_state' (Luc
Van Oostenryck)
Peer-to-peer DMA:
- Allow P2PDMA on AMD Zen and newer CPUs (Logan Gunthorpe)
ASPM:
- Add missing newline in sysfs 'policy' (Xiongfeng Wang)
Native PCIe controllers:
- Convert to devm_platform_ioremap_resource_byname() (Dejin Zheng)
- Convert to devm_platform_ioremap_resource() (Dejin Zheng)
- Remove duplicate error message from devm_pci_remap_cfg_resource()
callers (Dejin Zheng)
- Fix runtime PM imbalance on error (Dinghao Liu)
- Remove dev_err() when handing an error from platform_get_irq()
(Krzysztof Wilczyński)
- Use pci_host_bridge.windows list directly instead of splicing in a
temporary list for cadence, mvebu, host-common (Rob Herring)
- Use pci_host_probe() instead of open-coding all the pieces for
altera, brcmstb, iproc, mobiveil, rcar, rockchip, tegra, v3,
versatile, xgene, xilinx, xilinx-nwl (Rob Herring)
- Default host bridge parent device to the platform device (Rob
Herring)
- Use pci_is_root_bus() instead of tracking root bus number
separately in aardvark, designware (imx6, keystone,
designware-host), mobiveil, xilinx-nwl, xilinx, rockchip, rcar (Rob
Herring)
- Set host bridge bus number in pci_scan_root_bus_bridge() instead of
each driver for aardvark, designware-host, host-common, mediatek,
rcar, tegra, v3-semi (Rob Herring)
- Move DT resource setup into devm_pci_alloc_host_bridge() (Rob
Herring)
- Set bridge map_irq and swizzle_irq to default functions; drivers
that don't support legacy IRQs (iproc) need to undo this (Rob
Herring)
ARM Versatile PCIe controller driver:
- Drop flag PCI_ENABLE_PROC_DOMAINS (Rob Herring)
Cadence PCIe controller driver:
- Use "dma-ranges" instead of "cdns,no-bar-match-nbits" property
(Kishon Vijay Abraham I)
- Remove "mem" from reg binding (Kishon Vijay Abraham I)
- Fix cdns_pcie_{host|ep}_setup() error path (Kishon Vijay Abraham I)
- Convert all r/w accessors to perform only 32-bit accesses (Kishon
Vijay Abraham I)
- Add support to start link and verify link status (Kishon Vijay
Abraham I)
- Allow pci_host_bridge to have custom pci_ops (Kishon Vijay Abraham I)
- Add new *ops* for CPU addr fixup (Kishon Vijay Abraham I)
- Fix updating Vendor ID and Subsystem Vendor ID register (Kishon
Vijay Abraham I)
- Use bridge resources for outbound window setup (Rob Herring)
- Remove private bus number and range storage (Rob Herring)
Cadence PCIe endpoint driver:
- Add MSI-X support (Alan Douglas)
Qualcomm PCIe controller driver:
- Change duplicate PCI reset to phy reset (Abhishek Sahu)
- Add missing ipq806x clocks in PCIe driver (Ansuel Smith)
- Add missing reset for ipq806x (Ansuel Smith)
- Add ext reset (Ansuel Smith)
- Use bulk clk API and assert on error (Ansuel Smith)
- Add support for tx term offset for rev 2.1.0 (Ansuel Smith)
- Define some PARF params needed for ipq8064 SoC (Ansuel Smith)
- Add ipq8064 rev2 variant (Ansuel Smith)
- Support PCI speed set for ipq806x (Sham Muthayyan)
Renesas R-Car PCIe controller driver:
- Use devm_pci_alloc_host_bridge() (Rob Herring)
- Use struct pci_host_bridge.windows list directly (Rob Herring)
- Convert rcar-gen2 to use modern host bridge probe functions (Rob
Herring)
TI J721E PCIe driver:
- Add TI J721E PCIe host and endpoint driver (Kishon Vijay Abraham I)
Xilinx Versal CPM PCIe controller driver:
- Add Versal CPM Root Port driver and YAML schema (Bharat Kumar
Gogada)
MicroSemi Switchtec management driver:
- Add missing __iomem and __user tags to fix sparse warnings (Logan
Gunthorpe)
Miscellaneous:
- Replace http:// links with https:// (Alexander A. Klimov)
- Replace lkml.org, spinics, gmane with lore.kernel.org (Bjorn
Helgaas)
- Remove unused pci_lost_interrupt() (Heiner Kallweit)
- Move PCI_VENDOR_ID_REDHAT definition to pci_ids.h (Huacai Chen)
- Fix kerneldoc warnings (Krzysztof Kozlowski)"
* tag 'pci-v5.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (113 commits)
PCI: Fix kerneldoc warnings
PCI: xilinx-cpm: Add Versal CPM Root Port driver
PCI: xilinx-cpm: Add YAML schemas for Versal CPM Root Port
PCI: Set bridge map_irq and swizzle_irq to default functions
PCI: Move DT resource setup into devm_pci_alloc_host_bridge()
PCI: rcar-gen2: Convert to use modern host bridge probe functions
PCI: Remove dev_err() when handing an error from platform_get_irq()
MAINTAINERS: Add Kishon Vijay Abraham I for TI J721E SoC PCIe
misc: pci_endpoint_test: Add J721E in pci_device_id table
PCI: j721e: Add TI J721E PCIe driver
PCI: switchtec: Add missing __iomem tag to fix sparse warnings
PCI: switchtec: Add missing __iomem and __user tags to fix sparse warnings
PCI: rpadlpar: Make functions static
PCI/P2PDMA: Allow P2PDMA on AMD Zen and newer CPUs
PCI: Release IVRS table in AMD ACS quirk
PCI: Announce device after early fixups
PCI: Mark AMD Navi10 GPU rev 0x00 ATS as broken
PCI: Remove unused pci_lost_interrupt()
dt-bindings: PCI: Add EP mode dt-bindings for TI's J721E SoC
dt-bindings: PCI: Add host mode dt-bindings for TI's J721E SoC
...
Linus Torvalds [Sat, 8 Aug 2020 01:29:15 +0000 (18:29 -0700)]
Merge tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- The biggest news in that the tracing ring buffer can now time events
that interrupted other ring buffer events.
Before this change, if an interrupt came in while recording another
event, and that interrupt also had an event, those events would all
have the same time stamp as the event it interrupted.
Now, with the new design, those events will have a unique time stamp
and rightfully display the time for those events that were recorded
while interrupting another event.
- Bootconfig how has an "override" operator that lets the users have a
default config, but then add options to override the default.
- A fix was made to properly filter function graph tracing to the
ftrace PIDs. This came in at the end of the -rc cycle, and needs to
be backported.
- Several clean ups, performance updates, and minor fixes as well.
* tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (39 commits)
tracing: Add trace_array_init_printk() to initialize instance trace_printk() buffers
kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
tracing: Use trace_sched_process_free() instead of exit() for pid tracing
bootconfig: Fix to find the initargs correctly
Documentation: bootconfig: Add bootconfig override operator
tools/bootconfig: Add testcases for value override operator
lib/bootconfig: Add override operator support
kprobes: Remove show_registers() function prototype
tracing/uprobe: Remove dead code in trace_uprobe_register()
kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler
ftrace: Fix ftrace_trace_task return value
tracepoint: Use __used attribute definitions from compiler_attributes.h
tracepoint: Mark __tracepoint_string's __used
trace : Have tracing buffer info use kvzalloc instead of kzalloc
tracing: Remove outdated comment in stack handling
ftrace: Do not let direct or IPMODIFY ftrace_ops be added to module and set trampolines
ftrace: Setup correct FTRACE_FL_REGS flags for module
tracing/hwlat: Honor the tracing_cpumask
tracing/hwlat: Drop the duplicate assignment in start_kthread()
tracing: Save one trace_event->type by using __TRACE_LAST_TYPE
...
The merge resolution in commit 25d8d4eecace left ret no longer used,
leading to:
arch/powerpc/kernel/ptrace/ptrace-view.c: In function ‘pkey_get’:
arch/powerpc/kernel/ptrace/ptrace-view.c:473:6: error: unused variable ‘ret’
473 | int ret;
Fix it by removing ret.
Fixes: 25d8d4eecace ("Merge tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tracing: Add trace_array_init_printk() to initialize instance trace_printk() buffers
As trace_array_printk() used with not global instances will not add noise to
the main buffer, they are OK to have in the kernel (unlike trace_printk()).
This require the subsystem to create their own tracing instance, and the
trace_array_printk() only writes into those instances.
Add trace_array_init_printk() to initialize the trace_printk() buffers
without printing out the WARNING message.
Reported-by: Sean Paul <sean@poorly.run> Reviewed-by: Sean Paul <sean@poorly.run> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Linus Torvalds [Fri, 7 Aug 2020 20:35:51 +0000 (13:35 -0700)]
Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk updates from Stephen Boyd:
"It looks like a smaller batch of clk updates this time around.
In the core framework we just have some minor tweaks and a debugfs
feature, so not much to see there. The driver updates are fairly well
split between AT91 and Qualcomm clk support. Adding those two drivers
together equals about 50% of the diffstat.
Otherwise, the big amount of work this time was on supporting
Broadcom's Raspberry Pi firmware clks.
Highlights:
Core:
- Document clk_hw_round_rate() so it gets some more use
- Remove unused __clk_get_flags()
- Add a prepare/enable debugfs feature similar to rate setting
New Drivers:
- Add support for SAMA7G5 SoC clks
- Enable CPU clks on Qualcomm IPQ6018 SoCs
- Enable CPU clks on Qualcomm MSM8996 SoCs
- GPU clk support for Qualcomm SM8150 and SM8250 SoCs
- Audio clks on Qualcomm SC7180 SoCs
- Microchip Sparx5 DPLL clk
- Add support for the new Renesas RZ/G2H (R8A774E1) SoC
Updates:
- Make defines for bcm63xx-gate clks to use in DT
- Support BCM2711 SoC firmware clks
- Add HDMI clks for BCM2711 SoCs
- Add RTC related clks on Ingenic SoCs
- Support USB PHY clks on Ingenic SoCs
- Support gate clks on BCM6318 SoCs
- RMU and DMAC/GPIO clock support for Actions Semi S500 SoCs
- Use poll_timeout functions in Rockchip clk driver
- Support Rockchip rk3288w SoC variant
- Mark mac_lbtest critical on Rockchip rk3188
- Add CAAM clock support for i.MX vf610 driver
- Add MU root clock support for i.MX imx8mp driver
- Amlogic g12: add neural network accelerator clock sources
- Amlogic meson8: remove critical flag for main PLL divider
- Amlogic meson8: add video decoder clock gates
- Convert one more Renesas DT binding to json-schema
- Enhance critical clock handling on Renesas platforms to only
consider clocks that were enabled at boot time"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (79 commits)
clk: qcom: gcc: Make disp gpll0 branch aon for sc7180/sdm845
ipq806x: gcc: add support for child probe
clk: qcom: msm8996: Make symbol 'cpu_msm8996_clks' static
clk: qcom: ipq8074: Add correct index for PCIe clocks
clk: <linux/clk-provider.h>: drop a duplicated word
clk: renesas: cpg-mssr: Add r8a774e1 support
dt-bindings: clock: renesas,cpg-mssr: Document r8a774e1
clk: Drop duplicate selection in Kconfig
clk: qcom: smd: Add support for MSM8992/4 rpm clocks
clk: qcom: ipq8074: Add missing clocks for pcie
dt-bindings: clock: qcom: ipq8074: Add missing bindings for PCIe
Replace HTTP links with HTTPS ones: Common CLK framework
clk: qcom: Add CPU clock driver for msm8996
dt-bindings: clk: qcom: Add bindings for CPU clock for msm8996
soc: qcom: Separate kryo l2 accessors from PMU driver
clk: meson: meson8b: add the vclk2_en gate clock
clk: meson: meson8b: add the vclk_en gate clock
clk: qcom: Fix return value check in apss_ipq6018_probe()
clk: bcm: dvp: Add missing module informations
clk: meson: meson8b: Drop CLK_IS_CRITICAL from fclk_div2
...
Linus Torvalds [Fri, 7 Aug 2020 20:29:39 +0000 (13:29 -0700)]
Merge branch 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fdpick coredump update from Al Viro:
"Switches fdpic coredumps away from original aout dumping primitives to
the same kind of regset use as regular elf coredumps do"
* 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[elf-fdpic] switch coredump to regsets
[elf-fdpic] use elf_dump_thread_status() for the dumper thread as well
[elf-fdpic] move allocation of elf_thread_status into elf_dump_thread_status()
[elf-fdpic] coredump: don't bother with cyclic list for per-thread objects
kill elf_fpxregs_t
take fdpic-related parts of elf_prstatus out
unexport linux/elfcore.h
About a month after my kallsyms_show_value() refactoring landed, 0day
noticed that there was a path through the kernfs binattr read handlers
that did not have PAGE_SIZEd buffers, and the module "sections" read
handler made a bad assumption about this, resulting in it stomping on
memory when reached through small-sized splice() calls.
I've added a set of tests to find these kinds of regressions more
quickly in the future as well"
Sefltests-acked-by: Shuah Khan <skhan@linuxfoundation.org>
* tag 'kallsyms_show_value-fix-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
selftests: splice: Check behavior of full and short splices
module: Correctly truncate sysfs sections output
Linus Torvalds [Fri, 7 Aug 2020 20:13:09 +0000 (13:13 -0700)]
Merge tag 'pm-5.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are mostly ARM cpufreq driver updates plus a cpufreq core
cleanup, an ARM-wide change to make schedutil the default scaling
governor, an intel_pstate driver fix and some runtime PM changes
regarding kerneldoc comments.
Specifics:
- Add adaptive voltage scaling (AVS) support to the brcmstb cpufreq
driver and clean it up (Florian Fainelli, Markus Mayer).
- Add a new Tegra cpufreq driver and clean up the existing one (Jon
Hunter, Sumit Gupta).
- Add bandwidth level support to the Qcom cpufreq driver along with
OPP changes (Sibi Sankar).
- Clean up the sti, cpufreq-dt, ap806, CPPC cpufreq drivers (Viresh
Kumar, Lee Jones, Ivan Kokshaysky, Sven Auhagen, Xin Hao).
- Make schedutil the default governor for ARM (Valentin Schneider).
- Fix dependency issues for the imx cpufreq driver (Walter Lozano).
- Clean up cached_resolved_idx handlihng in the cpufreq core (Viresh
Kumar).
- Fix the intel_pstate driver to use the correct maximum frequency
value when MSR_TURBO_RATIO_LIMIT is 0 (Srinivas Pandruvada).
- Provide kenrneldoc comments for multiple runtime PM helpers and
improve the pm_runtime_get_if_active() kerneldoc (Rafael Wysocki)"
* tag 'pm-5.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (22 commits)
cpufreq: intel_pstate: Fix cpuinfo_max_freq when MSR_TURBO_RATIO_LIMIT is 0
PM: runtime: Improve kerneldoc of pm_runtime_get_if_active()
PM: runtime: Add kerneldoc comments to multiple helpers
cpufreq: make schedutil the default for arm and arm64
cpufreq: cached_resolved_idx can not be negative
cpufreq: Add Tegra194 cpufreq driver
dt-bindings: arm: Add NVIDIA Tegra194 CPU Complex binding
cpufreq: imx: Select NVMEM_IMX_OCOTP
cpufreq: sti-cpufreq: Fix some formatting and misspelling issues
cpufreq: tegra186: Simplify probe return path
cpufreq: CPPC: Reuse caps variable in few routines
cpufreq: ap806: fix cpufreq driver needs ap cpu clk
cpufreq: cppc: Reorder code and remove apply_hisi_workaround variable
cpufreq: dt: fix oops on armada37xx
cpufreq: brcmstb-avs-cpufreq: send S2_ENTER / S2_EXIT commands to AVS
cpufreq: brcmstb-avs-cpufreq: Support polling AVS firmware
cpufreq: brcmstb-avs-cpufreq: more flexible interface for __issue_avs_command()
cpufreq: qcom: Disable fast switch when scaling DDR/L3
cpufreq: qcom: Update the bandwidth levels on frequency change
OPP: Add and export helper to set bandwidth
...
Linus Torvalds [Fri, 7 Aug 2020 20:08:09 +0000 (13:08 -0700)]
Merge tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM multipath locking fixes around m->flags tests and improvements to
bio-based code so that it follows patterns established by
request-based code.
- Request-based DM core improvement to eliminate unnecessary call to
blk_mq_queue_stopped().
- Add "panic_on_corruption" error handling mode to DM verity target.
- DM bufio fix to to perform buffer cleanup from a workqueue rather
than wait for IO in reclaim context from shrinker.
- DM crypt improvement to optionally avoid async processing via
workqueues for reads and/or writes -- via "no_read_workqueue" and
"no_write_workqueue" features. This more direct IO processing
improves latency and throughput with faster storage. Avoiding
workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is a
requirement for adding zoned block device support to DM crypt.
- Add zoned block device support to DM crypt. Makes use of
DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
(DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
encryption to complete. This allows write ordering to be preserved,
which is needed for zoned block devices.
- Fix DM ebs target's check for REQ_OP_FLUSH.
- Fix DM core's report zones support to not report more zones than were
requested.
- A few small compiler warning fixes.
- DM dust improvements to return output directly to the user rather
than require they scrape the system log for output.
* tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: don't call report zones for more than the user requested
dm ebs: Fix incorrect checking for REQ_OP_FLUSH
dm init: Set file local variable static
dm ioctl: Fix compilation warning
dm raid: Remove empty if statement
dm verity: Fix compilation warning
dm crypt: Enable zoned block device support
dm crypt: add flags to optionally bypass kcryptd workqueues
dm bufio: do buffer cleanup from a workqueue
dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()
dm dust: add interface to list all badblocks
dm dust: report some message results directly back to user
dm verity: add "panic_on_corruption" error handling mode
dm mpath: use double checked locking in fast path
dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl
dm mpath: rework __map_bio()
dm mpath: factor out multipath_queue_bio
dm mpath: push locking down to must_push_back_rq()
dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH
dm mpath: changes from initial m->flags locking audit
Linus Torvalds [Fri, 7 Aug 2020 20:00:53 +0000 (13:00 -0700)]
Merge tag 'media/v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- Legacy soc_camera driver was removed from staging
- New I2C sensor related drivers: dw9768, ch7322, max9271, rdacm20
- TI vpe driver code was re-organized and had new features added
- Added Xilinx MIPI CSI-2 Rx Subsystem driver
- Added support for Infrared Toy and IR Droid devices
- Lots of random driver fixes, new features and cleanups
* tag 'media/v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (318 commits)
media: camss: fix memory leaks on error handling paths in probe
media: davinci: vpif_capture: fix potential double free
media: radio: remove redundant assignment to variable retval
media: allegro: fix potential null dereference on header
media: mtk-mdp: Fix a refcounting bug on error in init
media: allegro: fix an error pointer vs NULL check
media: meye: fix missing pm_mchip_mode field
media: cafe-driver: use generic power management
media: saa7164: use generic power management
media: v4l2-dev/ioctl: Fix document for VIDIOC_QUERYCAP
media: v4l2: Correct kernel-doc inconsistency
media: v4l2: Correct kernel-doc inconsistency
media: dvbdev.h: keep * together with the type
media: v4l2-subdev.h: keep * together with the type
media: videobuf2: Print videobuf2 buffer state by name
media: colorspaces-details.rst: fix V4L2_COLORSPACE_JPEG description
media: tw68: use generic power management
media: meye: use generic power management
media: cx88: use generic power management
media: cx25821: use generic power management
...
Kees Cook [Fri, 7 Aug 2020 17:53:54 +0000 (10:53 -0700)]
net/scm: Fix typo in SCM_RIGHTS compat refactoring
When refactoring the SCM_RIGHTS code, I accidentally mis-merged my
native/compat diffs, which entirely broke using SCM_RIGHTS in compat
mode. Use the correct helper.
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Link: https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216156.html Reported-by: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca> Link: https://lore.kernel.org/lkml/1596812929.lz7fuo8r2w.none@localhost/ Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Fixes: c0029de50982 ("net/scm: Regularize compat handling of scm_detach_fds()") Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Kees Cook <keescook@chromium.org>
Linus Torvalds [Fri, 7 Aug 2020 18:39:33 +0000 (11:39 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
- a few MM hotfixes
- kthread, tools, scripts, ntfs and ocfs2
- some of MM
Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
mm: vmscan: consistent update to pgrefill
mm/vmscan.c: fix typo
khugepaged: khugepaged_test_exit() check mmget_still_valid()
khugepaged: retract_page_tables() remember to test exit
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
khugepaged: collapse_pte_mapped_thp() flush the right range
mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
mm: thp: replace HTTP links with HTTPS ones
mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
mm/page_alloc.c: skip setting nodemask when we are in interrupt
mm/page_alloc: fallbacks at most has 3 elements
mm/page_alloc: silence a KASAN false positive
mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
mm/page_alloc.c: simplify pageblock bitmap access
mm/page_alloc.c: extract the common part in pfn_to_bitidx()
mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
mm/shuffle: remove dynamic reconfiguration
mm/memory_hotplug: document why shuffle_zone() is relevant
mm/page_alloc: remove nr_free_pagecache_pages()
mm: remove vm_total_pages
...
Shakeel Butt [Fri, 7 Aug 2020 06:26:32 +0000 (23:26 -0700)]
mm: vmscan: consistent update to pgrefill
The vmstat pgrefill is useful together with pgscan and pgsteal stats to
measure the reclaim efficiency. However vmstat's pgrefill is not updated
consistently at system level. It gets updated for both global and memcg
reclaim however pgscan and pgsteal are updated for only global reclaim.
So, update pgrefill only for global reclaim. If someone is interested in
the stats representing both system level as well as memcg level reclaim,
then consult the root memcg's memory.stat instead of /proc/vmstat.
Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200711011459.1159929-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move collapse_huge_page()'s mmget_still_valid() check into
khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP
only, and earned its mmget_still_valid() check because it inserts a huge
pmd entry in place of the page table's pmd entry; whereas
collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp()
merely clears the page table's pmd entry. But core dumping without mmap
lock must have been as open to mistaking a racily cleared pmd entry for a
page table at physical page 0, as exit_mmap() was. And we certainly have
no interest in mapping as a THP once dumping core.
Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 7 Aug 2020 06:26:22 +0000 (23:26 -0700)]
khugepaged: retract_page_tables() remember to test exit
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 7 Aug 2020 06:26:18 +0000 (23:26 -0700)]
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 7 Aug 2020 06:26:15 +0000 (23:26 -0700)]
khugepaged: collapse_pte_mapped_thp() flush the right range
pmdp_collapse_flush() should be given the start address at which the huge
page is mapped, haddr: it was given addr, which at that point has been
used as a local variable, incremented to the end address of the extent.
Found by source inspection while chasing a hugepage locking bug, which I
then could not explain by this. At first I thought this was very bad;
then saw that all of the page translations that were not flushed would
actually still point to the right pages afterwards, so harmless; then
realized that I know nothing of how different architectures and models
cache intermediate paging structures, so maybe it matters after all -
particularly since the page table concerned is immediately freed.
Much easier to fix than to think about.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021204390.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 7 Aug 2020 06:26:11 +0000 (23:26 -0700)]
mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
This is found by code observation only.
Firstly, the worst case scenario should assume the whole range was covered
by pmd sharing. The old algorithm might not work as expected for ranges
like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
expected range should be (0, 2g).
Since at it, remove the loop since it should not be required. With that,
the new code should be faster too when the invalidating range is huge.
Mike said:
: With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only
: adjust to (0, 1g+2m) which is incorrect.
:
: We should cc stable. The original reason for adjusting the range was to
: prevent data corruption (getting wrong page). Since the range is not
: always adjusted correctly, the potential for corruption still exists.
:
: However, I am fairly confident that adjust_range_if_pmd_sharing_possible
: is only gong to be called in two cases:
:
: 1) for a single page
: 2) for range == entire vma
:
: In those cases, the current code should produce the correct results.
:
: To be safe, let's just cc stable.
Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages") Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `xmlns`:
For each link, `http://[^# ]*(?:\w|/)`:
If neither `gnu\.org/license`, nor `mozilla\.org/MPL`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
[akpm@linux-foundation.org: fix amd.com URL, per Vlastimil]
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200713164345.36088-1-grandmaster@al2klimov.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, memalloc_nocma_{save/restore} API that prevents CMA area
in page allocation is implemented by using current_gfp_context(). However,
there are two problems of this implementation.
First, this doesn't work for allocation fastpath. In the fastpath,
original gfp_mask is used since current_gfp_context() is introduced in
order to control reclaim and it is on slowpath. So, CMA area can be
allocated through the allocation fastpath even if
memalloc_nocma_{save/restore} APIs are used. Currently, there is just
one user for these APIs and it has a fallback method to prevent actual
problem.
Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
to exclude the memory on the ZONE_MOVABLE for allocation target.
To fix these problems, this patch changes the implementation to exclude
CMA area in page allocation. Main point of this change is using the
alloc_flags. alloc_flags is mainly used to control allocation so it fits
for excluding CMA area in allocation.
Fixes: d7fefcc8de91 (mm/cma: add PF flag to force non cma alloc) Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Fri, 7 Aug 2020 06:26:01 +0000 (23:26 -0700)]
mm/page_alloc.c: skip setting nodemask when we are in interrupt
When we are in the interrupt context, it is irrelevant to the current task
context. If we use current task's mems_allowed, we can be fair to alloc
pages in the fast path and fall back to slow path memory allocation when
the current node(which is the current task mems_allowed) does not have
enough memory to allocate. In this case, it slows down the memory
allocation speed of interrupt context. So we can skip setting the
nodemask to allow any node to allocate memory, so that fast path
allocation can success.
Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200706025921.53683-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Fri, 7 Aug 2020 06:25:54 +0000 (23:25 -0700)]
mm/page_alloc: silence a KASAN false positive
kernel_init_free_pages() will use memset() on s390 to clear all pages from
kmalloc_order() which will override KASAN redzones because a redzone was
setup from the end of the allocation size to the end of the last page.
Silence it by not reporting it there. An example of the report is,
BUG: KASAN: slab-out-of-bounds in __free_pages_ok
Write of size 4096 at addr 000000014beaa000
Call Trace:
show_stack+0x152/0x210
dump_stack+0x1f8/0x248
print_address_description.isra.13+0x5e/0x4d0
kasan_report+0x130/0x178
check_memory_region+0x190/0x218
memset+0x34/0x60
__free_pages_ok+0x894/0x12f0
kfree+0x4f2/0x5e0
unpack_to_rootfs+0x60e/0x650
populate_rootfs+0x56/0x358
do_one_initcall+0x1f4/0xa20
kernel_init_freeable+0x758/0x7e8
kernel_init+0x1c/0x170
ret_from_fork+0x24/0x28
Memory state around the buggy address: 000000014bea9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000014bea9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>000000014beaa000: 03 fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
^ 000000014beaa080: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe 000000014beaa100: fe fe fe fe fe fe fe fe fe fe fe fe fe fe
Wei Yang [Fri, 7 Aug 2020 06:25:48 +0000 (23:25 -0700)]
mm/page_alloc.c: simplify pageblock bitmap access
Due to commit e58469bafd05 ("mm: page_alloc: use word-based accesses for
get/set pageblock bitmaps"), pageblock bitmap is accessed with word-based
access. This operation could be simplified a little.
Intuitively, if we want to get a bit range [start_idx, end_idx] in a word,
we can do like this:
Commit e900a918b098 ("mm: shuffle initial free memory to improve
memory-side-cache utilization") promised "autodetection of a
memory-side-cache (to be added in a follow-on patch)" over a year ago.
The original series included patches [1], however, they were dropped
during review [2] to be followed-up later.
Due to lack of platforms that publish an HMAT, autodetection is currently
not implemented. However, manual activation is actively used [3]. Let's
simplify for now and re-add when really (ever?) needed.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dan Williams <dan.j.williams@intel.com> Link: http://lkml.kernel.org/r/20200624094741.9918-4-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/memory_hotplug: document why shuffle_zone() is relevant
It's not completely obvious why we have to shuffle the complete zone -
introduced in commit e900a918b098 ("mm: shuffle initial free memory to
improve memory-side-cache utilization") - because some sort of shuffling
is already performed when onlining pages via __free_one_page(), placing
MAX_ORDER-1 pages either to the head or the tail of the freelist. Let's
document why we have to shuffle the complete zone when exposing larger,
contiguous physical memory areas to the buddy.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200624094741.9918-3-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and
the name does not really help to understand what's going on. Let's
open-code it instead and add a comment.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Huang Ying <ying.huang@intel.com> Link: http://lkml.kernel.org/r/20200619132410.23859-3-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global variable "vm_total_pages" is a relic from older days. There is
only a single user that reads the variable - build_all_zonelists() - and
the first thing it does is update it.
Use a local variable in build_all_zonelists() instead and remove the
global variable.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations
When boosting is enabled, it is observed that rate of atomic order-0
allocation failures are high due to the fact that free levels in the
system are checked with ->watermark_boost offset. This is not a problem
for sleepable allocations but for atomic allocations which looks like
regression.
This problem is seen frequently on system setup of Android kernel running
on Snapdragon hardware with 4GB RAM size. When no extfrag event occurred
in the system, ->watermark_boost factor is zero, thus the watermark
configurations in the system are:
After launching some memory hungry applications in Android which can cause
extfrag events in the system to an extent that ->watermark_boost can be
set to max i.e. default boost factor makes it to 150% of high watermark.
With default system configuration, for an atomic order-0 allocation to
succeed, having free memory of ~2MB will suffice. But boosting makes the
min_wmark to ~61MB thus for an atomic order-0 allocation to be successful
system should have minimum of ~23MB of free memory(from calculations of
zone_watermark_ok(), min = 3/4(min/2)). But failures are observed despite
system is having ~20MB of free memory. In the testing, this is
reproducible as early as first 300secs since boot and with furtherlowram
configurations(<2GB) it is observed as early as first 150secs since boot.
These failures can be avoided by excluding the ->watermark_boost in
watermark caluculations for atomic order-0 allocations.
Jaewon Kim [Fri, 7 Aug 2020 06:25:20 +0000 (23:25 -0700)]
page_alloc: consider highatomic reserve in watermark fast
zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
page_alloc: shortcut watermark checks for order-0 pages"). The commit
simply checks if free pages is bigger than watermark without additional
calculation such like reducing watermark.
It considered free cma pages but it did not consider highatomic reserved.
This may incur exhaustion of free pages except high order atomic free
pages.
Assume that reserved_highatomic pageblock is bigger than watermark min,
and there are only few free pages except high order atomic free. Because
zone_watermark_fast passes the allocation without considering high order
atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume
all the free pages. Then finally order-0 atomic allocation may fail on
allocation.
This means watermark min is not protected against non-atomic allocation.
The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed.
Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also
can be failed.
To avoid the problem, zone_watermark_fast should consider highatomic
reserve. If the actual size of high atomic free is counted accurately
like cma free, we may use it. On this patch just use
nr_reserved_highatomic. Additionally introduce
__zone_watermark_unusable_free to factor out common parts between
zone_watermark_fast and __zone_watermark_ok.
This is an example of ALLOC_HARDER allocation failure using v4.19 based
kernel.
Vlastimil Babka [Fri, 7 Aug 2020 06:25:16 +0000 (23:25 -0700)]
mm, page_alloc: use unlikely() in task_capc()
Hugh noted that task_capc() could use unlikely(), as most of the time
there is no capture in progress and we are in page freeing hot path.
Indeed adding unlikely() produces assembly that better matches the
assumption and moves all the tests away from the hot path.
I have also noticed that we don't need to test for cc->direct_compaction
as the only place we set current->task_capture is compact_zone_order()
which also always sets cc->direct_compaction true.
Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Hugh Dickins <hughd@googlecom> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Li Wang <liwang@redhat.com> Link: http://lkml.kernel.org/r/4a24f7af-3aa5-6e80-4ae6-8f253b562039@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>