mm, memcg: throttle allocators when failing reclaim over memory.high
We're trying to use memory.high to limit workloads, but have found that
containment can frequently fail completely and cause OOM situations
outside of the cgroup. This happens especially with swap space -- either
when none is configured, or swap is full. These failures often also don't
have enough warning to allow one to react, whether for a human or for a
daemon monitoring PSI.
Here is output from a simple program showing how long it takes in usec
(column 2) to allocate a megabyte of anonymous memory (column 1) when a
cgroup is already beyond its memory high setting, and no swap is
available:
[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1035
96 1038
97 1000
98 1036
99 1048
100 1590
101 1968
102 1776
103 1863
104 1757
105 1921
106 1893
107 1760
108 1748
109 1843
110 1716
111 1924
112 1776
113 1831
114 1766
115 1836
116 1588
117 1912
118 1802
119 1857
120 1731
[...]
[System OOM in 2-3 seconds]
The delay does go up extremely marginally past the 100MB memory.high
threshold, as now we spend time scanning before returning to usermode, but
it's nowhere near enough to contain growth. It also doesn't get worse the
more pages you have, since it only considers nr_pages.
The current situation goes against both the expectations of users of
memory.high, and our intentions as cgroup v2 developers. In
cgroup-v2.txt, we claim that we will throttle and only under "extreme
conditions" will memory.high protection be breached. Likewise, cgroup v2
users generally also expect that memory.high should throttle workloads as
they exceed their high threshold. However, as seen above, this isn't
always how it works in practice -- even on banal setups like those with no
swap, or where swap has become exhausted, we can end up with memory.high
being breached and us having no weapons left in our arsenal to combat
runaway growth with, since reclaim is futile.
It's also hard for system monitoring software or users to tell how bad the
situation is, as "high" events for the memcg may in some cases be benign,
and in others be catastrophic. The current status quo is that we fail
containment in a way that doesn't provide any advance warning that things
are about to go horribly wrong (for example, we are about to invoke the
kernel OOM killer).
This patch introduces explicit throttling when reclaim is failing to keep
memcg size contained at the memory.high setting. It does so by applying
an exponential delay curve derived from the memcg's overage compared to
memory.high. In the normal case where the memcg is either below or only
marginally over its memory.high setting, no throttling will be performed.
This composes well with system health monitoring and remediation, as these
allocator delays are factored into PSI's memory pressure calculations.
This both creates a mechanism system administrators or applications
consuming the PSI interface to trivially see that the memcg in question is
struggling and use that to make more reasonable decisions, and permits
them enough time to act. Either of these can act with significantly more
nuance than that we can provide using the system OOM killer.
This is a similar idea to memory.oom_control in cgroup v1 which would put
the cgroup to sleep if the threshold was violated, but it's also
significantly improved as it results in visible memory pressure, and also
doesn't schedule indefinitely, which previously made tracing and other
introspection difficult (ie. it's clamped at 2*HZ per allocation through
MEMCG_MAX_HIGH_DELAY_JIFFIES).
Contrast the previous results with a kernel with this patch:
[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1002
96 1000
97 1002
98 1003
99 1000
100 1043
101 84724
102 330628
103 610511
104
1016265
105
1503969
106
2391692
107
2872061
108
3248003
109
4791904
110
5759832
111
6912509
112
8127818
113
9472203
114
12287622
115
12480079
116
14144008
117
15808029
118
16384500
119
16383242
120
16384979
[...]
As you can see, in the normal case, memory allocation takes around 1000
usec. However, as we exceed our memory.high, things start to increase
exponentially, but fairly leniently at first. Our first megabyte over
memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
next is almost an entire second. This gets worse until we reach our
eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
However, this is still making forward progress, so permits tracing or
further analysis with programs like GDB.
We use an exponential curve for our delay penalty for a few reasons:
1. We run mem_cgroup_handle_over_high to potentially do reclaim after
we've already performed allocations, which means that temporarily
going over memory.high by a small amount may be perfectly legitimate,
even for compliant workloads. We don't want to unduly penalise such
cases.
2. An exponential curve (as opposed to a static or linear delay) allows
ramping up memory pressure stats more gradually, which can be useful
to work out that you have set memory.high too low, without destroying
application performance entirely.
This patch expands on earlier work by Johannes Weiner. Thanks!
[akpm@linux-foundation.org: fix max() warning]
[akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
[akpm@linux-foundation.org: fix it even more]
[chris@chrisdown.name: fix 64-bit divide even more]
Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>