]> git.proxmox.com Git - pve-kernel.git/blame - patches/kernel/0045-x86-mm-Flush-more-aggressively-in-lazy-TLB-mode.patch
build: reformat existing patches
[pve-kernel.git] / patches / kernel / 0045-x86-mm-Flush-more-aggressively-in-lazy-TLB-mode.patch
CommitLineData
59d5af67 1From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
321d628a
FG
2From: Andy Lutomirski <luto@kernel.org>
3Date: Mon, 9 Oct 2017 09:50:49 -0700
59d5af67 4Subject: [PATCH] x86/mm: Flush more aggressively in lazy TLB mode
321d628a
FG
5MIME-Version: 1.0
6Content-Type: text/plain; charset=UTF-8
7Content-Transfer-Encoding: 8bit
8
9CVE-2017-5754
10
11Since commit:
12
13 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
14
15x86's lazy TLB mode has been all the way lazy: when running a kernel thread
16(including the idle thread), the kernel keeps using the last user mm's
17page tables without attempting to maintain user TLB coherence at all.
18
19From a pure semantic perspective, this is fine -- kernel threads won't
20attempt to access user pages, so having stale TLB entries doesn't matter.
21
22Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
23we also allow any paging-structure caches that may exist on the CPU
24to become incoherent. This means that we can have a
25paging-structure cache entry that references a freed page table, and
26the CPU is within its rights to do a speculative page walk starting
27at the freed page table.
28
29I can imagine this causing two different problems:
30
31 - A speculative page walk starting from a bogus page table could read
32 IO addresses. I haven't seen any reports of this causing problems.
33
34 - A speculative page walk that involves a bogus page table can install
35 garbage in the TLB. Such garbage would always be at a user VA, but
36 some AMD CPUs have logic that triggers a machine check when it notices
37 these bogus entries. I've seen a couple reports of this.
38
39Boris further explains the failure mode:
40
41> It is actually more of an optimization which assumes that paging-structure
42> entries are in WB DRAM:
43>
44> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
45> performance optimization that assumes PML4, PDP, PDE, and PTE entries
46> are in cacheable WB-DRAM; memory type checks may be bypassed, and
47> addresses outside of WB-DRAM may result in undefined behavior or NB
48> protocol errors. 1=Disables performance optimization and allows PML4,
49> PDP, PDE and PTE entries to be in any memory type. Operating systems
50> that maintain page tables in memory types other than WB- DRAM must set
51> TlbCacheDis to insure proper operation."
52>
53> The MCE generated is an NB protocol error to signal that
54>
55> "Link: A specific coherent-only packet from a CPU was issued to an
56> IO link. This may be caused by software which addresses page table
57> structures in a memory type other than cacheable WB-DRAM without
58> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
59> example, when page table structure addresses are above top of memory. In
60> such cases, the NB will generate an MCE if it sees a mismatch between
61> the memory operation generated by the core and the link type."
62>
63> I'm assuming coherent-only packets don't go out on IO links, thus the
64> error.
65
66To fix this, reinstate TLB coherence in lazy mode. With this patch
67applied, we do it in one of two ways:
68
69 - If we have PCID, we simply switch back to init_mm's page tables
70 when we enter a kernel thread -- this seems to be quite cheap
71 except for the cost of serializing the CPU.
72
73 - If we don't have PCID, then we set a flag and switch to init_mm
74 the first time we would otherwise need to flush the TLB.
75
76The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
77to override the default mode for benchmarking.
78
79In theory, we could optimize this better by only flushing the TLB in
80lazy CPUs when a page table is freed. Doing that would require
81auditing the mm code to make sure that all page table freeing goes
82through tlb_remove_page() as well as reworking some data structures
83to implement the improved flush logic.
84
85Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
86Reported-by: Adam Borowski <kilobyte@angband.pl>
87Signed-off-by: Andy Lutomirski <luto@kernel.org>
88Signed-off-by: Borislav Petkov <bp@suse.de>
89Cc: Borislav Petkov <bp@alien8.de>
90Cc: Brian Gerst <brgerst@gmail.com>
91Cc: Daniel Borkmann <daniel@iogearbox.net>
92Cc: Eric Biggers <ebiggers@google.com>
93Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
94Cc: Kees Cook <keescook@chromium.org>
95Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
96Cc: Linus Torvalds <torvalds@linux-foundation.org>
97Cc: Nadav Amit <nadav.amit@gmail.com>
98Cc: Peter Zijlstra <peterz@infradead.org>
99Cc: Rik van Riel <riel@redhat.com>
100Cc: Roman Kagan <rkagan@virtuozzo.com>
101Cc: Thomas Gleixner <tglx@linutronix.de>
102Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
103Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
104Signed-off-by: Ingo Molnar <mingo@kernel.org>
105(backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
106Signed-off-by: Andy Whitcroft <apw@canonical.com>
107Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
108(cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
109Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
110---
111 arch/x86/include/asm/mmu_context.h | 8 +-
112 arch/x86/include/asm/tlbflush.h | 24 ++++++
113 arch/x86/mm/tlb.c | 160 +++++++++++++++++++++++++------------
114 3 files changed, 136 insertions(+), 56 deletions(-)
115
116diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
117index c120b5db178a..3c856a15b98e 100644
118--- a/arch/x86/include/asm/mmu_context.h
119+++ b/arch/x86/include/asm/mmu_context.h
120@@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
121 DEBUG_LOCKS_WARN_ON(preemptible());
122 }
123
124-static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
125-{
126- int cpu = smp_processor_id();
127-
128- if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
129- cpumask_clear_cpu(cpu, mm_cpumask(mm));
130-}
131+void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
132
133 static inline int init_new_context(struct task_struct *tsk,
134 struct mm_struct *mm)
135diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
136index d23e61dc0640..6533da3036c9 100644
137--- a/arch/x86/include/asm/tlbflush.h
138+++ b/arch/x86/include/asm/tlbflush.h
139@@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
140 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
141 #endif
142
143+/*
144+ * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
145+ * to init_mm when we switch to a kernel thread (e.g. the idle thread). If
146+ * it's false, then we immediately switch CR3 when entering a kernel thread.
147+ */
148+DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
149+
150 /*
151 * 6 because 6 should be plenty and struct tlb_state will fit in
152 * two cache lines.
153@@ -104,6 +111,23 @@ struct tlb_state {
154 u16 loaded_mm_asid;
155 u16 next_asid;
156
157+ /*
158+ * We can be in one of several states:
159+ *
160+ * - Actively using an mm. Our CPU's bit will be set in
161+ * mm_cpumask(loaded_mm) and is_lazy == false;
162+ *
163+ * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
164+ * will not be set in mm_cpumask(&init_mm) and is_lazy == false.
165+ *
166+ * - Lazily using a real mm. loaded_mm != &init_mm, our bit
167+ * is set in mm_cpumask(loaded_mm), but is_lazy == true.
168+ * We're heuristically guessing that the CR3 load we
169+ * skipped more than makes up for the overhead added by
170+ * lazy mode.
171+ */
172+ bool is_lazy;
173+
174 /*
175 * Access to this CR4 shadow and to H/W CR4 is protected by
176 * disabling interrupts when modifying either one.
177diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
178index 440400316c8a..b27aceaf7ed1 100644
179--- a/arch/x86/mm/tlb.c
180+++ b/arch/x86/mm/tlb.c
181@@ -30,6 +30,8 @@
182
183 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
184
185+DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
186+
187 static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
188 u16 *new_asid, bool *need_flush)
189 {
190@@ -80,7 +82,7 @@ void leave_mm(int cpu)
191 return;
192
193 /* Warn if we're not lazy. */
194- WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
195+ WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
196
197 switch_mm(NULL, &init_mm, NULL);
198 }
199@@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
200 __flush_tlb_all();
201 }
202 #endif
203+ this_cpu_write(cpu_tlbstate.is_lazy, false);
204
205 if (real_prev == next) {
206 VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
207 next->context.ctx_id);
208
209- if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
210- /*
211- * There's nothing to do: we weren't lazy, and we
212- * aren't changing our mm. We don't need to flush
213- * anything, nor do we need to update CR3, CR4, or
214- * LDTR.
215- */
216- return;
217- }
218-
219- /* Resume remote flushes and then read tlb_gen. */
220- cpumask_set_cpu(cpu, mm_cpumask(next));
221- next_tlb_gen = atomic64_read(&next->context.tlb_gen);
222-
223- if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
224- next_tlb_gen) {
225- /*
226- * Ideally, we'd have a flush_tlb() variant that
227- * takes the known CR3 value as input. This would
228- * be faster on Xen PV and on hypothetical CPUs
229- * on which INVPCID is fast.
230- */
231- this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
232- next_tlb_gen);
233- write_cr3(build_cr3(next, prev_asid));
234-
235- /*
236- * This gets called via leave_mm() in the idle path
237- * where RCU functions differently. Tracing normally
238- * uses RCU, so we have to call the tracepoint
239- * specially here.
240- */
241- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
242- TLB_FLUSH_ALL);
243- }
244-
245 /*
246- * We just exited lazy mode, which means that CR4 and/or LDTR
247- * may be stale. (Changes to the required CR4 and LDTR states
248- * are not reflected in tlb_gen.)
249+ * We don't currently support having a real mm loaded without
250+ * our cpu set in mm_cpumask(). We have all the bookkeeping
251+ * in place to figure out whether we would need to flush
252+ * if our cpu were cleared in mm_cpumask(), but we don't
253+ * currently use it.
254 */
255+ if (WARN_ON_ONCE(real_prev != &init_mm &&
256+ !cpumask_test_cpu(cpu, mm_cpumask(next))))
257+ cpumask_set_cpu(cpu, mm_cpumask(next));
258+
259+ return;
260 } else {
261 u16 new_asid;
262 bool need_flush;
263@@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
264 }
265
266 /* Stop remote flushes for the previous mm */
267- if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
268- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
269-
270- VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
271+ VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
272+ real_prev != &init_mm);
273+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
274
275 /*
276 * Start remote flushes and then read tlb_gen.
277@@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
278 switch_ldt(real_prev, next);
279 }
280
281+/*
282+ * enter_lazy_tlb() is a hint from the scheduler that we are entering a
283+ * kernel thread or other context without an mm. Acceptable implementations
284+ * include doing nothing whatsoever, switching to init_mm, or various clever
285+ * lazy tricks to try to minimize TLB flushes.
286+ *
287+ * The scheduler reserves the right to call enter_lazy_tlb() several times
288+ * in a row. It will notify us that we're going back to a real mm by
289+ * calling switch_mm_irqs_off().
290+ */
291+void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
292+{
293+ if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
294+ return;
295+
296+ if (static_branch_unlikely(&tlb_use_lazy_mode)) {
297+ /*
298+ * There's a significant optimization that may be possible
299+ * here. We have accurate enough TLB flush tracking that we
300+ * don't need to maintain coherence of TLB per se when we're
301+ * lazy. We do, however, need to maintain coherence of
302+ * paging-structure caches. We could, in principle, leave our
303+ * old mm loaded and only switch to init_mm when
304+ * tlb_remove_page() happens.
305+ */
306+ this_cpu_write(cpu_tlbstate.is_lazy, true);
307+ } else {
308+ switch_mm(NULL, &init_mm, NULL);
309+ }
310+}
311+
312 /*
313 * Call this when reinitializing a CPU. It fixes the following potential
314 * problems:
315@@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
316 /* This code cannot presently handle being reentered. */
317 VM_WARN_ON(!irqs_disabled());
318
319+ if (unlikely(loaded_mm == &init_mm))
320+ return;
321+
322 VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
323 loaded_mm->context.ctx_id);
324
325- if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
326+ if (this_cpu_read(cpu_tlbstate.is_lazy)) {
327 /*
328- * We're in lazy mode -- don't flush. We can get here on
329- * remote flushes due to races and on local flushes if a
330- * kernel thread coincidentally flushes the mm it's lazily
331- * still using.
332+ * We're in lazy mode. We need to at least flush our
333+ * paging-structure cache to avoid speculatively reading
334+ * garbage into our TLB. Since switching to init_mm is barely
335+ * slower than a minimal flush, just switch to init_mm.
336 */
337+ switch_mm_irqs_off(NULL, &init_mm, NULL);
338 return;
339 }
340
341@@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
342 return 0;
343 }
344 late_initcall(create_tlb_single_page_flush_ceiling);
345+
346+static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
347+ size_t count, loff_t *ppos)
348+{
349+ char buf[2];
350+
351+ buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
352+ buf[1] = '\n';
353+
354+ return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
355+}
356+
357+static ssize_t tlblazy_write_file(struct file *file,
358+ const char __user *user_buf, size_t count, loff_t *ppos)
359+{
360+ bool val;
361+
362+ if (kstrtobool_from_user(user_buf, count, &val))
363+ return -EINVAL;
364+
365+ if (val)
366+ static_branch_enable(&tlb_use_lazy_mode);
367+ else
368+ static_branch_disable(&tlb_use_lazy_mode);
369+
370+ return count;
371+}
372+
373+static const struct file_operations fops_tlblazy = {
374+ .read = tlblazy_read_file,
375+ .write = tlblazy_write_file,
376+ .llseek = default_llseek,
377+};
378+
379+static int __init init_tlb_use_lazy_mode(void)
380+{
381+ if (boot_cpu_has(X86_FEATURE_PCID)) {
382+ /*
383+ * Heuristic: with PCID on, switching to and from
384+ * init_mm is reasonably fast, but remote flush IPIs
385+ * as expensive as ever, so turn off lazy TLB mode.
386+ *
387+ * We can't do this in setup_pcid() because static keys
388+ * haven't been initialized yet, and it would blow up
389+ * badly.
390+ */
391+ static_branch_disable(&tlb_use_lazy_mode);
392+ }
393+
394+ debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
395+ arch_debugfs_dir, NULL, &fops_tlblazy);
396+ return 0;
397+}
398+late_initcall(init_tlb_use_lazy_mode);
399--
4002.14.2
401