]> git.proxmox.com Git - pve-kernel.git/blob - patches/kernel/0045-x86-mm-Flush-more-aggressively-in-lazy-TLB-mode.patch
add objtool build fix
[pve-kernel.git] / patches / kernel / 0045-x86-mm-Flush-more-aggressively-in-lazy-TLB-mode.patch
1 From d1ffadc67e2eee2d5f8626dca6646e70e3aa9d76 Mon Sep 17 00:00:00 2001
2 From: Andy Lutomirski <luto@kernel.org>
3 Date: Mon, 9 Oct 2017 09:50:49 -0700
4 Subject: [PATCH 045/233] x86/mm: Flush more aggressively in lazy TLB mode
5 MIME-Version: 1.0
6 Content-Type: text/plain; charset=UTF-8
7 Content-Transfer-Encoding: 8bit
8
9 CVE-2017-5754
10
11 Since commit:
12
13 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
14
15 x86's lazy TLB mode has been all the way lazy: when running a kernel thread
16 (including the idle thread), the kernel keeps using the last user mm's
17 page tables without attempting to maintain user TLB coherence at all.
18
19 From a pure semantic perspective, this is fine -- kernel threads won't
20 attempt to access user pages, so having stale TLB entries doesn't matter.
21
22 Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
23 we also allow any paging-structure caches that may exist on the CPU
24 to become incoherent. This means that we can have a
25 paging-structure cache entry that references a freed page table, and
26 the CPU is within its rights to do a speculative page walk starting
27 at the freed page table.
28
29 I can imagine this causing two different problems:
30
31 - A speculative page walk starting from a bogus page table could read
32 IO addresses. I haven't seen any reports of this causing problems.
33
34 - A speculative page walk that involves a bogus page table can install
35 garbage in the TLB. Such garbage would always be at a user VA, but
36 some AMD CPUs have logic that triggers a machine check when it notices
37 these bogus entries. I've seen a couple reports of this.
38
39 Boris further explains the failure mode:
40
41 > It is actually more of an optimization which assumes that paging-structure
42 > entries are in WB DRAM:
43 >
44 > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
45 > performance optimization that assumes PML4, PDP, PDE, and PTE entries
46 > are in cacheable WB-DRAM; memory type checks may be bypassed, and
47 > addresses outside of WB-DRAM may result in undefined behavior or NB
48 > protocol errors. 1=Disables performance optimization and allows PML4,
49 > PDP, PDE and PTE entries to be in any memory type. Operating systems
50 > that maintain page tables in memory types other than WB- DRAM must set
51 > TlbCacheDis to insure proper operation."
52 >
53 > The MCE generated is an NB protocol error to signal that
54 >
55 > "Link: A specific coherent-only packet from a CPU was issued to an
56 > IO link. This may be caused by software which addresses page table
57 > structures in a memory type other than cacheable WB-DRAM without
58 > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
59 > example, when page table structure addresses are above top of memory. In
60 > such cases, the NB will generate an MCE if it sees a mismatch between
61 > the memory operation generated by the core and the link type."
62 >
63 > I'm assuming coherent-only packets don't go out on IO links, thus the
64 > error.
65
66 To fix this, reinstate TLB coherence in lazy mode. With this patch
67 applied, we do it in one of two ways:
68
69 - If we have PCID, we simply switch back to init_mm's page tables
70 when we enter a kernel thread -- this seems to be quite cheap
71 except for the cost of serializing the CPU.
72
73 - If we don't have PCID, then we set a flag and switch to init_mm
74 the first time we would otherwise need to flush the TLB.
75
76 The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
77 to override the default mode for benchmarking.
78
79 In theory, we could optimize this better by only flushing the TLB in
80 lazy CPUs when a page table is freed. Doing that would require
81 auditing the mm code to make sure that all page table freeing goes
82 through tlb_remove_page() as well as reworking some data structures
83 to implement the improved flush logic.
84
85 Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
86 Reported-by: Adam Borowski <kilobyte@angband.pl>
87 Signed-off-by: Andy Lutomirski <luto@kernel.org>
88 Signed-off-by: Borislav Petkov <bp@suse.de>
89 Cc: Borislav Petkov <bp@alien8.de>
90 Cc: Brian Gerst <brgerst@gmail.com>
91 Cc: Daniel Borkmann <daniel@iogearbox.net>
92 Cc: Eric Biggers <ebiggers@google.com>
93 Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
94 Cc: Kees Cook <keescook@chromium.org>
95 Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
96 Cc: Linus Torvalds <torvalds@linux-foundation.org>
97 Cc: Nadav Amit <nadav.amit@gmail.com>
98 Cc: Peter Zijlstra <peterz@infradead.org>
99 Cc: Rik van Riel <riel@redhat.com>
100 Cc: Roman Kagan <rkagan@virtuozzo.com>
101 Cc: Thomas Gleixner <tglx@linutronix.de>
102 Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
103 Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
104 Signed-off-by: Ingo Molnar <mingo@kernel.org>
105 (backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
106 Signed-off-by: Andy Whitcroft <apw@canonical.com>
107 Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
108 (cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
109 Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
110 ---
111 arch/x86/include/asm/mmu_context.h | 8 +-
112 arch/x86/include/asm/tlbflush.h | 24 ++++++
113 arch/x86/mm/tlb.c | 160 +++++++++++++++++++++++++------------
114 3 files changed, 136 insertions(+), 56 deletions(-)
115
116 diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
117 index c120b5db178a..3c856a15b98e 100644
118 --- a/arch/x86/include/asm/mmu_context.h
119 +++ b/arch/x86/include/asm/mmu_context.h
120 @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
121 DEBUG_LOCKS_WARN_ON(preemptible());
122 }
123
124 -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
125 -{
126 - int cpu = smp_processor_id();
127 -
128 - if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
129 - cpumask_clear_cpu(cpu, mm_cpumask(mm));
130 -}
131 +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
132
133 static inline int init_new_context(struct task_struct *tsk,
134 struct mm_struct *mm)
135 diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
136 index d23e61dc0640..6533da3036c9 100644
137 --- a/arch/x86/include/asm/tlbflush.h
138 +++ b/arch/x86/include/asm/tlbflush.h
139 @@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
140 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
141 #endif
142
143 +/*
144 + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
145 + * to init_mm when we switch to a kernel thread (e.g. the idle thread). If
146 + * it's false, then we immediately switch CR3 when entering a kernel thread.
147 + */
148 +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
149 +
150 /*
151 * 6 because 6 should be plenty and struct tlb_state will fit in
152 * two cache lines.
153 @@ -104,6 +111,23 @@ struct tlb_state {
154 u16 loaded_mm_asid;
155 u16 next_asid;
156
157 + /*
158 + * We can be in one of several states:
159 + *
160 + * - Actively using an mm. Our CPU's bit will be set in
161 + * mm_cpumask(loaded_mm) and is_lazy == false;
162 + *
163 + * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
164 + * will not be set in mm_cpumask(&init_mm) and is_lazy == false.
165 + *
166 + * - Lazily using a real mm. loaded_mm != &init_mm, our bit
167 + * is set in mm_cpumask(loaded_mm), but is_lazy == true.
168 + * We're heuristically guessing that the CR3 load we
169 + * skipped more than makes up for the overhead added by
170 + * lazy mode.
171 + */
172 + bool is_lazy;
173 +
174 /*
175 * Access to this CR4 shadow and to H/W CR4 is protected by
176 * disabling interrupts when modifying either one.
177 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
178 index 440400316c8a..b27aceaf7ed1 100644
179 --- a/arch/x86/mm/tlb.c
180 +++ b/arch/x86/mm/tlb.c
181 @@ -30,6 +30,8 @@
182
183 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
184
185 +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
186 +
187 static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
188 u16 *new_asid, bool *need_flush)
189 {
190 @@ -80,7 +82,7 @@ void leave_mm(int cpu)
191 return;
192
193 /* Warn if we're not lazy. */
194 - WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
195 + WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
196
197 switch_mm(NULL, &init_mm, NULL);
198 }
199 @@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
200 __flush_tlb_all();
201 }
202 #endif
203 + this_cpu_write(cpu_tlbstate.is_lazy, false);
204
205 if (real_prev == next) {
206 VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
207 next->context.ctx_id);
208
209 - if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
210 - /*
211 - * There's nothing to do: we weren't lazy, and we
212 - * aren't changing our mm. We don't need to flush
213 - * anything, nor do we need to update CR3, CR4, or
214 - * LDTR.
215 - */
216 - return;
217 - }
218 -
219 - /* Resume remote flushes and then read tlb_gen. */
220 - cpumask_set_cpu(cpu, mm_cpumask(next));
221 - next_tlb_gen = atomic64_read(&next->context.tlb_gen);
222 -
223 - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
224 - next_tlb_gen) {
225 - /*
226 - * Ideally, we'd have a flush_tlb() variant that
227 - * takes the known CR3 value as input. This would
228 - * be faster on Xen PV and on hypothetical CPUs
229 - * on which INVPCID is fast.
230 - */
231 - this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
232 - next_tlb_gen);
233 - write_cr3(build_cr3(next, prev_asid));
234 -
235 - /*
236 - * This gets called via leave_mm() in the idle path
237 - * where RCU functions differently. Tracing normally
238 - * uses RCU, so we have to call the tracepoint
239 - * specially here.
240 - */
241 - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
242 - TLB_FLUSH_ALL);
243 - }
244 -
245 /*
246 - * We just exited lazy mode, which means that CR4 and/or LDTR
247 - * may be stale. (Changes to the required CR4 and LDTR states
248 - * are not reflected in tlb_gen.)
249 + * We don't currently support having a real mm loaded without
250 + * our cpu set in mm_cpumask(). We have all the bookkeeping
251 + * in place to figure out whether we would need to flush
252 + * if our cpu were cleared in mm_cpumask(), but we don't
253 + * currently use it.
254 */
255 + if (WARN_ON_ONCE(real_prev != &init_mm &&
256 + !cpumask_test_cpu(cpu, mm_cpumask(next))))
257 + cpumask_set_cpu(cpu, mm_cpumask(next));
258 +
259 + return;
260 } else {
261 u16 new_asid;
262 bool need_flush;
263 @@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
264 }
265
266 /* Stop remote flushes for the previous mm */
267 - if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
268 - cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
269 -
270 - VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
271 + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
272 + real_prev != &init_mm);
273 + cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
274
275 /*
276 * Start remote flushes and then read tlb_gen.
277 @@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
278 switch_ldt(real_prev, next);
279 }
280
281 +/*
282 + * enter_lazy_tlb() is a hint from the scheduler that we are entering a
283 + * kernel thread or other context without an mm. Acceptable implementations
284 + * include doing nothing whatsoever, switching to init_mm, or various clever
285 + * lazy tricks to try to minimize TLB flushes.
286 + *
287 + * The scheduler reserves the right to call enter_lazy_tlb() several times
288 + * in a row. It will notify us that we're going back to a real mm by
289 + * calling switch_mm_irqs_off().
290 + */
291 +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
292 +{
293 + if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
294 + return;
295 +
296 + if (static_branch_unlikely(&tlb_use_lazy_mode)) {
297 + /*
298 + * There's a significant optimization that may be possible
299 + * here. We have accurate enough TLB flush tracking that we
300 + * don't need to maintain coherence of TLB per se when we're
301 + * lazy. We do, however, need to maintain coherence of
302 + * paging-structure caches. We could, in principle, leave our
303 + * old mm loaded and only switch to init_mm when
304 + * tlb_remove_page() happens.
305 + */
306 + this_cpu_write(cpu_tlbstate.is_lazy, true);
307 + } else {
308 + switch_mm(NULL, &init_mm, NULL);
309 + }
310 +}
311 +
312 /*
313 * Call this when reinitializing a CPU. It fixes the following potential
314 * problems:
315 @@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
316 /* This code cannot presently handle being reentered. */
317 VM_WARN_ON(!irqs_disabled());
318
319 + if (unlikely(loaded_mm == &init_mm))
320 + return;
321 +
322 VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
323 loaded_mm->context.ctx_id);
324
325 - if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
326 + if (this_cpu_read(cpu_tlbstate.is_lazy)) {
327 /*
328 - * We're in lazy mode -- don't flush. We can get here on
329 - * remote flushes due to races and on local flushes if a
330 - * kernel thread coincidentally flushes the mm it's lazily
331 - * still using.
332 + * We're in lazy mode. We need to at least flush our
333 + * paging-structure cache to avoid speculatively reading
334 + * garbage into our TLB. Since switching to init_mm is barely
335 + * slower than a minimal flush, just switch to init_mm.
336 */
337 + switch_mm_irqs_off(NULL, &init_mm, NULL);
338 return;
339 }
340
341 @@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
342 return 0;
343 }
344 late_initcall(create_tlb_single_page_flush_ceiling);
345 +
346 +static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
347 + size_t count, loff_t *ppos)
348 +{
349 + char buf[2];
350 +
351 + buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
352 + buf[1] = '\n';
353 +
354 + return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
355 +}
356 +
357 +static ssize_t tlblazy_write_file(struct file *file,
358 + const char __user *user_buf, size_t count, loff_t *ppos)
359 +{
360 + bool val;
361 +
362 + if (kstrtobool_from_user(user_buf, count, &val))
363 + return -EINVAL;
364 +
365 + if (val)
366 + static_branch_enable(&tlb_use_lazy_mode);
367 + else
368 + static_branch_disable(&tlb_use_lazy_mode);
369 +
370 + return count;
371 +}
372 +
373 +static const struct file_operations fops_tlblazy = {
374 + .read = tlblazy_read_file,
375 + .write = tlblazy_write_file,
376 + .llseek = default_llseek,
377 +};
378 +
379 +static int __init init_tlb_use_lazy_mode(void)
380 +{
381 + if (boot_cpu_has(X86_FEATURE_PCID)) {
382 + /*
383 + * Heuristic: with PCID on, switching to and from
384 + * init_mm is reasonably fast, but remote flush IPIs
385 + * as expensive as ever, so turn off lazy TLB mode.
386 + *
387 + * We can't do this in setup_pcid() because static keys
388 + * haven't been initialized yet, and it would blow up
389 + * badly.
390 + */
391 + static_branch_disable(&tlb_use_lazy_mode);
392 + }
393 +
394 + debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
395 + arch_debugfs_dir, NULL, &fops_tlblazy);
396 + return 0;
397 +}
398 +late_initcall(init_tlb_use_lazy_mode);
399 --
400 2.14.2
401