]>
Commit | Line | Data |
---|---|---|
321d628a FG |
1 | From d1ffadc67e2eee2d5f8626dca6646e70e3aa9d76 Mon Sep 17 00:00:00 2001 |
2 | From: Andy Lutomirski <luto@kernel.org> | |
3 | Date: Mon, 9 Oct 2017 09:50:49 -0700 | |
e4cdf2a5 | 4 | Subject: [PATCH 045/241] x86/mm: Flush more aggressively in lazy TLB mode |
321d628a FG |
5 | MIME-Version: 1.0 |
6 | Content-Type: text/plain; charset=UTF-8 | |
7 | Content-Transfer-Encoding: 8bit | |
8 | ||
9 | CVE-2017-5754 | |
10 | ||
11 | Since commit: | |
12 | ||
13 | 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") | |
14 | ||
15 | x86's lazy TLB mode has been all the way lazy: when running a kernel thread | |
16 | (including the idle thread), the kernel keeps using the last user mm's | |
17 | page tables without attempting to maintain user TLB coherence at all. | |
18 | ||
19 | From a pure semantic perspective, this is fine -- kernel threads won't | |
20 | attempt to access user pages, so having stale TLB entries doesn't matter. | |
21 | ||
22 | Unfortunately, I forgot about a subtlety. By skipping TLB flushes, | |
23 | we also allow any paging-structure caches that may exist on the CPU | |
24 | to become incoherent. This means that we can have a | |
25 | paging-structure cache entry that references a freed page table, and | |
26 | the CPU is within its rights to do a speculative page walk starting | |
27 | at the freed page table. | |
28 | ||
29 | I can imagine this causing two different problems: | |
30 | ||
31 | - A speculative page walk starting from a bogus page table could read | |
32 | IO addresses. I haven't seen any reports of this causing problems. | |
33 | ||
34 | - A speculative page walk that involves a bogus page table can install | |
35 | garbage in the TLB. Such garbage would always be at a user VA, but | |
36 | some AMD CPUs have logic that triggers a machine check when it notices | |
37 | these bogus entries. I've seen a couple reports of this. | |
38 | ||
39 | Boris further explains the failure mode: | |
40 | ||
41 | > It is actually more of an optimization which assumes that paging-structure | |
42 | > entries are in WB DRAM: | |
43 | > | |
44 | > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables | |
45 | > performance optimization that assumes PML4, PDP, PDE, and PTE entries | |
46 | > are in cacheable WB-DRAM; memory type checks may be bypassed, and | |
47 | > addresses outside of WB-DRAM may result in undefined behavior or NB | |
48 | > protocol errors. 1=Disables performance optimization and allows PML4, | |
49 | > PDP, PDE and PTE entries to be in any memory type. Operating systems | |
50 | > that maintain page tables in memory types other than WB- DRAM must set | |
51 | > TlbCacheDis to insure proper operation." | |
52 | > | |
53 | > The MCE generated is an NB protocol error to signal that | |
54 | > | |
55 | > "Link: A specific coherent-only packet from a CPU was issued to an | |
56 | > IO link. This may be caused by software which addresses page table | |
57 | > structures in a memory type other than cacheable WB-DRAM without | |
58 | > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for | |
59 | > example, when page table structure addresses are above top of memory. In | |
60 | > such cases, the NB will generate an MCE if it sees a mismatch between | |
61 | > the memory operation generated by the core and the link type." | |
62 | > | |
63 | > I'm assuming coherent-only packets don't go out on IO links, thus the | |
64 | > error. | |
65 | ||
66 | To fix this, reinstate TLB coherence in lazy mode. With this patch | |
67 | applied, we do it in one of two ways: | |
68 | ||
69 | - If we have PCID, we simply switch back to init_mm's page tables | |
70 | when we enter a kernel thread -- this seems to be quite cheap | |
71 | except for the cost of serializing the CPU. | |
72 | ||
73 | - If we don't have PCID, then we set a flag and switch to init_mm | |
74 | the first time we would otherwise need to flush the TLB. | |
75 | ||
76 | The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed | |
77 | to override the default mode for benchmarking. | |
78 | ||
79 | In theory, we could optimize this better by only flushing the TLB in | |
80 | lazy CPUs when a page table is freed. Doing that would require | |
81 | auditing the mm code to make sure that all page table freeing goes | |
82 | through tlb_remove_page() as well as reworking some data structures | |
83 | to implement the improved flush logic. | |
84 | ||
85 | Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> | |
86 | Reported-by: Adam Borowski <kilobyte@angband.pl> | |
87 | Signed-off-by: Andy Lutomirski <luto@kernel.org> | |
88 | Signed-off-by: Borislav Petkov <bp@suse.de> | |
89 | Cc: Borislav Petkov <bp@alien8.de> | |
90 | Cc: Brian Gerst <brgerst@gmail.com> | |
91 | Cc: Daniel Borkmann <daniel@iogearbox.net> | |
92 | Cc: Eric Biggers <ebiggers@google.com> | |
93 | Cc: Johannes Hirte <johannes.hirte@datenkhaos.de> | |
94 | Cc: Kees Cook <keescook@chromium.org> | |
95 | Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> | |
96 | Cc: Linus Torvalds <torvalds@linux-foundation.org> | |
97 | Cc: Nadav Amit <nadav.amit@gmail.com> | |
98 | Cc: Peter Zijlstra <peterz@infradead.org> | |
99 | Cc: Rik van Riel <riel@redhat.com> | |
100 | Cc: Roman Kagan <rkagan@virtuozzo.com> | |
101 | Cc: Thomas Gleixner <tglx@linutronix.de> | |
102 | Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") | |
103 | Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic | |
104 | Signed-off-by: Ingo Molnar <mingo@kernel.org> | |
105 | (backported from commit b956575bed91ecfb136a8300742ecbbf451471ab) | |
106 | Signed-off-by: Andy Whitcroft <apw@canonical.com> | |
107 | Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> | |
108 | (cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142) | |
109 | Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> | |
110 | --- | |
111 | arch/x86/include/asm/mmu_context.h | 8 +- | |
112 | arch/x86/include/asm/tlbflush.h | 24 ++++++ | |
113 | arch/x86/mm/tlb.c | 160 +++++++++++++++++++++++++------------ | |
114 | 3 files changed, 136 insertions(+), 56 deletions(-) | |
115 | ||
116 | diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h | |
117 | index c120b5db178a..3c856a15b98e 100644 | |
118 | --- a/arch/x86/include/asm/mmu_context.h | |
119 | +++ b/arch/x86/include/asm/mmu_context.h | |
120 | @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) | |
121 | DEBUG_LOCKS_WARN_ON(preemptible()); | |
122 | } | |
123 | ||
124 | -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) | |
125 | -{ | |
126 | - int cpu = smp_processor_id(); | |
127 | - | |
128 | - if (cpumask_test_cpu(cpu, mm_cpumask(mm))) | |
129 | - cpumask_clear_cpu(cpu, mm_cpumask(mm)); | |
130 | -} | |
131 | +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); | |
132 | ||
133 | static inline int init_new_context(struct task_struct *tsk, | |
134 | struct mm_struct *mm) | |
135 | diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h | |
136 | index d23e61dc0640..6533da3036c9 100644 | |
137 | --- a/arch/x86/include/asm/tlbflush.h | |
138 | +++ b/arch/x86/include/asm/tlbflush.h | |
139 | @@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) | |
140 | #define __flush_tlb_single(addr) __native_flush_tlb_single(addr) | |
141 | #endif | |
142 | ||
143 | +/* | |
144 | + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point | |
145 | + * to init_mm when we switch to a kernel thread (e.g. the idle thread). If | |
146 | + * it's false, then we immediately switch CR3 when entering a kernel thread. | |
147 | + */ | |
148 | +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode); | |
149 | + | |
150 | /* | |
151 | * 6 because 6 should be plenty and struct tlb_state will fit in | |
152 | * two cache lines. | |
153 | @@ -104,6 +111,23 @@ struct tlb_state { | |
154 | u16 loaded_mm_asid; | |
155 | u16 next_asid; | |
156 | ||
157 | + /* | |
158 | + * We can be in one of several states: | |
159 | + * | |
160 | + * - Actively using an mm. Our CPU's bit will be set in | |
161 | + * mm_cpumask(loaded_mm) and is_lazy == false; | |
162 | + * | |
163 | + * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit | |
164 | + * will not be set in mm_cpumask(&init_mm) and is_lazy == false. | |
165 | + * | |
166 | + * - Lazily using a real mm. loaded_mm != &init_mm, our bit | |
167 | + * is set in mm_cpumask(loaded_mm), but is_lazy == true. | |
168 | + * We're heuristically guessing that the CR3 load we | |
169 | + * skipped more than makes up for the overhead added by | |
170 | + * lazy mode. | |
171 | + */ | |
172 | + bool is_lazy; | |
173 | + | |
174 | /* | |
175 | * Access to this CR4 shadow and to H/W CR4 is protected by | |
176 | * disabling interrupts when modifying either one. | |
177 | diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c | |
178 | index 440400316c8a..b27aceaf7ed1 100644 | |
179 | --- a/arch/x86/mm/tlb.c | |
180 | +++ b/arch/x86/mm/tlb.c | |
181 | @@ -30,6 +30,8 @@ | |
182 | ||
183 | atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1); | |
184 | ||
185 | +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode); | |
186 | + | |
187 | static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, | |
188 | u16 *new_asid, bool *need_flush) | |
189 | { | |
190 | @@ -80,7 +82,7 @@ void leave_mm(int cpu) | |
191 | return; | |
192 | ||
193 | /* Warn if we're not lazy. */ | |
194 | - WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))); | |
195 | + WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy)); | |
196 | ||
197 | switch_mm(NULL, &init_mm, NULL); | |
198 | } | |
199 | @@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, | |
200 | __flush_tlb_all(); | |
201 | } | |
202 | #endif | |
203 | + this_cpu_write(cpu_tlbstate.is_lazy, false); | |
204 | ||
205 | if (real_prev == next) { | |
206 | VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != | |
207 | next->context.ctx_id); | |
208 | ||
209 | - if (cpumask_test_cpu(cpu, mm_cpumask(next))) { | |
210 | - /* | |
211 | - * There's nothing to do: we weren't lazy, and we | |
212 | - * aren't changing our mm. We don't need to flush | |
213 | - * anything, nor do we need to update CR3, CR4, or | |
214 | - * LDTR. | |
215 | - */ | |
216 | - return; | |
217 | - } | |
218 | - | |
219 | - /* Resume remote flushes and then read tlb_gen. */ | |
220 | - cpumask_set_cpu(cpu, mm_cpumask(next)); | |
221 | - next_tlb_gen = atomic64_read(&next->context.tlb_gen); | |
222 | - | |
223 | - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) < | |
224 | - next_tlb_gen) { | |
225 | - /* | |
226 | - * Ideally, we'd have a flush_tlb() variant that | |
227 | - * takes the known CR3 value as input. This would | |
228 | - * be faster on Xen PV and on hypothetical CPUs | |
229 | - * on which INVPCID is fast. | |
230 | - */ | |
231 | - this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen, | |
232 | - next_tlb_gen); | |
233 | - write_cr3(build_cr3(next, prev_asid)); | |
234 | - | |
235 | - /* | |
236 | - * This gets called via leave_mm() in the idle path | |
237 | - * where RCU functions differently. Tracing normally | |
238 | - * uses RCU, so we have to call the tracepoint | |
239 | - * specially here. | |
240 | - */ | |
241 | - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, | |
242 | - TLB_FLUSH_ALL); | |
243 | - } | |
244 | - | |
245 | /* | |
246 | - * We just exited lazy mode, which means that CR4 and/or LDTR | |
247 | - * may be stale. (Changes to the required CR4 and LDTR states | |
248 | - * are not reflected in tlb_gen.) | |
249 | + * We don't currently support having a real mm loaded without | |
250 | + * our cpu set in mm_cpumask(). We have all the bookkeeping | |
251 | + * in place to figure out whether we would need to flush | |
252 | + * if our cpu were cleared in mm_cpumask(), but we don't | |
253 | + * currently use it. | |
254 | */ | |
255 | + if (WARN_ON_ONCE(real_prev != &init_mm && | |
256 | + !cpumask_test_cpu(cpu, mm_cpumask(next)))) | |
257 | + cpumask_set_cpu(cpu, mm_cpumask(next)); | |
258 | + | |
259 | + return; | |
260 | } else { | |
261 | u16 new_asid; | |
262 | bool need_flush; | |
263 | @@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, | |
264 | } | |
265 | ||
266 | /* Stop remote flushes for the previous mm */ | |
267 | - if (cpumask_test_cpu(cpu, mm_cpumask(real_prev))) | |
268 | - cpumask_clear_cpu(cpu, mm_cpumask(real_prev)); | |
269 | - | |
270 | - VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next))); | |
271 | + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) && | |
272 | + real_prev != &init_mm); | |
273 | + cpumask_clear_cpu(cpu, mm_cpumask(real_prev)); | |
274 | ||
275 | /* | |
276 | * Start remote flushes and then read tlb_gen. | |
277 | @@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, | |
278 | switch_ldt(real_prev, next); | |
279 | } | |
280 | ||
281 | +/* | |
282 | + * enter_lazy_tlb() is a hint from the scheduler that we are entering a | |
283 | + * kernel thread or other context without an mm. Acceptable implementations | |
284 | + * include doing nothing whatsoever, switching to init_mm, or various clever | |
285 | + * lazy tricks to try to minimize TLB flushes. | |
286 | + * | |
287 | + * The scheduler reserves the right to call enter_lazy_tlb() several times | |
288 | + * in a row. It will notify us that we're going back to a real mm by | |
289 | + * calling switch_mm_irqs_off(). | |
290 | + */ | |
291 | +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) | |
292 | +{ | |
293 | + if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm) | |
294 | + return; | |
295 | + | |
296 | + if (static_branch_unlikely(&tlb_use_lazy_mode)) { | |
297 | + /* | |
298 | + * There's a significant optimization that may be possible | |
299 | + * here. We have accurate enough TLB flush tracking that we | |
300 | + * don't need to maintain coherence of TLB per se when we're | |
301 | + * lazy. We do, however, need to maintain coherence of | |
302 | + * paging-structure caches. We could, in principle, leave our | |
303 | + * old mm loaded and only switch to init_mm when | |
304 | + * tlb_remove_page() happens. | |
305 | + */ | |
306 | + this_cpu_write(cpu_tlbstate.is_lazy, true); | |
307 | + } else { | |
308 | + switch_mm(NULL, &init_mm, NULL); | |
309 | + } | |
310 | +} | |
311 | + | |
312 | /* | |
313 | * Call this when reinitializing a CPU. It fixes the following potential | |
314 | * problems: | |
315 | @@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, | |
316 | /* This code cannot presently handle being reentered. */ | |
317 | VM_WARN_ON(!irqs_disabled()); | |
318 | ||
319 | + if (unlikely(loaded_mm == &init_mm)) | |
320 | + return; | |
321 | + | |
322 | VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) != | |
323 | loaded_mm->context.ctx_id); | |
324 | ||
325 | - if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) { | |
326 | + if (this_cpu_read(cpu_tlbstate.is_lazy)) { | |
327 | /* | |
328 | - * We're in lazy mode -- don't flush. We can get here on | |
329 | - * remote flushes due to races and on local flushes if a | |
330 | - * kernel thread coincidentally flushes the mm it's lazily | |
331 | - * still using. | |
332 | + * We're in lazy mode. We need to at least flush our | |
333 | + * paging-structure cache to avoid speculatively reading | |
334 | + * garbage into our TLB. Since switching to init_mm is barely | |
335 | + * slower than a minimal flush, just switch to init_mm. | |
336 | */ | |
337 | + switch_mm_irqs_off(NULL, &init_mm, NULL); | |
338 | return; | |
339 | } | |
340 | ||
341 | @@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void) | |
342 | return 0; | |
343 | } | |
344 | late_initcall(create_tlb_single_page_flush_ceiling); | |
345 | + | |
346 | +static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf, | |
347 | + size_t count, loff_t *ppos) | |
348 | +{ | |
349 | + char buf[2]; | |
350 | + | |
351 | + buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0'; | |
352 | + buf[1] = '\n'; | |
353 | + | |
354 | + return simple_read_from_buffer(user_buf, count, ppos, buf, 2); | |
355 | +} | |
356 | + | |
357 | +static ssize_t tlblazy_write_file(struct file *file, | |
358 | + const char __user *user_buf, size_t count, loff_t *ppos) | |
359 | +{ | |
360 | + bool val; | |
361 | + | |
362 | + if (kstrtobool_from_user(user_buf, count, &val)) | |
363 | + return -EINVAL; | |
364 | + | |
365 | + if (val) | |
366 | + static_branch_enable(&tlb_use_lazy_mode); | |
367 | + else | |
368 | + static_branch_disable(&tlb_use_lazy_mode); | |
369 | + | |
370 | + return count; | |
371 | +} | |
372 | + | |
373 | +static const struct file_operations fops_tlblazy = { | |
374 | + .read = tlblazy_read_file, | |
375 | + .write = tlblazy_write_file, | |
376 | + .llseek = default_llseek, | |
377 | +}; | |
378 | + | |
379 | +static int __init init_tlb_use_lazy_mode(void) | |
380 | +{ | |
381 | + if (boot_cpu_has(X86_FEATURE_PCID)) { | |
382 | + /* | |
383 | + * Heuristic: with PCID on, switching to and from | |
384 | + * init_mm is reasonably fast, but remote flush IPIs | |
385 | + * as expensive as ever, so turn off lazy TLB mode. | |
386 | + * | |
387 | + * We can't do this in setup_pcid() because static keys | |
388 | + * haven't been initialized yet, and it would blow up | |
389 | + * badly. | |
390 | + */ | |
391 | + static_branch_disable(&tlb_use_lazy_mode); | |
392 | + } | |
393 | + | |
394 | + debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR, | |
395 | + arch_debugfs_dir, NULL, &fops_tlblazy); | |
396 | + return 0; | |
397 | +} | |
398 | +late_initcall(init_tlb_use_lazy_mode); | |
399 | -- | |
400 | 2.14.2 | |
401 |