patches/kernel/0045-x86-mm-Flush-more-aggressively-in-lazy-TLB-mode.patch

   1 From d1ffadc67e2eee2d5f8626dca6646e70e3aa9d76 Mon Sep 17 00:00:00 2001
   2 From: Andy Lutomirski <luto@kernel.org>
   3 Date: Mon, 9 Oct 2017 09:50:49 -0700
   4 Subject: [PATCH 045/233] x86/mm: Flush more aggressively in lazy TLB mode
   5 MIME-Version: 1.0
   6 Content-Type: text/plain; charset=UTF-8
   7 Content-Transfer-Encoding: 8bit
   8
   9 CVE-2017-5754
  10
  11 Since commit:
  12
  13   94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
  14
  15 x86's lazy TLB mode has been all the way lazy: when running a kernel thread
  16 (including the idle thread), the kernel keeps using the last user mm's
  17 page tables without attempting to maintain user TLB coherence at all.
  18
  19 From a pure semantic perspective, this is fine -- kernel threads won't
  20 attempt to access user pages, so having stale TLB entries doesn't matter.
  21
  22 Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
  23 we also allow any paging-structure caches that may exist on the CPU
  24 to become incoherent.  This means that we can have a
  25 paging-structure cache entry that references a freed page table, and
  26 the CPU is within its rights to do a speculative page walk starting
  27 at the freed page table.
  28
  29 I can imagine this causing two different problems:
  30
  31  - A speculative page walk starting from a bogus page table could read
  32    IO addresses.  I haven't seen any reports of this causing problems.
  33
  34  - A speculative page walk that involves a bogus page table can install
  35    garbage in the TLB.  Such garbage would always be at a user VA, but
  36    some AMD CPUs have logic that triggers a machine check when it notices
  37    these bogus entries.  I've seen a couple reports of this.
  38
  39 Boris further explains the failure mode:
  40
  41 > It is actually more of an optimization which assumes that paging-structure
  42 > entries are in WB DRAM:
  43 >
  44 > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
  45 > performance optimization that assumes PML4, PDP, PDE, and PTE entries
  46 > are in cacheable WB-DRAM; memory type checks may be bypassed, and
  47 > addresses outside of WB-DRAM may result in undefined behavior or NB
  48 > protocol errors. 1=Disables performance optimization and allows PML4,
  49 > PDP, PDE and PTE entries to be in any memory type. Operating systems
  50 > that maintain page tables in memory types other than WB- DRAM must set
  51 > TlbCacheDis to insure proper operation."
  52 >
  53 > The MCE generated is an NB protocol error to signal that
  54 >
  55 > "Link: A specific coherent-only packet from a CPU was issued to an
  56 > IO link. This may be caused by software which addresses page table
  57 > structures in a memory type other than cacheable WB-DRAM without
  58 > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
  59 > example, when page table structure addresses are above top of memory. In
  60 > such cases, the NB will generate an MCE if it sees a mismatch between
  61 > the memory operation generated by the core and the link type."
  62 >
  63 > I'm assuming coherent-only packets don't go out on IO links, thus the
  64 > error.
  65
  66 To fix this, reinstate TLB coherence in lazy mode.  With this patch
  67 applied, we do it in one of two ways:
  68
  69  - If we have PCID, we simply switch back to init_mm's page tables
  70    when we enter a kernel thread -- this seems to be quite cheap
  71    except for the cost of serializing the CPU.
  72
  73  - If we don't have PCID, then we set a flag and switch to init_mm
  74    the first time we would otherwise need to flush the TLB.
  75
  76 The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
  77 to override the default mode for benchmarking.
  78
  79 In theory, we could optimize this better by only flushing the TLB in
  80 lazy CPUs when a page table is freed.  Doing that would require
  81 auditing the mm code to make sure that all page table freeing goes
  82 through tlb_remove_page() as well as reworking some data structures
  83 to implement the improved flush logic.
  84
  85 Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
  86 Reported-by: Adam Borowski <kilobyte@angband.pl>
  87 Signed-off-by: Andy Lutomirski <luto@kernel.org>
  88 Signed-off-by: Borislav Petkov <bp@suse.de>
  89 Cc: Borislav Petkov <bp@alien8.de>
  90 Cc: Brian Gerst <brgerst@gmail.com>
  91 Cc: Daniel Borkmann <daniel@iogearbox.net>
  92 Cc: Eric Biggers <ebiggers@google.com>
  93 Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
  94 Cc: Kees Cook <keescook@chromium.org>
  95 Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
  96 Cc: Linus Torvalds <torvalds@linux-foundation.org>
  97 Cc: Nadav Amit <nadav.amit@gmail.com>
  98 Cc: Peter Zijlstra <peterz@infradead.org>
  99 Cc: Rik van Riel <riel@redhat.com>
 100 Cc: Roman Kagan <rkagan@virtuozzo.com>
 101 Cc: Thomas Gleixner <tglx@linutronix.de>
 102 Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
 103 Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
 104 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 105 (backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
 106 Signed-off-by: Andy Whitcroft <apw@canonical.com>
 107 Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
 108 (cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
 109 Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
 110 ---
 111  arch/x86/include/asm/mmu_context.h |   8 +-
 112  arch/x86/include/asm/tlbflush.h    |  24 ++++++
 113  arch/x86/mm/tlb.c                  | 160 +++++++++++++++++++++++++------------
 114  3 files changed, 136 insertions(+), 56 deletions(-)
 115
 116 diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
 117 index c120b5db178a..3c856a15b98e 100644
 118 --- a/arch/x86/include/asm/mmu_context.h
 119 +++ b/arch/x86/include/asm/mmu_context.h
 120 @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
 121         DEBUG_LOCKS_WARN_ON(preemptible());
 122  }
 123
 124 -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 125 -{
 126 -       int cpu = smp_processor_id();
 127 -
 128 -       if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
 129 -               cpumask_clear_cpu(cpu, mm_cpumask(mm));
 130 -}
 131 +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 132
 133  static inline int init_new_context(struct task_struct *tsk,
 134                                    struct mm_struct *mm)
 135 diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
 136 index d23e61dc0640..6533da3036c9 100644
 137 --- a/arch/x86/include/asm/tlbflush.h
 138 +++ b/arch/x86/include/asm/tlbflush.h
 139 @@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 140  #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 141  #endif
 142
 143 +/*
 144 + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
 145 + * to init_mm when we switch to a kernel thread (e.g. the idle thread).  If
 146 + * it's false, then we immediately switch CR3 when entering a kernel thread.
 147 + */
 148 +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
 149 +
 150  /*
 151   * 6 because 6 should be plenty and struct tlb_state will fit in
 152   * two cache lines.
 153 @@ -104,6 +111,23 @@ struct tlb_state {
 154         u16 loaded_mm_asid;
 155         u16 next_asid;
 156
 157 +       /*
 158 +        * We can be in one of several states:
 159 +        *
 160 +        *  - Actively using an mm.  Our CPU's bit will be set in
 161 +        *    mm_cpumask(loaded_mm) and is_lazy == false;
 162 +        *
 163 +        *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
 164 +        *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
 165 +        *
 166 +        *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
 167 +        *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
 168 +        *    We're heuristically guessing that the CR3 load we
 169 +        *    skipped more than makes up for the overhead added by
 170 +        *    lazy mode.
 171 +        */
 172 +       bool is_lazy;
 173 +
 174         /*
 175          * Access to this CR4 shadow and to H/W CR4 is protected by
 176          * disabling interrupts when modifying either one.
 177 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
 178 index 440400316c8a..b27aceaf7ed1 100644
 179 --- a/arch/x86/mm/tlb.c
 180 +++ b/arch/x86/mm/tlb.c
 181 @@ -30,6 +30,8 @@
 182
 183  atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 184
 185 +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
 186 +
 187  static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 188                             u16 *new_asid, bool *need_flush)
 189  {
 190 @@ -80,7 +82,7 @@ void leave_mm(int cpu)
 191                 return;
 192
 193         /* Warn if we're not lazy. */
 194 -       WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
 195 +       WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
 196
 197         switch_mm(NULL, &init_mm, NULL);
 198  }
 199 @@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 200                 __flush_tlb_all();
 201         }
 202  #endif
 203 +       this_cpu_write(cpu_tlbstate.is_lazy, false);
 204
 205         if (real_prev == next) {
 206                 VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 207                           next->context.ctx_id);
 208
 209 -               if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
 210 -                       /*
 211 -                        * There's nothing to do: we weren't lazy, and we
 212 -                        * aren't changing our mm.  We don't need to flush
 213 -                        * anything, nor do we need to update CR3, CR4, or
 214 -                        * LDTR.
 215 -                        */
 216 -                       return;
 217 -               }
 218 -
 219 -               /* Resume remote flushes and then read tlb_gen. */
 220 -               cpumask_set_cpu(cpu, mm_cpumask(next));
 221 -               next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 222 -
 223 -               if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
 224 -                   next_tlb_gen) {
 225 -                       /*
 226 -                        * Ideally, we'd have a flush_tlb() variant that
 227 -                        * takes the known CR3 value as input.  This would
 228 -                        * be faster on Xen PV and on hypothetical CPUs
 229 -                        * on which INVPCID is fast.
 230 -                        */
 231 -                       this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
 232 -                                      next_tlb_gen);
 233 -                       write_cr3(build_cr3(next, prev_asid));
 234 -
 235 -                       /*
 236 -                        * This gets called via leave_mm() in the idle path
 237 -                        * where RCU functions differently.  Tracing normally
 238 -                        * uses RCU, so we have to call the tracepoint
 239 -                        * specially here.
 240 -                        */
 241 -                       trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
 242 -                                               TLB_FLUSH_ALL);
 243 -               }
 244 -
 245                 /*
 246 -                * We just exited lazy mode, which means that CR4 and/or LDTR
 247 -                * may be stale.  (Changes to the required CR4 and LDTR states
 248 -                * are not reflected in tlb_gen.)
 249 +                * We don't currently support having a real mm loaded without
 250 +                * our cpu set in mm_cpumask().  We have all the bookkeeping
 251 +                * in place to figure out whether we would need to flush
 252 +                * if our cpu were cleared in mm_cpumask(), but we don't
 253 +                * currently use it.
 254                  */
 255 +               if (WARN_ON_ONCE(real_prev != &init_mm &&
 256 +                                !cpumask_test_cpu(cpu, mm_cpumask(next))))
 257 +                       cpumask_set_cpu(cpu, mm_cpumask(next));
 258 +
 259 +               return;
 260         } else {
 261                 u16 new_asid;
 262                 bool need_flush;
 263 @@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 264                 }
 265
 266                 /* Stop remote flushes for the previous mm */
 267 -               if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
 268 -                       cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 269 -
 270 -               VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
 271 +               VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
 272 +                               real_prev != &init_mm);
 273 +               cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 274
 275                 /*
 276                  * Start remote flushes and then read tlb_gen.
 277 @@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 278         switch_ldt(real_prev, next);
 279  }
 280
 281 +/*
 282 + * enter_lazy_tlb() is a hint from the scheduler that we are entering a
 283 + * kernel thread or other context without an mm.  Acceptable implementations
 284 + * include doing nothing whatsoever, switching to init_mm, or various clever
 285 + * lazy tricks to try to minimize TLB flushes.
 286 + *
 287 + * The scheduler reserves the right to call enter_lazy_tlb() several times
 288 + * in a row.  It will notify us that we're going back to a real mm by
 289 + * calling switch_mm_irqs_off().
 290 + */
 291 +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 292 +{
 293 +       if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 294 +               return;
 295 +
 296 +       if (static_branch_unlikely(&tlb_use_lazy_mode)) {
 297 +               /*
 298 +                * There's a significant optimization that may be possible
 299 +                * here.  We have accurate enough TLB flush tracking that we
 300 +                * don't need to maintain coherence of TLB per se when we're
 301 +                * lazy.  We do, however, need to maintain coherence of
 302 +                * paging-structure caches.  We could, in principle, leave our
 303 +                * old mm loaded and only switch to init_mm when
 304 +                * tlb_remove_page() happens.
 305 +                */
 306 +               this_cpu_write(cpu_tlbstate.is_lazy, true);
 307 +       } else {
 308 +               switch_mm(NULL, &init_mm, NULL);
 309 +       }
 310 +}
 311 +
 312  /*
 313   * Call this when reinitializing a CPU.  It fixes the following potential
 314   * problems:
 315 @@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 316         /* This code cannot presently handle being reentered. */
 317         VM_WARN_ON(!irqs_disabled());
 318
 319 +       if (unlikely(loaded_mm == &init_mm))
 320 +               return;
 321 +
 322         VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 323                    loaded_mm->context.ctx_id);
 324
 325 -       if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
 326 +       if (this_cpu_read(cpu_tlbstate.is_lazy)) {
 327                 /*
 328 -                * We're in lazy mode -- don't flush.  We can get here on
 329 -                * remote flushes due to races and on local flushes if a
 330 -                * kernel thread coincidentally flushes the mm it's lazily
 331 -                * still using.
 332 +                * We're in lazy mode.  We need to at least flush our
 333 +                * paging-structure cache to avoid speculatively reading
 334 +                * garbage into our TLB.  Since switching to init_mm is barely
 335 +                * slower than a minimal flush, just switch to init_mm.
 336                  */
 337 +               switch_mm_irqs_off(NULL, &init_mm, NULL);
 338                 return;
 339         }
 340
 341 @@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 342         return 0;
 343  }
 344  late_initcall(create_tlb_single_page_flush_ceiling);
 345 +
 346 +static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
 347 +                                size_t count, loff_t *ppos)
 348 +{
 349 +       char buf[2];
 350 +
 351 +       buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
 352 +       buf[1] = '\n';
 353 +
 354 +       return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
 355 +}
 356 +
 357 +static ssize_t tlblazy_write_file(struct file *file,
 358 +                const char __user *user_buf, size_t count, loff_t *ppos)
 359 +{
 360 +       bool val;
 361 +
 362 +       if (kstrtobool_from_user(user_buf, count, &val))
 363 +               return -EINVAL;
 364 +
 365 +       if (val)
 366 +               static_branch_enable(&tlb_use_lazy_mode);
 367 +       else
 368 +               static_branch_disable(&tlb_use_lazy_mode);
 369 +
 370 +       return count;
 371 +}
 372 +
 373 +static const struct file_operations fops_tlblazy = {
 374 +       .read = tlblazy_read_file,
 375 +       .write = tlblazy_write_file,
 376 +       .llseek = default_llseek,
 377 +};
 378 +
 379 +static int __init init_tlb_use_lazy_mode(void)
 380 +{
 381 +       if (boot_cpu_has(X86_FEATURE_PCID)) {
 382 +               /*
 383 +                * Heuristic: with PCID on, switching to and from
 384 +                * init_mm is reasonably fast, but remote flush IPIs
 385 +                * as expensive as ever, so turn off lazy TLB mode.
 386 +                *
 387 +                * We can't do this in setup_pcid() because static keys
 388 +                * haven't been initialized yet, and it would blow up
 389 +                * badly.
 390 +                */
 391 +               static_branch_disable(&tlb_use_lazy_mode);
 392 +       }
 393 +
 394 +       debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
 395 +                           arch_debugfs_dir, NULL, &fops_tlblazy);
 396 +       return 0;
 397 +}
 398 +late_initcall(init_tlb_use_lazy_mode);
 399 --
 400 2.14.2
 401