[pve-kernel.git] / patches / kernel / 0045-x86-mm-Flush-more-aggressively-in-lazy-TLB-mode.patch

From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto@kernel.org>
Date: Mon, 9 Oct 2017 09:50:49 -0700
Subject: [PATCH] x86/mm: Flush more aggressively in lazy TLB mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CVE-2017-5754

Since commit:

  94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")

x86's lazy TLB mode has been all the way lazy: when running a kernel thread
(including the idle thread), the kernel keeps using the last user mm's
page tables without attempting to maintain user TLB coherence at all.

From a pure semantic perspective, this is fine -- kernel threads won't
attempt to access user pages, so having stale TLB entries doesn't matter.

Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
we also allow any paging-structure caches that may exist on the CPU
to become incoherent.  This means that we can have a
paging-structure cache entry that references a freed page table, and
the CPU is within its rights to do a speculative page walk starting
at the freed page table.

I can imagine this causing two different problems:

 - A speculative page walk starting from a bogus page table could read
   IO addresses.  I haven't seen any reports of this causing problems.

 - A speculative page walk that involves a bogus page table can install
   garbage in the TLB.  Such garbage would always be at a user VA, but
   some AMD CPUs have logic that triggers a machine check when it notices
   these bogus entries.  I've seen a couple reports of this.

Boris further explains the failure mode:

> It is actually more of an optimization which assumes that paging-structure
> entries are in WB DRAM:
>
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
> addresses outside of WB-DRAM may result in undefined behavior or NB
> protocol errors. 1=Disables performance optimization and allows PML4,
> PDP, PDE and PTE entries to be in any memory type. Operating systems
> that maintain page tables in memory types other than WB- DRAM must set
> TlbCacheDis to insure proper operation."
>
> The MCE generated is an NB protocol error to signal that
>
> "Link: A specific coherent-only packet from a CPU was issued to an
> IO link. This may be caused by software which addresses page table
> structures in a memory type other than cacheable WB-DRAM without
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
> example, when page table structure addresses are above top of memory. In
> such cases, the NB will generate an MCE if it sees a mismatch between
> the memory operation generated by the core and the link type."
>
> I'm assuming coherent-only packets don't go out on IO links, thus the
> error.

To fix this, reinstate TLB coherence in lazy mode.  With this patch
applied, we do it in one of two ways:

 - If we have PCID, we simply switch back to init_mm's page tables
   when we enter a kernel thread -- this seems to be quite cheap
   except for the cost of serializing the CPU.

 - If we don't have PCID, then we set a flag and switch to init_mm
   the first time we would otherwise need to flush the TLB.

The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
to override the default mode for benchmarking.

In theory, we could optimize this better by only flushing the TLB in
lazy CPUs when a page table is freed.  Doing that would require
auditing the mm code to make sure that all page table freeing goes
through tlb_remove_page() as well as reworking some data structures
to implement the improved flush logic.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
(cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
 arch/x86/include/asm/mmu_context.h |   8 +-
 arch/x86/include/asm/tlbflush.h    |  24 ++++++
 arch/x86/mm/tlb.c                  | 160 +++++++++++++++++++++++++------------
 3 files changed, 136 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index c120b5db178a..3c856a15b98e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
 	DEBUG_LOCKS_WARN_ON(preemptible());
 }
 
-static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
-{
-	int cpu = smp_processor_id();
-
-	if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
-		cpumask_clear_cpu(cpu, mm_cpumask(mm));
-}
+void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d23e61dc0640..6533da3036c9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
+ * to init_mm when we switch to a kernel thread (e.g. the idle thread).  If
+ * it's false, then we immediately switch CR3 when entering a kernel thread.
+ */
+DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
+
 /*
  * 6 because 6 should be plenty and struct tlb_state will fit in
  * two cache lines.
@@ -104,6 +111,23 @@ struct tlb_state {
 	u16 loaded_mm_asid;
 	u16 next_asid;
 
+	/*
+	 * We can be in one of several states:
+	 *
+	 *  - Actively using an mm.  Our CPU's bit will be set in
+	 *    mm_cpumask(loaded_mm) and is_lazy == false;
+	 *
+	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
+	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
+	 *
+	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
+	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
+	 *    We're heuristically guessing that the CR3 load we
+	 *    skipped more than makes up for the overhead added by
+	 *    lazy mode.
+	 */
+	bool is_lazy;
+
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 440400316c8a..b27aceaf7ed1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,6 +30,8 @@
 
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
+DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
+
 static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 			    u16 *new_asid, bool *need_flush)
 {
@@ -80,7 +82,7 @@ void leave_mm(int cpu)
 		return;
 
 	/* Warn if we're not lazy. */
-	WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
+	WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
 
 	switch_mm(NULL, &init_mm, NULL);
 }
@@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
+	this_cpu_write(cpu_tlbstate.is_lazy, false);
 
 	if (real_prev == next) {
 		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			  next->context.ctx_id);
 
-		if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
-			/*
-			 * There's nothing to do: we weren't lazy, and we
-			 * aren't changing our mm.  We don't need to flush
-			 * anything, nor do we need to update CR3, CR4, or
-			 * LDTR.
-			 */
-			return;
-		}
-
-		/* Resume remote flushes and then read tlb_gen. */
-		cpumask_set_cpu(cpu, mm_cpumask(next));
-		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
-
-		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
-		    next_tlb_gen) {
-			/*
-			 * Ideally, we'd have a flush_tlb() variant that
-			 * takes the known CR3 value as input.  This would
-			 * be faster on Xen PV and on hypothetical CPUs
-			 * on which INVPCID is fast.
-			 */
-			this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
-				       next_tlb_gen);
-			write_cr3(build_cr3(next, prev_asid));
-
-			/*
-			 * This gets called via leave_mm() in the idle path
-			 * where RCU functions differently.  Tracing normally
-			 * uses RCU, so we have to call the tracepoint
-			 * specially here.
-			 */
-			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-						TLB_FLUSH_ALL);
-		}
-
 		/*
-		 * We just exited lazy mode, which means that CR4 and/or LDTR
-		 * may be stale.  (Changes to the required CR4 and LDTR states
-		 * are not reflected in tlb_gen.)
+		 * We don't currently support having a real mm loaded without
+		 * our cpu set in mm_cpumask().  We have all the bookkeeping
+		 * in place to figure out whether we would need to flush
+		 * if our cpu were cleared in mm_cpumask(), but we don't
+		 * currently use it.
 		 */
+		if (WARN_ON_ONCE(real_prev != &init_mm &&
+				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
+			cpumask_set_cpu(cpu, mm_cpumask(next));
+
+		return;
 	} else {
 		u16 new_asid;
 		bool need_flush;
@@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		}
 
 		/* Stop remote flushes for the previous mm */
-		if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
-			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
-
-		VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+		VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
+				real_prev != &init_mm);
+		cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
 		/*
 		 * Start remote flushes and then read tlb_gen.
@@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	switch_ldt(real_prev, next);
 }
 
+/*
+ * enter_lazy_tlb() is a hint from the scheduler that we are entering a
+ * kernel thread or other context without an mm.  Acceptable implementations
+ * include doing nothing whatsoever, switching to init_mm, or various clever
+ * lazy tricks to try to minimize TLB flushes.
+ *
+ * The scheduler reserves the right to call enter_lazy_tlb() several times
+ * in a row.  It will notify us that we're going back to a real mm by
+ * calling switch_mm_irqs_off().
+ */
+void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
+{
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
+		return;
+
+	if (static_branch_unlikely(&tlb_use_lazy_mode)) {
+		/*
+		 * There's a significant optimization that may be possible
+		 * here.  We have accurate enough TLB flush tracking that we
+		 * don't need to maintain coherence of TLB per se when we're
+		 * lazy.  We do, however, need to maintain coherence of
+		 * paging-structure caches.  We could, in principle, leave our
+		 * old mm loaded and only switch to init_mm when
+		 * tlb_remove_page() happens.
+		 */
+		this_cpu_write(cpu_tlbstate.is_lazy, true);
+	} else {
+		switch_mm(NULL, &init_mm, NULL);
+	}
+}
+
 /*
  * Call this when reinitializing a CPU.  It fixes the following potential
  * problems:
@@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	/* This code cannot presently handle being reentered. */
 	VM_WARN_ON(!irqs_disabled());
 
+	if (unlikely(loaded_mm == &init_mm))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
-	if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
+	if (this_cpu_read(cpu_tlbstate.is_lazy)) {
 		/*
-		 * We're in lazy mode -- don't flush.  We can get here on
-		 * remote flushes due to races and on local flushes if a
-		 * kernel thread coincidentally flushes the mm it's lazily
-		 * still using.
+		 * We're in lazy mode.  We need to at least flush our
+		 * paging-structure cache to avoid speculatively reading
+		 * garbage into our TLB.  Since switching to init_mm is barely
+		 * slower than a minimal flush, just switch to init_mm.
 		 */
+		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
 	}
 
@@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
+
+static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
+				 size_t count, loff_t *ppos)
+{
+	char buf[2];
+
+	buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
+	buf[1] = '\n';
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
+}
+
+static ssize_t tlblazy_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	bool val;
+
+	if (kstrtobool_from_user(user_buf, count, &val))
+		return -EINVAL;
+
+	if (val)
+		static_branch_enable(&tlb_use_lazy_mode);
+	else
+		static_branch_disable(&tlb_use_lazy_mode);
+
+	return count;
+}
+
+static const struct file_operations fops_tlblazy = {
+	.read = tlblazy_read_file,
+	.write = tlblazy_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init init_tlb_use_lazy_mode(void)
+{
+	if (boot_cpu_has(X86_FEATURE_PCID)) {
+		/*
+		 * Heuristic: with PCID on, switching to and from
+		 * init_mm is reasonably fast, but remote flush IPIs
+		 * as expensive as ever, so turn off lazy TLB mode.
+		 *
+		 * We can't do this in setup_pcid() because static keys
+		 * haven't been initialized yet, and it would blow up
+		 * badly.
+		 */
+		static_branch_disable(&tlb_use_lazy_mode);
+	}
+
+	debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_tlblazy);
+	return 0;
+}
+late_initcall(init_tlb_use_lazy_mode);
-- 
2.14.2
Commit	Line	Data
59d5af67	1	From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
321d628a FG	2	From: Andy Lutomirski <luto@kernel.org>
321d628a FG	3	Date: Mon, 9 Oct 2017 09:50:49 -0700
59d5af67	4	Subject: [PATCH] x86/mm: Flush more aggressively in lazy TLB mode
321d628a FG	5	MIME-Version: 1.0
	6	Content-Type: text/plain; charset=UTF-8
	7	Content-Transfer-Encoding: 8bit
	8
	9	CVE-2017-5754
	10
	11	Since commit:
	12
	13	94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
	14
	15	x86's lazy TLB mode has been all the way lazy: when running a kernel thread
	16	(including the idle thread), the kernel keeps using the last user mm's
	17	page tables without attempting to maintain user TLB coherence at all.
	18
	19	From a pure semantic perspective, this is fine -- kernel threads won't
	20	attempt to access user pages, so having stale TLB entries doesn't matter.
	21
	22	Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
	23	we also allow any paging-structure caches that may exist on the CPU
	24	to become incoherent. This means that we can have a
	25	paging-structure cache entry that references a freed page table, and
	26	the CPU is within its rights to do a speculative page walk starting
	27	at the freed page table.
	28
	29	I can imagine this causing two different problems:
	30
	31	- A speculative page walk starting from a bogus page table could read
	32	IO addresses. I haven't seen any reports of this causing problems.
	33
	34	- A speculative page walk that involves a bogus page table can install
	35	garbage in the TLB. Such garbage would always be at a user VA, but
	36	some AMD CPUs have logic that triggers a machine check when it notices
	37	these bogus entries. I've seen a couple reports of this.
	38
	39	Boris further explains the failure mode:
	40
	41	> It is actually more of an optimization which assumes that paging-structure
	42	> entries are in WB DRAM:
	43	>
	44	> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
	45	> performance optimization that assumes PML4, PDP, PDE, and PTE entries
	46	> are in cacheable WB-DRAM; memory type checks may be bypassed, and
	47	> addresses outside of WB-DRAM may result in undefined behavior or NB
	48	> protocol errors. 1=Disables performance optimization and allows PML4,
	49	> PDP, PDE and PTE entries to be in any memory type. Operating systems
	50	> that maintain page tables in memory types other than WB- DRAM must set
	51	> TlbCacheDis to insure proper operation."
	52	>
	53	> The MCE generated is an NB protocol error to signal that
	54	>
	55	> "Link: A specific coherent-only packet from a CPU was issued to an
	56	> IO link. This may be caused by software which addresses page table
	57	> structures in a memory type other than cacheable WB-DRAM without
	58	> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
	59	> example, when page table structure addresses are above top of memory. In
	60	> such cases, the NB will generate an MCE if it sees a mismatch between
	61	> the memory operation generated by the core and the link type."
	62	>
	63	> I'm assuming coherent-only packets don't go out on IO links, thus the
	64	> error.
	65
	66	To fix this, reinstate TLB coherence in lazy mode. With this patch
	67	applied, we do it in one of two ways:
	68
69	- If we have PCID, we simply switch back to init_mm's page tables
70	when we enter a kernel thread -- this seems to be quite cheap
71	except for the cost of serializing the CPU.
72
73	- If we don't have PCID, then we set a flag and switch to init_mm
74	the first time we would otherwise need to flush the TLB.
75
76	The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
77	to override the default mode for benchmarking.
78
79	In theory, we could optimize this better by only flushing the TLB in
80	lazy CPUs when a page table is freed. Doing that would require
81	auditing the mm code to make sure that all page table freeing goes
82	through tlb_remove_page() as well as reworking some data structures
83	to implement the improved flush logic.
84
85	Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
86	Reported-by: Adam Borowski <kilobyte@angband.pl>
87	Signed-off-by: Andy Lutomirski <luto@kernel.org>
88	Signed-off-by: Borislav Petkov <bp@suse.de>
89	Cc: Borislav Petkov <bp@alien8.de>
90	Cc: Brian Gerst <brgerst@gmail.com>
91	Cc: Daniel Borkmann <daniel@iogearbox.net>
92	Cc: Eric Biggers <ebiggers@google.com>
93	Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
94	Cc: Kees Cook <keescook@chromium.org>
95	Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
96	Cc: Linus Torvalds <torvalds@linux-foundation.org>
97	Cc: Nadav Amit <nadav.amit@gmail.com>
98	Cc: Peter Zijlstra <peterz@infradead.org>
99	Cc: Rik van Riel <riel@redhat.com>
100	Cc: Roman Kagan <rkagan@virtuozzo.com>
101	Cc: Thomas Gleixner <tglx@linutronix.de>
102	Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
103	Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
104	Signed-off-by: Ingo Molnar <mingo@kernel.org>
105	(backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
106	Signed-off-by: Andy Whitcroft <apw@canonical.com>
107	Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
108	(cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
109	Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
110	---
111	arch/x86/include/asm/mmu_context.h \| 8 +-
112	arch/x86/include/asm/tlbflush.h \| 24 ++++++
113	arch/x86/mm/tlb.c \| 160 +++++++++++++++++++++++++------------
114	3 files changed, 136 insertions(+), 56 deletions(-)
115
116	diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
117	index c120b5db178a..3c856a15b98e 100644
118	--- a/arch/x86/include/asm/mmu_context.h
119	+++ b/arch/x86/include/asm/mmu_context.h
120	@@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct prev, struct mm_struct next)
121	DEBUG_LOCKS_WARN_ON(preemptible());
122	}
123
124	-static inline void enter_lazy_tlb(struct mm_struct mm, struct task_struct tsk)
125	-{
126	- int cpu = smp_processor_id();
127	-
128	- if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
129	- cpumask_clear_cpu(cpu, mm_cpumask(mm));
130	-}
131	+void enter_lazy_tlb(struct mm_struct mm, struct task_struct tsk);
132
133	static inline int init_new_context(struct task_struct *tsk,
134	struct mm_struct *mm)
135	diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
136	index d23e61dc0640..6533da3036c9 100644
137	--- a/arch/x86/include/asm/tlbflush.h
138	+++ b/arch/x86/include/asm/tlbflush.h
139	@@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
140	#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
141	#endif
142
143	+/*
144	+ * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
145	+ * to init_mm when we switch to a kernel thread (e.g. the idle thread). If
146	+ * it's false, then we immediately switch CR3 when entering a kernel thread.
147	+ */
148	+DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
149	+
150	/*
151	* 6 because 6 should be plenty and struct tlb_state will fit in
152	* two cache lines.
153	@@ -104,6 +111,23 @@ struct tlb_state {
154	u16 loaded_mm_asid;
155	u16 next_asid;
156
157	+ /*
158	+ * We can be in one of several states:
159	+ *
160	+ * - Actively using an mm. Our CPU's bit will be set in
161	+ * mm_cpumask(loaded_mm) and is_lazy == false;
162	+ *
163	+ * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
164	+ * will not be set in mm_cpumask(&init_mm) and is_lazy == false.
165	+ *
166	+ * - Lazily using a real mm. loaded_mm != &init_mm, our bit
167	+ * is set in mm_cpumask(loaded_mm), but is_lazy == true.
168	+ * We're heuristically guessing that the CR3 load we
169	+ * skipped more than makes up for the overhead added by
170	+ * lazy mode.
171	+ */
172	+ bool is_lazy;
173	+
174	/*
175	* Access to this CR4 shadow and to H/W CR4 is protected by
176	* disabling interrupts when modifying either one.
177	diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
178	index 440400316c8a..b27aceaf7ed1 100644
179	--- a/arch/x86/mm/tlb.c
180	+++ b/arch/x86/mm/tlb.c
181	@@ -30,6 +30,8 @@
182
183	atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
184
185	+DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
186	+
187	static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
188	u16 new_asid, bool need_flush)
189	{
190	@@ -80,7 +82,7 @@ void leave_mm(int cpu)
191	return;
192
193	/* Warn if we're not lazy. */
194	- WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
195	+ WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
196
197	switch_mm(NULL, &init_mm, NULL);
198	}
199	@@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct prev, struct mm_struct next,
200	__flush_tlb_all();
201	}
202	#endif
203	+ this_cpu_write(cpu_tlbstate.is_lazy, false);
204
205	if (real_prev == next) {
206	VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
207	next->context.ctx_id);
208
209	- if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
210	- /*
211	- * There's nothing to do: we weren't lazy, and we
212	- * aren't changing our mm. We don't need to flush
213	- * anything, nor do we need to update CR3, CR4, or
214	- * LDTR.
215	- */
216	- return;
217	- }
218	-
219	- /* Resume remote flushes and then read tlb_gen. */
220	- cpumask_set_cpu(cpu, mm_cpumask(next));
221	- next_tlb_gen = atomic64_read(&next->context.tlb_gen);
222	-
223	- if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
224	- next_tlb_gen) {
225	- /*
226	- * Ideally, we'd have a flush_tlb() variant that
227	- * takes the known CR3 value as input. This would
228	- * be faster on Xen PV and on hypothetical CPUs
229	- * on which INVPCID is fast.
230	- */
231	- this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
232	- next_tlb_gen);
233	- write_cr3(build_cr3(next, prev_asid));
234	-
235	- /*
236	- * This gets called via leave_mm() in the idle path
237	- * where RCU functions differently. Tracing normally
238	- * uses RCU, so we have to call the tracepoint
239	- * specially here.
240	- */
241	- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
242	- TLB_FLUSH_ALL);
243	- }
244	-
245	/*
246	- * We just exited lazy mode, which means that CR4 and/or LDTR
247	- * may be stale. (Changes to the required CR4 and LDTR states
248	- * are not reflected in tlb_gen.)
249	+ * We don't currently support having a real mm loaded without
250	+ * our cpu set in mm_cpumask(). We have all the bookkeeping
251	+ * in place to figure out whether we would need to flush
252	+ * if our cpu were cleared in mm_cpumask(), but we don't
253	+ * currently use it.
254	*/
255	+ if (WARN_ON_ONCE(real_prev != &init_mm &&
256	+ !cpumask_test_cpu(cpu, mm_cpumask(next))))
257	+ cpumask_set_cpu(cpu, mm_cpumask(next));
258	+
259	+ return;
260	} else {
261	u16 new_asid;
262	bool need_flush;
263	@@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct prev, struct mm_struct next,
264	}
265
266	/* Stop remote flushes for the previous mm */
267	- if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
268	- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
269	-
270	- VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
271	+ VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
272	+ real_prev != &init_mm);
273	+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
274
275	/*
276	* Start remote flushes and then read tlb_gen.
277	@@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct prev, struct mm_struct next,
278	switch_ldt(real_prev, next);
279	}
280
281	+/*
282	+ * enter_lazy_tlb() is a hint from the scheduler that we are entering a
283	+ * kernel thread or other context without an mm. Acceptable implementations
284	+ * include doing nothing whatsoever, switching to init_mm, or various clever
285	+ * lazy tricks to try to minimize TLB flushes.
286	+ *
287	+ * The scheduler reserves the right to call enter_lazy_tlb() several times
288	+ * in a row. It will notify us that we're going back to a real mm by
289	+ * calling switch_mm_irqs_off().
290	+ */
291	+void enter_lazy_tlb(struct mm_struct mm, struct task_struct tsk)
292	+{
293	+ if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
294	+ return;
295	+
296	+ if (static_branch_unlikely(&tlb_use_lazy_mode)) {
297	+ /*
298	+ * There's a significant optimization that may be possible
299	+ * here. We have accurate enough TLB flush tracking that we
300	+ * don't need to maintain coherence of TLB per se when we're
301	+ * lazy. We do, however, need to maintain coherence of
302	+ * paging-structure caches. We could, in principle, leave our
303	+ * old mm loaded and only switch to init_mm when
304	+ * tlb_remove_page() happens.
305	+ */
306	+ this_cpu_write(cpu_tlbstate.is_lazy, true);
307	+ } else {
308	+ switch_mm(NULL, &init_mm, NULL);
309	+ }
310	+}
311	+
312	/*
313	* Call this when reinitializing a CPU. It fixes the following potential
314	* problems:
315	@@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
316	/* This code cannot presently handle being reentered. */
317	VM_WARN_ON(!irqs_disabled());
318
319	+ if (unlikely(loaded_mm == &init_mm))
320	+ return;
321	+
322	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
323	loaded_mm->context.ctx_id);
324
325	- if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
326	+ if (this_cpu_read(cpu_tlbstate.is_lazy)) {
327	/*
328	- * We're in lazy mode -- don't flush. We can get here on
329	- * remote flushes due to races and on local flushes if a
330	- * kernel thread coincidentally flushes the mm it's lazily
331	- * still using.
332	+ * We're in lazy mode. We need to at least flush our
333	+ * paging-structure cache to avoid speculatively reading
334	+ * garbage into our TLB. Since switching to init_mm is barely
335	+ * slower than a minimal flush, just switch to init_mm.
336	*/
337	+ switch_mm_irqs_off(NULL, &init_mm, NULL);
338	return;
339	}
340
341	@@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
342	return 0;
343	}
344	late_initcall(create_tlb_single_page_flush_ceiling);
345	+
346	+static ssize_t tlblazy_read_file(struct file file, char __user user_buf,
347	+ size_t count, loff_t *ppos)
348	+{
349	+ char buf[2];
350	+
351	+ buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
352	+ buf[1] = '\n';
353	+
354	+ return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
355	+}
356	+
357	+static ssize_t tlblazy_write_file(struct file *file,
358	+ const char __user user_buf, size_t count, loff_t ppos)
359	+{
360	+ bool val;
361	+
362	+ if (kstrtobool_from_user(user_buf, count, &val))
363	+ return -EINVAL;
364	+
365	+ if (val)
366	+ static_branch_enable(&tlb_use_lazy_mode);
367	+ else
368	+ static_branch_disable(&tlb_use_lazy_mode);
369	+
370	+ return count;
371	+}
372	+
373	+static const struct file_operations fops_tlblazy = {
374	+ .read = tlblazy_read_file,
375	+ .write = tlblazy_write_file,
376	+ .llseek = default_llseek,
377	+};
378	+
379	+static int __init init_tlb_use_lazy_mode(void)
380	+{
381	+ if (boot_cpu_has(X86_FEATURE_PCID)) {
382	+ /*
383	+ * Heuristic: with PCID on, switching to and from
384	+ * init_mm is reasonably fast, but remote flush IPIs
385	+ * as expensive as ever, so turn off lazy TLB mode.
386	+ *
387	+ * We can't do this in setup_pcid() because static keys
388	+ * haven't been initialized yet, and it would blow up
389	+ * badly.
390	+ */
391	+ static_branch_disable(&tlb_use_lazy_mode);
392	+ }
393	+
394	+ debugfs_create_file("tlb_use_lazy_mode", S_IRUSR \| S_IWUSR,
395	+ arch_debugfs_dir, NULL, &fops_tlblazy);
396	+ return 0;
397	+}
398	+late_initcall(init_tlb_use_lazy_mode);
399	--
400	2.14.2
401