]> git.proxmox.com Git - pve-kernel.git/blame - patches/kernel/0251-x86-Documentation-Add-PTI-description.patch
update ZFS to 0.7.4 + ARC hit rate cherry-pick
[pve-kernel.git] / patches / kernel / 0251-x86-Documentation-Add-PTI-description.patch
CommitLineData
035dbe67
FG
1From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
2From: Dave Hansen <dave.hansen@linux.intel.com>
3Date: Fri, 5 Jan 2018 09:44:36 -0800
4Subject: [PATCH] x86/Documentation: Add PTI description
5MIME-Version: 1.0
6Content-Type: text/plain; charset=UTF-8
7Content-Transfer-Encoding: 8bit
8
9CVE-2017-5754
10
11Add some details about how PTI works, what some of the downsides
12are, and how to debug it when things go wrong.
13
14Also document the kernel parameter: 'pti/nopti'.
15
16Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
17Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
18Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
19Reviewed-by: Kees Cook <keescook@chromium.org>
20Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
21Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
22Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
23Cc: Richard Fellner <richard.fellner@student.tugraz.at>
24Cc: Andy Lutomirski <luto@kernel.org>
25Cc: Linus Torvalds <torvalds@linux-foundation.org>
26Cc: Hugh Dickins <hughd@google.com>
27Cc: Andi Lutomirsky <luto@kernel.org>
28Cc: stable@vger.kernel.org
29Link: https://lkml.kernel.org/r/20180105174436.1BC6FA2B@viggo.jf.intel.com
30
31(cherry picked from commit 01c9b17bf673b05bb401b76ec763e9730ccf1376)
32Signed-off-by: Andy Whitcroft <apw@canonical.com>
33Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
34(cherry picked from commit 1acf87c45b0170e717fc1b06a2d6fef47e07f79b)
35Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
36---
37 Documentation/admin-guide/kernel-parameters.txt | 21 ++-
38 Documentation/x86/pti.txt | 186 ++++++++++++++++++++++++
39 2 files changed, 200 insertions(+), 7 deletions(-)
40 create mode 100644 Documentation/x86/pti.txt
41
42diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
43index b4d2edf316db..1a6ebc6cdf26 100644
44--- a/Documentation/admin-guide/kernel-parameters.txt
45+++ b/Documentation/admin-guide/kernel-parameters.txt
46@@ -2677,8 +2677,6 @@
47 steal time is computed, but won't influence scheduler
48 behaviour
49
50- nopti [X86-64] Disable kernel page table isolation
51-
52 nolapic [X86-32,APIC] Do not enable or use the local APIC.
53
54 nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
55@@ -3247,11 +3245,20 @@
56 pt. [PARIDE]
57 See Documentation/blockdev/paride.txt.
58
59- pti= [X86_64]
60- Control user/kernel address space isolation:
61- on - enable
62- off - disable
63- auto - default setting
64+ pti= [X86_64] Control Page Table Isolation of user and
65+ kernel address spaces. Disabling this feature
66+ removes hardening, but improves performance of
67+ system calls and interrupts.
68+
69+ on - unconditionally enable
70+ off - unconditionally disable
71+ auto - kernel detects whether your CPU model is
72+ vulnerable to issues that PTI mitigates
73+
74+ Not specifying this option is equivalent to pti=auto.
75+
76+ nopti [X86_64]
77+ Equivalent to pti=off
78
79 pty.legacy_count=
80 [KNL] Number of legacy pty's. Overwrites compiled-in
81diff --git a/Documentation/x86/pti.txt b/Documentation/x86/pti.txt
82new file mode 100644
83index 000000000000..d11eff61fc9a
84--- /dev/null
85+++ b/Documentation/x86/pti.txt
86@@ -0,0 +1,186 @@
87+Overview
88+========
89+
90+Page Table Isolation (pti, previously known as KAISER[1]) is a
91+countermeasure against attacks on the shared user/kernel address
92+space such as the "Meltdown" approach[2].
93+
94+To mitigate this class of attacks, we create an independent set of
95+page tables for use only when running userspace applications. When
96+the kernel is entered via syscalls, interrupts or exceptions, the
97+page tables are switched to the full "kernel" copy. When the system
98+switches back to user mode, the user copy is used again.
99+
100+The userspace page tables contain only a minimal amount of kernel
101+data: only what is needed to enter/exit the kernel such as the
102+entry/exit functions themselves and the interrupt descriptor table
103+(IDT). There are a few strictly unnecessary things that get mapped
104+such as the first C function when entering an interrupt (see
105+comments in pti.c).
106+
107+This approach helps to ensure that side-channel attacks leveraging
108+the paging structures do not function when PTI is enabled. It can be
109+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
110+Once enabled at compile-time, it can be disabled at boot with the
111+'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
112+
113+Page Table Management
114+=====================
115+
116+When PTI is enabled, the kernel manages two sets of page tables.
117+The first set is very similar to the single set which is present in
118+kernels without PTI. This includes a complete mapping of userspace
119+that the kernel can use for things like copy_to_user().
120+
121+Although _complete_, the user portion of the kernel page tables is
122+crippled by setting the NX bit in the top level. This ensures
123+that any missed kernel->user CR3 switch will immediately crash
124+userspace upon executing its first instruction.
125+
126+The userspace page tables map only the kernel data needed to enter
127+and exit the kernel. This data is entirely contained in the 'struct
128+cpu_entry_area' structure which is placed in the fixmap which gives
129+each CPU's copy of the area a compile-time-fixed virtual address.
130+
131+For new userspace mappings, the kernel makes the entries in its
132+page tables like normal. The only difference is when the kernel
133+makes entries in the top (PGD) level. In addition to setting the
134+entry in the main kernel PGD, a copy of the entry is made in the
135+userspace page tables' PGD.
136+
137+This sharing at the PGD level also inherently shares all the lower
138+layers of the page tables. This leaves a single, shared set of
139+userspace page tables to manage. One PTE to lock, one set of
140+accessed bits, dirty bits, etc...
141+
142+Overhead
143+========
144+
145+Protection against side-channel attacks is important. But,
146+this protection comes at a cost:
147+
148+1. Increased Memory Use
149+ a. Each process now needs an order-1 PGD instead of order-0.
150+ (Consumes an additional 4k per process).
151+ b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
152+ aligned so that it can be mapped by setting a single PMD
153+ entry. This consumes nearly 2MB of RAM once the kernel
154+ is decompressed, but no space in the kernel image itself.
155+
156+2. Runtime Cost
157+ a. CR3 manipulation to switch between the page table copies
158+ must be done at interrupt, syscall, and exception entry
159+ and exit (it can be skipped when the kernel is interrupted,
160+ though.) Moves to CR3 are on the order of a hundred
161+ cycles, and are required at every entry and exit.
162+ b. A "trampoline" must be used for SYSCALL entry. This
163+ trampoline depends on a smaller set of resources than the
164+ non-PTI SYSCALL entry code, so requires mapping fewer
165+ things into the userspace page tables. The downside is
166+ that stacks must be switched at entry time.
167+ d. Global pages are disabled for all kernel structures not
168+ mapped into both kernel and userspace page tables. This
169+ feature of the MMU allows different processes to share TLB
170+ entries mapping the kernel. Losing the feature means more
171+ TLB misses after a context switch. The actual loss of
172+ performance is very small, however, never exceeding 1%.
173+ d. Process Context IDentifiers (PCID) is a CPU feature that
174+ allows us to skip flushing the entire TLB when switching page
175+ tables by setting a special bit in CR3 when the page tables
176+ are changed. This makes switching the page tables (at context
177+ switch, or kernel entry/exit) cheaper. But, on systems with
178+ PCID support, the context switch code must flush both the user
179+ and kernel entries out of the TLB. The user PCID TLB flush is
180+ deferred until the exit to userspace, minimizing the cost.
181+ See intel.com/sdm for the gory PCID/INVPCID details.
182+ e. The userspace page tables must be populated for each new
183+ process. Even without PTI, the shared kernel mappings
184+ are created by copying top-level (PGD) entries into each
185+ new process. But, with PTI, there are now *two* kernel
186+ mappings: one in the kernel page tables that maps everything
187+ and one for the entry/exit structures. At fork(), we need to
188+ copy both.
189+ f. In addition to the fork()-time copying, there must also
190+ be an update to the userspace PGD any time a set_pgd() is done
191+ on a PGD used to map userspace. This ensures that the kernel
192+ and userspace copies always map the same userspace
193+ memory.
194+ g. On systems without PCID support, each CR3 write flushes
195+ the entire TLB. That means that each syscall, interrupt
196+ or exception flushes the TLB.
197+ h. INVPCID is a TLB-flushing instruction which allows flushing
198+ of TLB entries for non-current PCIDs. Some systems support
199+ PCIDs, but do not support INVPCID. On these systems, addresses
200+ can only be flushed from the TLB for the current PCID. When
201+ flushing a kernel address, we need to flush all PCIDs, so a
202+ single kernel address flush will require a TLB-flushing CR3
203+ write upon the next use of every PCID.
204+
205+Possible Future Work
206+====================
207+1. We can be more careful about not actually writing to CR3
208+ unless its value is actually changed.
209+2. Allow PTI to be enabled/disabled at runtime in addition to the
210+ boot-time switching.
211+
212+Testing
213+========
214+
215+To test stability of PTI, the following test procedure is recommended,
216+ideally doing all of these in parallel:
217+
218+1. Set CONFIG_DEBUG_ENTRY=y
219+2. Run several copies of all of the tools/testing/selftests/x86/ tests
220+ (excluding MPX and protection_keys) in a loop on multiple CPUs for
221+ several minutes. These tests frequently uncover corner cases in the
222+ kernel entry code. In general, old kernels might cause these tests
223+ themselves to crash, but they should never crash the kernel.
224+3. Run the 'perf' tool in a mode (top or record) that generates many
225+ frequent performance monitoring non-maskable interrupts (see "NMI"
226+ in /proc/interrupts). This exercises the NMI entry/exit code which
227+ is known to trigger bugs in code paths that did not expect to be
228+ interrupted, including nested NMIs. Using "-c" boosts the rate of
229+ NMIs, and using two -c with separate counters encourages nested NMIs
230+ and less deterministic behavior.
231+
232+ while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
233+
234+4. Launch a KVM virtual machine.
235+5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
236+ This has been a lightly-tested code path and needs extra scrutiny.
237+
238+Debugging
239+=========
240+
241+Bugs in PTI cause a few different signatures of crashes
242+that are worth noting here.
243+
244+ * Failures of the selftests/x86 code. Usually a bug in one of the
245+ more obscure corners of entry_64.S
246+ * Crashes in early boot, especially around CPU bringup. Bugs
247+ in the trampoline code or mappings cause these.
248+ * Crashes at the first interrupt. Caused by bugs in entry_64.S,
249+ like screwing up a page table switch. Also caused by
250+ incorrectly mapping the IRQ handler entry code.
251+ * Crashes at the first NMI. The NMI code is separate from main
252+ interrupt handlers and can have bugs that do not affect
253+ normal interrupts. Also caused by incorrectly mapping NMI
254+ code. NMIs that interrupt the entry code must be very
255+ careful and can be the cause of crashes that show up when
256+ running perf.
257+ * Kernel crashes at the first exit to userspace. entry_64.S
258+ bugs, or failing to map some of the exit code.
259+ * Crashes at first interrupt that interrupts userspace. The paths
260+ in entry_64.S that return to userspace are sometimes separate
261+ from the ones that return to the kernel.
262+ * Double faults: overflowing the kernel stack because of page
263+ faults upon page faults. Caused by touching non-pti-mapped
264+ data in the entry code, or forgetting to switch to kernel
265+ CR3 before calling into C functions which are not pti-mapped.
266+ * Userspace segfaults early in boot, sometimes manifesting
267+ as mount(8) failing to mount the rootfs. These have
268+ tended to be TLB invalidation issues. Usually invalidating
269+ the wrong PCID, or otherwise missing an invalidation.
270+
271+1. https://gruss.cc/files/kaiser.pdf
272+2. https://meltdownattack.com/meltdown.pdf
273--
2742.14.2
275