]> git.proxmox.com Git - pve-kernel.git/blob - patches/kernel/0251-x86-Documentation-Add-PTI-description.patch
6a38eaa9147a30a280a800507b94796ba124f2a3
[pve-kernel.git] / patches / kernel / 0251-x86-Documentation-Add-PTI-description.patch
1 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
2 From: Dave Hansen <dave.hansen@linux.intel.com>
3 Date: Fri, 5 Jan 2018 09:44:36 -0800
4 Subject: [PATCH] x86/Documentation: Add PTI description
5 MIME-Version: 1.0
6 Content-Type: text/plain; charset=UTF-8
7 Content-Transfer-Encoding: 8bit
8
9 CVE-2017-5754
10
11 Add some details about how PTI works, what some of the downsides
12 are, and how to debug it when things go wrong.
13
14 Also document the kernel parameter: 'pti/nopti'.
15
16 Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
17 Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
18 Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
19 Reviewed-by: Kees Cook <keescook@chromium.org>
20 Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
21 Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
22 Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
23 Cc: Richard Fellner <richard.fellner@student.tugraz.at>
24 Cc: Andy Lutomirski <luto@kernel.org>
25 Cc: Linus Torvalds <torvalds@linux-foundation.org>
26 Cc: Hugh Dickins <hughd@google.com>
27 Cc: Andi Lutomirsky <luto@kernel.org>
28 Cc: stable@vger.kernel.org
29 Link: https://lkml.kernel.org/r/20180105174436.1BC6FA2B@viggo.jf.intel.com
30
31 (cherry picked from commit 01c9b17bf673b05bb401b76ec763e9730ccf1376)
32 Signed-off-by: Andy Whitcroft <apw@canonical.com>
33 Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
34 (cherry picked from commit 1acf87c45b0170e717fc1b06a2d6fef47e07f79b)
35 Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
36 ---
37 Documentation/admin-guide/kernel-parameters.txt | 21 ++-
38 Documentation/x86/pti.txt | 186 ++++++++++++++++++++++++
39 2 files changed, 200 insertions(+), 7 deletions(-)
40 create mode 100644 Documentation/x86/pti.txt
41
42 diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
43 index b4d2edf316db..1a6ebc6cdf26 100644
44 --- a/Documentation/admin-guide/kernel-parameters.txt
45 +++ b/Documentation/admin-guide/kernel-parameters.txt
46 @@ -2677,8 +2677,6 @@
47 steal time is computed, but won't influence scheduler
48 behaviour
49
50 - nopti [X86-64] Disable kernel page table isolation
51 -
52 nolapic [X86-32,APIC] Do not enable or use the local APIC.
53
54 nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
55 @@ -3247,11 +3245,20 @@
56 pt. [PARIDE]
57 See Documentation/blockdev/paride.txt.
58
59 - pti= [X86_64]
60 - Control user/kernel address space isolation:
61 - on - enable
62 - off - disable
63 - auto - default setting
64 + pti= [X86_64] Control Page Table Isolation of user and
65 + kernel address spaces. Disabling this feature
66 + removes hardening, but improves performance of
67 + system calls and interrupts.
68 +
69 + on - unconditionally enable
70 + off - unconditionally disable
71 + auto - kernel detects whether your CPU model is
72 + vulnerable to issues that PTI mitigates
73 +
74 + Not specifying this option is equivalent to pti=auto.
75 +
76 + nopti [X86_64]
77 + Equivalent to pti=off
78
79 pty.legacy_count=
80 [KNL] Number of legacy pty's. Overwrites compiled-in
81 diff --git a/Documentation/x86/pti.txt b/Documentation/x86/pti.txt
82 new file mode 100644
83 index 000000000000..d11eff61fc9a
84 --- /dev/null
85 +++ b/Documentation/x86/pti.txt
86 @@ -0,0 +1,186 @@
87 +Overview
88 +========
89 +
90 +Page Table Isolation (pti, previously known as KAISER[1]) is a
91 +countermeasure against attacks on the shared user/kernel address
92 +space such as the "Meltdown" approach[2].
93 +
94 +To mitigate this class of attacks, we create an independent set of
95 +page tables for use only when running userspace applications. When
96 +the kernel is entered via syscalls, interrupts or exceptions, the
97 +page tables are switched to the full "kernel" copy. When the system
98 +switches back to user mode, the user copy is used again.
99 +
100 +The userspace page tables contain only a minimal amount of kernel
101 +data: only what is needed to enter/exit the kernel such as the
102 +entry/exit functions themselves and the interrupt descriptor table
103 +(IDT). There are a few strictly unnecessary things that get mapped
104 +such as the first C function when entering an interrupt (see
105 +comments in pti.c).
106 +
107 +This approach helps to ensure that side-channel attacks leveraging
108 +the paging structures do not function when PTI is enabled. It can be
109 +enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
110 +Once enabled at compile-time, it can be disabled at boot with the
111 +'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
112 +
113 +Page Table Management
114 +=====================
115 +
116 +When PTI is enabled, the kernel manages two sets of page tables.
117 +The first set is very similar to the single set which is present in
118 +kernels without PTI. This includes a complete mapping of userspace
119 +that the kernel can use for things like copy_to_user().
120 +
121 +Although _complete_, the user portion of the kernel page tables is
122 +crippled by setting the NX bit in the top level. This ensures
123 +that any missed kernel->user CR3 switch will immediately crash
124 +userspace upon executing its first instruction.
125 +
126 +The userspace page tables map only the kernel data needed to enter
127 +and exit the kernel. This data is entirely contained in the 'struct
128 +cpu_entry_area' structure which is placed in the fixmap which gives
129 +each CPU's copy of the area a compile-time-fixed virtual address.
130 +
131 +For new userspace mappings, the kernel makes the entries in its
132 +page tables like normal. The only difference is when the kernel
133 +makes entries in the top (PGD) level. In addition to setting the
134 +entry in the main kernel PGD, a copy of the entry is made in the
135 +userspace page tables' PGD.
136 +
137 +This sharing at the PGD level also inherently shares all the lower
138 +layers of the page tables. This leaves a single, shared set of
139 +userspace page tables to manage. One PTE to lock, one set of
140 +accessed bits, dirty bits, etc...
141 +
142 +Overhead
143 +========
144 +
145 +Protection against side-channel attacks is important. But,
146 +this protection comes at a cost:
147 +
148 +1. Increased Memory Use
149 + a. Each process now needs an order-1 PGD instead of order-0.
150 + (Consumes an additional 4k per process).
151 + b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
152 + aligned so that it can be mapped by setting a single PMD
153 + entry. This consumes nearly 2MB of RAM once the kernel
154 + is decompressed, but no space in the kernel image itself.
155 +
156 +2. Runtime Cost
157 + a. CR3 manipulation to switch between the page table copies
158 + must be done at interrupt, syscall, and exception entry
159 + and exit (it can be skipped when the kernel is interrupted,
160 + though.) Moves to CR3 are on the order of a hundred
161 + cycles, and are required at every entry and exit.
162 + b. A "trampoline" must be used for SYSCALL entry. This
163 + trampoline depends on a smaller set of resources than the
164 + non-PTI SYSCALL entry code, so requires mapping fewer
165 + things into the userspace page tables. The downside is
166 + that stacks must be switched at entry time.
167 + d. Global pages are disabled for all kernel structures not
168 + mapped into both kernel and userspace page tables. This
169 + feature of the MMU allows different processes to share TLB
170 + entries mapping the kernel. Losing the feature means more
171 + TLB misses after a context switch. The actual loss of
172 + performance is very small, however, never exceeding 1%.
173 + d. Process Context IDentifiers (PCID) is a CPU feature that
174 + allows us to skip flushing the entire TLB when switching page
175 + tables by setting a special bit in CR3 when the page tables
176 + are changed. This makes switching the page tables (at context
177 + switch, or kernel entry/exit) cheaper. But, on systems with
178 + PCID support, the context switch code must flush both the user
179 + and kernel entries out of the TLB. The user PCID TLB flush is
180 + deferred until the exit to userspace, minimizing the cost.
181 + See intel.com/sdm for the gory PCID/INVPCID details.
182 + e. The userspace page tables must be populated for each new
183 + process. Even without PTI, the shared kernel mappings
184 + are created by copying top-level (PGD) entries into each
185 + new process. But, with PTI, there are now *two* kernel
186 + mappings: one in the kernel page tables that maps everything
187 + and one for the entry/exit structures. At fork(), we need to
188 + copy both.
189 + f. In addition to the fork()-time copying, there must also
190 + be an update to the userspace PGD any time a set_pgd() is done
191 + on a PGD used to map userspace. This ensures that the kernel
192 + and userspace copies always map the same userspace
193 + memory.
194 + g. On systems without PCID support, each CR3 write flushes
195 + the entire TLB. That means that each syscall, interrupt
196 + or exception flushes the TLB.
197 + h. INVPCID is a TLB-flushing instruction which allows flushing
198 + of TLB entries for non-current PCIDs. Some systems support
199 + PCIDs, but do not support INVPCID. On these systems, addresses
200 + can only be flushed from the TLB for the current PCID. When
201 + flushing a kernel address, we need to flush all PCIDs, so a
202 + single kernel address flush will require a TLB-flushing CR3
203 + write upon the next use of every PCID.
204 +
205 +Possible Future Work
206 +====================
207 +1. We can be more careful about not actually writing to CR3
208 + unless its value is actually changed.
209 +2. Allow PTI to be enabled/disabled at runtime in addition to the
210 + boot-time switching.
211 +
212 +Testing
213 +========
214 +
215 +To test stability of PTI, the following test procedure is recommended,
216 +ideally doing all of these in parallel:
217 +
218 +1. Set CONFIG_DEBUG_ENTRY=y
219 +2. Run several copies of all of the tools/testing/selftests/x86/ tests
220 + (excluding MPX and protection_keys) in a loop on multiple CPUs for
221 + several minutes. These tests frequently uncover corner cases in the
222 + kernel entry code. In general, old kernels might cause these tests
223 + themselves to crash, but they should never crash the kernel.
224 +3. Run the 'perf' tool in a mode (top or record) that generates many
225 + frequent performance monitoring non-maskable interrupts (see "NMI"
226 + in /proc/interrupts). This exercises the NMI entry/exit code which
227 + is known to trigger bugs in code paths that did not expect to be
228 + interrupted, including nested NMIs. Using "-c" boosts the rate of
229 + NMIs, and using two -c with separate counters encourages nested NMIs
230 + and less deterministic behavior.
231 +
232 + while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
233 +
234 +4. Launch a KVM virtual machine.
235 +5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
236 + This has been a lightly-tested code path and needs extra scrutiny.
237 +
238 +Debugging
239 +=========
240 +
241 +Bugs in PTI cause a few different signatures of crashes
242 +that are worth noting here.
243 +
244 + * Failures of the selftests/x86 code. Usually a bug in one of the
245 + more obscure corners of entry_64.S
246 + * Crashes in early boot, especially around CPU bringup. Bugs
247 + in the trampoline code or mappings cause these.
248 + * Crashes at the first interrupt. Caused by bugs in entry_64.S,
249 + like screwing up a page table switch. Also caused by
250 + incorrectly mapping the IRQ handler entry code.
251 + * Crashes at the first NMI. The NMI code is separate from main
252 + interrupt handlers and can have bugs that do not affect
253 + normal interrupts. Also caused by incorrectly mapping NMI
254 + code. NMIs that interrupt the entry code must be very
255 + careful and can be the cause of crashes that show up when
256 + running perf.
257 + * Kernel crashes at the first exit to userspace. entry_64.S
258 + bugs, or failing to map some of the exit code.
259 + * Crashes at first interrupt that interrupts userspace. The paths
260 + in entry_64.S that return to userspace are sometimes separate
261 + from the ones that return to the kernel.
262 + * Double faults: overflowing the kernel stack because of page
263 + faults upon page faults. Caused by touching non-pti-mapped
264 + data in the entry code, or forgetting to switch to kernel
265 + CR3 before calling into C functions which are not pti-mapped.
266 + * Userspace segfaults early in boot, sometimes manifesting
267 + as mount(8) failing to mount the rootfs. These have
268 + tended to be TLB invalidation issues. Usually invalidating
269 + the wrong PCID, or otherwise missing an invalidation.
270 +
271 +1. https://gruss.cc/files/kaiser.pdf
272 +2. https://meltdownattack.com/meltdown.pdf
273 --
274 2.14.2
275