]>
Commit | Line | Data |
---|---|---|
035dbe67 FG |
1 | From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 |
2 | From: Dave Hansen <dave.hansen@linux.intel.com> | |
3 | Date: Fri, 5 Jan 2018 09:44:36 -0800 | |
4 | Subject: [PATCH] x86/Documentation: Add PTI description | |
5 | MIME-Version: 1.0 | |
6 | Content-Type: text/plain; charset=UTF-8 | |
7 | Content-Transfer-Encoding: 8bit | |
8 | ||
9 | CVE-2017-5754 | |
10 | ||
11 | Add some details about how PTI works, what some of the downsides | |
12 | are, and how to debug it when things go wrong. | |
13 | ||
14 | Also document the kernel parameter: 'pti/nopti'. | |
15 | ||
16 | Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> | |
17 | Signed-off-by: Thomas Gleixner <tglx@linutronix.de> | |
18 | Reviewed-by: Randy Dunlap <rdunlap@infradead.org> | |
19 | Reviewed-by: Kees Cook <keescook@chromium.org> | |
20 | Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at> | |
21 | Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at> | |
22 | Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at> | |
23 | Cc: Richard Fellner <richard.fellner@student.tugraz.at> | |
24 | Cc: Andy Lutomirski <luto@kernel.org> | |
25 | Cc: Linus Torvalds <torvalds@linux-foundation.org> | |
26 | Cc: Hugh Dickins <hughd@google.com> | |
27 | Cc: Andi Lutomirsky <luto@kernel.org> | |
28 | Cc: stable@vger.kernel.org | |
29 | Link: https://lkml.kernel.org/r/20180105174436.1BC6FA2B@viggo.jf.intel.com | |
30 | ||
31 | (cherry picked from commit 01c9b17bf673b05bb401b76ec763e9730ccf1376) | |
32 | Signed-off-by: Andy Whitcroft <apw@canonical.com> | |
33 | Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> | |
34 | (cherry picked from commit 1acf87c45b0170e717fc1b06a2d6fef47e07f79b) | |
35 | Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> | |
36 | --- | |
37 | Documentation/admin-guide/kernel-parameters.txt | 21 ++- | |
38 | Documentation/x86/pti.txt | 186 ++++++++++++++++++++++++ | |
39 | 2 files changed, 200 insertions(+), 7 deletions(-) | |
40 | create mode 100644 Documentation/x86/pti.txt | |
41 | ||
42 | diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt | |
43 | index b4d2edf316db..1a6ebc6cdf26 100644 | |
44 | --- a/Documentation/admin-guide/kernel-parameters.txt | |
45 | +++ b/Documentation/admin-guide/kernel-parameters.txt | |
46 | @@ -2677,8 +2677,6 @@ | |
47 | steal time is computed, but won't influence scheduler | |
48 | behaviour | |
49 | ||
50 | - nopti [X86-64] Disable kernel page table isolation | |
51 | - | |
52 | nolapic [X86-32,APIC] Do not enable or use the local APIC. | |
53 | ||
54 | nolapic_timer [X86-32,APIC] Do not use the local APIC timer. | |
55 | @@ -3247,11 +3245,20 @@ | |
56 | pt. [PARIDE] | |
57 | See Documentation/blockdev/paride.txt. | |
58 | ||
59 | - pti= [X86_64] | |
60 | - Control user/kernel address space isolation: | |
61 | - on - enable | |
62 | - off - disable | |
63 | - auto - default setting | |
64 | + pti= [X86_64] Control Page Table Isolation of user and | |
65 | + kernel address spaces. Disabling this feature | |
66 | + removes hardening, but improves performance of | |
67 | + system calls and interrupts. | |
68 | + | |
69 | + on - unconditionally enable | |
70 | + off - unconditionally disable | |
71 | + auto - kernel detects whether your CPU model is | |
72 | + vulnerable to issues that PTI mitigates | |
73 | + | |
74 | + Not specifying this option is equivalent to pti=auto. | |
75 | + | |
76 | + nopti [X86_64] | |
77 | + Equivalent to pti=off | |
78 | ||
79 | pty.legacy_count= | |
80 | [KNL] Number of legacy pty's. Overwrites compiled-in | |
81 | diff --git a/Documentation/x86/pti.txt b/Documentation/x86/pti.txt | |
82 | new file mode 100644 | |
83 | index 000000000000..d11eff61fc9a | |
84 | --- /dev/null | |
85 | +++ b/Documentation/x86/pti.txt | |
86 | @@ -0,0 +1,186 @@ | |
87 | +Overview | |
88 | +======== | |
89 | + | |
90 | +Page Table Isolation (pti, previously known as KAISER[1]) is a | |
91 | +countermeasure against attacks on the shared user/kernel address | |
92 | +space such as the "Meltdown" approach[2]. | |
93 | + | |
94 | +To mitigate this class of attacks, we create an independent set of | |
95 | +page tables for use only when running userspace applications. When | |
96 | +the kernel is entered via syscalls, interrupts or exceptions, the | |
97 | +page tables are switched to the full "kernel" copy. When the system | |
98 | +switches back to user mode, the user copy is used again. | |
99 | + | |
100 | +The userspace page tables contain only a minimal amount of kernel | |
101 | +data: only what is needed to enter/exit the kernel such as the | |
102 | +entry/exit functions themselves and the interrupt descriptor table | |
103 | +(IDT). There are a few strictly unnecessary things that get mapped | |
104 | +such as the first C function when entering an interrupt (see | |
105 | +comments in pti.c). | |
106 | + | |
107 | +This approach helps to ensure that side-channel attacks leveraging | |
108 | +the paging structures do not function when PTI is enabled. It can be | |
109 | +enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. | |
110 | +Once enabled at compile-time, it can be disabled at boot with the | |
111 | +'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). | |
112 | + | |
113 | +Page Table Management | |
114 | +===================== | |
115 | + | |
116 | +When PTI is enabled, the kernel manages two sets of page tables. | |
117 | +The first set is very similar to the single set which is present in | |
118 | +kernels without PTI. This includes a complete mapping of userspace | |
119 | +that the kernel can use for things like copy_to_user(). | |
120 | + | |
121 | +Although _complete_, the user portion of the kernel page tables is | |
122 | +crippled by setting the NX bit in the top level. This ensures | |
123 | +that any missed kernel->user CR3 switch will immediately crash | |
124 | +userspace upon executing its first instruction. | |
125 | + | |
126 | +The userspace page tables map only the kernel data needed to enter | |
127 | +and exit the kernel. This data is entirely contained in the 'struct | |
128 | +cpu_entry_area' structure which is placed in the fixmap which gives | |
129 | +each CPU's copy of the area a compile-time-fixed virtual address. | |
130 | + | |
131 | +For new userspace mappings, the kernel makes the entries in its | |
132 | +page tables like normal. The only difference is when the kernel | |
133 | +makes entries in the top (PGD) level. In addition to setting the | |
134 | +entry in the main kernel PGD, a copy of the entry is made in the | |
135 | +userspace page tables' PGD. | |
136 | + | |
137 | +This sharing at the PGD level also inherently shares all the lower | |
138 | +layers of the page tables. This leaves a single, shared set of | |
139 | +userspace page tables to manage. One PTE to lock, one set of | |
140 | +accessed bits, dirty bits, etc... | |
141 | + | |
142 | +Overhead | |
143 | +======== | |
144 | + | |
145 | +Protection against side-channel attacks is important. But, | |
146 | +this protection comes at a cost: | |
147 | + | |
148 | +1. Increased Memory Use | |
149 | + a. Each process now needs an order-1 PGD instead of order-0. | |
150 | + (Consumes an additional 4k per process). | |
151 | + b. The 'cpu_entry_area' structure must be 2MB in size and 2MB | |
152 | + aligned so that it can be mapped by setting a single PMD | |
153 | + entry. This consumes nearly 2MB of RAM once the kernel | |
154 | + is decompressed, but no space in the kernel image itself. | |
155 | + | |
156 | +2. Runtime Cost | |
157 | + a. CR3 manipulation to switch between the page table copies | |
158 | + must be done at interrupt, syscall, and exception entry | |
159 | + and exit (it can be skipped when the kernel is interrupted, | |
160 | + though.) Moves to CR3 are on the order of a hundred | |
161 | + cycles, and are required at every entry and exit. | |
162 | + b. A "trampoline" must be used for SYSCALL entry. This | |
163 | + trampoline depends on a smaller set of resources than the | |
164 | + non-PTI SYSCALL entry code, so requires mapping fewer | |
165 | + things into the userspace page tables. The downside is | |
166 | + that stacks must be switched at entry time. | |
167 | + d. Global pages are disabled for all kernel structures not | |
168 | + mapped into both kernel and userspace page tables. This | |
169 | + feature of the MMU allows different processes to share TLB | |
170 | + entries mapping the kernel. Losing the feature means more | |
171 | + TLB misses after a context switch. The actual loss of | |
172 | + performance is very small, however, never exceeding 1%. | |
173 | + d. Process Context IDentifiers (PCID) is a CPU feature that | |
174 | + allows us to skip flushing the entire TLB when switching page | |
175 | + tables by setting a special bit in CR3 when the page tables | |
176 | + are changed. This makes switching the page tables (at context | |
177 | + switch, or kernel entry/exit) cheaper. But, on systems with | |
178 | + PCID support, the context switch code must flush both the user | |
179 | + and kernel entries out of the TLB. The user PCID TLB flush is | |
180 | + deferred until the exit to userspace, minimizing the cost. | |
181 | + See intel.com/sdm for the gory PCID/INVPCID details. | |
182 | + e. The userspace page tables must be populated for each new | |
183 | + process. Even without PTI, the shared kernel mappings | |
184 | + are created by copying top-level (PGD) entries into each | |
185 | + new process. But, with PTI, there are now *two* kernel | |
186 | + mappings: one in the kernel page tables that maps everything | |
187 | + and one for the entry/exit structures. At fork(), we need to | |
188 | + copy both. | |
189 | + f. In addition to the fork()-time copying, there must also | |
190 | + be an update to the userspace PGD any time a set_pgd() is done | |
191 | + on a PGD used to map userspace. This ensures that the kernel | |
192 | + and userspace copies always map the same userspace | |
193 | + memory. | |
194 | + g. On systems without PCID support, each CR3 write flushes | |
195 | + the entire TLB. That means that each syscall, interrupt | |
196 | + or exception flushes the TLB. | |
197 | + h. INVPCID is a TLB-flushing instruction which allows flushing | |
198 | + of TLB entries for non-current PCIDs. Some systems support | |
199 | + PCIDs, but do not support INVPCID. On these systems, addresses | |
200 | + can only be flushed from the TLB for the current PCID. When | |
201 | + flushing a kernel address, we need to flush all PCIDs, so a | |
202 | + single kernel address flush will require a TLB-flushing CR3 | |
203 | + write upon the next use of every PCID. | |
204 | + | |
205 | +Possible Future Work | |
206 | +==================== | |
207 | +1. We can be more careful about not actually writing to CR3 | |
208 | + unless its value is actually changed. | |
209 | +2. Allow PTI to be enabled/disabled at runtime in addition to the | |
210 | + boot-time switching. | |
211 | + | |
212 | +Testing | |
213 | +======== | |
214 | + | |
215 | +To test stability of PTI, the following test procedure is recommended, | |
216 | +ideally doing all of these in parallel: | |
217 | + | |
218 | +1. Set CONFIG_DEBUG_ENTRY=y | |
219 | +2. Run several copies of all of the tools/testing/selftests/x86/ tests | |
220 | + (excluding MPX and protection_keys) in a loop on multiple CPUs for | |
221 | + several minutes. These tests frequently uncover corner cases in the | |
222 | + kernel entry code. In general, old kernels might cause these tests | |
223 | + themselves to crash, but they should never crash the kernel. | |
224 | +3. Run the 'perf' tool in a mode (top or record) that generates many | |
225 | + frequent performance monitoring non-maskable interrupts (see "NMI" | |
226 | + in /proc/interrupts). This exercises the NMI entry/exit code which | |
227 | + is known to trigger bugs in code paths that did not expect to be | |
228 | + interrupted, including nested NMIs. Using "-c" boosts the rate of | |
229 | + NMIs, and using two -c with separate counters encourages nested NMIs | |
230 | + and less deterministic behavior. | |
231 | + | |
232 | + while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done | |
233 | + | |
234 | +4. Launch a KVM virtual machine. | |
235 | +5. Run 32-bit binaries on systems supporting the SYSCALL instruction. | |
236 | + This has been a lightly-tested code path and needs extra scrutiny. | |
237 | + | |
238 | +Debugging | |
239 | +========= | |
240 | + | |
241 | +Bugs in PTI cause a few different signatures of crashes | |
242 | +that are worth noting here. | |
243 | + | |
244 | + * Failures of the selftests/x86 code. Usually a bug in one of the | |
245 | + more obscure corners of entry_64.S | |
246 | + * Crashes in early boot, especially around CPU bringup. Bugs | |
247 | + in the trampoline code or mappings cause these. | |
248 | + * Crashes at the first interrupt. Caused by bugs in entry_64.S, | |
249 | + like screwing up a page table switch. Also caused by | |
250 | + incorrectly mapping the IRQ handler entry code. | |
251 | + * Crashes at the first NMI. The NMI code is separate from main | |
252 | + interrupt handlers and can have bugs that do not affect | |
253 | + normal interrupts. Also caused by incorrectly mapping NMI | |
254 | + code. NMIs that interrupt the entry code must be very | |
255 | + careful and can be the cause of crashes that show up when | |
256 | + running perf. | |
257 | + * Kernel crashes at the first exit to userspace. entry_64.S | |
258 | + bugs, or failing to map some of the exit code. | |
259 | + * Crashes at first interrupt that interrupts userspace. The paths | |
260 | + in entry_64.S that return to userspace are sometimes separate | |
261 | + from the ones that return to the kernel. | |
262 | + * Double faults: overflowing the kernel stack because of page | |
263 | + faults upon page faults. Caused by touching non-pti-mapped | |
264 | + data in the entry code, or forgetting to switch to kernel | |
265 | + CR3 before calling into C functions which are not pti-mapped. | |
266 | + * Userspace segfaults early in boot, sometimes manifesting | |
267 | + as mount(8) failing to mount the rootfs. These have | |
268 | + tended to be TLB invalidation issues. Usually invalidating | |
269 | + the wrong PCID, or otherwise missing an invalidation. | |
270 | + | |
271 | +1. https://gruss.cc/files/kaiser.pdf | |
272 | +2. https://meltdownattack.com/meltdown.pdf | |
273 | -- | |
274 | 2.14.2 | |
275 |