]>
Commit | Line | Data |
---|---|---|
ea0765e8 CD |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ========================== | |
4 | Page Table Isolation (PTI) | |
5 | ========================== | |
6 | ||
01c9b17b DH |
7 | Overview |
8 | ======== | |
9 | ||
ea0765e8 | 10 | Page Table Isolation (pti, previously known as KAISER [1]_) is a |
01c9b17b | 11 | countermeasure against attacks on the shared user/kernel address |
ea0765e8 | 12 | space such as the "Meltdown" approach [2]_. |
01c9b17b DH |
13 | |
14 | To mitigate this class of attacks, we create an independent set of | |
15 | page tables for use only when running userspace applications. When | |
16 | the kernel is entered via syscalls, interrupts or exceptions, the | |
17 | page tables are switched to the full "kernel" copy. When the system | |
18 | switches back to user mode, the user copy is used again. | |
19 | ||
20 | The userspace page tables contain only a minimal amount of kernel | |
21 | data: only what is needed to enter/exit the kernel such as the | |
22 | entry/exit functions themselves and the interrupt descriptor table | |
23 | (IDT). There are a few strictly unnecessary things that get mapped | |
24 | such as the first C function when entering an interrupt (see | |
25 | comments in pti.c). | |
26 | ||
27 | This approach helps to ensure that side-channel attacks leveraging | |
28 | the paging structures do not function when PTI is enabled. It can be | |
29 | enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. | |
30 | Once enabled at compile-time, it can be disabled at boot with the | |
31 | 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). | |
32 | ||
33 | Page Table Management | |
34 | ===================== | |
35 | ||
36 | When PTI is enabled, the kernel manages two sets of page tables. | |
37 | The first set is very similar to the single set which is present in | |
38 | kernels without PTI. This includes a complete mapping of userspace | |
39 | that the kernel can use for things like copy_to_user(). | |
40 | ||
41 | Although _complete_, the user portion of the kernel page tables is | |
42 | crippled by setting the NX bit in the top level. This ensures | |
43 | that any missed kernel->user CR3 switch will immediately crash | |
44 | userspace upon executing its first instruction. | |
45 | ||
46 | The userspace page tables map only the kernel data needed to enter | |
47 | and exit the kernel. This data is entirely contained in the 'struct | |
48 | cpu_entry_area' structure which is placed in the fixmap which gives | |
49 | each CPU's copy of the area a compile-time-fixed virtual address. | |
50 | ||
51 | For new userspace mappings, the kernel makes the entries in its | |
52 | page tables like normal. The only difference is when the kernel | |
53 | makes entries in the top (PGD) level. In addition to setting the | |
54 | entry in the main kernel PGD, a copy of the entry is made in the | |
55 | userspace page tables' PGD. | |
56 | ||
57 | This sharing at the PGD level also inherently shares all the lower | |
58 | layers of the page tables. This leaves a single, shared set of | |
59 | userspace page tables to manage. One PTE to lock, one set of | |
60 | accessed bits, dirty bits, etc... | |
61 | ||
62 | Overhead | |
63 | ======== | |
64 | ||
65 | Protection against side-channel attacks is important. But, | |
66 | this protection comes at a cost: | |
67 | ||
68 | 1. Increased Memory Use | |
ea0765e8 | 69 | |
01c9b17b DH |
70 | a. Each process now needs an order-1 PGD instead of order-0. |
71 | (Consumes an additional 4k per process). | |
72 | b. The 'cpu_entry_area' structure must be 2MB in size and 2MB | |
73 | aligned so that it can be mapped by setting a single PMD | |
74 | entry. This consumes nearly 2MB of RAM once the kernel | |
75 | is decompressed, but no space in the kernel image itself. | |
76 | ||
77 | 2. Runtime Cost | |
ea0765e8 | 78 | |
01c9b17b DH |
79 | a. CR3 manipulation to switch between the page table copies |
80 | must be done at interrupt, syscall, and exception entry | |
81 | and exit (it can be skipped when the kernel is interrupted, | |
82 | though.) Moves to CR3 are on the order of a hundred | |
83 | cycles, and are required at every entry and exit. | |
84 | b. A "trampoline" must be used for SYSCALL entry. This | |
85 | trampoline depends on a smaller set of resources than the | |
86 | non-PTI SYSCALL entry code, so requires mapping fewer | |
87 | things into the userspace page tables. The downside is | |
88 | that stacks must be switched at entry time. | |
98f0fcee | 89 | c. Global pages are disabled for all kernel structures not |
01c9b17b DH |
90 | mapped into both kernel and userspace page tables. This |
91 | feature of the MMU allows different processes to share TLB | |
92 | entries mapping the kernel. Losing the feature means more | |
93 | TLB misses after a context switch. The actual loss of | |
94 | performance is very small, however, never exceeding 1%. | |
95 | d. Process Context IDentifiers (PCID) is a CPU feature that | |
96 | allows us to skip flushing the entire TLB when switching page | |
97 | tables by setting a special bit in CR3 when the page tables | |
98 | are changed. This makes switching the page tables (at context | |
99 | switch, or kernel entry/exit) cheaper. But, on systems with | |
100 | PCID support, the context switch code must flush both the user | |
101 | and kernel entries out of the TLB. The user PCID TLB flush is | |
102 | deferred until the exit to userspace, minimizing the cost. | |
103 | See intel.com/sdm for the gory PCID/INVPCID details. | |
104 | e. The userspace page tables must be populated for each new | |
105 | process. Even without PTI, the shared kernel mappings | |
106 | are created by copying top-level (PGD) entries into each | |
107 | new process. But, with PTI, there are now *two* kernel | |
108 | mappings: one in the kernel page tables that maps everything | |
109 | and one for the entry/exit structures. At fork(), we need to | |
110 | copy both. | |
111 | f. In addition to the fork()-time copying, there must also | |
112 | be an update to the userspace PGD any time a set_pgd() is done | |
113 | on a PGD used to map userspace. This ensures that the kernel | |
114 | and userspace copies always map the same userspace | |
115 | memory. | |
116 | g. On systems without PCID support, each CR3 write flushes | |
117 | the entire TLB. That means that each syscall, interrupt | |
118 | or exception flushes the TLB. | |
119 | h. INVPCID is a TLB-flushing instruction which allows flushing | |
120 | of TLB entries for non-current PCIDs. Some systems support | |
121 | PCIDs, but do not support INVPCID. On these systems, addresses | |
122 | can only be flushed from the TLB for the current PCID. When | |
123 | flushing a kernel address, we need to flush all PCIDs, so a | |
124 | single kernel address flush will require a TLB-flushing CR3 | |
125 | write upon the next use of every PCID. | |
126 | ||
127 | Possible Future Work | |
128 | ==================== | |
129 | 1. We can be more careful about not actually writing to CR3 | |
130 | unless its value is actually changed. | |
131 | 2. Allow PTI to be enabled/disabled at runtime in addition to the | |
132 | boot-time switching. | |
133 | ||
134 | Testing | |
135 | ======== | |
136 | ||
137 | To test stability of PTI, the following test procedure is recommended, | |
138 | ideally doing all of these in parallel: | |
139 | ||
140 | 1. Set CONFIG_DEBUG_ENTRY=y | |
141 | 2. Run several copies of all of the tools/testing/selftests/x86/ tests | |
142 | (excluding MPX and protection_keys) in a loop on multiple CPUs for | |
143 | several minutes. These tests frequently uncover corner cases in the | |
144 | kernel entry code. In general, old kernels might cause these tests | |
145 | themselves to crash, but they should never crash the kernel. | |
146 | 3. Run the 'perf' tool in a mode (top or record) that generates many | |
147 | frequent performance monitoring non-maskable interrupts (see "NMI" | |
148 | in /proc/interrupts). This exercises the NMI entry/exit code which | |
149 | is known to trigger bugs in code paths that did not expect to be | |
150 | interrupted, including nested NMIs. Using "-c" boosts the rate of | |
151 | NMIs, and using two -c with separate counters encourages nested NMIs | |
152 | and less deterministic behavior. | |
ea0765e8 | 153 | :: |
01c9b17b DH |
154 | |
155 | while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done | |
156 | ||
157 | 4. Launch a KVM virtual machine. | |
158 | 5. Run 32-bit binaries on systems supporting the SYSCALL instruction. | |
159 | This has been a lightly-tested code path and needs extra scrutiny. | |
160 | ||
161 | Debugging | |
162 | ========= | |
163 | ||
164 | Bugs in PTI cause a few different signatures of crashes | |
165 | that are worth noting here. | |
166 | ||
167 | * Failures of the selftests/x86 code. Usually a bug in one of the | |
168 | more obscure corners of entry_64.S | |
169 | * Crashes in early boot, especially around CPU bringup. Bugs | |
170 | in the trampoline code or mappings cause these. | |
171 | * Crashes at the first interrupt. Caused by bugs in entry_64.S, | |
172 | like screwing up a page table switch. Also caused by | |
173 | incorrectly mapping the IRQ handler entry code. | |
174 | * Crashes at the first NMI. The NMI code is separate from main | |
175 | interrupt handlers and can have bugs that do not affect | |
176 | normal interrupts. Also caused by incorrectly mapping NMI | |
177 | code. NMIs that interrupt the entry code must be very | |
178 | careful and can be the cause of crashes that show up when | |
179 | running perf. | |
180 | * Kernel crashes at the first exit to userspace. entry_64.S | |
181 | bugs, or failing to map some of the exit code. | |
182 | * Crashes at first interrupt that interrupts userspace. The paths | |
183 | in entry_64.S that return to userspace are sometimes separate | |
184 | from the ones that return to the kernel. | |
185 | * Double faults: overflowing the kernel stack because of page | |
186 | faults upon page faults. Caused by touching non-pti-mapped | |
187 | data in the entry code, or forgetting to switch to kernel | |
188 | CR3 before calling into C functions which are not pti-mapped. | |
189 | * Userspace segfaults early in boot, sometimes manifesting | |
190 | as mount(8) failing to mount the rootfs. These have | |
191 | tended to be TLB invalidation issues. Usually invalidating | |
192 | the wrong PCID, or otherwise missing an invalidation. | |
193 | ||
ea0765e8 CD |
194 | .. [1] https://gruss.cc/files/kaiser.pdf |
195 | .. [2] https://meltdownattack.com/meltdown.pdf |