]>
Commit | Line | Data |
---|---|---|
9f803664 KC |
1 | # Kernel Self-Protection |
2 | ||
3 | Kernel self-protection is the design and implementation of systems and | |
4 | structures within the Linux kernel to protect against security flaws in | |
5 | the kernel itself. This covers a wide range of issues, including removing | |
6 | entire classes of bugs, blocking security flaw exploitation methods, | |
7 | and actively detecting attack attempts. Not all topics are explored in | |
8 | this document, but it should serve as a reasonable starting point and | |
9 | answer any frequently asked questions. (Patches welcome, of course!) | |
10 | ||
11 | In the worst-case scenario, we assume an unprivileged local attacker | |
12 | has arbitrary read and write access to the kernel's memory. In many | |
13 | cases, bugs being exploited will not provide this level of access, | |
14 | but with systems in place that defend against the worst case we'll | |
15 | cover the more limited cases as well. A higher bar, and one that should | |
16 | still be kept in mind, is protecting the kernel against a _privileged_ | |
17 | local attacker, since the root user has access to a vastly increased | |
18 | attack surface. (Especially when they have the ability to load arbitrary | |
19 | kernel modules.) | |
20 | ||
21 | The goals for successful self-protection systems would be that they | |
22 | are effective, on by default, require no opt-in by developers, have no | |
23 | performance impact, do not impede kernel debugging, and have tests. It | |
24 | is uncommon that all these goals can be met, but it is worth explicitly | |
25 | mentioning them, since these aspects need to be explored, dealt with, | |
26 | and/or accepted. | |
27 | ||
28 | ||
29 | ## Attack Surface Reduction | |
30 | ||
31 | The most fundamental defense against security exploits is to reduce the | |
32 | areas of the kernel that can be used to redirect execution. This ranges | |
33 | from limiting the exposed APIs available to userspace, making in-kernel | |
34 | APIs hard to use incorrectly, minimizing the areas of writable kernel | |
35 | memory, etc. | |
36 | ||
37 | ### Strict kernel memory permissions | |
38 | ||
39 | When all of kernel memory is writable, it becomes trivial for attacks | |
40 | to redirect execution flow. To reduce the availability of these targets | |
41 | the kernel needs to protect its memory with a tight set of permissions. | |
42 | ||
43 | #### Executable code and read-only data must not be writable | |
44 | ||
45 | Any areas of the kernel with executable memory must not be writable. | |
46 | While this obviously includes the kernel text itself, we must consider | |
47 | all additional places too: kernel modules, JIT memory, etc. (There are | |
48 | temporary exceptions to this rule to support things like instruction | |
49 | alternatives, breakpoints, kprobes, etc. If these must exist in a | |
50 | kernel, they are implemented in a way where the memory is temporarily | |
51 | made writable during the update, and then returned to the original | |
52 | permissions.) | |
53 | ||
0f5bf6d0 LA |
54 | In support of this are CONFIG_STRICT_KERNEL_RWX and |
55 | CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not | |
9f803664 KC |
56 | writable, data is not executable, and read-only data is neither writable |
57 | nor executable. | |
58 | ||
ad21fc4f LA |
59 | Most architectures have these options on by default and not user selectable. |
60 | For some architectures like arm that wish to have these be selectable, | |
61 | the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable | |
62 | a Kconfig prompt. CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines | |
63 | the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. | |
64 | ||
9f803664 KC |
65 | #### Function pointers and sensitive variables must not be writable |
66 | ||
67 | Vast areas of kernel memory contain function pointers that are looked | |
68 | up by the kernel and used to continue execution (e.g. descriptor/vector | |
69 | tables, file/network/etc operation structures, etc). The number of these | |
70 | variables must be reduced to an absolute minimum. | |
71 | ||
72 | Many such variables can be made read-only by setting them "const" | |
73 | so that they live in the .rodata section instead of the .data section | |
74 | of the kernel, gaining the protection of the kernel's strict memory | |
75 | permissions as described above. | |
76 | ||
77 | For variables that are initialized once at __init time, these can | |
78 | be marked with the (new and under development) __ro_after_init | |
79 | attribute. | |
80 | ||
81 | What remains are variables that are updated rarely (e.g. GDT). These | |
82 | will need another infrastructure (similar to the temporary exceptions | |
83 | made to kernel code mentioned above) that allow them to spend the rest | |
84 | of their lifetime read-only. (For example, when being updated, only the | |
85 | CPU thread performing the update would be given uninterruptible write | |
86 | access to the memory.) | |
87 | ||
88 | #### Segregation of kernel memory from userspace memory | |
89 | ||
90 | The kernel must never execute userspace memory. The kernel must also never | |
91 | access userspace memory without explicit expectation to do so. These | |
92 | rules can be enforced either by support of hardware-based restrictions | |
93 | (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). | |
94 | By blocking userspace memory in this way, execution and data parsing | |
95 | cannot be passed to trivially-controlled userspace memory, forcing | |
96 | attacks to operate entirely in kernel memory. | |
97 | ||
98 | ### Reduced access to syscalls | |
99 | ||
100 | One trivial way to eliminate many syscalls for 64-bit systems is building | |
101 | without CONFIG_COMPAT. However, this is rarely a feasible scenario. | |
102 | ||
103 | The "seccomp" system provides an opt-in feature made available to | |
104 | userspace, which provides a way to reduce the number of kernel entry | |
105 | points available to a running process. This limits the breadth of kernel | |
106 | code that can be reached, possibly reducing the availability of a given | |
107 | bug to an attack. | |
108 | ||
109 | An area of improvement would be creating viable ways to keep access to | |
110 | things like compat, user namespaces, BPF creation, and perf limited only | |
111 | to trusted processes. This would keep the scope of kernel entry points | |
112 | restricted to the more regular set of normally available to unprivileged | |
113 | userspace. | |
114 | ||
115 | ### Restricting access to kernel modules | |
116 | ||
117 | The kernel should never allow an unprivileged user the ability to | |
118 | load specific kernel modules, since that would provide a facility to | |
119 | unexpectedly extend the available attack surface. (The on-demand loading | |
120 | of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is | |
121 | considered "expected" here, though additional consideration should be | |
122 | given even to these.) For example, loading a filesystem module via an | |
123 | unprivileged socket API is nonsense: only the root or physically local | |
124 | user should trigger filesystem module loading. (And even this can be up | |
125 | for debate in some scenarios.) | |
126 | ||
127 | To protect against even privileged users, systems may need to either | |
128 | disable module loading entirely (e.g. monolithic kernel builds or | |
129 | modules_disabled sysctl), or provide signed modules (e.g. | |
130 | CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having | |
131 | root load arbitrary kernel code via the module loader interface. | |
132 | ||
133 | ||
134 | ## Memory integrity | |
135 | ||
136 | There are many memory structures in the kernel that are regularly abused | |
137 | to gain execution control during an attack, By far the most commonly | |
138 | understood is that of the stack buffer overflow in which the return | |
139 | address stored on the stack is overwritten. Many other examples of this | |
140 | kind of attack exist, and protections exist to defend against them. | |
141 | ||
142 | ### Stack buffer overflow | |
143 | ||
144 | The classic stack buffer overflow involves writing past the expected end | |
145 | of a variable stored on the stack, ultimately writing a controlled value | |
146 | to the stack frame's stored return address. The most widely used defense | |
147 | is the presence of a stack canary between the stack variables and the | |
148 | return address (CONFIG_CC_STACKPROTECTOR), which is verified just before | |
149 | the function returns. Other defenses include things like shadow stacks. | |
150 | ||
151 | ### Stack depth overflow | |
152 | ||
153 | A less well understood attack is using a bug that triggers the | |
154 | kernel to consume stack memory with deep function calls or large stack | |
155 | allocations. With this attack it is possible to write beyond the end of | |
156 | the kernel's preallocated stack space and into sensitive structures. Two | |
157 | important changes need to be made for better protections: moving the | |
158 | sensitive thread_info structure elsewhere, and adding a faulting memory | |
159 | hole at the bottom of the stack to catch these overflows. | |
160 | ||
161 | ### Heap memory integrity | |
162 | ||
163 | The structures used to track heap free lists can be sanity-checked during | |
164 | allocation and freeing to make sure they aren't being used to manipulate | |
165 | other memory areas. | |
166 | ||
167 | ### Counter integrity | |
168 | ||
169 | Many places in the kernel use atomic counters to track object references | |
170 | or perform similar lifetime management. When these counters can be made | |
171 | to wrap (over or under) this traditionally exposes a use-after-free | |
172 | flaw. By trapping atomic wrapping, this class of bug vanishes. | |
173 | ||
174 | ### Size calculation overflow detection | |
175 | ||
176 | Similar to counter overflow, integer overflows (usually size calculations) | |
177 | need to be detected at runtime to kill this class of bug, which | |
178 | traditionally leads to being able to write past the end of kernel buffers. | |
179 | ||
180 | ||
181 | ## Statistical defenses | |
182 | ||
183 | While many protections can be considered deterministic (e.g. read-only | |
184 | memory cannot be written to), some protections provide only statistical | |
185 | defense, in that an attack must gather enough information about a | |
186 | running system to overcome the defense. While not perfect, these do | |
187 | provide meaningful defenses. | |
188 | ||
189 | ### Canaries, blinding, and other secrets | |
190 | ||
191 | It should be noted that things like the stack canary discussed earlier | |
c9de4a82 KC |
192 | are technically statistical defenses, since they rely on a secret value, |
193 | and such values may become discoverable through an information exposure | |
194 | flaw. | |
9f803664 KC |
195 | |
196 | Blinding literal values for things like JITs, where the executable | |
197 | contents may be partially under the control of userspace, need a similar | |
198 | secret value. | |
199 | ||
200 | It is critical that the secret values used must be separate (e.g. | |
201 | different canary per stack) and high entropy (e.g. is the RNG actually | |
202 | working?) in order to maximize their success. | |
203 | ||
204 | ### Kernel Address Space Layout Randomization (KASLR) | |
205 | ||
206 | Since the location of kernel memory is almost always instrumental in | |
207 | mounting a successful attack, making the location non-deterministic | |
208 | raises the difficulty of an exploit. (Note that this in turn makes | |
c9de4a82 KC |
209 | the value of information exposures higher, since they may be used to |
210 | discover desired memory locations.) | |
9f803664 KC |
211 | |
212 | #### Text and module base | |
213 | ||
214 | By relocating the physical and virtual base address of the kernel at | |
215 | boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be | |
216 | frustrated. Additionally, offsetting the module loading base address | |
217 | means that even systems that load the same set of modules in the same | |
218 | order every boot will not share a common base address with the rest of | |
219 | the kernel text. | |
220 | ||
221 | #### Stack base | |
222 | ||
223 | If the base address of the kernel stack is not the same between processes, | |
224 | or even not the same between syscalls, targets on or beyond the stack | |
225 | become more difficult to locate. | |
226 | ||
227 | #### Dynamic memory base | |
228 | ||
229 | Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up | |
230 | being relatively deterministic in layout due to the order of early-boot | |
231 | initializations. If the base address of these areas is not the same | |
c9de4a82 KC |
232 | between boots, targeting them is frustrated, requiring an information |
233 | exposure specific to the region. | |
234 | ||
235 | #### Structure layout | |
236 | ||
237 | By performing a per-build randomization of the layout of sensitive | |
238 | structures, attacks must either be tuned to known kernel builds or expose | |
239 | enough kernel memory to determine structure layouts before manipulating | |
240 | them. | |
9f803664 KC |
241 | |
242 | ||
c9de4a82 | 243 | ## Preventing Information Exposures |
9f803664 KC |
244 | |
245 | Since the locations of sensitive structures are the primary target for | |
c9de4a82 | 246 | attacks, it is important to defend against exposure of both kernel memory |
9f803664 KC |
247 | addresses and kernel memory contents (since they may contain kernel |
248 | addresses or other sensitive things like canary values). | |
249 | ||
250 | ### Unique identifiers | |
251 | ||
252 | Kernel memory addresses must never be used as identifiers exposed to | |
253 | userspace. Instead, use an atomic counter, an idr, or similar unique | |
254 | identifier. | |
255 | ||
256 | ### Memory initialization | |
257 | ||
258 | Memory copied to userspace must always be fully initialized. If not | |
259 | explicitly memset(), this will require changes to the compiler to make | |
260 | sure structure holes are cleared. | |
261 | ||
262 | ### Memory poisoning | |
263 | ||
264 | When releasing memory, it is best to poison the contents (clear stack on | |
265 | syscall return, wipe heap memory on a free), to avoid reuse attacks that | |
266 | rely on the old contents of memory. This frustrates many uninitialized | |
c9de4a82 KC |
267 | variable attacks, stack content exposures, heap content exposures, and |
268 | use-after-free attacks. | |
9f803664 KC |
269 | |
270 | ### Destination tracking | |
271 | ||
272 | To help kill classes of bugs that result in kernel addresses being | |
273 | written to userspace, the destination of writes needs to be tracked. If | |
274 | the buffer is destined for userspace (e.g. seq_file backed /proc files), | |
275 | it should automatically censor sensitive values. |