[mirror_ubuntu-bionic-kernel.git] / Documentation / security / self-protection.txt

# Kernel Self-Protection

Kernel self-protection is the design and implementation of systems and
structures within the Linux kernel to protect against security flaws in
the kernel itself. This covers a wide range of issues, including removing
entire classes of bugs, blocking security flaw exploitation methods,
and actively detecting attack attempts. Not all topics are explored in
this document, but it should serve as a reasonable starting point and
answer any frequently asked questions. (Patches welcome, of course!)

In the worst-case scenario, we assume an unprivileged local attacker
has arbitrary read and write access to the kernel's memory. In many
cases, bugs being exploited will not provide this level of access,
but with systems in place that defend against the worst case we'll
cover the more limited cases as well. A higher bar, and one that should
still be kept in mind, is protecting the kernel against a _privileged_
local attacker, since the root user has access to a vastly increased
attack surface. (Especially when they have the ability to load arbitrary
kernel modules.)

The goals for successful self-protection systems would be that they
are effective, on by default, require no opt-in by developers, have no
performance impact, do not impede kernel debugging, and have tests. It
is uncommon that all these goals can be met, but it is worth explicitly
mentioning them, since these aspects need to be explored, dealt with,
and/or accepted.


## Attack Surface Reduction

The most fundamental defense against security exploits is to reduce the
areas of the kernel that can be used to redirect execution. This ranges
from limiting the exposed APIs available to userspace, making in-kernel
APIs hard to use incorrectly, minimizing the areas of writable kernel
memory, etc.

### Strict kernel memory permissions

When all of kernel memory is writable, it becomes trivial for attacks
to redirect execution flow. To reduce the availability of these targets
the kernel needs to protect its memory with a tight set of permissions.

#### Executable code and read-only data must not be writable

Any areas of the kernel with executable memory must not be writable.
While this obviously includes the kernel text itself, we must consider
all additional places too: kernel modules, JIT memory, etc. (There are
temporary exceptions to this rule to support things like instruction
alternatives, breakpoints, kprobes, etc. If these must exist in a
kernel, they are implemented in a way where the memory is temporarily
made writable during the update, and then returned to the original
permissions.)

In support of this are CONFIG_STRICT_KERNEL_RWX and
CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not
writable, data is not executable, and read-only data is neither writable
nor executable.

Most architectures have these options on by default and not user selectable.
For some architectures like arm that wish to have these be selectable,
the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
a Kconfig prompt. CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines
the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.

#### Function pointers and sensitive variables must not be writable

Vast areas of kernel memory contain function pointers that are looked
up by the kernel and used to continue execution (e.g. descriptor/vector
tables, file/network/etc operation structures, etc). The number of these
variables must be reduced to an absolute minimum.

Many such variables can be made read-only by setting them "const"
so that they live in the .rodata section instead of the .data section
of the kernel, gaining the protection of the kernel's strict memory
permissions as described above.

For variables that are initialized once at __init time, these can
be marked with the (new and under development) __ro_after_init
attribute.

What remains are variables that are updated rarely (e.g. GDT). These
will need another infrastructure (similar to the temporary exceptions
made to kernel code mentioned above) that allow them to spend the rest
of their lifetime read-only. (For example, when being updated, only the
CPU thread performing the update would be given uninterruptible write
access to the memory.)

#### Segregation of kernel memory from userspace memory

The kernel must never execute userspace memory. The kernel must also never
access userspace memory without explicit expectation to do so. These
rules can be enforced either by support of hardware-based restrictions
(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
By blocking userspace memory in this way, execution and data parsing
cannot be passed to trivially-controlled userspace memory, forcing
attacks to operate entirely in kernel memory.

### Reduced access to syscalls

One trivial way to eliminate many syscalls for 64-bit systems is building
without CONFIG_COMPAT. However, this is rarely a feasible scenario.

The "seccomp" system provides an opt-in feature made available to
userspace, which provides a way to reduce the number of kernel entry
points available to a running process. This limits the breadth of kernel
code that can be reached, possibly reducing the availability of a given
bug to an attack.

An area of improvement would be creating viable ways to keep access to
things like compat, user namespaces, BPF creation, and perf limited only
to trusted processes. This would keep the scope of kernel entry points
restricted to the more regular set of normally available to unprivileged
userspace.

### Restricting access to kernel modules

The kernel should never allow an unprivileged user the ability to
load specific kernel modules, since that would provide a facility to
unexpectedly extend the available attack surface. (The on-demand loading
of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
considered "expected" here, though additional consideration should be
given even to these.) For example, loading a filesystem module via an
unprivileged socket API is nonsense: only the root or physically local
user should trigger filesystem module loading. (And even this can be up
for debate in some scenarios.)

To protect against even privileged users, systems may need to either
disable module loading entirely (e.g. monolithic kernel builds or
modules_disabled sysctl), or provide signed modules (e.g.
CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having
root load arbitrary kernel code via the module loader interface.


## Memory integrity

There are many memory structures in the kernel that are regularly abused
to gain execution control during an attack, By far the most commonly
understood is that of the stack buffer overflow in which the return
address stored on the stack is overwritten. Many other examples of this
kind of attack exist, and protections exist to defend against them.

### Stack buffer overflow

The classic stack buffer overflow involves writing past the expected end
of a variable stored on the stack, ultimately writing a controlled value
to the stack frame's stored return address. The most widely used defense
is the presence of a stack canary between the stack variables and the
return address (CONFIG_CC_STACKPROTECTOR), which is verified just before
the function returns. Other defenses include things like shadow stacks.

### Stack depth overflow

A less well understood attack is using a bug that triggers the
kernel to consume stack memory with deep function calls or large stack
allocations. With this attack it is possible to write beyond the end of
the kernel's preallocated stack space and into sensitive structures. Two
important changes need to be made for better protections: moving the
sensitive thread_info structure elsewhere, and adding a faulting memory
hole at the bottom of the stack to catch these overflows.

### Heap memory integrity

The structures used to track heap free lists can be sanity-checked during
allocation and freeing to make sure they aren't being used to manipulate
other memory areas.

### Counter integrity

Many places in the kernel use atomic counters to track object references
or perform similar lifetime management. When these counters can be made
to wrap (over or under) this traditionally exposes a use-after-free
flaw. By trapping atomic wrapping, this class of bug vanishes.

### Size calculation overflow detection

Similar to counter overflow, integer overflows (usually size calculations)
need to be detected at runtime to kill this class of bug, which
traditionally leads to being able to write past the end of kernel buffers.


## Statistical defenses

While many protections can be considered deterministic (e.g. read-only
memory cannot be written to), some protections provide only statistical
defense, in that an attack must gather enough information about a
running system to overcome the defense. While not perfect, these do
provide meaningful defenses.

### Canaries, blinding, and other secrets

It should be noted that things like the stack canary discussed earlier
are technically statistical defenses, since they rely on a secret value,
and such values may become discoverable through an information exposure
flaw.

Blinding literal values for things like JITs, where the executable
contents may be partially under the control of userspace, need a similar
secret value.

It is critical that the secret values used must be separate (e.g.
different canary per stack) and high entropy (e.g. is the RNG actually
working?) in order to maximize their success.

### Kernel Address Space Layout Randomization (KASLR)

Since the location of kernel memory is almost always instrumental in
mounting a successful attack, making the location non-deterministic
raises the difficulty of an exploit. (Note that this in turn makes
the value of information exposures higher, since they may be used to
discover desired memory locations.)

#### Text and module base

By relocating the physical and virtual base address of the kernel at
boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be
frustrated. Additionally, offsetting the module loading base address
means that even systems that load the same set of modules in the same
order every boot will not share a common base address with the rest of
the kernel text.

#### Stack base

If the base address of the kernel stack is not the same between processes,
or even not the same between syscalls, targets on or beyond the stack
become more difficult to locate.

#### Dynamic memory base

Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
being relatively deterministic in layout due to the order of early-boot
initializations. If the base address of these areas is not the same
between boots, targeting them is frustrated, requiring an information
exposure specific to the region.

#### Structure layout

By performing a per-build randomization of the layout of sensitive
structures, attacks must either be tuned to known kernel builds or expose
enough kernel memory to determine structure layouts before manipulating
them.


## Preventing Information Exposures

Since the locations of sensitive structures are the primary target for
attacks, it is important to defend against exposure of both kernel memory
addresses and kernel memory contents (since they may contain kernel
addresses or other sensitive things like canary values).

### Unique identifiers

Kernel memory addresses must never be used as identifiers exposed to
userspace. Instead, use an atomic counter, an idr, or similar unique
identifier.

### Memory initialization

Memory copied to userspace must always be fully initialized. If not
explicitly memset(), this will require changes to the compiler to make
sure structure holes are cleared.

### Memory poisoning

When releasing memory, it is best to poison the contents (clear stack on
syscall return, wipe heap memory on a free), to avoid reuse attacks that
rely on the old contents of memory. This frustrates many uninitialized
variable attacks, stack content exposures, heap content exposures, and
use-after-free attacks.

### Destination tracking

To help kill classes of bugs that result in kernel addresses being
written to userspace, the destination of writes needs to be tracked. If
the buffer is destined for userspace (e.g. seq_file backed /proc files),
it should automatically censor sensitive values.
Commit	Line	Data
9f803664 KC	1	# Kernel Self-Protection
	2
	3	Kernel self-protection is the design and implementation of systems and
	4	structures within the Linux kernel to protect against security flaws in
	5	the kernel itself. This covers a wide range of issues, including removing
	6	entire classes of bugs, blocking security flaw exploitation methods,
	7	and actively detecting attack attempts. Not all topics are explored in
	8	this document, but it should serve as a reasonable starting point and
	9	answer any frequently asked questions. (Patches welcome, of course!)
	10
	11	In the worst-case scenario, we assume an unprivileged local attacker
	12	has arbitrary read and write access to the kernel's memory. In many
	13	cases, bugs being exploited will not provide this level of access,
	14	but with systems in place that defend against the worst case we'll
	15	cover the more limited cases as well. A higher bar, and one that should
	16	still be kept in mind, is protecting the kernel against a _privileged_
	17	local attacker, since the root user has access to a vastly increased
	18	attack surface. (Especially when they have the ability to load arbitrary
	19	kernel modules.)
	20
	21	The goals for successful self-protection systems would be that they
	22	are effective, on by default, require no opt-in by developers, have no
	23	performance impact, do not impede kernel debugging, and have tests. It
	24	is uncommon that all these goals can be met, but it is worth explicitly
	25	mentioning them, since these aspects need to be explored, dealt with,
	26	and/or accepted.
	27
	28
	29	## Attack Surface Reduction
	30
	31	The most fundamental defense against security exploits is to reduce the
	32	areas of the kernel that can be used to redirect execution. This ranges
	33	from limiting the exposed APIs available to userspace, making in-kernel
	34	APIs hard to use incorrectly, minimizing the areas of writable kernel
	35	memory, etc.
	36
	37	### Strict kernel memory permissions
	38
	39	When all of kernel memory is writable, it becomes trivial for attacks
	40	to redirect execution flow. To reduce the availability of these targets
	41	the kernel needs to protect its memory with a tight set of permissions.
	42
	43	#### Executable code and read-only data must not be writable
	44
	45	Any areas of the kernel with executable memory must not be writable.
	46	While this obviously includes the kernel text itself, we must consider
	47	all additional places too: kernel modules, JIT memory, etc. (There are
	48	temporary exceptions to this rule to support things like instruction
	49	alternatives, breakpoints, kprobes, etc. If these must exist in a
	50	kernel, they are implemented in a way where the memory is temporarily
	51	made writable during the update, and then returned to the original
	52	permissions.)
	53
0f5bf6d0 LA	54	In support of this are CONFIG_STRICT_KERNEL_RWX and
0f5bf6d0 LA	55	CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not
9f803664 KC	56	writable, data is not executable, and read-only data is neither writable
	57	nor executable.
	58
ad21fc4f LA	59	Most architectures have these options on by default and not user selectable.
	60	For some architectures like arm that wish to have these be selectable,
	61	the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
	62	a Kconfig prompt. CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines
	63	the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
	64
9f803664 KC	65	#### Function pointers and sensitive variables must not be writable
	66
	67	Vast areas of kernel memory contain function pointers that are looked
	68	up by the kernel and used to continue execution (e.g. descriptor/vector
	69	tables, file/network/etc operation structures, etc). The number of these
	70	variables must be reduced to an absolute minimum.
	71
	72	Many such variables can be made read-only by setting them "const"
	73	so that they live in the .rodata section instead of the .data section
	74	of the kernel, gaining the protection of the kernel's strict memory
	75	permissions as described above.
	76
	77	For variables that are initialized once at __init time, these can
	78	be marked with the (new and under development) __ro_after_init
	79	attribute.
	80
	81	What remains are variables that are updated rarely (e.g. GDT). These
	82	will need another infrastructure (similar to the temporary exceptions
	83	made to kernel code mentioned above) that allow them to spend the rest
	84	of their lifetime read-only. (For example, when being updated, only the
	85	CPU thread performing the update would be given uninterruptible write
	86	access to the memory.)
	87
	88	#### Segregation of kernel memory from userspace memory
	89
	90	The kernel must never execute userspace memory. The kernel must also never
	91	access userspace memory without explicit expectation to do so. These
	92	rules can be enforced either by support of hardware-based restrictions
	93	(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
	94	By blocking userspace memory in this way, execution and data parsing
	95	cannot be passed to trivially-controlled userspace memory, forcing
	96	attacks to operate entirely in kernel memory.
	97
	98	### Reduced access to syscalls
	99
	100	One trivial way to eliminate many syscalls for 64-bit systems is building
	101	without CONFIG_COMPAT. However, this is rarely a feasible scenario.
	102
	103	The "seccomp" system provides an opt-in feature made available to
	104	userspace, which provides a way to reduce the number of kernel entry
	105	points available to a running process. This limits the breadth of kernel
	106	code that can be reached, possibly reducing the availability of a given
	107	bug to an attack.
	108
	109	An area of improvement would be creating viable ways to keep access to
	110	things like compat, user namespaces, BPF creation, and perf limited only
	111	to trusted processes. This would keep the scope of kernel entry points
	112	restricted to the more regular set of normally available to unprivileged
	113	userspace.
	114
	115	### Restricting access to kernel modules
	116
	117	The kernel should never allow an unprivileged user the ability to
	118	load specific kernel modules, since that would provide a facility to
	119	unexpectedly extend the available attack surface. (The on-demand loading
	120	of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
	121	considered "expected" here, though additional consideration should be
	122	given even to these.) For example, loading a filesystem module via an
	123	unprivileged socket API is nonsense: only the root or physically local
	124	user should trigger filesystem module loading. (And even this can be up
	125	for debate in some scenarios.)
	126
	127	To protect against even privileged users, systems may need to either
	128	disable module loading entirely (e.g. monolithic kernel builds or
129	modules_disabled sysctl), or provide signed modules (e.g.
130	CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having
131	root load arbitrary kernel code via the module loader interface.
132
133
134	## Memory integrity
135
136	There are many memory structures in the kernel that are regularly abused
137	to gain execution control during an attack, By far the most commonly
138	understood is that of the stack buffer overflow in which the return
139	address stored on the stack is overwritten. Many other examples of this
140	kind of attack exist, and protections exist to defend against them.
141
142	### Stack buffer overflow
143
144	The classic stack buffer overflow involves writing past the expected end
145	of a variable stored on the stack, ultimately writing a controlled value
146	to the stack frame's stored return address. The most widely used defense
147	is the presence of a stack canary between the stack variables and the
148	return address (CONFIG_CC_STACKPROTECTOR), which is verified just before
149	the function returns. Other defenses include things like shadow stacks.
150
151	### Stack depth overflow
152
153	A less well understood attack is using a bug that triggers the
154	kernel to consume stack memory with deep function calls or large stack
155	allocations. With this attack it is possible to write beyond the end of
156	the kernel's preallocated stack space and into sensitive structures. Two
157	important changes need to be made for better protections: moving the
158	sensitive thread_info structure elsewhere, and adding a faulting memory
159	hole at the bottom of the stack to catch these overflows.
160
161	### Heap memory integrity
162
163	The structures used to track heap free lists can be sanity-checked during
164	allocation and freeing to make sure they aren't being used to manipulate
165	other memory areas.
166
167	### Counter integrity
168
169	Many places in the kernel use atomic counters to track object references
170	or perform similar lifetime management. When these counters can be made
171	to wrap (over or under) this traditionally exposes a use-after-free
172	flaw. By trapping atomic wrapping, this class of bug vanishes.
173
174	### Size calculation overflow detection
175
176	Similar to counter overflow, integer overflows (usually size calculations)
177	need to be detected at runtime to kill this class of bug, which
178	traditionally leads to being able to write past the end of kernel buffers.
179
180
181	## Statistical defenses
182
183	While many protections can be considered deterministic (e.g. read-only
184	memory cannot be written to), some protections provide only statistical
185	defense, in that an attack must gather enough information about a
186	running system to overcome the defense. While not perfect, these do
187	provide meaningful defenses.
188
189	### Canaries, blinding, and other secrets
190
191	It should be noted that things like the stack canary discussed earlier
c9de4a82 KC	192	are technically statistical defenses, since they rely on a secret value,
	193	and such values may become discoverable through an information exposure
	194	flaw.
9f803664 KC	195
	196	Blinding literal values for things like JITs, where the executable
	197	contents may be partially under the control of userspace, need a similar
	198	secret value.
	199
	200	It is critical that the secret values used must be separate (e.g.
	201	different canary per stack) and high entropy (e.g. is the RNG actually
	202	working?) in order to maximize their success.
	203
	204	### Kernel Address Space Layout Randomization (KASLR)
	205
	206	Since the location of kernel memory is almost always instrumental in
	207	mounting a successful attack, making the location non-deterministic
	208	raises the difficulty of an exploit. (Note that this in turn makes
c9de4a82 KC	209	the value of information exposures higher, since they may be used to
c9de4a82 KC	210	discover desired memory locations.)
9f803664 KC	211
	212	#### Text and module base
	213
	214	By relocating the physical and virtual base address of the kernel at
	215	boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be
	216	frustrated. Additionally, offsetting the module loading base address
	217	means that even systems that load the same set of modules in the same
	218	order every boot will not share a common base address with the rest of
	219	the kernel text.
	220
	221	#### Stack base
	222
	223	If the base address of the kernel stack is not the same between processes,
	224	or even not the same between syscalls, targets on or beyond the stack
	225	become more difficult to locate.
	226
	227	#### Dynamic memory base
	228
	229	Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
	230	being relatively deterministic in layout due to the order of early-boot
	231	initializations. If the base address of these areas is not the same
c9de4a82 KC	232	between boots, targeting them is frustrated, requiring an information
	233	exposure specific to the region.
	234
	235	#### Structure layout
	236
	237	By performing a per-build randomization of the layout of sensitive
	238	structures, attacks must either be tuned to known kernel builds or expose
	239	enough kernel memory to determine structure layouts before manipulating
	240	them.
9f803664 KC	241
9f803664 KC	242
c9de4a82	243	## Preventing Information Exposures
9f803664 KC	244
9f803664 KC	245	Since the locations of sensitive structures are the primary target for
c9de4a82	246	attacks, it is important to defend against exposure of both kernel memory
9f803664 KC	247	addresses and kernel memory contents (since they may contain kernel
	248	addresses or other sensitive things like canary values).
	249
	250	### Unique identifiers
	251
	252	Kernel memory addresses must never be used as identifiers exposed to
	253	userspace. Instead, use an atomic counter, an idr, or similar unique
	254	identifier.
	255
	256	### Memory initialization
	257
	258	Memory copied to userspace must always be fully initialized. If not
	259	explicitly memset(), this will require changes to the compiler to make
	260	sure structure holes are cleared.
	261
	262	### Memory poisoning
	263
	264	When releasing memory, it is best to poison the contents (clear stack on
	265	syscall return, wipe heap memory on a free), to avoid reuse attacks that
	266	rely on the old contents of memory. This frustrates many uninitialized
c9de4a82 KC	267	variable attacks, stack content exposures, heap content exposures, and
c9de4a82 KC	268	use-after-free attacks.
9f803664 KC	269
	270	### Destination tracking
	271
	272	To help kill classes of bugs that result in kernel addresses being
	273	written to userspace, the destination of writes needs to be tracked. If
	274	the buffer is destined for userspace (e.g. seq_file backed /proc files),
	275	it should automatically censor sensitive values.