[mirror_ubuntu-bionic-kernel.git] / Documentation / x86 / mds.rst

Microarchitectural Data Sampling (MDS) mitigation
=================================================

.. _mds:

Overview
--------

Microarchitectural Data Sampling (MDS) is a family of side channel attacks
on internal buffers in Intel CPUs. The variants are:

 - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
 - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
 - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
 - Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)

MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
dependent load (store-to-load forwarding) as an optimization. The forward
can also happen to a faulting or assisting load operation for a different
memory address, which can be exploited under certain conditions. Store
buffers are partitioned between Hyper-Threads so cross thread forwarding is
not possible. But if a thread enters or exits a sleep state the store
buffer is repartitioned which can expose data from one thread to the other.

MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
L1 miss situations and to hold data which is returned or sent in response
to a memory or I/O operation. Fill buffers can forward data to a load
operation and also write data to the cache. When the fill buffer is
deallocated it can retain the stale data of the preceding operations which
can then be forwarded to a faulting or assisting load operation, which can
be exploited under certain conditions. Fill buffers are shared between
Hyper-Threads so cross thread leakage is possible.

MLPDS leaks Load Port Data. Load ports are used to perform load operations
from memory or I/O. The received data is then forwarded to the register
file or a subsequent operation. In some implementations the Load Port can
contain stale data from a previous operation which can be forwarded to
faulting or assisting loads under certain conditions, which again can be
exploited eventually. Load ports are shared between Hyper-Threads so cross
thread leakage is possible.

MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from
memory that takes a fault or assist can leave data in a microarchitectural
structure that may later be observed using one of the same methods used by
MSBDS, MFBDS or MLPDS.

Exposure assumptions
--------------------

It is assumed that attack code resides in user space or in a guest with one
exception. The rationale behind this assumption is that the code construct
needed for exploiting MDS requires:

 - to control the load to trigger a fault or assist

 - to have a disclosure gadget which exposes the speculatively accessed
   data for consumption through a side channel.

 - to control the pointer through which the disclosure gadget exposes the
   data

The existence of such a construct in the kernel cannot be excluded with
100% certainty, but the complexity involved makes it extremly unlikely.

There is one exception, which is untrusted BPF. The functionality of
untrusted BPF is limited, but it needs to be thoroughly investigated
whether it can be used to create such a construct.


Mitigation strategy
-------------------

All variants have the same mitigation strategy at least for the single CPU
thread case (SMT off): Force the CPU to clear the affected buffers.

This is achieved by using the otherwise unused and obsolete VERW
instruction in combination with a microcode update. The microcode clears
the affected CPU buffers when the VERW instruction is executed.

For virtualization there are two ways to achieve CPU buffer
clearing. Either the modified VERW instruction or via the L1D Flush
command. The latter is issued when L1TF mitigation is enabled so the extra
VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
be issued.

If the VERW instruction with the supplied segment selector argument is
executed on a CPU without the microcode update there is no side effect
other than a small number of pointlessly wasted CPU cycles.

This does not protect against cross Hyper-Thread attacks except for MSBDS
which is only exploitable cross Hyper-thread when one of the Hyper-Threads
enters a C-state.

The kernel provides a function to invoke the buffer clearing:

    mds_clear_cpu_buffers()

The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
(idle) transitions.

As a special quirk to address virtualization scenarios where the host has
the microcode updated, but the hypervisor does not (yet) expose the
MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
hope that it might actually clear the buffers. The state is reflected
accordingly.

According to current knowledge additional mitigations inside the kernel
itself are not required because the necessary gadgets to expose the leaked
data cannot be controlled in a way which allows exploitation from malicious
user space or VM guests.

Kernel internal mitigation modes
--------------------------------

 ======= ============================================================
 off      Mitigation is disabled. Either the CPU is not affected or
          mds=off is supplied on the kernel command line

 full     Mitigation is enabled. CPU is affected and MD_CLEAR is
          advertised in CPUID.

 vmwerv	  Mitigation is enabled. CPU is affected and MD_CLEAR is not
	  advertised in CPUID. That is mainly for virtualization
	  scenarios where the host has the updated microcode but the
	  hypervisor does not expose MD_CLEAR in CPUID. It's a best
	  effort approach without guarantee.
 ======= ============================================================

If the CPU is affected and mds=off is not supplied on the kernel command
line then the kernel selects the appropriate mitigation mode depending on
the availability of the MD_CLEAR CPUID bit.

Mitigation points
-----------------

1. Return to user space
^^^^^^^^^^^^^^^^^^^^^^^

   When transitioning from kernel to user space the CPU buffers are flushed
   on affected CPUs when the mitigation is not disabled on the kernel
   command line. The migitation is enabled through the static key
   mds_user_clear.

   The mitigation is invoked in prepare_exit_to_usermode() which covers
   most of the kernel to user space transitions. There are a few exceptions
   which are not invoking prepare_exit_to_usermode() on return to user
   space. These exceptions use the paranoid exit code.

   - Non Maskable Interrupt (NMI):

     Access to sensible data like keys, credentials in the NMI context is
     mostly theoretical: The CPU can do prefetching or execute a
     misspeculated code path and thereby fetching data which might end up
     leaking through a buffer.

     But for mounting other attacks the kernel stack address of the task is
     already valuable information. So in full mitigation mode, the NMI is
     mitigated on the return from do_nmi() to provide almost complete
     coverage.

   - Double fault (#DF):

     A double fault is usually fatal, but the ESPFIX workaround, which can
     be triggered from user space through modify_ldt(2) is a recoverable
     double fault. #DF uses the paranoid exit path, so explicit mitigation
     in the double fault handler is required.

   - Machine Check Exception (#MC):

     Another corner case is a #MC which hits between the CPU buffer clear
     invocation and the actual return to user. As this still is in kernel
     space it takes the paranoid exit path which does not clear the CPU
     buffers. So the #MC handler repopulates the buffers to some
     extent. Machine checks are not reliably controllable and the window is
     extremly small so mitigation would just tick a checkbox that this
     theoretical corner case is covered. To keep the amount of special
     cases small, ignore #MC.

   - Debug Exception (#DB):

     This takes the paranoid exit path only when the INT1 breakpoint is in
     kernel space. #DB on a user space address takes the regular exit path,
     so no extra mitigation required.


2. C-State transition
^^^^^^^^^^^^^^^^^^^^^

   When a CPU goes idle and enters a C-State the CPU buffers need to be
   cleared on affected CPUs when SMT is active. This addresses the
   repartitioning of the store buffer when one of the Hyper-Threads enters
   a C-State.

   When SMT is inactive, i.e. either the CPU does not support it or all
   sibling threads are offline CPU buffer clearing is not required.

   The idle clearing is enabled on CPUs which are only affected by MSBDS
   and not by any other MDS variant. The other MDS variants cannot be
   protected against cross Hyper-Thread attacks because the Fill Buffer and
   the Load Ports are shared. So on CPUs affected by other variants, the
   idle clearing would be a window dressing exercise and is therefore not
   activated.

   The invocation is controlled by the static key mds_idle_clear which is
   switched depending on the chosen mitigation mode and the SMT state of
   the system.

   The buffer clear is only invoked before entering the C-State to prevent
   that stale data from the idling CPU from spilling to the Hyper-Thread
   sibling after the store buffer got repartitioned and all entries are
   available to the non idle sibling.

   When coming out of idle the store buffer is partitioned again so each
   sibling has half of it available. The back from idle CPU could be then
   speculatively exposed to contents of the sibling. The buffers are
   flushed either on exit to user space or on VMENTER so malicious code
   in user space or the guest cannot speculatively access them.

   The mitigation is hooked into all variants of halt()/mwait(), but does
   not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
   has been superseded by the intel_idle driver around 2010 and is
   preferred on all affected CPUs which are expected to gain the MD_CLEAR
   functionality in microcode. Aside of that the IO-Port mechanism is a
   legacy interface which is only used on older systems which are either
   not affected or do not receive microcode updates anymore.
Commit	Line	Data
4446d382 TG	1	Microarchitectural Data Sampling (MDS) mitigation
	2	=================================================
	3
	4	.. _mds:
	5
	6	Overview
	7	--------
	8
	9	Microarchitectural Data Sampling (MDS) is a family of side channel attacks
	10	on internal buffers in Intel CPUs. The variants are:
	11
	12	- Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
	13	- Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
	14	- Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
445b98a1	15	- Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)
4446d382 TG	16
	17	MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
	18	dependent load (store-to-load forwarding) as an optimization. The forward
	19	can also happen to a faulting or assisting load operation for a different
	20	memory address, which can be exploited under certain conditions. Store
	21	buffers are partitioned between Hyper-Threads so cross thread forwarding is
	22	not possible. But if a thread enters or exits a sleep state the store
	23	buffer is repartitioned which can expose data from one thread to the other.
	24
	25	MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
	26	L1 miss situations and to hold data which is returned or sent in response
	27	to a memory or I/O operation. Fill buffers can forward data to a load
	28	operation and also write data to the cache. When the fill buffer is
	29	deallocated it can retain the stale data of the preceding operations which
	30	can then be forwarded to a faulting or assisting load operation, which can
	31	be exploited under certain conditions. Fill buffers are shared between
	32	Hyper-Threads so cross thread leakage is possible.
	33
	34	MLPDS leaks Load Port Data. Load ports are used to perform load operations
	35	from memory or I/O. The received data is then forwarded to the register
	36	file or a subsequent operation. In some implementations the Load Port can
	37	contain stale data from a previous operation which can be forwarded to
	38	faulting or assisting loads under certain conditions, which again can be
	39	exploited eventually. Load ports are shared between Hyper-Threads so cross
	40	thread leakage is possible.
	41
445b98a1 PG	42	MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from
	43	memory that takes a fault or assist can leave data in a microarchitectural
	44	structure that may later be observed using one of the same methods used by
	45	MSBDS, MFBDS or MLPDS.
4446d382 TG	46
	47	Exposure assumptions
	48	--------------------
	49
	50	It is assumed that attack code resides in user space or in a guest with one
	51	exception. The rationale behind this assumption is that the code construct
	52	needed for exploiting MDS requires:
	53
	54	- to control the load to trigger a fault or assist
	55
	56	- to have a disclosure gadget which exposes the speculatively accessed
	57	data for consumption through a side channel.
	58
	59	- to control the pointer through which the disclosure gadget exposes the
	60	data
	61
	62	The existence of such a construct in the kernel cannot be excluded with
	63	100% certainty, but the complexity involved makes it extremly unlikely.
	64
	65	There is one exception, which is untrusted BPF. The functionality of
	66	untrusted BPF is limited, but it needs to be thoroughly investigated
	67	whether it can be used to create such a construct.
	68
	69
	70	Mitigation strategy
	71	-------------------
	72
	73	All variants have the same mitigation strategy at least for the single CPU
	74	thread case (SMT off): Force the CPU to clear the affected buffers.
	75
	76	This is achieved by using the otherwise unused and obsolete VERW
	77	instruction in combination with a microcode update. The microcode clears
	78	the affected CPU buffers when the VERW instruction is executed.
	79
	80	For virtualization there are two ways to achieve CPU buffer
	81	clearing. Either the modified VERW instruction or via the L1D Flush
	82	command. The latter is issued when L1TF mitigation is enabled so the extra
	83	VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
	84	be issued.
	85
	86	If the VERW instruction with the supplied segment selector argument is
	87	executed on a CPU without the microcode update there is no side effect
	88	other than a small number of pointlessly wasted CPU cycles.
	89
	90	This does not protect against cross Hyper-Thread attacks except for MSBDS
	91	which is only exploitable cross Hyper-thread when one of the Hyper-Threads
	92	enters a C-state.
	93
	94	The kernel provides a function to invoke the buffer clearing:
	95
	96	mds_clear_cpu_buffers()
	97
	98	The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
	99	(idle) transitions.
	100
ebf1e8cb TG	101	As a special quirk to address virtualization scenarios where the host has
	102	the microcode updated, but the hypervisor does not (yet) expose the
	103	MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
	104	hope that it might actually clear the buffers. The state is reflected
	105	accordingly.
	106
4446d382 TG	107	According to current knowledge additional mitigations inside the kernel
	108	itself are not required because the necessary gadgets to expose the leaked
	109	data cannot be controlled in a way which allows exploitation from malicious
	110	user space or VM guests.
5ab15133	111
ebf1e8cb TG	112	Kernel internal mitigation modes
	113	--------------------------------
	114
	115	======= ============================================================
	116	off Mitigation is disabled. Either the CPU is not affected or
	117	mds=off is supplied on the kernel command line
	118
5bc9056e	119	full Mitigation is enabled. CPU is affected and MD_CLEAR is
ebf1e8cb TG	120	advertised in CPUID.
	121
	122	vmwerv Mitigation is enabled. CPU is affected and MD_CLEAR is not
	123	advertised in CPUID. That is mainly for virtualization
	124	scenarios where the host has the updated microcode but the
	125	hypervisor does not expose MD_CLEAR in CPUID. It's a best
	126	effort approach without guarantee.
	127	======= ============================================================
	128
	129	If the CPU is affected and mds=off is not supplied on the kernel command
	130	line then the kernel selects the appropriate mitigation mode depending on
	131	the availability of the MD_CLEAR CPUID bit.
	132
5ab15133 TG	133	Mitigation points
	134	-----------------
	135
	136	1. Return to user space
	137	^^^^^^^^^^^^^^^^^^^^^^^
	138
	139	When transitioning from kernel to user space the CPU buffers are flushed
	140	on affected CPUs when the mitigation is not disabled on the kernel
	141	command line. The migitation is enabled through the static key
	142	mds_user_clear.
	143
	144	The mitigation is invoked in prepare_exit_to_usermode() which covers
	145	most of the kernel to user space transitions. There are a few exceptions
	146	which are not invoking prepare_exit_to_usermode() on return to user
	147	space. These exceptions use the paranoid exit code.
	148
	149	- Non Maskable Interrupt (NMI):
	150
	151	Access to sensible data like keys, credentials in the NMI context is
	152	mostly theoretical: The CPU can do prefetching or execute a
	153	misspeculated code path and thereby fetching data which might end up
	154	leaking through a buffer.
	155
	156	But for mounting other attacks the kernel stack address of the task is
	157	already valuable information. So in full mitigation mode, the NMI is
	158	mitigated on the return from do_nmi() to provide almost complete
	159	coverage.
	160
	161	- Double fault (#DF):
	162
	163	A double fault is usually fatal, but the ESPFIX workaround, which can
	164	be triggered from user space through modify_ldt(2) is a recoverable
	165	double fault. #DF uses the paranoid exit path, so explicit mitigation
	166	in the double fault handler is required.
	167
	168	- Machine Check Exception (#MC):
	169
	170	Another corner case is a #MC which hits between the CPU buffer clear
	171	invocation and the actual return to user. As this still is in kernel
	172	space it takes the paranoid exit path which does not clear the CPU
	173	buffers. So the #MC handler repopulates the buffers to some
	174	extent. Machine checks are not reliably controllable and the window is
	175	extremly small so mitigation would just tick a checkbox that this
	176	theoretical corner case is covered. To keep the amount of special
	177	cases small, ignore #MC.
	178
	179	- Debug Exception (#DB):
	180
	181	This takes the paranoid exit path only when the INT1 breakpoint is in
	182	kernel space. #DB on a user space address takes the regular exit path,
	183	so no extra mitigation required.
f3eb8f09 TG	184
	185
	186	2. C-State transition
	187	^^^^^^^^^^^^^^^^^^^^^
	188
	189	When a CPU goes idle and enters a C-State the CPU buffers need to be
	190	cleared on affected CPUs when SMT is active. This addresses the
	191	repartitioning of the store buffer when one of the Hyper-Threads enters
	192	a C-State.
	193
	194	When SMT is inactive, i.e. either the CPU does not support it or all
	195	sibling threads are offline CPU buffer clearing is not required.
	196
	197	The idle clearing is enabled on CPUs which are only affected by MSBDS
	198	and not by any other MDS variant. The other MDS variants cannot be
	199	protected against cross Hyper-Thread attacks because the Fill Buffer and
	200	the Load Ports are shared. So on CPUs affected by other variants, the
	201	idle clearing would be a window dressing exercise and is therefore not
	202	activated.
	203
	204	The invocation is controlled by the static key mds_idle_clear which is
	205	switched depending on the chosen mitigation mode and the SMT state of
	206	the system.
	207
	208	The buffer clear is only invoked before entering the C-State to prevent
	209	that stale data from the idling CPU from spilling to the Hyper-Thread
	210	sibling after the store buffer got repartitioned and all entries are
	211	available to the non idle sibling.
	212
	213	When coming out of idle the store buffer is partitioned again so each
	214	sibling has half of it available. The back from idle CPU could be then
	215	speculatively exposed to contents of the sibling. The buffers are
	216	flushed either on exit to user space or on VMENTER so malicious code
	217	in user space or the guest cannot speculatively access them.
	218
	219	The mitigation is hooked into all variants of halt()/mwait(), but does
	220	not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
	221	has been superseded by the intel_idle driver around 2010 and is
	222	preferred on all affected CPUs which are expected to gain the MD_CLEAR
	223	functionality in microcode. Aside of that the IO-Port mechanism is a
	224	legacy interface which is only used on older systems which are either
	225	not affected or do not receive microcode updates anymore.