]> git.proxmox.com Git - mirror_ubuntu-bionic-kernel.git/blame - Documentation/admin-guide/hw-vuln/l1tf.rst
Documentation: Add MDS vulnerability documentation
[mirror_ubuntu-bionic-kernel.git] / Documentation / admin-guide / hw-vuln / l1tf.rst
CommitLineData
37cf0390
TG
1L1TF - L1 Terminal Fault
2========================
3
4L1 Terminal Fault is a hardware vulnerability which allows unprivileged
5speculative access to data which is available in the Level 1 Data Cache
6when the page table entry controlling the virtual address, which is used
7for the access, has the Present bit cleared or other reserved bits set.
8
9Affected processors
10-------------------
11
12This vulnerability affects a wide range of Intel processors. The
13vulnerability is not present on:
14
15 - Processors from AMD, Centaur and other non Intel vendors
16
17 - Older processor models, where the CPU family is < 6
18
19 - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
4869ec69 20 Penwell, Pineview, Silvermont, Airmont, Merrifield)
37cf0390 21
37cf0390
TG
22 - The Intel XEON PHI family
23
24 - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
25 IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
26 by the Meltdown vulnerability either. These CPUs should become
27 available by end of 2018.
28
29Whether a processor is affected or not can be read out from the L1TF
30vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
31
32Related CVEs
33------------
34
35The following CVE entries are related to the L1TF vulnerability:
36
37 ============= ================= ==============================
38 CVE-2018-3615 L1 Terminal Fault SGX related aspects
39 CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects
40 CVE-2018-3646 L1 Terminal Fault Virtualization related aspects
41 ============= ================= ==============================
42
43Problem
44-------
45
46If an instruction accesses a virtual address for which the relevant page
47table entry (PTE) has the Present bit cleared or other reserved bits set,
48then speculative execution ignores the invalid PTE and loads the referenced
49data if it is present in the Level 1 Data Cache, as if the page referenced
50by the address bits in the PTE was still present and accessible.
51
52While this is a purely speculative mechanism and the instruction will raise
53a page fault when it is retired eventually, the pure act of loading the
54data and making it available to other speculative instructions opens up the
55opportunity for side channel attacks to unprivileged malicious code,
56similar to the Meltdown attack.
57
58While Meltdown breaks the user space to kernel space protection, L1TF
59allows to attack any physical memory address in the system and the attack
60works across all protection domains. It allows an attack of SGX and also
61works from inside virtual machines because the speculation bypasses the
62extended page table (EPT) protection mechanism.
63
64
65Attack scenarios
66----------------
67
681. Malicious user space
69^^^^^^^^^^^^^^^^^^^^^^^
70
71 Operating Systems store arbitrary information in the address bits of a
72 PTE which is marked non present. This allows a malicious user space
73 application to attack the physical memory to which these PTEs resolve.
74 In some cases user-space can maliciously influence the information
75 encoded in the address bits of the PTE, thus making attacks more
76 deterministic and more practical.
77
78 The Linux kernel contains a mitigation for this attack vector, PTE
79 inversion, which is permanently enabled and has no performance
80 impact. The kernel ensures that the address bits of PTEs, which are not
81 marked present, never point to cacheable physical memory space.
82
83 A system with an up to date kernel is protected against attacks from
84 malicious user space applications.
85
862. Malicious guest in a virtual machine
87^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
88
89 The fact that L1TF breaks all domain protections allows malicious guest
90 OSes, which can control the PTEs directly, and malicious guest user
91 space applications, which run on an unprotected guest kernel lacking the
92 PTE inversion mitigation for L1TF, to attack physical host memory.
93
94 A special aspect of L1TF in the context of virtualization is symmetric
95 multi threading (SMT). The Intel implementation of SMT is called
96 HyperThreading. The fact that Hyperthreads on the affected processors
97 share the L1 Data Cache (L1D) is important for this. As the flaw allows
98 only to attack data which is present in L1D, a malicious guest running
99 on one Hyperthread can attack the data which is brought into the L1D by
100 the context which runs on the sibling Hyperthread of the same physical
101 core. This context can be host OS, host user space or a different guest.
102
103 If the processor does not support Extended Page Tables, the attack is
104 only possible, when the hypervisor does not sanitize the content of the
105 effective (shadow) page tables.
106
107 While solutions exist to mitigate these attack vectors fully, these
108 mitigations are not enabled by default in the Linux kernel because they
109 can affect performance significantly. The kernel provides several
110 mechanisms which can be utilized to address the problem depending on the
111 deployment scenario. The mitigations, their protection scope and impact
112 are described in the next sections.
113
4869ec69 114 The default mitigations and the rationale for choosing them are explained
37cf0390
TG
115 at the end of this document. See :ref:`default_mitigations`.
116
117.. _l1tf_sys_info:
118
119L1TF system information
120-----------------------
121
122The Linux kernel provides a sysfs interface to enumerate the current L1TF
123status of the system: whether the system is vulnerable, and which
124mitigations are active. The relevant sysfs file is:
125
126/sys/devices/system/cpu/vulnerabilities/l1tf
127
128The possible values in this file are:
129
130 =========================== ===============================
131 'Not affected' The processor is not vulnerable
132 'Mitigation: PTE Inversion' The host protection is active
133 =========================== ===============================
134
135If KVM/VMX is enabled and the processor is vulnerable then the following
136information is appended to the 'Mitigation: PTE Inversion' part:
137
138 - SMT status:
139
140 ===================== ================
141 'VMX: SMT vulnerable' SMT is enabled
142 'VMX: SMT disabled' SMT is disabled
143 ===================== ================
144
145 - L1D Flush mode:
146
147 ================================ ====================================
148 'L1D vulnerable' L1D flushing is disabled
149
150 'L1D conditional cache flushes' L1D flush is conditionally enabled
151
152 'L1D cache flushes' L1D flush is unconditionally enabled
153 ================================ ====================================
154
155The resulting grade of protection is discussed in the following sections.
156
157
158Host mitigation mechanism
159-------------------------
160
161The kernel is unconditionally protected against L1TF attacks from malicious
162user space running on the host.
163
164
165Guest mitigation mechanisms
166---------------------------
167
168.. _l1d_flush:
169
1701. L1D flush on VMENTER
171^^^^^^^^^^^^^^^^^^^^^^^
172
173 To make sure that a guest cannot attack data which is present in the L1D
174 the hypervisor flushes the L1D before entering the guest.
175
176 Flushing the L1D evicts not only the data which should not be accessed
177 by a potentially malicious guest, it also flushes the guest
178 data. Flushing the L1D has a performance impact as the processor has to
179 bring the flushed guest data back into the L1D. Depending on the
180 frequency of VMEXIT/VMENTER and the type of computations in the guest
181 performance degradation in the range of 1% to 50% has been observed. For
182 scenarios where guest VMEXIT/VMENTER are rare the performance impact is
183 minimal. Virtio and mechanisms like posted interrupts are designed to
184 confine the VMEXITs to a bare minimum, but specific configurations and
185 application scenarios might still suffer from a high VMEXIT rate.
186
187 The kernel provides two L1D flush modes:
188 - conditional ('cond')
189 - unconditional ('always')
190
191 The conditional mode avoids L1D flushing after VMEXITs which execute
4869ec69
TL
192 only audited code paths before the corresponding VMENTER. These code
193 paths have been verified that they cannot expose secrets or other
37cf0390
TG
194 interesting data to an attacker, but they can leak information about the
195 address space layout of the hypervisor.
196
197 Unconditional mode flushes L1D on all VMENTER invocations and provides
198 maximum protection. It has a higher overhead than the conditional
199 mode. The overhead cannot be quantified correctly as it depends on the
4869ec69 200 workload scenario and the resulting number of VMEXITs.
37cf0390
TG
201
202 The general recommendation is to enable L1D flush on VMENTER. The kernel
203 defaults to conditional mode on affected processors.
204
205 **Note**, that L1D flush does not prevent the SMT problem because the
206 sibling thread will also bring back its data into the L1D which makes it
207 attackable again.
208
209 L1D flush can be controlled by the administrator via the kernel command
210 line and sysfs control files. See :ref:`mitigation_control_command_line`
211 and :ref:`mitigation_control_kvm`.
212
213.. _guest_confinement:
214
2152. Guest VCPU confinement to dedicated physical cores
216^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
217
218 To address the SMT problem, it is possible to make a guest or a group of
219 guests affine to one or more physical cores. The proper mechanism for
220 that is to utilize exclusive cpusets to ensure that no other guest or
221 host tasks can run on these cores.
222
223 If only a single guest or related guests run on sibling SMT threads on
224 the same physical core then they can only attack their own memory and
225 restricted parts of the host memory.
226
227 Host memory is attackable, when one of the sibling SMT threads runs in
228 host OS (hypervisor) context and the other in guest context. The amount
229 of valuable information from the host OS context depends on the context
230 which the host OS executes, i.e. interrupts, soft interrupts and kernel
231 threads. The amount of valuable data from these contexts cannot be
232 declared as non-interesting for an attacker without deep inspection of
233 the code.
234
235 **Note**, that assigning guests to a fixed set of physical cores affects
236 the ability of the scheduler to do load balancing and might have
237 negative effects on CPU utilization depending on the hosting
238 scenario. Disabling SMT might be a viable alternative for particular
239 scenarios.
240
241 For further information about confining guests to a single or to a group
242 of cores consult the cpusets documentation:
243
244 https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
245
246.. _interrupt_isolation:
247
2483. Interrupt affinity
249^^^^^^^^^^^^^^^^^^^^^
250
251 Interrupts can be made affine to logical CPUs. This is not universally
252 true because there are types of interrupts which are truly per CPU
253 interrupts, e.g. the local timer interrupt. Aside of that multi queue
254 devices affine their interrupts to single CPUs or groups of CPUs per
255 queue without allowing the administrator to control the affinities.
256
257 Moving the interrupts, which can be affinity controlled, away from CPUs
258 which run untrusted guests, reduces the attack vector space.
259
260 Whether the interrupts with are affine to CPUs, which run untrusted
261 guests, provide interesting data for an attacker depends on the system
262 configuration and the scenarios which run on the system. While for some
4869ec69 263 of the interrupts it can be assumed that they won't expose interesting
37cf0390
TG
264 information beyond exposing hints about the host OS memory layout, there
265 is no way to make general assumptions.
266
267 Interrupt affinity can be controlled by the administrator via the
268 /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
269 available at:
270
271 https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
272
273.. _smt_control:
274
2754. SMT control
276^^^^^^^^^^^^^^
277
278 To prevent the SMT issues of L1TF it might be necessary to disable SMT
279 completely. Disabling SMT can have a significant performance impact, but
280 the impact depends on the hosting scenario and the type of workloads.
281 The impact of disabling SMT needs also to be weighted against the impact
282 of other mitigation solutions like confining guests to dedicated cores.
283
284 The kernel provides a sysfs interface to retrieve the status of SMT and
285 to control it. It also provides a kernel command line interface to
286 control SMT.
287
288 The kernel command line interface consists of the following options:
289
290 =========== ==========================================================
291 nosmt Affects the bring up of the secondary CPUs during boot. The
292 kernel tries to bring all present CPUs online during the
293 boot process. "nosmt" makes sure that from each physical
294 core only one - the so called primary (hyper) thread is
295 activated. Due to a design flaw of Intel processors related
296 to Machine Check Exceptions the non primary siblings have
297 to be brought up at least partially and are then shut down
298 again. "nosmt" can be undone via the sysfs interface.
299
4869ec69 300 nosmt=force Has the same effect as "nosmt" but it does not allow to
37cf0390
TG
301 undo the SMT disable via the sysfs interface.
302 =========== ==========================================================
303
304 The sysfs interface provides two files:
305
306 - /sys/devices/system/cpu/smt/control
307 - /sys/devices/system/cpu/smt/active
308
309 /sys/devices/system/cpu/smt/control:
310
311 This file allows to read out the SMT control state and provides the
312 ability to disable or (re)enable SMT. The possible states are:
313
314 ============== ===================================================
315 on SMT is supported by the CPU and enabled. All
316 logical CPUs can be onlined and offlined without
317 restrictions.
318
319 off SMT is supported by the CPU and disabled. Only
320 the so called primary SMT threads can be onlined
321 and offlined without restrictions. An attempt to
322 online a non-primary sibling is rejected
323
324 forceoff Same as 'off' but the state cannot be controlled.
325 Attempts to write to the control file are rejected.
326
327 notsupported The processor does not support SMT. It's therefore
328 not affected by the SMT implications of L1TF.
329 Attempts to write to the control file are rejected.
330 ============== ===================================================
331
332 The possible states which can be written into this file to control SMT
333 state are:
334
335 - on
336 - off
337 - forceoff
338
339 /sys/devices/system/cpu/smt/active:
340
341 This file reports whether SMT is enabled and active, i.e. if on any
342 physical core two or more sibling threads are online.
343
344 SMT control is also possible at boot time via the l1tf kernel command
345 line parameter in combination with L1D flush control. See
346 :ref:`mitigation_control_command_line`.
347
3485. Disabling EPT
349^^^^^^^^^^^^^^^^
350
351 Disabling EPT for virtual machines provides full mitigation for L1TF even
352 with SMT enabled, because the effective page tables for guests are
353 managed and sanitized by the hypervisor. Though disabling EPT has a
354 significant performance impact especially when the Meltdown mitigation
355 KPTI is enabled.
356
357 EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
358
359There is ongoing research and development for new mitigation mechanisms to
360address the performance impact of disabling SMT or EPT.
361
362.. _mitigation_control_command_line:
363
364Mitigation control on the kernel command line
365---------------------------------------------
366
367The kernel command line allows to control the L1TF mitigations at boot
368time with the option "l1tf=". The valid arguments for this option are:
369
370 ============ =============================================================
371 full Provides all available mitigations for the L1TF
372 vulnerability. Disables SMT and enables all mitigations in
373 the hypervisors, i.e. unconditional L1D flushing
374
375 SMT control and L1D flush control via the sysfs interface
376 is still possible after boot. Hypervisors will issue a
377 warning when the first VM is started in a potentially
378 insecure configuration, i.e. SMT enabled or L1D flush
379 disabled.
380
381 full,force Same as 'full', but disables SMT and L1D flush runtime
382 control. Implies the 'nosmt=force' command line option.
383 (i.e. sysfs control of SMT is disabled.)
384
385 flush Leaves SMT enabled and enables the default hypervisor
386 mitigation, i.e. conditional L1D flushing
387
388 SMT control and L1D flush control via the sysfs interface
389 is still possible after boot. Hypervisors will issue a
390 warning when the first VM is started in a potentially
391 insecure configuration, i.e. SMT enabled or L1D flush
392 disabled.
393
394 flush,nosmt Disables SMT and enables the default hypervisor mitigation,
395 i.e. conditional L1D flushing.
396
397 SMT control and L1D flush control via the sysfs interface
398 is still possible after boot. Hypervisors will issue a
399 warning when the first VM is started in a potentially
400 insecure configuration, i.e. SMT enabled or L1D flush
401 disabled.
402
403 flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is
404 started in a potentially insecure configuration.
405
406 off Disables hypervisor mitigations and doesn't emit any
407 warnings.
408 ============ =============================================================
409
410The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
411
412
413.. _mitigation_control_kvm:
414
415Mitigation control for KVM - module parameter
416-------------------------------------------------------------
417
418The KVM hypervisor mitigation mechanism, flushing the L1D cache when
419entering a guest, can be controlled with a module parameter.
420
421The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
422following arguments:
423
424 ============ ==============================================================
425 always L1D cache flush on every VMENTER.
426
427 cond Flush L1D on VMENTER only when the code between VMEXIT and
428 VMENTER can leak host memory which is considered
429 interesting for an attacker. This still can leak host memory
430 which allows e.g. to determine the hosts address space layout.
431
432 never Disables the mitigation
433 ============ ==============================================================
434
435The parameter can be provided on the kernel command line, as a module
436parameter when loading the modules and at runtime modified via the sysfs
437file:
438
439/sys/module/kvm_intel/parameters/vmentry_l1d_flush
440
441The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
442line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
443module parameter is ignored and writes to the sysfs file are rejected.
444
40318558 445.. _mitigation_selection:
37cf0390
TG
446
447Mitigation selection guide
448--------------------------
449
4501. No virtualization in use
451^^^^^^^^^^^^^^^^^^^^^^^^^^^
452
453 The system is protected by the kernel unconditionally and no further
454 action is required.
455
4562. Virtualization with trusted guests
457^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
458
459 If the guest comes from a trusted source and the guest OS kernel is
460 guaranteed to have the L1TF mitigations in place the system is fully
461 protected against L1TF and no further action is required.
462
463 To avoid the overhead of the default L1D flushing on VMENTER the
464 administrator can disable the flushing via the kernel command line and
465 sysfs control files. See :ref:`mitigation_control_command_line` and
466 :ref:`mitigation_control_kvm`.
467
468
4693. Virtualization with untrusted guests
470^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
471
4723.1. SMT not supported or disabled
473""""""""""""""""""""""""""""""""""
474
475 If SMT is not supported by the processor or disabled in the BIOS or by
476 the kernel, it's only required to enforce L1D flushing on VMENTER.
477
478 Conditional L1D flushing is the default behaviour and can be tuned. See
479 :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
480
4813.2. EPT not supported or disabled
482""""""""""""""""""""""""""""""""""
483
484 If EPT is not supported by the processor or disabled in the hypervisor,
485 the system is fully protected. SMT can stay enabled and L1D flushing on
486 VMENTER is not required.
487
488 EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
489
4903.3. SMT and EPT supported and active
491"""""""""""""""""""""""""""""""""""""
492
493 If SMT and EPT are supported and active then various degrees of
494 mitigations can be employed:
495
496 - L1D flushing on VMENTER:
497
498 L1D flushing on VMENTER is the minimal protection requirement, but it
499 is only potent in combination with other mitigation methods.
500
501 Conditional L1D flushing is the default behaviour and can be tuned. See
502 :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
503
504 - Guest confinement:
505
506 Confinement of guests to a single or a group of physical cores which
507 are not running any other processes, can reduce the attack surface
508 significantly, but interrupts, soft interrupts and kernel threads can
509 still expose valuable data to a potential attacker. See
510 :ref:`guest_confinement`.
511
512 - Interrupt isolation:
513
514 Isolating the guest CPUs from interrupts can reduce the attack surface
515 further, but still allows a malicious guest to explore a limited amount
516 of host physical memory. This can at least be used to gain knowledge
517 about the host address space layout. The interrupts which have a fixed
518 affinity to the CPUs which run the untrusted guests can depending on
519 the scenario still trigger soft interrupts and schedule kernel threads
520 which might expose valuable information. See
521 :ref:`interrupt_isolation`.
522
523The above three mitigation methods combined can provide protection to a
524certain degree, but the risk of the remaining attack surface has to be
525carefully analyzed. For full protection the following methods are
526available:
527
528 - Disabling SMT:
529
530 Disabling SMT and enforcing the L1D flushing provides the maximum
531 amount of protection. This mitigation is not depending on any of the
532 above mitigation methods.
533
534 SMT control and L1D flushing can be tuned by the command line
535 parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
536 time with the matching sysfs control files. See :ref:`smt_control`,
537 :ref:`mitigation_control_command_line` and
538 :ref:`mitigation_control_kvm`.
539
540 - Disabling EPT:
541
542 Disabling EPT provides the maximum amount of protection as well. It is
543 not depending on any of the above mitigation methods. SMT can stay
544 enabled and L1D flushing is not required, but the performance impact is
545 significant.
546
547 EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
548 parameter.
549
1ccd9994
PB
5503.4. Nested virtual machines
551""""""""""""""""""""""""""""
552
553When nested virtualization is in use, three operating systems are involved:
554the bare metal hypervisor, the nested hypervisor and the nested virtual
555machine. VMENTER operations from the nested hypervisor into the nested
556guest will always be processed by the bare metal hypervisor. If KVM is the
48565d03 557bare metal hypervisor it will:
1ccd9994
PB
558
559 - Flush the L1D cache on every switch from the nested hypervisor to the
560 nested virtual machine, so that the nested hypervisor's secrets are not
561 exposed to the nested virtual machine;
562
563 - Flush the L1D cache on every switch from the nested virtual machine to
564 the nested hypervisor; this is a complex operation, and flushing the L1D
565 cache avoids that the bare metal hypervisor's secrets are exposed to the
566 nested virtual machine;
567
568 - Instruct the nested hypervisor to not perform any L1D cache flush. This
569 is an optimization to avoid double L1D flushing.
570
37cf0390
TG
571
572.. _default_mitigations:
573
574Default mitigations
575-------------------
576
577 The kernel default mitigations for vulnerable processors are:
578
579 - PTE inversion to protect against malicious user space. This is done
580 unconditionally and cannot be controlled.
581
582 - L1D conditional flushing on VMENTER when EPT is enabled for
583 a guest.
584
585 The kernel does not by default enforce the disabling of SMT, which leaves
586 SMT systems vulnerable when running untrusted guests with EPT enabled.
587
588 The rationale for this choice is:
589
590 - Force disabling SMT can break existing setups, especially with
591 unattended updates.
592
593 - If regular users run untrusted guests on their machine, then L1TF is
594 just an add on to other malware which might be embedded in an untrusted
595 guest, e.g. spam-bots or attacks on the local network.
596
597 There is no technical way to prevent a user from running untrusted code
598 on their machines blindly.
599
600 - It's technically extremely unlikely and from today's knowledge even
601 impossible that L1TF can be exploited via the most popular attack
602 mechanisms like JavaScript because these mechanisms have no way to
603 control PTEs. If this would be possible and not other mitigation would
604 be possible, then the default might be different.
605
606 - The administrators of cloud and hosting setups have to carefully
607 analyze the risk for their scenarios and make the appropriate
608 mitigation choices, which might even vary across their deployed
609 machines and also result in other changes of their overall setup.
610 There is no way for the kernel to provide a sensible default for this
611 kind of scenarios.