Arrow USB Blaster integrated on MAX1000 board uses the same vendor ID
(0x0403) and product ID (0x6010) as the "original" FTDI device.
This patch avoids picking up by ftdi_sio of the first interface of this
USB device. After that this device can be used by Arrow user-space JTAG
driver.
Add simple driver for libtransistor USB console.
This device is implemented in software:
https://github.com/reswitched/libtransistor/blob/development/lib/usb_serial.c
Signed-off-by: Collin May <collin@collinswebsite.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
It is incomplete and causes hangs on devices when shutting down. It
needs a much more "complete" fix in order to work properly. As that fix
has not been merged, revert this patch for now before it causes any more
problems.
Until the primary_crng is fully initialized, don't initialize the NUMA
crng nodes. Otherwise users of /dev/urandom on NUMA systems before
the CRNG is fully initialized can get very bad quality randomness. Of
course everyone should move to getrandom(2) where this won't be an
issue, but there's a lot of legacy code out there. This related to
CVE-2018-1108.
Currently in ext4_valid_block_bitmap() we expect the bitmap to be
positioned anywhere between 0 and s_blocksize clusters, but that's
wrong because the bitmap can be placed anywhere in the block group. This
causes false positives when validating bitmaps on perfectly valid file
system layouts. Fix it by checking whether the bitmap is within the group
boundary.
The problem can be reproduced using the following
mkfs -t ext3 -E stride=256 /dev/vdb1
mount /dev/vdb1 /mnt/test
cd /mnt/test
wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.16.3.tar.xz
tar xf linux-4.16.3.tar.xz
An privileged attacker can cause a crash by mounting a crafted ext4
image which triggers a out-of-bounds read in the function
ext4_valid_block_bitmap() in fs/ext4/balloc.c.
If ext4 tries to start a reserved handle via
jbd2_journal_start_reserved(), and the journal has been aborted, this
can result in a NULL pointer dereference. This is because the fields
h_journal and h_transaction in the handle structure share the same
memory, via a union, so jbd2_journal_start_reserved() will clear
h_journal before calling start_this_handle(). If this function fails
due to an aborted handle, h_journal will still be NULL, and the call
to jbd2_journal_free_reserved() will pass a NULL journal to
sub_reserve_credits().
This can be reproduced by running "kvm-xfstests -c dioread_nolock
generic/475".
Cc: stable@kernel.org # 3.11 Fixes: 8f7d89f36829b ("jbd2: transaction reservation support") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Kai-Heng Feng [Fri, 22 Jun 2018 18:50:47 +0000 (02:50 +0800)]
xhci: Fix USB ports for Dell Inspiron 5775
BugLink: https://bugs.launchpad.net/bugs/1756700
The Dell Inspiron 5775 is a Raven Ridge. The Enable Slot command timed
out when a USB device gets plugged:
[ 212.156326] xhci_hcd 0000:03:00.3: Error while assigning device slot ID
[ 212.156340] xhci_hcd 0000:03:00.3: Max number of devices this xHCI host supports is 64.
[ 212.156348] usb usb2-port3: couldn't allocate usb_device
AMD suggests that a delay before xHC suspends can fix the issue.
I can confirm it fixes the issue, so use the suspend delay quirk for
Raven Ridge's xHC.
Cc: stable@vger.kernel.org Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 621faf4f6a181b6e012c1d1865213f36f4159b7f) Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Acked-by: Anthony Wong <anthony.wong@canonical.com> Acked-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device
BugLink: http://bugs.launchpad.net/bugs/1776389
Since commit f7f37e7ff2b9 ("ixgbe: handle close/suspend race with
netif_device_detach/present") ixgbe_close_suspend is called, from
ixgbe_close, only if the device is present, i.e. if it isn't detached.
That exposed a situation where IRQs weren't freed if a PCI error
recovery system opts to remove the device. For such case the pci channel
state is set to pci_channel_io_perm_failure and ixgbe_io_error_detected
was returning PCI_ERS_RESULT_DISCONNECT before calling
ixgbe_close_suspend consequentially not freeing IRQ and crashing when
the remove handler calls pci_disable_device, hitting a BUG_ON at
free_msi_irqs, which asserts that there is no non-free IRQ associated
with the device to be removed:
BUG_ON(irq_has_action(entry->irq + i));
The issue is fixed by calling the ixgbe_close_suspend before evaluate
the pci channel state.
Reported-by: Naresh Bannoth <nbannoth@in.ibm.com> Reported-by: Abdul Haleem <abdhalee@in.ibm.com> Signed-off-by: Mauro S M Rodrigues <maurosr@linux.vnet.ibm.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit b212d815e77c72be921979119c715166cc8987b1) Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Dave Carroll [Wed, 13 Jun 2018 15:21:09 +0000 (11:21 -0400)]
scsi: aacraid: Correct hba_send to include iu_type
BugLink: http://bugs.launchpad.net/bugs/1770095
commit b60710ec7d7a ("scsi: aacraid: enable sending of TMFs from
aac_hba_send()") allows aac_hba_send() to send scsi commands, and TMF
requests, but the existing code only updates the iu_type for scsi
commands. For TMF requests we are sending an unknown iu_type to
firmware, which causes a fault.
Include iu_type prior to determining the validity of the command
Reported-by: Noah Misner <nmisner@us.ibm.com> Fixes: b60710ec7d7ab ("aacraid: enable sending of TMFs from aac_hba_send()") Fixes: 423400e64d377 ("aacraid: Include HBA direct interface") Tested-by: Noah Misner <nmisner@us.ibm.com>
cc: stable@vger.kernel.org Signed-off-by: Dave Carroll <david.carroll@microsemi.com> Reviewed-by: Raghava Aditya Renukunta <RaghavaAditya.Renukunta@microsemi.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from linux-next commit 7d3af7d96af7b9f51e1ef67b6f4725f545737da2) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
s390/archrandom: Rework arch random implementation.
BugLink: http://bugs.launchpad.net/bugs/1775391
The arch_get_random_seed_long() invocation done by the random device
driver is done in interrupt context and may be invoked very very
frequently. The existing s390 arch_get_random_seed*() implementation
uses the PRNO(TRNG) instruction which produces excellent high quality
entropy but is relatively slow and thus expensive.
This fix reworks the arch_get_random_seed* implementation. It
introduces a buffer concept to decouple the delivery of random data
via arch_get_random_seed*() from the generation of new random
bytes. The buffer of random data is filled asynchronously by a
workqueue thread.
If there are enough bytes in the buffer the s390_arch_random_generate()
just delivers these bytes. Otherwise false is returned until the worker
thread refills the buffer.
The worker fills the rng buffer by pulling fresh entropy from the
high quality (but slow) true hardware random generator. This entropy
is then spread over the buffer with an pseudo random generator.
As the arch_get_random_seed_long() fetches 8 bytes and the calling
function add_interrupt_randomness() counts this as 1 bit entropy the
distribution needs to make sure there is in fact 1 bit entropy
contained in 8 bytes of the buffer. The current values pull 32 byte
entropy and scatter this into a 2048 byte buffer. So 8 byte in the
buffer will contain 1 bit of entropy.
The worker thread is rescheduled based on the charge level of the
buffer but at least with 500 ms delay to avoid too much cpu consumption.
So the max. amount of rng data delivered via arch_get_random_seed is
limited to 4Kb per second.
Signed-off-by: Harald Freudenberger <freude@de.ibm.com> Reviewed-by: Patrick Steuer <patrick.steuer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
(cherry picked from commit 966f53e750aedc5f59f9ccae6bbfb8f671c7c842) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
s390/zcrypt: Fix CCA and EP11 CPRB processing failure memory leak.
BugLink: http://bugs.launchpad.net/bugs/1775390
Tests showed, that the zcrypt device driver produces memory
leaks when a valid CCA or EP11 CPRB can't get delivered or has
a failure during processing within the zcrypt device driver.
This happens when a invalid domain or adapter number is used
or the lower level software or hardware layers produce any
kind of failure during processing of the request.
Only CPRBs send to CCA or EP11 cards can produce this memory
leak. The accelerator and the CPRBs processed by this type
of crypto card is not affected.
The two fields message and private within the ap_message struct
are allocated with pulling the function code for the CPRB but
only freed when processing of the CPRB succeeds. So for example
an invalid domain or adapter field causes the processing to
fail, leaving these two memory areas allocated forever.
Signed-off-by: Harald Freudenberger <freude@de.ibm.com> Reviewed-by: Ingo Franzki <ifranzki@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
(cherry picked from commit 89a0c0ec0d2e3ce0ee9caa00f60c0c26ccf11c21) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Vaibhav Jain [Fri, 18 May 2018 09:42:23 +0000 (15:12 +0530)]
cxl: Disable prefault_mode in Radix mode
BugLink: http://bugs.launchpad.net/bugs/1774471
Currently we see a kernel-oops reported on Power-9 while attaching a
context to an AFU, with radix-mode and sysfs attr 'prefault_mode' set
to anything other than 'none'. The backtrace of the oops is of this
form:
Unable to handle kernel paging request for data at address 0x00000080
Faulting instruction address: 0xc00800000bcf3b20
cpu 0x1: Vector: 300 (Data Access) at [c00000037f003800]
pc: c00800000bcf3b20: cxl_load_segment+0x178/0x290 [cxl]
lr: c00800000bcf39f0: cxl_load_segment+0x48/0x290 [cxl]
sp: c00000037f003a80
msr: 9000000000009033
dar: 80
dsisr: 40000000
current = 0xc00000037f280000
paca = 0xc0000003ffffe600 softe: 3 irq_happened: 0x01
pid = 3529, comm = afp_no_int
<snip>
cxl_prefault+0xfc/0x248 [cxl]
process_element_entry_psl9+0xd8/0x1a0 [cxl]
cxl_attach_dedicated_process_psl9+0x44/0x130 [cxl]
native_attach_process+0xc0/0x130 [cxl]
afu_ioctl+0x3f4/0x5e0 [cxl]
do_vfs_ioctl+0xdc/0x890
ksys_ioctl+0x68/0xf0
sys_ioctl+0x40/0xa0
system_call+0x58/0x6c
The issue is caused as on Power-8 the AFU attr 'prefault_mode' was
used to improve initial storage fault performance by prefaulting
process segments. However on Power-9 with radix mode we don't have
Storage-Segments that we can prefault. Also prefaulting process Pages
will be too costly and fine-grained.
Hence, since the prefaulting mechanism doesn't makes sense of
radix-mode, this patch updates prefault_mode_store() to not allow any
other value apart from CXL_PREFAULT_NONE when radix mode is enabled.
Fixes: f24be42aab37 ("cxl: Add psl9 specific code") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from linux-next commit b6c84ba22ff3a198eb8d5552cf9b8fda1d792e54) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
cxl: Configure PSL to not use APC virtual machines
BugLink: http://bugs.launchpad.net/bugs/1774471
APC virtual machines arent used on POWER-9 chips and are already
disabled in on-chip CAPP. They also need to be disabled on the PSL via
'PSL Data Send Control Register' by setting bit(47). This forces the
PSL to send commands to CAPP with queue.id == 0.
Fixes: 5632874311db ("cxl: Add support for POWER9 DD2") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from linux-next commit 9a6d2022bacd8fca0be6297459a02dfd28dad6ba) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
BugLink: http://bugs.launchpad.net/bugs/1774471
Failure to synchronize the tunneled operations does not prevent
the initialization of the cxl card. This patch reports the tunneled
operations status via /sys.
Signed-off-by: Philippe Bergheaud <felix@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 497a0790e2c604366b9e35dcb41310319e9bca13) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
cxl: Set the PBCQ Tunnel BAR register when enabling capi mode
BugLink: http://bugs.launchpad.net/bugs/1774471
Skiboot used to set the default Tunnel BAR register value when capi
mode was enabled. This approach was ok for the cxl driver, but
prevented other drivers from choosing different values.
Skiboot versions > 5.11 will not set the default value any longer.
This patch modifies the cxl driver to set/reset the Tunnel BAR
register when entering/exiting the cxl mode, with
pnv_pci_set_tunnel_bar().
That should work with old skiboot (since we are re-writing the value
already set) and new skiboot.
mpe: The tunnel support was only merged into Linux recently, in commit d6a90bb83b50 ("powerpc/powernv: Enable tunneled operations")
(v4.17-rc1), so with new skiboot kernels between that commit and this
will not work correctly.
Fixes: d6a90bb83b50 ("powerpc/powernv: Enable tunneled operations") Signed-off-by: Philippe Bergheaud <felix@linux.ibm.com> Reviewed-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 401dca8cbd14fc4b32d93499dcd12a1711a73ecc) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Vaibhav Jain [Thu, 15 Feb 2018 06:19:36 +0000 (11:49 +0530)]
cxl: Remove function write_timebase_ctrl_psl9() for PSL9
BugLink: http://bugs.launchpad.net/bugs/1774471
For PSL9 the contents of PSL_TB_CTLSTAT register have changed in PSL9
and all of the register is now readonly. Hence we don't need an sl_ops
implementation for 'write_timebase_ctrl' for to populate this register
for PSL9.
Hence this patch removes function write_timebase_ctrl_psl9() and its
references from the code.
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 02b63b420223db3e33e19cc0aaf346371e8efe48) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Takashi Iwai [Wed, 13 Jun 2018 16:53:58 +0000 (12:53 -0400)]
Bluetooth: btusb: Apply QCA Rome patches for some ATH3012 models
BugLink: http://bugs.launchpad.net/bugs/1764645
In commit f44cb4b19ed4 ("Bluetooth: btusb: Fix quirk for Atheros
1525/QCA6174") we tried to address the non-working Atheros BT devices
by changing the quirk from BTUSB_ATH3012 to BTUSB_QCA_ROME. This made
such devices working while it turned out to break other existing chips
with the very same USB ID, hence it was reverted afterwards.
This is another attempt to tackle the issue. The essential point to
use BTUSB_QCA_ROME is to apply the btusb_setup_qca() and do RAM-
patching. And the previous attempt failed because btusb_setup_qca()
returns -ENODEV if the ROM version doesn't match with the expected
ones. For some devices that have already the "correct" ROM versions,
we may just skip the setup procedure and continue the rest.
So, the first fix we'll need is to add a check of the ROM version in
the function to skip the setup if the ROM version looks already sane,
so that it can be applied for all ath devices.
However, the world is a bit more complex than that simple solution.
Since BTUSB_ATH3012 quirk checks the bcdDevice and bails out when it's
0x0001 at the beginning of probing, so the device probe always aborts
here.
In this patch, we add another check of ROM version again, and if the
device needs patching, the probe continues. For that, a slight
refactoring of btusb_qca_send_vendor_req() was required so that the
probe function can pass usb_device pointer directly before allocating
hci_dev stuff.
Fixes: commit f44cb4b19ed4 ("Bluetooth: btusb: Fix quirk for Atheros 1525/QCA6174")
Bugzilla: http://bugzilla.opensuse.org/show_bug.cgi?id=1082504 Tested-by: Ivan Levshin <ivan.levshin@microfocus.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
(cherry picked from commit 803cdb8ce584198cd45825822910cac7de6378cb) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Paolo Pisati [Thu, 14 Jun 2018 10:08:56 +0000 (12:08 +0200)]
UBUNTU: SAUCE: wcn36xx: read MAC from file or randomly generate one
BugLink: http://bugs.launchpad.net/bugs/1776491
By default, wcn36xx initializes itself with a dummy 00:00:00:00:00:00 MAC
address, preventing the interface from working until a valid MAC address was set.
While not an issue on Ubuntu Classic (where the user can always set it
later on the command line or via /etc/network/interfaces), it became a problem
on Ubuntu Core where the wifi interface is probed during installation,
before the user has any chance to set a new MAC address.
To overcome this scenario, the wcn36xx driver in Xenial had a couple of features:
1) during probe, if /lib/firmware/wlan/macaddr0 was present, its content was
used as the new MAC address
2) if that failed, a pseudo-random MAC addres was generated and set
and this is a port of a the corresponding Xenial code to Bionic:
see xenial/snapdragon tree,
drivers/net/wireless/ath/wcn36xx/wcn36xx-msm.c::wcn36xx_msm_get_hw_mac().
Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
David Howells [Fri, 15 Jun 2018 02:33:29 +0000 (12:33 +1000)]
fscache: Fix hanging wait on page discarded by writeback
BugLink: https://bugs.launchpad.net/bugs/1777029
If the fscache asynchronous write operation elects to discard a page that's
pending storage to the cache because the page would be over the store limit
then it needs to wake the page as someone may be waiting on completion of
the write.
The problem is that the store limit may be updated by a different
asynchronous operation - and so may miss the write - and that the store
limit may not even get updated until later by the netfs.
Fix the kernel hang by making fscache_write_op() mark as written any pages
that are over the limit.
Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 2c98425720233ae3e135add0c7e869b32913502f) Signed-off-by: Daniel Axtens <daniel.axtens@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Tyler Hicks [Fri, 10 Aug 2018 14:42:27 +0000 (14:42 +0000)]
cpu: Fix per-cpu regression on ARM64
this_cpu_write() cannot be used on ARM64 until smp_prepare_boot_cpu()
has been called. boot_cpu_state_init() is called right before
smp_prepare_boot_cpu() so it can't use this_cpu_write().
CVE-2018-3620
CVE-2018-3646
Fixes: 956aab6172ee ("cpu/hotplug: Boot HT siblings at least once") Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Andi Kleen [Tue, 7 Aug 2018 22:09:39 +0000 (15:09 -0700)]
x86/mm/pat: Make set_memory_np() L1TF safe
set_memory_np() is used to mark kernel mappings not present, but it has
it's own open coded mechanism which does not have the L1TF protection of
inverting the address bits.
Replace the open coded PTE manipulation with the L1TF protecting low level
PTE routines.
Passes the CPA self test.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Tue, 7 Aug 2018 22:09:37 +0000 (15:09 -0700)]
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
Some cases in THP like:
- MADV_FREE
- mprotect
- split
mark the PMD non present for temporarily to prevent races. The window for
an L1TF attack in these contexts is very small, but it wants to be fixed
for correctness sake.
Use the proper low level functions for pmd/pud_mknotpresent() to address
this.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Tue, 7 Aug 2018 22:09:36 +0000 (15:09 -0700)]
x86/speculation/l1tf: Invert all not present mappings
For kernel mappings PAGE_PROTNONE is not necessarily set for a non present
mapping, but the inversion logic explicitely checks for !PRESENT and
PROT_NONE.
Remove the PROT_NONE check and make the inversion unconditional for all not
present mappings.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 7 Aug 2018 06:19:57 +0000 (08:19 +0200)]
cpu/hotplug: Fix SMT supported evaluation
Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
on the kernel command line as it cannot differentiate between SMT disabled
by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
makes the sysfs interface unusable.
Rework this so that during bringup of the non boot CPUs the availability of
SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
'primary' thread then set the local cpu_smt_available marker and evaluate
this explicitely right after the initial SMP bringup has finished.
SMT evaulation on x86 is a trainwreck as the firmware has all the
information _before_ booting the kernel, but there is no interface to query
it.
Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Paolo Bonzini [Sun, 5 Aug 2018 14:07:47 +0000 (16:07 +0200)]
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry. Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
Add the relevant Documentation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Adjust for the missing MSR_F10H_DECFG and MSR_IA32_UCODE_REV
feature MSRs which do not exist in 4.15] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Paolo Bonzini [Mon, 25 Jun 2018 12:04:37 +0000 (14:04 +0200)]
KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR
This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all
requested features are available on the host.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CVE-2018-3620
CVE-2018-3646
(backported from commit cd28325249a1ca0d771557ce823e0308ad629f98)
[tyhicks: Adjust for the missing MSR_F10H_DECFG and MSR_IA32_UCODE_REV
feature MSRs which do not exist in 4.15] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Tom Lendacky [Wed, 21 Feb 2018 19:39:51 +0000 (13:39 -0600)]
KVM: x86: Add a framework for supporting MSR-based features
Provide a new KVM capability that allows bits within MSRs to be recognized
as features. Two new ioctls are added to the /dev/kvm ioctl routine to
retrieve the list of these MSRs and then retrieve their values. A kvm_x86_ops
callback is used to determine support for the listed MSR-based features.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Tweaked documentation. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
CVE-2018-3620
CVE-2018-3646
(backported from commit 801e459a6f3a63af9d447e6249088c76ae16efc4)
[tyhicks: Adjust context for missing mem_enc_* kvm_x86_ops] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Paolo Bonzini [Sun, 5 Aug 2018 14:07:46 +0000 (16:07 +0200)]
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
not needed. Add a new value to enum vmx_l1d_flush_state, which is used
either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
The last missing piece to having vmx_l1d_flush() take interrupts after
VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
irq entry.
Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
ipi_entering_ack_irq(), smp_reschedule_interrupt() and
uv_bau_message_interrupt().
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.
A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.
However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.
Remove the linux/irq.h #include from x86' asm/hardirq.h.
Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.
Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.
Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Adjust context in smpboot.c, add asm/sections.h include in
kprobes/core.c, i915_pmu.c doesn't exist in 4.15,
controller/pci-hyperv.c is host/pci-hyperv.c in 4.15] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
VMENTRY.
L1D flushes are costly and two modes of operations are provided to users:
"always" and the more selective "conditional" mode.
If operating in the latter, the cache would get flushed only if a host side
code path considered unconfined had been traversed. "Unconfined" in this
context means that it might have pulled in sensitive data like user data
or kernel crypto keys.
The need for L1D flushes is tracked by means of the per-vcpu flag
l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
L1d flush based on its value and reset that flag again.
Currently, interrupts delivered "normally" while in root operation between
VMEXIT and VMENTER are not taken into account. Part of the reason is that
these don't leave any traces and thus, the vmx code is unable to tell if
any such has happened.
As proposed by Paolo Bonzini, prepare for tracking all interrupts by
introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
strong analogy to the per-vcpu ->l1tf_flush_l1d.
A later patch will make interrupt handlers set it.
For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
per-cpu irq_cpustat_t as suggested by Peter Zijlstra.
Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.
Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
l1tf_flush_l1d.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
An upcoming patch will extend KVM's L1TF mitigation in conditional mode
to also cover interrupts after VMEXITs. For tracking those, stores to a
new per-cpu flag from interrupt handlers will become necessary.
In order to improve cache locality, this new flag will be added to x86's
irq_cpustat_t.
Make some space available there by shrinking the ->softirq_pending bitfield
from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
i.e. 10.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
The vmx_l1d_flush_always static key is only ever evaluated if
vmx_l1d_should_flush is enabled. In that case however, there are only two
L1d flushing modes possible: "always" and "conditional".
The "conditional" mode's implementation tends to require more sophisticated
logic than the "always" mode.
Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
with a 'vmx_l1d_flush_cond' one.
There is no change in functionality.
Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
If SMT is disabled in BIOS, the CPU code doesn't properly detect it.
The /sys/devices/system/cpu/smt/control file shows 'on', and the 'l1tf'
vulnerabilities file shows SMT as vulnerable.
Fix it by forcing 'cpu_smt_control' to CPU_SMT_NOT_SUPPORTED in such a
case. Unfortunately the detection can only be done after bringing all
the CPUs online, so we have to overwrite any previous writes to the
variable.
Reported-by: Joe Mario <jmario@redhat.com> Tested-by: Jiri Kosina <jkosina@suse.cz> Fixes: f048c399e0f7 ("x86/topology: Provide topology_smt_supported()") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content
The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order
to evict the L1d cache.
However, these pages are never cleared and, in theory, their data could be
leaked.
More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages
to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break
the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses.
Fix this by initializing the individual vmx_l1d_flush_pages with a
different pattern each.
Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to
"flush_pages" to reflect this change.
pfn_modify_allowed() and arch_has_pfn_modify_check() are outside of the
!__ASSEMBLY__ section in include/asm-generic/pgtable.h, which confuses
assembler on archs that don't have __HAVE_ARCH_PFN_MODIFY_ALLOWED (e.g.
ia64) and breaks build:
include/asm-generic/pgtable.h: Assembler messages:
include/asm-generic/pgtable.h:538: Error: Unknown opcode `static inline bool pfn_modify_allowed(unsigned long pfn,pgprot_t prot)'
include/asm-generic/pgtable.h:540: Error: Unknown opcode `return true'
include/asm-generic/pgtable.h:543: Error: Unknown opcode `static inline bool arch_has_pfn_modify_check(void)'
include/asm-generic/pgtable.h:545: Error: Unknown opcode `return false'
arch/ia64/kernel/entry.S:69: Error: `mov' does not fit into bundle
Move those two static inlines into the !__ASSEMBLY__ section so that they
don't confuse the asm build pass.
Fixes: 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings") Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations
Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.
The possible values are:
full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.
flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.
flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.
off
Disables hypervisor mitigations and doesn't emit any warnings.
Default is 'flush'.
Let KVM adhere to these semantics, which means:
- 'lt1f=full,force' : Performe L1D flushes. No runtime control
possible.
- 'l1tf=full'
- 'l1tf-flush'
- 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
SMT has been runtime enabled or L1D flushing
has been run-time enabled
- 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.
- 'l1tf=off' : L1D flushes are not performed and no warnings
are emitted.
KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.
This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.
Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
[tyhicks: Adjust context for changes to vmx_vm_init()] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Fri, 13 Jul 2018 14:23:22 +0000 (16:23 +0200)]
x86/kvm: Allow runtime control of L1D flush
All mitigation modes can be switched at run time with a static key now:
- Use sysfs_streq() instead of strcmp() to handle the trailing new line
from sysfs writes correctly.
- Make the static key management handle multiple invocations properly.
- Set the module parameter file to RW
Thomas Gleixner [Fri, 13 Jul 2018 14:23:19 +0000 (16:23 +0200)]
x86/kvm: Move l1tf setup function
In preparation of allowing run time control for L1D flushing, move the
setup code to the module parameter handler.
In case of pre module init parsing, just store the value and let vmx_init()
do the actual setup after running kvm_init() so that enable_ept is having
the correct state.
During run-time invoke it directly from the parameter setter to prepare for
run-time control.
Thomas Gleixner [Fri, 13 Jul 2018 14:23:18 +0000 (16:23 +0200)]
x86/l1tf: Handle EPT disabled state proper
If Extended Page Tables (EPT) are disabled or not supported, no L1D
flushing is required. The setup function can just avoid setting up the L1D
flush for the EPT=n case.
Invoke it after the hardware setup has be done and enable_ept has the
correct state and expose the EPT disabled state in the mitigation status as
well.
Thomas Gleixner [Fri, 13 Jul 2018 14:23:17 +0000 (16:23 +0200)]
x86/kvm: Drop L1TF MSR list approach
The VMX module parameter to control the L1D flush should become
writeable.
The MSR list is set up at VM init per guest VCPU, but the run time
switching is based on a static key which is global. Toggling the MSR list
at run time might be feasible, but for now drop this optimization and use
the regular MSR write to make run-time switching possible.
The default mitigation is the conditional flush anyway, so for extra
paranoid setups this will add some small overhead, but the extra code
executed is in the noise compared to the flush itself.
Aside of that the EPT disabled case is not handled correctly at the moment
and the MSR list magic is in the way for fixing that as well.
If it's really providing a significant advantage, then this needs to be
revisited after the code is correct and the control is writable.
Thomas Gleixner [Sat, 7 Jul 2018 09:40:18 +0000 (11:40 +0200)]
cpu/hotplug: Online siblings when SMT control is turned on
Writing 'off' to /sys/devices/system/cpu/smt/control offlines all SMT
siblings. Writing 'on' merily enables the abilify to online them, but does
not online them automatically.
Make 'on' more useful by onlining all offline siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs
The IA32_FLUSH_CMD MSR needs only to be written on VMENTER. Extend
add_atomic_switch_msr() with an entry_only parameter to allow storing the
MSR only in the guest (ENTRY) MSR array.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/KVM/VMX: Split the VMX MSR LOAD structures to have an host/guest numbers
There is no semantic change but this change allows an unbalanced amount of
MSRs to be loaded on VMEXIT and VMENTER, i.e. the number of MSRs to save or
restore on VMEXIT or VMENTER may be different.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around prepare_vmcs02_full() not yet existing] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Paolo Bonzini [Mon, 2 Jul 2018 11:07:14 +0000 (13:07 +0200)]
x86/KVM/VMX: Add L1D flush logic
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backported around context changes] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Paolo Bonzini [Mon, 2 Jul 2018 11:03:48 +0000 (13:03 +0200)]
x86/KVM/VMX: Add L1D MSR based flush
336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD aka 0x10B) which has similar write-only semantics to other
MSRs defined in the document.
The semantics of this MSR is to allow "finer granularity invalidation of
caching structures than existing mechanisms like WBINVD. It will writeback
and invalidate the L1 data cache, including all cachelines brought in by
preceding instructions, without invalidating all caches (eg. L2 or
LLC). Some processors may also invalidate the first level level instruction
cache on a L1D_FLUSH command. The L1 data and instruction caches may be
shared across the logical processors of a core."
Use it instead of the loop based L1 flush algorithm.
A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199511
[ tglx: Avoid allocating pages when the MSR is available ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Paolo Bonzini [Mon, 2 Jul 2018 10:47:38 +0000 (12:47 +0200)]
x86/KVM/VMX: Add L1D flush algorithm
To mitigate the L1 Terminal Fault vulnerability it's required to flush L1D
on VMENTER to prevent rogue guests from snooping host memory.
CPUs will have a new control MSR via a microcode update to flush L1D with a
single MSR write, but in the absence of microcode a fallback to a software
based flush algorithm is required.
Add a software flush loop which is based on code from Intel.
[ tglx: Split out from combo patch ]
[ bpetkov: Polish the asm code ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backported around context changes] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/KVM/VMX: Add module argument for L1TF mitigation
Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
L1 terminal fault. The valid arguments are:
- "always" L1D cache flush on every VMENTER.
- "cond" Conditional L1D cache flush, explained below
- "never" Disable the L1D cache flush mitigation
"cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
interesting information into L1D which might exploited.
[ tglx: Split out from a larger patch ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backported around context changes] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present
If the L1TF CPU bug is present we allow the KVM module to be loaded as the
major of users that use Linux and KVM have trusted guests and do not want a
broken setup.
Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
such they are the ones that should set nosmt to one.
Setting 'nosmt' means that the system administrator also needs to disable
SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
parameter, or via the /sys/devices/system/cpu/smt/control. See commit 05736e4ac13c ("cpu/hotplug: Provide knobs to control SMT").
Other mitigations are to use task affinity, cpu sets, interrupt binding,
etc - anything to make sure that _only_ the same guests vCPUs are running
on sibling threads.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing commit b31c114 to add vmx_vm_init()] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Fri, 29 Jun 2018 14:05:48 +0000 (16:05 +0200)]
cpu/hotplug: Boot HT siblings at least once
Due to the way Machine Check Exceptions work on X86 hyperthreads it's
required to boot up _all_ logical cores at least once in order to set the
CR4.MCE bit.
So instead of ignoring the sibling threads right away, let them boot up
once so they can configure themselves. After they came out of the initial
boot stage check whether its a "secondary" sibling and cancel the operation
which puts the CPU back into offline state.
Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tony Luck <tony.luck@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Fri, 29 Jun 2018 14:05:47 +0000 (16:05 +0200)]
Revert "x86/apic: Ignore secondary threads if nosmt=force"
Dave Hansen reported, that it's outright dangerous to keep SMT siblings
disabled completely so they are stuck in the BIOS and wait for SIPI.
The reason is that Machine Check Exceptions are broadcasted to siblings and
the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
reboots the machine. The MCE chapter in the SDM contains the following
blurb:
Because the logical processors within a physical package are tightly
coupled with respect to shared hardware resources, both logical
processors are notified of machine check errors that occur within a
given physical processor. If machine-check exceptions are enabled when
a fatal error is reported, all the logical processors within a physical
package are dispatched to the machine-check exception handler. If
machine-check exceptions are disabled, the logical processors enter the
shutdown state and assert the IERR# signal. When enabling machine-check
exceptions, the MCE flag in control register CR4 should be set for each
logical processor.
Reverting the commit which ignores siblings at enumeration time solves only
half of the problem. The core cpuhotplug logic needs to be adjusted as
well.
This thoughtful engineered mechanism also turns the boot process on all
Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
before the secondary CPUs are brought up. Depending on the number of
physical cores the window in which this situation can happen is smaller or
larger. On a HSW-EX it's about 750ms:
MCE is enabled on the boot CPU:
[ 0.244017] mce: CPU supports 22 MCE banks
The corresponding sibling #72 boots:
[ 1.008005] .... node #0, CPUs: #72
That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
between these two points the machine is going to shutdown. At least it's a
known safe state.
It's obvious that the early boot can be hit by an MCE as well and then runs
into the same situation because MCEs are not yet enabled on the boot CPU.
But after enabling them on the boot CPU, it does not make any sense to
prevent the kernel from recovering.
Adjust the nosmt kernel parameter documentation as well.
Reverts: 2207def700f9 ("x86/apic: Ignore secondary threads if nosmt=force") Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tony Luck <tony.luck@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Michal Hocko [Wed, 27 Jun 2018 15:46:50 +0000 (17:46 +0200)]
x86/speculation/l1tf: Fix up pte->pfn conversion for PAE
Jan has noticed that pte_pfn and co. resp. pfn_pte are incorrect for
CONFIG_PAE because phys_addr_t is wider than unsigned long and so the
pte_val reps. shift left would get truncated. Fix this up by using proper
types.
Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation") Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Vlastimil Babka [Fri, 22 Jun 2018 15:39:33 +0000 (17:39 +0200)]
x86/speculation/l1tf: Protect PAE swap entries against L1TF
The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
offset. The lower word is zeroed, thus systems with less than 4GB memory are
safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
to L1TF; with even more memory, also the swap offfset influences the address.
This might be a problem with 32bit PAE guests running on large 64bit hosts.
By continuing to keep the whole swap entry in either high or low 32bit word of
PTE we would limit the swap size too much. Thus this patch uses the whole PAE
PTE with the same layout as the 64bit version does. The macros just become a
bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Borislav Petkov [Fri, 22 Jun 2018 09:34:11 +0000 (11:34 +0200)]
x86/CPU/AMD: Move TOPOEXT reenablement before reading smp_num_siblings
The TOPOEXT reenablement is a workaround for broken BIOSen which didn't
enable the CPUID bit. amd_get_topology_early(), however, relies on
that bit being set so that it can read out the CPUID leaf and set
smp_num_siblings properly.
Move the reenablement up to early_init_amd(). While at it, simplify
amd_get_topology_early().
Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing commit 18c71ce] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 5 Jun 2018 12:00:11 +0000 (14:00 +0200)]
x86/apic: Ignore secondary threads if nosmt=force
nosmt on the kernel command line merely prevents the onlining of the
secondary SMT siblings.
nosmt=force makes the APIC detection code ignore the secondary SMT siblings
completely, so they even do not show up as possible CPUs. That reduces the
amount of memory allocations for per cpu variables and saves other
resources from being allocated too large.
This is not fully equivalent to disabling SMT in the BIOS because the low
level SMT enabling in the BIOS can result in partitioning of resources
between the siblings, which is not undone by just ignoring them. Some CPUs
can use the full resources when their sibling is not onlined, but this is
depending on the CPU family and model and it's not well documented whether
this applies to all partitioned resources. That means depending on the
workload disabling SMT in the BIOS might result in better performance.
Linus analysis of the Intel manual:
The intel optimization manual is not very clear on what the partitioning
rules are.
I find:
"In general, the buffers for staging instructions between major pipe
stages are partitioned. These buffers include µop queues after the
execution trace cache, the queues after the register rename stage, the
reorder buffer which stages instructions for retirement, and the load
and store buffers.
In the case of load and store buffers, partitioning also provided an
easier implementation to maintain memory ordering for each logical
processor and detect memory ordering violations"
but some of that partitioning may be relaxed if the HT thread is "not
active":
"In Intel microarchitecture code name Sandy Bridge, the micro-op queue
is statically partitioned to provide 28 entries for each logical
processor, irrespective of software executing in single thread or
multiple threads. If one logical processor is not active in Intel
microarchitecture code name Ivy Bridge, then a single thread executing
on that processor core can use the 56 entries in the micro-op queue"
but I do not know what "not active" means, and how dynamic it is. Some of
that partitioning may be entirely static and depend on the early BIOS
disabling of HT, and even if we park the cores, the resources will just be
wasted.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 5 Jun 2018 22:57:38 +0000 (00:57 +0200)]
x86/cpu/AMD: Evaluate smp_num_siblings early
To support force disabling of SMT it's required to know the number of
thread siblings early. amd_get_topology() cannot be called before the APIC
driver is selected, so split out the part which initializes
smp_num_siblings and invoke it from amd_early_init().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing commit 18c71ce] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Borislav Petkov [Fri, 15 Jun 2018 18:48:39 +0000 (20:48 +0200)]
x86/CPU/AMD: Do not check CPUID max ext level before parsing SMP info
Old code used to check whether CPUID ext max level is >= 0x80000008 because
that last leaf contains the number of cores of the physical CPU. The three
functions called there now do not depend on that leaf anymore so the check
can go.
Thomas Gleixner [Tue, 5 Jun 2018 23:00:55 +0000 (01:00 +0200)]
x86/cpu/intel: Evaluate smp_num_siblings early
Make use of the new early detection function to initialize smp_num_siblings
on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's
required for force disabling SMT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 5 Jun 2018 22:55:39 +0000 (00:55 +0200)]
x86/cpu/topology: Provide detect_extended_topology_early()
To support force disabling of SMT it's required to know the number of
thread siblings early. detect_extended_topology() cannot be called before
the APIC driver is selected, so split out the part which initializes
smp_num_siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 5 Jun 2018 22:53:57 +0000 (00:53 +0200)]
x86/cpu/common: Provide detect_ht_early()
To support force disabling of SMT it's required to know the number of
thread siblings early. detect_ht() cannot be called before the APIC driver
is selected, so split out the part which initializes smp_num_siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 29 May 2018 15:48:27 +0000 (17:48 +0200)]
cpu/hotplug: Provide knobs to control SMT
Provide a command line and a sysfs knob to control SMT.
The command line options are:
'nosmt': Enumerate secondary threads, but do not online them
'nosmt=force': Ignore secondary threads completely during enumeration
via MP table and ACPI/MADT.
The sysfs control file has the following states (read/write):
'on': SMT is enabled. Secondary threads can be freely onlined
'off': SMT is disabled. Secondary threads, even if enumerated
cannot be onlined
'forceoff': SMT is permanentely disabled. Writes to the control
file are rejected.
'notsupported': SMT is not supported by the CPU
The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.
The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.
When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.
When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.
When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.
When the control status is 'notsupported' then writes to the control file
are rejected.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing "select NEED_SG_DMA_LENGTH" in Kconfig] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 29 May 2018 17:05:25 +0000 (19:05 +0200)]
cpu/hotplug: Make bringup/teardown of smp threads symmetric
The asymmetry caused a warning to trigger if the bootup was stopped in state
CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can
now be invoked on already or still parked threads. But there is still no
reason to have this be asymmetric.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 29 May 2018 15:50:22 +0000 (17:50 +0200)]
x86/smp: Provide topology_is_primary_thread()
If the CPU is supporting SMT then the primary thread can be found by
checking the lower APIC ID bits for zero. smp_num_siblings is used to build
the mask for the APIC ID bits which need to be taken into account.
This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
than the initial APIC ID in CPUID. But according to AMD the lower bits have
to be consistent. Intel gave a tentative confirmation as well.
Preparatory patch to support disabling SMT at boot/runtime.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Peter Zijlstra [Tue, 29 May 2018 14:43:46 +0000 (16:43 +0200)]
sched/smt: Update sched_smt_present at runtime
The static key sched_smt_present is only updated at boot time when SMT
siblings have been detected. Booting with maxcpus=1 and bringing the
siblings online after boot rebuilds the scheduling domains correctly but
does not update the static key, so the SMT code is not enabled.
Let the key be updated in the scheduler CPU hotplug code to fix this.
Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:28 +0000 (15:48 -0700)]
x86/speculation/l1tf: Limit swap file size to MAX_PA/2
For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.
Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.
The check is only enabled if the CPU is vulnerable to L1TF.
In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing commit a06ad63] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>