Thomas Gleixner [Fri, 29 Jun 2018 14:05:47 +0000 (16:05 +0200)]
Revert "x86/apic: Ignore secondary threads if nosmt=force"
Dave Hansen reported, that it's outright dangerous to keep SMT siblings
disabled completely so they are stuck in the BIOS and wait for SIPI.
The reason is that Machine Check Exceptions are broadcasted to siblings and
the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
reboots the machine. The MCE chapter in the SDM contains the following
blurb:
Because the logical processors within a physical package are tightly
coupled with respect to shared hardware resources, both logical
processors are notified of machine check errors that occur within a
given physical processor. If machine-check exceptions are enabled when
a fatal error is reported, all the logical processors within a physical
package are dispatched to the machine-check exception handler. If
machine-check exceptions are disabled, the logical processors enter the
shutdown state and assert the IERR# signal. When enabling machine-check
exceptions, the MCE flag in control register CR4 should be set for each
logical processor.
Reverting the commit which ignores siblings at enumeration time solves only
half of the problem. The core cpuhotplug logic needs to be adjusted as
well.
This thoughtful engineered mechanism also turns the boot process on all
Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
before the secondary CPUs are brought up. Depending on the number of
physical cores the window in which this situation can happen is smaller or
larger. On a HSW-EX it's about 750ms:
MCE is enabled on the boot CPU:
[ 0.244017] mce: CPU supports 22 MCE banks
The corresponding sibling #72 boots:
[ 1.008005] .... node #0, CPUs: #72
That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
between these two points the machine is going to shutdown. At least it's a
known safe state.
It's obvious that the early boot can be hit by an MCE as well and then runs
into the same situation because MCEs are not yet enabled on the boot CPU.
But after enabling them on the boot CPU, it does not make any sense to
prevent the kernel from recovering.
Adjust the nosmt kernel parameter documentation as well.
Reverts: 2207def700f9 ("x86/apic: Ignore secondary threads if nosmt=force") Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tony Luck <tony.luck@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Michal Hocko [Wed, 27 Jun 2018 15:46:50 +0000 (17:46 +0200)]
x86/speculation/l1tf: Fix up pte->pfn conversion for PAE
Jan has noticed that pte_pfn and co. resp. pfn_pte are incorrect for
CONFIG_PAE because phys_addr_t is wider than unsigned long and so the
pte_val reps. shift left would get truncated. Fix this up by using proper
types.
Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation") Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Vlastimil Babka [Fri, 22 Jun 2018 15:39:33 +0000 (17:39 +0200)]
x86/speculation/l1tf: Protect PAE swap entries against L1TF
The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
offset. The lower word is zeroed, thus systems with less than 4GB memory are
safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
to L1TF; with even more memory, also the swap offfset influences the address.
This might be a problem with 32bit PAE guests running on large 64bit hosts.
By continuing to keep the whole swap entry in either high or low 32bit word of
PTE we would limit the swap size too much. Thus this patch uses the whole PAE
PTE with the same layout as the 64bit version does. The macros just become a
bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Borislav Petkov [Fri, 22 Jun 2018 09:34:11 +0000 (11:34 +0200)]
x86/CPU/AMD: Move TOPOEXT reenablement before reading smp_num_siblings
The TOPOEXT reenablement is a workaround for broken BIOSen which didn't
enable the CPUID bit. amd_get_topology_early(), however, relies on
that bit being set so that it can read out the CPUID leaf and set
smp_num_siblings properly.
Move the reenablement up to early_init_amd(). While at it, simplify
amd_get_topology_early().
Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing commit 18c71ce] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 5 Jun 2018 12:00:11 +0000 (14:00 +0200)]
x86/apic: Ignore secondary threads if nosmt=force
nosmt on the kernel command line merely prevents the onlining of the
secondary SMT siblings.
nosmt=force makes the APIC detection code ignore the secondary SMT siblings
completely, so they even do not show up as possible CPUs. That reduces the
amount of memory allocations for per cpu variables and saves other
resources from being allocated too large.
This is not fully equivalent to disabling SMT in the BIOS because the low
level SMT enabling in the BIOS can result in partitioning of resources
between the siblings, which is not undone by just ignoring them. Some CPUs
can use the full resources when their sibling is not onlined, but this is
depending on the CPU family and model and it's not well documented whether
this applies to all partitioned resources. That means depending on the
workload disabling SMT in the BIOS might result in better performance.
Linus analysis of the Intel manual:
The intel optimization manual is not very clear on what the partitioning
rules are.
I find:
"In general, the buffers for staging instructions between major pipe
stages are partitioned. These buffers include µop queues after the
execution trace cache, the queues after the register rename stage, the
reorder buffer which stages instructions for retirement, and the load
and store buffers.
In the case of load and store buffers, partitioning also provided an
easier implementation to maintain memory ordering for each logical
processor and detect memory ordering violations"
but some of that partitioning may be relaxed if the HT thread is "not
active":
"In Intel microarchitecture code name Sandy Bridge, the micro-op queue
is statically partitioned to provide 28 entries for each logical
processor, irrespective of software executing in single thread or
multiple threads. If one logical processor is not active in Intel
microarchitecture code name Ivy Bridge, then a single thread executing
on that processor core can use the 56 entries in the micro-op queue"
but I do not know what "not active" means, and how dynamic it is. Some of
that partitioning may be entirely static and depend on the early BIOS
disabling of HT, and even if we park the cores, the resources will just be
wasted.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 5 Jun 2018 22:57:38 +0000 (00:57 +0200)]
x86/cpu/AMD: Evaluate smp_num_siblings early
To support force disabling of SMT it's required to know the number of
thread siblings early. amd_get_topology() cannot be called before the APIC
driver is selected, so split out the part which initializes
smp_num_siblings and invoke it from amd_early_init().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing commit 18c71ce] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Borislav Petkov [Fri, 15 Jun 2018 18:48:39 +0000 (20:48 +0200)]
x86/CPU/AMD: Do not check CPUID max ext level before parsing SMP info
Old code used to check whether CPUID ext max level is >= 0x80000008 because
that last leaf contains the number of cores of the physical CPU. The three
functions called there now do not depend on that leaf anymore so the check
can go.
Thomas Gleixner [Tue, 5 Jun 2018 23:00:55 +0000 (01:00 +0200)]
x86/cpu/intel: Evaluate smp_num_siblings early
Make use of the new early detection function to initialize smp_num_siblings
on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's
required for force disabling SMT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 5 Jun 2018 22:55:39 +0000 (00:55 +0200)]
x86/cpu/topology: Provide detect_extended_topology_early()
To support force disabling of SMT it's required to know the number of
thread siblings early. detect_extended_topology() cannot be called before
the APIC driver is selected, so split out the part which initializes
smp_num_siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 5 Jun 2018 22:53:57 +0000 (00:53 +0200)]
x86/cpu/common: Provide detect_ht_early()
To support force disabling of SMT it's required to know the number of
thread siblings early. detect_ht() cannot be called before the APIC driver
is selected, so split out the part which initializes smp_num_siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 29 May 2018 15:48:27 +0000 (17:48 +0200)]
cpu/hotplug: Provide knobs to control SMT
Provide a command line and a sysfs knob to control SMT.
The command line options are:
'nosmt': Enumerate secondary threads, but do not online them
'nosmt=force': Ignore secondary threads completely during enumeration
via MP table and ACPI/MADT.
The sysfs control file has the following states (read/write):
'on': SMT is enabled. Secondary threads can be freely onlined
'off': SMT is disabled. Secondary threads, even if enumerated
cannot be onlined
'forceoff': SMT is permanentely disabled. Writes to the control
file are rejected.
'notsupported': SMT is not supported by the CPU
The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.
The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.
When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.
When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.
When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.
When the control status is 'notsupported' then writes to the control file
are rejected.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing "select NEED_SG_DMA_LENGTH" in Kconfig] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 29 May 2018 17:05:25 +0000 (19:05 +0200)]
cpu/hotplug: Make bringup/teardown of smp threads symmetric
The asymmetry caused a warning to trigger if the bootup was stopped in state
CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can
now be invoked on already or still parked threads. But there is still no
reason to have this be asymmetric.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Thomas Gleixner [Tue, 29 May 2018 15:50:22 +0000 (17:50 +0200)]
x86/smp: Provide topology_is_primary_thread()
If the CPU is supporting SMT then the primary thread can be found by
checking the lower APIC ID bits for zero. smp_num_siblings is used to build
the mask for the APIC ID bits which need to be taken into account.
This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
than the initial APIC ID in CPUID. But according to AMD the lower bits have
to be consistent. Intel gave a tentative confirmation as well.
Preparatory patch to support disabling SMT at boot/runtime.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Peter Zijlstra [Tue, 29 May 2018 14:43:46 +0000 (16:43 +0200)]
sched/smt: Update sched_smt_present at runtime
The static key sched_smt_present is only updated at boot time when SMT
siblings have been detected. Booting with maxcpus=1 and bringing the
siblings online after boot rebuilds the scheduling domains correctly but
does not update the static key, so the SMT code is not enabled.
Let the key be updated in the scheduler CPU hotplug code to fix this.
Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:28 +0000 (15:48 -0700)]
x86/speculation/l1tf: Limit swap file size to MAX_PA/2
For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.
Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.
The check is only enabled if the CPU is vulnerable to L1TF.
In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing commit a06ad63] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:27 +0000 (15:48 -0700)]
x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.
Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.
To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.
For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().
For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:26 +0000 (15:48 -0700)]
x86/speculation/l1tf: Add sysfs reporting for l1tf
L1TF core kernel workarounds are cheap and normally always enabled, However
they still should be reported in sysfs if the system is vulnerable or
mitigated. Add the necessary CPU feature/bug bits.
- Extend the existing checks for Meltdowns to determine if the system is
vulnerable. All CPUs which are not vulnerable to Meltdown are also not
vulnerable to L1TF
- Check for 32bit non PAE and emit a warning as there is no practical way
for mitigation due to the limited physical address bits
- If the system has more than MAX_PA/2 physical memory the invert page
workarounds don't protect the system against the L1TF attack anymore,
because an inverted physical address will also point to valid
memory. Print a warning in this case and report that the system is
vulnerable.
Add a function which returns the PFN limit for the L1TF mitigation, which
will be used in follow up patches for sanity and range checks.
[ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:25 +0000 (15:48 -0700)]
x86/speculation/l1tf: Make sure the first page is always reserved
The L1TF workaround doesn't make any attempt to mitigate speculate accesses
to the first physical page for zeroed PTEs. Normally it only contains some
data from the early real mode BIOS.
It's not entirely clear that the first page is reserved in all
configurations, so add an extra reservation call to make sure it is really
reserved. In most configurations (e.g. with the standard reservations)
it's likely a nop.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:24 +0000 (15:48 -0700)]
x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation
When PTEs are set to PROT_NONE the kernel just clears the Present bit and
preserves the PFN, which creates attack surface for L1TF speculation
speculation attacks.
This is important inside guests, because L1TF speculation bypasses physical
page remapping. While the host has its own migitations preventing leaking
data from other VMs into the guest, this would still risk leaking the wrong
page inside the current guest.
This uses the same technique as Linus' swap entry patch: while an entry is
is in PROTNONE state invert the complete PFN part part of it. This ensures
that the the highest bit will point to non existing memory.
The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
pte/pmd/pud_pfn undo it.
This assume that no code path touches the PFN part of a PTE directly
without using these primitives.
This doesn't handle the case that MMIO is on the top of the CPU physical
memory. If such an MMIO region was exposed by an unpriviledged driver for
mmap it would be possible to attack some real memory. However this
situation is all rather unlikely.
For 32bit non PAE the inversion is not done because there are really not
enough bits to protect anything.
Q: Why does the guest need to be protected when the HyperVisor already has
L1TF mitigations?
A: Here's an example:
Physical pages 1 2 get mapped into a guest as
GPA 1 -> PA 2
GPA 2 -> PA 1
through EPT.
The L1TF speculation ignores the EPT remapping.
Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
they belong to different users and should be isolated.
A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
gets read access to the underlying physical page. Which in this case
points to PA 2, so it can read process B's data, if it happened to be in
L1, so isolation inside the guest is broken.
There's nothing the hypervisor can do about this. This mitigation has to
be done in the guest itself.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Linus Torvalds [Wed, 13 Jun 2018 22:48:23 +0000 (15:48 -0700)]
x86/speculation/l1tf: Protect swap entries against L1TF
With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
side effects allow to read the memory the PTE is pointing too, if its
values are still in the L1 cache.
For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
them.
To protect against L1TF it must be ensured that the swap entry is not
pointing to valid memory, which requires setting higher bits (between bit
36 and bit 45) that are inside the CPUs physical address space, but outside
any real memory.
To do this invert the offset to make sure the higher bits are always set,
as long as the swap file is not too big.
Note there is no workaround for 32bit !PAE, or on systems which have more
than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
real systems.
[AK: updated description and minor tweaks by. Split out from the original
patch ]
Linus Torvalds [Wed, 13 Jun 2018 22:48:22 +0000 (15:48 -0700)]
x86/speculation/l1tf: Change order of offset/type in swap entry
If pages are swapped out, the swap entry is stored in the corresponding
PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
on PTE entries which have the present bit set and would treat the swap
entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
must be set so the PTE points to non existent memory.
The swap entry stores the type and the offset of a swapped out page in the
PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
ignores the bits beyond the phsyical address space limit, so to make the
mitigation effective its required to start 'offset' at the lowest possible
bit so that even large swap offsets do not reach into the physical address
space limit bits.
Move offset to bit 9-58 and type to bit 59-63 which are the bits that
hardware generally doesn't care about.
That, in turn, means that if you on desktop chip with only 40 bits of
physical addressing, now that the offset starts at bit 9, there needs to be
30 bits of offset actually *in use* until bit 39 ends up being set, which
means when inverted it will again point into existing memory.
So that's 4 terabyte of swap space (because the offset is counted in pages,
so 30 bits of offset is 42 bits of actual coverage). With bigger physical
addressing, that obviously grows further, until the limit of the offset is
hit (at 50 bits of offset - 62 bits of actual swap file coverage).
This is a preparatory change for the actual swap entry inversion to protect
against L1TF.
[ AK: Updated description and minor tweaks. Split into two parts ]
[ tglx: Massaged changelog ]
L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
speculates on PTE entries which do not have the PRESENT bit set, if the
content of the resulting physical address is available in the L1D cache.
The OS side mitigation makes sure that a !PRESENT PTE entry points to a
physical address outside the actually existing and cachable memory
space. This is achieved by inverting the upper bits of the PTE. Due to the
address space limitations this only works for 64bit and 32bit PAE kernels,
but not for 32bit non PAE.
This mitigation applies to both host and guest kernels, but in case of a
64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
the PAE address space (44bit) is not enough if the host has more than 43
bits of populated memory address space, because the speculation treats the
PTE content as a physical host address bypassing EPT.
The host (hypervisor) protects itself against the guest by flushing L1D as
needed, but pages inside the guest are not protected against attacks from
other processes inside the same guest.
For the guest the inverted PTE mask has to match the host to provide the
full protection for all pages the host could possibly map into the
guest. The hosts populated address space is not known to the guest, so the
mask must cover the possible maximal host address space, i.e. 52 bit.
On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
the mask to be below what the host could possible use for physical pages.
The L1TF PROT_NONE protection code uses the PTE masks to determine which
bits to invert to make sure the higher bits are set for unmapped entries to
prevent L1TF speculation attacks against EPT inside guests.
In order to invert all bits that could be used by the host, increase
__PHYSICAL_PAGE_SHIFT to 52 to match 64bit.
The real limit for a 32bit PAE kernel is still 44 bits because all Linux
PTEs are created from unsigned long PFNs, so they cannot be higher than 44
bits on a 32bit kernel. So these extra PFN bits should be never set. The
only users of this macro are using it to look at PTEs, so it's safe.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/boot/64/clang: Use fixup_pointer() to access '__supported_pte_mask'
Clang builds with defconfig started crashing after the following
commit:
fb43d6cb91ef ("x86/mm: Do not auto-massage page protections")
This was caused by introducing a new global access in __startup_64().
Code in __startup_64() can be relocated during execution, but the compiler
doesn't have to generate PC-relative relocations when accessing globals
from that function. Clang actually does not generate them, which leads
to boot-time crashes. To work around this problem, every global pointer
must be adjusted using fixup_pointer().
Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: dvyukov@google.com Cc: kirill.shutemov@linux.intel.com Cc: linux-mm@kvack.org Cc: md@google.com Cc: mka@chromium.org Fixes: fb43d6cb91ef ("x86/mm: Do not auto-massage page protections") Link: http://lkml.kernel.org/r/20180509091822.191810-1-glider@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(backported from commit 4a09f0210c8b1221aae8afda8bd3a603fece0986) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
0day reported warnings at boot on 32-bit systems without NX support:
attempted to set unsupported pgprot: 8000000000000025 bits: 8000000000000000 supported: 7fffffffffffffff
WARNING: CPU: 0 PID: 1 at
arch/x86/include/asm/pgtable.h:540 handle_mm_fault+0xfc1/0xfe0:
check_pgprot at arch/x86/include/asm/pgtable.h:535
(inlined by) pfn_pte at arch/x86/include/asm/pgtable.h:549
(inlined by) do_anonymous_page at mm/memory.c:3169
(inlined by) handle_pte_fault at mm/memory.c:3961
(inlined by) __handle_mm_fault at mm/memory.c:4087
(inlined by) handle_mm_fault at mm/memory.c:4124
The problem is that due to the recent commit which removed auto-massaging
of page protections, filtering page permissions at PTE creation time is not
longer done, so vma->vm_page_prot is passed unfiltered to PTE creation.
Filter the page protections before they are installed in vma->vm_page_prot.
Fixes: fb43d6cb91 ("x86/mm: Do not auto-massage page protections") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222028.99D72858@viggo.jf.intel.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 316d097c4cd4e7f2ef50c40cff2db266593c4ec4) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/power/64: Fix page-table setup for temporary text mapping
On a system with 4-level page-tables there is no p4d, so the pud in the pgd
should be mapped. The old code before commit fb43d6cb91ef already did that.
The change from above commit causes an invalid page-table which causes
undefined behavior. In one report it caused triple faults.
Fix it by changing the p4d back to pud.
Fixes: fb43d6cb91ef ('x86/mm: Do not auto-massage page protections') Reported-by: Borislav Petkov <bp@suse.de> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michal Kubecek <mkubecek@suse.cz> Tested-by: Borislav Petkov <bp@suse.de> Cc: linux-pm@vger.kernel.org Cc: rjw@rjwysocki.net Cc: pavel@ucw.cz Cc: hpa@zytor.com Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/1524162360-26179-1-git-send-email-joro@8bytes.org
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 05189820da23fc87ee2a7d87c20257f298af27f4) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:14 +0000 (13:55 -0700)]
x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init
__ro_after_init data gets stuck in the .rodata section. That's normally
fine because the kernel itself manages the R/W properties.
But, if we run __change_page_attr() on an area which is __ro_after_init,
the .rodata checks will trigger and force the area to be immediately
read-only, even if it is early-ish in boot. This caused problems when
trying to clear the _PAGE_GLOBAL bit for these area in the PTI code:
it cleared _PAGE_GLOBAL like I asked, but also took it up on itself
to clear _PAGE_RW. The kernel then oopses the next time it wrote to
a __ro_after_init data structure.
To fix this, add the kernel_set_to_readonly check, just like we have
for kernel text, just a few lines below in this function.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 639d6aafe437a7464399d2a77d006049053df06f) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:13 +0000 (13:55 -0700)]
x86/mm: Comment _PAGE_GLOBAL mystery
I was mystified as to where the _PAGE_GLOBAL in the kernel page tables
for kernel text came from. I audited all the places I could find, but
I missed one: head_64.S.
The page tables that we create in here live for a long time, and they
also have _PAGE_GLOBAL set, despite whether the processor supports it
or not. It's harmless, and we got *lucky* that the pageattr code
accidentally clears it when we wipe it out of __supported_pte_mask and
then later try to mark kernel text read-only.
Comment some of these properties to make it easier to find and
understand in the future.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205513.079BB265@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 430d4005b8b41c19966dd3bfdb33004bdb2de01c) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:11 +0000 (13:55 -0700)]
x86/mm: Remove extra filtering in pageattr code
The pageattr code has a mode where it can set or clear PTE bits in
existing PTEs, so the page protections of the *new* PTEs come from
one of two places:
1. The set/clear masks: cpa->mask_clr / cpa->mask_set
2. The existing PTE
We filter ->mask_set/clr for supported PTE bits at entry to
__change_page_attr() so we never need to filter them again.
The only other place permissions can come from is an existing PTE
and those already presumably have good bits. We do not need to filter
them again.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205511.BC072352@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 1a54420aeb4da1ba5b28283aa5696898220c9a27) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:09 +0000 (13:55 -0700)]
x86/mm: Do not auto-massage page protections
A PTE is constructed from a physical address and a pgprotval_t.
__PAGE_KERNEL, for instance, is a pgprot_t and must be converted
into a pgprotval_t before it can be used to create a PTE. This is
done implicitly within functions like pfn_pte() by massage_pgprot().
However, this makes it very challenging to set bits (and keep them
set) if your bit is being filtered out by massage_pgprot().
This moves the bit filtering out of pfn_pte() and friends. For
users of PAGE_KERNEL*, filtering will be done automatically inside
those macros but for users of __PAGE_KERNEL*, they need to do their
own filtering now.
Note that we also just move pfn_pte/pmd/pud() over to check_pgprot()
instead of massage_pgprot(). This way, we still *look* for
unsupported bits and properly warn about them if we find them. This
might happen if an unfiltered __PAGE_KERNEL* value was passed in,
for instance.
- printk format warning fix from: Arnd Bergmann <arnd@arndb.de>
- boot crash fix from: Tom Lendacky <thomas.lendacky@amd.com>
- crash bisected by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reported-and-fixed-by: Arnd Bergmann <arnd@arndb.de> Fixed-by: Tom Lendacky <thomas.lendacky@amd.com> Bisected-by: Mike Galbraith <efault@gmx.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205509.77E1D7F6@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(backported from commit fb43d6cb91ef57d9e58d5f69b423784ff4a4c374)
[tyhicks: Backport around missing commit 91f606a] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:07 +0000 (13:55 -0700)]
x86/espfix: Document use of _PAGE_GLOBAL
The "normal" kernel page table creation mechanisms using
PAGE_KERNEL_* page protections will never set _PAGE_GLOBAL with PTI.
The few places in the kernel that always want _PAGE_GLOBAL must
avoid using PAGE_KERNEL_*.
Document that we want it here and its use is not accidental.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205507.BCF4D4F0@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 6baf4bec02dbc41645c3a5130ee15a8e1d62b80f) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:06 +0000 (13:55 -0700)]
x86/mm: Introduce "default" kernel PTE mask
The __PAGE_KERNEL_* page permissions are "raw". They contain bits
that may or may not be supported on the current processor. They need
to be filtered by a mask (currently __supported_pte_mask) to turn them
into a value that we can actually set in a PTE.
These __PAGE_KERNEL_* values all contain _PAGE_GLOBAL. But, with PTI,
we want to be able to support _PAGE_GLOBAL (have the bit set in
__supported_pte_mask) but not have it appear in any of these masks by
default.
This patch creates a new mask, __default_kernel_pte_mask, and applies
it when creating all of the PAGE_KERNEL_* masks. This makes
PAGE_KERNEL_* safe to use anywhere (they only contain supported bits).
It also ensures that PAGE_KERNEL_* contains _PAGE_GLOBAL on PTI=n
kernels but clears _PAGE_GLOBAL when PTI=y.
We also make __default_kernel_pte_mask a non-GPL exported symbol
because there are plenty of driver-available interfaces that take
PAGE_KERNEL_* permissions.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205506.030DB6B6@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(backported from commit 8a57f4849f4fa22ed18a941164a214083fc020a2)
[tyhicks: Backport around missing commit 076ca27] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:04 +0000 (13:55 -0700)]
x86/mm: Undo double _PAGE_PSE clearing
When clearing _PAGE_PRESENT on a huge page, we need to be careful
to also clear _PAGE_PSE, otherwise it might still get confused
for a valid large page table entry.
We do that near the spot where we *set* _PAGE_PSE. That's fine,
but it's unnecessary. pgprot_large_2_4k() already did it.
BTW, I also noticed that pgprot_large_2_4k() and
pgprot_4k_2_large() are not symmetric. pgprot_large_2_4k() clears
_PAGE_PSE (because it is aliased to _PAGE_PAT) but
pgprot_4k_2_large() does not put _PAGE_PSE back. Bummer.
Also, add some comments and change "promote" to "move". "Promote"
seems an odd word to move when we are logically moving a bit to a
lower bit position. Also add an extra line return to make it clear
to which line the comment applies.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205504.9B0F44A9@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 606c7193d5fbf8ea3dafc8a9468f719fbf1d7160) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:02 +0000 (13:55 -0700)]
x86/mm: Factor out pageattr _PAGE_GLOBAL setting
The pageattr code has a pattern repeated where it sets _PAGE_GLOBAL
for present PTEs but clears it for non-present PTEs. The intention
is to keep _PAGE_GLOBAL from getting confused with _PAGE_PROTNONE
since _PAGE_GLOBAL is for present PTEs and _PAGE_PROTNONE is for
non-present
But, this pattern makes no sense. Effectively, it says, if you use
the pageattr code, always set _PAGE_GLOBAL when _PAGE_PRESENT.
canon_pgprot() will clear it if unsupported (because it masks the
value with __supported_pte_mask) but we *always* set it. Even if
canon_pgprot() did not filter _PAGE_GLOBAL, it would be OK.
_PAGE_GLOBAL is ignored when CR4.PGE=0 by the hardware.
This unconditional setting of _PAGE_GLOBAL is a problem when we have
PTI and non-PTI and we want some areas to have _PAGE_GLOBAL and some
not.
This updated version of the code says:
1. Clear _PAGE_GLOBAL when !_PAGE_PRESENT
2. Never set _PAGE_GLOBAL implicitly
3. Allow _PAGE_GLOBAL to be in cpa.set_mask
4. Allow _PAGE_GLOBAL to be inherited from previous PTE
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205502.86E199DA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit d1440b23c922d845ff039f64694a32ff356e89fa) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The current logic incorrectly calculates the LLC ID from the APIC ID.
Unless specified otherwise, the LLC ID should be calculated by removing
the Core and Thread ID bits from the least significant end of the APIC
ID. For more info, see "ApicId Enumeration Requirements" in any Fam17h
PPR document.
[ bp: Improve commit message. ]
Fixes: 68091ee7ac3c ("Calculate last level cache ID from number of sharing threads") Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1528915390-30533-1-git-send-email-suravee.suthikulpanit@amd.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 964d978433a4b9aa1368ff71227ca0027dd1e32f) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
David Wang [Thu, 3 May 2018 02:32:44 +0000 (10:32 +0800)]
x86/CPU: Make intel_num_cpu_cores() generic
intel_num_cpu_cores() is a static function in intel.c which can't be used
by other files. Define another function called detect_num_cpu_cores() in
common.c to replace this function so it can be reused.
Signed-off-by: David Wang <davidwang@zhaoxin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: lukelin@viacpu.com Cc: qiyuanwang@zhaoxin.com Cc: gregkh@linuxfoundation.org Cc: brucechang@via-alliance.com Cc: timguo@zhaoxin.com Cc: cooperyan@zhaoxin.com Cc: hpa@zytor.com Cc: benjaminpan@viatech.com Link: https://lkml.kernel.org/r/1525314766-18910-2-git-send-email-davidwang@zhaoxin.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 2cc61be60e37b1856a97ccbdcca3e86e593bf06a) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/CPU: Modify detect_extended_topology() to return result
Current implementation does not communicate whether it can successfully
detect CPUID function 0xB information. Therefore, modify the function to
return success or error codes. This will be used by subsequent patches.
x86/CPU/AMD: Calculate last level cache ID from number of sharing threads
Last Level Cache ID can be calculated from the number of threads sharing
the cache, which is available from CPUID Fn0x8000001D (Cache Properties).
This is used to left-shift the APIC ID to derive LLC ID.
Therefore, default to this method unless the APIC ID enumeration does not
follow the scheme.
perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
Current logic iterates over CPUID Fn8000001d leafs (Cache Properties)
to detect the last level cache, and derive the last-level cache ID.
However, this information is already available in the cpu_llc_id.
Therefore, make use of it instead.
x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present
Move smp_num_siblings and cpu_llc_id to cpu/common.c so that they're
always present as symbols and not only in the CONFIG_SMP case. Then,
other code using them doesn't need ugly ifdeffery anymore. Get rid of
some ifdeffery.
David Wang [Thu, 3 May 2018 02:32:46 +0000 (10:32 +0800)]
x86/Centaur: Report correct CPU/cache topology
Centaur CPUs enumerate the cache topology in the same way as Intel CPUs,
but the function is unused so for. The Centaur init code also misses to
initialize x86_info::max_cores, so the CPU topology can't be described
correctly.
Initialize x86_info::max_cores and invoke init_cacheinfo() to make
CPU and cache topology information available and correct.
Signed-off-by: David Wang <davidwang@zhaoxin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: lukelin@viacpu.com Cc: qiyuanwang@zhaoxin.com Cc: gregkh@linuxfoundation.org Cc: brucechang@via-alliance.com Cc: timguo@zhaoxin.com Cc: cooperyan@zhaoxin.com Cc: hpa@zytor.com Cc: benjaminpan@viatech.com Link: https://lkml.kernel.org/r/1525314766-18910-4-git-send-email-davidwang@zhaoxin.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit a2aa578fec8c29436bce5e6c15e1e31729d539a3) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
David Wang [Fri, 20 Apr 2018 08:29:28 +0000 (16:29 +0800)]
x86/Centaur: Initialize supported CPU features properly
Centaur CPUs have some Intel compatible capabilities,including Permformance
Monitoring Counters and CPU virtualization capabilities. Initialize them in
the Centaur specific init code.
Signed-off-by: David Wang <davidwang@zhaoxin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: lukelin@viacpu.com Cc: qiyuanwang@zhaoxin.com Cc: gregkh@linuxfoundation.org Cc: brucechang@via-alliance.com Cc: timguo@zhaoxin.com Cc: cooperyan@zhaoxin.com Cc: hpa@zytor.com Cc: benjaminpan@viatech.com Link: https://lkml.kernel.org/r/1524212968-28998-1-git-send-email-davidwang@zhaoxin.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 60882cc159e1416fb1d17210de60d4a3ba04e613) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:21 +0000 (09:28 -0700)]
tcp: add tcp_ooo_try_coalesce() helper
In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.
I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:20 +0000 (09:28 -0700)]
tcp: call tcp_drop() from tcp_data_queue_ofo()
In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.
Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit 8541b21e781a22dce52a74fef0b9bed00404a1cd) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:19 +0000 (09:28 -0700)]
tcp: detect malicious patterns in tcp_collapse_ofo_queue()
In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.
1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.
We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.
In the future, we might add the possibility of terminating flows
that are proven to be malicious.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:18 +0000 (09:28 -0700)]
tcp: avoid collapses in tcp_prune_queue() if possible
Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);
tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])
Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.
Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:17 +0000 (09:28 -0700)]
tcp: free batches of packets in tcp_prune_ofo_queue()
Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.
Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.
Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit 72cd43ba64fc172a443410ce01645895850844c8) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Wei Yongjun [Tue, 5 Jun 2018 09:16:21 +0000 (09:16 +0000)]
ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
BugLink: http://bugs.launchpad.net/bugs/1775786
Add the missing unlock before return from function
afu_ioctl_enable_p9_wait() in the error handling case.
Fixes: e948e06fc63a ("ocxl: Expose the thread_id needed for wait on POWER9") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Alastair D'Silva <alastair@d-silva.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 2e5c93d6bb2f7bc17eb82748943a1b9f6b068520) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Alastair D'Silva [Fri, 11 May 2018 06:13:02 +0000 (16:13 +1000)]
ocxl: Add an IOCTL so userspace knows what OCXL features are available
BugLink: http://bugs.launchpad.net/bugs/1775786
In order for a userspace AFU driver to call the POWER9 specific
OCXL_IOCTL_ENABLE_P9_WAIT, it needs to verify that it can actually
make that call.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 02a8e5bc1c06045f36423bd9632ad9f40da18d3f) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Alastair D'Silva [Fri, 11 May 2018 06:13:01 +0000 (16:13 +1000)]
ocxl: Expose the thread_id needed for wait on POWER9
BugLink: http://bugs.launchpad.net/bugs/1775786
In order to successfully issue as_notify, an AFU needs to know the TID
to notify, which in turn means that this information should be
available in userspace so it can be communicated to the AFU.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit e948e06fc63a1c1e36ec4c8e5c510b881ff19c26) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Alastair D'Silva [Fri, 11 May 2018 06:12:59 +0000 (16:12 +1000)]
powerpc: use task_pid_nr() for TID allocation
BugLink: http://bugs.launchpad.net/bugs/1775786
The current implementation of TID allocation, using a global IDR, may
result in an errant process starving the system of available TIDs.
Instead, use task_pid_nr(), as mentioned by the original author. The
scenario described which prevented it's use is not applicable, as
set_thread_tidr can only be called after the task struct has been
populated.
In the unlikely event that 2 threads share the TID and are waiting,
all potential outcomes have been determined safe.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 71cc64a85d8d99936f6851709a07f18c87a0adab) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Alastair D'Silva [Fri, 11 May 2018 06:12:57 +0000 (16:12 +1000)]
powerpc: Add TIDR CPU feature for POWER9
BugLink: http://bugs.launchpad.net/bugs/1775786
This patch adds a CPU feature bit to show whether the CPU has
the TIDR register available, enabling as_notify/wait in userspace.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(backported from commit 819844285ef2b5d15466f5b5062514135ffba06c) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Nicholas Piggin [Fri, 22 Jun 2018 16:08:00 +0000 (18:08 +0200)]
powerpc: Fix smp_send_stop NMI IPI handling
BugLink: http://bugs.launchpad.net/bugs/1777194
The NMI IPI handler for a receiving CPU increments nmi_ipi_busy_count
over the handler function call, which causes later smp_send_nmi_ipi()
callers to spin until the call is finished.
The stop_this_cpu() function never returns, so the busy count is never
decremeted, which can cause the system to hang in some cases. For
example panic() will call smp_send_stop() early on which calls
stop_this_cpu() on other CPUs, then later in the reboot path,
pnv_restart() will call smp_send_stop() again, which hangs.
Fix this by adding a special case to the stop_this_cpu() handler to
decrement the busy count, because it will never return.
Now that the NMI/non-NMI versions of stop_this_cpu() are different,
split them out into separate functions rather than doing #ifdef tricks
to share the body between the two functions.
Fixes: 6bed3237624e3 ("powerpc: use NMI IPI for smp_send_stop") Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out the functions, tweak change log a bit] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit ac61c1156623455c46701654abd8c99720bceea1) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Nicholas Piggin [Fri, 22 Jun 2018 16:08:00 +0000 (18:08 +0200)]
powerpc: use NMI IPI for smp_send_stop
BugLink: http://bugs.launchpad.net/bugs/1777194
Use the NMI IPI rather than smp_call_function for smp_send_stop.
Have stopped CPUs hard disable interrupts rather than just soft
disable.
This function is used in crash/panic/shutdown paths to bring other
CPUs down as quickly and reliably as possible, and minimizing their
potential to cause trouble.
Avoiding the Linux smp_call_function infrastructure and (if supported)
using true NMI IPIs makes this more robust.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 6bed3237624e3faad1592543952907cd01a42c83) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Nicholas Piggin [Wed, 13 Jun 2018 15:10:08 +0000 (11:10 -0400)]
rtc: opal: Fix OPAL RTC driver OPAL_BUSY loops
BugLink: http://bugs.launchpad.net/bugs/1773964
The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
OPAL_BUSY_EVENT from firmware, which causes large scheduling
latencies, up to 50 seconds have been observed here when RTC stops
responding (BMC reboot can do it).
Fix this by converting it to the standard form OPAL_BUSY loop that
sleeps.
Fixes: 628daa8d5abf ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks") Cc: stable@vger.kernel.org # v3.2+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 682e6b4da5cbe8e9a53f979a58c2a9d7dc997175) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
UBUNTU: SAUCE: ext4: check for allocation block validity with block group locked
BugLink: https://bugs.launchpad.net/bugs/1781709
With commit 044e6e3d74a3: "ext4: don't update checksum of new
initialized bitmaps" the buffer valid bit will get set without
actually setting up the checksum for the allocation bitmap, since the
checksum will get calculated once we actually allocate an inode or
block.
If we are doing this, then we need to (re-)check the verified bit
after we take the block group lock. Otherwise, we could race with
another process reading and verifying the bitmap, which would then
complain about the checksum being invalid.
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
This fix was incomplete. Will replace with the complete fix in the next commit.
Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Ross Lagerwall [Thu, 21 Jun 2018 13:00:21 +0000 (14:00 +0100)]
xen-netfront: Update features after registering netdev
Update the features after calling register_netdev() otherwise the
device features are not set up correctly and it not possible to change
the MTU of the device. After this change, the features reported by
ethtool match the device's features before the commit which introduced
the issue and it is possible to change the device's MTU.
Fixes: f599c64fdf7d ("xen-netfront: Fix race between device setup and open") Reported-by: Liam Shepherd <liam@dancer.es> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> BugLink: https://bugs.launchpad.net/bugs/1781413
(cherry picked from commit 45c8184c1bed1ca8a7f02918552063a00b909bf5) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Khaled Elmously <khalid.elmously@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Ross Lagerwall [Thu, 21 Jun 2018 13:00:20 +0000 (14:00 +0100)]
xen-netfront: Fix mismatched rtnl_unlock
Fixes: f599c64fdf7d ("xen-netfront: Fix race between device setup and open") Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> BugLink: https://bugs.launchpad.net/bugs/1781413
(cherry picked from commit cb257783c2927b73614b20f915a91ff78aa6f3e8) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Khaled Elmously <khalid.elmously@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Regression observed on some arm64 server platforms:
EXT4-fs error (device sda1): ext4_validate_inode_bitmap:99: comm stress-ng: Corrupt inode bitmap
Reference: https://lkml.org/lkml/2018/7/7/2
Reported-by: dann frazier <dann.frazier@canonical.com> Tested-by: dann frazier <dann.frazier@canonical.com> Fixes: 044e6e3d74a3 ext4: don't update checksum of new initialized bitmaps Signed-off-by: Kamal Mostafa <kamal@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Ben Hutchings [Wed, 13 Jun 2018 03:31:00 +0000 (04:31 +0100)]
random: Make getrandom() ready earlier
This effectively reverts commit 725e828 "random: fix crng_ready()
test" which was commit 43838a23a05f upstream. Unfortunately some
users of getrandom() don't expect it to block for long, and they need
to be fixed before we can allow this change into stable.
This doesn't directly revert that commit, but only weakens the ready
condition used by getrandom() when the GRND_RANDOM flag is not set.
Calls to getrandom() that return before the RNG is fully seeded will
generate warnings, just like reads from /dev/urandom.
BugLink: https://bugs.launchpad.net/bugs/1780062 BugLink: https://bugs.launchpad.net/bugs/1779827
(backported from ://salsa.debian.org/kernel-team/linux/raw/stretch/debian/patches/debian/random-make-getrandom-ready-earlier.patch)
[smb: open code waiting in getrandom directly] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Acked-by: Khaled Elmously <khalid.elmously@canonical.com> Acked-by: Marcelo Cerri <marcelo.cerri@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Xiang Chen [Thu, 31 May 2018 12:50:48 +0000 (20:50 +0800)]
scsi: hisi_sas: Pre-allocate slot DMA buffers
BugLink: https://bugs.launchpad.net/bugs/1777727
Currently the driver spends much time allocating and freeing the slot DMA
buffer for command delivery/completion. To boost the performance,
pre-allocate the buffers for all IPTT. The downside of this approach is
that we are reallocating all buffer memory upfront, so hog memory which we
may not need.
However, the current method - DMA buffer pool - also caches all buffers and
does not free them until the pool is destroyed, so is not exactly efficient
either.
On top of this, since the slot DMA buffer is slightly bigger than a 4K
page, we need to allocate 2x4K pages per buffer (for 4K page kernel), which
is quite wasteful. For 64K page size this is not such an issue.
So, for the 4K page case, in order to make memory usage more efficient,
pre-allocating larger blocks of DMA memory for the buffers can be more
efficient.
To make DMA memory usage most efficient, we would choose a single
contiguous DMA memory block, but this could use up all the DMA memory in
the system (when CMA enabled and no IOMMU), or we may just not be able to
allocate a DMA buffer large enough when no CMA or IOMMU.
To decide the block size we use the LCM (least common multiple) of the
buffer size and the page size. We roundup(64) to ensure the LCM is not too
large, even though a little memory may be wasted per block.
So, with this, the total memory requirement is about is about 17MB for 4096
max IPTT.
Previously (for 4K pages case), it would be 32MB (for all slots
allocated).
With this change, the relative increase of IOPS for bs=4K read when
PAGE_SIZE=4K and PAGE_SIZE=64K is as follows:
IODEPTH 4K PAGE_SIZE 64K PAGE_SIZE
32 56% 47%
64 53% 44%
128 64% 43%
256 67% 45%
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit bbb14d21a0614deefbdf24da61606f7bd7a1263a scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiaofei Tan [Thu, 31 May 2018 12:50:47 +0000 (20:50 +0800)]
scsi: hisi_sas: Release all remaining resources in clear nexus ha
BugLink: https://bugs.launchpad.net/bugs/1777696
In host reset, we use TMF or soft-reset to re-init device, and if success,
we will release all LLDD resources of this device. If the init fails -
maybe because the device was removed or link has not come up - then do not
release the LLDD resources, but rather rely on SCSI EH to handle the
timeout for these resources later on.
But if clear nexus ha calls host reset, which is the last effort of SCSI
EH, we should release all LLDD remain resources. Because SCSI EH will
release all tasks after clear nexus ha.
Before release, we do I_T nexus reset to try to clear target remain IOs.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d213102e6d75df919444c374e3a2dee13e98644e scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiaofei Tan [Thu, 31 May 2018 12:50:46 +0000 (20:50 +0800)]
scsi: hisi_sas: Add a flag to filter PHY events during reset
BugLink: https://bugs.launchpad.net/bugs/1777696
During reset, we don't want PHY events reported to libsas for PHYs which
were previously attached prior to reset.
So check hisi_hba->flags for HISI_SAS_RESET_BIT to filter PHY events during
reset.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 3c878dc1e2d10e2403913b04b94513b626c252a8 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiaofei Tan [Thu, 31 May 2018 12:50:45 +0000 (20:50 +0800)]
scsi: hisi_sas: Adjust task reject period during host reset
BugLink: https://bugs.launchpad.net/bugs/1777696
After soft_reset() for host reset, we should not be allowed to send
commands to the HW before the PHYs have come up and the port ids have been
refreshed.
Prior to this point, any commands cannot be successfully completed.
This exclusion is achieved by grabbing the host reset semaphore.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b36d9c963f7eee2b949b900b477f3d7f6f0d7b76 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
The reason is that then the device is notified as gone, we try to clear the
ITCT, which is notified via an interrupt. The dev gone function pends on
this event with a completion, which is completed when the ITCT interrupt
occurs.
But host reset will disable all interrupts, the wait_for_completion() may
wait indefinitely.
This patch adds an semaphore to synchronise this two processes. The
semaphore is taken by the host reset as the basis of synchronising.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e5baf18c5e3146d4156f8a81e5da0c66bac8d037 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiaofei Tan [Thu, 31 May 2018 12:50:43 +0000 (20:50 +0800)]
scsi: hisi_sas: Only process broadcast change in phy_bcast_v3_hw()
BugLink: https://bugs.launchpad.net/bugs/1777696
There are many BROADCAST primitives generated by the host. We are only
interested in BROADCAST (CHANGE) primitives currently, so only process
this.
We have applied this processing for v2 hw before, and it is also needed for
v3 hw.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b65a8ae24b600af7929df79acd16bd96fae2d935 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com>
[ dannf: included as a dependency for
"scsi: hisi_sas: Add a flag to filter PHY events during reset" ] Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiang Chen [Thu, 31 May 2018 12:50:42 +0000 (20:50 +0800)]
scsi: hisi_sas: Use dmam_alloc_coherent()
BugLink: https://bugs.launchpad.net/bugs/1777727
This patch replaces the usage of dma_alloc_coherent() with the managed
version, dmam_alloc_coherent(), hereby reducing replicated code.
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by; John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d5738b82293675e35124b903131d7a42b27995e0 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com>
[ dannf: included to simplify the backporting of:
"scsi: hisi_sas: Pre-allocate slot DMA buffers" ] Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>