Peter Zijlstra [Tue, 29 May 2018 14:43:46 +0000 (16:43 +0200)]
sched/smt: Update sched_smt_present at runtime
The static key sched_smt_present is only updated at boot time when SMT
siblings have been detected. Booting with maxcpus=1 and bringing the
siblings online after boot rebuilds the scheduling domains correctly but
does not update the static key, so the SMT code is not enabled.
Let the key be updated in the scheduler CPU hotplug code to fix this.
Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:28 +0000 (15:48 -0700)]
x86/speculation/l1tf: Limit swap file size to MAX_PA/2
For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.
Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.
The check is only enabled if the CPU is vulnerable to L1TF.
In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
[tyhicks: Backport around missing commit a06ad63] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:27 +0000 (15:48 -0700)]
x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.
Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.
To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.
For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().
For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:26 +0000 (15:48 -0700)]
x86/speculation/l1tf: Add sysfs reporting for l1tf
L1TF core kernel workarounds are cheap and normally always enabled, However
they still should be reported in sysfs if the system is vulnerable or
mitigated. Add the necessary CPU feature/bug bits.
- Extend the existing checks for Meltdowns to determine if the system is
vulnerable. All CPUs which are not vulnerable to Meltdown are also not
vulnerable to L1TF
- Check for 32bit non PAE and emit a warning as there is no practical way
for mitigation due to the limited physical address bits
- If the system has more than MAX_PA/2 physical memory the invert page
workarounds don't protect the system against the L1TF attack anymore,
because an inverted physical address will also point to valid
memory. Print a warning in this case and report that the system is
vulnerable.
Add a function which returns the PFN limit for the L1TF mitigation, which
will be used in follow up patches for sanity and range checks.
[ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:25 +0000 (15:48 -0700)]
x86/speculation/l1tf: Make sure the first page is always reserved
The L1TF workaround doesn't make any attempt to mitigate speculate accesses
to the first physical page for zeroed PTEs. Normally it only contains some
data from the early real mode BIOS.
It's not entirely clear that the first page is reserved in all
configurations, so add an extra reservation call to make sure it is really
reserved. In most configurations (e.g. with the standard reservations)
it's likely a nop.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andi Kleen [Wed, 13 Jun 2018 22:48:24 +0000 (15:48 -0700)]
x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation
When PTEs are set to PROT_NONE the kernel just clears the Present bit and
preserves the PFN, which creates attack surface for L1TF speculation
speculation attacks.
This is important inside guests, because L1TF speculation bypasses physical
page remapping. While the host has its own migitations preventing leaking
data from other VMs into the guest, this would still risk leaking the wrong
page inside the current guest.
This uses the same technique as Linus' swap entry patch: while an entry is
is in PROTNONE state invert the complete PFN part part of it. This ensures
that the the highest bit will point to non existing memory.
The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
pte/pmd/pud_pfn undo it.
This assume that no code path touches the PFN part of a PTE directly
without using these primitives.
This doesn't handle the case that MMIO is on the top of the CPU physical
memory. If such an MMIO region was exposed by an unpriviledged driver for
mmap it would be possible to attack some real memory. However this
situation is all rather unlikely.
For 32bit non PAE the inversion is not done because there are really not
enough bits to protect anything.
Q: Why does the guest need to be protected when the HyperVisor already has
L1TF mitigations?
A: Here's an example:
Physical pages 1 2 get mapped into a guest as
GPA 1 -> PA 2
GPA 2 -> PA 1
through EPT.
The L1TF speculation ignores the EPT remapping.
Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
they belong to different users and should be isolated.
A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
gets read access to the underlying physical page. Which in this case
points to PA 2, so it can read process B's data, if it happened to be in
L1, so isolation inside the guest is broken.
There's nothing the hypervisor can do about this. This mitigation has to
be done in the guest itself.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Linus Torvalds [Wed, 13 Jun 2018 22:48:23 +0000 (15:48 -0700)]
x86/speculation/l1tf: Protect swap entries against L1TF
With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
side effects allow to read the memory the PTE is pointing too, if its
values are still in the L1 cache.
For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
them.
To protect against L1TF it must be ensured that the swap entry is not
pointing to valid memory, which requires setting higher bits (between bit
36 and bit 45) that are inside the CPUs physical address space, but outside
any real memory.
To do this invert the offset to make sure the higher bits are always set,
as long as the swap file is not too big.
Note there is no workaround for 32bit !PAE, or on systems which have more
than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
real systems.
[AK: updated description and minor tweaks by. Split out from the original
patch ]
Linus Torvalds [Wed, 13 Jun 2018 22:48:22 +0000 (15:48 -0700)]
x86/speculation/l1tf: Change order of offset/type in swap entry
If pages are swapped out, the swap entry is stored in the corresponding
PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
on PTE entries which have the present bit set and would treat the swap
entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
must be set so the PTE points to non existent memory.
The swap entry stores the type and the offset of a swapped out page in the
PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
ignores the bits beyond the phsyical address space limit, so to make the
mitigation effective its required to start 'offset' at the lowest possible
bit so that even large swap offsets do not reach into the physical address
space limit bits.
Move offset to bit 9-58 and type to bit 59-63 which are the bits that
hardware generally doesn't care about.
That, in turn, means that if you on desktop chip with only 40 bits of
physical addressing, now that the offset starts at bit 9, there needs to be
30 bits of offset actually *in use* until bit 39 ends up being set, which
means when inverted it will again point into existing memory.
So that's 4 terabyte of swap space (because the offset is counted in pages,
so 30 bits of offset is 42 bits of actual coverage). With bigger physical
addressing, that obviously grows further, until the limit of the offset is
hit (at 50 bits of offset - 62 bits of actual swap file coverage).
This is a preparatory change for the actual swap entry inversion to protect
against L1TF.
[ AK: Updated description and minor tweaks. Split into two parts ]
[ tglx: Massaged changelog ]
L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
speculates on PTE entries which do not have the PRESENT bit set, if the
content of the resulting physical address is available in the L1D cache.
The OS side mitigation makes sure that a !PRESENT PTE entry points to a
physical address outside the actually existing and cachable memory
space. This is achieved by inverting the upper bits of the PTE. Due to the
address space limitations this only works for 64bit and 32bit PAE kernels,
but not for 32bit non PAE.
This mitigation applies to both host and guest kernels, but in case of a
64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
the PAE address space (44bit) is not enough if the host has more than 43
bits of populated memory address space, because the speculation treats the
PTE content as a physical host address bypassing EPT.
The host (hypervisor) protects itself against the guest by flushing L1D as
needed, but pages inside the guest are not protected against attacks from
other processes inside the same guest.
For the guest the inverted PTE mask has to match the host to provide the
full protection for all pages the host could possibly map into the
guest. The hosts populated address space is not known to the guest, so the
mask must cover the possible maximal host address space, i.e. 52 bit.
On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
the mask to be below what the host could possible use for physical pages.
The L1TF PROT_NONE protection code uses the PTE masks to determine which
bits to invert to make sure the higher bits are set for unmapped entries to
prevent L1TF speculation attacks against EPT inside guests.
In order to invert all bits that could be used by the host, increase
__PHYSICAL_PAGE_SHIFT to 52 to match 64bit.
The real limit for a 32bit PAE kernel is still 44 bits because all Linux
PTEs are created from unsigned long PFNs, so they cannot be higher than 44
bits on a 32bit kernel. So these extra PFN bits should be never set. The
only users of this macro are using it to look at PTEs, so it's safe.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
CVE-2018-3620
CVE-2018-3646
Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/boot/64/clang: Use fixup_pointer() to access '__supported_pte_mask'
Clang builds with defconfig started crashing after the following
commit:
fb43d6cb91ef ("x86/mm: Do not auto-massage page protections")
This was caused by introducing a new global access in __startup_64().
Code in __startup_64() can be relocated during execution, but the compiler
doesn't have to generate PC-relative relocations when accessing globals
from that function. Clang actually does not generate them, which leads
to boot-time crashes. To work around this problem, every global pointer
must be adjusted using fixup_pointer().
Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: dvyukov@google.com Cc: kirill.shutemov@linux.intel.com Cc: linux-mm@kvack.org Cc: md@google.com Cc: mka@chromium.org Fixes: fb43d6cb91ef ("x86/mm: Do not auto-massage page protections") Link: http://lkml.kernel.org/r/20180509091822.191810-1-glider@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(backported from commit 4a09f0210c8b1221aae8afda8bd3a603fece0986) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
0day reported warnings at boot on 32-bit systems without NX support:
attempted to set unsupported pgprot: 8000000000000025 bits: 8000000000000000 supported: 7fffffffffffffff
WARNING: CPU: 0 PID: 1 at
arch/x86/include/asm/pgtable.h:540 handle_mm_fault+0xfc1/0xfe0:
check_pgprot at arch/x86/include/asm/pgtable.h:535
(inlined by) pfn_pte at arch/x86/include/asm/pgtable.h:549
(inlined by) do_anonymous_page at mm/memory.c:3169
(inlined by) handle_pte_fault at mm/memory.c:3961
(inlined by) __handle_mm_fault at mm/memory.c:4087
(inlined by) handle_mm_fault at mm/memory.c:4124
The problem is that due to the recent commit which removed auto-massaging
of page protections, filtering page permissions at PTE creation time is not
longer done, so vma->vm_page_prot is passed unfiltered to PTE creation.
Filter the page protections before they are installed in vma->vm_page_prot.
Fixes: fb43d6cb91 ("x86/mm: Do not auto-massage page protections") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222028.99D72858@viggo.jf.intel.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 316d097c4cd4e7f2ef50c40cff2db266593c4ec4) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/power/64: Fix page-table setup for temporary text mapping
On a system with 4-level page-tables there is no p4d, so the pud in the pgd
should be mapped. The old code before commit fb43d6cb91ef already did that.
The change from above commit causes an invalid page-table which causes
undefined behavior. In one report it caused triple faults.
Fix it by changing the p4d back to pud.
Fixes: fb43d6cb91ef ('x86/mm: Do not auto-massage page protections') Reported-by: Borislav Petkov <bp@suse.de> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michal Kubecek <mkubecek@suse.cz> Tested-by: Borislav Petkov <bp@suse.de> Cc: linux-pm@vger.kernel.org Cc: rjw@rjwysocki.net Cc: pavel@ucw.cz Cc: hpa@zytor.com Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/1524162360-26179-1-git-send-email-joro@8bytes.org
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 05189820da23fc87ee2a7d87c20257f298af27f4) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:14 +0000 (13:55 -0700)]
x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init
__ro_after_init data gets stuck in the .rodata section. That's normally
fine because the kernel itself manages the R/W properties.
But, if we run __change_page_attr() on an area which is __ro_after_init,
the .rodata checks will trigger and force the area to be immediately
read-only, even if it is early-ish in boot. This caused problems when
trying to clear the _PAGE_GLOBAL bit for these area in the PTI code:
it cleared _PAGE_GLOBAL like I asked, but also took it up on itself
to clear _PAGE_RW. The kernel then oopses the next time it wrote to
a __ro_after_init data structure.
To fix this, add the kernel_set_to_readonly check, just like we have
for kernel text, just a few lines below in this function.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 639d6aafe437a7464399d2a77d006049053df06f) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:13 +0000 (13:55 -0700)]
x86/mm: Comment _PAGE_GLOBAL mystery
I was mystified as to where the _PAGE_GLOBAL in the kernel page tables
for kernel text came from. I audited all the places I could find, but
I missed one: head_64.S.
The page tables that we create in here live for a long time, and they
also have _PAGE_GLOBAL set, despite whether the processor supports it
or not. It's harmless, and we got *lucky* that the pageattr code
accidentally clears it when we wipe it out of __supported_pte_mask and
then later try to mark kernel text read-only.
Comment some of these properties to make it easier to find and
understand in the future.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205513.079BB265@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 430d4005b8b41c19966dd3bfdb33004bdb2de01c) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:11 +0000 (13:55 -0700)]
x86/mm: Remove extra filtering in pageattr code
The pageattr code has a mode where it can set or clear PTE bits in
existing PTEs, so the page protections of the *new* PTEs come from
one of two places:
1. The set/clear masks: cpa->mask_clr / cpa->mask_set
2. The existing PTE
We filter ->mask_set/clr for supported PTE bits at entry to
__change_page_attr() so we never need to filter them again.
The only other place permissions can come from is an existing PTE
and those already presumably have good bits. We do not need to filter
them again.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205511.BC072352@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 1a54420aeb4da1ba5b28283aa5696898220c9a27) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:09 +0000 (13:55 -0700)]
x86/mm: Do not auto-massage page protections
A PTE is constructed from a physical address and a pgprotval_t.
__PAGE_KERNEL, for instance, is a pgprot_t and must be converted
into a pgprotval_t before it can be used to create a PTE. This is
done implicitly within functions like pfn_pte() by massage_pgprot().
However, this makes it very challenging to set bits (and keep them
set) if your bit is being filtered out by massage_pgprot().
This moves the bit filtering out of pfn_pte() and friends. For
users of PAGE_KERNEL*, filtering will be done automatically inside
those macros but for users of __PAGE_KERNEL*, they need to do their
own filtering now.
Note that we also just move pfn_pte/pmd/pud() over to check_pgprot()
instead of massage_pgprot(). This way, we still *look* for
unsupported bits and properly warn about them if we find them. This
might happen if an unfiltered __PAGE_KERNEL* value was passed in,
for instance.
- printk format warning fix from: Arnd Bergmann <arnd@arndb.de>
- boot crash fix from: Tom Lendacky <thomas.lendacky@amd.com>
- crash bisected by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reported-and-fixed-by: Arnd Bergmann <arnd@arndb.de> Fixed-by: Tom Lendacky <thomas.lendacky@amd.com> Bisected-by: Mike Galbraith <efault@gmx.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205509.77E1D7F6@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(backported from commit fb43d6cb91ef57d9e58d5f69b423784ff4a4c374)
[tyhicks: Backport around missing commit 91f606a] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:07 +0000 (13:55 -0700)]
x86/espfix: Document use of _PAGE_GLOBAL
The "normal" kernel page table creation mechanisms using
PAGE_KERNEL_* page protections will never set _PAGE_GLOBAL with PTI.
The few places in the kernel that always want _PAGE_GLOBAL must
avoid using PAGE_KERNEL_*.
Document that we want it here and its use is not accidental.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205507.BCF4D4F0@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 6baf4bec02dbc41645c3a5130ee15a8e1d62b80f) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:06 +0000 (13:55 -0700)]
x86/mm: Introduce "default" kernel PTE mask
The __PAGE_KERNEL_* page permissions are "raw". They contain bits
that may or may not be supported on the current processor. They need
to be filtered by a mask (currently __supported_pte_mask) to turn them
into a value that we can actually set in a PTE.
These __PAGE_KERNEL_* values all contain _PAGE_GLOBAL. But, with PTI,
we want to be able to support _PAGE_GLOBAL (have the bit set in
__supported_pte_mask) but not have it appear in any of these masks by
default.
This patch creates a new mask, __default_kernel_pte_mask, and applies
it when creating all of the PAGE_KERNEL_* masks. This makes
PAGE_KERNEL_* safe to use anywhere (they only contain supported bits).
It also ensures that PAGE_KERNEL_* contains _PAGE_GLOBAL on PTI=n
kernels but clears _PAGE_GLOBAL when PTI=y.
We also make __default_kernel_pte_mask a non-GPL exported symbol
because there are plenty of driver-available interfaces that take
PAGE_KERNEL_* permissions.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205506.030DB6B6@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(backported from commit 8a57f4849f4fa22ed18a941164a214083fc020a2)
[tyhicks: Backport around missing commit 076ca27] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:04 +0000 (13:55 -0700)]
x86/mm: Undo double _PAGE_PSE clearing
When clearing _PAGE_PRESENT on a huge page, we need to be careful
to also clear _PAGE_PSE, otherwise it might still get confused
for a valid large page table entry.
We do that near the spot where we *set* _PAGE_PSE. That's fine,
but it's unnecessary. pgprot_large_2_4k() already did it.
BTW, I also noticed that pgprot_large_2_4k() and
pgprot_4k_2_large() are not symmetric. pgprot_large_2_4k() clears
_PAGE_PSE (because it is aliased to _PAGE_PAT) but
pgprot_4k_2_large() does not put _PAGE_PSE back. Bummer.
Also, add some comments and change "promote" to "move". "Promote"
seems an odd word to move when we are logically moving a bit to a
lower bit position. Also add an extra line return to make it clear
to which line the comment applies.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205504.9B0F44A9@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 606c7193d5fbf8ea3dafc8a9468f719fbf1d7160) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Dave Hansen [Fri, 6 Apr 2018 20:55:02 +0000 (13:55 -0700)]
x86/mm: Factor out pageattr _PAGE_GLOBAL setting
The pageattr code has a pattern repeated where it sets _PAGE_GLOBAL
for present PTEs but clears it for non-present PTEs. The intention
is to keep _PAGE_GLOBAL from getting confused with _PAGE_PROTNONE
since _PAGE_GLOBAL is for present PTEs and _PAGE_PROTNONE is for
non-present
But, this pattern makes no sense. Effectively, it says, if you use
the pageattr code, always set _PAGE_GLOBAL when _PAGE_PRESENT.
canon_pgprot() will clear it if unsupported (because it masks the
value with __supported_pte_mask) but we *always* set it. Even if
canon_pgprot() did not filter _PAGE_GLOBAL, it would be OK.
_PAGE_GLOBAL is ignored when CR4.PGE=0 by the hardware.
This unconditional setting of _PAGE_GLOBAL is a problem when we have
PTI and non-PTI and we want some areas to have _PAGE_GLOBAL and some
not.
This updated version of the code says:
1. Clear _PAGE_GLOBAL when !_PAGE_PRESENT
2. Never set _PAGE_GLOBAL implicitly
3. Allow _PAGE_GLOBAL to be in cpa.set_mask
4. Allow _PAGE_GLOBAL to be inherited from previous PTE
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205502.86E199DA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit d1440b23c922d845ff039f64694a32ff356e89fa) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The current logic incorrectly calculates the LLC ID from the APIC ID.
Unless specified otherwise, the LLC ID should be calculated by removing
the Core and Thread ID bits from the least significant end of the APIC
ID. For more info, see "ApicId Enumeration Requirements" in any Fam17h
PPR document.
[ bp: Improve commit message. ]
Fixes: 68091ee7ac3c ("Calculate last level cache ID from number of sharing threads") Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1528915390-30533-1-git-send-email-suravee.suthikulpanit@amd.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 964d978433a4b9aa1368ff71227ca0027dd1e32f) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
David Wang [Thu, 3 May 2018 02:32:44 +0000 (10:32 +0800)]
x86/CPU: Make intel_num_cpu_cores() generic
intel_num_cpu_cores() is a static function in intel.c which can't be used
by other files. Define another function called detect_num_cpu_cores() in
common.c to replace this function so it can be reused.
Signed-off-by: David Wang <davidwang@zhaoxin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: lukelin@viacpu.com Cc: qiyuanwang@zhaoxin.com Cc: gregkh@linuxfoundation.org Cc: brucechang@via-alliance.com Cc: timguo@zhaoxin.com Cc: cooperyan@zhaoxin.com Cc: hpa@zytor.com Cc: benjaminpan@viatech.com Link: https://lkml.kernel.org/r/1525314766-18910-2-git-send-email-davidwang@zhaoxin.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 2cc61be60e37b1856a97ccbdcca3e86e593bf06a) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
x86/CPU: Modify detect_extended_topology() to return result
Current implementation does not communicate whether it can successfully
detect CPUID function 0xB information. Therefore, modify the function to
return success or error codes. This will be used by subsequent patches.
x86/CPU/AMD: Calculate last level cache ID from number of sharing threads
Last Level Cache ID can be calculated from the number of threads sharing
the cache, which is available from CPUID Fn0x8000001D (Cache Properties).
This is used to left-shift the APIC ID to derive LLC ID.
Therefore, default to this method unless the APIC ID enumeration does not
follow the scheme.
perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
Current logic iterates over CPUID Fn8000001d leafs (Cache Properties)
to detect the last level cache, and derive the last-level cache ID.
However, this information is already available in the cpu_llc_id.
Therefore, make use of it instead.
x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present
Move smp_num_siblings and cpu_llc_id to cpu/common.c so that they're
always present as symbols and not only in the CONFIG_SMP case. Then,
other code using them doesn't need ugly ifdeffery anymore. Get rid of
some ifdeffery.
David Wang [Thu, 3 May 2018 02:32:46 +0000 (10:32 +0800)]
x86/Centaur: Report correct CPU/cache topology
Centaur CPUs enumerate the cache topology in the same way as Intel CPUs,
but the function is unused so for. The Centaur init code also misses to
initialize x86_info::max_cores, so the CPU topology can't be described
correctly.
Initialize x86_info::max_cores and invoke init_cacheinfo() to make
CPU and cache topology information available and correct.
Signed-off-by: David Wang <davidwang@zhaoxin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: lukelin@viacpu.com Cc: qiyuanwang@zhaoxin.com Cc: gregkh@linuxfoundation.org Cc: brucechang@via-alliance.com Cc: timguo@zhaoxin.com Cc: cooperyan@zhaoxin.com Cc: hpa@zytor.com Cc: benjaminpan@viatech.com Link: https://lkml.kernel.org/r/1525314766-18910-4-git-send-email-davidwang@zhaoxin.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit a2aa578fec8c29436bce5e6c15e1e31729d539a3) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
David Wang [Fri, 20 Apr 2018 08:29:28 +0000 (16:29 +0800)]
x86/Centaur: Initialize supported CPU features properly
Centaur CPUs have some Intel compatible capabilities,including Permformance
Monitoring Counters and CPU virtualization capabilities. Initialize them in
the Centaur specific init code.
Signed-off-by: David Wang <davidwang@zhaoxin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: lukelin@viacpu.com Cc: qiyuanwang@zhaoxin.com Cc: gregkh@linuxfoundation.org Cc: brucechang@via-alliance.com Cc: timguo@zhaoxin.com Cc: cooperyan@zhaoxin.com Cc: hpa@zytor.com Cc: benjaminpan@viatech.com Link: https://lkml.kernel.org/r/1524212968-28998-1-git-send-email-davidwang@zhaoxin.com
CVE-2018-3620
CVE-2018-3646
(cherry picked from commit 60882cc159e1416fb1d17210de60d4a3ba04e613) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:21 +0000 (09:28 -0700)]
tcp: add tcp_ooo_try_coalesce() helper
In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.
I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:20 +0000 (09:28 -0700)]
tcp: call tcp_drop() from tcp_data_queue_ofo()
In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.
Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit 8541b21e781a22dce52a74fef0b9bed00404a1cd) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:19 +0000 (09:28 -0700)]
tcp: detect malicious patterns in tcp_collapse_ofo_queue()
In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.
1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.
We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.
In the future, we might add the possibility of terminating flows
that are proven to be malicious.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:18 +0000 (09:28 -0700)]
tcp: avoid collapses in tcp_prune_queue() if possible
Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);
tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])
Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.
Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Eric Dumazet [Mon, 23 Jul 2018 16:28:17 +0000 (09:28 -0700)]
tcp: free batches of packets in tcp_prune_ofo_queue()
Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.
Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.
Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CVE-2018-5390
(cherry picked from commit 72cd43ba64fc172a443410ce01645895850844c8) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Wei Yongjun [Tue, 5 Jun 2018 09:16:21 +0000 (09:16 +0000)]
ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
BugLink: http://bugs.launchpad.net/bugs/1775786
Add the missing unlock before return from function
afu_ioctl_enable_p9_wait() in the error handling case.
Fixes: e948e06fc63a ("ocxl: Expose the thread_id needed for wait on POWER9") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Alastair D'Silva <alastair@d-silva.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 2e5c93d6bb2f7bc17eb82748943a1b9f6b068520) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Alastair D'Silva [Fri, 11 May 2018 06:13:02 +0000 (16:13 +1000)]
ocxl: Add an IOCTL so userspace knows what OCXL features are available
BugLink: http://bugs.launchpad.net/bugs/1775786
In order for a userspace AFU driver to call the POWER9 specific
OCXL_IOCTL_ENABLE_P9_WAIT, it needs to verify that it can actually
make that call.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 02a8e5bc1c06045f36423bd9632ad9f40da18d3f) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Alastair D'Silva [Fri, 11 May 2018 06:13:01 +0000 (16:13 +1000)]
ocxl: Expose the thread_id needed for wait on POWER9
BugLink: http://bugs.launchpad.net/bugs/1775786
In order to successfully issue as_notify, an AFU needs to know the TID
to notify, which in turn means that this information should be
available in userspace so it can be communicated to the AFU.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit e948e06fc63a1c1e36ec4c8e5c510b881ff19c26) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Alastair D'Silva [Fri, 11 May 2018 06:12:59 +0000 (16:12 +1000)]
powerpc: use task_pid_nr() for TID allocation
BugLink: http://bugs.launchpad.net/bugs/1775786
The current implementation of TID allocation, using a global IDR, may
result in an errant process starving the system of available TIDs.
Instead, use task_pid_nr(), as mentioned by the original author. The
scenario described which prevented it's use is not applicable, as
set_thread_tidr can only be called after the task struct has been
populated.
In the unlikely event that 2 threads share the TID and are waiting,
all potential outcomes have been determined safe.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 71cc64a85d8d99936f6851709a07f18c87a0adab) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Alastair D'Silva [Fri, 11 May 2018 06:12:57 +0000 (16:12 +1000)]
powerpc: Add TIDR CPU feature for POWER9
BugLink: http://bugs.launchpad.net/bugs/1775786
This patch adds a CPU feature bit to show whether the CPU has
the TIDR register available, enabling as_notify/wait in userspace.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(backported from commit 819844285ef2b5d15466f5b5062514135ffba06c) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Nicholas Piggin [Fri, 22 Jun 2018 16:08:00 +0000 (18:08 +0200)]
powerpc: Fix smp_send_stop NMI IPI handling
BugLink: http://bugs.launchpad.net/bugs/1777194
The NMI IPI handler for a receiving CPU increments nmi_ipi_busy_count
over the handler function call, which causes later smp_send_nmi_ipi()
callers to spin until the call is finished.
The stop_this_cpu() function never returns, so the busy count is never
decremeted, which can cause the system to hang in some cases. For
example panic() will call smp_send_stop() early on which calls
stop_this_cpu() on other CPUs, then later in the reboot path,
pnv_restart() will call smp_send_stop() again, which hangs.
Fix this by adding a special case to the stop_this_cpu() handler to
decrement the busy count, because it will never return.
Now that the NMI/non-NMI versions of stop_this_cpu() are different,
split them out into separate functions rather than doing #ifdef tricks
to share the body between the two functions.
Fixes: 6bed3237624e3 ("powerpc: use NMI IPI for smp_send_stop") Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out the functions, tweak change log a bit] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit ac61c1156623455c46701654abd8c99720bceea1) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Nicholas Piggin [Fri, 22 Jun 2018 16:08:00 +0000 (18:08 +0200)]
powerpc: use NMI IPI for smp_send_stop
BugLink: http://bugs.launchpad.net/bugs/1777194
Use the NMI IPI rather than smp_call_function for smp_send_stop.
Have stopped CPUs hard disable interrupts rather than just soft
disable.
This function is used in crash/panic/shutdown paths to bring other
CPUs down as quickly and reliably as possible, and minimizing their
potential to cause trouble.
Avoiding the Linux smp_call_function infrastructure and (if supported)
using true NMI IPIs makes this more robust.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 6bed3237624e3faad1592543952907cd01a42c83) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Nicholas Piggin [Wed, 13 Jun 2018 15:10:08 +0000 (11:10 -0400)]
rtc: opal: Fix OPAL RTC driver OPAL_BUSY loops
BugLink: http://bugs.launchpad.net/bugs/1773964
The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
OPAL_BUSY_EVENT from firmware, which causes large scheduling
latencies, up to 50 seconds have been observed here when RTC stops
responding (BMC reboot can do it).
Fix this by converting it to the standard form OPAL_BUSY loop that
sleeps.
Fixes: 628daa8d5abf ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks") Cc: stable@vger.kernel.org # v3.2+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 682e6b4da5cbe8e9a53f979a58c2a9d7dc997175) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
UBUNTU: SAUCE: ext4: check for allocation block validity with block group locked
BugLink: https://bugs.launchpad.net/bugs/1781709
With commit 044e6e3d74a3: "ext4: don't update checksum of new
initialized bitmaps" the buffer valid bit will get set without
actually setting up the checksum for the allocation bitmap, since the
checksum will get calculated once we actually allocate an inode or
block.
If we are doing this, then we need to (re-)check the verified bit
after we take the block group lock. Otherwise, we could race with
another process reading and verifying the bitmap, which would then
complain about the checksum being invalid.
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
This fix was incomplete. Will replace with the complete fix in the next commit.
Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Ross Lagerwall [Thu, 21 Jun 2018 13:00:21 +0000 (14:00 +0100)]
xen-netfront: Update features after registering netdev
Update the features after calling register_netdev() otherwise the
device features are not set up correctly and it not possible to change
the MTU of the device. After this change, the features reported by
ethtool match the device's features before the commit which introduced
the issue and it is possible to change the device's MTU.
Fixes: f599c64fdf7d ("xen-netfront: Fix race between device setup and open") Reported-by: Liam Shepherd <liam@dancer.es> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> BugLink: https://bugs.launchpad.net/bugs/1781413
(cherry picked from commit 45c8184c1bed1ca8a7f02918552063a00b909bf5) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Khaled Elmously <khalid.elmously@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Ross Lagerwall [Thu, 21 Jun 2018 13:00:20 +0000 (14:00 +0100)]
xen-netfront: Fix mismatched rtnl_unlock
Fixes: f599c64fdf7d ("xen-netfront: Fix race between device setup and open") Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> BugLink: https://bugs.launchpad.net/bugs/1781413
(cherry picked from commit cb257783c2927b73614b20f915a91ff78aa6f3e8) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Khaled Elmously <khalid.elmously@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Regression observed on some arm64 server platforms:
EXT4-fs error (device sda1): ext4_validate_inode_bitmap:99: comm stress-ng: Corrupt inode bitmap
Reference: https://lkml.org/lkml/2018/7/7/2
Reported-by: dann frazier <dann.frazier@canonical.com> Tested-by: dann frazier <dann.frazier@canonical.com> Fixes: 044e6e3d74a3 ext4: don't update checksum of new initialized bitmaps Signed-off-by: Kamal Mostafa <kamal@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Ben Hutchings [Wed, 13 Jun 2018 03:31:00 +0000 (04:31 +0100)]
random: Make getrandom() ready earlier
This effectively reverts commit 725e828 "random: fix crng_ready()
test" which was commit 43838a23a05f upstream. Unfortunately some
users of getrandom() don't expect it to block for long, and they need
to be fixed before we can allow this change into stable.
This doesn't directly revert that commit, but only weakens the ready
condition used by getrandom() when the GRND_RANDOM flag is not set.
Calls to getrandom() that return before the RNG is fully seeded will
generate warnings, just like reads from /dev/urandom.
BugLink: https://bugs.launchpad.net/bugs/1780062 BugLink: https://bugs.launchpad.net/bugs/1779827
(backported from ://salsa.debian.org/kernel-team/linux/raw/stretch/debian/patches/debian/random-make-getrandom-ready-earlier.patch)
[smb: open code waiting in getrandom directly] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Acked-by: Khaled Elmously <khalid.elmously@canonical.com> Acked-by: Marcelo Cerri <marcelo.cerri@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Xiang Chen [Thu, 31 May 2018 12:50:48 +0000 (20:50 +0800)]
scsi: hisi_sas: Pre-allocate slot DMA buffers
BugLink: https://bugs.launchpad.net/bugs/1777727
Currently the driver spends much time allocating and freeing the slot DMA
buffer for command delivery/completion. To boost the performance,
pre-allocate the buffers for all IPTT. The downside of this approach is
that we are reallocating all buffer memory upfront, so hog memory which we
may not need.
However, the current method - DMA buffer pool - also caches all buffers and
does not free them until the pool is destroyed, so is not exactly efficient
either.
On top of this, since the slot DMA buffer is slightly bigger than a 4K
page, we need to allocate 2x4K pages per buffer (for 4K page kernel), which
is quite wasteful. For 64K page size this is not such an issue.
So, for the 4K page case, in order to make memory usage more efficient,
pre-allocating larger blocks of DMA memory for the buffers can be more
efficient.
To make DMA memory usage most efficient, we would choose a single
contiguous DMA memory block, but this could use up all the DMA memory in
the system (when CMA enabled and no IOMMU), or we may just not be able to
allocate a DMA buffer large enough when no CMA or IOMMU.
To decide the block size we use the LCM (least common multiple) of the
buffer size and the page size. We roundup(64) to ensure the LCM is not too
large, even though a little memory may be wasted per block.
So, with this, the total memory requirement is about is about 17MB for 4096
max IPTT.
Previously (for 4K pages case), it would be 32MB (for all slots
allocated).
With this change, the relative increase of IOPS for bs=4K read when
PAGE_SIZE=4K and PAGE_SIZE=64K is as follows:
IODEPTH 4K PAGE_SIZE 64K PAGE_SIZE
32 56% 47%
64 53% 44%
128 64% 43%
256 67% 45%
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit bbb14d21a0614deefbdf24da61606f7bd7a1263a scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiaofei Tan [Thu, 31 May 2018 12:50:47 +0000 (20:50 +0800)]
scsi: hisi_sas: Release all remaining resources in clear nexus ha
BugLink: https://bugs.launchpad.net/bugs/1777696
In host reset, we use TMF or soft-reset to re-init device, and if success,
we will release all LLDD resources of this device. If the init fails -
maybe because the device was removed or link has not come up - then do not
release the LLDD resources, but rather rely on SCSI EH to handle the
timeout for these resources later on.
But if clear nexus ha calls host reset, which is the last effort of SCSI
EH, we should release all LLDD remain resources. Because SCSI EH will
release all tasks after clear nexus ha.
Before release, we do I_T nexus reset to try to clear target remain IOs.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d213102e6d75df919444c374e3a2dee13e98644e scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiaofei Tan [Thu, 31 May 2018 12:50:46 +0000 (20:50 +0800)]
scsi: hisi_sas: Add a flag to filter PHY events during reset
BugLink: https://bugs.launchpad.net/bugs/1777696
During reset, we don't want PHY events reported to libsas for PHYs which
were previously attached prior to reset.
So check hisi_hba->flags for HISI_SAS_RESET_BIT to filter PHY events during
reset.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 3c878dc1e2d10e2403913b04b94513b626c252a8 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiaofei Tan [Thu, 31 May 2018 12:50:45 +0000 (20:50 +0800)]
scsi: hisi_sas: Adjust task reject period during host reset
BugLink: https://bugs.launchpad.net/bugs/1777696
After soft_reset() for host reset, we should not be allowed to send
commands to the HW before the PHYs have come up and the port ids have been
refreshed.
Prior to this point, any commands cannot be successfully completed.
This exclusion is achieved by grabbing the host reset semaphore.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b36d9c963f7eee2b949b900b477f3d7f6f0d7b76 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
The reason is that then the device is notified as gone, we try to clear the
ITCT, which is notified via an interrupt. The dev gone function pends on
this event with a completion, which is completed when the ITCT interrupt
occurs.
But host reset will disable all interrupts, the wait_for_completion() may
wait indefinitely.
This patch adds an semaphore to synchronise this two processes. The
semaphore is taken by the host reset as the basis of synchronising.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e5baf18c5e3146d4156f8a81e5da0c66bac8d037 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiaofei Tan [Thu, 31 May 2018 12:50:43 +0000 (20:50 +0800)]
scsi: hisi_sas: Only process broadcast change in phy_bcast_v3_hw()
BugLink: https://bugs.launchpad.net/bugs/1777696
There are many BROADCAST primitives generated by the host. We are only
interested in BROADCAST (CHANGE) primitives currently, so only process
this.
We have applied this processing for v2 hw before, and it is also needed for
v3 hw.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b65a8ae24b600af7929df79acd16bd96fae2d935 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com>
[ dannf: included as a dependency for
"scsi: hisi_sas: Add a flag to filter PHY events during reset" ] Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiang Chen [Thu, 31 May 2018 12:50:42 +0000 (20:50 +0800)]
scsi: hisi_sas: Use dmam_alloc_coherent()
BugLink: https://bugs.launchpad.net/bugs/1777727
This patch replaces the usage of dma_alloc_coherent() with the managed
version, dmam_alloc_coherent(), hereby reducing replicated code.
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by; John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d5738b82293675e35124b903131d7a42b27995e0 scsi) Signed-off-by: dann frazier <dann.frazier@canonical.com>
[ dannf: included to simplify the backporting of:
"scsi: hisi_sas: Pre-allocate slot DMA buffers" ] Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiang Chen [Fri, 23 Mar 2018 16:05:11 +0000 (00:05 +0800)]
scsi: hisi_sas: use dma_zalloc_coherent()
BugLink: https://bugs.launchpad.net/bugs/1777727
This is a warning coming from Coccinelle, and need to use new interface
dma_zalloc_coherent() instead of dma_alloc_coherent()/memset().
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 4f4e21b8ff3e706f79e1adb2a475c3f5ee6b57f9) Signed-off-by: dann frazier <dann.frazier@canonical.com>
[ dannf: included to simplify the backporting of:
"scsi: hisi_sas: Pre-allocate slot DMA buffers" ] Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Xiang Chen [Wed, 13 Jun 2018 20:00:55 +0000 (14:00 -0600)]
scsi: hisi_sas: make SAS address of SATA disks unique
BugLink: https://bugs.launchpad.net/bugs/1776750
When directly connected with SATA disks in different SAS cores, fill SAS
address with scsi_host's id to make it's fake SAS address unique.
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 8b8d66531555006a18d1532546dadbea8d16df95) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Andy Whitcroft [Fri, 22 Jun 2018 19:28:53 +0000 (20:28 +0100)]
UBUNTU: SAUCE: kvm -- increase KVM_MAX_IRQ_ROUTES to 2048 on x86
Large KVM instances with a lot of devices may hit the kernel
KVM_MAX_IRQ_ROUTES limit. Up the kernel limit to allow
these configurations to boot. This has no effect on smaller
configurations.
BugLink: http://bugs.launchpad.net/bugs/1778261 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Mike writes:
It seems that commit f5a26acf0162 ("pinctrl: intel: Initialize GPIO
properly when used through irqchip") can cause problems on some Skylake
systems with Sunrisepoint PCH-H. Namely on certain systems it may turn
the backlight PWM pin from native mode to GPIO which makes the screen
blank during boot.
The actual reason is that GPIO numbering used in BIOS is using "Windows"
numbers meaning that they don't match the hardware 1:1 and because of
this a wrong pin (backlight PWM) is picked and switched to GPIO mode.
There is a proper fix for this but since it has quite many dependencies
on commits that cannot be considered stable material, I suggest we
revert commit f5a26acf0162 from stable trees 4.9, 4.14 and 4.15 to
prevent the backlight issue.
Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com> Fixes: f5a26acf0162 ("pinctrl: intel: Initialize GPIO properly when used through irqchip") Cc: Daniel Drake <drake@endlessm.com> Cc: Chris Chiu <chiu@endlessm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Johannes Wienke [Fri, 8 Jun 2018 19:55:16 +0000 (15:55 -0400)]
UBUNTU: SAUCE: Add Lenovo V330 to the ideapad_laptop rfkill blacklist
BugLink: http://bugs.launchpad.net/bugs/1774636
This laptop doesn't have a working hardware kill switch and thus the
wifi and bluetooth might become hard blocked without an option to revert
this state at runtime.
Signed-off-by: Johannes Wienke <languitar@semipol.de> Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Hans de Goede [Fri, 8 Jun 2018 17:56:24 +0000 (13:56 -0400)]
Bluetooth: btusb: Add Dell XPS 13 9360 to btusb_needs_reset_resume_table
BugLink: http://bugs.launchpad.net/bugs/1775217
The Dell XPS 13 9360 uses a QCA Rome chip which needs to be reset
(and have its firmware reloaded) for bluetooth to work after
suspend/resume.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1514836 Cc: stable@vger.kernel.org Cc: Garrett LeSage <glesage@redhat.com> Reported-and-tested-by: Garrett LeSage <glesage@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
(cherry picked from commit 596b07a9a22656493726edf1739569102bd3e136) Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Dexuan Cui [Fri, 8 Jun 2018 18:39:10 +0000 (15:39 -0300)]
PCI: hv: Remove the bogus test in hv_eject_device_work()
BugLink: http://bugs.launchpad.net/bugs/1758378
When kernel is executing hv_eject_device_work(), hpdev->state value must
be hv_pcichild_ejecting; any other value would consist in a bug,
therefore replace the bogus check with an explicit WARN_ON() on the
condition failure detection.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
[lorenzo.pieralisi@arm.com: updated commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Haiyang Zhang <haiyangz@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jack Morgenstein <jackm@mellanox.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com>
(cherry picked from commit fca288c0153b2b97114b9081bc3c33c3735145b6) Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Dexuan Cui [Fri, 8 Jun 2018 18:39:11 +0000 (15:39 -0300)]
PCI: hv: Only queue new work items in hv_pci_devices_present() if necessary
BugLink: http://bugs.launchpad.net/bugs/1758378
If there is pending work in hv_pci_devices_present() we just need to add
the new dr entry into the dr_list. Add a check to detect pending work
items and update the code to skip queuing work if pending work items
are detected.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
[lorenzo.pieralisi@arm.com: updated commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Haiyang Zhang <haiyangz@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jack Morgenstein <jackm@mellanox.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com>
(cherry picked from commit 948373b3ed1bcf05a237c24675b84804315aff14) Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1775856
WHen registering a new binfmt_misc handler, it is possible to overflow
the offset to get a negative value, which might crash the system, or
possibly leak kernel data.
Here is a crash log when 2500000000 was used as an offset:
Use kstrtoint instead of simple_strtoul. It will work as the code
already set the delimiter byte to '\0' and we only do it when the field
is not empty.
Tested with offsets -1, 2500000000, UINT_MAX and INT_MAX. Also tested
with examples documented at Documentation/admin-guide/binfmt-misc.rst
and other registrations from packages on Ubuntu.
Link: http://lkml.kernel.org/r/20180529135648.14254-1-cascardo@canonical.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5cc41e099504b77014358b58567c5ea6293dd220) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Jann Horn [Fri, 8 Jun 2018 14:29:07 +0000 (15:29 +0100)]
compat: fix 4-byte infoleak via uninitialized struct field
Commit 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to
native counterparts") removed the memset() in compat_get_timex(). Since
then, the compat adjtimex syscall can invoke do_adjtimex() with an
uninitialized ->tai.
If do_adjtimex() doesn't write to ->tai (e.g. because the arguments are
invalid), compat_put_timex() then copies the uninitialized ->tai field
to userspace.
Fix it by adding the memset() back.
Fixes: 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0a0b98734479aa5b3c671d5190e86273372cab95)
CVE-2018-11508 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Jassi Brar [Fri, 8 Jun 2018 16:52:18 +0000 (10:52 -0600)]
net: netsec: enable tx-irq during open callback
BugLink: https://bugs.launchpad.net/bugs/1775884
Enable TX-irq as well during ndo_open() as we can not count upon
RX to arrive early enough to trigger the napi. This patch is critical
for installation over network.
Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver") Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c009f413b79de526a355b6eefa4f900b6c45d5f4) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Masahisa KOJIMA [Fri, 8 Jun 2018 16:52:19 +0000 (10:52 -0600)]
net: socionext: reset hardware in ndo_stop
BugLink: https://bugs.launchpad.net/bugs/1775884
When the interface is down, head/tail of the descriptor
ring address is set to 0 in netsec_netdev_stop().
But netsec hardware still keeps the previous descriptor
ring address, so there is inconsistency between driver
and hardware after interface is up at a later time.
To address this inconsistency, add netsec_reset_hardware()
when the interface is down.
In addition, to minimize the reset process,
add flag to decide whether driver loads the netsec microcode.
Even if driver resets the netsec hardware, netsec microcode
keeps resident on RAM, so it is ok we only load the microcode
at initialization.
This patch is critical for installation over network.
Signed-off-by: Masahisa KOJIMA <masahisa.kojima@linaro.org> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver") Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9a00b697ce31e38c670a3042cf9f1e9cf28dabb5) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Ard Biesheuvel [Fri, 8 Jun 2018 16:52:20 +0000 (10:52 -0600)]
net: netsec: reduce DMA mask to 40 bits
BugLink: https://bugs.launchpad.net/bugs/1775884
The netsec network controller IP can drive 64 address bits for DMA, and
the DMA mask is set accordingly in the driver. However, the SynQuacer
SoC, which is the only silicon incorporating this IP at the moment,
integrates this IP in a manner that leaves address bits [63:40]
unconnected.
Up until now, this has not resulted in any problems, given that the DDR
controller doesn't decode those bits to begin with. However, recent
firmware updates for platforms incorporating this SoC allow the IOMMU
to be enabled, which does decode address bits [47:40], and allocates
top down from the IOVA space, producing DMA addresses that have bits
set that have been left unconnected.
Both the DT and ACPI (IORT) descriptions of the platform take this into
account, and only describe a DMA address space of 40 bits (using either
dma-ranges DT properties, or DMA address limits in IORT named component
nodes). However, even though our IOMMU and bus layers may take such
limitations into account by setting a narrower DMA mask when creating
the platform device, the netsec probe() entrypoint follows the common
practice of setting the DMA mask uncondionally, according to the
capabilities of the IP block itself rather than to its integration into
the chip.
It is currently unclear what the correct fix is here. We could hack around
it by only setting the DMA mask if it deviates from its default value of
DMA_BIT_MASK(32). However, this makes it impossible for the bus layer to
use DMA_BIT_MASK(32) as the bus limit, and so it appears that a more
comprehensive approach is required to take DMA limits imposed by the
SoC as a whole into account.
In the mean time, let's limit the DMA mask to 40 bits. Given that there
is currently only one SoC that incorporates this IP, this is a reasonable
approach that can be backported to -stable and buys us some time to come
up with a proper fix going forward.
Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver") Cc: Robin Murphy <robin.murphy@arm.com> Cc: Jassi Brar <jaswinder.singh@linaro.org> Cc: Masahisa Kojima <masahisa.kojima@linaro.org> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Jassi Brar <jaswinder.singh@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 312564269535892cc082bc80592150cd1f5e8ec3) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Heiner Kallweit [Thu, 24 May 2018 04:35:19 +0000 (12:35 +0800)]
r8169: improve interrupt handling
BugLink: https://bugs.launchpad.net/bugs/1752772
This patch improves few aspects of interrupt handling:
- update to current interrupt allocation API
(use pci_alloc_irq_vectors() instead of deprecated pci_enable_msi())
- this implicitly will allocate a MSI-X interrupt if available
- get rid of flag RTL_FEATURE_MSI
- remove some dead code, intentionally disabling (unreliable) MSI
being partially available on old PCI chips.
The patch works fine on a RTL8168evl (chip version 34) and on a
RTL8169SB (chip version 04).
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 6c6aa15fdea52aa4a19cc0b5478884af591c9351) Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>