There is a small chance that the compiler could generate separate loads
for the dist->propbaser which could be modified from another CPU. As we
want to make sure we atomically update the entire value, and don't race
with other updates, guarantee that the cmpxchg operation compares
against the original value.
Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
means that tells the kernel to not make use of any IOAPICs that may be present
in the system.
Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
variable passed from vcpu_enter_guest() which means that the L0's userspace
requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
L0's userspace reqeusts an irq window) is true, so there is no interrupt which
L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
interrupt on exit" for the irq window requirement in this scenario.
This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
if there is no L1 requirement to inject an interrupt to L2.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Added code comment to make it obvious that the behavior is not correct.
We should do a userspace exit with open interrupt window instead of the
nested VM exit. This patch still improves the behavior, so it was
accepted as a (temporary) workaround.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Shawn Lin [Tue, 25 Jul 2017 01:11:28 +0000 (09:11 +0800)]
mmc: block: bypass the queue even if usage is present for hotplug
The commit 304419d8a7e9 ("mmc: core: Allocate per-request data using the
block layer core") refactored mechanism of queue handling caused
mmc_init_request() can be called just after mmc_cleanup_queue() caused null
pointer dereference.
Another commit bbdc74dc19e0 ("mmc: block: Prevent new req entering queue
after its cleanup") tried to fix the problem. However it actually miss one
corner case.
We could still reproduce the issue mentioned with these steps:
(1) insert a SD card and mount it
(2) hotplug it, so it will leave md->usage still be counted
(3) reboot the system which will sync data and umount the card
So mmc_blk_put wouldn't calling blk_cleanup_queue which actually the
QUEUE_FLAG_DYING and QUEUE_FLAG_BYPASS should stay. Block core expect
blk_queue_bypass_{start, end} internally to bypass/drain the queue before
actually dying the queue, so it didn't expose API to set the queue bypass.
I think we should set QUEUE_FLAG_BYPASS whenever queue is removed, although
the md->usage is still counted, as no dispatch queue could be found then.
Fixes: 304419d8a7e9 ("mmc: core: Allocate per-request data using the block layer core") Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
mmc: sdhci-of-at91: force card detect value for non removable devices
When the device is non removable, the card detect signal is often used
for another purpose i.e. muxed to another SoC peripheral or used as a
GPIO. It could lead to wrong behaviors depending the default value of
this signal if not muxed to the SDHCI controller.
Fixes: bb5f8ea4d514 ("mmc: sdhci-of-at91: introduce driver for the Atmel SDMMC") Signed-off-by: Ludovic Desroches <ludovic.desroches@microchip.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Linus Torvalds [Thu, 3 Aug 2017 03:56:44 +0000 (20:56 -0700)]
Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Two fixes from Trond this time, now that he's back from his vacation.
The first is a stable fix for the EXCHANGE_ID issue on the mailing
list, and the other fixes a double-free situation that he found at the
same time.
Tero Kristo [Wed, 2 Aug 2017 18:32:13 +0000 (21:32 +0300)]
clk: keystone: sci-clk: Fix sci_clk_get
Currently a bug in the sci_clk_get implementation causes it to always
return a clock belonging to the last device in the static list of clock
data. This is due to a bug in the init code that causes the array
used by sci_clk_get to only be populated with the clocks for the last
device, as each device overwrites the entire array with its own clocks.
Fix this by calculating the actual number of clocks for the SoC, and
allocating the whole array in one go. Also, we don't need the handle
to the init data array anymore after doing this, instead we can
just compare the dev_id / clk_id against the registered clocks and
use binary search for speed.
Signed-off-by: Tero Kristo <t-kristo@ti.com> Reported-by: Dave Gerlach <d-gerlach@ti.com> Fixes: b745c0794e2f ("clk: keystone: Add sci-clk driver support") Cc: Nishanth Menon <nm@ti.com> Tested-by: Franklin Cooper <fcooper@ti.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Jan Kara [Wed, 2 Aug 2017 20:32:30 +0000 (13:32 -0700)]
ocfs2: don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
into ocfs2_iop_set_acl(). That way the function will not be called when
inheriting ACLs which is what we want as it prevents SGID bit clearing
and the mode has been properly set by posix_acl_create() anyway. Also
posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
mode itself.
Fixes: 073931017b4 ("posix_acl: Clear SGID bit when setting file permissions") Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kan Liang [Wed, 2 Aug 2017 20:32:27 +0000 (13:32 -0700)]
mm: allow page_cache_get_speculative in interrupt context
Kernel panic when calling the IRQ-safe __get_user_pages_fast in NMI
handler.
The bug was introduced by commit 2947ba054a4d ("x86/mm/gup: Switch GUP
to the generic get_user_page_fast() implementation").
The original x86 __get_user_page_fast used plain get_page() or
page_ref_add(). However, the generic __get_user_page_fast uses
page_cache_get_speculative(), which has VM_BUG_ON(in_interrupt()).
There is no reason to prevent page_cache_get_speculative from using in
interrupt context. According to the author, putting a BUG_ON there is
just because the code is not verifying correctness of interrupt races.
I did some tests in interrupt context. There is no issue found.
Removing VM_BUG_ON(in_interrupt()) for page_cache_get_speculative().
Link: http://lkml.kernel.org/r/1501609146-59730-1-git-send-email-kan.liang@intel.com Fixes: 2947ba054a4d ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation") Signed-off-by: Kan Liang <kan.liang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ying Huang <ying.huang@intel.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 2 Aug 2017 20:32:24 +0000 (13:32 -0700)]
userfaultfd: non-cooperative: flush event_wqh at release time
There may still be threads waiting on event_wqh at the time the
userfault file descriptor is closed. Flush the events wait-queue to
prevent waiting threads from hanging.
Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report
non-PF events from uffd descriptor") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Wed, 2 Aug 2017 20:32:21 +0000 (13:32 -0700)]
ipc: add missing container_of()s for randstruct
When building with the randstruct gcc plugin, the layout of the IPC
structs will be randomized, which requires any sub-structure accesses to
use container_of(). The proc display handlers were missing the needed
container_of()s since the iterator is passing in the top-level struct
kern_ipc_perm.
This would lead to crashes when running the "lsipc" program after the
system had IPC registered (e.g. after starting up Gnome):
Dima Zavin [Wed, 2 Aug 2017 20:32:18 +0000 (13:32 -0700)]
cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.
This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial. The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.
The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch. This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites. If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.
This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.
The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.
The relevant stack traces of the two stuck threads:
Mike Rapoport [Wed, 2 Aug 2017 20:32:15 +0000 (13:32 -0700)]
userfaultfd_zeropage: return -ENOSPC in case mm has gone
In the non-cooperative userfaultfd case, the process exit may race with
outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC
instead of -EINVAL when mm is already gone will allow uffd monitor to
distinguish this case from other error conditions.
Unfortunately I overlooked userfaultfd_zeropage when updating
userfaultd_copy().
Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com Fixes: 96333187ab162 ("userfaultfd_copy: return -ENOSPC in case mm has gone") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu
rwsem") memory hotplug locking was changed to fix a potential deadlock.
This also switched the stop_machine() invocation within
build_all_zonelists() to stop_machine_cpuslocked() which now expects
that online cpus are locked when being called.
This assumption is not true if build_all_zonelists() is being called
from numa_zonelist_order_handler().
In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
pair to numa_zonelist_order_handler().
Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com Fixes: 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem") Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reported-by: Andre Wild <wild@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Wed, 2 Aug 2017 20:32:09 +0000 (13:32 -0700)]
mm/page_io.c: fix oops during block io poll in swapin path
When a thread is OOM-killed during swap_readpage() operation, an oops
occurs because end_swap_bio_read() is calling wake_up_process() based on
an assumption that the thread which called swap_readpage() is still
alive.
After commit cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size
handling"), zram doesn't use double pointer for pool->size_class any
more in zs_create_pool so counter function zs_destroy_pool don't need to
free it, either.
Otherwise, it does kfree wrong address and then, kernel goes Oops.
Link: http://lkml.kernel.org/r/20170725062650.GA12134@bbox Fixes: cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size handling") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Wed, 2 Aug 2017 20:31:58 +0000 (13:31 -0700)]
kasan: avoid -Wmaybe-uninitialized warning
gcc-7 produces this warning:
mm/kasan/report.c: In function 'kasan_report':
mm/kasan/report.c:351:3: error: 'info.first_bad_addr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
print_shadow_for_address(info->first_bad_addr);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/kasan/report.c:360:27: note: 'info.first_bad_addr' was declared here
The code seems fine as we only print info.first_bad_addr when there is a
shadow, and we always initialize it in that case, but this is relatively
hard for gcc to figure out after the latest rework.
Adding an intialization to the most likely value together with the other
struct members shuts up that warning.
Mike Rapoport [Wed, 2 Aug 2017 20:31:55 +0000 (13:31 -0700)]
userfaultfd: non-cooperative: notify about unmap of destination during mremap
When mremap is called with MREMAP_FIXED it unmaps memory at the
destination address without notifying userfaultfd monitor.
If the destination were registered with userfaultfd, the monitor has no
way to distinguish between the old and new ranges and to properly relate
the page faults that would occur in the destination region.
Fixes: 897ab3e0c49e ("userfaultfd: non-cooperative: add event for memory unmaps") Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 2 Aug 2017 20:31:52 +0000 (13:31 -0700)]
mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.
He described the race as follows:
CPU0 CPU1
---- ----
user accesses memory using RW PTE
[PTE now cached in TLB]
try_to_unmap_one()
==> ptep_get_and_clear()
==> set_tlb_ubc_flush_pending()
mprotect(addr, PROT_READ)
==> change_pte_range()
==> [ PTE non-present - no flush ]
user writes using cached RW PTE
...
try_to_unmap_flush()
The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.
For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access. For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB. However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.
Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch. Other users such as gup as a race with
reclaim looks just at PTEs. huge page variants should be ok as they
don't race with reclaim. mincore only looks at PTEs. userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.
Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected. His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.
Daniel Jordan [Wed, 2 Aug 2017 20:31:47 +0000 (13:31 -0700)]
mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
Commit 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when
FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
errors from follow_hugetlb_page. After such error, __get_user_pages
subsequently calls faultin_page on the same VMA and start address that
follow_hugetlb_page failed on instead of returning the error immediately
as it should.
In follow_hugetlb_page, when hugetlb_fault returns a value covered under
VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
to 0 as __get_user_pages expects in this case, which causes the
following to happen in __get_user_pages: the "while (nr_pages)" check
succeeds, we skip the "if (!vma..." check because we got a VMA the last
time around, we find no page with follow_page_mask, and we call
faultin_page, which calls hugetlb_fault for the second time.
This issue also slightly changes how __get_user_pages works. Before, it
only returned error if it had made no progress (i = 0). But now,
follow_hugetlb_page can clobber "i" with an error code since its new
return path doesn't check for progress. So if "i" is nonzero before a
failing call to follow_hugetlb_page, that indication of progress is lost
and __get_user_pages can return error even if some pages were
successfully pinned.
To fix this, change follow_hugetlb_page so that it updates nr_pages,
allowing __get_user_pages to fail immediately and restoring the "error
only if no progress" behavior to __get_user_pages.
Tested that __get_user_pages returns when expected on error from
hugetlb_fault in follow_hugetlb_page.
Fixes: 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified") Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.com Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Punit Agrawal <punit.agrawal@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: James Morse <james.morse@arm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: <stable@vger.kernel.org> [4.12.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Matlack [Tue, 1 Aug 2017 21:00:40 +0000 (14:00 -0700)]
KVM: nVMX: mark vmcs12 pages dirty on L2 exit
The host physical addresses of L1's Virtual APIC Page and Posted
Interrupt descriptor are loaded into the VMCS02. The CPU may write
to these pages via their host physical address while L2 is running,
bypassing address-translation-based dirty tracking (e.g. EPT write
protection). Mark them dirty on every exit from L2 to prevent them
from getting out of sync with dirty tracking.
Also mark the virtual APIC page and the posted interrupt descriptor
dirty when KVM is virtualizing posted interrupt processing.
Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
David Matlack [Tue, 1 Aug 2017 21:00:39 +0000 (14:00 -0700)]
kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown
According to the Intel SDM, software cannot rely on the current VMCS to be
coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
flushes.
24.11.1 Software Use of Virtual-Machine Control Structures
...
If a logical processor leaves VMX operation, any VMCSs active on
that logical processor may be corrupted (see below). To prevent
such corruption of a VMCS that may be used either after a return
to VMX operation or on another logical processor, software should
execute VMCLEAR for that VMCS before executing the VMXOFF instruction
or removing power from the processor (e.g., as part of a transition
to the S3 and S4 power states).
...
This fixes a "suspicious rcu_dereference_check() usage!" warning during
kvm_vm_release() because nested_release_vmcs12() calls
kvm_vcpu_write_guest_page() without holding kvm->srcu.
Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Paolo Bonzini [Thu, 27 Jul 2017 13:54:46 +0000 (15:54 +0200)]
KVM: nVMX: do not pin the VMCS12
Since the current implementation of VMCS12 does a memcpy in and out
of guest memory, we do not need current_vmcs12 and current_vmcs12_page
anymore. current_vmptr is enough to read and write the VMCS12.
And David Matlack noted:
This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
VMCS12 page by using kvm_write_guest. nested_release_page() only marks
the struct page dirty.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com>
[Added David Matlack's note and nested_release_page_clean() fix.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Paolo Bonzini [Wed, 2 Aug 2017 15:55:54 +0000 (17:55 +0200)]
KVM: avoid using rcu_dereference_protected
During teardown, accesses to memslots and buses are using
rcu_dereference_protected with an always-true condition because
these accesses are done outside the usual mutexes. This
is because the last reference is gone and there cannot be any
concurrent modifications, but rcu_dereference_protected is
ugly and unobvious.
Instead, check the refcount in kvm_get_bus and __kvm_memslots.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Longpeng(Mike) [Wed, 2 Aug 2017 03:20:51 +0000 (11:20 +0800)]
KVM: X86: init irq->level in kvm_pv_kick_cpu_op
'lapic_irq' is a local variable and its 'level' field isn't
initialized, so 'level' is random, it doesn't matter but
makes UBSAN unhappy:
UBSAN: Undefined behaviour in .../lapic.c:...
load of value 10 is not a valid value for type '_Bool'
...
Call Trace:
[<ffffffff81f030b6>] dump_stack+0x1e/0x20
[<ffffffff81f03173>] ubsan_epilogue+0x12/0x55
[<ffffffff81f03b96>] __ubsan_handle_load_invalid_value+0x118/0x162
[<ffffffffa1575173>] kvm_apic_set_irq+0xc3/0xf0 [kvm]
[<ffffffffa1575b20>] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
[<ffffffffa15858ea>] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
[<ffffffffa1517f4e>] kvm_emulate_hypercall+0x62e/0x760 [kvm]
[<ffffffffa113141a>] handle_vmcall+0x1a/0x30 [kvm_intel]
[<ffffffffa114e592>] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
...
Felix Kuehling [Wed, 2 Aug 2017 02:34:55 +0000 (22:34 -0400)]
drm/amdgpu: Use list_del_init in amdgpu_mn_unregister
Otherwise bo->shadow_list (which is aliased by bo->mn_list) will not
appear empty in amdgpu_ttm_bo_destroy and cause an oops when freeing
former userptr BOs.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jean Delvare [Sun, 30 Jul 2017 08:18:25 +0000 (10:18 +0200)]
drm/amdgpu: Fix undue fallthroughs in golden registers initialization
As I was staring at the si_init_golden_registers code, I noticed that
the Pitcairn initialization silently falls through the Cape Verde
initialization, and the Oland initialization falls through the Hainan
initialization. However there is no comment stating that this is
intentional, and the radeon driver doesn't have any such fallthrough,
so I suspect this is not supposed to happen.
Signed-off-by: Jean Delvare <jdelvare@suse.de> Fixes: 62a37553414a ("drm/amdgpu: add si implementation v10") Cc: Ken Wang <Qingqing.Wang@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Marek Olšák" <maraeo@gmail.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Flora Cui <Flora.Cui@amd.com> Reviewed-by: Marek Olšák <marek.olsak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
Stephen Boyd [Wed, 2 Aug 2017 16:11:44 +0000 (09:11 -0700)]
Merge tag 'sunxi-clk-fixes-for-4.13' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into clk-fixes
Pull one Allwinner clock fix from Chen-Yu Tsai:
One critical clock fix for sun5i (A10s/A13/R8) which enables propagation
of clock rate changes from the "cpu" clock to it's parent PLL clock.
This fixes cpufreq related crashes that have been observed on KernelCI
with the C.H.I.P. and multi_v7_defconfig.
* tag 'sunxi-clk-fixes-for-4.13' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
clk: sunxi-ng: sun5i: Add clk_set_rate_parent to the CPU clock
Takashi Iwai [Wed, 2 Aug 2017 15:11:45 +0000 (17:11 +0200)]
Merge tag 'asoc-fix-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v4.13
Quite a few fixes here that have been sent since the merge window, the
biggest one is the fix from Tony for some confusion with the device
property API which was causing issues with the of-graph card. This is
fixed with some changes in the graph API itself as it seemed very likely
to be error prone.
Gregory CLEMENT [Tue, 1 Aug 2017 16:01:35 +0000 (18:01 +0200)]
ARM64: dts: marvell: armada-37xx: Fix the number of GPIO on south bridge
The number of pins in South Bridge is 30 and not 29. There is a fix for
the driver for the pinctrl, but a fix is also need at device tree level
for the GPIO.
Fixes: afda007feda5 ("ARM64: dts: marvell: Add pinctrl nodes for Armada
3700") Cc: <stable@vger.kernel.org> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
of_irq_to_resource() has recently been fixed to return negative error #'s
along with 0 in case of failure, however the Freescale MPC832x RDB board
code still only regards 0 as a failure indication -- fix it up.
Fixes: 7a4228bbff76 ("of: irq: use of_irq_get() in of_irq_to_resource()") Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When more than one GPIO IRQs are triggered simultaneously,
tegra_gpio_irq_handler() called chained_irq_exit() multiple
times for one chained_irq_enter().
Fixes: 3c92db9ac0ca3eee8e46e2424b6c074e2e394ad9 Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
[Also changed the variable to a bool] Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Andy Lutomirski [Tue, 1 Aug 2017 15:37:26 +0000 (08:37 -0700)]
platform/x86: dell-wmi: Fix driver interface version query
When I converted dell-wmi to the new bus infrastructure, I left the
call to dell_wmi_check_descriptor_buffer() in dell_wmi_init(). This
could cause two problems:
- An error message when loading the driver on a system without
dell-wmi. We'd try to read the event descriptor even if the WMI
GUID wasn't there.
- A possible race if dell-wmi was loaded manually before wmi was
fully initialized.
Fix it by moving the call to the probe function where it belongs.
Fixes: bff589be59c5 ("platform/x86: dell-wmi: Convert to the WMI bus infrastructure") Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org>
Trond Myklebust [Tue, 1 Aug 2017 20:02:47 +0000 (16:02 -0400)]
NFSv4: Fix EXCHANGE_ID corrupt verifier issue
The verifier is allocated on the stack, but the EXCHANGE_ID RPC call was
changed to be asynchronous by commit 8d89bd70bc939. If we interrrupt
the call to rpc_wait_for_completion_task(), we can therefore end up
transmitting random stack contents in lieu of the verifier.
I encounter this when trying to stress the async page fault in L1 guest w/
L2 guests running.
Commit 9b132fbe5419 (Add rcu user eqs exception hooks for async page
fault) adds rcu_irq_enter/exit() to kvm_async_pf_task_wait() to exit cpu
idle eqs when needed, to protect the code that needs use rcu. However,
we need to call the pair even if the function calls schedule(), as seen
from the above backtrace.
This patch fixes it by informing the RCU subsystem exit/enter the irq
towards/away from idle for both n.halted and !n.halted.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Paolo Bonzini [Thu, 27 Jul 2017 10:29:32 +0000 (12:29 +0200)]
KVM: nVMX: fixes to nested virt interrupt injection
There are three issues in nested_vmx_check_exception:
1) it is not taking PFEC_MATCH/PFEC_MASK into account, as reported
by Wanpeng Li;
2) it should rebuild the interruption info and exit qualification fields
from scratch, as reported by Jim Mattson, because the values from the
L2->L0 vmexit may be invalid (e.g. if an emulated instruction causes
a page fault, the EPT misconfig's exit qualification is incorrect).
3) CR2 and DR6 should not be written for exception intercept vmexits
(CR2 only for AMD).
This patch fixes the first two and adds a comment about the last,
outlining the fix.
Cc: Jim Mattson <jmattson@google.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 27 Jul 2017 08:31:25 +0000 (10:31 +0200)]
KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12
Do this in the caller of nested_vmx_vmexit instead.
nested_vmx_check_exception was doing a vmwrite to the vmcs02's
VM_EXIT_INTR_ERROR_CODE field, so that prepare_vmcs12 would move
the field to vmcs12->vm_exit_intr_error_code. However that isn't
possible on pre-Haswell machines. Moving the vmcs12 write to the
callers fixes it.
Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Changed nested_vmx_reflect_vmexit() return type to (int)1 from (bool)1,
thanks to fengguang.wu@intel.com] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Linus Torvalds [Tue, 1 Aug 2017 20:20:24 +0000 (13:20 -0700)]
Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parsic fixes from Helge Deller:
- Our cache flushing code ran into a BUG in case context is not
current. Fix it by flushing the whole cache in such rare situations
(by Dave Anglin).
- Fix a "sleeping function called from invalid context BUG" in our
pdc_stable driver by rearranging our locks (by James Bottomley)
- The thread and irq stacks require more than 16 KB since kernel 4.11.
Increase both to 32 KB.
- Define CONFIG_CPU_BIG_ENDIAN unconditionally on parisc to avoid wrong
behaviour in qrwlock functions (by Babu Moger).
* 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Define CONFIG_CPU_BIG_ENDIAN
parisc: pdc_stable: Fix locking when creating sysfs links
parisc: Increase thread and stack size to 32kb
parisc: Handle vma's whose context is not current in flush_cache_range
libceph: fallback for when there isn't a pool-specific choose_arg
There is now a fallback to a choose_arg index of -1 if there isn't
a pool-specific choose_arg set. If you create a per-pool weight-set,
that works for that pool. Otherwise we try the compat/default one. If
that doesn't exist either, then we use the normal CRUSH weights.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
libceph: don't call ->reencode_message() more than once per message
Reencoding an already reencoded message is a bad idea. This could
happen on Policy::stateful_server connections (!CEPH_MSG_CONNECT_LOSSY),
such as MDS sessions.
This didn't pop up in testing because currently only OSD requests are
reencoded and OSD sessions are always lossy.
libceph: make encode_request_*() work with r_mempool requests
Messages allocated out of ceph_msgpool have a fixed front length
(pool->front_len). Asserting that the entire front has been filled
while encoding is thus wrong.
Tony Lindgren [Fri, 28 Jul 2017 08:23:15 +0000 (01:23 -0700)]
device property: Fix usecount for of_graph_get_port_parent()
Fix inconsistent use of of_graph_get_port_parent() where
asoc_simple_card_parse_graph_dai() does of_node_get() before
calling it while other callers do not. We can fix this by
not trashing the node passed to of_graph_get_port_parent().
Let's also make sure the callers have correct refcounts and remove
related incorrect of_node_put() calls for of_for_each_phandle
as that's done by of_phandle_iterator_next() except when
we break out of the loop early.
Let's fix both issues with a single patch to avoid kobject
refcounts getting messed up more if two patches are merged
separately.
Otherwise strange issues can happen caused by memory corruption
caused by too many kobject_del() calls such as:
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:747
...
(___might_sleep)
(__mutex_lock)
(mutex_lock_nested)
(kernfs_remove)
(kobject_del)
(kobject_put)
(of_get_next_parent)
(of_graph_get_port_parent)
(asoc_simple_card_parse_graph_dai [snd_soc_simple_card_utils])
(asoc_graph_card_probe [snd_soc_audio_graph_card])
Fixes: 0ef472a973eb ("of_graph: add of_graph_get_port_parent()") Fixes: 2692c1c63c29 ("ASoC: add audio-graph-card support") Fixes: 1689333f8311 ("ASoC: simple-card-utils: add asoc_simple_card_parse_graph_dai()") Signed-off-by: Tony Lindgren <tony@atomide.com> Reviewed-by: Rob Herring <robh@kernel.org> Tested-by: Antonio Borneo <borneo.antonio@gmail.com> Tested-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Mark Brown <broonie@kernel.org>
For e.g. HZ=100, timer being 430 jiffies in the future, and 32 bit
unsigned int, there is an overflow on unsigned int right-hand side
of the expression which results with wrong values being returned.
Type cast the multiplier to 64bit to avoid that issue.
Fixes: 46c8f0b077a8 ("timers: Fix get_next_timer_interrupt() computation") Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Cc: khilman@baylibre.com Cc: akpm@linux-foundation.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/a7900f04-2a21-c9fd-67be-ab334d459ee5@nokia.com
gpiolib: skip unwanted events, don't convert them to opposite edge
The previous fix for filtering out of unwatched events was not entirely
correct. Instead of skipping the events we don't want, they are now
interpreted as events with opposing edge.
In order to fix it: always read the GPIO line value on interrupt and
only emit the event if it corresponds with the event type we requested.
Jan Kiszka [Wed, 19 Jul 2017 05:31:14 +0000 (07:31 +0200)]
gpio: exar: Use correct property prefix and document bindings
The device-specific property should be prefixed with the vendor name,
not "linux,", as Linus Walleij pointed out. Change this and document the
bindings of this platform device.
We didn't ship the old binding in a release yet. So we can still change
it without breaking an official API.
Fixes: 380b1e2f3a2f ("gpio-exar/8250-exar: Make set of exported GPIOs configurable") Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Marc Zyngier [Fri, 21 Jul 2017 17:15:27 +0000 (18:15 +0100)]
arm64: Use arch_timer_get_rate when trapping CNTFRQ_EL0
In an ideal world, CNTFRQ_EL0 always contains the timer frequency
for the kernel to use. Sadly, we get quite a few broken systems
where the firmware authors cannot be bothered to program that
register on all CPUs, and rely on DT to provide that frequency.
So when trapping CNTFRQ_EL0, make sure to return the actual rate
(as known by the kernel), and not CNTFRQ_EL0.
Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
Thomas Gleixner [Mon, 31 Jul 2017 20:07:09 +0000 (22:07 +0200)]
x86/hpet: Cure interface abuse in the resume path
The HPET resume path abuses irq_domain_[de]activate_irq() to restore the
MSI message in the HPET chip for the boot CPU on resume and it relies on an
implementation detail of the interrupt core code, which magically makes the
HPET unmask call invoked via a irq_disable/enable pair. This worked as long
as the irq code did unconditionally invoke the unmask() callback. With the
recent changes which keep track of the masked state to avoid expensive
hardware access, this does not longer work. As a consequence the HPET timer
interrupts are not unmasked which breaks resume as the boot CPU waits
forever that a timer interrupt arrives.
Make the restore of the MSI message explicit and invoke the unmask()
function directly. While at it get rid of the pointless affinity setting as
nothing can change the affinity of the interrupt and the vector across
suspend/resume. The restore of the MSI message reestablishes the previous
affinity setting which is the correct one.
Fixes: bf22ff45bed6 ("genirq: Avoid unnecessary low level irq function calls") Reported-and-tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com> Reported-by: Martin Peres <martin.peres@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: jeffy.chen@rock-chips.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707312158590.2287@nanos
gpio: gpio-mxc: Fix: higher 16 GPIOs usable as wake source
In the function gpio_set_wake_irq(), port->irq_high is only checked for
zero. As platform_get_irq() returns a value less then zero if no interrupt
was found, any gpio >= 16 was handled like an irq_high interrupt was
available. On iMX27 for example no high interrupt is available. This lead
to the problem that only some gpios (the lower 16) were useable as wake
sources.
Signed-off-by: Philipp Rosenberger <p.rosenberger@linutronix.de> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
pinctrl: stm32: select IRQ_DOMAIN_HIERARCHY instead of depends on
Drivers that need IRQ_DOMAIN_HIERARCHY should "select" it, but
drivers/pinctrl/stm32/Kconfig is the only exception that uses
"depends on" syntax. This prevents GPIO drivers from select'ing
IRQ_DOMAIN_HIERARCHY.
For example, if I add "select IRQ_DOMAIN_HIERARCHY" to GPIO_XGENE_SB,
I get the following recursive dependency error.
drivers/gpio/Kconfig:13:error: recursive dependency detected!
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
drivers/gpio/Kconfig:13: symbol GPIOLIB is selected by PINCTRL_STM32
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
drivers/pinctrl/stm32/Kconfig:3: symbol PINCTRL_STM32 is selected by PINCTRL_STM32F429
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
drivers/pinctrl/stm32/Kconfig:11: symbol PINCTRL_STM32F429 depends on IRQ_DOMAIN_HIERARCHY
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
kernel/irq/Kconfig:67: symbol IRQ_DOMAIN_HIERARCHY is selected by GPIO_XGENE_SB
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
drivers/gpio/Kconfig:502: symbol GPIO_XGENE_SB depends on GPIOLIB
William Tu [Mon, 31 Jul 2017 21:40:50 +0000 (14:40 -0700)]
samples/bpf: fix bpf tunnel cleanup
test_tunnel_bpf.sh fails to remove the vxlan11 tunnel device, causing the
next geneve tunnelling test case fails. In addition, the geneve reserved bit
in tcbpf2_kern.c should be zero, according to the RFC.
Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Mon, 31 Jul 2017 14:52:36 +0000 (16:52 +0200)]
udp6: fix jumbogram reception
Since commit 67a51780aebb ("ipv6: udp: leverage scratch area
helpers") udp6_recvmsg() read the skb len from the scratch area,
to avoid a cache miss.
But the UDP6 rx path support RFC 2675 UDPv6 jumbograms, and their
length exceeds the 16 bits available in the scratch area. As a side
effect the length returned by recvmsg() is:
<ingress datagram len> % (1<<16)
This commit addresses the issue allocating one more bit in the
IP6CB flags field and setting it for incoming jumbograms.
Such field is still in the first cacheline, so at recvmsg()
time we can check it and fallback to access skb->len if
required, without a measurable overhead.
Fixes: 67a51780aebb ("ipv6: udp: leverage scratch area helpers") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ppp: Fix a scheduling-while-atomic bug in del_chan
The PPTP set the pptp_sock_destruct as the sock's sk_destruct, it would
trigger this bug when __sk_free is invoked in atomic context, because of
the call path pptp_sock_destruct->del_chan->synchronize_rcu.
Now move the synchronize_rcu to pptp_release from del_chan. This is the
only one case which would free the sock and need the synchronize_rcu.
The following is the panic I met with kernel 3.3.8, but this issue should
exist in current kernel too according to the codes.
Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config"
This reverts commit 28b45910ccda ("net: bcmgenet: Remove init parameter
from bcmgenet_mii_config") because in the process of moving from
dev_info() to dev_info_once() we essentially lost the helpful printed
messages once the second instance of the driver is loaded.
dev_info_once() does not actually print the message once per device
instance, but once period.
Fixes: 28b45910ccda ("net: bcmgenet: Remove init parameter from bcmgenet_mii_config") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Doug Berger <opendmb@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Seth Forshee noticed a performance degradation with some workloads.
This turns out to be due to packet drops. Euan Kemp noticed that this
is because we drop all packets where length exceeds the truesize, but
for some packets we add in extra memory without updating the truesize.
This in turn was kept around unchanged from ab7db91705e95 ("virtio-net:
auto-tune mergeable rx buffer size for improved performance"). That
commit had an internal reason not to account for the extra space: not
enough bits to do it. No longer true so let's account for the allocated
length exactly.
Many thanks to Seth Forshee for the report and bisecting and Euan Kemp
for debugging the issue.
Fixes: 680557cf79f8 ("virtio_net: rework mergeable buffer handling") Reported-by: Euan Kemp <euan.kemp@coreos.com> Tested-by: Euan Kemp <euan.kemp@coreos.com> Reported-by: Seth Forshee <seth.forshee@canonical.com> Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Sat, 29 Jul 2017 19:18:41 +0000 (22:18 +0300)]
mv643xx_eth: fix of_irq_to_resource() error check
of_irq_to_resource() has recently been fixed to return negative error #'s
along with 0 in case of failure, however the Marvell MV643xx Ethernet
driver still only regards 0 as invalid IRQ -- fix it up.
Fixes: 7a4228bbff76 ("of: irq: use of_irq_get() in of_irq_to_resource()") Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
MAINTAINERS: Add more files to the PHY LIBRARY section
Include missing files that are provided by, used, or directly maintained
within the PHY LIBRARY, this include uapi header, header files used by
Device Tree code etc.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev()
Michał reported a NULL pointer deref during fib_sync_down_dev() when
unregistering a netdevice. The problem is that we don't check for
'in_dev' being NULL, which can happen in very specific cases.
Usually routes are flushed upon NETDEV_DOWN sent in either the netdev or
the inetaddr notification chains. However, if an interface isn't
configured with any IP address, then it's possible for host routes to be
flushed following NETDEV_UNREGISTER, after NULLing dev->ip_ptr in
inetdev_destroy().
To reproduce:
$ ip link add type dummy
$ ip route add local 1.1.1.0/24 dev dummy0
$ ip link del dev dummy0
Fix this by checking for the presence of 'in_dev' before referencing it.
Fixes: 982acb97560c ("ipv4: fib: Notify about nexthop status changes") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: Correctly process PHY_HALTED in phy_stop_machine()
Marc reported that he was not getting the PHY library adjust_link()
callback function to run when calling phy_stop() + phy_disconnect()
which does not indeed happen because we set the state machine to
PHY_HALTED but we don't get to run it to process this state past that
point.
Fix this with a synchronous call to phy_state_machine() in order to have
the state machine actually act on PHY_HALTED, set the PHY device's link
down, turn the network device's carrier off and finally call the
adjust_link() function.
Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge branch 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Several cgroup bug fixes.
- cgroup core was calling a migration callback on empty migrations,
which could make cpuset crash.
- There was a very subtle bug where the controller interface files
aren't created directly when cgroup2 is mounted. Because later
operations create them, this bug didn't get noticed earlier.
- Failed writes to cgroup.subtree_control were incorrectly returning
zero"
* 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix error return value from cgroup_subtree_control()
cgroup: create dfl_root files on subsys registration
cgroup: don't call migration methods if there are no tasks to migrate
Merge branch 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
"Two notable fixes.
- While adding NUMA affinity support to unbound workqueues, the
assumption that an unbound workqueue with max_active == 1 is
ordered was broken.
The plan was to use explicit alloc_ordered_workqueue() for those
cases. Unfortunately, I forgot to update the documentation properly
and we grew a handful of use cases which depend on that assumption.
While we want to convert them to alloc_ordered_workqueue(), we
don't really lose anything by enforcing ordered execution on
unbound max_active == 1 workqueues and it doesn't make sense to
risk subtle bugs. Restore the assumption.
- Workqueue assumes that CPU <-> NUMA node mapping remains static.
This is a general assumption - we don't have any synchronization
mechanism around CPU <-> node mapping. Unfortunately, powerpc may
change the mapping dynamically leading to crashes. Michael added a
workaround so that we at least don't crash while powerpc hotplug
code gets updated"
* 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Work around edge cases for calc of pool's cpumask
workqueue: implicit ordered attribute should be overridable
workqueue: restore WQ_UNBOUND/max_active==1 to be ordered
Merge branch 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
"Dan found a really old bug where libata hotplug code wasn't sanitizing
index value from userland and may end up indexing with a negative
number. It is scary but fortunately can only be triggered by root.
Other than that, minor fixes"
* 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
libata: fix a couple of doc build warnings
libata: array underflow in ata_find_dev()
ata: sata_rcar: add gen[23] fallback compatibility strings
libata: remove unused rc in ata_eh_handle_port_resume
libata: Cleanup ata_read_log_page()
ata: fix gemini Kconfig dependencies
clk: samsung: exynos5420: The EPLL rate table corrections
This patch fixes values of the EPLL K coefficient and changes
the EPLL output frequency values to match exactly what is
possible to achieve with given M, P, S, K coefficients.
This allows to avoid rounding errors and unexpected frequency
being set with clk_set_rate(), due to recalc_rate returning
different values than the PLL rate specified in the
exynos5420_epll_24mhz_tbl table. E.g. this prevents a case
where two consecutive clk_set_rate() calls with same argument
result in different PLL output frequency.
The PLL output frequencies have been calculated with formula:
While working on enabling queued rwlock on SPARC, found this following
code in include/asm-generic/qrwlock.h which uses CONFIG_CPU_BIG_ENDIAN
to clear a byte.
Jonathan Corbet [Sun, 30 Jul 2017 22:16:04 +0000 (16:16 -0600)]
libata: fix a couple of doc build warnings
The kerneldoc comments for a couple of functions in drivers/ata/libata-eh.c
had fallen behind the current implementation, resulting in these doc build
warnings:
./drivers/ata/libata-eh.c:1449: warning: No description found for parameter 'link'
./drivers/ata/libata-eh.c:1449: warning: Excess function parameter 'ap' description in 'ata_eh_done'
./drivers/ata/libata-eh.c:1590: warning: No description found for parameter 'qc'
./drivers/ata/libata-eh.c:1590: warning: Excess function parameter 'dev' description in 'ata_eh_request_sense'
Update the comments and make the warnings go away.
Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Tejun Heo <tj@kernel.org>
thunderbolt: icm: Ignore mailbox errors in icm_suspend()
On one of my test machines nhi_mailbox_cmd() called from icm_suspend()
times out and returnes an error which then is propagated to the
caller and causes the entire system suspend to be aborted which isn't
very useful.
Instead of aborting system suspend, print the error into the log
and continue.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Acked-by: Michael Jamet <michael.jamet@intel.com>
Nicholas Piggin [Sat, 29 Jul 2017 12:50:27 +0000 (22:50 +1000)]
powerpc/64s: Fix stack setup in watchdog soft_nmi_common()
The watchdog soft-NMI exception stack setup loads a stack pointer
twice, which is an obvious error. It ends up using the system reset
interrupt (true-NMI) stack, which is also a bug because the watchdog
could be preempted by a system reset interrupt that overwrites the
NMI stack.
Change the soft-NMI to use the "emergency stack". The current kernel
stack is not used, because of the longer-term goal to prevent
asynchronous stack access using soft-disable.
Fixes: 2104180a5369 ("powerpc/64s: implement arch-specific hardlockup watchdog") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since kernel 4.11 the thread and irq stacks on parisc randomly overflow
the default size of 16k. The reason why stack usage suddenly grew is yet
unknown.
This patch modifies flush_cache_range() to handle non current contexts.
In as much as this occurs infrequently, the simplest approach is to
flush the entire cache when this happens.
Signed-off-by: John David Anglin <dave.anglin@bell.net> Cc: stable@vger.kernel.org # 4.9+ Signed-off-by: Helge Deller <deller@gmx.de>
Jeff Layton [Mon, 31 Jul 2017 04:55:34 +0000 (00:55 -0400)]
ext4: convert swap_inode_data() over to use swap() on most of the fields
For some odd reason, it forces a byte-by-byte copy of each field. A
plain old swap() on most of these fields would be more efficient. We
do need to retain the memswap of i_data however as that field is an array.
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
Emoly Liu [Mon, 31 Jul 2017 04:40:22 +0000 (00:40 -0400)]
ext4: error should be cleared if ea_inode isn't added to the cache
For Lustre, if ea_inode fails in hash validation but passes parent
inode and generation checks, it won't be added to the cache as well
as the error "-EFSCORRUPTED" should be cleared, otherwise it will
cause "Structure needs cleaning" when running getfattr command.
Jan Kara [Mon, 31 Jul 2017 03:33:01 +0000 (23:33 -0400)]
ext4: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of
__ext4_set_acl() into ext4_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.
Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
When changing a file's acl mask, __ext4_set_acl() will first set the group
bits of i_mode to the value of the mask, and only then set the actual
extended attribute representing the new acl.
If the second part fails (due to lack of space, for example) and the file
had no acl attribute to begin with, the system will from now on assume
that the mask permission bits are actual group permission bits, potentially
granting access to the wrong users.
Prevent this by only changing the inode mode after the acl has been set.
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
Eric Whitney [Mon, 31 Jul 2017 02:30:11 +0000 (22:30 -0400)]
ext4: remove unused metadata accounting variables
Two variables in ext4_inode_info, i_reserved_meta_blocks and
i_allocated_meta_blocks, are unused. Removing them saves a little
memory per in-memory inode and cleans up clutter in several tracepoints.
Adjust tracepoint output from ext4_alloc_da_blocks() for consistency
and fix a typo and whitespace near these changes.
Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>