David Matlack [Wed, 19 Jan 2022 23:07:28 +0000 (23:07 +0000)]
KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table
Consolidate the logic to atomically replace an SPTE with an SPTE that
points to a new page table into a single helper function. This will be
used in a follow-up commit to split huge pages, which involves replacing
each huge page SPTE with an SPTE that points to a page table.
Opportunistically drop the call to trace_kvm_mmu_get_page() in
kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
tdp_mmu_alloc_sp().
No functional change intended.
Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Wed, 19 Jan 2022 23:07:27 +0000 (23:07 +0000)]
KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt()
First remove tdp_mmu_ from the name since it is redundant given that it
is a static function in tdp_mmu.c. There is a pattern of using tdp_mmu_
as a prefix in the names of static TDP MMU functions, but all of the
other handle_*() variants do not include such a prefix. So drop it
entirely.
Then change "page" to "pt" to convey that this is operating on a page
table rather than an struct page. Purposely use "pt" instead of "sp"
since this function takes the raw RCU-protected page table pointer as an
argument rather than a pointer to the struct kvm_mmu_page.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These changed make tdp_mmu a consistent prefix before the verb in the
function name, and make it more clear that these functions deal with
kvm_mmu_page structs rather than struct pages.
One could argue that "shadow page" is the wrong term for a page table in
the TDP MMU since it never actually shadows a guest page table.
However, "shadow page" (or "sp" for short) has evolved to become the
standard term in KVM when referring to a kvm_mmu_page struct, and its
associated page table and other metadata, regardless of whether the page
table shadows a guest page table. So this commit just makes the TDP MMU
more consistent with the rest of KVM.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Wed, 19 Jan 2022 23:07:25 +0000 (23:07 +0000)]
KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY
tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() return a bool
with true indicating the SPTE modification was successful and false
indicating failure. Change these functions to return an int instead
since that is the common practice.
Opportunistically fix up the kernel-doc style for the Return section
above tdp_mmu_set_spte_atomic().
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Wed, 19 Jan 2022 23:07:24 +0000 (23:07 +0000)]
KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
Consolidate a bunch of code that was manually re-reading the spte if the
cmpxchg failed. There is no extra cost of doing this because we already
have the spte value as a result of the cmpxchg (and in fact this
eliminates re-reading the spte), and none of the call sites depend on
iter->old_spte retaining the stale spte value.
Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Wed, 19 Jan 2022 23:07:23 +0000 (23:07 +0000)]
KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect()
The function formerly known as rmap_write_protect() has been renamed to
kvm_vcpu_write_protect_gfn(), so we can get rid of the double
underscores in front of __rmap_write_protect().
No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Wed, 19 Jan 2022 23:07:22 +0000 (23:07 +0000)]
KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn()
rmap_write_protect() is a poor name because it also write-protects SPTEs
in the TDP MMU, not just SPTEs in the rmap. It is also confusing that
rmap_write_protect() is not a simple wrapper around
__rmap_write_protect(), since that is the common pattern for functions
with double-underscore names.
Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() to convey
that KVM is write-protecting a specific gfn in the context of a vCPU.
No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Add checks for reserved-to-zero Hyper-V hypercall fields
Add checks for the three fields in Hyper-V's hypercall params that must
be zero. Per the TLFS, HV_STATUS_INVALID_HYPERCALL_INPUT is returned if
"A reserved bit in the specified hypercall input value is non-zero."
Note, some versions of the TLFS have an off-by-one bug for the last
reserved field, and define it as being bits 64:60. See
https://github.com/MicrosoftDocs/Virtualization-Documentation/pull/1682.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Reject fixeds-size Hyper-V hypercalls with non-zero "var_cnt"
Reject Hyper-V hypercalls if the guest specifies a non-zero variable size
header (var_cnt in KVM) for a hypercall that has a fixed header size.
Per the TLFS:
It is illegal to specify a non-zero variable header size for a
hypercall that is not explicitly documented as accepting variable sized
input headers. In such a case the hypercall will result in a return
code of HV_STATUS_INVALID_HYPERCALL_INPUT.
Note, at least some of the various DEBUG commands likely aren't allowed
to use variable size headers, but the TLFS documentation doesn't clearly
state what is/isn't allowed. Omit them for now to avoid unnecessary
breakage.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Shove vp_bitmap handling down into sparse_set_to_vcpu_mask()
Move the vp_bitmap "allocation" that's needed to handle mismatched vp_index
values down into sparse_set_to_vcpu_mask() and drop __always_inline from
said helper. The need for an intermediate vp_bitmap is a detail that's
specific to the sparse translation with mismatched VP<=>vCPU indexes and
does not need to be exposed to the caller.
Regarding the __always_inline, prior to commit f21dd494506a ("KVM: x86:
hyperv: optimize sparse VP set processing") the helper, then named
hv_vcpu_in_sparse_set(), was a tiny bit of code that effectively boiled
down to a handful of bit ops. The __always_inline was understandable, if
not justifiable. Since the aforementioned change, sparse_set_to_vcpu_mask()
is a chunky 350-450+ bytes of code without KASAN=y, and balloons to 1100+
with KASAN=y. In other words, it has no business being forcefully inlined.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Don't bother reading sparse banks that end up being ignored
When handling "sparse" VP_SET requests, don't read sparse banks that
can't possibly contain a legal VP index instead of ignoring such banks
later on in sparse_set_to_vcpu_mask(). This allows KVM to cap the size
of its sparse_banks arrays for VP_SET at KVM_HV_MAX_SPARSE_VCPU_SET_BITS.
Add a compile time assert that KVM_HV_MAX_SPARSE_VCPU_SET_BITS<=64, i.e.
that KVM_MAX_VCPUS<=4096, as the TLFS allows for at most 64 sparse banks,
and KVM will need to do _something_ to play nice with Hyper-V.
Reducing the size of sparse_banks fudges around a compilation warning
(that becomes error with KVM_WERROR=y) when CONFIG_KASAN_STACK=y, which
is selected (and can't be unselected) by CONFIG_KASAN=y when using gcc
(clang/LLVM is a stack hog in some cases so it's opt-in for clang).
KASAN_STACK adds a redzone around every stack variable, which pushes the
Hyper-V functions over the default limit of 1024.
Ideally, KVM would flat out reject such impossibilities, but the TLFS
explicitly allows providing empty banks, even if a bank can't possibly
contain a valid VP index due to its position exceeding KVM's max.
Furthermore, for a bit 1 in ValidBankMask, it is valid state for the
corresponding element in BanksContents can be all 0s, meaning no
processors are specified in this bank.
Arguably KVM should reject and not ignore the "extra" banks, but that can
be done independently and without bloating sparse_banks, e.g. by reading
each "extra" 8-byte chunk individually.
Reported-by: Ajay Garg <ajaygargnsit@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Add a helper to get the sparse VP_SET for IPIs and TLB flushes
Add a helper, kvm_get_sparse_vp_set(), to handle sanity checks related to
the VARHEAD field and reading the sparse banks of a VP_SET. A future
commit to reduce the memory footprint of sparse_banks will introduce more
common code to the sparse bank retrieval.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Refactor kvm_hv_flush_tlb() to reduce indentation
Refactor the "extended" path of kvm_hv_flush_tlb() to reduce the nesting
depth for the non-fast sparse path, and to make the code more similar to
the extended path in kvm_hv_send_ipi().
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Get the number of Hyper-V sparse banks from the VARHEAD field
Get the number of sparse banks from the VARHEAD field, which the guest is
required to provide as "The size of a variable header, in QWORDS.", where
the variable header is:
Variable Header Bytes = {Total Header Bytes - sizeof(Fixed Header)}
rounded up to nearest multiple of 8
Variable HeaderSize = Variable Header Bytes / 8
In other words, the VARHEAD should match the number of sparse banks.
Keep the manual count as a sanity check, but otherwise rely on the field
so as to more closely align with the logic defined in the TLFS and to
allow for future cleanups.
Tweak the tracepoint output to use "rep_cnt" instead of simply "cnt" now
that there is also "var_cnt".
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Tue, 25 Jan 2022 23:07:23 +0000 (23:07 +0000)]
KVM: x86/mmu: Consolidate comments about {Host,MMU}-writable
Consolidate the large comment above DEFAULT_SPTE_HOST_WRITABLE with the
large comment above is_writable_pte() into one comment. This comment
explains the different reasons why an SPTE may be non-writable and KVM
keeps track of that with the {Host,MMU}-writable bits.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230723.1701061-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Tue, 25 Jan 2022 23:07:13 +0000 (23:07 +0000)]
KVM: x86/mmu: Rename DEFAULT_SPTE_MMU_WRITEABLE to DEFAULT_SPTE_MMU_WRITABLE
Both "writeable" and "writable" are valid, but we should be consistent
about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in
the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230713.1700406-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Tue, 25 Jan 2022 23:05:16 +0000 (23:05 +0000)]
KVM: x86/mmu: Move is_writable_pte() to spte.h
Move is_writable_pte() close to the other functions that check
writability information about SPTEs. While here opportunistically
replace the open-coded bit arithmetic in
check_spte_writable_invariants() with a call to is_writable_pte().
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230518.1697048-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Tue, 25 Jan 2022 23:05:15 +0000 (23:05 +0000)]
KVM: x86/mmu: Check SPTE writable invariants when setting leaf SPTEs
Check SPTE writable invariants when setting SPTEs rather than in
spte_can_locklessly_be_made_writable(). By the time KVM checks
spte_can_locklessly_be_made_writable(), the SPTE has long been since
corrupted.
Note that these invariants only apply to shadow-present leaf SPTEs (i.e.
not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the
restriction and only instrument the code paths that set shadow-present
leaf SPTEs.
To account for access tracking, also check the SPTE writable invariants
when marking an SPTE as an access track SPTE. This also lets us remove
a redundant WARN from mark_spte_for_access_track().
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230518.1697048-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Tue, 25 Jan 2022 23:05:14 +0000 (23:05 +0000)]
KVM: x86/mmu: Move SPTE writable invariant checks to a helper function
Move the WARNs in spte_can_locklessly_be_made_writable() to a separate
helper function. This is in preparation for moving these checks to the
places where SPTEs are set.
Opportunistically add warning error messages that include the SPTE to
make future debugging of these warnings easier.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230518.1697048-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 25 Jan 2022 12:08:58 +0000 (04:08 -0800)]
KVM: LAPIC: Enable timer posted-interrupt only when mwait/hlt is advertised
As commit 0c5f81dad46 ("KVM: LAPIC: Inject timer interrupt via posted
interrupt") mentioned that the host admin should well tune the guest
setup, so that vCPUs are placed on isolated pCPUs, and with several pCPUs
surplus for *busy* housekeeping. In this setup, it is preferrable to
disable mwait/hlt/pause vmexits to keep the vCPUs in non-root mode.
However, if only some guests isolated and others not, they would not
have any benefit from posted timer interrupts, and at the same time lose
VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
returns true and therefore forces kvm_can_use_hv_timer() to false.
By guaranteeing that posted-interrupt timer is only used if MWAIT or
HLT are done without vmexit, KVM can make a better choice and use the
VMX preemption timer and the corresponding fast paths.
Reported-by: Aili Yao <yaoaili@kingsoft.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1643112538-36743-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 25 Jan 2022 11:59:39 +0000 (03:59 -0800)]
KVM: VMX: Dont' send posted IRQ if vCPU == this vCPU and vCPU is IN_GUEST_MODE
When delivering a virtual interrupt, don't actually send a posted interrupt
if the target vCPU is also the currently running vCPU and is IN_GUEST_MODE,
in which case the interrupt is being sent from a VM-Exit fastpath and the
core run loop in vcpu_enter_guest() will manually move the interrupt from
the PIR to vmcs.GUEST_RVI. IRQs are disabled while IN_GUEST_MODE, thus
there's no possibility of the virtual interrupt being sent from anything
other than KVM, i.e. KVM won't suppress a wake event from an IRQ handler
(see commit fdba608f15e2, "KVM: VMX: Wake vCPU when delivering posted IRQ
even if vCPU == this vCPU").
Eliding the posted interrupt restores the performance provided by the
combination of commits 379a3c8ee444 ("KVM: VMX: Optimize posted-interrupt
delivery for timer fastpath") and 26efe2fd92e5 ("KVM: VMX: Handle
preemption timer fastpath").
Thanks Sean for better comments.
Suggested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1643111979-36447-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Rename hook implementations to conform to kvm_x86_ops' names
Massage SVM's implementation names that still diverge from kvm_x86_ops to
allow for wiring up all SVM-defined functions via kvm-x86-ops.h.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Rename SEV implemenations to conform to kvm_x86_ops hooks
Rename svm_vm_copy_asid_from() and svm_vm_migrate_from() to conform to
the names used by kvm_x86_ops, and opportunistically use "sev" instead of
"svm" to more precisely identify the role of the hooks.
svm_vm_copy_asid_from() in particular was poorly named as the function
does much more than simply copy the ASID.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks
Use slightly more verbose names for the so called "memory encrypt",
a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
super short kvm_x86_ops names and SVM's more verbose, but non-conforming
names. This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
to fill svm_x86_ops.
Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
reflect its true nature, as it really is a full fledged ioctl() of its
own. Ideally, the hook would be named confidential_vm_ioctl() or so, as
the ioctl() is a gateway to more than just memory encryption, and because
its underlying purpose to support Confidential VMs, which can be provided
without memory encryption, e.g. if the TCB of the guest includes the host
kernel but not host userspace, or by isolation in hardware without
encrypting memory. But, diverging from KVM_MEMORY_ENCRYPT_OP even
further is undeseriable, and short of creating alises for all related
ioctl()s, which introduces a different flavor of divergence, KVM is stuck
with the nomenclature.
Defer renaming SVM's functions to a future commit as there are additional
changes needed to make SVM fully conforming and to match reality (looking
at you, svm_vm_copy_asid_from()).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove SVM's MAX_INST_SIZE, which has long since been obsoleted by the
common MAX_INSN_SIZE. Note, the latter's "insn" is also the generally
preferred abbreviation of instruction.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Rename svm_flush_tlb() to svm_flush_tlb_current()
Rename svm_flush_tlb() to svm_flush_tlb_current() so that at least one of
the flushing operations in svm_x86_ops can be filled via kvm-x86-ops.h,
and to document the scope of the flush (specifically that it doesn't
flush "all").
Opportunistically make svm_tlb_flush_current(), was svm_flush_tlb(),
static.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that
its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a
superfluous export from KVM x86.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: VMX: Rename VMX functions to conform to kvm_x86_ops names
Massage VMX's implementation names for kvm_x86_ops to maximize use of
kvm-x86-ops.h. Leave cpu_has_vmx_wbinvd_exit() as-is to preserve the
cpu_has_vmx_*() pattern used for querying VMCS capabilities. Keep
pi_has_pending_interrupt() as vmx_dy_apicv_has_pending_interrupt() does
a poor job of describing exactly what is being checked in VMX land.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Use static_call() for copy/move encryption context ioctls()
Define and use static_call()s for .vm_{copy,move}_enc_context_from(),
mostly so that the op is defined in kvm-x86-ops.h. This will allow using
KVM_X86_OP in vendor code to wire up the implementation. Any performance
gains eeked out by using static_call() is a happy bonus and not the
primary motiviation.
Opportunistically refactor the code to reduce indentation and keep line
lengths reasonable, and to be consistent when wrapping versus running
a bit over the 80 char soft limit.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the export of kvm_x86_ops now it is no longer referenced by SVM or
VMX. Disallowing access to kvm_x86_ops is very desirable as it prevents
vendor code from incorrectly modifying hooks after they have been set by
kvm_arch_hardware_setup(), and more importantly after each function's
associated static_call key has been updated.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Uninline and export Hyper-V's hv_track_root_tdp(), which is (somewhat
indirectly) the last remaining reference to kvm_x86_ops from vendor
modules, i.e. will allow unexporting kvm_x86_ops. Reloading the TDP PGD
isn't the fastest of paths, hv_track_root_tdp() isn't exactly tiny, and
disallowing vendor code from accessing kvm_x86_ops provides nice-to-have
encapsulation of common x86 code (and of Hyper-V code for that matter).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: nVMX: Refactor PMU refresh to avoid referencing kvm_x86_ops.pmu_ops
Refactor the nested VMX PMU refresh helper to pass it a flag stating
whether or not the vCPU has PERF_GLOBAL_CTRL instead of having the nVMX
helper query the information by bouncing through kvm_x86_ops.pmu_ops.
This will allow a future patch to use static_call() for the PMU ops
without having to export any static call definitions from common x86, and
it is also a step toward unexported kvm_x86_ops.
Alternatively, nVMX could call kvm_pmu_is_valid_msr() to indirectly use
kvm_x86_ops.pmu_ops, but that would incur an extra layer of indirection
and would require exporting kvm_pmu_is_valid_msr().
Opportunistically rename the helper to keep line lengths somewhat
reasonable, and to better capture its high-level role.
No functional change intended.
Cc: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: xen: Use static_call() for invoking kvm_x86_ops hooks
Use static_call() for invoking kvm_x86_ops function that already have a
defined static call, mostly as a step toward having _all_ calls to
kvm_x86_ops route through a static_call() in order to simplify auditing,
e.g. via grep, that all functions have an entry in kvm-x86-ops.h, but
also because there's no reason not to use a static_call().
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Use static_call() for .vcpu_deliver_sipi_vector()
Define and use a static_call() for kvm_x86_ops.vcpu_deliver_sipi_vector(),
mostly so that the op is defined in kvm-x86-ops.h. This will allow using
KVM_X86_OP in vendor code to wire up the implementation. Any performance
gains eeked out by using static_call() is a happy bonus and not the
primary motiviation.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: VMX: Call vmx_get_cpl() directly in handle_dr()
Use vmx_get_cpl() instead of bouncing through kvm_x86_ops.get_cpl() when
performing a CPL check on MOV DR accesses. This avoids a RETPOLINE (when
enabled), and more importantly removes a vendor reference to kvm_x86_ops
and helps pave the way for unexporting kvm_x86_ops.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.
In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name. Justification for using the VMX nomenclature:
- set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
event that has already been "set" in e.g. the vIRR. SVM's relevant
VMCB field is even named event_inj, and KVM's stat is irq_injections.
- prepare_guest_switch => prepare_switch_to_guest because the former is
ambiguous, e.g. it could mean switching between multiple guests,
switching from the guest to host, etc...
- update_pi_irte => pi_update_irte to allow for matching match the rest
of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
- start_assignment => pi_start_assignment to again follow VMX's posted
interrupt naming scheme, and to provide context for what bit of code
might care about an otherwise undescribed "assignment".
The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.
VMX and SVM function names are intentionally left as is to minimize the
diff. Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Drop export for .tlb_flush_current() static_call key
Remove the export of kvm_x86_tlb_flush_current() as there are no longer
any users outside of common x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 28 Oct 2021 17:15:55 +0000 (13:15 -0400)]
KVM: x86: skip host CPUID call for hypervisor leaves
Hypervisor leaves are always synthesized by __do_cpuid_func; just return
zeroes and do not ask the host. Even on nested virtualization, a value
from another hypervisor would be bogus, because all hypercalls and MSRs
are processed by KVM.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 25 Jan 2022 16:11:30 +0000 (11:11 -0500)]
KVM: SVM: improve split between svm_prepare_guest_switch and sev_es_prepare_guest_switch
KVM performs the VMSAVE to the host save area for both regular and SEV-ES
guests, so hoist it up to svm_prepare_guest_switch. And because
sev_es_prepare_guest_switch does not really need to know the details
of struct svm_cpu_data *, just pass it the pointer to the host save area
inside the HSAVE page.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Skip APICv update if APICv is disable at the module level
Bail from the APICv update paths _before_ taking apicv_update_lock if
APICv is disabled at the module level. kvm_request_apicv_update() in
particular is invoked from multiple paths that can be reached without
APICv being enabled, e.g. svm_enable_irq_window(), and taking the
rw_sem for write when APICv is disabled may introduce unnecessary
contention and stalls.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
Drop the useless NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
when handling an APICv update, both VMX and SVM unconditionally implement
the helper and leave it non-NULL even if APICv is disabled at the module
level. The latter is a moot point now that __kvm_request_apicv_update()
is called if and only if enable_apicv is true.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unexport __kvm_request_apicv_update(), it's not used by vendor code and
should never be used by vendor code. The only reason it's exposed at all
is because Hyper-V's SynIC needs to track how many auto-EOIs are in use,
and it's convenient to use apicv_update_lock to guard that tracking.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU
Zap both valid and invalid roots when zapping/unmapping a gfn range, as
KVM must ensure it holds no references to the freed page after returning
from the unmap operation. Most notably, the TDP MMU doesn't zap invalid
roots in mmu_notifier callbacks. This leads to use-after-free and other
issues if the mmu_notifier runs to completion while an invalid root
zapper yields as KVM fails to honor the requirement that there must be
_no_ references to the page after the mmu_notifier returns.
The bug is most easily reproduced by hacking KVM to cause a collision
between set_nx_huge_pages() and kvm_mmu_notifier_release(), but the bug
exists between kvm_mmu_notifier_invalidate_range_start() and memslot
updates as well. Invalidating a root ensures pages aren't accessible by
the guest, and KVM won't read or write page data itself, but KVM will
trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing
a zap of an invalid root _after_ the mmu_notifier returns is fatal.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211215011557.399940-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root()
Move the check for an invalid root out of kvm_tdp_mmu_get_root() and into
the one place it actually matters, tdp_mmu_next_root(), as the other user
already has an implicit validity check. A future bug fix will need to
get references to invalid roots to honor mmu_notifier requests; there's
no point in forcing what will be a common path to open code getting a
reference to a root.
No functional change intended.
Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211215011557.399940-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook
Use the common TDP MMU zap helper when handling an MMU notifier unmap
event, the two flows are semantically identical. Consolidate the code in
preparation for a future bug fix, as both kvm_tdp_mmu_unmap_gfn_range()
and __kvm_tdp_mmu_zap_gfn_range() are guilty of not zapping SPTEs in
invalid roots.
No functional change intended.
Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211215011557.399940-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Woodhouse [Mon, 25 Oct 2021 13:29:01 +0000 (14:29 +0100)]
KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU
There are circumstances whem kvm_xen_update_runstate_guest() should not
sleep because it ends up being called from __schedule() when the vCPU
is preempted:
To handle this, make it use the same trick as __kvm_xen_has_interrupt(),
of using the hva from the gfn_to_hva_cache directly. Then it can use
pagefault_disable() around the accesses and just bail out if the page
is absent (which is unlikely).
I almost switched to using a gfn_to_pfn_cache here and bailing out if
kvm_map_gfn() fails, like kvm_steal_time_set_preempted() does — but on
closer inspection it looks like kvm_map_gfn() will *always* fail in
atomic context for a page in IOMEM, which means it will silently fail
to make the update every single time for such guests, AFAICT. So I
didn't do it that way after all. And will probably fix that one too.
Cc: stable@vger.kernel.org Fixes: 30b5c851af79 ("KVM: x86/xen: Add support for vCPU runstate information") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <b17a93e5ff4561e57b1238e3e7ccd0b613eb827e.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 7 Feb 2022 15:54:25 +0000 (17:54 +0200)]
KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it
kvm_apic_update_apicv is called when AVIC is still active, thus IRR bits
can be set by the CPU after it is called, and don't cause the irr_pending
to be set to true.
Also logic in avic_kick_target_vcpu doesn't expect a race with this
function so to make it simple, just keep irr_pending set to true and
let the next interrupt injection to the guest clear it.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220207155447.840194-9-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 7 Feb 2022 15:54:24 +0000 (17:54 +0200)]
KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them
Fix a corner case in which the L1 hypervisor intercepts
interrupts (INTERCEPT_INTR) and either doesn't set
virtual interrupt masking (V_INTR_MASKING) or enters a
nested guest with EFLAGS.IF disabled prior to the entry.
In this case, despite the fact that L1 intercepts the interrupts,
KVM still needs to set up an interrupt window to wait before
injecting the INTR vmexit.
Currently the KVM instead enters an endless loop of 'req_immediate_exit'.
Exactly the same issue also happens for SMIs and NMI.
Fix this as well.
Note that on VMX, this case is impossible as there is only
'vmexit on external interrupts' execution control which either set,
in which case both host and guest's EFLAGS.IF
are ignored, or not set, in which case no VMexits are delivered.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220207155447.840194-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 7 Feb 2022 15:54:22 +0000 (17:54 +0200)]
KVM: x86: nSVM: expose clean bit support to the guest
KVM already honours few clean bits thus it makes sense
to let the nested guest know about it.
Note that KVM also doesn't check if the hardware supports
clean bits, and therefore nested KVM was
already setting clean bits and L0 KVM
was already honouring them.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220207155447.840194-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 7 Feb 2022 15:54:21 +0000 (17:54 +0200)]
KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM
While RSM induced VM entries are not full VM entries,
they still need to be followed by actual VM entry to complete it,
unlike setting the nested state.
This patch fixes boot of hyperv and SMM enabled
windows VM running nested on KVM, which fail due
to this issue combined with lack of dirty bit setting.
Maxim Levitsky [Mon, 7 Feb 2022 15:54:20 +0000 (17:54 +0200)]
KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state
While usually, restoring the smm state makes the KVM enter
the nested guest thus a different vmcb (vmcb02 vs vmcb01),
KVM should still mark it as dirty, since hardware
can in theory cache multiple vmcbs.
Failure to do so, combined with lack of setting the
nested_run_pending (which is fixed in the next patch),
might make KVM re-enter vmcb01, which was just exited from,
with completely different set of guest state registers
(SMM vs non SMM) and without proper dirty bits set,
which results in the CPU reusing stale IDTR pointer
which leads to a guest shutdown on any interrupt.
On the real hardware this usually doesn't happen,
but when running nested, L0's KVM does check and
honour few dirty bits, causing this issue to happen.
This patch fixes boot of hyperv and SMM enabled
windows VM running nested on KVM.
Maxim Levitsky [Mon, 7 Feb 2022 15:54:19 +0000 (17:54 +0200)]
KVM: x86: nSVM: fix potential NULL derefernce on nested migration
Turns out that due to review feedback and/or rebases
I accidentally moved the call to nested_svm_load_cr3 to be too early,
before the NPT is enabled, which is very wrong to do.
KVM can't even access guest memory at that point as nested NPT
is needed for that, and of course it won't initialize the walk_mmu,
which is main issue the patch was addressing.
Fix this for real.
Fixes: 232f75d3b4b5 ("KVM: nSVM: call nested_svm_load_cr3 on nested state load") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220207155447.840194-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 7 Feb 2022 15:54:18 +0000 (17:54 +0200)]
KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case
When the guest doesn't enable paging, and NPT/EPT is disabled, we
use guest't paging CR3's as KVM's shadow paging pointer and
we are technically in direct mode as if we were to use NPT/EPT.
In direct mode we create SPTEs with user mode permissions
because usually in the direct mode the NPT/EPT doesn't
need to restrict access based on guest CPL
(there are MBE/GMET extenstions for that but KVM doesn't use them).
In this special "use guest paging as direct" mode however,
and if CR4.SMAP/CR4.SMEP are enabled, that will make the CPU
fault on each access and KVM will enter endless loop of page faults.
Since page protection doesn't have any meaning in !PG case,
just don't passthrough these bits.
The fix is the same as was done for VMX in commit:
commit 656ec4a4928a ("KVM: VMX: fix SMEP and SMAP without EPT")
This fixes the boot of windows 10 without NPT for good.
(Without this patch, BSP boots, but APs were stuck in endless
loop of page faults, causing the VM boot with 1 CPU)
Revert "svm: Add warning message for AVIC IPI invalid target"
Remove a WARN on an "AVIC IPI invalid target" exit, the WARN is trivial
to trigger from guest as it will fail on any destination APIC ID that
doesn't exist from the guest's perspective.
Don't bother recording anything in the kernel log, the common tracepoint
for kvm_avic_incomplete_ipi() is sufficient for debugging.
Linus Torvalds [Sun, 6 Feb 2022 18:34:45 +0000 (10:34 -0800)]
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Various bug fixes for ext4 fast commit and inline data handling.
Also fix regression introduced as part of moving to the new mount API"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fs/ext4: fix comments mentioning i_mutex
ext4: fix incorrect type issue during replay_del_range
jbd2: fix kernel-doc descriptions for jbd2_journal_shrink_{scan,count}()
ext4: fix potential NULL pointer dereference in ext4_fill_super()
jbd2: refactor wait logic for transaction updates into a common function
jbd2: cleanup unused functions declarations from jbd2.h
ext4: fix error handling in ext4_fc_record_modified_inode()
ext4: remove redundant max inline_size check in ext4_da_write_inline_data_begin()
ext4: fix error handling in ext4_restore_inline_data()
ext4: fast commit may miss file actions
ext4: fast commit may not fallback for ineligible commit
ext4: modify the logic of ext4_mb_new_blocks_simple
ext4: prevent used blocks from being allocated during fast commit replay
Linus Torvalds [Sun, 6 Feb 2022 18:18:23 +0000 (10:18 -0800)]
Merge tag 'perf-tools-fixes-for-v5.17-2022-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix display of grouped aliased events in 'perf stat'.
- Add missing branch_sample_type to perf_event_attr__fprintf().
- Apply correct label to user/kernel symbols in branch mode.
- Fix 'perf ftrace' system_wide tracing, it has to be set before
creating the maps.
- Return error if procfs isn't mounted for PID namespaces when
synthesizing records for pre-existing processes.
- Set error stream of objdump process for 'perf annotate' TUI, to avoid
garbling the screen.
- Add missing arm64 support to perf_mmap__read_self(), the kernel part
got into 5.17.
- Check for NULL pointer before dereference writing debug info about a
sample.
- Update UAPI copies for asound, perf_event, prctl and kvm headers.
- Fix a typo in bpf_counter_cgroup.c.
* tag 'perf-tools-fixes-for-v5.17-2022-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf ftrace: system_wide collection is not effective by default
libperf: Add arm64 support to perf_mmap__read_self()
tools include UAPI: Sync sound/asound.h copy with the kernel sources
perf stat: Fix display of grouped aliased events
perf tools: Apply correct label to user/kernel symbols in branch mode
perf bpf: Fix a typo in bpf_counter_cgroup.c
perf synthetic-events: Return error if procfs isn't mounted for PID namespaces
perf session: Check for NULL pointer before dereference
perf annotate: Set error stream of objdump process for TUI
perf tools: Add missing branch_sample_type to perf_event_attr__fprintf()
tools headers UAPI: Sync linux/kvm.h with the kernel sources
tools headers UAPI: Sync linux/prctl.h with the kernel sources
perf beauty: Make the prctl arg regexp more strict to cope with PR_SET_VMA
tools headers cpufeatures: Sync with the kernel sources
tools headers UAPI: Sync linux/perf_event.h with the kernel sources
tools include UAPI: Sync sound/asound.h copy with the kernel sources
Linus Torvalds [Sun, 6 Feb 2022 18:11:14 +0000 (10:11 -0800)]
Merge tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Intel/PT: filters could crash the kernel
- Intel: default disable the PMU for SMM, some new-ish EFI firmware has
started using CPL3 and the PMU CPL filters don't discriminate against
SMM, meaning that CPL3 (userspace only) events now also count EFI/SMM
cycles.
- Fixup for perf_event_attr::sig_data
* tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix crash with stop filters in single-range mode
perf: uapi: Document perf_event_attr::sig_data truncation on 32 bit architectures
selftests/perf_events: Test modification of perf_event_attr::sig_data
perf: Copy perf_event_attr::sig_data on modification
x86/perf: Default set FREEZE_ON_SMI for all
Linus Torvalds [Sun, 6 Feb 2022 17:57:39 +0000 (09:57 -0800)]
Merge tag 'edac_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fixes from Borislav Petkov:
"Fix altera and xgene EDAC drivers to propagate the correct error code
from platform_get_irq() so that deferred probing still works"
* tag 'edac_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/xgene: Fix deferred probing
EDAC/altera: Fix deferred probing
Rob Herring [Tue, 1 Feb 2022 21:40:56 +0000 (15:40 -0600)]
libperf: Add arm64 support to perf_mmap__read_self()
Add the arm64 variants for read_perf_counter() and read_timestamp().
Unfortunately the counter number is encoded into the instruction, so the
code is a bit verbose to enumerate all possible counters.
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Tested-by: John Garry <john.garry@huawei.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20220201214056.702854-1-robh@kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-kernel@vger.kernel.org Cc: linux-perf-users@vger.kernel.org
Which entails no changes in the tooling side as it doesn't introduce new
SNDRV_PCM_IOCTL_ ioctls.
To silence this perf tools build warning:
Warning: Kernel ABI header at 'tools/include/uapi/sound/asound.h' differs from latest version at 'include/uapi/sound/asound.h'
diff -u tools/include/uapi/sound/asound.h include/uapi/sound/asound.h
Cc: Dmitry Osipenko <digetx@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Takashi Iwai <tiwai@suse.de> Link: https://lore.kernel.org/lkml/Yf+6OT+2eMrYDEeX@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Ian Rogers [Sat, 5 Feb 2022 01:09:41 +0000 (17:09 -0800)]
perf stat: Fix display of grouped aliased events
An event may have a number of uncore aliases that when added to the
evlist are consecutive.
If there are multiple uncore events in a group then
parse_events__set_leader_for_uncore_aliase will reorder the evlist so
that events on the same PMU are adjacent.
The collect_all_aliases function assumes that aliases are in blocks so
that only the first counter is printed and all others are marked merged.
The reordering for groups breaks the assumption and so all counts are
printed.
This change removes the assumption from collect_all_aliases
that the events are in blocks and instead processes the entire evlist.
Before:
```
$ perf stat -e '{UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE,UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE},duration_time' -a -A -- sleep 1
Reviewed-by: James Clark <james.clark@arm.com> Signed-off-by: German Gomez <german.gomez@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20220126105927.3411216-1-german.gomez@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Leo Yan [Fri, 24 Dec 2021 12:40:13 +0000 (20:40 +0800)]
perf synthetic-events: Return error if procfs isn't mounted for PID namespaces
For perf recording, it retrieves process info by iterating nodes in proc
fs. If we run perf in a non-root PID namespace with command:
# unshare --fork --pid perf record -e cycles -a -- test_program
... in this case, unshare command creates a child PID namespace and
launches perf tool in it, but the issue is the proc fs is not mounted
for the non-root PID namespace, this leads to the perf tool gathering
process info from its parent PID namespace.
We can use below command to observe the process nodes under proc fs:
So it shows many existed tasks, since unshared command has not mounted
the proc fs for the new created PID namespace, it still accesses the
proc fs of the root PID namespace. This leads to two prominent issues:
- Firstly, PID values are mismatched between thread info and samples.
The gathered thread info are coming from the proc fs of the root PID
namespace, but samples record its PID from the child PID namespace.
- The second issue is profiled program 'test_program' returns its forked
PID number from the child PID namespace, perf tool wrongly uses this
PID number to retrieve the process info via the proc fs of the root
PID namespace.
To avoid issues, we need to mount proc fs for the child PID namespace
with the option '--mount-proc' when use unshare command:
# unshare --fork --pid --mount-proc perf record -e cycles -a -- test_program
Conversely, when the proc fs of the root PID namespace is used by child
namespace, perf tool can detect the multiple PID levels and
nsinfo__is_in_root_namespace() returns false, this patch reports error
for this case:
# unshare --fork --pid perf record -e cycles -a -- test_program
Couldn't synthesize bpf events.
Perf runs in non-root PID namespace but it tries to gather process info from its parent PID namespace.
Please mount the proc file system properly, e.g. add the option '--mount-proc' for unshare command.
Reviewed-by: James Clark <james.clark@arm.com> Signed-off-by: Leo Yan <leo.yan@linaro.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20211224124014.2492751-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Ameer Hamza [Tue, 25 Jan 2022 12:11:41 +0000 (17:11 +0500)]
perf session: Check for NULL pointer before dereference
Move NULL pointer check before dereferencing the variable.
Addresses-Coverity: 1497622 ("Derereference before null check") Reviewed-by: James Clark <james.clark@arm.com> Signed-off-by: Ameer Hamza <amhamza.mgc@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com> Cc: German Gomez <german.gomez@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Riccardo Mancini <rickyman7@gmail.com> Link: https://lore.kernel.org/r/20220125121141.18347-1-amhamza.mgc@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf tools: Add missing branch_sample_type to perf_event_attr__fprintf()
This updates branch sample type with missing PERF_SAMPLE_BRANCH_TYPE_SAVE.
Suggested-by: James Clark <james.clark@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-arm-kernel@lists.infradead.org Link: http://lore.kernel.org/lkml/1643799443-15109-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools headers UAPI: Sync linux/kvm.h with the kernel sources
To pick the changes in:
f6c6804c43fa18d3 ("kvm: Move KVM_GET_XSAVE2 IOCTL definition at the end of kvm.h")
That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the the 'perf trace' ioctl syscall argument
beautifiers.
This is also by now used by tools/testing/selftests/kvm/, a simple test
build succeeded.
This silences this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Linus Torvalds [Sat, 5 Feb 2022 17:55:59 +0000 (09:55 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- A couple of fixes when handling an exception while a SError has
been delivered
- Workaround for Cortex-A510's single-step erratum
RISC-V:
- Make CY, TM, and IR counters accessible in VU mode
- Fix SBI implementation version
x86:
- Report deprecation of x87 features in supported CPUID
- Preparation for fixing an interrupt delivery race on AMD hardware
- Sparse fix
All except POWER and s390:
- Rework guest entry code to correctly mark noinstr areas and fix
vtime' accounting (for x86, this was already mostly correct but not
entirely; for ARM, MIPS and RISC-V it wasn't)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Use ERR_PTR_USR() to return -EFAULT as a __user pointer
KVM: x86: Report deprecated x87 features in supported CPUID
KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata
KVM: arm64: Stop handle_exit() from handling HVC twice when an SError occurs
KVM: arm64: Avoid consuming a stale esr value when SError occur
RISC-V: KVM: Fix SBI implementation version
RISC-V: KVM: make CY, TM, and IR counters accessible in VU mode
kvm/riscv: rework guest entry logic
kvm/arm64: rework guest entry logic
kvm/x86: rework guest entry logic
kvm/mips: rework guest entry logic
kvm: add guest_state_{enter,exit}_irqoff()
KVM: x86: Move delivery of non-APICv interrupt into vendor code
kvm: Move KVM_GET_XSAVE2 IOCTL definition at the end of kvm.h
Linus Torvalds [Sat, 5 Feb 2022 17:21:55 +0000 (09:21 -0800)]
Merge tag 'xfs-5.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"I was auditing operations in XFS that clear file privileges, and
realized that XFS' fallocate implementation drops suid/sgid but
doesn't clear file capabilities the same way that file writes and
reflink do.
There are VFS helpers that do it correctly, so refactor XFS to use
them. I also noticed that we weren't flushing the log at the correct
point in the fallocate operation, so that's fixed too.
Summary:
- Fix fallocate so that it drops all file privileges when files are
modified instead of open-coding that incompletely.
- Fix fallocate to flush the log if the caller wanted synchronous
file updates"
* tag 'xfs-5.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: ensure log flush at the end of a synchronous fallocate call
xfs: move xfs_update_prealloc_flags() to xfs_pnfs.c
xfs: set prealloc flag in xfs_alloc_file_space()
xfs: fallocate() should call file_modified()
xfs: remove XFS_PREALLOC_SYNC
xfs: reject crazy array sizes being fed to XFS_IOC_GETBMAP*
Linus Torvalds [Sat, 5 Feb 2022 17:13:51 +0000 (09:13 -0800)]
Merge tag 'vfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull vfs fixes from Darrick Wong:
"I was auditing the sync_fs code paths recently and noticed that most
callers of ->sync_fs ignore its return value (and many implementations
never return nonzero even if the fs is broken!), which means that
internal fs errors and corruption are not passed up to userspace
callers of syncfs(2) or FIFREEZE. Hence fixing the common code and
XFS, and I'll start working on the ext4/btrfs folks if this is merged.
Summary:
- Fix a bug where callers of ->sync_fs (e.g. sync_filesystem and
syncfs(2)) ignore the return value.
- Fix a bug where callers of sync_filesystem (e.g. fs freeze) ignore
the return value.
- Fix a bug in XFS where xfs_fs_sync_fs never passed back error
returns"
* tag 'vfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: return errors in xfs_fs_sync_fs
quota: make dquot_quota_sync return errors from ->sync_fs
vfs: make sync_filesystem return errors from ->sync_fs
vfs: make freeze_super abort when sync_filesystem returns error
Linus Torvalds [Sat, 5 Feb 2022 17:04:43 +0000 (09:04 -0800)]
Merge tag 'iomap-5.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
"A single bugfix for iomap.
The fix should eliminate occasional complaints about stall warnings
when a lot of writeback IO completes all at once and we have to then
go clearing status on a large number of folios.
Summary:
- Limit the length of ioend chains in writeback so that we don't trip
the softlockup watchdog and to limit long tail latency on clearing
PageWriteback"
* tag 'iomap-5.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs, iomap: limit individual ioend chain lengths in writeback
Linus Torvalds [Fri, 4 Feb 2022 23:27:45 +0000 (15:27 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Seven fixes, six of which are fairly obvious driver fixes.
The one core change to the device budget depth is to try to ensure
that if the default depth is large (which can produce quite a sizeable
bitmap allocation per device), we give back the memory we don't need
if there's a queue size reduction in slave_configure (which happens to
a lot of devices)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: hisi_sas: Fix setting of hisi_sas_slot.is_internal
scsi: pm8001: Fix use-after-free for aborted SSP/STP sas_task
scsi: pm8001: Fix use-after-free for aborted TMF sas_task
scsi: pm8001: Fix warning for undescribed param in process_one_iomb()
scsi: core: Reallocate device's budget map on queue depth change
scsi: bnx2fc: Make bnx2fc_recv_frame() mp safe
scsi: pm80xx: Fix double completion for SATA devices
Linus Torvalds [Fri, 4 Feb 2022 23:22:35 +0000 (15:22 -0800)]
Merge tag 'pci-v5.17-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci fixes from Bjorn Helgaas:
- Restructure j721e_pcie_probe() so we don't dereference a NULL pointer
(Bjorn Helgaas)
- Add a kirin_pcie_data struct to identify different Kirin variants to
fix probe failure for controllers with an internal PHY (Bjorn
Helgaas)
* tag 'pci-v5.17-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: kirin: Add dev struct for of_device_get_match_data()
PCI: j721e: Initialize pcie->cdns_pcie before using it
Bjorn Helgaas [Wed, 2 Feb 2022 15:52:41 +0000 (09:52 -0600)]
PCI: kirin: Add dev struct for of_device_get_match_data()
Bean reported that a622435fbe1a ("PCI: kirin: Prefer
of_device_get_match_data()") broke kirin_pcie_probe() because it assumed
match data of 0 was a failure when in fact, it meant the match data was
"(void *)PCIE_KIRIN_INTERNAL_PHY".
Therefore, probing of "hisilicon,kirin960-pcie" devices failed with -EINVAL
and an "OF data missing" message.
Add a struct kirin_pcie_data to encode the PHY type. Then the result of
of_device_get_match_data() should always be a non-NULL pointer to a struct
kirin_pcie_data that contains the PHY type.
Linus Torvalds [Fri, 4 Feb 2022 20:14:58 +0000 (12:14 -0800)]
Merge tag 'for-5.17-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes and error handling improvements:
- fix deadlock between quota disable and qgroup rescan worker
- fix use-after-free after failure to create a snapshot
- skip warning on unmount after log cleanup failure
- don't start transaction for scrub if the fs is mounted read-only
- tree checker verifies item sizes"
* tag 'for-5.17-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: skip reserved bytes warning on unmount after log cleanup failure
btrfs: fix use of uninitialized variable at rm device ioctl
btrfs: fix use-after-free after failure to create a snapshot
btrfs: tree-checker: check item_size for dev_item
btrfs: tree-checker: check item_size for inode_item
btrfs: fix deadlock between quota disable and qgroup rescan worker
btrfs: don't start transaction for scrub if the fs is mounted read-only
Linus Torvalds [Fri, 4 Feb 2022 20:08:49 +0000 (12:08 -0800)]
Merge tag 'erofs-for-5.17-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Two fixes related to fsdax cleanup in this cycle and ztailpacking to
fix small compressed data inlining. There is also a trivial cleanup to
rearrange code for better reading.
Summary:
- fix fsdax partition offset misbehavior
- clean up z_erofs_decompressqueue_work() declaration
- fix up EOF lcluster inlining, especially for small compressed data"
* tag 'erofs-for-5.17-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix small compressed files inlining
erofs: avoid unnecessary z_erofs_decompressqueue_work() declaration
erofs: fix fsdax partition offset handling
Linus Torvalds [Fri, 4 Feb 2022 20:01:57 +0000 (12:01 -0800)]
Merge tag 'block-5.17-2022-02-04' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request
- fix use-after-free in rdma and tcp controller reset (Sagi Grimberg)
- fix the state check in nvmf_ctlr_matches_baseopts (Uday Shankar)
- MD nowait null pointer fix (Song)
- blk-integrity seed advance fix (Martin)
- Fix a dio regression in this merge window (Ilya)
* tag 'block-5.17-2022-02-04' of git://git.kernel.dk/linux-block:
block: bio-integrity: Advance seed correctly for larger interval sizes
nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts()
md: fix NULL pointer deref with nowait but no mddev->queue
block: fix DIO handling regressions in blkdev_read_iter()
nvme-rdma: fix possible use-after-free in transport error_recovery work
nvme-tcp: fix possible use-after-free in transport error_recovery work
nvme: fix a possible use-after-free in controller reset during load