Some operating systems store data about the host processor at the
time of installation, and when booted on a more uptodate cpu tries
to read MSR_EBC_FREQUENCY_ID. This has been found with XP.
KVM: SVM: Restore correct registers after sel_cr0 intercept emulation
This patch implements restoring of the correct rip, rsp, and
rax after the svm emulation in KVM injected a selective_cr0
write intercept into the guest hypervisor. The problem was
that the vmexit is emulated in the instruction emulation
which later commits the registers right after the write-cr0
instruction. So the l1 guest will continue to run with the
l2 rip, rsp and rax resulting in unpredictable behavior.
This patch is not the final word, it is just an easy patch
to fix the issue. The real fix will be done when the
instruction emulator is made aware of nested virtualization.
Until this is done this patch fixes the issue and provides
an easy way to fix this in -stable too.
This patch fixes 32 bit legacy paging with NPT enabled. The
mmu_check_root call on the top-level of the loop causes
root_gfn to take values (in the tdp_enabled path) which are
outside of guest memory. So the mmu_check_root call fails at
some point in the loop interation causing the guest to
tiple-fault.
This patch changes the mmu_check_root calls to the places
where they are really necessary. As a side-effect it
introduces a check for the root of a pae page table too.
Alexander Graf [Tue, 31 Aug 2010 01:45:39 +0000 (03:45 +0200)]
KVM: PPC: Fix compile error in e500_tlb.c
The e500_tlb.c file didn't compile for me due to the following error:
arch/powerpc/kvm/e500_tlb.c: In function ‘kvmppc_e500_shadow_map’:
arch/powerpc/kvm/e500_tlb.c:300: error: format ‘%lx’ expects type ‘long unsigned int’, but argument 2 has type ‘gfn_t’
So let's explicitly cast the argument to make printk happy.
Kyle Moffett [Mon, 30 Aug 2010 15:38:39 +0000 (11:38 -0400)]
KVM: PPC: e500_tlb: Fix a minor copy-paste tracing bug
The kvmppc_e500_stlbe_invalidate() function was trying to pass too many
parameters to trace_kvm_stlb_inval(). This appears to be a bad
copy-paste from a call to trace_kvm_stlb_write().
Signed-off-by: Kyle Moffett <Kyle.D.Moffett@boeing.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Alexander Graf [Mon, 30 Aug 2010 12:03:24 +0000 (14:03 +0200)]
KVM: PPC: Implement level interrupts for BookE
BookE also wants to support level based interrupts, so let's implement
all the necessary logic there. We need to trick a bit here because the
irqprios are 1:1 assigned to architecture defined values. But since there
is some space left there, we can just pick a random one and move it later
on - it's internal anyways.
Alexander Graf [Mon, 30 Aug 2010 08:44:15 +0000 (10:44 +0200)]
KVM: PPC: Implement Level interrupts on Book3S
The current interrupt logic is just completely broken. We get a notification
from user space, telling us that an interrupt is there. But then user space
expects us that we just acknowledge an interrupt once we deliver it to the
guest.
This is not how real hardware works though. On real hardware, the interrupt
controller pulls the external interrupt line until it gets notified that the
interrupt was received.
So in reality we have two events: pulling and letting go of the interrupt line.
To maintain backwards compatibility, I added a new request for the pulling
part. The letting go part was implemented earlier already.
With this in place, we can now finally start guests that do not randomly stall
and stop to work at random times.
Alexander Graf [Tue, 17 Aug 2010 20:08:39 +0000 (22:08 +0200)]
KVM: PPC: Enable napping only for Book3s_64
Before I incorrectly enabled napping also for BookE, which would result in
needless dcache flushes. Since we only need to force enable napping on
Book3s_64 because it doesn't go into MSR_POW otherwise, we can just #ifdef
that code to this particular platform.
Reported-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Alexander Graf [Sun, 15 Aug 2010 06:04:24 +0000 (08:04 +0200)]
KVM: PPC: Implement correct SID mapping on Book3s_32
Up until now we were doing segment mappings wrong on Book3s_32. For Book3s_64
we were using a trick where we know that a single mmu_context gives us 16 bits
of context ids.
The mm system on Book3s_32 instead uses a clever algorithm to distribute VSIDs
across the available range, so a context id really only gives us 16 available
VSIDs.
To keep at least a few guest processes in the SID shadow, let's map a number of
contexts that we can use as VSID pool. This makes the code be actually correct
and shouldn't hurt performance too much.
Alexander Graf [Tue, 17 Aug 2010 09:41:44 +0000 (11:41 +0200)]
KVM: PPC: Force enable nap on KVM
There are some heuristics in the PPC power management code that try to find
out if the particular hardware we're running on supports proper power management
or just hangs the machine when going into nap mode.
Since we know that KVM is safe with nap, let's force enable it in the PV code
once we're certain that we are on a KVM VM.
Alexander Graf [Thu, 5 Aug 2010 13:44:41 +0000 (15:44 +0200)]
KVM: PPC: Make PV mtmsrd L=1 work with r30 and r31
We had an arbitrary limitation in mtmsrd L=1 that kept us from using r30 and
r31 as input registers. Let's get rid of that and get more potential speedups!
Alexander Graf [Thu, 5 Aug 2010 10:24:40 +0000 (12:24 +0200)]
KVM: PPC: Update int_pending also on dequeue
When having a decrementor interrupt pending, the dequeuing happens manually
through an mtdec instruction. This instruction simply calls dequeue on that
interrupt, so the int_pending hint doesn't get updated.
This patch enables updating the int_pending hint also on dequeue, thus
correctly enabling guests to stay in guest contexts more often.
Alexander Graf [Thu, 5 Aug 2010 09:26:04 +0000 (11:26 +0200)]
KVM: PPC: Make PV mtmsr work with r30 and r31
So far we've been restricting ourselves to r0-r29 as registers an mtmsr
instruction could use. This was bad, as there are some code paths in
Linux actually using r30.
So let's instead handle all registers gracefully and get rid of that
stupid limitation
Alexander Graf [Tue, 3 Aug 2010 08:39:35 +0000 (10:39 +0200)]
KVM: PPC: Add mtsrin PV code
This is the guest side of the mtsr acceleration. Using this a guest can now
call mtsrin with almost no overhead as long as it ensures that it only uses
it with (MSR_IR|MSR_DR) == 0. Linux does that, so we're good.
Alexander Graf [Tue, 3 Aug 2010 00:29:27 +0000 (02:29 +0200)]
KVM: PPC: Put segment registers in shared page
Now that the actual mtsr doesn't do anything anymore, we can move the sr
contents over to the shared page, so a guest can directly read and write
its sr contents from guest context.
Alexander Graf [Mon, 2 Aug 2010 23:06:11 +0000 (01:06 +0200)]
KVM: PPC: Interpret SR registers on demand
Right now we're examining the contents of Book3s_32's segment registers when
the register is written and put the interpreted contents into a struct.
There are two reasons this is bad. For starters, the struct has worse real-time
performance, as it occupies more ram. But the more important part is that with
segment registers being interpreted from their raw values, we can put them in
the shared page, allowing guests to mess with them directly.
This patch makes the internal representation of SRs be u32s.
Alexander Graf [Mon, 2 Aug 2010 21:23:04 +0000 (23:23 +0200)]
KVM: PPC: Move BAT handling code into spr handler
The current approach duplicates the spr->bat finding logic and makes it harder
to reuse the actually used variables. So let's move everything down to the spr
handler.
Alexander Graf [Mon, 2 Aug 2010 19:48:53 +0000 (21:48 +0200)]
KVM: PPC: Revert "KVM: PPC: Use kernel hash function"
It turns out the in-kernel hash function is sub-optimal for our subtle
hash inputs where every bit is significant. So let's revert to the original
hash functions.
Alexander Graf [Mon, 2 Aug 2010 19:24:48 +0000 (21:24 +0200)]
KVM: PPC: Make invalidation code more reliable
There is a race condition in the pte invalidation code path where we can't
be sure if a pte was invalidated already. So let's move the spin lock around
to get rid of the race.
Alexander Graf [Mon, 2 Aug 2010 18:11:39 +0000 (20:11 +0200)]
KVM: PPC: Don't flush PTEs on NX/RO hit
When hitting a no-execute or read-only data/inst storage interrupt we were
flushing the respective PTE so we're sure it gets properly overwritten next.
According to the spec, this is unnecessary though. The guest issues a tlbie
anyways, so we're safe to just keep the PTE around and have it manually removed
from the guest, saving us a flush.
Alexander Graf [Mon, 2 Aug 2010 14:08:22 +0000 (16:08 +0200)]
KVM: PPC: Preload magic page when in kernel mode
When the guest jumps into kernel mode and has the magic page mapped, theres a
very high chance that it will also use it. So let's detect that scenario and
map the segment accordingly.
Alexander Graf [Mon, 2 Aug 2010 11:38:18 +0000 (13:38 +0200)]
KVM: PPC: Fix sid map search after flush
After a flush the sid map contained lots of entries with 0 for their gvsid and
hvsid value. Unfortunately, 0 can be a real value the guest searches for when
looking up a vsid so it would incorrectly find the host's 0 hvsid mapping which
doesn't belong to our sid space.
So let's also check for the valid bit that indicated that the sid we're
looking at actually contains useful data.
Alexander Graf [Mon, 2 Aug 2010 09:06:26 +0000 (11:06 +0200)]
KVM: PPC: Move EXIT_DEBUG partially to tracepoints
We have a debug printk on every exit that is usually #ifdef'ed out. Using
tracepoints makes a lot more sense here though, as they can be dynamically
enabled.
This patch converts the most commonly used debug printks of EXIT_DEBUG to
tracepoints.
MSR_K7_CLK_CTL is a no longer documented MSR, which is only relevant
on said old AMD CPU models. This change returns the expected value,
which the Linux kernel is expecting to avoid writing back the MSR,
plus it ignores all writes to the MSR.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Sat, 28 Aug 2010 11:24:13 +0000 (19:24 +0800)]
KVM: MMU: rewrite audit_mappings_page() function
There is a bugs in this function, we call gfn_to_pfn() and kvm_mmu_gva_to_gpa_read() in
atomic context(kvm_mmu_audit() is called under the spinlock(mmu_lock)'s protection).
This patch fix it by:
- introduce gfn_to_pfn_atomic instead of gfn_to_pfn
- get the mapping gfn from kvm_mmu_page_get_gfn()
And it adds 'notrap' ptes check in unsync/direct sps
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Sat, 28 Aug 2010 11:19:42 +0000 (19:19 +0800)]
KVM: MMU: fix compile warning in audit code
fix:
arch/x86/kvm/mmu.c: In function ‘kvm_mmu_unprotect_page’:
arch/x86/kvm/mmu.c:1741: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 3 has type ‘gfn_t’
arch/x86/kvm/mmu.c:1745: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 3 has type ‘gfn_t’
arch/x86/kvm/mmu.c: In function ‘mmu_unshadow’:
arch/x86/kvm/mmu.c:1761: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 3 has type ‘gfn_t’
arch/x86/kvm/mmu.c: In function ‘set_spte’:
arch/x86/kvm/mmu.c:2005: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 3 has type ‘gfn_t’
arch/x86/kvm/mmu.c: In function ‘mmu_set_spte’:
arch/x86/kvm/mmu.c:2033: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 7 has type ‘gfn_t’
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Alexander Graf [Tue, 24 Aug 2010 13:48:52 +0000 (15:48 +0200)]
KVM: S390: Export kvm_virtio.h
As suggested by Christian, we should expose headers to user space with
information that might be valuable there. The s390 virtio interface is
one of those cases. It defines an ABI between hypervisor and guest, so
it should be exposed to user space.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alexander Graf [Tue, 24 Aug 2010 13:48:51 +0000 (15:48 +0200)]
KVM: S390: Add virtio hotplug add support
The one big missing feature in s390-virtio was hotplugging. This is no more.
This patch implements hotplug add support, so you can on the fly add new devices
in the guest.
Keep in mind that this needs a patch for qemu to actually leverage the
functionality.
Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alexander Graf [Tue, 24 Aug 2010 13:48:50 +0000 (15:48 +0200)]
KVM: S390: take a full byte as ext_param indicator
Currenty the ext_param field only distinguishes between "config change" and
"vring interrupt". We can do a lot more with it though, so let's enable a
full byte of possible values and constants to #defines while at it.
Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Zachary Amsden [Fri, 20 Aug 2010 08:07:30 +0000 (22:07 -1000)]
KVM: x86: Fix a possible backwards warp of kvmclock
Kernel time, which advances in discrete steps may progress much slower
than TSC. As a result, when kvmclock is adjusted to a new base, the
apparent time to the guest, which runs at a much higher, nsec scaled
rate based on the current TSC, may have already been observed to have
a larger value (kernel_ns + scaled tsc) than the value to which we are
setting it (kernel_ns + 0).
We must instead compute the clock as potentially observed by the guest
for kernel_ns to make sure it does not go backwards.
Zachary Amsden [Fri, 20 Aug 2010 08:07:28 +0000 (22:07 -1000)]
KVM: x86: Add clock sync request to hardware enable
If there are active VCPUs which are marked as belonging to
a particular hardware CPU, request a clock sync for them when
enabling hardware; the TSC could be desynchronized on a newly
arriving CPU, and we need to recompute guests system time
relative to boot after a suspend event.
This covers both cases.
Note that it is acceptable to take the spinlock, as either
no other tasks will be running and no locks held (BSP after
resume), or other tasks will be guaranteed to drop the lock
relatively quickly (AP on CPU_STARTING).
Noting we now get clock synchronization requests for VCPUs
which are starting up (or restarting), it is tempting to
attempt to remove the arch/x86/kvm/x86.c CPU hot-notifiers
at this time, however it is not correct to do so; they are
required for systems with non-constant TSC as the frequency
may not be known immediately after the processor has started
until the cpufreq driver has had a chance to run and query
the chipset.
Updated: implement better locking semantics for hardware_enable
Removed the hack of dropping and retaking the lock by adding the
semantic that we always hold kvm_lock when hardware_enable is
called. The one place that doesn't need to worry about it is
resume, as resuming a frozen CPU, the spinlock won't be taken.
Zachary Amsden [Fri, 20 Aug 2010 08:07:26 +0000 (22:07 -1000)]
KVM: x86: Robust TSC compensation
Make the match of TSC find TSC writes that are close to each other
instead of perfectly identical; this allows the compensator to also
work in migration / suspend scenarios.
Zachary Amsden [Fri, 20 Aug 2010 08:07:25 +0000 (22:07 -1000)]
KVM: x86: Add helper functions for time computation
Add a helper function to compute the kernel time and convert nanoseconds
back to CPU specific cycles. Note that these must not be called in preemptible
context, as that would mean the kernel could enter software suspend state,
which would cause non-atomic operation.
Also, convert the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls to use the kernel
time helper, these should be bootbased as well.
Zachary Amsden [Fri, 20 Aug 2010 08:07:24 +0000 (22:07 -1000)]
KVM: x86: Fix deep C-state TSC desynchronization
When CPUs with unstable TSCs enter deep C-state, TSC may stop
running. This causes us to require resynchronization. Since
we can't tell when this may potentially happen, we assume the
worst by forcing re-compensation for it at every point the VCPU
task is descheduled.
Zachary Amsden [Fri, 20 Aug 2010 08:07:21 +0000 (22:07 -1000)]
KVM: x86: Make cpu_tsc_khz updates use local CPU
This simplifies much of the init code; we can now simply always
call tsc_khz_changed, optionally passing it a new value, or letting
it figure out the existing value (while interrupts are disabled, and
thus, by inference from the rule, not raceful against CPU hotplug or
frequency updates, which will issue IPIs to the local CPU to perform
this very same task).
Zachary Amsden [Fri, 20 Aug 2010 08:07:20 +0000 (22:07 -1000)]
KVM: x86: TSC reset compensation
Attempt to synchronize TSCs which are reset to the same value. In the
case of a reliable hardware TSC, we can just re-use the same offset, but
on non-reliable hardware, we can get closer by adjusting the offset to
match the elapsed time.
Zachary Amsden [Fri, 20 Aug 2010 08:07:17 +0000 (22:07 -1000)]
KVM: x86: Move TSC offset writes to common code
Also, ensure that the storing of the offset and the reading of the TSC
are never preempted by taking a spinlock. While the lock is overkill
now, it is useful later in this patch series.
Zachary Amsden [Fri, 20 Aug 2010 08:07:16 +0000 (22:07 -1000)]
KVM: x86: Convert TSC writes to TSC offset writes
Change svm / vmx to be the same internally and write TSC offset
instead of bare TSC in helper functions. Isolated as a single
patch to contain code movement.