Andre Przywara [Wed, 24 Jun 2009 13:37:05 +0000 (15:37 +0200)]
KVM: Ignore PCI ECS I/O enablement
Linux guests will try to enable access to the extended PCI config space
via the I/O ports 0xCF8/0xCFC on AMD Fam10h CPU. Since we (currently?)
don't use ECS, simply ignore write and read attempts.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
This changes bus accesses to use high-level kvm_io_bus_read/kvm_io_bus_write
functions. in_range now becomes unused so it is removed from device ops in
favor of read/write callbacks performing range checks internally.
This allows aliasing (mostly for in-kernel virtio), as well as better error
handling by making it possible to pass errors up to userspace.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Use slots_lock to protect device list on the bus. slots_lock is already
taken for read everywhere, so we only need to take it for write when
registering devices. This is in preparation to removing in_range and
kvm->lock around it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
switch pit creation to slots_lock. slots_lock is already taken for read
everywhere, so we only need to take it for write when creating pit.
This is in preparation to removing in_range and kvm->lock around it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
switch coalesced mmio slots_lock. slots_lock is already taken for read
everywhere, so we only need to take it for write when changing zones.
This is in preparation to removing in_range and kvm->lock around it.
[avi: fix build]
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Thu, 25 Jun 2009 10:36:49 +0000 (12:36 +0200)]
KVM: introduce module parameter for ignoring unknown MSRs accesses
KVM will inject a #GP into the guest if that tries to access unhandled
MSRs. This will crash many guests. Although it would be the correct
way to actually handle these MSRs, we introduce a runtime switchable
module param called "ignore_msrs" (defaults to 0). If this is Y, unknown
MSR reads will return 0, while MSR writes are simply dropped. In both cases
we print a message to dmesg to inform the user about that.
You can change the behaviour at any time by saying:
Andre Przywara [Wed, 24 Jun 2009 10:44:34 +0000 (12:44 +0200)]
KVM: ignore reads from AMDs C1E enabled MSR
If the Linux kernel detects an C1E capable AMD processor (K8 RevF and
higher), it will access a certain MSR on every attempt to go to halt.
Explicitly handle this read and return 0 to let KVM run a Linux guest
with the native AMD host CPU propagated to the guest.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Wed, 24 Jun 2009 10:44:33 +0000 (12:44 +0200)]
KVM: ignore AMDs HWCR register access to set the FFDIS bit
Linux tries to disable the flush filter on all AMD K8 CPUs. Since KVM
does not handle the needed MSR, the injected #GP will panic the Linux
kernel. Ignore setting of the HWCR.FFDIS bit in this MSR to let Linux
boot with an AMD K8 family guest CPU.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
s390 uses the values of the memslot instead of doing everything in the arch
ioctl handler of the KVM_SET_USER_MEMORY_REGION. Unfortunately we missed to
set the userspace_addr of our memslot due to our s390 ifdef in
__kvm_set_memory_region.
Old s390 userspace launchers did not notice, since they started the guest at
userspace address 0.
Because of CONFIG_DEFAULT_MMAP_MIN_ADDR we now put the guest at 1M userspace,
which does not work. This patch makes sure that new.userspace_addr is set
on s390.
This fix should go in quickly. Nevertheless, looking at the code we should
clean up that ifdef in the long term. Any kernel janitors?
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Thu, 18 Jun 2009 10:56:02 +0000 (12:56 +0200)]
KVM: x86 emulator: Add sysexit emulation
Handle #UD intercept of the sysexit instruction in 64bit mode returning to
32bit compat mode on an AMD host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual.
Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Thu, 18 Jun 2009 10:56:01 +0000 (12:56 +0200)]
KVM: x86 emulator: Add sysenter emulation
Handle #UD intercept of the sysenter instruction in 32bit compat mode on
an AMD host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual.
Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Thu, 18 Jun 2009 10:56:00 +0000 (12:56 +0200)]
KVM: x86 emulator: add syscall emulation
Handle #UD intercept of the syscall instruction in 32bit compat mode on
an Intel host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual. Save the RIP and EFLAGS to the correct registers.
[avi: fix build on i386 due to missing R11]
Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Wed, 17 Jun 2009 13:50:33 +0000 (15:50 +0200)]
KVM: x86 emulator: Prepare for emulation of syscall instructions
Add the flags needed for syscall, sysenter and sysexit to the opcode table.
Catch (but for now ignore) the opcodes in the emulation switch/case.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Wed, 17 Jun 2009 13:50:32 +0000 (15:50 +0200)]
KVM: x86 emulator: Add missing EFLAGS bit definitions
Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Wed, 17 Jun 2009 13:50:31 +0000 (15:50 +0200)]
KVM: Allow emulation of syscalls instructions on #UD
Add the opcodes for syscall, sysenter and sysexit to the list of instructions
handled by the undefined opcode handler.
Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Jan Kiszka [Tue, 9 Jun 2009 13:37:03 +0000 (15:37 +0200)]
KVM: Drop useless atomic test from timer function
The current code tries to optimize the setting of
KVM_REQ_PENDING_TIMER but used atomic_inc_and_test - which always
returns true unless pending had the invalid value of -1 on entry. This
patch drops the test part preserving the original semantic but
expressing it less confusingly.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Jan Kiszka [Tue, 9 Jun 2009 13:37:01 +0000 (15:37 +0200)]
KVM: Fix racy event propagation in timer
Minor issue that likely had no practical relevance: the kvm timer
function so far incremented the pending counter and then may reset it
again to 1 in case reinjection was disabled. This opened a small racy
window with the corresponding VCPU loop that may have happened to run
on another (real) CPU and already consumed the value.
Fix it by skipping the incrementation in case pending is already > 0.
This opens a different race windows, but may only rarely cause lost
events in case we do not care about them anyway (!reinject).
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Thu, 11 Jun 2009 08:06:51 +0000 (11:06 +0300)]
KVM: Optimize searching for highest IRR
Most of the time IRR is empty, so instead of scanning the whole IRR on
each VM entry keep a variable that tells us if IRR is not empty. IRR
will have to be scanned twice on each IRQ delivery, but this is much
more rare than VM entry.
Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Fri, 12 Jun 2009 20:01:29 +0000 (22:01 +0200)]
KVM: Move performance counter MSR access interception to generic x86 path
The performance counter MSRs are different for AMD and Intel CPUs and they
are chosen mainly by the CPUID vendor string. This patch catches writes to
all addresses (regardless of VMX/SVM path) and handles them in the generic
MSR handler routine. Writing a 0 into the event select register is something
we perfectly emulate ;-), so don't print out a warning to dmesg in this
case.
This fixes booting a 64bit Windows guest with an AMD CPUID on an Intel host.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Marcelo Tosatti [Wed, 10 Jun 2009 15:27:07 +0000 (12:27 -0300)]
KVM: MMU audit: audit_mappings tweaks
- Fail early in case gfn_to_pfn returns is_error_pfn.
- For the pre pte write case, avoid spurious "gva is valid but spte is notrap"
messages (the emulation code does the guest write first, so this particular
case is OK).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Tue, 9 Jun 2009 12:56:28 +0000 (15:56 +0300)]
KVM: Break dependency between vcpu index in vcpus array and vcpu_id.
Archs are free to use vcpu_id as they see fit. For x86 it is used as
vcpu's apic id. New ioctl is added to configure boot vcpu id that was
assumed to be 0 till now.
Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Nitin A Kamble [Mon, 8 Jun 2009 18:34:16 +0000 (11:34 -0700)]
KVM: VMX: Support Unrestricted Guest feature
"Unrestricted Guest" feature is added in the VMX specification.
Intel Westmere and onwards processors will support this feature.
It allows kvm guests to run real mode and unpaged mode
code natively in the VMX mode when EPT is turned on. With the
unrestricted guest there is no need to emulate the guest real mode code
in the vm86 container or in the emulator. Also the guest big real mode
code works like native.
The attached patch enhances KVM to use the unrestricted guest feature
if available on the processor. It also adds a new kernel/module
parameter to disable the unrestricted guest feature at the boot time.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Marcelo Tosatti [Thu, 4 Jun 2009 18:08:24 +0000 (15:08 -0300)]
KVM: switch irq injection/acking data structures to irq_lock
Protect irq injection/acking data structures with a separate irq_lock
mutex. This fixes the following deadlock:
CPU A CPU B
kvm_vm_ioctl_deassign_dev_irq()
mutex_lock(&kvm->lock); worker_thread()
-> kvm_deassign_irq() -> kvm_assigned_dev_interrupt_work_handler()
-> deassign_host_irq() mutex_lock(&kvm->lock);
-> cancel_work_sync() [blocked]
[gleb: fix ia64 path]
Reported-by: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 4 Jun 2009 15:09:08 +0000 (18:09 +0300)]
KVM: Calculate available entries in coalesced mmio ring
Instead of checking whether we'll wrap around, calculate how many entries
are available, and check whether we have enough (just one) for the pending
mmio.
By itself, this doesn't change anything, but it paves the way for making
this function lockless.
Avi Kivity [Sun, 31 May 2009 15:41:29 +0000 (18:41 +0300)]
KVM: VMX: Simplify pdptr and cr3 management
Instead of reading the PDPTRs from memory after every exit (which is slow
and wrong, as the PDPTRs are stored on the cpu), sync the PDPTRs from
memory to the VMCS before entry, and from the VMCS to memory after exit.
Do the same for cr3.
Gregory Haskins [Mon, 1 Jun 2009 16:54:56 +0000 (12:54 -0400)]
KVM: do not register i8254 PIO regions until we are initialized
We currently publish the i8254 resources to the pio_bus before the devices
are fully initialized. Since we hold the pit_lock, its probably not
a real issue. But lets clean this up anyway.
Reported-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Avi Kivity <avi@redhat.com>
Gregory Haskins [Mon, 1 Jun 2009 16:54:50 +0000 (12:54 -0400)]
KVM: cleanup io_device code
We modernize the io_device code so that we use container_of() instead of
dev->private, and move the vtable to a separate ops structure
(theoretically allows better caching for multiple instances of the same
ops structure)
Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Avi Kivity <avi@redhat.com>
Gregory Haskins [Mon, 1 Jun 2009 16:54:45 +0000 (12:54 -0400)]
KVM: Clean up coalesced_mmio destruction
We invoke kfree() on a data member instead of the structure. This works today
because the kvm_io_device is the first element of the private structure, but
this could change in the future, so lets clean this up.
Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Avi Kivity <avi@redhat.com>
WARNING: arch/powerpc/kvm/kvm-440.o(.text+0x1e8c): Section mismatch in reference from the function kvmppc_44x_exit() to the function .exit.text:kvmppc_booke_exit()
The function kvmppc_44x_exit() references a function in an exit section.
Often the function kvmppc_booke_exit() has valid usage outside the exit section
and the fix is to remove the __exit annotation of kvmppc_booke_exit.
Also add some __init annotations on obvious routines.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Avi Kivity <avi@redhat.com>
Changing s390 code in kvm_arch_vcpu_load/put come across this header
declarations. They are complete duplicates, not even useful forward
declarations as nothing using it is in between (maybe it was that in
the past).
This patch removes the two dispensable lines.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Sheng Yang [Thu, 30 Apr 2009 02:58:42 +0000 (10:58 +0800)]
KVM: No disable_irq for MSI/MSI-X interrupt on device assignment
Disable interrupt at interrupt handler and enable it when guest ack is for
the level triggered interrupt, to prevent reinjected interrupt. MSI/MSI-X don't
need it.
One possible problem is multiply same vector interrupt injected between irq
handler and scheduled work handler would be merged as one for MSI/MSI-X.
But AFAIK, the drivers handle it well.
The patch fixed the oplin card performance issue(MSI-X performance is half of
MSI/INTx).
Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Andre Przywara [Thu, 28 May 2009 09:56:31 +0000 (11:56 +0200)]
KVM: SVM: use explicit 64bit storage for sysenter values
Since AMD does not support sysenter in 64bit mode, the VMCB fields storing
the MSRs are truncated to 32bit upon VMRUN/#VMEXIT. So store the values
in a separate 64bit storage to avoid truncation.
[andre: fix amd->amd migration]
Signed-off-by: Christoph Egger <christoph.egger@amd.com> Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
This patch relocates the variables kvm-s390 uses to track guest mem addr/size.
As discussed dropping the variables at struct kvm_arch level allows to use the
common vcpu->request based mechanism to reload guest memory if e.g. changes
via set_memory_region.
The kick mechanism introduced in this series is used to ensure running vcpus
leave guest state to catch the update.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
If signal pending is true we exit without updating kvm_run, userspace
currently just does nothing and jumps to kvm_run again.
Since we did not set an exit_reason we might end up with a random one
(whatever was the last exit). Therefore it was possible to e.g. jump to
the psw position the last real interruption set.
Setting the INTR exit reason ensures that no old psw data is swapped
in on reentry.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
KVM: s390: infrastructure to kick vcpus out of guest state
To ensure vcpu's come out of guest context in certain cases this patch adds a
s390 specific way to kick them out of guest context. Currently it kicks them
out to rerun the vcpu_run path in the s390 code, but the mechanism itself is
expandable and with a new flag we could also add e.g. kicks to userspace etc.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Jan Kiszka [Thu, 14 May 2009 20:42:53 +0000 (22:42 +0200)]
KVM: Allow PIT emulation without speaker port
The in-kernel speaker emulation is only a dummy and also unneeded from
the performance point of view. Rather, it takes user space support to
generate sound output on the host, e.g. console beeps.
To allow this, introduce KVM_CREATE_PIT2 which controls in-kernel
speaker port emulation via a flag passed along the new IOCTL. It also
leaves room for future extensions of the PIT configuration interface.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Gregory Haskins [Wed, 20 May 2009 14:30:49 +0000 (10:30 -0400)]
KVM: irqfd
KVM provides a complete virtual system environment for guests, including
support for injecting interrupts modeled after the real exception/interrupt
facilities present on the native platform (such as the IDT on x86).
Virtual interrupts can come from a variety of sources (emulated devices,
pass-through devices, etc) but all must be injected to the guest via
the KVM infrastructure. This patch adds a new mechanism to inject a specific
interrupt to a guest using a decoupled eventfd mechnanism: Any legal signal
on the irqfd (using eventfd semantics from either userspace or kernel) will
translate into an injected interrupt in the guest at the next available
interrupt window.
Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Use proper foo-y style list additions to cleanup all the conditionals,
move module selection after compound object selection and remove the
superflous comment.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Avi Kivity <avi@redhat.com>
Huang Ying [Mon, 11 May 2009 08:48:15 +0000 (16:48 +0800)]
KVM: Add MCE support
The related MSRs are emulated. MCE capability is exported via
extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new
vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation
such as the mcg_cap. MCE is injected via vcpu ioctl command
KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are
not implemented.
Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
"This seems to generate /sys/block/$device/queue and its contents for
everyone who is using queues, not just for those queues that have a
non-NULL queue->request_fn."
Note that embedding a queue inside another object has always been
an illegal construct, since the queues are reference counted and
must persist until the last reference is dropped. So aoe was
always buggy in this respect (Jens).
Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Bruno Premont <bonbons@linux-vserver.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
i915: disable interrupts before tearing down GEM state
Reinette Chatre reports a frozen system (with blinking keyboard LEDs)
when switching from graphics mode to the text console, or when
suspending (which does the same thing). With netconsole, the oops
turned out to be
BUG: unable to handle kernel NULL pointer dereference at 0000000000000084
IP: [<ffffffffa03ecaab>] i915_driver_irq_handler+0x26b/0xd20 [i915]
and it's due to the i915_gem.c code doing drm_irq_uninstall() after
having done i915_gem_idle(). And the i915_gem_idle() path will do
but if an i915 interrupt comes in after this stage, it may want to
access that hw_status_page, and gets the above NULL pointer dereference.
And since the NULL pointer dereference happens from within an interrupt,
and with the screen still in graphics mode, the common end result is
simply a silently hung machine.
Fix it by simply uninstalling the irq handler before idling rather than
after. Fixes