]> git.proxmox.com Git - pve-kernel.git/blob - patches/kernel/0008-kvm-xsave-set-mask-out-PKRU-bit-in-xfeatures-if-vCPU.patch
f6d8d411abb50724eeeae7588c00ae4bdd6df179
[pve-kernel.git] / patches / kernel / 0008-kvm-xsave-set-mask-out-PKRU-bit-in-xfeatures-if-vCPU.patch
1 From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
2 From: Thomas Lamprecht <t.lamprecht@proxmox.com>
3 Date: Fri, 14 Jul 2023 18:10:32 +0200
4 Subject: [PATCH] kvm: xsave set: mask-out PKRU bit in xfeatures if vCPU has no
5 support
6 MIME-Version: 1.0
7 Content-Type: text/plain; charset=UTF-8
8 Content-Transfer-Encoding: 8bit
9
10 Fixes live-migrations & snapshot-rollback of VMs with a restricted
11 CPU type (e.g., qemu64) from our 5.15 based kernel (default Proxmox
12 VE 7.4) to the 6.2 (and future newer) of Proxmox VE 8.0.
13
14 Previous to ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to
15 supported bits of XCR0") the PKRU bit of the host could leak into the
16 state from the guest, which caused trouble when migrating between
17 hosts with different CPUs, i.e., where the source supported it but
18 the target did not, causing a general protection fault when the guest
19 tried to use a pkru related instruction after the migration.
20
21 But the fix, while welcome, caused a temporary out-of-sync state when
22 migrating such a VM from a kernel without the fix to a kernel with
23 the fix, as it threw of KVM when the CPUID of the guest and most of
24 the state doesn't report XSAVE and thus any xfeatures, but PKRU and
25 the related state is set as enabled, causing the vCPU to spin at 100%
26 without any progress forever.
27
28 The fix could be at two sites, either in QEMU or in the kernel, I
29 choose the kernel as we have all the info there for a targeted
30 heuristic so that we don't have to adapt QEMU and qemu-server, the
31 latter even on both sides.
32
33 Still, a short summary of the possible fixes and short drawbacks:
34 * on QEMU-side either
35 - clear the PKRU state in the migration saved state would be rather
36 complicated to implement as the vCPU is initialised way before we
37 have the saved xfeature state available to check what we'd need
38 to do, plus the user-space only gets a memory blob from ioctl
39 KVM_GET_XSAVE2 that it passes to KVM_SET_XSAVE ioctl, there are
40 no ABI guarantees, and while the struct seem stable for 5.15 to
41 6.5-rc1, that doesn't has to be for future kernels, so off the
42 table.
43 - enforce that the CPUID reports PKU support even if it normally
44 wouldn't. While this works (tested by hard-coding it as POC) it
45 is a) not really nice and b) needs some interaction from
46 qemu-server to enable this flag as otherwise we have no good info
47 to decide when it's OK to do this, which means we need to adapt
48 both PVE 7 and 8's qemu-server and also pve-qemu, workable but
49 not optimal
50
51 * on Kernel/KVM-side we can hook into the set XSAVE ioctl specific to
52 the KVM subsystem, which already reduces chance of regression for
53 all other places. There we have access to the union/struct
54 definitions of the saved state and thus can savely cast to that.
55 We also got access to the vCPU's CPUID capabilities, meaning we can
56 check if the XCR0 (first XSAVE Control Register) reports
57 that it support the PKRU feature, and if it does *NOT* but the
58 saved xfeatures register from XSAVE *DOES* report it, we can safely
59 assume that this combination is due to an migration from an older,
60 leaky kernel – and clear the bit in the xfeature register before
61 restoring it to the guest vCPU KVM state, avoiding the confusing
62 situation that made the vCPU spin at 100%.
63 This should be safe to do, as the guest vCPU CPUID never reported
64 support for the PKRU feature, and it's also a relatively niche and
65 newish feature.
66
67 If it gains us something we can drop this patch a bit in the future
68 Proxmox VE 9 major release, but we should ensure that VMs that where
69 started before PVE 8 cannot be directly live-migrated to the release
70 that includes that change; so we should rather only drop it if the
71 maintenance burden is high.
72
73 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
74 ---
75 arch/x86/kvm/cpuid.c | 6 ++++++
76 arch/x86/kvm/cpuid.h | 2 ++
77 arch/x86/kvm/x86.c | 13 +++++++++++++
78 3 files changed, 21 insertions(+)
79
80 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
81 index d3432687c9e6..2c20da9aa2ac 100644
82 --- a/arch/x86/kvm/cpuid.c
83 +++ b/arch/x86/kvm/cpuid.c
84 @@ -249,6 +249,12 @@ static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
85 return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
86 }
87
88 +bool vcpu_supports_xsave_pkru(struct kvm_vcpu *vcpu) {
89 + u64 guest_supported_xcr0 = cpuid_get_supported_xcr0(
90 + vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
91 + return (guest_supported_xcr0 & XFEATURE_MASK_PKRU) != 0;
92 +}
93 +
94 static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
95 int nent)
96 {
97 diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
98 index b1658c0de847..12a02851ff57 100644
99 --- a/arch/x86/kvm/cpuid.h
100 +++ b/arch/x86/kvm/cpuid.h
101 @@ -32,6 +32,8 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
102 bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
103 u32 *ecx, u32 *edx, bool exact_only);
104
105 +bool vcpu_supports_xsave_pkru(struct kvm_vcpu *vcpu);
106 +
107 u32 xstate_required_size(u64 xstate_bv, bool compacted);
108
109 int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu);
110 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
111 index c381770bcbf1..6690a3722007 100644
112 --- a/arch/x86/kvm/x86.c
113 +++ b/arch/x86/kvm/x86.c
114 @@ -5413,6 +5413,19 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
115 if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
116 return 0;
117
118 + if (!vcpu_supports_xsave_pkru(vcpu)) {
119 + void *buf = guest_xsave->region;
120 + union fpregs_state *ustate = buf;
121 + if (ustate->xsave.header.xfeatures & XFEATURE_MASK_PKRU) {
122 + printk(
123 + KERN_NOTICE "clearing PKRU xfeature bit as vCPU from PID %d"
124 + " reports no PKRU support - migration from fpu-leaky kernel?",
125 + current->pid
126 + );
127 + ustate->xsave.header.xfeatures &= ~XFEATURE_MASK_PKRU;
128 + }
129 + }
130 +
131 return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
132 guest_xsave->region,
133 kvm_caps.supported_xcr0,