From: Laszlo Ersek Date: Thu, 19 Jan 2023 11:01:31 +0000 (+0100) Subject: OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression X-Git-Tag: edk2-stable202302~100 X-Git-Url: https://git.proxmox.com/?p=mirror_edk2.git;a=commitdiff_plain;h=bf5678b5802685e07583e3c7ec56d883cbdd5da3 OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation protocol is (effectively) broken such that it suggests that switching from the legacy interface to the modern interface works, but in reality the switch never happens. The symptom has been witnessed when using TCG acceleration; KVM seems to mask the issue. The issue persists with the following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently there is no stable release that addresses the problem. The QEMU bug confuses the Present and Possible counting in function PlatformMaxCpuCountInitialization(), in "OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0 Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering the privilege level of SMM) is not that great. Detect the issue in PlatformCpuCountBugCheck(), and print an error message and *hang* if the issue is present. Users willing to take risks can override the hang with the experimental QEMU command line option -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes (The "-fw_cfg" QEMU option itself is not experimental; its above argument, as far it concerns the firmware, is experimental.) The problem was originally reported by Ard [0]. We analyzed it at [1] and [2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389 ("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block", 2023-01-08), to be included in QEMU v8.0.0. [0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2 [1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3 [2] IO port write width clamping differs between TCG and KVM http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html [3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html NOTE: PlatformInitLib is used in the following platform DSCs: OvmfPkg/AmdSev/AmdSevX64.dsc OvmfPkg/CloudHv/CloudHvX64.dsc OvmfPkg/IntelTdx/IntelTdxX64.dsc OvmfPkg/Microvm/MicrovmX64.dsc OvmfPkg/OvmfPkgIa32.dsc OvmfPkg/OvmfPkgIa32X64.dsc OvmfPkg/OvmfPkgX64.dsc but I can only test this change with the last three platforms, running on QEMU. Test results: TCG QEMU OVMF override result patched patched --- ------- ------- -------- -------------------------------------- 0 0 0 0 CPU counts OK (KVM masks the QEMU bug) 0 0 1 0 CPU counts OK (KVM masks the QEMU bug) 0 1 0 0 CPU counts OK (QEMU fix, but KVM masks the QEMU bug anyway) 0 1 1 0 CPU counts OK (QEMU fix, but KVM masks the QEMU bug anyway) 1 0 0 0 boot with broken CPU counts (original QEMU bug) 1 0 1 0 broken CPU count caught (boot hangs) 1 0 1 1 broken CPU count caught, bug check overridden, boot continues 1 1 0 0 CPU counts OK (QEMU fix) 1 1 1 0 CPU counts OK (QEMU fix) Cc: Ard Biesheuvel Cc: Brijesh Singh Cc: Erdem Aktas Cc: Gerd Hoffmann Cc: James Bottomley Cc: Jiewen Yao Cc: Jordan Justen Cc: Michael Brown Cc: Min Xu Cc: Oliver Steffen Cc: Sebastien Boeuf Cc: Tom Lendacky Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250 Signed-off-by: Laszlo Ersek Message-Id: <20230119110131.91923-3-lersek@redhat.com> Reviewed-by: Ard Biesheuvel Hugely-appreciated-by: Michael Brown Acked-by: Gerd Hoffmann --- diff --git a/OvmfPkg/Library/PlatformInitLib/Platform.c b/OvmfPkg/Library/PlatformInitLib/Platform.c index d1be5c2d79..9fee6e4810 100644 --- a/OvmfPkg/Library/PlatformInitLib/Platform.c +++ b/OvmfPkg/Library/PlatformInitLib/Platform.c @@ -36,6 +36,9 @@ #include +#define CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE \ + "opt/org.tianocore/X-Cpuhp-Bugcheck-Override" + VOID EFIAPI PlatformAddIoMemoryBaseSizeHob ( @@ -437,6 +440,87 @@ PlatformCpuCountBugCheck ( { ASSERT (*BootCpuCount > 0); + // + // Sanity check: we need at least 1 present CPU (CPU#0 is always present). + // + // The legacy-to-modern switching of the CPU hotplug register block got broken + // (for TCG) in QEMU v5.1.0. Refer to "IO port write width clamping differs + // between TCG and KVM" at + // + // or at + // . + // + // QEMU received the fix in commit dab30fbef389 ("acpi: cpuhp: fix + // guest-visible maximum access size to the legacy reg block", 2023-01-08), to + // be included in QEMU v8.0.0. + // + // If we're affected by this QEMU bug, then we must not continue: it confuses + // the multiprocessing in UefiCpuPkg/Library/MpInitLib, and breaks CPU + // hot(un)plug with SMI in OvmfPkg/CpuHotplugSmm. + // + if (*Present == 0) { + UINTN Idx; + STATIC CONST CHAR8 *CONST Message[] = { + "Broken CPU hotplug register block found. Update QEMU to version 8+, or", + "to a stable release with commit dab30fbef389 backported. Refer to", + ".", + "Consequences of the QEMU bug may include, but are not limited to:", + "- all firmware logic, dependent on the CPU hotplug register block,", + " being confused, for example, multiprocessing-related logic;", + "- guest OS data loss, including filesystem corruption, due to crash or", + " hang during ACPI S3 resume;", + "- SMM privilege escalation, by a malicious guest OS or 3rd partty UEFI", + " agent, against the platform firmware.", + "These symptoms need not necessarily be limited to the QEMU user", + "attempting to hot(un)plug a CPU.", + "The firmware will now stop (hang) deliberately, in order to prevent the", + "above symptoms.", + "You can forcibly override the hang, *at your own risk*, with the", + "following *experimental* QEMU command line option:", + " -fw_cfg name=" CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE ",string=yes", + "Please only report such bugs that you can reproduce *without* the", + "override.", + }; + RETURN_STATUS ParseStatus; + BOOLEAN Override; + + DEBUG (( + DEBUG_ERROR, + "%a: Present=%u Possible=%u\n", + __FUNCTION__, + *Present, + *Possible + )); + for (Idx = 0; Idx < ARRAY_SIZE (Message); ++Idx) { + DEBUG ((DEBUG_ERROR, "%a: %a\n", __FUNCTION__, Message[Idx])); + } + + ParseStatus = QemuFwCfgParseBool ( + CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE, + &Override + ); + if (!RETURN_ERROR (ParseStatus) && Override) { + DEBUG (( + DEBUG_WARN, + "%a: \"%a\" active. You've been warned.\n", + __FUNCTION__, + CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE + )); + // + // The bug is in QEMU v5.1.0+, where we're not affected by the QEMU v2.7 + // reset bug, so BootCpuCount from fw_cfg is reliable. Assume a fully + // populated topology, like when the modern CPU hotplug interface is + // unavailable. + // + *Present = *BootCpuCount; + *Possible = *BootCpuCount; + return; + } + + ASSERT (FALSE); + CpuDeadLoop (); + } + // // Sanity check: fw_cfg and the modern CPU hotplug interface should expose the // same boot CPU count. @@ -596,6 +680,9 @@ PlatformMaxCpuCountInitialization ( } while (Selected > 0); PlatformCpuCountBugCheck (&BootCpuCount, &Present, &Possible); + ASSERT (Present > 0); + ASSERT (Present <= Possible); + ASSERT (BootCpuCount == Present); MaxCpuCount = Possible; }