cherry-pick scheduler fix to avoid temporary VM freezes on NUMA hosts

author Friedrich Weber <f.weber@proxmox.com>

Wed, 17 Jan 2024 14:45:21 +0000 (15:45 +0100)

committer Thomas Lamprecht <t.lamprecht@proxmox.com>

Wed, 14 Feb 2024 10:10:25 +0000 (11:10 +0100)
author Friedrich Weber <f.weber@proxmox.com>
Wed, 17 Jan 2024 14:45:21 +0000 (15:45 +0100)
committer Thomas Lamprecht <t.lamprecht@proxmox.com>
Wed, 14 Feb 2024 10:10:25 +0000 (11:10 +0100)
diff --git a/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch b/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch

new file mode 100644 (file)

index 0000000..932e2f2
--- /dev/null
+++ b/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch
@@ -0,0 +1,78 @@
+From 39f2bfe0177d3f56c9feac4e70424e4952949e2a Mon Sep 17 00:00:00 2001
+From: Sean Christopherson <seanjc@google.com>
+Date: Wed, 10 Jan 2024 13:47:23 -0800
+Subject: [PATCH] sched/core: Drop spinlocks on contention iff kernel is
+ preemptible
+
+Use preempt_model_preemptible() to detect a preemptible kernel when
+deciding whether or not to reschedule in order to drop a contended
+spinlock or rwlock.  Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
+built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
+preemption model is "none" or "voluntary".  In short, make kernels with
+dynamically selected models behave the same as kernels with statically
+selected models.
+
+Somewhat counter-intuitively, NOT yielding a lock can provide better
+latency for the relevant tasks/processes.  E.g. KVM x86's mmu_lock, a
+rwlock, is often contended between an invalidation event (takes mmu_lock
+for write) and a vCPU servicing a guest page fault (takes mmu_lock for
+read).  For _some_ setups, letting the invalidation task complete even
+if there is mmu_lock contention provides lower latency for *all* tasks,
+i.e. the invalidation completes sooner *and* the vCPU services the guest
+page fault sooner.
+
+But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
+can vary depending on the host VMM, the guest workload, the number of
+vCPUs, the number of pCPUs in the host, why there is lock contention, etc.
+
+In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
+opposite and removing contention yielding entirely) needs to come with a
+big pile of data proving that changing the status quo is a net positive.
+
+Cc: Valentin Schneider <valentin.schneider@arm.com>
+Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
+Cc: Marco Elver <elver@google.com>
+Cc: Frederic Weisbecker <frederic@kernel.org>
+Cc: David Matlack <dmatlack@google.com>
+Signed-off-by: Sean Christopherson <seanjc@google.com>
+---
+ include/linux/sched.h | 14 ++++++--------
+ 1 file changed, 6 insertions(+), 8 deletions(-)
+
+diff --git a/include/linux/sched.h b/include/linux/sched.h
+index 292c31697248..a274bc85f222 100644
+--- a/include/linux/sched.h
++++ b/include/linux/sched.h
+@@ -2234,11 +2234,10 @@ static inline bool preempt_model_preemptible(void)
+  */
+ static inline int spin_needbreak(spinlock_t *lock)
+ {
+-#ifdef CONFIG_PREEMPTION
++      if (!preempt_model_preemptible())
++              return 0;
++
+       return spin_is_contended(lock);
+-#else
+-      return 0;
+-#endif
+ }
+ 
+ /*
+@@ -2251,11 +2250,10 @@ static inline int spin_needbreak(spinlock_t *lock)
+  */
+ static inline int rwlock_needbreak(rwlock_t *lock)
+ {
+-#ifdef CONFIG_PREEMPTION
++      if (!preempt_model_preemptible())
++              return 0;
++
+       return rwlock_is_contended(lock);
+-#else
+-      return 0;
+-#endif
+ }
+ 
+ static __always_inline bool need_resched(void)
+-- 
+2.39.2
+
author	Friedrich Weber <f.weber@proxmox.com>
	Wed, 17 Jan 2024 14:45:21 +0000 (15:45 +0100)
committer	Thomas Lamprecht <t.lamprecht@proxmox.com>
	Wed, 14 Feb 2024 10:10:25 +0000 (11:10 +0100)