]>
Commit | Line | Data |
---|---|---|
27abe577 KC |
1 | ============================== |
2 | Running nested guests with KVM | |
3 | ============================== | |
4 | ||
5 | A nested guest is the ability to run a guest inside another guest (it | |
6 | can be KVM-based or a different hypervisor). The straightforward | |
7 | example is a KVM guest that in turn runs on a KVM guest (the rest of | |
8 | this document is built on this example):: | |
9 | ||
10 | .----------------. .----------------. | |
11 | | | | | | |
12 | | L2 | | L2 | | |
13 | | (Nested Guest) | | (Nested Guest) | | |
14 | | | | | | |
15 | |----------------'--'----------------| | |
16 | | | | |
17 | | L1 (Guest Hypervisor) | | |
18 | | KVM (/dev/kvm) | | |
19 | | | | |
20 | .------------------------------------------------------. | |
21 | | L0 (Host Hypervisor) | | |
22 | | KVM (/dev/kvm) | | |
23 | |------------------------------------------------------| | |
24 | | Hardware (with virtualization extensions) | | |
25 | '------------------------------------------------------' | |
26 | ||
27 | Terminology: | |
28 | ||
29 | - L0 – level-0; the bare metal host, running KVM | |
30 | ||
31 | - L1 – level-1 guest; a VM running on L0; also called the "guest | |
32 | hypervisor", as it itself is capable of running KVM. | |
33 | ||
34 | - L2 – level-2 guest; a VM running on L1, this is the "nested guest" | |
35 | ||
36 | .. note:: The above diagram is modelled after the x86 architecture; | |
37 | s390x, ppc64 and other architectures are likely to have | |
38 | a different design for nesting. | |
39 | ||
40 | For example, s390x always has an LPAR (LogicalPARtition) | |
41 | hypervisor running on bare metal, adding another layer and | |
42 | resulting in at least four levels in a nested setup — L0 (bare | |
43 | metal, running the LPAR hypervisor), L1 (host hypervisor), L2 | |
44 | (guest hypervisor), L3 (nested guest). | |
45 | ||
46 | This document will stick with the three-level terminology (L0, | |
47 | L1, and L2) for all architectures; and will largely focus on | |
48 | x86. | |
49 | ||
50 | ||
51 | Use Cases | |
52 | --------- | |
53 | ||
54 | There are several scenarios where nested KVM can be useful, to name a | |
55 | few: | |
56 | ||
57 | - As a developer, you want to test your software on different operating | |
58 | systems (OSes). Instead of renting multiple VMs from a Cloud | |
59 | Provider, using nested KVM lets you rent a large enough "guest | |
60 | hypervisor" (level-1 guest). This in turn allows you to create | |
61 | multiple nested guests (level-2 guests), running different OSes, on | |
62 | which you can develop and test your software. | |
63 | ||
64 | - Live migration of "guest hypervisors" and their nested guests, for | |
65 | load balancing, disaster recovery, etc. | |
66 | ||
67 | - VM image creation tools (e.g. ``virt-install``, etc) often run | |
68 | their own VM, and users expect these to work inside a VM. | |
69 | ||
70 | - Some OSes use virtualization internally for security (e.g. to let | |
71 | applications run safely in isolation). | |
72 | ||
73 | ||
74 | Enabling "nested" (x86) | |
75 | ----------------------- | |
76 | ||
77 | From Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled | |
78 | by default for Intel and AMD. (Though your Linux distribution might | |
79 | override this default.) | |
80 | ||
81 | In case you are running a Linux kernel older than v4.19, to enable | |
82 | nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To | |
83 | persist this setting across reboots, you can add it in a config file, as | |
84 | shown below: | |
85 | ||
86 | 1. On the bare metal host (L0), list the kernel modules and ensure that | |
87 | the KVM modules:: | |
88 | ||
89 | $ lsmod | grep -i kvm | |
90 | kvm_intel 133627 0 | |
91 | kvm 435079 1 kvm_intel | |
92 | ||
93 | 2. Show information for ``kvm_intel`` module:: | |
94 | ||
95 | $ modinfo kvm_intel | grep -i nested | |
96 | parm: nested:bool | |
97 | ||
98 | 3. For the nested KVM configuration to persist across reboots, place the | |
99 | below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it | |
100 | doesn't exist):: | |
101 | ||
102 | $ cat /etc/modprobe.d/kvm_intel.conf | |
103 | options kvm-intel nested=y | |
104 | ||
105 | 4. Unload and re-load the KVM Intel module:: | |
106 | ||
107 | $ sudo rmmod kvm-intel | |
108 | $ sudo modprobe kvm-intel | |
109 | ||
110 | 5. Verify if the ``nested`` parameter for KVM is enabled:: | |
111 | ||
112 | $ cat /sys/module/kvm_intel/parameters/nested | |
113 | Y | |
114 | ||
115 | For AMD hosts, the process is the same as above, except that the module | |
116 | name is ``kvm-amd``. | |
117 | ||
118 | ||
119 | Additional nested-related kernel parameters (x86) | |
120 | ------------------------------------------------- | |
121 | ||
122 | If your hardware is sufficiently advanced (Intel Haswell processor or | |
123 | higher, which has newer hardware virt extensions), the following | |
124 | additional features will also be enabled by default: "Shadow VMCS | |
125 | (Virtual Machine Control Structure)", APIC Virtualization on your bare | |
126 | metal host (L0). Parameters for Intel hosts:: | |
127 | ||
128 | $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs | |
129 | Y | |
130 | ||
131 | $ cat /sys/module/kvm_intel/parameters/enable_apicv | |
132 | Y | |
133 | ||
134 | $ cat /sys/module/kvm_intel/parameters/ept | |
135 | Y | |
136 | ||
137 | .. note:: If you suspect your L2 (i.e. nested guest) is running slower, | |
138 | ensure the above are enabled (particularly | |
139 | ``enable_shadow_vmcs`` and ``ept``). | |
140 | ||
141 | ||
142 | Starting a nested guest (x86) | |
143 | ----------------------------- | |
144 | ||
145 | Once your bare metal host (L0) is configured for nesting, you should be | |
146 | able to start an L1 guest with:: | |
147 | ||
148 | $ qemu-kvm -cpu host [...] | |
149 | ||
150 | The above will pass through the host CPU's capabilities as-is to the | |
151 | gues); or for better live migration compatibility, use a named CPU | |
152 | model supported by QEMU. e.g.:: | |
153 | ||
154 | $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on | |
155 | ||
156 | then the guest hypervisor will subsequently be capable of running a | |
157 | nested guest with accelerated KVM. | |
158 | ||
159 | ||
160 | Enabling "nested" (s390x) | |
161 | ------------------------- | |
162 | ||
163 | 1. On the host hypervisor (L0), enable the ``nested`` parameter on | |
164 | s390x:: | |
165 | ||
166 | $ rmmod kvm | |
167 | $ modprobe kvm nested=1 | |
168 | ||
169 | .. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive | |
170 | with the ``nested`` paramter — i.e. to be able to enable | |
171 | ``nested``, the ``hpage`` parameter *must* be disabled. | |
172 | ||
173 | 2. The guest hypervisor (L1) must be provided with the ``sie`` CPU | |
174 | feature — with QEMU, this can be done by using "host passthrough" | |
175 | (via the command-line ``-cpu host``). | |
176 | ||
177 | 3. Now the KVM module can be loaded in the L1 (guest hypervisor):: | |
178 | ||
179 | $ modprobe kvm | |
180 | ||
181 | ||
182 | Live migration with nested KVM | |
183 | ------------------------------ | |
184 | ||
185 | Migrating an L1 guest, with a *live* nested guest in it, to another | |
186 | bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for | |
187 | Intel x86 systems, and even on older versions for s390x. | |
188 | ||
189 | On AMD systems, once an L1 guest has started an L2 guest, the L1 guest | |
190 | should no longer be migrated or saved (refer to QEMU documentation on | |
191 | "savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate | |
192 | or save-and-load an L1 guest while an L2 guest is running will result in | |
193 | undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a | |
194 | kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 | |
195 | guest can no longer be considered stable or secure, and must be restarted. | |
196 | Migrating an L1 guest merely configured to support nesting, while not | |
197 | actually running L2 guests, is expected to function normally even on AMD | |
198 | systems but may fail once guests are started. | |
199 | ||
200 | Migrating an L2 guest is always expected to succeed, so all the following | |
201 | scenarios should work even on AMD systems: | |
202 | ||
203 | - Migrating a nested guest (L2) to another L1 guest on the *same* bare | |
204 | metal host. | |
205 | ||
206 | - Migrating a nested guest (L2) to another L1 guest on a *different* | |
207 | bare metal host. | |
208 | ||
209 | - Migrating a nested guest (L2) to a bare metal host. | |
210 | ||
211 | Reporting bugs from nested setups | |
212 | ----------------------------------- | |
213 | ||
214 | Debugging "nested" problems can involve sifting through log files across | |
215 | L0, L1 and L2; this can result in tedious back-n-forth between the bug | |
216 | reporter and the bug fixer. | |
217 | ||
218 | - Mention that you are in a "nested" setup. If you are running any kind | |
219 | of "nesting" at all, say so. Unfortunately, this needs to be called | |
220 | out because when reporting bugs, people tend to forget to even | |
221 | *mention* that they're using nested virtualization. | |
222 | ||
223 | - Ensure you are actually running KVM on KVM. Sometimes people do not | |
224 | have KVM enabled for their guest hypervisor (L1), which results in | |
225 | them running with pure emulation or what QEMU calls it as "TCG", but | |
226 | they think they're running nested KVM. Thus confusing "nested Virt" | |
227 | (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). | |
228 | ||
229 | Information to collect (generic) | |
230 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
231 | ||
232 | The following is not an exhaustive list, but a very good starting point: | |
233 | ||
234 | - Kernel, libvirt, and QEMU version from L0 | |
235 | ||
236 | - Kernel, libvirt and QEMU version from L1 | |
237 | ||
238 | - QEMU command-line of L1 -- when using libvirt, you'll find it here: | |
239 | ``/var/log/libvirt/qemu/instance.log`` | |
240 | ||
241 | - QEMU command-line of L2 -- as above, when using libvirt, get the | |
242 | complete libvirt-generated QEMU command-line | |
243 | ||
244 | - ``cat /sys/cpuinfo`` from L0 | |
245 | ||
246 | - ``cat /sys/cpuinfo`` from L1 | |
247 | ||
248 | - ``lscpu`` from L0 | |
249 | ||
250 | - ``lscpu`` from L1 | |
251 | ||
252 | - Full ``dmesg`` output from L0 | |
253 | ||
254 | - Full ``dmesg`` output from L1 | |
255 | ||
256 | x86-specific info to collect | |
257 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
258 | ||
259 | Both the below commands, ``x86info`` and ``dmidecode``, should be | |
260 | available on most Linux distributions with the same name: | |
261 | ||
262 | - Output of: ``x86info -a`` from L0 | |
263 | ||
264 | - Output of: ``x86info -a`` from L1 | |
265 | ||
266 | - Output of: ``dmidecode`` from L0 | |
267 | ||
268 | - Output of: ``dmidecode`` from L1 | |
269 | ||
270 | s390x-specific info to collect | |
271 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
272 | ||
273 | Along with the earlier mentioned generic details, the below is | |
274 | also recommended: | |
275 | ||
276 | - ``/proc/sysinfo`` from L1; this will also include the info from L0 |