]>
Commit | Line | Data |
---|---|---|
8684f1be JJ |
1 | Multi-process QEMU |
2 | =================== | |
3 | ||
f9df7aac PB |
4 | .. note:: |
5 | ||
6 | This is the design document for multi-process QEMU. It does not | |
7 | necessarily reflect the status of the current implementation, which | |
8 | may lack features or be considerably different from what is described | |
9 | in this document. This document is still useful as a description of | |
10 | the goals and general direction of this feature. | |
11 | ||
12 | Please refer to the following wiki for latest details: | |
13 | https://wiki.qemu.org/Features/MultiProcessQEMU | |
14 | ||
8684f1be JJ |
15 | QEMU is often used as the hypervisor for virtual machines running in the |
16 | Oracle cloud. Since one of the advantages of cloud computing is the | |
17 | ability to run many VMs from different tenants in the same cloud | |
18 | infrastructure, a guest that compromised its hypervisor could | |
19 | potentially use the hypervisor's access privileges to access data it is | |
20 | not authorized for. | |
21 | ||
22 | QEMU can be susceptible to security attacks because it is a large, | |
23 | monolithic program that provides many features to the VMs it services. | |
24 | Many of these features can be configured out of QEMU, but even a reduced | |
25 | configuration QEMU has a large amount of code a guest can potentially | |
26 | attack. Separating QEMU reduces the attack surface by aiding to | |
27 | limit each component in the system to only access the resources that | |
28 | it needs to perform its job. | |
29 | ||
30 | QEMU services | |
31 | ------------- | |
32 | ||
33 | QEMU can be broadly described as providing three main services. One is a | |
34 | VM control point, where VMs can be created, migrated, re-configured, and | |
35 | destroyed. A second is to emulate the CPU instructions within the VM, | |
36 | often accelerated by HW virtualization features such as Intel's VT | |
37 | extensions. Finally, it provides IO services to the VM by emulating HW | |
38 | IO devices, such as disk and network devices. | |
39 | ||
40 | A multi-process QEMU | |
41 | ~~~~~~~~~~~~~~~~~~~~ | |
42 | ||
43 | A multi-process QEMU involves separating QEMU services into separate | |
44 | host processes. Each of these processes can be given only the privileges | |
45 | it needs to provide its service, e.g., a disk service could be given | |
46 | access only to the disk images it provides, and not be allowed to | |
47 | access other files, or any network devices. An attacker who compromised | |
48 | this service would not be able to use this exploit to access files or | |
49 | devices beyond what the disk service was given access to. | |
50 | ||
51 | A QEMU control process would remain, but in multi-process mode, will | |
52 | have no direct interfaces to the VM. During VM execution, it would still | |
53 | provide the user interface to hot-plug devices or live migrate the VM. | |
54 | ||
55 | A first step in creating a multi-process QEMU is to separate IO services | |
56 | from the main QEMU program, which would continue to provide CPU | |
57 | emulation. i.e., the control process would also be the CPU emulation | |
58 | process. In a later phase, CPU emulation could be separated from the | |
59 | control process. | |
60 | ||
61 | Separating IO services | |
62 | ---------------------- | |
63 | ||
64 | Separating IO services into individual host processes is a good place to | |
65 | begin for a couple of reasons. One is the sheer number of IO devices QEMU | |
66 | can emulate provides a large surface of interfaces which could potentially | |
67 | be exploited, and, indeed, have been a source of exploits in the past. | |
68 | Another is the modular nature of QEMU device emulation code provides | |
69 | interface points where the QEMU functions that perform device emulation | |
70 | can be separated from the QEMU functions that manage the emulation of | |
71 | guest CPU instructions. The devices emulated in the separate process are | |
72 | referred to as remote devices. | |
73 | ||
74 | QEMU device emulation | |
75 | ~~~~~~~~~~~~~~~~~~~~~ | |
76 | ||
77 | QEMU uses an object oriented SW architecture for device emulation code. | |
78 | Configured objects are all compiled into the QEMU binary, then objects | |
79 | are instantiated by name when used by the guest VM. For example, the | |
80 | code to emulate a device named "foo" is always present in QEMU, but its | |
81 | instantiation code is only run when the device is included in the target | |
82 | VM. (e.g., via the QEMU command line as *-device foo*) | |
83 | ||
84 | The object model is hierarchical, so device emulation code names its | |
85 | parent object (such as "pci-device" for a PCI device) and QEMU will | |
86 | instantiate a parent object before calling the device's instantiation | |
87 | code. | |
88 | ||
89 | Current separation models | |
90 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
91 | ||
92 | In order to separate the device emulation code from the CPU emulation | |
93 | code, the device object code must run in a different process. There are | |
94 | a couple of existing QEMU features that can run emulation code | |
95 | separately from the main QEMU process. These are examined below. | |
96 | ||
97 | vhost user model | |
98 | ^^^^^^^^^^^^^^^^ | |
99 | ||
100 | Virtio guest device drivers can be connected to vhost user applications | |
101 | in order to perform their IO operations. This model uses special virtio | |
102 | device drivers in the guest and vhost user device objects in QEMU, but | |
103 | once the QEMU vhost user code has configured the vhost user application, | |
104 | mission-mode IO is performed by the application. The vhost user | |
105 | application is a daemon process that can be contacted via a known UNIX | |
106 | domain socket. | |
107 | ||
108 | vhost socket | |
109 | '''''''''''' | |
110 | ||
111 | As mentioned above, one of the tasks of the vhost device object within | |
112 | QEMU is to contact the vhost application and send it configuration | |
113 | information about this device instance. As part of the configuration | |
114 | process, the application can also be sent other file descriptors over | |
115 | the socket, which then can be used by the vhost user application in | |
116 | various ways, some of which are described below. | |
117 | ||
118 | vhost MMIO store acceleration | |
119 | ''''''''''''''''''''''''''''' | |
120 | ||
121 | VMs are often run using HW virtualization features via the KVM kernel | |
122 | driver. This driver allows QEMU to accelerate the emulation of guest CPU | |
123 | instructions by running the guest in a virtual HW mode. When the guest | |
124 | executes instructions that cannot be executed by virtual HW mode, | |
125 | execution returns to the KVM driver so it can inform QEMU to emulate the | |
126 | instructions in SW. | |
127 | ||
128 | One of the events that can cause a return to QEMU is when a guest device | |
129 | driver accesses an IO location. QEMU then dispatches the memory | |
130 | operation to the corresponding QEMU device object. In the case of a | |
131 | vhost user device, the memory operation would need to be sent over a | |
132 | socket to the vhost application. This path is accelerated by the QEMU | |
133 | virtio code by setting up an eventfd file descriptor that the vhost | |
134 | application can directly receive MMIO store notifications from the KVM | |
135 | driver, instead of needing them to be sent to the QEMU process first. | |
136 | ||
137 | vhost interrupt acceleration | |
138 | '''''''''''''''''''''''''''' | |
139 | ||
140 | Another optimization used by the vhost application is the ability to | |
141 | directly inject interrupts into the VM via the KVM driver, again, | |
142 | bypassing the need to send the interrupt back to the QEMU process first. | |
143 | The QEMU virtio setup code configures the KVM driver with an eventfd | |
144 | that triggers the device interrupt in the guest when the eventfd is | |
145 | written. This irqfd file descriptor is then passed to the vhost user | |
146 | application program. | |
147 | ||
148 | vhost access to guest memory | |
149 | '''''''''''''''''''''''''''' | |
150 | ||
151 | The vhost application is also allowed to directly access guest memory, | |
152 | instead of needing to send the data as messages to QEMU. This is also | |
153 | done with file descriptors sent to the vhost user application by QEMU. | |
154 | These descriptors can be passed to ``mmap()`` by the vhost application | |
155 | to map the guest address space into the vhost application. | |
156 | ||
157 | IOMMUs introduce another level of complexity, since the address given to | |
158 | the guest virtio device to DMA to or from is not a guest physical | |
159 | address. This case is handled by having vhost code within QEMU register | |
160 | as a listener for IOMMU mapping changes. The vhost application maintains | |
161 | a cache of IOMMMU translations: sending translation requests back to | |
162 | QEMU on cache misses, and in turn receiving flush requests from QEMU | |
163 | when mappings are purged. | |
164 | ||
165 | applicability to device separation | |
166 | '''''''''''''''''''''''''''''''''' | |
167 | ||
168 | Much of the vhost model can be re-used by separated device emulation. In | |
169 | particular, the ideas of using a socket between QEMU and the device | |
170 | emulation application, using a file descriptor to inject interrupts into | |
171 | the VM via KVM, and allowing the application to ``mmap()`` the guest | |
172 | should be re used. | |
173 | ||
174 | There are, however, some notable differences between how a vhost | |
175 | application works and the needs of separated device emulation. The most | |
176 | basic is that vhost uses custom virtio device drivers which always | |
177 | trigger IO with MMIO stores. A separated device emulation model must | |
178 | work with existing IO device models and guest device drivers. MMIO loads | |
179 | break vhost store acceleration since they are synchronous - guest | |
180 | progress cannot continue until the load has been emulated. By contrast, | |
181 | stores are asynchronous, the guest can continue after the store event | |
182 | has been sent to the vhost application. | |
183 | ||
184 | Another difference is that in the vhost user model, a single daemon can | |
185 | support multiple QEMU instances. This is contrary to the security regime | |
186 | desired, in which the emulation application should only be allowed to | |
187 | access the files or devices the VM it's running on behalf of can access. | |
188 | #### qemu-io model | |
189 | ||
c5ba6219 PMD |
190 | ``qemu-io`` is a test harness used to test changes to the QEMU block backend |
191 | object code (e.g., the code that implements disk images for disk driver | |
192 | emulation). ``qemu-io`` is not a device emulation application per se, but it | |
8684f1be JJ |
193 | does compile the QEMU block objects into a separate binary from the main |
194 | QEMU one. This could be useful for disk device emulation, since its | |
195 | emulation applications will need to include the QEMU block objects. | |
196 | ||
197 | New separation model based on proxy objects | |
198 | ------------------------------------------- | |
199 | ||
200 | A different model based on proxy objects in the QEMU program | |
201 | communicating with remote emulation programs could provide separation | |
202 | while minimizing the changes needed to the device emulation code. The | |
203 | rest of this section is a discussion of how a proxy object model would | |
204 | work. | |
205 | ||
206 | Remote emulation processes | |
207 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
208 | ||
209 | The remote emulation process will run the QEMU object hierarchy without | |
210 | modification. The device emulation objects will be also be based on the | |
211 | QEMU code, because for anything but the simplest device, it would not be | |
212 | a tractable to re-implement both the object model and the many device | |
213 | backends that QEMU has. | |
214 | ||
215 | The processes will communicate with the QEMU process over UNIX domain | |
216 | sockets. The processes can be executed either as standalone processes, | |
217 | or be executed by QEMU. In both cases, the host backends the emulation | |
218 | processes will provide are specified on its command line, as they would | |
219 | be for QEMU. For example: | |
220 | ||
221 | :: | |
222 | ||
223 | disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ | |
224 | -blockdev driver=qcow2,node-name=drive0,file=file0 | |
225 | ||
226 | would indicate process *disk-proc* uses a qcow2 emulated disk named | |
227 | *file0* as its backend. | |
228 | ||
229 | Emulation processes may emulate more than one guest controller. A common | |
230 | configuration might be to put all controllers of the same device class | |
231 | (e.g., disk, network, etc.) in a single process, so that all backends of | |
232 | the same type can be managed by a single QMP monitor. | |
233 | ||
234 | communication with QEMU | |
235 | ^^^^^^^^^^^^^^^^^^^^^^^ | |
236 | ||
237 | The first argument to the remote emulation process will be a Unix domain | |
238 | socket that connects with the Proxy object. This is a required argument. | |
239 | ||
240 | :: | |
241 | ||
242 | disk-proc <socket number> <backend list> | |
243 | ||
244 | remote process QMP monitor | |
245 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
246 | ||
247 | Remote emulation processes can be monitored via QMP, similar to QEMU | |
248 | itself. The QMP monitor socket is specified the same as for a QEMU | |
249 | process: | |
250 | ||
251 | :: | |
252 | ||
253 | disk-proc -qmp unix:/tmp/disk-mon,server | |
254 | ||
255 | can be monitored over the UNIX socket path */tmp/disk-mon*. | |
256 | ||
257 | QEMU command line | |
258 | ~~~~~~~~~~~~~~~~~ | |
259 | ||
260 | Each remote device emulated in a remote process on the host is | |
261 | represented as a *-device* of type *pci-proxy-dev*. A socket | |
262 | sub-option to this option specifies the Unix socket that connects | |
263 | to the remote process. An *id* sub-option is required, and it should | |
264 | be the same id as used in the remote process. | |
265 | ||
266 | :: | |
267 | ||
268 | qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3 | |
269 | ||
270 | can be used to add a device emulated in a remote process | |
271 | ||
272 | ||
273 | QEMU management of remote processes | |
274 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
275 | ||
276 | QEMU is not aware of the type of type of the remote PCI device. It is | |
277 | a pass through device as far as QEMU is concerned. | |
278 | ||
279 | communication with emulation process | |
280 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
281 | ||
282 | primary channel | |
283 | ''''''''''''''' | |
284 | ||
285 | The primary channel (referred to as com in the code) is used to bootstrap | |
286 | the remote process. It is also used to pass on device-agnostic commands | |
287 | like reset. | |
288 | ||
289 | per-device channels | |
290 | ''''''''''''''''''' | |
291 | ||
292 | Each remote device communicates with QEMU using a dedicated communication | |
293 | channel. The proxy object sets up this channel using the primary | |
294 | channel during its initialization. | |
295 | ||
296 | QEMU device proxy objects | |
297 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
298 | ||
299 | QEMU has an object model based on sub-classes inherited from the | |
300 | "object" super-class. The sub-classes that are of interest here are the | |
301 | "device" and "bus" sub-classes whose child sub-classes make up the | |
302 | device tree of a QEMU emulated system. | |
303 | ||
304 | The proxy object model will use device proxy objects to replace the | |
305 | device emulation code within the QEMU process. These objects will live | |
306 | in the same place in the object and bus hierarchies as the objects they | |
307 | replace. i.e., the proxy object for an LSI SCSI controller will be a | |
308 | sub-class of the "pci-device" class, and will have the same PCI bus | |
309 | parent and the same SCSI bus child objects as the LSI controller object | |
310 | it replaces. | |
311 | ||
312 | It is worth noting that the same proxy object is used to mediate with | |
313 | all types of remote PCI devices. | |
314 | ||
315 | object initialization | |
316 | ^^^^^^^^^^^^^^^^^^^^^ | |
317 | ||
318 | The Proxy device objects are initialized in the exact same manner in | |
319 | which any other QEMU device would be initialized. | |
320 | ||
321 | In addition, the Proxy objects perform the following two tasks: | |
322 | - Parses the "socket" sub option and connects to the remote process | |
323 | using this channel | |
324 | - Uses the "id" sub-option to connect to the emulated device on the | |
325 | separate process | |
326 | ||
327 | class\_init | |
328 | ''''''''''' | |
329 | ||
330 | The ``class_init()`` method of a proxy object will, in general behave | |
331 | similarly to the object it replaces, including setting any static | |
332 | properties and methods needed by the proxy. | |
333 | ||
334 | instance\_init / realize | |
335 | '''''''''''''''''''''''' | |
336 | ||
337 | The ``instance_init()`` and ``realize()`` functions would only need to | |
338 | perform tasks related to being a proxy, such are registering its own | |
339 | MMIO handlers, or creating a child bus that other proxy devices can be | |
340 | attached to later. | |
341 | ||
342 | Other tasks will be device-specific. For example, PCI device objects | |
343 | will initialize the PCI config space in order to make a valid PCI device | |
344 | tree within the QEMU process. | |
345 | ||
346 | address space registration | |
347 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
348 | ||
349 | Most devices are driven by guest device driver accesses to IO addresses | |
350 | or ports. The QEMU device emulation code uses QEMU's memory region | |
351 | function calls (such as ``memory_region_init_io()``) to add callback | |
352 | functions that QEMU will invoke when the guest accesses the device's | |
353 | areas of the IO address space. When a guest driver does access the | |
354 | device, the VM will exit HW virtualization mode and return to QEMU, | |
355 | which will then lookup and execute the corresponding callback function. | |
356 | ||
357 | A proxy object would need to mirror the memory region calls the actual | |
358 | device emulator would perform in its initialization code, but with its | |
359 | own callbacks. When invoked by QEMU as a result of a guest IO operation, | |
360 | they will forward the operation to the device emulation process. | |
361 | ||
362 | PCI config space | |
363 | ^^^^^^^^^^^^^^^^ | |
364 | ||
365 | PCI devices also have a configuration space that can be accessed by the | |
366 | guest driver. Guest accesses to this space is not handled by the device | |
367 | emulation object, but by its PCI parent object. Much of this space is | |
368 | read-only, but certain registers (especially BAR and MSI-related ones) | |
369 | need to be propagated to the emulation process. | |
370 | ||
371 | PCI parent proxy | |
372 | '''''''''''''''' | |
373 | ||
374 | One way to propagate guest PCI config accesses is to create a | |
375 | "pci-device-proxy" class that can serve as the parent of a PCI device | |
376 | proxy object. This class's parent would be "pci-device" and it would | |
377 | override the PCI parent's ``config_read()`` and ``config_write()`` | |
378 | methods with ones that forward these operations to the emulation | |
379 | program. | |
380 | ||
381 | interrupt receipt | |
382 | ^^^^^^^^^^^^^^^^^ | |
383 | ||
384 | A proxy for a device that generates interrupts will need to create a | |
385 | socket to receive interrupt indications from the emulation process. An | |
386 | incoming interrupt indication would then be sent up to its bus parent to | |
387 | be injected into the guest. For example, a PCI device object may use | |
388 | ``pci_set_irq()``. | |
389 | ||
390 | live migration | |
391 | ^^^^^^^^^^^^^^ | |
392 | ||
393 | The proxy will register to save and restore any *vmstate* it needs over | |
394 | a live migration event. The device proxy does not need to manage the | |
395 | remote device's *vmstate*; that will be handled by the remote process | |
396 | proxy (see below). | |
397 | ||
398 | QEMU remote device operation | |
399 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
400 | ||
401 | Generic device operations, such as DMA, will be performed by the remote | |
402 | process proxy by sending messages to the remote process. | |
403 | ||
404 | DMA operations | |
405 | ^^^^^^^^^^^^^^ | |
406 | ||
407 | DMA operations would be handled much like vhost applications do. One of | |
408 | the initial messages sent to the emulation process is a guest memory | |
409 | table. Each entry in this table consists of a file descriptor and size | |
410 | that the emulation process can ``mmap()`` to directly access guest | |
411 | memory, similar to ``vhost_user_set_mem_table()``. Note guest memory | |
412 | must be backed by file descriptors, such as when QEMU is given the | |
413 | *-mem-path* command line option. | |
414 | ||
415 | IOMMU operations | |
416 | ^^^^^^^^^^^^^^^^ | |
417 | ||
418 | When the emulated system includes an IOMMU, the remote process proxy in | |
419 | QEMU will need to create a socket for IOMMU requests from the emulation | |
420 | process. It will handle those requests with an | |
421 | ``address_space_get_iotlb_entry()`` call. In order to handle IOMMU | |
422 | unmaps, the remote process proxy will also register as a listener on the | |
423 | device's DMA address space. When an IOMMU memory region is created | |
424 | within the DMA address space, an IOMMU notifier for unmaps will be added | |
425 | to the memory region that will forward unmaps to the emulation process | |
426 | over the IOMMU socket. | |
427 | ||
428 | device hot-plug via QMP | |
429 | ^^^^^^^^^^^^^^^^^^^^^^^ | |
430 | ||
431 | An QMP "device\_add" command can add a device emulated by a remote | |
432 | process. It will also have "rid" option to the command, just as the | |
433 | *-device* command line option does. The remote process may either be one | |
434 | started at QEMU startup, or be one added by the "add-process" QMP | |
435 | command described above. In either case, the remote process proxy will | |
436 | forward the new device's JSON description to the corresponding emulation | |
437 | process. | |
438 | ||
439 | live migration | |
440 | ^^^^^^^^^^^^^^ | |
441 | ||
442 | The remote process proxy will also register for live migration | |
443 | notifications with ``vmstate_register()``. When called to save state, | |
444 | the proxy will send the remote process a secondary socket file | |
445 | descriptor to save the remote process's device *vmstate* over. The | |
446 | incoming byte stream length and data will be saved as the proxy's | |
447 | *vmstate*. When the proxy is resumed on its new host, this *vmstate* | |
448 | will be extracted, and a secondary socket file descriptor will be sent | |
449 | to the new remote process through which it receives the *vmstate* in | |
450 | order to restore the devices there. | |
451 | ||
452 | device emulation in remote process | |
453 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
454 | ||
455 | The parts of QEMU that the emulation program will need include the | |
456 | object model; the memory emulation objects; the device emulation objects | |
457 | of the targeted device, and any dependent devices; and, the device's | |
458 | backends. It will also need code to setup the machine environment, | |
459 | handle requests from the QEMU process, and route machine-level requests | |
460 | (such as interrupts or IOMMU mappings) back to the QEMU process. | |
461 | ||
462 | initialization | |
463 | ^^^^^^^^^^^^^^ | |
464 | ||
465 | The process initialization sequence will follow the same sequence | |
466 | followed by QEMU. It will first initialize the backend objects, then | |
467 | device emulation objects. The JSON descriptions sent by the QEMU process | |
468 | will drive which objects need to be created. | |
469 | ||
470 | - address spaces | |
471 | ||
472 | Before the device objects are created, the initial address spaces and | |
473 | memory regions must be configured with ``memory_map_init()``. This | |
474 | creates a RAM memory region object (*system\_memory*) and an IO memory | |
475 | region object (*system\_io*). | |
476 | ||
477 | - RAM | |
478 | ||
479 | RAM memory region creation will follow how ``pc_memory_init()`` creates | |
480 | them, but must use ``memory_region_init_ram_from_fd()`` instead of | |
481 | ``memory_region_allocate_system_memory()``. The file descriptors needed | |
482 | will be supplied by the guest memory table from above. Those RAM regions | |
483 | would then be added to the *system\_memory* memory region with | |
484 | ``memory_region_add_subregion()``. | |
485 | ||
486 | - PCI | |
487 | ||
488 | IO initialization will be driven by the JSON descriptions sent from the | |
489 | QEMU process. For a PCI device, a PCI bus will need to be created with | |
490 | ``pci_root_bus_new()``, and a PCI memory region will need to be created | |
491 | and added to the *system\_memory* memory region with | |
492 | ``memory_region_add_subregion_overlap()``. The overlap version is | |
493 | required for architectures where PCI memory overlaps with RAM memory. | |
494 | ||
495 | MMIO handling | |
496 | ^^^^^^^^^^^^^ | |
497 | ||
498 | The device emulation objects will use ``memory_region_init_io()`` to | |
499 | install their MMIO handlers, and ``pci_register_bar()`` to associate | |
500 | those handlers with a PCI BAR, as they do within QEMU currently. | |
501 | ||
502 | In order to use ``address_space_rw()`` in the emulation process to | |
503 | handle MMIO requests from QEMU, the PCI physical addresses must be the | |
504 | same in the QEMU process and the device emulation process. In order to | |
505 | accomplish that, guest BAR programming must also be forwarded from QEMU | |
506 | to the emulation process. | |
507 | ||
508 | interrupt injection | |
509 | ^^^^^^^^^^^^^^^^^^^ | |
510 | ||
511 | When device emulation wants to inject an interrupt into the VM, the | |
512 | request climbs the device's bus object hierarchy until the point where a | |
513 | bus object knows how to signal the interrupt to the guest. The details | |
514 | depend on the type of interrupt being raised. | |
515 | ||
516 | - PCI pin interrupts | |
517 | ||
518 | On x86 systems, there is an emulated IOAPIC object attached to the root | |
519 | PCI bus object, and the root PCI object forwards interrupt requests to | |
520 | it. The IOAPIC object, in turn, calls the KVM driver to inject the | |
521 | corresponding interrupt into the VM. The simplest way to handle this in | |
522 | an emulation process would be to setup the root PCI bus driver (via | |
523 | ``pci_bus_irqs()``) to send a interrupt request back to the QEMU | |
524 | process, and have the device proxy object reflect it up the PCI tree | |
525 | there. | |
526 | ||
527 | - PCI MSI/X interrupts | |
528 | ||
529 | PCI MSI/X interrupts are implemented in HW as DMA writes to a | |
530 | CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives | |
531 | these DMA writes, then calls into the KVM driver to inject the interrupt | |
532 | into the VM. A simple emulation process implementation would be to send | |
533 | the MSI DMA address from QEMU as a message at initialization, then | |
534 | install an address space handler at that address which forwards the MSI | |
535 | message back to QEMU. | |
536 | ||
537 | DMA operations | |
538 | ^^^^^^^^^^^^^^ | |
539 | ||
540 | When a emulation object wants to DMA into or out of guest memory, it | |
541 | first must use dma\_memory\_map() to convert the DMA address to a local | |
542 | virtual address. The emulation process memory region objects setup above | |
543 | will be used to translate the DMA address to a local virtual address the | |
544 | device emulation code can access. | |
545 | ||
546 | IOMMU | |
547 | ^^^^^ | |
548 | ||
549 | When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory | |
550 | regions to translate the DMA address to a guest physical address before | |
551 | that physical address can be translated to a local virtual address. The | |
552 | emulation process will need similar functionality. | |
553 | ||
554 | - IOTLB cache | |
555 | ||
556 | The emulation process will maintain a cache of recent IOMMU translations | |
557 | (the IOTLB). When the translate() callback of an IOMMU memory region is | |
558 | invoked, the IOTLB cache will be searched for an entry that will map the | |
559 | DMA address to a guest PA. On a cache miss, a message will be sent back | |
560 | to QEMU requesting the corresponding translation entry, which be both be | |
561 | used to return a guest address and be added to the cache. | |
562 | ||
563 | - IOTLB purge | |
564 | ||
565 | The IOMMU emulation will also need to act on unmap requests from QEMU. | |
566 | These happen when the guest IOMMU driver purges an entry from the | |
567 | guest's translation table. | |
568 | ||
569 | live migration | |
570 | ^^^^^^^^^^^^^^ | |
571 | ||
572 | When a remote process receives a live migration indication from QEMU, it | |
573 | will set up a channel using the received file descriptor with | |
574 | ``qio_channel_socket_new_fd()``. This channel will be used to create a | |
575 | *QEMUfile* that can be passed to ``qemu_save_device_state()`` to send | |
576 | the process's device state back to QEMU. This method will be reversed on | |
577 | restore - the channel will be passed to ``qemu_loadvm_state()`` to | |
578 | restore the device state. | |
579 | ||
580 | Accelerating device emulation | |
581 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
582 | ||
583 | The messages that are required to be sent between QEMU and the emulation | |
584 | process can add considerable latency to IO operations. The optimizations | |
585 | described below attempt to ameliorate this effect by allowing the | |
586 | emulation process to communicate directly with the kernel KVM driver. | |
587 | The KVM file descriptors created would be passed to the emulation process | |
588 | via initialization messages, much like the guest memory table is done. | |
589 | #### MMIO acceleration | |
590 | ||
591 | Vhost user applications can receive guest virtio driver stores directly | |
592 | from KVM. The issue with the eventfd mechanism used by vhost user is | |
593 | that it does not pass any data with the event indication, so it cannot | |
594 | handle guest loads or guest stores that carry store data. This concept | |
595 | could, however, be expanded to cover more cases. | |
596 | ||
597 | The expanded idea would require a new type of KVM device: | |
598 | *KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master | |
599 | descriptor that QEMU can use for configuration, and a slave descriptor | |
600 | that the emulation process can use to receive MMIO notifications. QEMU | |
601 | would create both descriptors using the KVM driver, and pass the slave | |
602 | descriptor to the emulation process via an initialization message. | |
603 | ||
604 | data structures | |
605 | ^^^^^^^^^^^^^^^ | |
606 | ||
607 | - guest physical range | |
608 | ||
609 | The guest physical range structure describes the address range that a | |
610 | device will respond to. It includes the base and length of the range, as | |
611 | well as which bus the range resides on (e.g., on an x86machine, it can | |
612 | specify whether the range refers to memory or IO addresses). | |
613 | ||
614 | A device can have multiple physical address ranges it responds to (e.g., | |
615 | a PCI device can have multiple BARs), so the structure will also include | |
616 | an enumerated identifier to specify which of the device's ranges is | |
617 | being referred to. | |
618 | ||
619 | +--------+----------------------------+ | |
620 | | Name | Description | | |
621 | +========+============================+ | |
622 | | addr | range base address | | |
623 | +--------+----------------------------+ | |
624 | | len | range length | | |
625 | +--------+----------------------------+ | |
626 | | bus | addr type (memory or IO) | | |
627 | +--------+----------------------------+ | |
628 | | id | range ID (e.g., PCI BAR) | | |
629 | +--------+----------------------------+ | |
630 | ||
631 | - MMIO request structure | |
632 | ||
633 | This structure describes an MMIO operation. It includes which guest | |
634 | physical range the MMIO was within, the offset within that range, the | |
635 | MMIO type (e.g., load or store), and its length and data. It also | |
636 | includes a sequence number that can be used to reply to the MMIO, and | |
637 | the CPU that issued the MMIO. | |
638 | ||
639 | +----------+------------------------+ | |
640 | | Name | Description | | |
641 | +==========+========================+ | |
642 | | rid | range MMIO is within | | |
643 | +----------+------------------------+ | |
b980c1ae | 644 | | offset | offset within *rid* | |
8684f1be JJ |
645 | +----------+------------------------+ |
646 | | type | e.g., load or store | | |
647 | +----------+------------------------+ | |
648 | | len | MMIO length | | |
649 | +----------+------------------------+ | |
650 | | data | store data | | |
651 | +----------+------------------------+ | |
652 | | seq | sequence ID | | |
653 | +----------+------------------------+ | |
654 | ||
655 | - MMIO request queues | |
656 | ||
657 | MMIO request queues are FIFO arrays of MMIO request structures. There | |
658 | are two queues: pending queue is for MMIOs that haven't been read by the | |
659 | emulation program, and the sent queue is for MMIOs that haven't been | |
660 | acknowledged. The main use of the second queue is to validate MMIO | |
661 | replies from the emulation program. | |
662 | ||
663 | - scoreboard | |
664 | ||
665 | Each CPU in the VM is emulated in QEMU by a separate thread, so multiple | |
666 | MMIOs may be waiting to be consumed by an emulation program and multiple | |
667 | threads may be waiting for MMIO replies. The scoreboard would contain a | |
668 | wait queue and sequence number for the per-CPU threads, allowing them to | |
669 | be individually woken when the MMIO reply is received from the emulation | |
670 | program. It also tracks the number of posted MMIO stores to the device | |
671 | that haven't been replied to, in order to satisfy the PCI constraint | |
672 | that a load to a device will not complete until all previous stores to | |
673 | that device have been completed. | |
674 | ||
675 | - device shadow memory | |
676 | ||
677 | Some MMIO loads do not have device side-effects. These MMIOs can be | |
678 | completed without sending a MMIO request to the emulation program if the | |
679 | emulation program shares a shadow image of the device's memory image | |
680 | with the KVM driver. | |
681 | ||
682 | The emulation program will ask the KVM driver to allocate memory for the | |
683 | shadow image, and will then use ``mmap()`` to directly access it. The | |
684 | emulation program can control KVM access to the shadow image by sending | |
685 | KVM an access map telling it which areas of the image have no | |
686 | side-effects (and can be completed immediately), and which require a | |
687 | MMIO request to the emulation program. The access map can also inform | |
688 | the KVM drive which size accesses are allowed to the image. | |
689 | ||
690 | master descriptor | |
691 | ^^^^^^^^^^^^^^^^^ | |
692 | ||
693 | The master descriptor is used by QEMU to configure the new KVM device. | |
694 | The descriptor would be returned by the KVM driver when QEMU issues a | |
695 | *KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. | |
696 | ||
697 | KVM\_DEV\_TYPE\_USER device ops | |
698 | ||
699 | ||
700 | The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a | |
701 | ``kvm_register_device_ops()`` call when the KVM system in initialized by | |
702 | ``kvm_init()``. These device ops are called by the KVM driver when QEMU | |
703 | executes certain ``ioctl()`` operations on its KVM file descriptor. They | |
704 | include: | |
705 | ||
706 | - create | |
707 | ||
708 | This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* | |
709 | ``ioctl()`` on its per-VM file descriptor. It will allocate and | |
710 | initialize a KVM user device specific data structure, and assign the | |
711 | *kvm\_device* private field to it. | |
712 | ||
713 | - ioctl | |
714 | ||
715 | This routine is invoked when QEMU issues an ``ioctl()`` on the master | |
716 | descriptor. The ``ioctl()`` commands supported are defined by the KVM | |
717 | device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: | |
718 | ||
719 | *KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will | |
720 | be passed to the device emulation program. Only one slave can be created | |
721 | by each master descriptor. The file operations performed by this | |
722 | descriptor are described below. | |
723 | ||
724 | The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical | |
725 | address range that the slave descriptor will receive MMIO notifications | |
726 | for. The range is specified by a guest physical range structure | |
727 | argument. For buses that assign addresses to devices dynamically, this | |
728 | command can be executed while the guest is running, such as the case | |
729 | when a guest changes a device's PCI BAR registers. | |
730 | ||
731 | *KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to | |
732 | register *kvm\_io\_device\_ops* callbacks to be invoked when the guest | |
733 | performs a MMIO operation within the range. When a range is changed, | |
734 | ``kvm_io_bus_unregister_dev()`` is used to remove the previous | |
735 | instantiation. | |
736 | ||
737 | *KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies | |
738 | how long KVM will wait for the emulation process to respond to a MMIO | |
739 | indication. | |
740 | ||
741 | - destroy | |
742 | ||
743 | This routine is called when the VM instance is destroyed. It will need | |
744 | to destroy the slave descriptor; and free any memory allocated by the | |
745 | driver, as well as the *kvm\_device* structure itself. | |
746 | ||
747 | slave descriptor | |
748 | ^^^^^^^^^^^^^^^^ | |
749 | ||
750 | The slave descriptor will have its own file operations vector, which | |
751 | responds to system calls on the descriptor performed by the device | |
752 | emulation program. | |
753 | ||
754 | - read | |
755 | ||
756 | A read returns any pending MMIO requests from the KVM driver as MMIO | |
757 | request structures. Multiple structures can be returned if there are | |
758 | multiple MMIO operations pending. The MMIO requests are moved from the | |
759 | pending queue to the sent queue, and if there are threads waiting for | |
760 | space in the pending to add new MMIO operations, they will be woken | |
761 | here. | |
762 | ||
763 | - write | |
764 | ||
765 | A write also consists of a set of MMIO requests. They are compared to | |
766 | the MMIO requests in the sent queue. Matches are removed from the sent | |
767 | queue, and any threads waiting for the reply are woken. If a store is | |
768 | removed, then the number of posted stores in the per-CPU scoreboard is | |
769 | decremented. When the number is zero, and a non side-effect load was | |
770 | waiting for posted stores to complete, the load is continued. | |
771 | ||
772 | - ioctl | |
773 | ||
774 | There are several ioctl()s that can be performed on the slave | |
775 | descriptor. | |
776 | ||
777 | A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to | |
778 | allocate memory for the shadow image. This memory can later be | |
779 | ``mmap()``\ ed by the emulation process to share the emulation's view of | |
780 | device memory with the KVM driver. | |
781 | ||
782 | A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the | |
783 | shadow image. It will send the KVM driver a shadow control map, which | |
784 | specifies which areas of the image can complete guest loads without | |
785 | sending the load request to the emulation program. It will also specify | |
786 | the size of load operations that are allowed. | |
787 | ||
788 | - poll | |
789 | ||
790 | An emulation program will use the ``poll()`` call with a *POLLIN* flag | |
791 | to determine if there are MMIO requests waiting to be read. It will | |
792 | return if the pending MMIO request queue is not empty. | |
793 | ||
794 | - mmap | |
795 | ||
796 | This call allows the emulation program to directly access the shadow | |
797 | image allocated by the KVM driver. As device emulation updates device | |
798 | memory, changes with no side-effects will be reflected in the shadow, | |
799 | and the KVM driver can satisfy guest loads from the shadow image without | |
800 | needing to wait for the emulation program. | |
801 | ||
802 | kvm\_io\_device ops | |
803 | ^^^^^^^^^^^^^^^^^^^ | |
804 | ||
805 | Each KVM per-CPU thread can handle MMIO operation on behalf of the guest | |
806 | VM. KVM will use the MMIO's guest physical address to search for a | |
807 | matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM | |
808 | driver instead of exiting back to QEMU. If a match is found, the | |
809 | corresponding callback will be invoked. | |
810 | ||
811 | - read | |
812 | ||
813 | This callback is invoked when the guest performs a load to the device. | |
814 | Loads with side-effects must be handled synchronously, with the KVM | |
815 | driver putting the QEMU thread to sleep waiting for the emulation | |
816 | process reply before re-starting the guest. Loads that do not have | |
817 | side-effects may be optimized by satisfying them from the shadow image, | |
818 | if there are no outstanding stores to the device by this CPU. PCI memory | |
819 | ordering demands that a load cannot complete before all older stores to | |
820 | the same device have been completed. | |
821 | ||
822 | - write | |
823 | ||
824 | Stores can be handled asynchronously unless the pending MMIO request | |
825 | queue is full. In this case, the QEMU thread must sleep waiting for | |
826 | space in the queue. Stores will increment the number of posted stores in | |
827 | the per-CPU scoreboard, in order to implement the PCI ordering | |
828 | constraint above. | |
829 | ||
830 | interrupt acceleration | |
831 | ^^^^^^^^^^^^^^^^^^^^^^ | |
832 | ||
833 | This performance optimization would work much like a vhost user | |
834 | application does, where the QEMU process sets up *eventfds* that cause | |
835 | the device's corresponding interrupt to be triggered by the KVM driver. | |
836 | These irq file descriptors are sent to the emulation process at | |
837 | initialization, and are used when the emulation code raises a device | |
838 | interrupt. | |
839 | ||
840 | intx acceleration | |
841 | ''''''''''''''''' | |
842 | ||
843 | Traditional PCI pin interrupts are level based, so, in addition to an | |
844 | irq file descriptor, a re-sampling file descriptor needs to be sent to | |
845 | the emulation program. This second file descriptor allows multiple | |
846 | devices sharing an irq to be notified when the interrupt has been | |
847 | acknowledged by the guest, so they can re-trigger the interrupt if their | |
848 | device has not de-asserted its interrupt. | |
849 | ||
850 | intx irq descriptor | |
851 | ||
852 | ||
853 | The irq descriptors are created by the proxy object | |
854 | ``using event_notifier_init()`` to create the irq and re-sampling | |
855 | *eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. | |
856 | The interrupt route can be found with | |
857 | ``pci_device_route_intx_to_irq()``. | |
858 | ||
859 | intx routing changes | |
860 | ||
861 | ||
862 | Intx routing can be changed when the guest programs the APIC the device | |
863 | pin is connected to. The proxy object in QEMU will use | |
864 | ``pci_device_set_intx_routing_notifier()`` to be informed of any guest | |
865 | changes to the route. This handler will broadly follow the VFIO | |
866 | interrupt logic to change the route: de-assigning the existing irq | |
867 | descriptor from its route, then assigning it the new route. (see | |
868 | ``vfio_intx_update()``) | |
869 | ||
870 | MSI/X acceleration | |
871 | '''''''''''''''''' | |
872 | ||
873 | MSI/X interrupts are sent as DMA transactions to the host. The interrupt | |
874 | data contains a vector that is programmed by the guest, A device may have | |
875 | multiple MSI interrupts associated with it, so multiple irq descriptors | |
876 | may need to be sent to the emulation program. | |
877 | ||
878 | MSI/X irq descriptor | |
879 | ||
880 | ||
881 | This case will also follow the VFIO example. For each MSI/X interrupt, | |
882 | an *eventfd* is created, a virtual interrupt is allocated by | |
883 | ``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to | |
884 | the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. | |
885 | ||
886 | MSI/X config space changes | |
887 | ||
888 | ||
889 | The guest may dynamically update several MSI-related tables in the | |
890 | device's PCI config space. These include per-MSI interrupt enables and | |
891 | vector data. Additionally, MSIX tables exist in device memory space, not | |
892 | config space. Much like the BAR case above, the proxy object must look | |
893 | at guest config space programming to keep the MSI interrupt state | |
894 | consistent between QEMU and the emulation program. | |
895 | ||
896 | -------------- | |
897 | ||
898 | Disaggregated CPU emulation | |
899 | --------------------------- | |
900 | ||
901 | After IO services have been disaggregated, a second phase would be to | |
902 | separate a process to handle CPU instruction emulation from the main | |
903 | QEMU control function. There are no object separation points for this | |
904 | code, so the first task would be to create one. | |
905 | ||
906 | Host access controls | |
907 | -------------------- | |
908 | ||
909 | Separating QEMU relies on the host OS's access restriction mechanisms to | |
910 | enforce that the differing processes can only access the objects they | |
911 | are entitled to. There are a couple types of mechanisms usually provided | |
912 | by general purpose OSs. | |
913 | ||
914 | Discretionary access control | |
915 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
916 | ||
917 | Discretionary access control allows each user to control who can access | |
918 | their files. In Linux, this type of control is usually too coarse for | |
919 | QEMU separation, since it only provides three separate access controls: | |
920 | one for the same user ID, the second for users IDs with the same group | |
921 | ID, and the third for all other user IDs. Each device instance would | |
922 | need a separate user ID to provide access control, which is likely to be | |
923 | unwieldy for dynamically created VMs. | |
924 | ||
925 | Mandatory access control | |
926 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
927 | ||
928 | Mandatory access control allows the OS to add an additional set of | |
929 | controls on top of discretionary access for the OS to control. It also | |
930 | adds other attributes to processes and files such as types, roles, and | |
931 | categories, and can establish rules for how processes and files can | |
932 | interact. | |
933 | ||
934 | Type enforcement | |
935 | ^^^^^^^^^^^^^^^^ | |
936 | ||
937 | Type enforcement assigns a *type* attribute to processes and files, and | |
938 | allows rules to be written on what operations a process with a given | |
939 | type can perform on a file with a given type. QEMU separation could take | |
940 | advantage of type enforcement by running the emulation processes with | |
941 | different types, both from the main QEMU process, and from the emulation | |
942 | processes of different classes of devices. | |
943 | ||
944 | For example, guest disk images and disk emulation processes could have | |
945 | types separate from the main QEMU process and non-disk emulation | |
946 | processes, and the type rules could prevent processes other than disk | |
947 | emulation ones from accessing guest disk images. Similarly, network | |
948 | emulation processes can have a type separate from the main QEMU process | |
949 | and non-network emulation process, and only that type can access the | |
950 | host tun/tap device used to provide guest networking. | |
951 | ||
952 | Category enforcement | |
953 | ^^^^^^^^^^^^^^^^^^^^ | |
954 | ||
955 | Category enforcement assigns a set of numbers within a given range to | |
956 | the process or file. The process is granted access to the file if the | |
957 | process's set is a superset of the file's set. This enforcement can be | |
958 | used to separate multiple instances of devices in the same class. | |
959 | ||
960 | For example, if there are multiple disk devices provides to a guest, | |
961 | each device emulation process could be provisioned with a separate | |
962 | category. The different device emulation processes would not be able to | |
963 | access each other's backing disk images. | |
964 | ||
965 | Alternatively, categories could be used in lieu of the type enforcement | |
966 | scheme described above. In this scenario, different categories would be | |
967 | used to prevent device emulation processes in different classes from | |
968 | accessing resources assigned to other classes. |