]> git.proxmox.com Git - mirror_qemu.git/blob - docs/interop/vhost-user.rst
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
[mirror_qemu.git] / docs / interop / vhost-user.rst
1 .. _vhost_user_proto:
2
3 ===================
4 Vhost-user Protocol
5 ===================
6
7 ..
8 Copyright 2014 Virtual Open Systems Sarl.
9 Copyright 2019 Intel Corporation
10 Licence: This work is licensed under the terms of the GNU GPL,
11 version 2 or later. See the COPYING file in the top-level
12 directory.
13
14 .. contents:: Table of Contents
15
16 Introduction
17 ============
18
19 This protocol is aiming to complement the ``ioctl`` interface used to
20 control the vhost implementation in the Linux kernel. It implements
21 the control plane needed to establish virtqueue sharing with a user
22 space process on the same host. It uses communication over a Unix
23 domain socket to share file descriptors in the ancillary data of the
24 message.
25
26 The protocol defines 2 sides of the communication, *master* and
27 *slave*. *Master* is the application that shares its virtqueues, in
28 our case QEMU. *Slave* is the consumer of the virtqueues.
29
30 In the current implementation QEMU is the *master*, and the *slave* is
31 the external process consuming the virtio queues, for example a
32 software Ethernet switch running in user space, such as Snabbswitch,
33 or a block device backend processing read & write to a virtual
34 disk. In order to facilitate interoperability between various backend
35 implementations, it is recommended to follow the :ref:`Backend program
36 conventions <backend_conventions>`.
37
38 *Master* and *slave* can be either a client (i.e. connecting) or
39 server (listening) in the socket communication.
40
41 Support for platforms other than Linux
42 --------------------------------------
43
44 While vhost-user was initially developed targeting Linux, nowadays it
45 is supported on any platform that provides the following features:
46
47 - A way for requesting shared memory represented by a file descriptor
48 so it can be passed over a UNIX domain socket and then mapped by the
49 other process.
50
51 - AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can
52 exchange messages through it, including ancillary data when needed.
53
54 - Either eventfd or pipe/pipe2. On platforms where eventfd is not
55 available, QEMU will automatically fall back to pipe2 or, as a last
56 resort, pipe. Each file descriptor will be used for receiving or
57 sending events by reading or writing (respectively) an 8-byte value
58 to the corresponding it. The 8-value itself has no meaning and
59 should not be interpreted.
60
61 Message Specification
62 =====================
63
64 .. Note:: All numbers are in the machine native byte order.
65
66 A vhost-user message consists of 3 header fields and a payload.
67
68 +---------+-------+------+---------+
69 | request | flags | size | payload |
70 +---------+-------+------+---------+
71
72 Header
73 ------
74
75 :request: 32-bit type of the request
76
77 :flags: 32-bit bit field
78
79 - Lower 2 bits are the version (currently 0x01)
80 - Bit 2 is the reply flag - needs to be sent on each reply from the slave
81 - Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
82 details.
83
84 :size: 32-bit size of the payload
85
86 Payload
87 -------
88
89 Depending on the request type, **payload** can be:
90
91 A single 64-bit integer
92 ^^^^^^^^^^^^^^^^^^^^^^^
93
94 +-----+
95 | u64 |
96 +-----+
97
98 :u64: a 64-bit unsigned integer
99
100 A vring state description
101 ^^^^^^^^^^^^^^^^^^^^^^^^^
102
103 +-------+-----+
104 | index | num |
105 +-------+-----+
106
107 :index: a 32-bit index
108
109 :num: a 32-bit number
110
111 A vring address description
112 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
113
114 +-------+-------+------+------------+------+-----------+-----+
115 | index | flags | size | descriptor | used | available | log |
116 +-------+-------+------+------------+------+-----------+-----+
117
118 :index: a 32-bit vring index
119
120 :flags: a 32-bit vring flags
121
122 :descriptor: a 64-bit ring address of the vring descriptor table
123
124 :used: a 64-bit ring address of the vring used ring
125
126 :available: a 64-bit ring address of the vring available ring
127
128 :log: a 64-bit guest address for logging
129
130 Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
131 been negotiated. Otherwise it is a user address.
132
133 Memory regions description
134 ^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136 +-------------+---------+---------+-----+---------+
137 | num regions | padding | region0 | ... | region7 |
138 +-------------+---------+---------+-----+---------+
139
140 :num regions: a 32-bit number of regions
141
142 :padding: 32-bit
143
144 A region is:
145
146 +---------------+------+--------------+-------------+
147 | guest address | size | user address | mmap offset |
148 +---------------+------+--------------+-------------+
149
150 :guest address: a 64-bit guest address of the region
151
152 :size: a 64-bit size
153
154 :user address: a 64-bit user address
155
156 :mmap offset: 64-bit offset where region starts in the mapped memory
157
158 Single memory region description
159 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
160
161 +---------+---------------+------+--------------+-------------+
162 | padding | guest address | size | user address | mmap offset |
163 +---------+---------------+------+--------------+-------------+
164
165 :padding: 64-bit
166
167 :guest address: a 64-bit guest address of the region
168
169 :size: a 64-bit size
170
171 :user address: a 64-bit user address
172
173 :mmap offset: 64-bit offset where region starts in the mapped memory
174
175 Log description
176 ^^^^^^^^^^^^^^^
177
178 +----------+------------+
179 | log size | log offset |
180 +----------+------------+
181
182 :log size: size of area used for logging
183
184 :log offset: offset from start of supplied file descriptor where
185 logging starts (i.e. where guest address 0 would be
186 logged)
187
188 An IOTLB message
189 ^^^^^^^^^^^^^^^^
190
191 +------+------+--------------+-------------------+------+
192 | iova | size | user address | permissions flags | type |
193 +------+------+--------------+-------------------+------+
194
195 :iova: a 64-bit I/O virtual address programmed by the guest
196
197 :size: a 64-bit size
198
199 :user address: a 64-bit user address
200
201 :permissions flags: an 8-bit value:
202 - 0: No access
203 - 1: Read access
204 - 2: Write access
205 - 3: Read/Write access
206
207 :type: an 8-bit IOTLB message type:
208 - 1: IOTLB miss
209 - 2: IOTLB update
210 - 3: IOTLB invalidate
211 - 4: IOTLB access fail
212
213 Virtio device config space
214 ^^^^^^^^^^^^^^^^^^^^^^^^^^
215
216 +--------+------+-------+---------+
217 | offset | size | flags | payload |
218 +--------+------+-------+---------+
219
220 :offset: a 32-bit offset of virtio device's configuration space
221
222 :size: a 32-bit configuration space access size in bytes
223
224 :flags: a 32-bit value:
225 - 0: Vhost master messages used for writeable fields
226 - 1: Vhost master messages used for live migration
227
228 :payload: Size bytes array holding the contents of the virtio
229 device's configuration space
230
231 Vring area description
232 ^^^^^^^^^^^^^^^^^^^^^^
233
234 +-----+------+--------+
235 | u64 | size | offset |
236 +-----+------+--------+
237
238 :u64: a 64-bit integer contains vring index and flags
239
240 :size: a 64-bit size of this area
241
242 :offset: a 64-bit offset of this area from the start of the
243 supplied file descriptor
244
245 Inflight description
246 ^^^^^^^^^^^^^^^^^^^^
247
248 +-----------+-------------+------------+------------+
249 | mmap size | mmap offset | num queues | queue size |
250 +-----------+-------------+------------+------------+
251
252 :mmap size: a 64-bit size of area to track inflight I/O
253
254 :mmap offset: a 64-bit offset of this area from the start
255 of the supplied file descriptor
256
257 :num queues: a 16-bit number of virtqueues
258
259 :queue size: a 16-bit size of virtqueues
260
261 C structure
262 -----------
263
264 In QEMU the vhost-user message is implemented with the following struct:
265
266 .. code:: c
267
268 typedef struct VhostUserMsg {
269 VhostUserRequest request;
270 uint32_t flags;
271 uint32_t size;
272 union {
273 uint64_t u64;
274 struct vhost_vring_state state;
275 struct vhost_vring_addr addr;
276 VhostUserMemory memory;
277 VhostUserLog log;
278 struct vhost_iotlb_msg iotlb;
279 VhostUserConfig config;
280 VhostUserVringArea area;
281 VhostUserInflight inflight;
282 };
283 } QEMU_PACKED VhostUserMsg;
284
285 Communication
286 =============
287
288 The protocol for vhost-user is based on the existing implementation of
289 vhost for the Linux Kernel. Most messages that can be sent via the
290 Unix domain socket implementing vhost-user have an equivalent ioctl to
291 the kernel implementation.
292
293 The communication consists of *master* sending message requests and
294 *slave* sending message replies. Most of the requests don't require
295 replies. Here is a list of the ones that do:
296
297 * ``VHOST_USER_GET_FEATURES``
298 * ``VHOST_USER_GET_PROTOCOL_FEATURES``
299 * ``VHOST_USER_GET_VRING_BASE``
300 * ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
301 * ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
302
303 .. seealso::
304
305 :ref:`REPLY_ACK <reply_ack>`
306 The section on ``REPLY_ACK`` protocol extension.
307
308 There are several messages that the master sends with file descriptors passed
309 in the ancillary data:
310
311 * ``VHOST_USER_SET_MEM_TABLE``
312 * ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
313 * ``VHOST_USER_SET_LOG_FD``
314 * ``VHOST_USER_SET_VRING_KICK``
315 * ``VHOST_USER_SET_VRING_CALL``
316 * ``VHOST_USER_SET_VRING_ERR``
317 * ``VHOST_USER_SET_SLAVE_REQ_FD``
318 * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
319
320 If *master* is unable to send the full message or receives a wrong
321 reply it will close the connection. An optional reconnection mechanism
322 can be implemented.
323
324 If *slave* detects some error such as incompatible features, it may also
325 close the connection. This should only happen in exceptional circumstances.
326
327 Any protocol extensions are gated by protocol feature bits, which
328 allows full backwards compatibility on both master and slave. As
329 older slaves don't support negotiating protocol features, a feature
330 bit was dedicated for this purpose::
331
332 #define VHOST_USER_F_PROTOCOL_FEATURES 30
333
334 Starting and stopping rings
335 ---------------------------
336
337 Client must only process each ring when it is started.
338
339 Client must only pass data between the ring and the backend, when the
340 ring is enabled.
341
342 If ring is started but disabled, client must process the ring without
343 talking to the backend.
344
345 For example, for a networking device, in the disabled state client
346 must not supply any new RX packets, but must process and discard any
347 TX packets.
348
349 If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
350 ring is initialized in an enabled state.
351
352 If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
353 initialized in a disabled state. Client must not pass data to/from the
354 backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with
355 parameter 1, or after it has been disabled by
356 ``VHOST_USER_SET_VRING_ENABLE`` with parameter 0.
357
358 Each ring is initialized in a stopped state, client must not process
359 it until ring is started, or after it has been stopped.
360
361 Client must start ring upon receiving a kick (that is, detecting that
362 file descriptor is readable) on the descriptor specified by
363 ``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
364 ``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving
365 ``VHOST_USER_GET_VRING_BASE``.
366
367 While processing the rings (whether they are enabled or not), client
368 must support changing some configuration aspects on the fly.
369
370 Multiple queue support
371 ----------------------
372
373 Many devices have a fixed number of virtqueues. In this case the master
374 already knows the number of available virtqueues without communicating with the
375 slave.
376
377 Some devices do not have a fixed number of virtqueues. Instead the maximum
378 number of virtqueues is chosen by the slave. The number can depend on host
379 resource availability or slave implementation details. Such devices are called
380 multiple queue devices.
381
382 Multiple queue support allows the slave to advertise the maximum number of
383 queues. This is treated as a protocol extension, hence the slave has to
384 implement protocol features first. The multiple queues feature is supported
385 only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
386
387 The max number of queues the slave supports can be queried with message
388 ``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested
389 queues is bigger than that.
390
391 As all queues share one connection, the master uses a unique index for each
392 queue in the sent message to identify a specified queue.
393
394 The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
395 vhost-user-net has historically automatically enabled the first queue pair.
396
397 Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
398 feature, even for devices with a fixed number of virtqueues, since it is simple
399 to implement and offers a degree of introspection.
400
401 Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
402 devices with a fixed number of virtqueues. Only true multiqueue devices
403 require this protocol feature.
404
405 Migration
406 ---------
407
408 During live migration, the master may need to track the modifications
409 the slave makes to the memory mapped regions. The client should mark
410 the dirty pages in a log. Once it complies to this logging, it may
411 declare the ``VHOST_F_LOG_ALL`` vhost feature.
412
413 To start/stop logging of data/used ring writes, server may send
414 messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
415 ``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
416 flags set to 1/0, respectively.
417
418 All the modifications to memory pointed by vring "descriptor" should
419 be marked. Modifications to "used" vring should be marked if
420 ``VHOST_VRING_F_LOG`` is part of ring's flags.
421
422 Dirty pages are of size::
423
424 #define VHOST_LOG_PAGE 0x1000
425
426 The log memory fd is provided in the ancillary data of
427 ``VHOST_USER_SET_LOG_BASE`` message when the slave has
428 ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
429
430 The size of the log is supplied as part of ``VhostUserMsg`` which
431 should be large enough to cover all known guest addresses. Log starts
432 at the supplied offset in the supplied file descriptor. The log
433 covers from address 0 to the maximum of guest regions. In pseudo-code,
434 to mark page at ``addr`` as dirty::
435
436 page = addr / VHOST_LOG_PAGE
437 log[page / 8] |= 1 << page % 8
438
439 Where ``addr`` is the guest physical address.
440
441 Use atomic operations, as the log may be concurrently manipulated.
442
443 Note that when logging modifications to the used ring (when
444 ``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should
445 be used to calculate the log offset: the write to first byte of the
446 used ring is logged at this offset from log start. Also note that this
447 value might be outside the legal guest physical address range
448 (i.e. does not have to be covered by the ``VhostUserMemory`` table), but
449 the bit offset of the last byte of the ring must fall within the size
450 supplied by ``VhostUserLog``.
451
452 ``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
453 ancillary data, it may be used to inform the master that the log has
454 been modified.
455
456 Once the source has finished migration, rings will be stopped by the
457 source. No further update must be done before rings are restarted.
458
459 In postcopy migration the slave is started before all the memory has
460 been received from the source host, and care must be taken to avoid
461 accessing pages that have yet to be received. The slave opens a
462 'userfault'-fd and registers the memory with it; this fd is then
463 passed back over to the master. The master services requests on the
464 userfaultfd for pages that are accessed and when the page is available
465 it performs WAKE ioctl's on the userfaultfd to wake the stalled
466 slave. The client indicates support for this via the
467 ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
468
469 Memory access
470 -------------
471
472 The master sends a list of vhost memory regions to the slave using the
473 ``VHOST_USER_SET_MEM_TABLE`` message. Each region has two base
474 addresses: a guest address and a user address.
475
476 Messages contain guest addresses and/or user addresses to reference locations
477 within the shared memory. The mapping of these addresses works as follows.
478
479 User addresses map to the vhost memory region containing that user address.
480
481 When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
482
483 * Guest addresses map to the vhost memory region containing that guest
484 address.
485
486 When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
487
488 * Guest addresses are also called I/O virtual addresses (IOVAs). They are
489 translated to user addresses via the IOTLB.
490
491 * The vhost memory region guest address is not used.
492
493 IOMMU support
494 -------------
495
496 When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
497 master sends IOTLB entries update & invalidation by sending
498 ``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct
499 vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
500 has to be filled with the update message type (2), the I/O virtual
501 address, the size, the user virtual address, and the permissions
502 flags. Addresses and size must be within vhost memory regions set via
503 the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
504 ``iotlb`` payload has to be filled with the invalidation message type
505 (3), the I/O virtual address and the size. On success, the slave is
506 expected to reply with a zero payload, non-zero otherwise.
507
508 The slave relies on the slave communication channel (see :ref:`Slave
509 communication <slave_communication>` section below) to send IOTLB miss
510 and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG``
511 requests to the master with a ``struct vhost_iotlb_msg`` as
512 payload. For miss events, the iotlb payload has to be filled with the
513 miss message type (1), the I/O virtual address and the permissions
514 flags. For access failure event, the iotlb payload has to be filled
515 with the access failure message type (4), the I/O virtual address and
516 the permissions flags. For synchronization purpose, the slave may
517 rely on the reply-ack feature, so the master may send a reply when
518 operation is completed if the reply-ack feature is negotiated and
519 slaves requests a reply. For miss events, completed operation means
520 either master sent an update message containing the IOTLB entry
521 containing requested address and permission, or master sent nothing if
522 the IOTLB miss message is invalid (invalid IOVA or permission).
523
524 The master isn't expected to take the initiative to send IOTLB update
525 messages, as the slave sends IOTLB miss messages for the guest virtual
526 memory areas it needs to access.
527
528 .. _slave_communication:
529
530 Slave communication
531 -------------------
532
533 An optional communication channel is provided if the slave declares
534 ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the
535 slave to make requests to the master.
536
537 The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data.
538
539 A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master
540 using this fd communication channel.
541
542 If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is
543 negotiated, slave can send file descriptors (at most 8 descriptors in
544 each message) to master via ancillary data using this fd communication
545 channel.
546
547 Inflight I/O tracking
548 ---------------------
549
550 To support reconnecting after restart or crash, slave may need to
551 resubmit inflight I/Os. If virtqueue is processed in order, we can
552 easily achieve that by getting the inflight descriptors from
553 descriptor table (split virtqueue) or descriptor ring (packed
554 virtqueue). However, it can't work when we process descriptors
555 out-of-order because some entries which store the information of
556 inflight descriptors in available ring (split virtqueue) or descriptor
557 ring (packed virtqueue) might be overridden by new entries. To solve
558 this problem, slave need to allocate an extra buffer to store this
559 information of inflight descriptors and share it with master for
560 persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
561 ``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
562 between master and slave. And the format of this buffer is described
563 below:
564
565 +---------------+---------------+-----+---------------+
566 | queue0 region | queue1 region | ... | queueN region |
567 +---------------+---------------+-----+---------------+
568
569 N is the number of available virtqueues. Slave could get it from num
570 queues field of ``VhostUserInflight``.
571
572 For split virtqueue, queue region can be implemented as:
573
574 .. code:: c
575
576 typedef struct DescStateSplit {
577 /* Indicate whether this descriptor is inflight or not.
578 * Only available for head-descriptor. */
579 uint8_t inflight;
580
581 /* Padding */
582 uint8_t padding[5];
583
584 /* Maintain a list for the last batch of used descriptors.
585 * Only available when batching is used for submitting */
586 uint16_t next;
587
588 /* Used to preserve the order of fetching available descriptors.
589 * Only available for head-descriptor. */
590 uint64_t counter;
591 } DescStateSplit;
592
593 typedef struct QueueRegionSplit {
594 /* The feature flags of this region. Now it's initialized to 0. */
595 uint64_t features;
596
597 /* The version of this region. It's 1 currently.
598 * Zero value indicates an uninitialized buffer */
599 uint16_t version;
600
601 /* The size of DescStateSplit array. It's equal to the virtqueue
602 * size. Slave could get it from queue size field of VhostUserInflight. */
603 uint16_t desc_num;
604
605 /* The head of list that track the last batch of used descriptors. */
606 uint16_t last_batch_head;
607
608 /* Store the idx value of used ring */
609 uint16_t used_idx;
610
611 /* Used to track the state of each descriptor in descriptor table */
612 DescStateSplit desc[];
613 } QueueRegionSplit;
614
615 To track inflight I/O, the queue region should be processed as follows:
616
617 When receiving available buffers from the driver:
618
619 #. Get the next available head-descriptor index from available ring, ``i``
620
621 #. Set ``desc[i].counter`` to the value of global counter
622
623 #. Increase global counter by 1
624
625 #. Set ``desc[i].inflight`` to 1
626
627 When supplying used buffers to the driver:
628
629 1. Get corresponding used head-descriptor index, i
630
631 2. Set ``desc[i].next`` to ``last_batch_head``
632
633 3. Set ``last_batch_head`` to ``i``
634
635 #. Steps 1,2,3 may be performed repeatedly if batching is possible
636
637 #. Increase the ``idx`` value of used ring by the size of the batch
638
639 #. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0
640
641 #. Set ``used_idx`` to the ``idx`` value of used ring
642
643 When reconnecting:
644
645 #. If the value of ``used_idx`` does not match the ``idx`` value of
646 used ring (means the inflight field of ``DescStateSplit`` entries in
647 last batch may be incorrect),
648
649 a. Subtract the value of ``used_idx`` from the ``idx`` value of
650 used ring to get last batch size of ``DescStateSplit`` entries
651
652 #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch
653 list which starts from ``last_batch_head``
654
655 #. Set ``used_idx`` to the ``idx`` value of used ring
656
657 #. Resubmit inflight ``DescStateSplit`` entries in order of their
658 counter value
659
660 For packed virtqueue, queue region can be implemented as:
661
662 .. code:: c
663
664 typedef struct DescStatePacked {
665 /* Indicate whether this descriptor is inflight or not.
666 * Only available for head-descriptor. */
667 uint8_t inflight;
668
669 /* Padding */
670 uint8_t padding;
671
672 /* Link to the next free entry */
673 uint16_t next;
674
675 /* Link to the last entry of descriptor list.
676 * Only available for head-descriptor. */
677 uint16_t last;
678
679 /* The length of descriptor list.
680 * Only available for head-descriptor. */
681 uint16_t num;
682
683 /* Used to preserve the order of fetching available descriptors.
684 * Only available for head-descriptor. */
685 uint64_t counter;
686
687 /* The buffer id */
688 uint16_t id;
689
690 /* The descriptor flags */
691 uint16_t flags;
692
693 /* The buffer length */
694 uint32_t len;
695
696 /* The buffer address */
697 uint64_t addr;
698 } DescStatePacked;
699
700 typedef struct QueueRegionPacked {
701 /* The feature flags of this region. Now it's initialized to 0. */
702 uint64_t features;
703
704 /* The version of this region. It's 1 currently.
705 * Zero value indicates an uninitialized buffer */
706 uint16_t version;
707
708 /* The size of DescStatePacked array. It's equal to the virtqueue
709 * size. Slave could get it from queue size field of VhostUserInflight. */
710 uint16_t desc_num;
711
712 /* The head of free DescStatePacked entry list */
713 uint16_t free_head;
714
715 /* The old head of free DescStatePacked entry list */
716 uint16_t old_free_head;
717
718 /* The used index of descriptor ring */
719 uint16_t used_idx;
720
721 /* The old used index of descriptor ring */
722 uint16_t old_used_idx;
723
724 /* Device ring wrap counter */
725 uint8_t used_wrap_counter;
726
727 /* The old device ring wrap counter */
728 uint8_t old_used_wrap_counter;
729
730 /* Padding */
731 uint8_t padding[7];
732
733 /* Used to track the state of each descriptor fetched from descriptor ring */
734 DescStatePacked desc[];
735 } QueueRegionPacked;
736
737 To track inflight I/O, the queue region should be processed as follows:
738
739 When receiving available buffers from the driver:
740
741 #. Get the next available descriptor entry from descriptor ring, ``d``
742
743 #. If ``d`` is head descriptor,
744
745 a. Set ``desc[old_free_head].num`` to 0
746
747 #. Set ``desc[old_free_head].counter`` to the value of global counter
748
749 #. Increase global counter by 1
750
751 #. Set ``desc[old_free_head].inflight`` to 1
752
753 #. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to
754 ``free_head``
755
756 #. Increase ``desc[old_free_head].num`` by 1
757
758 #. Set ``desc[free_head].addr``, ``desc[free_head].len``,
759 ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``,
760 ``d.len``, ``d.flags``, ``d.id``
761
762 #. Set ``free_head`` to ``desc[free_head].next``
763
764 #. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head``
765
766 When supplying used buffers to the driver:
767
768 1. Get corresponding used head-descriptor entry from descriptor ring,
769 ``d``
770
771 2. Get corresponding ``DescStatePacked`` entry, ``e``
772
773 3. Set ``desc[e.last].next`` to ``free_head``
774
775 4. Set ``free_head`` to the index of ``e``
776
777 #. Steps 1,2,3,4 may be performed repeatedly if batching is possible
778
779 #. Increase ``used_idx`` by the size of the batch and update
780 ``used_wrap_counter`` if needed
781
782 #. Update ``d.flags``
783
784 #. Set the ``inflight`` field of each head ``DescStatePacked`` entry
785 in the batch to 0
786
787 #. Set ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
788 to ``free_head``, ``used_idx``, ``used_wrap_counter``
789
790 When reconnecting:
791
792 #. If ``used_idx`` does not match ``old_used_idx`` (means the
793 ``inflight`` field of ``DescStatePacked`` entries in last batch may
794 be incorrect),
795
796 a. Get the next descriptor ring entry through ``old_used_idx``, ``d``
797
798 #. Use ``old_used_wrap_counter`` to calculate the available flags
799
800 #. If ``d.flags`` is not equal to the calculated flags value (means
801 slave has submitted the buffer to guest driver before crash, so
802 it has to commit the in-progres update), set ``old_free_head``,
803 ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
804 ``used_idx``, ``used_wrap_counter``
805
806 #. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to
807 ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
808 (roll back any in-progress update)
809
810 #. Set the ``inflight`` field of each ``DescStatePacked`` entry in
811 free list to 0
812
813 #. Resubmit inflight ``DescStatePacked`` entries in order of their
814 counter value
815
816 In-band notifications
817 ---------------------
818
819 In some limited situations (e.g. for simulation) it is desirable to
820 have the kick, call and error (if used) signals done via in-band
821 messages instead of asynchronous eventfd notifications. This can be
822 done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS``
823 protocol feature.
824
825 Note that due to the fact that too many messages on the sockets can
826 cause the sending application(s) to block, it is not advised to use
827 this feature unless absolutely necessary. It is also considered an
828 error to negotiate this feature without also negotiating
829 ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
830 the former is necessary for getting a message channel from the slave
831 to the master, while the latter needs to be used with the in-band
832 notification messages to block until they are processed, both to avoid
833 blocking later and for proper processing (at least in the simulation
834 use case.) As it has no other way of signalling this error, the slave
835 should close the connection as a response to a
836 ``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
837 notifications feature flag without the other two.
838
839 Protocol features
840 -----------------
841
842 .. code:: c
843
844 #define VHOST_USER_PROTOCOL_F_MQ 0
845 #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
846 #define VHOST_USER_PROTOCOL_F_RARP 2
847 #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
848 #define VHOST_USER_PROTOCOL_F_MTU 4
849 #define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5
850 #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6
851 #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7
852 #define VHOST_USER_PROTOCOL_F_PAGEFAULT 8
853 #define VHOST_USER_PROTOCOL_F_CONFIG 9
854 #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD 10
855 #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11
856 #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
857 #define VHOST_USER_PROTOCOL_F_RESET_DEVICE 13
858 #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
859 #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15
860 #define VHOST_USER_PROTOCOL_F_STATUS 16
861
862 Master message types
863 --------------------
864
865 ``VHOST_USER_GET_FEATURES``
866 :id: 1
867 :equivalent ioctl: ``VHOST_GET_FEATURES``
868 :master payload: N/A
869 :slave payload: ``u64``
870
871 Get from the underlying vhost implementation the features bitmask.
872 Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support
873 for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
874 ``VHOST_USER_SET_PROTOCOL_FEATURES``.
875
876 ``VHOST_USER_SET_FEATURES``
877 :id: 2
878 :equivalent ioctl: ``VHOST_SET_FEATURES``
879 :master payload: ``u64``
880
881 Enable features in the underlying vhost implementation using a
882 bitmask. Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
883 slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
884 ``VHOST_USER_SET_PROTOCOL_FEATURES``.
885
886 ``VHOST_USER_GET_PROTOCOL_FEATURES``
887 :id: 15
888 :equivalent ioctl: ``VHOST_GET_FEATURES``
889 :master payload: N/A
890 :slave payload: ``u64``
891
892 Get the protocol feature bitmask from the underlying vhost
893 implementation. Only legal if feature bit
894 ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
895 ``VHOST_USER_GET_FEATURES``.
896
897 .. Note::
898 Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must
899 support this message even before ``VHOST_USER_SET_FEATURES`` was
900 called.
901
902 ``VHOST_USER_SET_PROTOCOL_FEATURES``
903 :id: 16
904 :equivalent ioctl: ``VHOST_SET_FEATURES``
905 :master payload: ``u64``
906
907 Enable protocol features in the underlying vhost implementation.
908
909 Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
910 ``VHOST_USER_GET_FEATURES``.
911
912 .. Note::
913 Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
914 this message even before ``VHOST_USER_SET_FEATURES`` was called.
915
916 ``VHOST_USER_SET_OWNER``
917 :id: 3
918 :equivalent ioctl: ``VHOST_SET_OWNER``
919 :master payload: N/A
920
921 Issued when a new connection is established. It sets the current
922 *master* as an owner of the session. This can be used on the *slave*
923 as a "session start" flag.
924
925 ``VHOST_USER_RESET_OWNER``
926 :id: 4
927 :master payload: N/A
928
929 .. admonition:: Deprecated
930
931 This is no longer used. Used to be sent to request disabling all
932 rings, but some clients interpreted it to also discard connection
933 state (this interpretation would lead to bugs). It is recommended
934 that clients either ignore this message, or use it to disable all
935 rings.
936
937 ``VHOST_USER_SET_MEM_TABLE``
938 :id: 5
939 :equivalent ioctl: ``VHOST_SET_MEM_TABLE``
940 :master payload: memory regions description
941 :slave payload: (postcopy only) memory regions description
942
943 Sets the memory map regions on the slave so it can translate the
944 vring addresses. In the ancillary data there is an array of file
945 descriptors for each memory mapped region. The size and ordering of
946 the fds matches the number and ordering of memory regions.
947
948 When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
949 ``SET_MEM_TABLE`` replies with the bases of the memory mapped
950 regions to the master. The slave must have mmap'd the regions but
951 not yet accessed them and should not yet generate a userfault
952 event.
953
954 .. Note::
955 ``NEED_REPLY_MASK`` is not set in this case. QEMU will then
956 reply back to the list of mappings with an empty
957 ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon
958 reception of this message may the guest start accessing the memory
959 and generating faults.
960
961 ``VHOST_USER_SET_LOG_BASE``
962 :id: 6
963 :equivalent ioctl: ``VHOST_SET_LOG_BASE``
964 :master payload: u64
965 :slave payload: N/A
966
967 Sets logging shared memory space.
968
969 When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
970 the log memory fd is provided in the ancillary data of
971 ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
972 memory area provided in the message.
973
974 ``VHOST_USER_SET_LOG_FD``
975 :id: 7
976 :equivalent ioctl: ``VHOST_SET_LOG_FD``
977 :master payload: N/A
978
979 Sets the logging file descriptor, which is passed as ancillary data.
980
981 ``VHOST_USER_SET_VRING_NUM``
982 :id: 8
983 :equivalent ioctl: ``VHOST_SET_VRING_NUM``
984 :master payload: vring state description
985
986 Set the size of the queue.
987
988 ``VHOST_USER_SET_VRING_ADDR``
989 :id: 9
990 :equivalent ioctl: ``VHOST_SET_VRING_ADDR``
991 :master payload: vring address description
992 :slave payload: N/A
993
994 Sets the addresses of the different aspects of the vring.
995
996 ``VHOST_USER_SET_VRING_BASE``
997 :id: 10
998 :equivalent ioctl: ``VHOST_SET_VRING_BASE``
999 :master payload: vring state description
1000
1001 Sets the base offset in the available vring.
1002
1003 ``VHOST_USER_GET_VRING_BASE``
1004 :id: 11
1005 :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
1006 :master payload: vring state description
1007 :slave payload: vring state description
1008
1009 Get the available vring base offset.
1010
1011 ``VHOST_USER_SET_VRING_KICK``
1012 :id: 12
1013 :equivalent ioctl: ``VHOST_SET_VRING_KICK``
1014 :master payload: ``u64``
1015
1016 Set the event file descriptor for adding buffers to the vring. It is
1017 passed in the ancillary data.
1018
1019 Bits (0-7) of the payload contain the vring index. Bit 8 is the
1020 invalid FD flag. This flag is set when there is no file descriptor
1021 in the ancillary data. This signals that polling should be used
1022 instead of waiting for the kick. Note that if the protocol feature
1023 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated
1024 this message isn't necessary as the ring is also started on the
1025 ``VHOST_USER_VRING_KICK`` message, it may however still be used to
1026 set an event file descriptor (which will be preferred over the
1027 message) or to enable polling.
1028
1029 ``VHOST_USER_SET_VRING_CALL``
1030 :id: 13
1031 :equivalent ioctl: ``VHOST_SET_VRING_CALL``
1032 :master payload: ``u64``
1033
1034 Set the event file descriptor to signal when buffers are used. It is
1035 passed in the ancillary data.
1036
1037 Bits (0-7) of the payload contain the vring index. Bit 8 is the
1038 invalid FD flag. This flag is set when there is no file descriptor
1039 in the ancillary data. This signals that polling will be used
1040 instead of waiting for the call. Note that if the protocol features
1041 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1042 ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1043 isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be
1044 used, it may however still be used to set an event file descriptor
1045 or to enable polling.
1046
1047 ``VHOST_USER_SET_VRING_ERR``
1048 :id: 14
1049 :equivalent ioctl: ``VHOST_SET_VRING_ERR``
1050 :master payload: ``u64``
1051
1052 Set the event file descriptor to signal when error occurs. It is
1053 passed in the ancillary data.
1054
1055 Bits (0-7) of the payload contain the vring index. Bit 8 is the
1056 invalid FD flag. This flag is set when there is no file descriptor
1057 in the ancillary data. Note that if the protocol features
1058 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1059 ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1060 isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be
1061 used, it may however still be used to set an event file descriptor
1062 (which will be preferred over the message).
1063
1064 ``VHOST_USER_GET_QUEUE_NUM``
1065 :id: 17
1066 :equivalent ioctl: N/A
1067 :master payload: N/A
1068 :slave payload: u64
1069
1070 Query how many queues the backend supports.
1071
1072 This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
1073 is set in queried protocol features by
1074 ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1075
1076 ``VHOST_USER_SET_VRING_ENABLE``
1077 :id: 18
1078 :equivalent ioctl: N/A
1079 :master payload: vring state description
1080
1081 Signal slave to enable or disable corresponding vring.
1082
1083 This request should be sent only when
1084 ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
1085
1086 ``VHOST_USER_SEND_RARP``
1087 :id: 19
1088 :equivalent ioctl: N/A
1089 :master payload: ``u64``
1090
1091 Ask vhost user backend to broadcast a fake RARP to notify the migration
1092 is terminated for guest that does not support GUEST_ANNOUNCE.
1093
1094 Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
1095 present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1096 ``VHOST_USER_PROTOCOL_F_RARP`` is present in
1097 ``VHOST_USER_GET_PROTOCOL_FEATURES``. The first 6 bytes of the
1098 payload contain the mac address of the guest to allow the vhost user
1099 backend to construct and broadcast the fake RARP.
1100
1101 ``VHOST_USER_NET_SET_MTU``
1102 :id: 20
1103 :equivalent ioctl: N/A
1104 :master payload: ``u64``
1105
1106 Set host MTU value exposed to the guest.
1107
1108 This request should be sent only when ``VIRTIO_NET_F_MTU`` feature
1109 has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES``
1110 is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1111 ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
1112 ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1113
1114 If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1115 respond with zero in case the specified MTU is valid, or non-zero
1116 otherwise.
1117
1118 ``VHOST_USER_SET_SLAVE_REQ_FD``
1119 :id: 21
1120 :equivalent ioctl: N/A
1121 :master payload: N/A
1122
1123 Set the socket file descriptor for slave initiated requests. It is passed
1124 in the ancillary data.
1125
1126 This request should be sent only when
1127 ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
1128 feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in
1129 ``VHOST_USER_GET_PROTOCOL_FEATURES``. If
1130 ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1131 respond with zero for success, non-zero otherwise.
1132
1133 ``VHOST_USER_IOTLB_MSG``
1134 :id: 22
1135 :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1136 :master payload: ``struct vhost_iotlb_msg``
1137 :slave payload: ``u64``
1138
1139 Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1140
1141 Master sends such requests to update and invalidate entries in the
1142 device IOTLB. The slave has to acknowledge the request with sending
1143 zero as ``u64`` payload for success, non-zero otherwise.
1144
1145 This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
1146 feature has been successfully negotiated.
1147
1148 ``VHOST_USER_SET_VRING_ENDIAN``
1149 :id: 23
1150 :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
1151 :master payload: vring state description
1152
1153 Set the endianness of a VQ for legacy devices. Little-endian is
1154 indicated with state.num set to 0 and big-endian is indicated with
1155 state.num set to 1. Other values are invalid.
1156
1157 This request should be sent only when
1158 ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
1159 Backends that negotiated this feature should handle both
1160 endiannesses and expect this message once (per VQ) during device
1161 configuration (ie. before the master starts the VQ).
1162
1163 ``VHOST_USER_GET_CONFIG``
1164 :id: 24
1165 :equivalent ioctl: N/A
1166 :master payload: virtio device config space
1167 :slave payload: virtio device config space
1168
1169 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1170 submitted by the vhost-user master to fetch the contents of the
1171 virtio device configuration space, vhost-user slave's payload size
1172 MUST match master's request, vhost-user slave uses zero length of
1173 payload to indicate an error to vhost-user master. The vhost-user
1174 master may cache the contents to avoid repeated
1175 ``VHOST_USER_GET_CONFIG`` calls.
1176
1177 ``VHOST_USER_SET_CONFIG``
1178 :id: 25
1179 :equivalent ioctl: N/A
1180 :master payload: virtio device config space
1181 :slave payload: N/A
1182
1183 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1184 submitted by the vhost-user master when the Guest changes the virtio
1185 device configuration space and also can be used for live migration
1186 on the destination host. The vhost-user slave must check the flags
1187 field, and slaves MUST NOT accept SET_CONFIG for read-only
1188 configuration space fields unless the live migration bit is set.
1189
1190 ``VHOST_USER_CREATE_CRYPTO_SESSION``
1191 :id: 26
1192 :equivalent ioctl: N/A
1193 :master payload: crypto session description
1194 :slave payload: crypto session description
1195
1196 Create a session for crypto operation. The server side must return
1197 the session id, 0 or positive for success, negative for failure.
1198 This request should be sent only when
1199 ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1200 successfully negotiated. It's a required feature for crypto
1201 devices.
1202
1203 ``VHOST_USER_CLOSE_CRYPTO_SESSION``
1204 :id: 27
1205 :equivalent ioctl: N/A
1206 :master payload: ``u64``
1207
1208 Close a session for crypto operation which was previously
1209 created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
1210
1211 This request should be sent only when
1212 ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1213 successfully negotiated. It's a required feature for crypto
1214 devices.
1215
1216 ``VHOST_USER_POSTCOPY_ADVISE``
1217 :id: 28
1218 :master payload: N/A
1219 :slave payload: userfault fd
1220
1221 When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master
1222 advises slave that a migration with postcopy enabled is underway,
1223 the slave must open a userfaultfd for later use. Note that at this
1224 stage the migration is still in precopy mode.
1225
1226 ``VHOST_USER_POSTCOPY_LISTEN``
1227 :id: 29
1228 :master payload: N/A
1229
1230 Master advises slave that a transition to postcopy mode has
1231 happened. The slave must ensure that shared memory is registered
1232 with userfaultfd to cause faulting of non-present pages.
1233
1234 This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
1235 and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported.
1236
1237 ``VHOST_USER_POSTCOPY_END``
1238 :id: 30
1239 :slave payload: ``u64``
1240
1241 Master advises that postcopy migration has now completed. The slave
1242 must disable the userfaultfd. The response is an acknowledgement
1243 only.
1244
1245 When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
1246 is sent at the end of the migration, after
1247 ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent.
1248
1249 The value returned is an error indication; 0 is success.
1250
1251 ``VHOST_USER_GET_INFLIGHT_FD``
1252 :id: 31
1253 :equivalent ioctl: N/A
1254 :master payload: inflight description
1255
1256 When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1257 been successfully negotiated, this message is submitted by master to
1258 get a shared buffer from slave. The shared buffer will be used to
1259 track inflight I/O by slave. QEMU should retrieve a new one when vm
1260 reset.
1261
1262 ``VHOST_USER_SET_INFLIGHT_FD``
1263 :id: 32
1264 :equivalent ioctl: N/A
1265 :master payload: inflight description
1266
1267 When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1268 been successfully negotiated, this message is submitted by master to
1269 send the shared inflight buffer back to slave so that slave could
1270 get inflight I/O after a crash or restart.
1271
1272 ``VHOST_USER_GPU_SET_SOCKET``
1273 :id: 33
1274 :equivalent ioctl: N/A
1275 :master payload: N/A
1276
1277 Sets the GPU protocol socket file descriptor, which is passed as
1278 ancillary data. The GPU protocol is used to inform the master of
1279 rendering state and updates. See vhost-user-gpu.rst for details.
1280
1281 ``VHOST_USER_RESET_DEVICE``
1282 :id: 34
1283 :equivalent ioctl: N/A
1284 :master payload: N/A
1285 :slave payload: N/A
1286
1287 Ask the vhost user backend to disable all rings and reset all
1288 internal device state to the initial state, ready to be
1289 reinitialized. The backend retains ownership of the device
1290 throughout the reset operation.
1291
1292 Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
1293 feature is set by the backend.
1294
1295 ``VHOST_USER_VRING_KICK``
1296 :id: 35
1297 :equivalent ioctl: N/A
1298 :slave payload: vring state description
1299 :master payload: N/A
1300
1301 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1302 feature has been successfully negotiated, this message may be
1303 submitted by the master to indicate that a buffer was added to
1304 the vring instead of signalling it using the vring's kick file
1305 descriptor or having the slave rely on polling.
1306
1307 The state.num field is currently reserved and must be set to 0.
1308
1309 ``VHOST_USER_GET_MAX_MEM_SLOTS``
1310 :id: 36
1311 :equivalent ioctl: N/A
1312 :slave payload: u64
1313
1314 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1315 feature has been successfully negotiated, this message is submitted
1316 by master to the slave. The slave should return the message with a
1317 u64 payload containing the maximum number of memory slots for
1318 QEMU to expose to the guest. The value returned by the backend
1319 will be capped at the maximum number of ram slots which can be
1320 supported by the target platform.
1321
1322 ``VHOST_USER_ADD_MEM_REG``
1323 :id: 37
1324 :equivalent ioctl: N/A
1325 :slave payload: single memory region description
1326
1327 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1328 feature has been successfully negotiated, this message is submitted
1329 by the master to the slave. The message payload contains a memory
1330 region descriptor struct, describing a region of guest memory which
1331 the slave device must map in. When the
1332 ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1333 been successfully negotiated, along with the
1334 ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
1335 update the memory tables of the slave device.
1336
1337 ``VHOST_USER_REM_MEM_REG``
1338 :id: 38
1339 :equivalent ioctl: N/A
1340 :slave payload: single memory region description
1341
1342 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1343 feature has been successfully negotiated, this message is submitted
1344 by the master to the slave. The message payload contains a memory
1345 region descriptor struct, describing a region of guest memory which
1346 the slave device must unmap. When the
1347 ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1348 been successfully negotiated, along with the
1349 ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
1350 update the memory tables of the slave device.
1351
1352 ``VHOST_USER_SET_STATUS``
1353 :id: 39
1354 :equivalent ioctl: VHOST_VDPA_SET_STATUS
1355 :slave payload: N/A
1356 :master payload: ``u64``
1357
1358 When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1359 successfully negotiated, this message is submitted by the master to
1360 notify the backend with updated device status as defined in the Virtio
1361 specification.
1362
1363 ``VHOST_USER_GET_STATUS``
1364 :id: 40
1365 :equivalent ioctl: VHOST_VDPA_GET_STATUS
1366 :slave payload: ``u64``
1367 :master payload: N/A
1368
1369 When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1370 successfully negotiated, this message is submitted by the master to
1371 query the backend for its device status as defined in the Virtio
1372 specification.
1373
1374
1375 Slave message types
1376 -------------------
1377
1378 ``VHOST_USER_SLAVE_IOTLB_MSG``
1379 :id: 1
1380 :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1381 :slave payload: ``struct vhost_iotlb_msg``
1382 :master payload: N/A
1383
1384 Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1385 Slave sends such requests to notify of an IOTLB miss, or an IOTLB
1386 access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
1387 negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master
1388 must respond with zero when operation is successfully completed, or
1389 non-zero otherwise. This request should be send only when
1390 ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
1391 negotiated.
1392
1393 ``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``
1394 :id: 2
1395 :equivalent ioctl: N/A
1396 :slave payload: N/A
1397 :master payload: N/A
1398
1399 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
1400 slave sends such messages to notify that the virtio device's
1401 configuration space has changed, for those host devices which can
1402 support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
1403 message to slave to get the latest content. If
1404 ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the
1405 ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when
1406 operation is successfully completed, or non-zero otherwise.
1407
1408 ``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``
1409 :id: 3
1410 :equivalent ioctl: N/A
1411 :slave payload: vring area description
1412 :master payload: N/A
1413
1414 Sets host notifier for a specified queue. The queue index is
1415 contained in the ``u64`` field of the vring area description. The
1416 host notifier is described by the file descriptor (typically it's a
1417 VFIO device fd) which is passed as ancillary data and the size
1418 (which is mmap size and should be the same as host page size) and
1419 offset (which is mmap offset) carried in the vring area
1420 description. QEMU can mmap the file descriptor based on the size and
1421 offset to get a memory range. Registering a host notifier means
1422 mapping this memory range to the VM as the specified queue's notify
1423 MMIO region. Slave sends this request to tell QEMU to de-register
1424 the existing notifier if any and register the new notifier if the
1425 request is sent with a file descriptor.
1426
1427 This request should be sent only when
1428 ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
1429 successfully negotiated.
1430
1431 ``VHOST_USER_SLAVE_VRING_CALL``
1432 :id: 4
1433 :equivalent ioctl: N/A
1434 :slave payload: vring state description
1435 :master payload: N/A
1436
1437 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1438 feature has been successfully negotiated, this message may be
1439 submitted by the slave to indicate that a buffer was used from
1440 the vring instead of signalling this using the vring's call file
1441 descriptor or having the master relying on polling.
1442
1443 The state.num field is currently reserved and must be set to 0.
1444
1445 ``VHOST_USER_SLAVE_VRING_ERR``
1446 :id: 5
1447 :equivalent ioctl: N/A
1448 :slave payload: vring state description
1449 :master payload: N/A
1450
1451 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1452 feature has been successfully negotiated, this message may be
1453 submitted by the slave to indicate that an error occurred on the
1454 specific vring, instead of signalling the error file descriptor
1455 set by the master via ``VHOST_USER_SET_VRING_ERR``.
1456
1457 The state.num field is currently reserved and must be set to 0.
1458
1459 .. _reply_ack:
1460
1461 VHOST_USER_PROTOCOL_F_REPLY_ACK
1462 -------------------------------
1463
1464 The original vhost-user specification only demands replies for certain
1465 commands. This differs from the vhost protocol implementation where
1466 commands are sent over an ``ioctl()`` call and block until the client
1467 has completed.
1468
1469 With this protocol extension negotiated, the sender (QEMU) can set the
1470 ``need_reply`` [Bit 3] flag to any command. This indicates that the
1471 client MUST respond with a Payload ``VhostUserMsg`` indicating success
1472 or failure. The payload should be set to zero on success or non-zero
1473 on failure, unless the message already has an explicit reply body.
1474
1475 The response payload gives QEMU a deterministic indication of the result
1476 of the command. Today, QEMU is expected to terminate the main vhost-user
1477 loop upon receiving such errors. In future, qemu could be taught to be more
1478 resilient for selective requests.
1479
1480 For the message types that already solicit a reply from the client,
1481 the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
1482 being set brings no behavioural change. (See the Communication_
1483 section for details.)
1484
1485 .. _backend_conventions:
1486
1487 Backend program conventions
1488 ===========================
1489
1490 vhost-user backends can provide various devices & services and may
1491 need to be configured manually depending on the use case. However, it
1492 is a good idea to follow the conventions listed here when
1493 possible. Users, QEMU or libvirt, can then rely on some common
1494 behaviour to avoid heterogeneous configuration and management of the
1495 backend programs and facilitate interoperability.
1496
1497 Each backend installed on a host system should come with at least one
1498 JSON file that conforms to the vhost-user.json schema. Each file
1499 informs the management applications about the backend type, and binary
1500 location. In addition, it defines rules for management apps for
1501 picking the highest priority backend when multiple match the search
1502 criteria (see ``@VhostUserBackend`` documentation in the schema file).
1503
1504 If the backend is not capable of enabling a requested feature on the
1505 host (such as 3D acceleration with virgl), or the initialization
1506 failed, the backend should fail to start early and exit with a status
1507 != 0. It may also print a message to stderr for further details.
1508
1509 The backend program must not daemonize itself, but it may be
1510 daemonized by the management layer. It may also have a restricted
1511 access to the system.
1512
1513 File descriptors 0, 1 and 2 will exist, and have regular
1514 stdin/stdout/stderr usage (they may have been redirected to /dev/null
1515 by the management layer, or to a log handler).
1516
1517 The backend program must end (as quickly and cleanly as possible) when
1518 the SIGTERM signal is received. Eventually, it may receive SIGKILL by
1519 the management layer after a few seconds.
1520
1521 The following command line options have an expected behaviour. They
1522 are mandatory, unless explicitly said differently:
1523
1524 --socket-path=PATH
1525
1526 This option specify the location of the vhost-user Unix domain socket.
1527 It is incompatible with --fd.
1528
1529 --fd=FDNUM
1530
1531 When this argument is given, the backend program is started with the
1532 vhost-user socket as file descriptor FDNUM. It is incompatible with
1533 --socket-path.
1534
1535 --print-capabilities
1536
1537 Output to stdout the backend capabilities in JSON format, and then
1538 exit successfully. Other options and arguments should be ignored, and
1539 the backend program should not perform its normal function. The
1540 capabilities can be reported dynamically depending on the host
1541 capabilities.
1542
1543 The JSON output is described in the ``vhost-user.json`` schema, by
1544 ```@VHostUserBackendCapabilities``. Example:
1545
1546 .. code:: json
1547
1548 {
1549 "type": "foo",
1550 "features": [
1551 "feature-a",
1552 "feature-b"
1553 ]
1554 }
1555
1556 vhost-user-input
1557 ----------------
1558
1559 Command line options:
1560
1561 --evdev-path=PATH
1562
1563 Specify the linux input device.
1564
1565 (optional)
1566
1567 --no-grab
1568
1569 Do no request exclusive access to the input device.
1570
1571 (optional)
1572
1573 vhost-user-gpu
1574 --------------
1575
1576 Command line options:
1577
1578 --render-node=PATH
1579
1580 Specify the GPU DRM render node.
1581
1582 (optional)
1583
1584 --virgl
1585
1586 Enable virgl rendering support.
1587
1588 (optional)
1589
1590 vhost-user-blk
1591 --------------
1592
1593 Command line options:
1594
1595 --blk-file=PATH
1596
1597 Specify block device or file path.
1598
1599 (optional)
1600
1601 --read-only
1602
1603 Enable read-only.
1604
1605 (optional)