]> git.proxmox.com Git - ceph.git/blob - ceph/src/spdk/doc/vhost_processing.md
update source to Ceph Pacific 16.2.2
[ceph.git] / ceph / src / spdk / doc / vhost_processing.md
1 # Virtualized I/O with Vhost-user {#vhost_processing}
2
3 # Table of Contents {#vhost_processing_toc}
4
5 - @ref vhost_processing_intro
6 - @ref vhost_processing_qemu
7 - @ref vhost_processing_init
8 - @ref vhost_processing_io_path
9 - @ref vhost_spdk_optimizations
10
11 # Introduction {#vhost_processing_intro}
12
13 This document is intended to provide an overview of how Vhost works behind the
14 scenes. Code snippets used in this document might have been simplified for the
15 sake of readability and should not be used as an API or implementation
16 reference.
17
18 Reading from the
19 [Virtio specification](http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html):
20
21 ```
22 The purpose of virtio and [virtio] specification is that virtual environments
23 and guests should have a straightforward, efficient, standard and extensible
24 mechanism for virtual devices, rather than boutique per-environment or per-OS
25 mechanisms.
26 ```
27
28 Virtio devices use virtqueues to transport data efficiently. Virtqueue is a set
29 of three different single-producer, single-consumer ring structures designed to
30 store generic scatter-gatter I/O. Virtio is most commonly used in QEMU VMs,
31 where the QEMU itself exposes a virtual PCI device and the guest OS communicates
32 with it using a specific Virtio PCI driver. With only Virtio involved, it's
33 always the QEMU process that handles all I/O traffic.
34
35 Vhost is a protocol for devices accessible via inter-process communication.
36 It uses the same virtqueue layout as Virtio to allow Vhost devices to be mapped
37 directly to Virtio devices. This allows a Vhost device, exposed by an SPDK
38 application, to be accessed directly by a guest OS inside a QEMU process with
39 an existing Virtio (PCI) driver. Only the configuration, I/O submission
40 notification, and I/O completion interruption are piped through QEMU.
41 See also @ref vhost_spdk_optimizations
42
43 The initial vhost implementation is a part of the Linux kernel and uses ioctl
44 interface to communicate with userspace applications. What makes it possible for
45 SPDK to expose a vhost device is Vhost-user protocol.
46
47 The [Vhost-user specification](https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD)
48 describes the protocol as follows:
49
50 ```
51 [Vhost-user protocol] is aiming to complement the ioctl interface used to
52 control the vhost implementation in the Linux kernel. It implements the control
53 plane needed to establish virtqueue sharing with a user space process on the
54 same host. It uses communication over a Unix domain socket to share file
55 descriptors in the ancillary data of the message.
56
57 The protocol defines 2 sides of the communication, master and slave. Master is
58 the application that shares its virtqueues, in our case QEMU. Slave is the
59 consumer of the virtqueues.
60
61 In the current implementation QEMU is the Master, and the Slave is intended to
62 be a software Ethernet switch running in user space, such as Snabbswitch.
63
64 Master and slave can be either a client (i.e. connecting) or server (listening)
65 in the socket communication.
66 ```
67
68 SPDK vhost is a Vhost-user slave server. It exposes Unix domain sockets and
69 allows external applications to connect.
70
71 # QEMU {#vhost_processing_qemu}
72
73 One of major Vhost-user use cases is networking (DPDK) or storage (SPDK)
74 offload in QEMU. The following diagram presents how QEMU-based VM
75 communicates with SPDK Vhost-SCSI device.
76
77 ![QEMU/SPDK vhost data flow](img/qemu_vhost_data_flow.svg)
78
79 # Device initialization {#vhost_processing_init}
80
81 All initialization and management information is exchanged using Vhost-user
82 messages. The connection always starts with the feature negotiation. Both
83 the Master and the Slave exposes a list of their implemented features and
84 upon negotiation they choose a common set of those. Most of these features are
85 implementation-related, but also regard e.g. multiqueue support or live migration.
86
87 After the negotiation, the Vhost-user driver shares its memory, so that the vhost
88 device (SPDK) can access it directly. The memory can be fragmented into multiple
89 physically-discontiguous regions and Vhost-user specification puts a limit on
90 their number - currently 8. The driver sends a single message for each region with
91 the following data:
92
93 * file descriptor - for mmap
94 * user address - for memory translations in Vhost-user messages (e.g.
95 translating vring addresses)
96 * guest address - for buffers addresses translations in vrings (for QEMU this
97 is a physical address inside the guest)
98 * user offset - positive offset for the mmap
99 * size
100
101 The Master will send new memory regions after each memory change - usually
102 hotplug/hotremove. The previous mappings will be removed.
103
104 Drivers may also request a device config, consisting of e.g. disk geometry.
105 Vhost-SCSI drivers, however, don't need to implement this functionality
106 as they use common SCSI I/O to inquiry the underlying disk(s).
107
108 Afterwards, the driver requests the number of maximum supported queues and
109 starts sending virtqueue data, which consists of:
110
111 * unique virtqueue id
112 * index of the last processed vring descriptor
113 * vring addresses (from user address space)
114 * call descriptor (for interrupting the driver after I/O completions)
115 * kick descriptor (to listen for I/O requests - unused by SPDK)
116
117 If multiqueue feature has been negotiated, the driver has to send a specific
118 *ENABLE* message for each extra queue it wants to be polled. Other queues are
119 polled as soon as they're initialized.
120
121 # I/O path {#vhost_processing_io_path}
122
123 The Master sends I/O by allocating proper buffers in shared memory, filling
124 the request data, and putting guest addresses of those buffers into virtqueues.
125
126 A Virtio-Block request looks as follows.
127
128 ```
129 struct virtio_blk_req {
130 uint32_t type; // READ, WRITE, FLUSH (read-only)
131 uint64_t offset; // offset in the disk (read-only)
132 struct iovec buffers[]; // scatter-gatter list (read/write)
133 uint8_t status; // I/O completion status (write-only)
134 };
135 ```
136 And a Virtio-SCSI request as follows.
137
138 ```
139 struct virtio_scsi_req_cmd {
140 struct virtio_scsi_cmd_req *req; // request data (read-only)
141 struct iovec read_only_buffers[]; // scatter-gatter list for WRITE I/Os
142 struct virtio_scsi_cmd_resp *resp; // response data (write-only)
143 struct iovec write_only_buffers[]; // scatter-gatter list for READ I/Os
144 }
145 ```
146
147 Virtqueue generally consists of an array of descriptors and each I/O needs
148 to be converted into a chain of such descriptors. A single descriptor can be
149 either readable or writable, so each I/O request consists of at least two
150 (request + response).
151
152 ```
153 struct virtq_desc {
154 /* Address (guest-physical). */
155 le64 addr;
156 /* Length. */
157 le32 len;
158
159 /* This marks a buffer as continuing via the next field. */
160 #define VIRTQ_DESC_F_NEXT 1
161 /* This marks a buffer as device write-only (otherwise device read-only). */
162 #define VIRTQ_DESC_F_WRITE 2
163 /* The flags as indicated above. */
164 le16 flags;
165 /* Next field if flags & NEXT */
166 le16 next;
167 };
168 ```
169
170 Legacy Virtio implementations used the name vring alongside virtqueue, and the
171 name vring is still used in virtio data structures inside the code. Instead of
172 `struct virtq_desc`, the `struct vring_desc` is much more likely to be found.
173
174 The device after polling this descriptor chain needs to translate and transform
175 it back into the original request struct. It needs to know the request layout
176 up-front, so each device backend (Vhost-Block/SCSI) has its own implementation
177 for polling virtqueues. For each descriptor, the device performs a lookup in
178 the Vhost-user memory region table and goes through a gpa_to_vva translation
179 (guest physical address to vhost virtual address). SPDK enforces the request
180 and response data to be contained within a single memory region. I/O buffers
181 do not have such limitations and SPDK may automatically perform additional
182 iovec splitting and gpa_to_vva translations if required. After forming the request
183 structs, SPDK forwards such I/O to the underlying drive and polls for the
184 completion. Once I/O completes, SPDK vhost fills the response buffer with
185 proper data and interrupts the guest by doing an eventfd_write on the call
186 descriptor for proper virtqueue. There are multiple interrupt coalescing
187 features involved, but they are not be discussed in this document.
188
189 ## SPDK optimizations {#vhost_spdk_optimizations}
190
191 Due to its poll-mode nature, SPDK vhost removes the requirement for I/O submission
192 notifications, drastically increasing the vhost server throughput and decreasing
193 the guest overhead of submitting an I/O. A couple of different solutions exist
194 to mitigate the I/O completion interrupt overhead (irqfd, vDPA), but those won't
195 be discussed in this document. For the highest performance, a poll-mode @ref virtio
196 can be used, as it suppresses all I/O completion interrupts, making the I/O
197 path to fully bypass the QEMU/KVM overhead.