]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | .. SPDX-License-Identifier: BSD-3-Clause |
2 | Copyright(c) 2010-2016 Intel Corporation. | |
7c673cae FG |
3 | |
4 | Vhost Library | |
5 | ============= | |
6 | ||
7 | The vhost library implements a user space virtio net server allowing the user | |
8 | to manipulate the virtio ring directly. In another words, it allows the user | |
9 | to fetch/put packets from/to the VM virtio net device. To achieve this, a | |
10 | vhost library should be able to: | |
11 | ||
12 | * Access the guest memory: | |
13 | ||
14 | For QEMU, this is done by using the ``-object memory-backend-file,share=on,...`` | |
15 | option. Which means QEMU will create a file to serve as the guest RAM. | |
16 | The ``share=on`` option allows another process to map that file, which | |
17 | means it can access the guest RAM. | |
18 | ||
19 | * Know all the necessary information about the vring: | |
20 | ||
21 | Information such as where the available ring is stored. Vhost defines some | |
22 | messages (passed through a Unix domain socket file) to tell the backend all | |
23 | the information it needs to know how to manipulate the vring. | |
24 | ||
25 | ||
26 | Vhost API Overview | |
27 | ------------------ | |
28 | ||
11fdf7f2 | 29 | The following is an overview of some key Vhost API functions: |
7c673cae FG |
30 | |
31 | * ``rte_vhost_driver_register(path, flags)`` | |
32 | ||
33 | This function registers a vhost driver into the system. ``path`` specifies | |
34 | the Unix domain socket file path. | |
35 | ||
36 | Currently supported flags are: | |
37 | ||
38 | - ``RTE_VHOST_USER_CLIENT`` | |
39 | ||
40 | DPDK vhost-user will act as the client when this flag is given. See below | |
41 | for an explanation. | |
42 | ||
43 | - ``RTE_VHOST_USER_NO_RECONNECT`` | |
44 | ||
45 | When DPDK vhost-user acts as the client it will keep trying to reconnect | |
46 | to the server (QEMU) until it succeeds. This is useful in two cases: | |
47 | ||
48 | * When QEMU is not started yet. | |
49 | * When QEMU restarts (for example due to a guest OS reboot). | |
50 | ||
51 | This reconnect option is enabled by default. However, it can be turned off | |
52 | by setting this flag. | |
53 | ||
54 | - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` | |
55 | ||
56 | Dequeue zero copy will be enabled when this flag is set. It is disabled by | |
57 | default. | |
58 | ||
59 | There are some truths (including limitations) you might want to know while | |
60 | setting this flag: | |
61 | ||
62 | * zero copy is not good for small packets (typically for packet size below | |
63 | 512). | |
64 | ||
65 | * zero copy is really good for VM2VM case. For iperf between two VMs, the | |
9f95a23c | 66 | boost could be above 70% (when TSO is enabled). |
7c673cae | 67 | |
9f95a23c TL |
68 | * For zero copy in VM2NIC case, guest Tx used vring may be starved if the |
69 | PMD driver consume the mbuf but not release them timely. | |
7c673cae | 70 | |
9f95a23c TL |
71 | For example, i40e driver has an optimization to maximum NIC pipeline which |
72 | postpones returning transmitted mbuf until only tx_free_threshold free | |
73 | descs left. The virtio TX used ring will be starved if the formula | |
74 | (num_i40e_tx_desc - num_virtio_tx_desc > tx_free_threshold) is true, since | |
75 | i40e will not return back mbuf. | |
76 | ||
77 | A performance tip for tuning zero copy in VM2NIC case is to adjust the | |
78 | frequency of mbuf free (i.e. adjust tx_free_threshold of i40e driver) to | |
79 | balance consumer and producer. | |
7c673cae FG |
80 | |
81 | * Guest memory should be backended with huge pages to achieve better | |
82 | performance. Using 1G page size is the best. | |
83 | ||
84 | When dequeue zero copy is enabled, the guest phys address and host phys | |
85 | address mapping has to be established. Using non-huge pages means far | |
86 | more page segments. To make it simple, DPDK vhost does a linear search | |
87 | of those segments, thus the fewer the segments, the quicker we will get | |
88 | the mapping. NOTE: we may speed it by using tree searching in future. | |
89 | ||
9f95a23c TL |
90 | * zero copy can not work when using vfio-pci with iommu mode currently, this |
91 | is because we don't setup iommu dma mapping for guest memory. If you have | |
92 | to use vfio-pci driver, please insert vfio-pci kernel module in noiommu | |
93 | mode. | |
94 | ||
95 | * The consumer of zero copy mbufs should consume these mbufs as soon as | |
96 | possible, otherwise it may block the operations in vhost. | |
97 | ||
98 | - ``RTE_VHOST_USER_IOMMU_SUPPORT`` | |
99 | ||
100 | IOMMU support will be enabled when this flag is set. It is disabled by | |
101 | default. | |
102 | ||
103 | Enabling this flag makes possible to use guest vIOMMU to protect vhost | |
104 | from accessing memory the virtio device isn't allowed to, when the feature | |
105 | is negotiated and an IOMMU device is declared. | |
106 | ||
107 | However, this feature enables vhost-user's reply-ack protocol feature, | |
108 | which implementation is buggy in Qemu v2.7.0-v2.9.0 when doing multiqueue. | |
109 | Enabling this flag with these Qemu version results in Qemu being blocked | |
110 | when multiple queue pairs are declared. | |
111 | ||
112 | - ``RTE_VHOST_USER_POSTCOPY_SUPPORT`` | |
113 | ||
114 | Postcopy live-migration support will be enabled when this flag is set. | |
115 | It is disabled by default. | |
116 | ||
117 | Enabling this flag should only be done when the calling application does | |
118 | not pre-fault the guest shared memory, otherwise migration would fail. | |
119 | ||
11fdf7f2 | 120 | * ``rte_vhost_driver_set_features(path, features)`` |
7c673cae | 121 | |
11fdf7f2 TL |
122 | This function sets the feature bits the vhost-user driver supports. The |
123 | vhost-user driver could be vhost-user net, yet it could be something else, | |
124 | say, vhost-user SCSI. | |
7c673cae | 125 | |
11fdf7f2 | 126 | * ``rte_vhost_driver_callback_register(path, vhost_device_ops)`` |
7c673cae FG |
127 | |
128 | This function registers a set of callbacks, to let DPDK applications take | |
129 | the appropriate action when some events happen. The following events are | |
130 | currently supported: | |
131 | ||
132 | * ``new_device(int vid)`` | |
133 | ||
11fdf7f2 TL |
134 | This callback is invoked when a virtio device becomes ready. ``vid`` |
135 | is the vhost device ID. | |
7c673cae FG |
136 | |
137 | * ``destroy_device(int vid)`` | |
138 | ||
9f95a23c | 139 | This callback is invoked when a virtio device is paused or shut down. |
7c673cae FG |
140 | |
141 | * ``vring_state_changed(int vid, uint16_t queue_id, int enable)`` | |
142 | ||
143 | This callback is invoked when a specific queue's state is changed, for | |
144 | example to enabled or disabled. | |
145 | ||
11fdf7f2 | 146 | * ``features_changed(int vid, uint64_t features)`` |
7c673cae | 147 | |
11fdf7f2 TL |
148 | This callback is invoked when the features is changed. For example, |
149 | ``VHOST_F_LOG_ALL`` will be set/cleared at the start/end of live | |
150 | migration, respectively. | |
7c673cae | 151 | |
9f95a23c TL |
152 | * ``new_connection(int vid)`` |
153 | ||
154 | This callback is invoked on new vhost-user socket connection. If DPDK | |
155 | acts as the server the device should not be deleted before | |
156 | ``destroy_connection`` callback is received. | |
157 | ||
158 | * ``destroy_connection(int vid)`` | |
159 | ||
160 | This callback is invoked when vhost-user socket connection is closed. | |
161 | It indicates that device with id ``vid`` is no longer in use and can be | |
162 | safely deleted. | |
163 | ||
11fdf7f2 | 164 | * ``rte_vhost_driver_disable/enable_features(path, features))`` |
7c673cae FG |
165 | |
166 | This function disables/enables some features. For example, it can be used to | |
167 | disable mergeable buffers and TSO features, which both are enabled by | |
168 | default. | |
169 | ||
11fdf7f2 TL |
170 | * ``rte_vhost_driver_start(path)`` |
171 | ||
172 | This function triggers the vhost-user negotiation. It should be invoked at | |
173 | the end of initializing a vhost-user driver. | |
174 | ||
175 | * ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)`` | |
176 | ||
177 | Transmits (enqueues) ``count`` packets from host to guest. | |
178 | ||
179 | * ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)`` | |
180 | ||
181 | Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``. | |
7c673cae | 182 | |
9f95a23c TL |
183 | * ``rte_vhost_crypto_create(vid, cryptodev_id, sess_mempool, socket_id)`` |
184 | ||
185 | As an extension of new_device(), this function adds virtio-crypto workload | |
186 | acceleration capability to the device. All crypto workload is processed by | |
187 | DPDK cryptodev with the device ID of ``cryptodev_id``. | |
188 | ||
189 | * ``rte_vhost_crypto_free(vid)`` | |
190 | ||
191 | Frees the memory and vhost-user message handlers created in | |
192 | rte_vhost_crypto_create(). | |
193 | ||
194 | * ``rte_vhost_crypto_fetch_requests(vid, queue_id, ops, nb_ops)`` | |
195 | ||
196 | Receives (dequeues) ``nb_ops`` virtio-crypto requests from guest, parses | |
197 | them to DPDK Crypto Operations, and fills the ``ops`` with parsing results. | |
198 | ||
199 | * ``rte_vhost_crypto_finalize_requests(queue_id, ops, nb_ops)`` | |
200 | ||
201 | After the ``ops`` are dequeued from Cryptodev, finalizes the jobs and | |
202 | notifies the guest(s). | |
203 | ||
204 | * ``rte_vhost_crypto_set_zero_copy(vid, option)`` | |
205 | ||
206 | Enable or disable zero copy feature of the vhost crypto backend. | |
207 | ||
7c673cae FG |
208 | Vhost-user Implementations |
209 | -------------------------- | |
210 | ||
211 | Vhost-user uses Unix domain sockets for passing messages. This means the DPDK | |
212 | vhost-user implementation has two options: | |
213 | ||
214 | * DPDK vhost-user acts as the server. | |
215 | ||
216 | DPDK will create a Unix domain socket server file and listen for | |
217 | connections from the frontend. | |
218 | ||
219 | Note, this is the default mode, and the only mode before DPDK v16.07. | |
220 | ||
221 | ||
222 | * DPDK vhost-user acts as the client. | |
223 | ||
224 | Unlike the server mode, this mode doesn't create the socket file; | |
225 | it just tries to connect to the server (which responses to create the | |
226 | file instead). | |
227 | ||
228 | When the DPDK vhost-user application restarts, DPDK vhost-user will try to | |
229 | connect to the server again. This is how the "reconnect" feature works. | |
230 | ||
231 | .. Note:: | |
232 | * The "reconnect" feature requires **QEMU v2.7** (or above). | |
233 | ||
234 | * The vhost supported features must be exactly the same before and | |
235 | after the restart. For example, if TSO is disabled and then enabled, | |
236 | nothing will work and issues undefined might happen. | |
237 | ||
238 | No matter which mode is used, once a connection is established, DPDK | |
239 | vhost-user will start receiving and processing vhost messages from QEMU. | |
240 | ||
241 | For messages with a file descriptor, the file descriptor can be used directly | |
242 | in the vhost process as it is already installed by the Unix domain socket. | |
243 | ||
244 | The supported vhost messages are: | |
245 | ||
246 | * ``VHOST_SET_MEM_TABLE`` | |
247 | * ``VHOST_SET_VRING_KICK`` | |
248 | * ``VHOST_SET_VRING_CALL`` | |
249 | * ``VHOST_SET_LOG_FD`` | |
250 | * ``VHOST_SET_VRING_ERR`` | |
251 | ||
252 | For ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each | |
253 | memory region and its file descriptor in the ancillary data of the message. | |
254 | The file descriptor is used to map that region. | |
255 | ||
256 | ``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into | |
257 | the data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove | |
258 | the vhost device from the data plane. | |
259 | ||
260 | When the socket connection is closed, vhost will destroy the device. | |
261 | ||
9f95a23c TL |
262 | Guest memory requirement |
263 | ------------------------ | |
264 | ||
265 | * Memory pre-allocation | |
266 | ||
267 | For non-zerocopy, guest memory pre-allocation is not a must. This can help | |
268 | save of memory. If users really want the guest memory to be pre-allocated | |
269 | (e.g., for performance reason), we can add option ``-mem-prealloc`` when | |
270 | starting QEMU. Or, we can lock all memory at vhost side which will force | |
271 | memory to be allocated when mmap at vhost side; option --mlockall in | |
272 | ovs-dpdk is an example in hand. | |
273 | ||
274 | For zerocopy, we force the VM memory to be pre-allocated at vhost lib when | |
275 | mapping the guest memory; and also we need to lock the memory to prevent | |
276 | pages being swapped out to disk. | |
277 | ||
278 | * Memory sharing | |
279 | ||
280 | Make sure ``share=on`` QEMU option is given. vhost-user will not work with | |
281 | a QEMU version without shared memory mapping. | |
282 | ||
7c673cae FG |
283 | Vhost supported vSwitch reference |
284 | --------------------------------- | |
285 | ||
286 | For more vhost details and how to support vhost in vSwitch, please refer to | |
287 | the vhost example in the DPDK Sample Applications Guide. | |
9f95a23c TL |
288 | |
289 | Vhost data path acceleration (vDPA) | |
290 | ----------------------------------- | |
291 | ||
292 | vDPA supports selective datapath in vhost-user lib by enabling virtio ring | |
293 | compatible devices to serve virtio driver directly for datapath acceleration. | |
294 | ||
295 | ``rte_vhost_driver_attach_vdpa_device`` is used to configure the vhost device | |
296 | with accelerated backend. | |
297 | ||
298 | Also vhost device capabilities are made configurable to adopt various devices. | |
299 | Such capabilities include supported features, protocol features, queue number. | |
300 | ||
301 | Finally, a set of device ops is defined for device specific operations: | |
302 | ||
303 | * ``get_queue_num`` | |
304 | ||
305 | Called to get supported queue number of the device. | |
306 | ||
307 | * ``get_features`` | |
308 | ||
309 | Called to get supported features of the device. | |
310 | ||
311 | * ``get_protocol_features`` | |
312 | ||
313 | Called to get supported protocol features of the device. | |
314 | ||
315 | * ``dev_conf`` | |
316 | ||
317 | Called to configure the actual device when the virtio device becomes ready. | |
318 | ||
319 | * ``dev_close`` | |
320 | ||
321 | Called to close the actual device when the virtio device is stopped. | |
322 | ||
323 | * ``set_vring_state`` | |
324 | ||
325 | Called to change the state of the vring in the actual device when vring state | |
326 | changes. | |
327 | ||
328 | * ``set_features`` | |
329 | ||
330 | Called to set the negotiated features to device. | |
331 | ||
332 | * ``migration_done`` | |
333 | ||
334 | Called to allow the device to response to RARP sending. | |
335 | ||
336 | * ``get_vfio_group_fd`` | |
337 | ||
338 | Called to get the VFIO group fd of the device. | |
339 | ||
340 | * ``get_vfio_device_fd`` | |
341 | ||
342 | Called to get the VFIO device fd of the device. | |
343 | ||
344 | * ``get_notify_area`` | |
345 | ||
346 | Called to get the notify area info of the queue. |