]> git.proxmox.com Git - mirror_qemu.git/blob - docs/specs/vhost-user.txt
spapr: update spapr hotplug documentation
[mirror_qemu.git] / docs / specs / vhost-user.txt
1 Vhost-user Protocol
2 ===================
3
4 Copyright (c) 2014 Virtual Open Systems Sarl.
5
6 This work is licensed under the terms of the GNU GPL, version 2 or later.
7 See the COPYING file in the top-level directory.
8 ===================
9
10 This protocol is aiming to complement the ioctl interface used to control the
11 vhost implementation in the Linux kernel. It implements the control plane needed
12 to establish virtqueue sharing with a user space process on the same host. It
13 uses communication over a Unix domain socket to share file descriptors in the
14 ancillary data of the message.
15
16 The protocol defines 2 sides of the communication, master and slave. Master is
17 the application that shares its virtqueues, in our case QEMU. Slave is the
18 consumer of the virtqueues.
19
20 In the current implementation QEMU is the Master, and the Slave is intended to
21 be a software Ethernet switch running in user space, such as Snabbswitch.
22
23 Master and slave can be either a client (i.e. connecting) or server (listening)
24 in the socket communication.
25
26 Message Specification
27 ---------------------
28
29 Note that all numbers are in the machine native byte order. A vhost-user message
30 consists of 3 header fields and a payload:
31
32 ------------------------------------
33 | request | flags | size | payload |
34 ------------------------------------
35
36 * Request: 32-bit type of the request
37 * Flags: 32-bit bit field:
38 - Lower 2 bits are the version (currently 0x01)
39 - Bit 2 is the reply flag - needs to be sent on each reply from the slave
40 - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for
41 details.
42 * Size - 32-bit size of the payload
43
44
45 Depending on the request type, payload can be:
46
47 * A single 64-bit integer
48 -------
49 | u64 |
50 -------
51
52 u64: a 64-bit unsigned integer
53
54 * A vring state description
55 ---------------
56 | index | num |
57 ---------------
58
59 Index: a 32-bit index
60 Num: a 32-bit number
61
62 * A vring address description
63 --------------------------------------------------------------
64 | index | flags | size | descriptor | used | available | log |
65 --------------------------------------------------------------
66
67 Index: a 32-bit vring index
68 Flags: a 32-bit vring flags
69 Descriptor: a 64-bit user address of the vring descriptor table
70 Used: a 64-bit user address of the vring used ring
71 Available: a 64-bit user address of the vring available ring
72 Log: a 64-bit guest address for logging
73
74 * Memory regions description
75 ---------------------------------------------------
76 | num regions | padding | region0 | ... | region7 |
77 ---------------------------------------------------
78
79 Num regions: a 32-bit number of regions
80 Padding: 32-bit
81
82 A region is:
83 -----------------------------------------------------
84 | guest address | size | user address | mmap offset |
85 -----------------------------------------------------
86
87 Guest address: a 64-bit guest address of the region
88 Size: a 64-bit size
89 User address: a 64-bit user address
90 mmap offset: 64-bit offset where region starts in the mapped memory
91
92 * Log description
93 ---------------------------
94 | log size | log offset |
95 ---------------------------
96 log size: size of area used for logging
97 log offset: offset from start of supplied file descriptor
98 where logging starts (i.e. where guest address 0 would be logged)
99
100 In QEMU the vhost-user message is implemented with the following struct:
101
102 typedef struct VhostUserMsg {
103 VhostUserRequest request;
104 uint32_t flags;
105 uint32_t size;
106 union {
107 uint64_t u64;
108 struct vhost_vring_state state;
109 struct vhost_vring_addr addr;
110 VhostUserMemory memory;
111 VhostUserLog log;
112 };
113 } QEMU_PACKED VhostUserMsg;
114
115 Communication
116 -------------
117
118 The protocol for vhost-user is based on the existing implementation of vhost
119 for the Linux Kernel. Most messages that can be sent via the Unix domain socket
120 implementing vhost-user have an equivalent ioctl to the kernel implementation.
121
122 The communication consists of master sending message requests and slave sending
123 message replies. Most of the requests don't require replies. Here is a list of
124 the ones that do:
125
126 * VHOST_GET_FEATURES
127 * VHOST_GET_PROTOCOL_FEATURES
128 * VHOST_GET_VRING_BASE
129 * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
130
131 [ Also see the section on REPLY_ACK protocol extension. ]
132
133 There are several messages that the master sends with file descriptors passed
134 in the ancillary data:
135
136 * VHOST_SET_MEM_TABLE
137 * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
138 * VHOST_SET_LOG_FD
139 * VHOST_SET_VRING_KICK
140 * VHOST_SET_VRING_CALL
141 * VHOST_SET_VRING_ERR
142
143 If Master is unable to send the full message or receives a wrong reply it will
144 close the connection. An optional reconnection mechanism can be implemented.
145
146 Any protocol extensions are gated by protocol feature bits,
147 which allows full backwards compatibility on both master
148 and slave.
149 As older slaves don't support negotiating protocol features,
150 a feature bit was dedicated for this purpose:
151 #define VHOST_USER_F_PROTOCOL_FEATURES 30
152
153 Starting and stopping rings
154 ----------------------
155 Client must only process each ring when it is started.
156
157 Client must only pass data between the ring and the
158 backend, when the ring is enabled.
159
160 If ring is started but disabled, client must process the
161 ring without talking to the backend.
162
163 For example, for a networking device, in the disabled state
164 client must not supply any new RX packets, but must process
165 and discard any TX packets.
166
167 If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized
168 in an enabled state.
169
170 If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized
171 in a disabled state. Client must not pass data to/from the backend until ring is enabled by
172 VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by
173 VHOST_USER_SET_VRING_ENABLE with parameter 0.
174
175 Each ring is initialized in a stopped state, client must not process it until
176 ring is started, or after it has been stopped.
177
178 Client must start ring upon receiving a kick (that is, detecting that file
179 descriptor is readable) on the descriptor specified by
180 VHOST_USER_SET_VRING_KICK, and stop ring upon receiving
181 VHOST_USER_GET_VRING_BASE.
182
183 While processing the rings (whether they are enabled or not), client must
184 support changing some configuration aspects on the fly.
185
186 Multiple queue support
187 ----------------------
188
189 Multiple queue is treated as a protocol extension, hence the slave has to
190 implement protocol features first. The multiple queues feature is supported
191 only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set.
192
193 The max number of queues the slave supports can be queried with message
194 VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of
195 requested queues is bigger than that.
196
197 As all queues share one connection, the master uses a unique index for each
198 queue in the sent message to identify a specified queue. One queue pair
199 is enabled initially. More queues are enabled dynamically, by sending
200 message VHOST_USER_SET_VRING_ENABLE.
201
202 Migration
203 ---------
204
205 During live migration, the master may need to track the modifications
206 the slave makes to the memory mapped regions. The client should mark
207 the dirty pages in a log. Once it complies to this logging, it may
208 declare the VHOST_F_LOG_ALL vhost feature.
209
210 To start/stop logging of data/used ring writes, server may send messages
211 VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with
212 VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively.
213
214 All the modifications to memory pointed by vring "descriptor" should
215 be marked. Modifications to "used" vring should be marked if
216 VHOST_VRING_F_LOG is part of ring's flags.
217
218 Dirty pages are of size:
219 #define VHOST_LOG_PAGE 0x1000
220
221 The log memory fd is provided in the ancillary data of
222 VHOST_USER_SET_LOG_BASE message when the slave has
223 VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature.
224
225 The size of the log is supplied as part of VhostUserMsg
226 which should be large enough to cover all known guest
227 addresses. Log starts at the supplied offset in the
228 supplied file descriptor.
229 The log covers from address 0 to the maximum of guest
230 regions. In pseudo-code, to mark page at "addr" as dirty:
231
232 page = addr / VHOST_LOG_PAGE
233 log[page / 8] |= 1 << page % 8
234
235 Where addr is the guest physical address.
236
237 Use atomic operations, as the log may be concurrently manipulated.
238
239 Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG
240 is set for this ring), log_guest_addr should be used to calculate the log
241 offset: the write to first byte of the used ring is logged at this offset from
242 log start. Also note that this value might be outside the legal guest physical
243 address range (i.e. does not have to be covered by the VhostUserMemory table),
244 but the bit offset of the last byte of the ring must fall within
245 the size supplied by VhostUserLog.
246
247 VHOST_USER_SET_LOG_FD is an optional message with an eventfd in
248 ancillary data, it may be used to inform the master that the log has
249 been modified.
250
251 Once the source has finished migration, rings will be stopped by
252 the source. No further update must be done before rings are
253 restarted.
254
255 Protocol features
256 -----------------
257
258 #define VHOST_USER_PROTOCOL_F_MQ 0
259 #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
260 #define VHOST_USER_PROTOCOL_F_RARP 2
261 #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
262
263 Message types
264 -------------
265
266 * VHOST_USER_GET_FEATURES
267
268 Id: 1
269 Equivalent ioctl: VHOST_GET_FEATURES
270 Master payload: N/A
271 Slave payload: u64
272
273 Get from the underlying vhost implementation the features bitmask.
274 Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
275 VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
276
277 * VHOST_USER_SET_FEATURES
278
279 Id: 2
280 Ioctl: VHOST_SET_FEATURES
281 Master payload: u64
282
283 Enable features in the underlying vhost implementation using a bitmask.
284 Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
285 VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
286
287 * VHOST_USER_GET_PROTOCOL_FEATURES
288
289 Id: 15
290 Equivalent ioctl: VHOST_GET_FEATURES
291 Master payload: N/A
292 Slave payload: u64
293
294 Get the protocol feature bitmask from the underlying vhost implementation.
295 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
296 VHOST_USER_GET_FEATURES.
297 Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
298 this message even before VHOST_USER_SET_FEATURES was called.
299
300 * VHOST_USER_SET_PROTOCOL_FEATURES
301
302 Id: 16
303 Ioctl: VHOST_SET_FEATURES
304 Master payload: u64
305
306 Enable protocol features in the underlying vhost implementation.
307 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
308 VHOST_USER_GET_FEATURES.
309 Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
310 this message even before VHOST_USER_SET_FEATURES was called.
311
312 * VHOST_USER_SET_OWNER
313
314 Id: 3
315 Equivalent ioctl: VHOST_SET_OWNER
316 Master payload: N/A
317
318 Issued when a new connection is established. It sets the current Master
319 as an owner of the session. This can be used on the Slave as a
320 "session start" flag.
321
322 * VHOST_USER_RESET_OWNER
323
324 Id: 4
325 Master payload: N/A
326
327 This is no longer used. Used to be sent to request disabling
328 all rings, but some clients interpreted it to also discard
329 connection state (this interpretation would lead to bugs).
330 It is recommended that clients either ignore this message,
331 or use it to disable all rings.
332
333 * VHOST_USER_SET_MEM_TABLE
334
335 Id: 5
336 Equivalent ioctl: VHOST_SET_MEM_TABLE
337 Master payload: memory regions description
338
339 Sets the memory map regions on the slave so it can translate the vring
340 addresses. In the ancillary data there is an array of file descriptors
341 for each memory mapped region. The size and ordering of the fds matches
342 the number and ordering of memory regions.
343
344 * VHOST_USER_SET_LOG_BASE
345
346 Id: 6
347 Equivalent ioctl: VHOST_SET_LOG_BASE
348 Master payload: u64
349 Slave payload: N/A
350
351 Sets logging shared memory space.
352 When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol
353 feature, the log memory fd is provided in the ancillary data of
354 VHOST_USER_SET_LOG_BASE message, the size and offset of shared
355 memory area provided in the message.
356
357
358 * VHOST_USER_SET_LOG_FD
359
360 Id: 7
361 Equivalent ioctl: VHOST_SET_LOG_FD
362 Master payload: N/A
363
364 Sets the logging file descriptor, which is passed as ancillary data.
365
366 * VHOST_USER_SET_VRING_NUM
367
368 Id: 8
369 Equivalent ioctl: VHOST_SET_VRING_NUM
370 Master payload: vring state description
371
372 Set the size of the queue.
373
374 * VHOST_USER_SET_VRING_ADDR
375
376 Id: 9
377 Equivalent ioctl: VHOST_SET_VRING_ADDR
378 Master payload: vring address description
379 Slave payload: N/A
380
381 Sets the addresses of the different aspects of the vring.
382
383 * VHOST_USER_SET_VRING_BASE
384
385 Id: 10
386 Equivalent ioctl: VHOST_SET_VRING_BASE
387 Master payload: vring state description
388
389 Sets the base offset in the available vring.
390
391 * VHOST_USER_GET_VRING_BASE
392
393 Id: 11
394 Equivalent ioctl: VHOST_USER_GET_VRING_BASE
395 Master payload: vring state description
396 Slave payload: vring state description
397
398 Get the available vring base offset.
399
400 * VHOST_USER_SET_VRING_KICK
401
402 Id: 12
403 Equivalent ioctl: VHOST_SET_VRING_KICK
404 Master payload: u64
405
406 Set the event file descriptor for adding buffers to the vring. It
407 is passed in the ancillary data.
408 Bits (0-7) of the payload contain the vring index. Bit 8 is the
409 invalid FD flag. This flag is set when there is no file descriptor
410 in the ancillary data. This signals that polling should be used
411 instead of waiting for a kick.
412
413 * VHOST_USER_SET_VRING_CALL
414
415 Id: 13
416 Equivalent ioctl: VHOST_SET_VRING_CALL
417 Master payload: u64
418
419 Set the event file descriptor to signal when buffers are used. It
420 is passed in the ancillary data.
421 Bits (0-7) of the payload contain the vring index. Bit 8 is the
422 invalid FD flag. This flag is set when there is no file descriptor
423 in the ancillary data. This signals that polling will be used
424 instead of waiting for the call.
425
426 * VHOST_USER_SET_VRING_ERR
427
428 Id: 14
429 Equivalent ioctl: VHOST_SET_VRING_ERR
430 Master payload: u64
431
432 Set the event file descriptor to signal when error occurs. It
433 is passed in the ancillary data.
434 Bits (0-7) of the payload contain the vring index. Bit 8 is the
435 invalid FD flag. This flag is set when there is no file descriptor
436 in the ancillary data.
437
438 * VHOST_USER_GET_QUEUE_NUM
439
440 Id: 17
441 Equivalent ioctl: N/A
442 Master payload: N/A
443 Slave payload: u64
444
445 Query how many queues the backend supports. This request should be
446 sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol
447 features by VHOST_USER_GET_PROTOCOL_FEATURES.
448
449 * VHOST_USER_SET_VRING_ENABLE
450
451 Id: 18
452 Equivalent ioctl: N/A
453 Master payload: vring state description
454
455 Signal slave to enable or disable corresponding vring.
456 This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
457 has been negotiated.
458
459 * VHOST_USER_SEND_RARP
460
461 Id: 19
462 Equivalent ioctl: N/A
463 Master payload: u64
464
465 Ask vhost user backend to broadcast a fake RARP to notify the migration
466 is terminated for guest that does not support GUEST_ANNOUNCE.
467 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
468 VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP
469 is present in VHOST_USER_GET_PROTOCOL_FEATURES.
470 The first 6 bytes of the payload contain the mac address of the guest to
471 allow the vhost user backend to construct and broadcast the fake RARP.
472
473 VHOST_USER_PROTOCOL_F_REPLY_ACK:
474 -------------------------------
475 The original vhost-user specification only demands replies for certain
476 commands. This differs from the vhost protocol implementation where commands
477 are sent over an ioctl() call and block until the client has completed.
478
479 With this protocol extension negotiated, the sender (QEMU) can set the
480 "need_reply" [Bit 3] flag to any command. This indicates that
481 the client MUST respond with a Payload VhostUserMsg indicating success or
482 failure. The payload should be set to zero on success or non-zero on failure,
483 unless the message already has an explicit reply body.
484
485 The response payload gives QEMU a deterministic indication of the result
486 of the command. Today, QEMU is expected to terminate the main vhost-user
487 loop upon receiving such errors. In future, qemu could be taught to be more
488 resilient for selective requests.
489
490 For the message types that already solicit a reply from the client, the
491 presence of VHOST_USER_PROTOCOL_F_REPLY_ACK or need_reply bit being set brings
492 no behavioural change. (See the 'Communication' section for details.)