]> git.proxmox.com Git - mirror_qemu.git/blob - docs/specs/ivshmem-spec.txt
0cd63adff0035b49aef911bdb4443246214929ad
[mirror_qemu.git] / docs / specs / ivshmem-spec.txt
1 = Device Specification for Inter-VM shared memory device =
2
3 The Inter-VM shared memory device (ivshmem) is designed to share a
4 memory region between multiple QEMU processes running different guests
5 and the host. In order for all guests to be able to pick up the
6 shared memory area, it is modeled by QEMU as a PCI device exposing
7 said memory to the guest as a PCI BAR.
8
9 The device can use a shared memory object on the host directly, or it
10 can obtain one from an ivshmem server.
11
12 In the latter case, the device can additionally interrupt its peers, and
13 get interrupted by its peers.
14
15
16 == Configuring the ivshmem PCI device ==
17
18 There are two basic configurations:
19
20 - Just shared memory: -device ivshmem,shm=NAME,...
21
22 This uses shared memory object NAME.
23
24 - Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
25
26 An ivshmem server must already be running on the host. The device
27 connects to the server's UNIX domain socket via character device
28 CHR.
29
30 Each peer gets assigned a unique ID by the server. IDs must be
31 between 0 and 65535.
32
33 Interrupts are message-signaled by default (MSI-X). With msi=off
34 the device has no MSI-X capability, and uses legacy INTx instead.
35 vectors=N configures the number of vectors to use.
36
37 For more details on ivshmem device properties, see The QEMU Emulator
38 User Documentation (qemu-doc.*).
39
40
41 == The ivshmem PCI device's guest interface ==
42
43 The device has vendor ID 1af4, device ID 1110, revision 0.
44
45 === PCI BARs ===
46
47 The ivshmem PCI device has two or three BARs:
48
49 - BAR0 holds device registers (256 Byte MMIO)
50 - BAR1 holds MSI-X table and PBA (only when using MSI-X)
51 - BAR2 maps the shared memory object
52
53 There are two ways to use this device:
54
55 - If you only need the shared memory part, BAR2 suffices. This way,
56 you have access to the shared memory in the guest and can use it as
57 you see fit. Memnic, for example, uses ivshmem this way from guest
58 user space (see http://dpdk.org/browse/memnic).
59
60 - If you additionally need the capability for peers to interrupt each
61 other, you need BAR0 and, if using MSI-X, BAR1. You will most
62 likely want to write a kernel driver to handle interrupts. Requires
63 the device to be configured for interrupts, obviously.
64
65 If the device is configured for interrupts, BAR2 is initially invalid.
66 It becomes safely accessible only after the ivshmem server provided
67 the shared memory. Guest software should wait for the IVPosition
68 register (described below) to become non-negative before accessing
69 BAR2.
70
71 The device is not capable to tell guest software whether it is
72 configured for interrupts.
73
74 === PCI device registers ===
75
76 BAR 0 contains the following registers:
77
78 Offset Size Access On reset Function
79 0 4 read/write 0 Interrupt Mask
80 bit 0: peer interrupt
81 bit 1..31: reserved
82 4 4 read/write 0 Interrupt Status
83 bit 0: peer interrupt
84 bit 1..31: reserved
85 8 4 read-only 0 or -1 IVPosition
86 12 4 write-only N/A Doorbell
87 bit 0..15: vector
88 bit 16..31: peer ID
89 16 240 none N/A reserved
90
91 Software should only access the registers as specified in column
92 "Access". Reserved bits should be ignored on read, and preserved on
93 write.
94
95 Interrupt Status and Mask Register together control the legacy INTx
96 interrupt when the device has no MSI-X capability: INTx is asserted
97 when the bit-wise AND of Status and Mask is non-zero and the device
98 has no MSI-X capability. Interrupt Status Register bit 0 becomes 1
99 when an interrupt request from a peer is received. Reading the
100 register clears it.
101
102 IVPosition Register: if the device is not configured for interrupts,
103 this is zero. Else, it's -1 for a short while after reset, then
104 changes to the device's ID (between 0 and 65535).
105
106 There is no good way for software to find out whether the device is
107 configured for interrupts. A positive IVPosition means interrupts,
108 but zero could be either. The initial -1 cannot be reliably observed.
109
110 Doorbell Register: writing this register requests to interrupt a peer.
111 The written value's high 16 bits are the ID of the peer to interrupt,
112 and its low 16 bits select an interrupt vector.
113
114 If the device is not configured for interrupts, the write is ignored.
115
116 If the interrupt hasn't completed setup, the write is ignored. The
117 device is not capable to tell guest software whether setup is
118 complete. Interrupts can regress to this state on migration.
119
120 If the peer with the requested ID isn't connected, or it has fewer
121 interrupt vectors connected, the write is ignored. The device is not
122 capable to tell guest software what peers are connected, or how many
123 interrupt vectors are connected.
124
125 If the peer doesn't use MSI-X, its Interrupt Status register is set to
126 1. This asserts INTx unless masked by the Interrupt Mask register.
127 The device is not capable to communicate the interrupt vector to guest
128 software then.
129
130 If the peer uses MSI-X, the interrupt for this vector becomes pending.
131 There is no way for software to clear the pending bit, and a polling
132 mode of operation is therefore impossible with MSI-X.
133
134 With multiple MSI-X vectors, different vectors can be used to indicate
135 different events have occurred. The semantics of interrupt vectors
136 are left to the application.
137
138
139 == Interrupt infrastructure ==
140
141 When configured for interrupts, the peers share eventfd objects in
142 addition to shared memory. The shared resources are managed by an
143 ivshmem server.
144
145 === The ivshmem server ===
146
147 The server listens on a UNIX domain socket.
148
149 For each new client that connects to the server, the server
150 - picks an ID,
151 - creates eventfd file descriptors for the interrupt vectors,
152 - sends the ID and the file descriptor for the shared memory to the
153 new client,
154 - sends connect notifications for the new client to the other clients
155 (these contain file descriptors for sending interrupts),
156 - sends connect notifications for the other clients to the new client,
157 and
158 - sends interrupt setup messages to the new client (these contain file
159 descriptors for receiving interrupts).
160
161 When a client disconnects from the server, the server sends disconnect
162 notifications to the other clients.
163
164 The next section describes the protocol in detail.
165
166 If the server terminates without sending disconnect notifications for
167 its connected clients, the clients can elect to continue. They can
168 communicate with each other normally, but won't receive disconnect
169 notification on disconnect, and no new clients can connect. There is
170 no way for the clients to connect to a restarted server. The device
171 is not capable to tell guest software whether the server is still up.
172
173 Example server code is in contrib/ivshmem-server/. Not to be used in
174 production. It assumes all clients use the same number of interrupt
175 vectors.
176
177 A standalone client is in contrib/ivshmem-client/. It can be useful
178 for debugging.
179
180 === The ivshmem Client-Server Protocol ===
181
182 An ivshmem device configured for interrupts connects to an ivshmem
183 server. This section details the protocol between the two.
184
185 The connection is one-way: the server sends messages to the client.
186 Each message consists of a single 8 byte little-endian signed number,
187 and may be accompanied by a file descriptor via SCM_RIGHTS. Both
188 client and server close the connection on error.
189
190 Note: QEMU currently doesn't close the connection right on error, but
191 only when the character device is destroyed.
192
193 On connect, the server sends the following messages in order:
194
195 1. The protocol version number, currently zero. The client should
196 close the connection on receipt of versions it can't handle.
197
198 2. The client's ID. This is unique among all clients of this server.
199 IDs must be between 0 and 65535, because the Doorbell register
200 provides only 16 bits for them.
201
202 3. The number -1, accompanied by the file descriptor for the shared
203 memory.
204
205 4. Connect notifications for existing other clients, if any. This is
206 a peer ID (number between 0 and 65535 other than the client's ID),
207 repeated N times. Each repetition is accompanied by one file
208 descriptor. These are for interrupting the peer with that ID using
209 vector 0,..,N-1, in order. If the client is configured for fewer
210 vectors, it closes the extra file descriptors. If it is configured
211 for more, the extra vectors remain unconnected.
212
213 5. Interrupt setup. This is the client's own ID, repeated N times.
214 Each repetition is accompanied by one file descriptor. These are
215 for receiving interrupts from peers using vector 0,..,N-1, in
216 order. If the client is configured for fewer vectors, it closes
217 the extra file descriptors. If it is configured for more, the
218 extra vectors remain unconnected.
219
220 From then on, the server sends these kinds of messages:
221
222 6. Connection / disconnection notification. This is a peer ID.
223
224 - If the number comes with a file descriptor, it's a connection
225 notification, exactly like in step 4.
226
227 - Else, it's a disconnection notification for the peer with that ID.
228
229 Known bugs:
230
231 * The protocol changed incompatibly in QEMU 2.5. Before, messages
232 were native endian long, and there was no version number.
233
234 * The protocol is poorly designed.
235
236 === The ivshmem Client-Client Protocol ===
237
238 An ivshmem device configured for interrupts receives eventfd file
239 descriptors for interrupting peers and getting interrupted by peers
240 from the server, as explained in the previous section.
241
242 To interrupt a peer, the device writes the 8-byte integer 1 in native
243 byte order to the respective file descriptor.
244
245 To receive an interrupt, the device reads and discards as many 8-byte
246 integers as it can.