]> git.proxmox.com Git - mirror_qemu.git/blob - docs/specs/ivshmem-spec.txt
ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem
[mirror_qemu.git] / docs / specs / ivshmem-spec.txt
1 = Device Specification for Inter-VM shared memory device =
2
3 The Inter-VM shared memory device (ivshmem) is designed to share a
4 memory region between multiple QEMU processes running different guests
5 and the host. In order for all guests to be able to pick up the
6 shared memory area, it is modeled by QEMU as a PCI device exposing
7 said memory to the guest as a PCI BAR.
8
9 The device can use a shared memory object on the host directly, or it
10 can obtain one from an ivshmem server.
11
12 In the latter case, the device can additionally interrupt its peers, and
13 get interrupted by its peers.
14
15
16 == Configuring the ivshmem PCI device ==
17
18 There are two basic configurations:
19
20 - Just shared memory: -device ivshmem-plain,memdev=HMB,...
21
22 This uses host memory backend HMB. It should have option "share"
23 set.
24
25 - Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
26
27 An ivshmem server must already be running on the host. The device
28 connects to the server's UNIX domain socket via character device
29 CHR.
30
31 Each peer gets assigned a unique ID by the server. IDs must be
32 between 0 and 65535.
33
34 Interrupts are message-signaled (MSI-X). vectors=N configures the
35 number of vectors to use.
36
37 For more details on ivshmem device properties, see The QEMU Emulator
38 User Documentation (qemu-doc.*).
39
40
41 == The ivshmem PCI device's guest interface ==
42
43 The device has vendor ID 1af4, device ID 1110, revision 1. Before
44 QEMU 2.6.0, it had revision 0.
45
46 === PCI BARs ===
47
48 The ivshmem PCI device has two or three BARs:
49
50 - BAR0 holds device registers (256 Byte MMIO)
51 - BAR1 holds MSI-X table and PBA (only ivshmem-doorbell)
52 - BAR2 maps the shared memory object
53
54 There are two ways to use this device:
55
56 - If you only need the shared memory part, BAR2 suffices. This way,
57 you have access to the shared memory in the guest and can use it as
58 you see fit. Memnic, for example, uses ivshmem this way from guest
59 user space (see http://dpdk.org/browse/memnic).
60
61 - If you additionally need the capability for peers to interrupt each
62 other, you need BAR0 and BAR1. You will most likely want to write a
63 kernel driver to handle interrupts. Requires the device to be
64 configured for interrupts, obviously.
65
66 Before QEMU 2.6.0, BAR2 can initially be invalid if the device is
67 configured for interrupts. It becomes safely accessible only after
68 the ivshmem server provided the shared memory. These devices have PCI
69 revision 0 rather than 1. Guest software should wait for the
70 IVPosition register (described below) to become non-negative before
71 accessing BAR2.
72
73 Revision 0 of the device is not capable to tell guest software whether
74 it is configured for interrupts.
75
76 === PCI device registers ===
77
78 BAR 0 contains the following registers:
79
80 Offset Size Access On reset Function
81 0 4 read/write 0 Interrupt Mask
82 bit 0: peer interrupt (rev 0)
83 reserved (rev 1)
84 bit 1..31: reserved
85 4 4 read/write 0 Interrupt Status
86 bit 0: peer interrupt (rev 0)
87 reserved (rev 1)
88 bit 1..31: reserved
89 8 4 read-only 0 or ID IVPosition
90 12 4 write-only N/A Doorbell
91 bit 0..15: vector
92 bit 16..31: peer ID
93 16 240 none N/A reserved
94
95 Software should only access the registers as specified in column
96 "Access". Reserved bits should be ignored on read, and preserved on
97 write.
98
99 In revision 0 of the device, Interrupt Status and Mask Register
100 together control the legacy INTx interrupt when the device has no
101 MSI-X capability: INTx is asserted when the bit-wise AND of Status and
102 Mask is non-zero and the device has no MSI-X capability. Interrupt
103 Status Register bit 0 becomes 1 when an interrupt request from a peer
104 is received. Reading the register clears it.
105
106 IVPosition Register: if the device is not configured for interrupts,
107 this is zero. Else, it is the device's ID (between 0 and 65535).
108
109 Before QEMU 2.6.0, the register may read -1 for a short while after
110 reset. These devices have PCI revision 0 rather than 1.
111
112 There is no good way for software to find out whether the device is
113 configured for interrupts. A positive IVPosition means interrupts,
114 but zero could be either.
115
116 Doorbell Register: writing this register requests to interrupt a peer.
117 The written value's high 16 bits are the ID of the peer to interrupt,
118 and its low 16 bits select an interrupt vector.
119
120 If the device is not configured for interrupts, the write is ignored.
121
122 If the interrupt hasn't completed setup, the write is ignored. The
123 device is not capable to tell guest software whether setup is
124 complete. Interrupts can regress to this state on migration.
125
126 If the peer with the requested ID isn't connected, or it has fewer
127 interrupt vectors connected, the write is ignored. The device is not
128 capable to tell guest software what peers are connected, or how many
129 interrupt vectors are connected.
130
131 The peer's interrupt for this vector then becomes pending. There is
132 no way for software to clear the pending bit, and a polling mode of
133 operation is therefore impossible.
134
135 If the peer is a revision 0 device without MSI-X capability, its
136 Interrupt Status register is set to 1. This asserts INTx unless
137 masked by the Interrupt Mask register. The device is not capable to
138 communicate the interrupt vector to guest software then.
139
140 With multiple MSI-X vectors, different vectors can be used to indicate
141 different events have occurred. The semantics of interrupt vectors
142 are left to the application.
143
144
145 == Interrupt infrastructure ==
146
147 When configured for interrupts, the peers share eventfd objects in
148 addition to shared memory. The shared resources are managed by an
149 ivshmem server.
150
151 === The ivshmem server ===
152
153 The server listens on a UNIX domain socket.
154
155 For each new client that connects to the server, the server
156 - picks an ID,
157 - creates eventfd file descriptors for the interrupt vectors,
158 - sends the ID and the file descriptor for the shared memory to the
159 new client,
160 - sends connect notifications for the new client to the other clients
161 (these contain file descriptors for sending interrupts),
162 - sends connect notifications for the other clients to the new client,
163 and
164 - sends interrupt setup messages to the new client (these contain file
165 descriptors for receiving interrupts).
166
167 When a client disconnects from the server, the server sends disconnect
168 notifications to the other clients.
169
170 The next section describes the protocol in detail.
171
172 If the server terminates without sending disconnect notifications for
173 its connected clients, the clients can elect to continue. They can
174 communicate with each other normally, but won't receive disconnect
175 notification on disconnect, and no new clients can connect. There is
176 no way for the clients to connect to a restarted server. The device
177 is not capable to tell guest software whether the server is still up.
178
179 Example server code is in contrib/ivshmem-server/. Not to be used in
180 production. It assumes all clients use the same number of interrupt
181 vectors.
182
183 A standalone client is in contrib/ivshmem-client/. It can be useful
184 for debugging.
185
186 === The ivshmem Client-Server Protocol ===
187
188 An ivshmem device configured for interrupts connects to an ivshmem
189 server. This section details the protocol between the two.
190
191 The connection is one-way: the server sends messages to the client.
192 Each message consists of a single 8 byte little-endian signed number,
193 and may be accompanied by a file descriptor via SCM_RIGHTS. Both
194 client and server close the connection on error.
195
196 Note: QEMU currently doesn't close the connection right on error, but
197 only when the character device is destroyed.
198
199 On connect, the server sends the following messages in order:
200
201 1. The protocol version number, currently zero. The client should
202 close the connection on receipt of versions it can't handle.
203
204 2. The client's ID. This is unique among all clients of this server.
205 IDs must be between 0 and 65535, because the Doorbell register
206 provides only 16 bits for them.
207
208 3. The number -1, accompanied by the file descriptor for the shared
209 memory.
210
211 4. Connect notifications for existing other clients, if any. This is
212 a peer ID (number between 0 and 65535 other than the client's ID),
213 repeated N times. Each repetition is accompanied by one file
214 descriptor. These are for interrupting the peer with that ID using
215 vector 0,..,N-1, in order. If the client is configured for fewer
216 vectors, it closes the extra file descriptors. If it is configured
217 for more, the extra vectors remain unconnected.
218
219 5. Interrupt setup. This is the client's own ID, repeated N times.
220 Each repetition is accompanied by one file descriptor. These are
221 for receiving interrupts from peers using vector 0,..,N-1, in
222 order. If the client is configured for fewer vectors, it closes
223 the extra file descriptors. If it is configured for more, the
224 extra vectors remain unconnected.
225
226 From then on, the server sends these kinds of messages:
227
228 6. Connection / disconnection notification. This is a peer ID.
229
230 - If the number comes with a file descriptor, it's a connection
231 notification, exactly like in step 4.
232
233 - Else, it's a disconnection notification for the peer with that ID.
234
235 Known bugs:
236
237 * The protocol changed incompatibly in QEMU 2.5. Before, messages
238 were native endian long, and there was no version number.
239
240 * The protocol is poorly designed.
241
242 === The ivshmem Client-Client Protocol ===
243
244 An ivshmem device configured for interrupts receives eventfd file
245 descriptors for interrupting peers and getting interrupted by peers
246 from the server, as explained in the previous section.
247
248 To interrupt a peer, the device writes the 8-byte integer 1 in native
249 byte order to the respective file descriptor.
250
251 To receive an interrupt, the device reads and discards as many 8-byte
252 integers as it can.