]> git.proxmox.com Git - mirror_qemu.git/blame - docs/pvrdma.txt
docs: Update pvrdma device documentation
[mirror_qemu.git] / docs / pvrdma.txt
CommitLineData
edab5632
MA
1Paravirtualized RDMA Device (PVRDMA)
2====================================
3
4
51. Description
6===============
7PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
8It works with its Linux Kernel driver AS IS, no need for any special guest
9modifications.
10
11While it complies with the VMware device, it can also communicate with bare
46b69a88
YS
12metal RDMA-enabled machines as peers.
13
14It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
edab5632
MA
15
16It does not require the whole guest RAM to be pinned allowing memory
17over-commit and, even if not implemented yet, migration support will be
18possible with some HW assistance.
19
20A project presentation accompany this document:
21- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
22
23
24
252. Setup
26========
27
28
292.1 Guest setup
30===============
31Fedora 27+ kernels work out of the box, older distributions
32require updating the kernel to 4.14 to include the pvrdma driver.
33
34However the libpvrdma library needed by User Level Software is still
35not available as part of the distributions, so the rdma-core library
36needs to be compiled and optionally installed.
37
38Please follow the instructions at:
39 https://github.com/linux-rdma/rdma-core.git
40
41
422.2 Host Setup
43==============
44The pvrdma backend is an ibdevice interface that can be exposed
45either by a Soft-RoCE(rxe) device on machines with no RDMA device,
46or an HCA SRIOV function(VF/PF).
47Note that ibdevice interfaces can't be shared between pvrdma devices,
48each one requiring a separate instance (rxe or SRIOV VF).
49
50
512.2.1 Soft-RoCE backend(rxe)
52===========================
53A stable version of rxe is required, Fedora 27+ or a Linux
54Kernel 4.14+ is preferred.
55
56The rdma_rxe module is part of the Linux Kernel but not loaded by default.
57Install the User Level library (librxe) following the instructions from:
58https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
59
60Associate an ETH interface with rxe by running:
61 rxe_cfg add eth0
62An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
63
64
652.2.2 RDMA device Virtual Function backend
66==========================================
67Nothing special is required, the pvrdma device can work not only with
68Ethernet Links, but also Infinibands Links.
69All is needed is an ibdevice with an active port, for Mellanox cards
70will be something like mlx5_6 which can be the backend.
71
72
732.2.3 QEMU setup
74================
75Configure QEMU with --enable-rdma flag, installing
76the required RDMA libraries.
77
78
79
803. Usage
81========
46b69a88
YS
82
83
843.1 VM Memory settings
85======================
edab5632
MA
86Currently the device is working only with memory backed RAM
87and it must be mark as "shared":
88 -m 1G \
89 -object memory-backend-ram,id=mb1,size=1G,share \
90 -numa node,memdev=mb1 \
91
46b69a88
YS
92
933.2 MAD Multiplexer
94===================
95MAD Multiplexer is a service that exposes MAD-like interface for VMs in
96order to overcome the limitation where only single entity can register with
97MAD layer to send and receive RDMA-CM MAD packets.
98
99To build rdmacm-mux run
100# make rdmacm-mux
101
102The application accepts 3 command line arguments and exposes a UNIX socket
103to pass control and data to it.
104-d rdma-device-name Name of RDMA device to register with
105-s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux)
106-p rdma-device-port Port number of RDMA device to register with (default 1)
107The final UNIX socket file name is a concatenation of the 3 arguments so
108for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
109will be created.
110
111pvrdma requires this service.
112
113Please refer to contrib/rdmacm-mux for more details.
114
115
1163.3 Service exposed by libvirt daemon
117=====================================
118The control over the RDMA device's GID table is done by updating the
119device's Ethernet function addresses.
120Usually the first GID entry is determined by the MAC address, the second by
121the first IPv6 address and the third by the IPv4 address. Other entries can
122be added by adding more IP addresses. The opposite is the same, i.e.
123whenever an address is removed, the corresponding GID entry is removed.
124The process is done by the network and RDMA stacks. Whenever an address is
125added the ib_core driver is notified and calls the device driver add_gid
126function which in turn update the device.
127To support this in pvrdma device the device hooks into the create_bind and
128destroy_bind HW commands triggered by pvrdma driver in guest.
129
130Whenever changed is made to the pvrdma port's GID table a special QMP
131messages is sent to be processed by libvirt to update the address of the
132backend Ethernet device.
133
134pvrdma requires that libvirt service will be up.
135
136
1373.4 PCI devices settings
138========================
139RoCE device exposes two functions - an Ethernet and RDMA.
140To support it, pvrdma device is composed of two PCI functions, an Ethernet
141device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
142Ethernet function can be used for other Ethernet purposes such as IP.
143
144
1453.5 Device parameters
146=====================
147- netdev: Specifies the Ethernet device function name on the host for
148 example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
149 device used to create it.
150- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
151- mad-chardev: The name of the MAD multiplexer char device.
152- ibport: In case of multi-port device (such as Mellanox's HCA) this
153 specify the port to use. If not set 1 will be used.
154- dev-caps-max-mr-size: The maximum size of MR.
155- dev-caps-max-qp: Maximum number of QPs.
156- dev-caps-max-sge: Maximum number of SGE elements in WR.
157- dev-caps-max-cq: Maximum number of CQs.
158- dev-caps-max-mr: Maximum number of MRs.
159- dev-caps-max-pd: Maximum number of PDs.
160- dev-caps-max-ah: Maximum number of AHs.
161
162Notes:
163- The first 3 parameters are mandatory settings, the rest have their
164 defaults.
165- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
166 limits but the final values is adjusted by the backend device limitations.
167- netdev can be extracted from ibdev's sysfs
168 (/sys/class/infiniband/<ibdev>/device/net/)
169
170
1713.6 Example
172===========
173Define bridge device with vmxnet3 network backend:
174<interface type='bridge'>
175 <mac address='56:b4:44:e9:62:dc'/>
176 <source bridge='bridge1'/>
177 <model type='vmxnet3'/>
178 <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
179</interface>
180
181Define pvrdma device:
182<qemu:commandline>
183 <qemu:arg value='-object'/>
184 <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
185 <qemu:arg value='-numa'/>
186 <qemu:arg value='node,memdev=mb1'/>
187 <qemu:arg value='-chardev'/>
188 <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
189 <qemu:arg value='-device'/>
190 <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
191</qemu:commandline>
edab5632
MA
192
193
194
1954. Implementation details
196=========================
197
198
1994.1 Overview
200============
201The device acts like a proxy between the Guest Driver and the host
202ibdevice interface.
203On configuration path:
204 - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
205 a resource from the backend interface, maintaining a 1-1 mapping
206 between the guest and host.
207On data path:
208 - Every post_send/receive received from the guest will be converted into
209 a post_send/receive for the backend. The buffers data will not be touched
210 or copied resulting in near bare-metal performance for large enough buffers.
211 - Completions from the backend interface will result in completions for
212 the pvrdma device.
213
214
2154.2 PCI BARs
216============
217PCI Bars:
218 BAR 0 - MSI-X
219 MSI-X vectors:
220 (0) Command - used when execution of a command is completed.
221 (1) Async - not in use.
222 (2) Completion - used when a completion event is placed in
223 device's CQ ring.
224 BAR 1 - Registers
225 --------------------------------------------------------
226 | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC |
227 --------------------------------------------------------
228 DSR - Address of driver/device shared memory used
229 for the command channel, used for passing:
230 - General info such as driver version
231 - Address of 'command' and 'response'
232 - Address of async ring
233 - Address of device's CQ ring
234 - Device capabilities
235 CTL - Device control operations (activate, reset etc)
236 IMG - Set interrupt mask
237 REQ - Command execution register
238 ERR - Operation status
239
240 BAR 2 - UAR
241 ---------------------------------------------------------
242 | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag |
243 ---------------------------------------------------------
244 - Offset 0 used for QP operations (send and recv)
245 - Offset 4 used for CQ operations (arm and poll)
246
247
2484.3 Major flows
249===============
250
2514.3.1 Create CQ
252===============
253 - Guest driver
254 - Allocates pages for CQ ring
255 - Creates page directory (pdir) to hold CQ ring's pages
256 - Initializes CQ ring
257 - Initializes 'Create CQ' command object (cqe, pdir etc)
258 - Copies the command to 'command' address
259 - Writes 0 into REQ register
260 - Device
261 - Reads the request object from the 'command' address
262 - Allocates CQ object and initialize CQ ring based on pdir
263 - Creates the backend CQ
264 - Writes operation status to ERR register
265 - Posts command-interrupt to guest
266 - Guest driver
267 - Reads the HW response code from ERR register
268
2694.3.2 Create QP
270===============
271 - Guest driver
272 - Allocates pages for send and receive rings
273 - Creates page directory(pdir) to hold the ring's pages
274 - Initializes 'Create QP' command object (max_send_wr,
275 send_cq_handle, recv_cq_handle, pdir etc)
276 - Copies the object to 'command' address
277 - Write 0 into REQ register
278 - Device
279 - Reads the request object from 'command' address
280 - Allocates the QP object and initialize
281 - Send and recv rings based on pdir
282 - Send and recv ring state
283 - Creates the backend QP
284 - Writes the operation status to ERR register
285 - Posts command-interrupt to guest
286 - Guest driver
287 - Reads the HW response code from ERR register
288
2894.3.3 Post receive
290==================
291 - Guest driver
292 - Initializes a wqe and place it on recv ring
293 - Write to qpn|qp_recv_bit (31) to QP offset in UAR
294 - Device
295 - Extracts qpn from UAR
296 - Walks through the ring and does the following for each wqe
297 - Prepares the backend CQE context to be used when
298 receiving completion from backend (wr_id, op_code, emu_cq_num)
299 - For each sge prepares backend sge
300 - Calls backend's post_recv
301
3024.3.4 Process backend events
303============================
304 - Done by a dedicated thread used to process backend events;
305 at initialization is attached to the device and creates
306 the communication channel.
307 - Thread main loop:
308 - Polls for completions
309 - Extracts QEMU _cq_num, wr_id and op_code from context
310 - Writes CQE to CQ ring
311 - Writes CQ number to device CQ
312 - Sends completion-interrupt to guest
313 - Deallocates context
314 - Acks the event to backend
315
316
317
3185. Limitations
319==============
320- The device obviously is limited by the Guest Linux Driver features implementation
321 of the VMware device API.
322- Memory registration mechanism requires mremap for every page in the buffer in order
323 to map it to a contiguous virtual address range. Since this is not the data path
324 it should not matter much. If the default max mr size is increased, be aware that
325 memory registration can take up to 0.5 seconds for 1GB of memory.
326- The device requires target page size to be the same as the host page size,
327 otherwise it will fail to init.
328- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
329 so it can't work with huge pages. The limitation will be addressed in the future,
330 however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
331 pages available, QEMU will use them. QEMU will fail to init if the requirements
332 are not met.
333
334
335
3366. Performance
337==============
338By design the pvrdma device exits on each post-send/receive, so for small buffers
339the performance is affected; however for medium buffers it will became close to
340bare metal and from 1MB buffers and up it reaches bare metal performance.
341(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
342
343All the above assumes no memory registration is done on data path.