]>
Commit | Line | Data |
---|---|---|
1738cd3e NB |
1 | Linux kernel driver for Elastic Network Adapter (ENA) family: |
2 | ============================================================= | |
3 | ||
4 | Overview: | |
5 | ========= | |
6 | ENA is a networking interface designed to make good use of modern CPU | |
7 | features and system architectures. | |
8 | ||
9 | The ENA device exposes a lightweight management interface with a | |
10 | minimal set of memory mapped registers and extendable command set | |
11 | through an Admin Queue. | |
12 | ||
13 | The driver supports a range of ENA devices, is link-speed independent | |
14 | (i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has | |
15 | a negotiated and extendable feature set. | |
16 | ||
17 | Some ENA devices support SR-IOV. This driver is used for both the | |
18 | SR-IOV Physical Function (PF) and Virtual Function (VF) devices. | |
19 | ||
20 | ENA devices enable high speed and low overhead network traffic | |
21 | processing by providing multiple Tx/Rx queue pairs (the maximum number | |
22 | is advertised by the device via the Admin Queue), a dedicated MSI-X | |
23 | interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, | |
24 | and CPU cacheline optimized data placement. | |
25 | ||
26 | The ENA driver supports industry standard TCP/IP offload features such | |
27 | as checksum offload and TCP transmit segmentation offload (TSO). | |
28 | Receive-side scaling (RSS) is supported for multi-core scaling. | |
29 | ||
30 | The ENA driver and its corresponding devices implement health | |
31 | monitoring mechanisms such as watchdog, enabling the device and driver | |
32 | to recover in a manner transparent to the application, as well as | |
33 | debug logs. | |
34 | ||
35 | Some of the ENA devices support a working mode called Low-latency | |
36 | Queue (LLQ), which saves several more microseconds. | |
37 | ||
38 | Supported PCI vendor ID/device IDs: | |
39 | =================================== | |
40 | 1d0f:0ec2 - ENA PF | |
41 | 1d0f:1ec2 - ENA PF with LLQ support | |
42 | 1d0f:ec20 - ENA VF | |
43 | 1d0f:ec21 - ENA VF with LLQ support | |
44 | ||
45 | ENA Source Code Directory Structure: | |
46 | ==================================== | |
47 | ena_com.[ch] - Management communication layer. This layer is | |
48 | responsible for the handling all the management | |
49 | (admin) communication between the device and the | |
50 | driver. | |
51 | ena_eth_com.[ch] - Tx/Rx data path. | |
52 | ena_admin_defs.h - Definition of ENA management interface. | |
53 | ena_eth_io_defs.h - Definition of ENA data path interface. | |
54 | ena_common_defs.h - Common definitions for ena_com layer. | |
55 | ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers. | |
56 | ena_netdev.[ch] - Main Linux kernel driver. | |
57 | ena_syfsfs.[ch] - Sysfs files. | |
58 | ena_ethtool.c - ethtool callbacks. | |
59 | ena_pci_id_tbl.h - Supported device IDs. | |
60 | ||
61 | Management Interface: | |
62 | ===================== | |
63 | ENA management interface is exposed by means of: | |
64 | - PCIe Configuration Space | |
65 | - Device Registers | |
66 | - Admin Queue (AQ) and Admin Completion Queue (ACQ) | |
67 | - Asynchronous Event Notification Queue (AENQ) | |
68 | ||
69 | ENA device MMIO Registers are accessed only during driver | |
70 | initialization and are not involved in further normal device | |
71 | operation. | |
72 | ||
73 | AQ is used for submitting management commands, and the | |
74 | results/responses are reported asynchronously through ACQ. | |
75 | ||
76 | ENA introduces a very small set of management commands with room for | |
77 | vendor-specific extensions. Most of the management operations are | |
78 | framed in a generic Get/Set feature command. | |
79 | ||
80 | The following admin queue commands are supported: | |
81 | - Create I/O submission queue | |
82 | - Create I/O completion queue | |
83 | - Destroy I/O submission queue | |
84 | - Destroy I/O completion queue | |
85 | - Get feature | |
86 | - Set feature | |
87 | - Configure AENQ | |
88 | - Get statistics | |
89 | ||
90 | Refer to ena_admin_defs.h for the list of supported Get/Set Feature | |
91 | properties. | |
92 | ||
93 | The Asynchronous Event Notification Queue (AENQ) is a uni-directional | |
94 | queue used by the ENA device to send to the driver events that cannot | |
95 | be reported using ACQ. AENQ events are subdivided into groups. Each | |
96 | group may have multiple syndromes, as shown below | |
97 | ||
98 | The events are: | |
99 | Group Syndrome | |
100 | Link state change - X - | |
101 | Fatal error - X - | |
102 | Notification Suspend traffic | |
103 | Notification Resume traffic | |
104 | Keep-Alive - X - | |
105 | ||
106 | ACQ and AENQ share the same MSI-X vector. | |
107 | ||
108 | Keep-Alive is a special mechanism that allows monitoring of the | |
109 | device's health. The driver maintains a watchdog (WD) handler which, | |
110 | if fired, logs the current state and statistics then resets and | |
111 | restarts the ENA device and driver. A Keep-Alive event is delivered by | |
112 | the device every second. The driver re-arms the WD upon reception of a | |
113 | Keep-Alive event. A missed Keep-Alive event causes the WD handler to | |
114 | fire. | |
115 | ||
116 | Data Path Interface: | |
117 | ==================== | |
118 | I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx | |
119 | SQ correspondingly). Each SQ has a completion queue (CQ) associated | |
120 | with it. | |
121 | ||
122 | The SQs and CQs are implemented as descriptor rings in contiguous | |
123 | physical memory. | |
124 | ||
125 | The ENA driver supports two Queue Operation modes for Tx SQs: | |
126 | - Regular mode | |
127 | * In this mode the Tx SQs reside in the host's memory. The ENA | |
128 | device fetches the ENA Tx descriptors and packet data from host | |
129 | memory. | |
130 | - Low Latency Queue (LLQ) mode or "push-mode". | |
131 | * In this mode the driver pushes the transmit descriptors and the | |
132 | first 128 bytes of the packet directly to the ENA device memory | |
133 | space. The rest of the packet payload is fetched by the | |
134 | device. For this operation mode, the driver uses a dedicated PCI | |
135 | device memory BAR, which is mapped with write-combine capability. | |
136 | ||
137 | The Rx SQs support only the regular mode. | |
138 | ||
139 | Note: Not all ENA devices support LLQ, and this feature is negotiated | |
140 | with the device upon initialization. If the ENA device does not | |
141 | support LLQ mode, the driver falls back to the regular mode. | |
142 | ||
143 | The driver supports multi-queue for both Tx and Rx. This has various | |
144 | benefits: | |
145 | - Reduced CPU/thread/process contention on a given Ethernet interface. | |
146 | - Cache miss rate on completion is reduced, particularly for data | |
147 | cache lines that hold the sk_buff structures. | |
148 | - Increased process-level parallelism when handling received packets. | |
149 | - Increased data cache hit rate, by steering kernel processing of | |
150 | packets to the CPU, where the application thread consuming the | |
151 | packet is running. | |
152 | - In hardware interrupt re-direction. | |
153 | ||
154 | Interrupt Modes: | |
155 | ================ | |
156 | The driver assigns a single MSI-X vector per queue pair (for both Tx | |
157 | and Rx directions). The driver assigns an additional dedicated MSI-X vector | |
158 | for management (for ACQ and AENQ). | |
159 | ||
160 | Management interrupt registration is performed when the Linux kernel | |
161 | probes the adapter, and it is de-registered when the adapter is | |
162 | removed. I/O queue interrupt registration is performed when the Linux | |
163 | interface of the adapter is opened, and it is de-registered when the | |
164 | interface is closed. | |
165 | ||
166 | The management interrupt is named: | |
167 | ena-mgmnt@pci:<PCI domain:bus:slot.function> | |
168 | and for each queue pair, an interrupt is named: | |
169 | <interface name>-Tx-Rx-<queue index> | |
170 | ||
171 | The ENA device operates in auto-mask and auto-clear interrupt | |
172 | modes. That is, once MSI-X is delivered to the host, its Cause bit is | |
173 | automatically cleared and the interrupt is masked. The interrupt is | |
174 | unmasked by the driver after NAPI processing is complete. | |
175 | ||
176 | Interrupt Moderation: | |
177 | ===================== | |
178 | ENA driver and device can operate in conventional or adaptive interrupt | |
179 | moderation mode. | |
180 | ||
181 | In conventional mode the driver instructs device to postpone interrupt | |
182 | posting according to static interrupt delay value. The interrupt delay | |
183 | value can be configured through ethtool(8). The following ethtool | |
184 | parameters are supported by the driver: tx-usecs, rx-usecs | |
185 | ||
186 | In adaptive interrupt moderation mode the interrupt delay value is | |
187 | updated by the driver dynamically and adjusted every NAPI cycle | |
188 | according to the traffic nature. | |
189 | ||
190 | By default ENA driver applies adaptive coalescing on Rx traffic and | |
191 | conventional coalescing on Tx traffic. | |
192 | ||
193 | Adaptive coalescing can be switched on/off through ethtool(8) | |
194 | adaptive_rx on|off parameter. | |
195 | ||
196 | The driver chooses interrupt delay value according to the number of | |
197 | bytes and packets received between interrupt unmasking and interrupt | |
198 | posting. The driver uses interrupt delay table that subdivides the | |
199 | range of received bytes/packets into 5 levels and assigns interrupt | |
200 | delay value to each level. | |
201 | ||
202 | The user can enable/disable adaptive moderation, modify the interrupt | |
203 | delay table and restore its default values through sysfs. | |
204 | ||
205 | The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK | |
206 | and can be configured by the ETHTOOL_STUNABLE command of the | |
207 | SIOCETHTOOL ioctl. | |
208 | ||
209 | SKB: | |
210 | The driver-allocated SKB for frames received from Rx handling using | |
211 | NAPI context. The allocation method depends on the size of the packet. | |
212 | If the frame length is larger than rx_copybreak, napi_get_frags() | |
213 | is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer | |
214 | content is copied (by CPU) to the SKB, and the buffer is recycled. | |
215 | ||
216 | Statistics: | |
217 | =========== | |
218 | The user can obtain ENA device and driver statistics using ethtool. | |
219 | The driver can collect regular or extended statistics (including | |
220 | per-queue stats) from the device. | |
221 | ||
222 | In addition the driver logs the stats to syslog upon device reset. | |
223 | ||
224 | MTU: | |
225 | ==== | |
226 | The driver supports an arbitrarily large MTU with a maximum that is | |
227 | negotiated with the device. The driver configures MTU using the | |
228 | SetFeature command (ENA_ADMIN_MTU property). The user can change MTU | |
229 | via ip(8) and similar legacy tools. | |
230 | ||
231 | Stateless Offloads: | |
232 | =================== | |
233 | The ENA driver supports: | |
234 | - TSO over IPv4/IPv6 | |
235 | - TSO with ECN | |
236 | - IPv4 header checksum offload | |
237 | - TCP/UDP over IPv4/IPv6 checksum offloads | |
238 | ||
239 | RSS: | |
240 | ==== | |
241 | - The ENA device supports RSS that allows flexible Rx traffic | |
242 | steering. | |
243 | - Toeplitz and CRC32 hash functions are supported. | |
244 | - Different combinations of L2/L3/L4 fields can be configured as | |
245 | inputs for hash functions. | |
246 | - The driver configures RSS settings using the AQ SetFeature command | |
247 | (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and | |
248 | ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). | |
249 | - If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash | |
250 | function delivered in the Rx CQ descriptor is set in the received | |
251 | SKB. | |
252 | - The user can provide a hash key, hash function, and configure the | |
253 | indirection table through ethtool(8). | |
254 | ||
255 | DATA PATH: | |
256 | ========== | |
257 | Tx: | |
258 | --- | |
259 | end_start_xmit() is called by the stack. This function does the following: | |
260 | - Maps data buffers (skb->data and frags). | |
261 | - Populates ena_buf for the push buffer (if the driver and device are | |
262 | in push mode.) | |
263 | - Prepares ENA bufs for the remaining frags. | |
264 | - Allocates a new request ID from the empty req_id ring. The request | |
265 | ID is the index of the packet in the Tx info. This is used for | |
266 | out-of-order TX completions. | |
267 | - Adds the packet to the proper place in the Tx ring. | |
268 | - Calls ena_com_prepare_tx(), an ENA communication layer that converts | |
269 | the ena_bufs to ENA descriptors (and adds meta ENA descriptors as | |
270 | needed.) | |
271 | * This function also copies the ENA descriptors and the push buffer | |
272 | to the Device memory space (if in push mode.) | |
273 | - Writes doorbell to the ENA device. | |
274 | - When the ENA device finishes sending the packet, a completion | |
275 | interrupt is raised. | |
276 | - The interrupt handler schedules NAPI. | |
277 | - The ena_clean_tx_irq() function is called. This function handles the | |
278 | completion descriptors generated by the ENA, with a single | |
279 | completion descriptor per completed packet. | |
280 | * req_id is retrieved from the completion descriptor. The tx_info of | |
281 | the packet is retrieved via the req_id. The data buffers are | |
282 | unmapped and req_id is returned to the empty req_id ring. | |
283 | * The function stops when the completion descriptors are completed or | |
284 | the budget is reached. | |
285 | ||
286 | Rx: | |
287 | --- | |
288 | - When a packet is received from the ENA device. | |
289 | - The interrupt handler schedules NAPI. | |
290 | - The ena_clean_rx_irq() function is called. This function calls | |
291 | ena_rx_pkt(), an ENA communication layer function, which returns the | |
292 | number of descriptors used for a new unhandled packet, and zero if | |
293 | no new packet is found. | |
294 | - Then it calls the ena_clean_rx_irq() function. | |
295 | - ena_eth_rx_skb() checks packet length: | |
296 | * If the packet is small (len < rx_copybreak), the driver allocates | |
297 | a SKB for the new packet, and copies the packet payload into the | |
298 | SKB data buffer. | |
299 | - In this way the original data buffer is not passed to the stack | |
300 | and is reused for future Rx packets. | |
301 | * Otherwise the function unmaps the Rx buffer, then allocates the | |
302 | new SKB structure and hooks the Rx buffer to the SKB frags. | |
303 | - The new SKB is updated with the necessary information (protocol, | |
304 | checksum hw verify result, etc.), and then passed to the network | |
305 | stack, using the NAPI interface function napi_gro_receive(). |