]>
Commit | Line | Data |
---|---|---|
50d4fa86 LE |
1 | ## @file\r |
2 | #\r | |
3 | # Technical notes for the virtio-net driver.\r | |
4 | #\r | |
5 | # Copyright (C) 2013, Red Hat, Inc.\r | |
6 | #\r | |
b26f0cf9 | 7 | # SPDX-License-Identifier: BSD-2-Clause-Patent\r |
50d4fa86 LE |
8 | #\r |
9 | ##\r | |
10 | \r | |
11 | Disclaimer\r | |
12 | ----------\r | |
13 | \r | |
14 | All statements concerning standards and specifications are informative and not\r | |
15 | normative. They are made in good faith. Corrections are most welcome on the\r | |
16 | edk2-devel mailing list.\r | |
17 | \r | |
18 | The following documents have been perused while writing the driver and this\r | |
19 | document:\r | |
20 | - Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C;\r | |
21 | June 27, 2012\r | |
22 | - Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01;\r | |
23 | - Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7.\r | |
24 | \r | |
25 | \r | |
26 | Summary\r | |
27 | -------\r | |
28 | \r | |
29 | The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for\r | |
30 | virtio-net devices. Higher level protocols are automatically installed on top\r | |
31 | of it by the DXE Core / the ConnectController() boot service, enabling for\r | |
32 | virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib\r | |
33 | applications, and PXE booting in OVMF.\r | |
34 | \r | |
35 | \r | |
36 | UEFI driver structure\r | |
37 | ---------------------\r | |
38 | \r | |
39 | A driver instance, belonging to a given virtio-net device, can be in one of\r | |
40 | four states at any time. The states stack up as follows below. The state\r | |
41 | transitions are labeled with the primary function (and its important callees\r | |
42 | faithfully indented) that implement the transition.\r | |
43 | \r | |
44 | | ^\r | |
45 | | |\r | |
46 | [DriverBinding.c] | | [DriverBinding.c]\r | |
47 | VirtioNetDriverBindingStart | | VirtioNetDriverBindingStop\r | |
48 | VirtioNetSnpPopulate | | VirtioNetSnpEvacuate\r | |
49 | VirtioNetGetFeatures | |\r | |
50 | v |\r | |
51 | +-------------------------+\r | |
52 | | EfiSimpleNetworkStopped |\r | |
53 | +-------------------------+\r | |
54 | | ^\r | |
55 | [SnpStart.c] | | [SnpStop.c]\r | |
56 | VirtioNetStart | | VirtioNetStop\r | |
57 | | |\r | |
58 | v |\r | |
59 | +-------------------------+\r | |
60 | | EfiSimpleNetworkStarted |\r | |
61 | +-------------------------+\r | |
62 | | ^\r | |
63 | [SnpInitialize.c] | | [SnpShutdown.c]\r | |
64 | VirtioNetInitialize | | VirtioNetShutdown\r | |
65 | VirtioNetInitRing {Rx, Tx} | | VirtioNetShutdownRx [SnpSharedHelpers.c]\r | |
53b55831 LE |
66 | VirtioRingInit | | VirtIo->UnmapSharedBuffer\r |
67 | VirtioRingMap | | VirtIo->FreeSharedPages\r | |
68 | VirtioNetInitTx | | VirtioNetShutdownTx [SnpSharedHelpers.c]\r | |
69 | VirtIo->AllocateShare... | | VirtIo->UnmapSharedBuffer\r | |
70 | VirtioMapAllBytesInSh... | | VirtIo->FreeSharedPages\r | |
71 | VirtioNetInitRx | | VirtioNetUninitRing [SnpSharedHelpers.c]\r | |
72 | VirtIo->AllocateShare... | | {Tx, Rx}\r | |
73 | VirtioMapAllBytesInSh... | | VirtIo->UnmapSharedBuffer\r | |
55dd5a67 | 74 | | | VirtioRingUninit\r |
50d4fa86 LE |
75 | v |\r |
76 | +-----------------------------+\r | |
77 | | EfiSimpleNetworkInitialized |\r | |
78 | +-----------------------------+\r | |
79 | \r | |
80 | The state at the top means "nonexistent" and is hence unnamed on the diagram --\r | |
81 | a driver instance actually doesn't exist at that point. The transition\r | |
82 | functions out of and into that state implement the Driver Binding Protocol.\r | |
83 | \r | |
84 | The lower three states characterize an existent driver instance and are all\r | |
85 | states defined by the Simple Network Protocol. The transition functions between\r | |
86 | them are member functions of the Simple Network Protocol.\r | |
87 | \r | |
88 | Each transition function validates its expected source state and its\r | |
89 | parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect\r | |
90 | from the controller unless it's in EfiSimpleNetworkStopped.\r | |
91 | \r | |
92 | \r | |
93 | Driver instance states (Simple Network Protocol)\r | |
94 | ------------------------------------------------\r | |
95 | \r | |
96 | In the EfiSimpleNetworkStopped state, the virtio-net device is (has been)\r | |
97 | re-set. No resources are allocated for networking / traffic purposes. The MAC\r | |
98 | address and other device attributes have been retrieved from the device (this\r | |
99 | is necessary for completing the VirtioNetDriverBindingStart transition).\r | |
100 | \r | |
101 | The EfiSimpleNetworkStarted is completely identical to the\r | |
102 | EfiSimpleNetworkStopped state for virtio-net, in the functional and\r | |
103 | resource-usage sense. This state is mandated / provided by the Simple Network\r | |
104 | Protocol for flexibility that the virtio-net driver doesn't exploit.\r | |
105 | \r | |
106 | In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown\r | |
107 | SNP member function, and must therefore correspond to a hardware configuration\r | |
108 | where "[it] is safe for another driver to initialize". (Clearly another UEFI\r | |
109 | driver could not do that due to the exclusivity of the driver binding that\r | |
110 | VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.)\r | |
111 | \r | |
112 | The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the\r | |
113 | driver instance. Virtio and other resources required for network traffic have\r | |
114 | been allocated, and the following SNP member functions are available (in\r | |
115 | addition to VirtioNetShutdown which leaves the state):\r | |
116 | \r | |
117 | - VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that\r | |
118 | may have arrived asynchronously;\r | |
119 | \r | |
120 | - VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous\r | |
121 | transmission (meant to be used together with VirtioNetGetStatus);\r | |
122 | \r | |
123 | - VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending\r | |
124 | Tx packets;\r | |
125 | \r | |
126 | - VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6\r | |
127 | address into a multicast MAC address;\r | |
128 | \r | |
129 | - VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast /\r | |
130 | broadcast filter configuration (not their actual effect -- a more liberal\r | |
131 | filter setting than requested is allowed by the UEFI specification).\r | |
132 | \r | |
133 | The following SNP member functions are not supported [SnpUnsupported.c]:\r | |
134 | \r | |
135 | - VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop\r | |
136 | from/to EfiSimpleNetworkInitialized);\r | |
137 | \r | |
138 | - VirtioNetStationAddress: assign a new MAC address to the virtio NIC,\r | |
139 | \r | |
140 | - VirtioNetStatistics: collect statistics,\r | |
141 | \r | |
142 | - VirtioNetNvData: access non-volatile data on the virtio NIC.\r | |
143 | \r | |
144 | Missing support for these functions is allowed by the UEFI specification and\r | |
145 | doesn't seem to trip up higher level protocols.\r | |
146 | \r | |
147 | \r | |
148 | Events and task priority levels\r | |
149 | -------------------------------\r | |
150 | \r | |
151 | The UEFI specification defines a sophisticated mechanism for asynchronous\r | |
152 | events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for\r | |
153 | details). Such callbacks work like software interrupts, and some notion of\r | |
154 | locking / masking is important to implement critical sections (atomic or\r | |
155 | exclusive access to data or a device). This notion is defined as Task Priority\r | |
156 | Levels.\r | |
157 | \r | |
158 | The virtio-net driver for OVMF must concern itself with events for two reasons:\r | |
159 | \r | |
160 | - The Simple Network Protocol provides its clients with a (non-optional) WAIT\r | |
161 | type event called WaitForPacket: it allows them to check or wait for Rx\r | |
162 | packets by polling or blocking on this event. (This functionality overlaps\r | |
163 | with the Receive member function.) The event is available to clients starting\r | |
164 | with EfiSimpleNetworkStopped (inclusive).\r | |
165 | \r | |
166 | The virtio-net driver is informed about such client polling or blockage by\r | |
167 | receiving an asynchronous callback (a software interrupt). In the callback\r | |
168 | function the driver must interrogate the driver instance state, and if it is\r | |
169 | EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are\r | |
170 | available for consumption. If so, it must signal the WaitForPacket WAIT type\r | |
171 | event, waking the client.\r | |
172 | \r | |
173 | For simplicity and safety, all parts of the virtio-net driver that access any\r | |
174 | bit of the driver instance (data or device) run at the TPL_CALLBACK level.\r | |
175 | This is the highest level allowed for an SNP implementation, and all code\r | |
176 | protected in this manner satisfies even stricter non-blocking requirements\r | |
177 | than what's documented for TPL_CALLBACK.\r | |
178 | \r | |
179 | The task priority level for the WaitForPacket callback too is set by the\r | |
180 | driver, the choice is TPL_CALLBACK again. This in effect serializes the\r | |
181 | WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal"\r | |
182 | parts of the driver.\r | |
183 | \r | |
184 | - According to the Driver Writer's Guide, a network driver should install a\r | |
185 | callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY\r | |
186 | type event). When the ExitBootServices() boot service has cleaned up internal\r | |
187 | firmware state and is about to pass control to the OS, any network driver has\r | |
188 | to stop any in-flight DMA transfers, lest it corrupts OS memory. For this\r | |
189 | reason EXIT_BOOT_SERVICES is emitted and the network driver must abort\r | |
190 | in-flight DMA transfers.\r | |
191 | \r | |
192 | This callback (VirtioNetExitBoot) is synchronized with the rest of the driver\r | |
193 | code just the same as explained for WaitForPacket. In\r | |
194 | EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data\r | |
195 | transfer. After the callback returns, no further driver code is expected to\r | |
196 | be scheduled.\r | |
197 | \r | |
198 | \r | |
199 | Virtio internals -- Rx\r | |
200 | ----------------------\r | |
201 | \r | |
202 | Requests (Rx and Tx alike) are always submitted by the guest and processed by\r | |
203 | the host. For Tx, processing means transmission. For Rx, processing means\r | |
204 | filling in the request with an incoming packet. Submitted requests exist on the\r | |
205 | "Available Ring", and answered (processed) requests show up on the "Used Ring".\r | |
206 | \r | |
207 | Packet data includes the media (Ethernet) header: destination MAC, source MAC,\r | |
208 | and Ethertype (14 bytes total).\r | |
209 | \r | |
210 | The following structures implement packet reception. Most of them are defined\r | |
211 | in the Virtio specification, the only driver-specific trait here is the static\r | |
212 | pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The\r | |
213 | diagram is simplified.\r | |
214 | \r | |
215 | Available Index Available Index\r | |
216 | last processed incremented\r | |
217 | by the host by the guest\r | |
218 | v -------> v\r | |
219 | Available +-------+-------+-------+-------+-------+\r | |
220 | Ring |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx|\r | |
221 | +-------+-------+-------+-------+-------+\r | |
222 | =D6 =D2\r | |
223 | \r | |
224 | D2 D3 D4 D5 D6 D7\r | |
225 | Descr. +----------+----------++----------+----------++----------+----------+\r | |
226 | Table |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx|\r | |
227 | +----------+----------++----------+----------++----------+----------+\r | |
228 | =A2 =D3 =A3 =A4 =D5 =A5 =A6 =D7 =A7\r | |
229 | \r | |
230 | \r | |
231 | A2 A3 A4 A5 A6 A7\r | |
232 | Receive +---------------+---------------+---------------+\r | |
233 | Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet|\r | |
234 | Area +---------------+---------------+---------------+\r | |
235 | \r | |
236 | Used Index Used Index incremented\r | |
237 | last processed by the guest by the host\r | |
238 | v -------> v\r | |
239 | Used +-----------+-----------+-----------+-----------+-----------+\r | |
240 | Ring |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|\r | |
241 | +-----------+-----------+-----------+-----------+-----------+\r | |
242 | =D4\r | |
243 | \r | |
244 | In VirtioNetInitRx, the guest allocates the fixed size Receive Destination\r | |
245 | Area, which accommodates all packets delivered asynchronously by the host. To\r | |
246 | each packet, a slice of this area is dedicated; each slice is further\r | |
247 | subdivided into virtio-net request header and network packet data. The\r | |
46b11f00 | 248 | (device-physical) addresses of these sub-slices are denoted with A2, A3, A4 and\r |
50d4fa86 LE |
249 | so on. Importantly, an even-subscript "A" always belongs to a virtio-net\r |
250 | request header, while an odd-subscript "A" always belongs to a packet\r | |
251 | sub-slice.\r | |
252 | \r | |
253 | Furthermore, the guest lays out a static pattern in the Descriptor Table. For\r | |
254 | each packet that can be in-flight or already arrived from the host,\r | |
255 | VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N,\r | |
256 | the Nth descriptor chain is set up as follows:\r | |
257 | \r | |
258 | - the first (=head) descriptor, with even index, points to the fixed-size\r | |
259 | sub-slice receiving the virtio-net request header,\r | |
260 | \r | |
261 | - the second descriptor (with odd index) points to the fixed (1514 byte) size\r | |
262 | sub-slice receiving the packet data,\r | |
263 | \r | |
264 | - a link from the first (head) descriptor in the chain is established to the\r | |
265 | second (tail) descriptor in the chain.\r | |
266 | \r | |
267 | Finally, the guest populates the Available Ring with the indices of the head\r | |
268 | descriptors. All descriptor indices on both the Available Ring and the Used\r | |
269 | Ring are even.\r | |
270 | \r | |
271 | Packet reception occurs as follows:\r | |
272 | \r | |
273 | - The host consumes a descriptor index off the Available Ring. This index is\r | |
274 | even (=2*N), and fingers the head descriptor of the chain belonging to packet\r | |
275 | N.\r | |
276 | \r | |
277 | - The host reads the descriptors D(2*N) and -- following the Next link there\r | |
278 | --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the\r | |
279 | packet data at A(2*N+1).\r | |
280 | \r | |
281 | - The host places the index of the head descriptor, 2*N, onto the Used Ring,\r | |
282 | and sets the Len field in the same Used Ring Element to the total number of\r | |
283 | bytes transferred for the entire descriptor chain. This enables the guest to\r | |
284 | identify the length of Rx packets.\r | |
285 | \r | |
286 | - VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it\r | |
287 | copies the data out to the caller, and recycles the index of the head\r | |
288 | descriptor (ie. 2*N) to the Available Ring.\r | |
289 | \r | |
290 | - Because the host can process (answer) Rx requests in any order theoretically,\r | |
291 | the order of head descriptor indices on each of the Available Ring and the\r | |
292 | Used Ring is virtually random. (Except right after the initial population in\r | |
293 | VirtioNetInitRx, when the Available Ring is full and increasing, and the Used\r | |
294 | Ring is empty.)\r | |
295 | \r | |
296 | - If the Available Ring is empty, the host is forced to drop packets. If the\r | |
297 | Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet\r | |
298 | available).\r | |
299 | \r | |
300 | \r | |
301 | Virtio internals -- Tx\r | |
302 | ----------------------\r | |
303 | \r | |
304 | The transmission structure erected by VirtioNetInitTx is similar, it differs\r | |
305 | in the following:\r | |
306 | \r | |
307 | - There is no Receive Destination Area.\r | |
308 | \r | |
309 | - Each head descriptor, D(2*N), points to a read-only virtio-net request header\r | |
310 | that is shared by all of the head descriptors. This virtio-net request header\r | |
311 | is never modified by the host.\r | |
312 | \r | |
76ad23ca BS |
313 | - Each tail descriptor is re-pointed to the device-mapped address of the\r |
314 | caller-supplied packet buffer whenever VirtioNetTransmit places the\r | |
315 | corresponding head descriptor on the Available Ring. A reverse mapping, from\r | |
316 | the device-mapped address to the caller-supplied packet address, is saved in\r | |
317 | an associative data structure that belongs to the driver instance.\r | |
318 | \r | |
319 | - Per spec, the caller is responsible to hang on to the unmodified packet\r | |
320 | buffer until it is reported transmitted by VirtioNetGetStatus.\r | |
50d4fa86 LE |
321 | \r |
322 | Steps of packet transmission:\r | |
323 | \r | |
324 | - Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor\r | |
325 | chains by keeping the indices of their head descriptors in a stack that is\r | |
326 | private to the driver instance. All elements of the stack are even.\r | |
327 | \r | |
328 | - If the stack is empty (that is, each descriptor chain, in isolation, is\r | |
329 | either pending transmission, or has been processed by the host but not\r | |
330 | yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns\r | |
331 | EFI_NOT_READY.\r | |
332 | \r | |
333 | - Otherwise the index of a free chain's head descriptor is popped from the\r | |
334 | stack. The linked tail descriptor is re-pointed as discussed above. The head\r | |
335 | descriptor's index is pushed on the Available Ring.\r | |
336 | \r | |
337 | - The host moves the head descriptor index from the Available Ring to the Used\r | |
338 | Ring when it transmits the packet.\r | |
339 | \r | |
340 | - Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the\r | |
341 | function reports no Tx completion. Otherwise, a head descriptor's index is\r | |
342 | consumed from the Used Ring and recycled to the private stack. The client\r | |
76ad23ca BS |
343 | code's original packet buffer address is calculated by fetching the\r |
344 | device-mapped address from the tail descriptor (where it has been stored at\r | |
345 | VirtioNetTransmit time), and by looking up the device-mapped address in the\r | |
346 | associative data structure. The reverse-mapped packet buffer address is\r | |
347 | returned to the caller.\r | |
50d4fa86 LE |
348 | \r |
349 | - The Len field of the Used Ring Element is not checked. The host is assumed to\r | |
350 | have transmitted the entire packet -- VirtioNetTransmit had forced it below\r | |
351 | 1514 bytes (inclusive). The Virtio specification suggests this packet size is\r | |
352 | always accepted (and a lower MTU could be encountered on any later hop as\r | |
353 | well). Additionally, there's no good way to report a short transmit via\r | |
354 | VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification\r | |
355 | and higher level protocols could interpret it as a fatal condition.\r | |
356 | \r | |
357 | - The host can theoretically reorder head descriptor indices when moving them\r | |
358 | from the Available Ring to the Used Ring (out of order transmission). Because\r | |
359 | of this (and the choice of a stack over a list for free descriptor chain\r | |
360 | tracking) the order of head descriptor indices on either Ring is\r | |
361 | unpredictable.\r |