]>
Commit | Line | Data |
---|---|---|
50d4fa86 LE |
1 | ## @file\r |
2 | #\r | |
3 | # Technical notes for the virtio-net driver.\r | |
4 | #\r | |
5 | # Copyright (C) 2013, Red Hat, Inc.\r | |
6 | #\r | |
7 | # This program and the accompanying materials are licensed and made available\r | |
8 | # under the terms and conditions of the BSD License which accompanies this\r | |
9 | # distribution. The full text of the license may be found at\r | |
10 | # http://opensource.org/licenses/bsd-license.php\r | |
11 | #\r | |
12 | # THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS, WITHOUT\r | |
13 | # WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.\r | |
14 | #\r | |
15 | ##\r | |
16 | \r | |
17 | Disclaimer\r | |
18 | ----------\r | |
19 | \r | |
20 | All statements concerning standards and specifications are informative and not\r | |
21 | normative. They are made in good faith. Corrections are most welcome on the\r | |
22 | edk2-devel mailing list.\r | |
23 | \r | |
24 | The following documents have been perused while writing the driver and this\r | |
25 | document:\r | |
26 | - Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C;\r | |
27 | June 27, 2012\r | |
28 | - Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01;\r | |
29 | - Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7.\r | |
30 | \r | |
31 | \r | |
32 | Summary\r | |
33 | -------\r | |
34 | \r | |
35 | The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for\r | |
36 | virtio-net devices. Higher level protocols are automatically installed on top\r | |
37 | of it by the DXE Core / the ConnectController() boot service, enabling for\r | |
38 | virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib\r | |
39 | applications, and PXE booting in OVMF.\r | |
40 | \r | |
41 | \r | |
42 | UEFI driver structure\r | |
43 | ---------------------\r | |
44 | \r | |
45 | A driver instance, belonging to a given virtio-net device, can be in one of\r | |
46 | four states at any time. The states stack up as follows below. The state\r | |
47 | transitions are labeled with the primary function (and its important callees\r | |
48 | faithfully indented) that implement the transition.\r | |
49 | \r | |
50 | | ^\r | |
51 | | |\r | |
52 | [DriverBinding.c] | | [DriverBinding.c]\r | |
53 | VirtioNetDriverBindingStart | | VirtioNetDriverBindingStop\r | |
54 | VirtioNetSnpPopulate | | VirtioNetSnpEvacuate\r | |
55 | VirtioNetGetFeatures | |\r | |
56 | v |\r | |
57 | +-------------------------+\r | |
58 | | EfiSimpleNetworkStopped |\r | |
59 | +-------------------------+\r | |
60 | | ^\r | |
61 | [SnpStart.c] | | [SnpStop.c]\r | |
62 | VirtioNetStart | | VirtioNetStop\r | |
63 | | |\r | |
64 | v |\r | |
65 | +-------------------------+\r | |
66 | | EfiSimpleNetworkStarted |\r | |
67 | +-------------------------+\r | |
68 | | ^\r | |
69 | [SnpInitialize.c] | | [SnpShutdown.c]\r | |
70 | VirtioNetInitialize | | VirtioNetShutdown\r | |
71 | VirtioNetInitRing {Rx, Tx} | | VirtioNetShutdownRx [SnpSharedHelpers.c]\r | |
53b55831 LE |
72 | VirtioRingInit | | VirtIo->UnmapSharedBuffer\r |
73 | VirtioRingMap | | VirtIo->FreeSharedPages\r | |
74 | VirtioNetInitTx | | VirtioNetShutdownTx [SnpSharedHelpers.c]\r | |
75 | VirtIo->AllocateShare... | | VirtIo->UnmapSharedBuffer\r | |
76 | VirtioMapAllBytesInSh... | | VirtIo->FreeSharedPages\r | |
77 | VirtioNetInitRx | | VirtioNetUninitRing [SnpSharedHelpers.c]\r | |
78 | VirtIo->AllocateShare... | | {Tx, Rx}\r | |
79 | VirtioMapAllBytesInSh... | | VirtIo->UnmapSharedBuffer\r | |
55dd5a67 | 80 | | | VirtioRingUninit\r |
50d4fa86 LE |
81 | v |\r |
82 | +-----------------------------+\r | |
83 | | EfiSimpleNetworkInitialized |\r | |
84 | +-----------------------------+\r | |
85 | \r | |
86 | The state at the top means "nonexistent" and is hence unnamed on the diagram --\r | |
87 | a driver instance actually doesn't exist at that point. The transition\r | |
88 | functions out of and into that state implement the Driver Binding Protocol.\r | |
89 | \r | |
90 | The lower three states characterize an existent driver instance and are all\r | |
91 | states defined by the Simple Network Protocol. The transition functions between\r | |
92 | them are member functions of the Simple Network Protocol.\r | |
93 | \r | |
94 | Each transition function validates its expected source state and its\r | |
95 | parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect\r | |
96 | from the controller unless it's in EfiSimpleNetworkStopped.\r | |
97 | \r | |
98 | \r | |
99 | Driver instance states (Simple Network Protocol)\r | |
100 | ------------------------------------------------\r | |
101 | \r | |
102 | In the EfiSimpleNetworkStopped state, the virtio-net device is (has been)\r | |
103 | re-set. No resources are allocated for networking / traffic purposes. The MAC\r | |
104 | address and other device attributes have been retrieved from the device (this\r | |
105 | is necessary for completing the VirtioNetDriverBindingStart transition).\r | |
106 | \r | |
107 | The EfiSimpleNetworkStarted is completely identical to the\r | |
108 | EfiSimpleNetworkStopped state for virtio-net, in the functional and\r | |
109 | resource-usage sense. This state is mandated / provided by the Simple Network\r | |
110 | Protocol for flexibility that the virtio-net driver doesn't exploit.\r | |
111 | \r | |
112 | In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown\r | |
113 | SNP member function, and must therefore correspond to a hardware configuration\r | |
114 | where "[it] is safe for another driver to initialize". (Clearly another UEFI\r | |
115 | driver could not do that due to the exclusivity of the driver binding that\r | |
116 | VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.)\r | |
117 | \r | |
118 | The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the\r | |
119 | driver instance. Virtio and other resources required for network traffic have\r | |
120 | been allocated, and the following SNP member functions are available (in\r | |
121 | addition to VirtioNetShutdown which leaves the state):\r | |
122 | \r | |
123 | - VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that\r | |
124 | may have arrived asynchronously;\r | |
125 | \r | |
126 | - VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous\r | |
127 | transmission (meant to be used together with VirtioNetGetStatus);\r | |
128 | \r | |
129 | - VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending\r | |
130 | Tx packets;\r | |
131 | \r | |
132 | - VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6\r | |
133 | address into a multicast MAC address;\r | |
134 | \r | |
135 | - VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast /\r | |
136 | broadcast filter configuration (not their actual effect -- a more liberal\r | |
137 | filter setting than requested is allowed by the UEFI specification).\r | |
138 | \r | |
139 | The following SNP member functions are not supported [SnpUnsupported.c]:\r | |
140 | \r | |
141 | - VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop\r | |
142 | from/to EfiSimpleNetworkInitialized);\r | |
143 | \r | |
144 | - VirtioNetStationAddress: assign a new MAC address to the virtio NIC,\r | |
145 | \r | |
146 | - VirtioNetStatistics: collect statistics,\r | |
147 | \r | |
148 | - VirtioNetNvData: access non-volatile data on the virtio NIC.\r | |
149 | \r | |
150 | Missing support for these functions is allowed by the UEFI specification and\r | |
151 | doesn't seem to trip up higher level protocols.\r | |
152 | \r | |
153 | \r | |
154 | Events and task priority levels\r | |
155 | -------------------------------\r | |
156 | \r | |
157 | The UEFI specification defines a sophisticated mechanism for asynchronous\r | |
158 | events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for\r | |
159 | details). Such callbacks work like software interrupts, and some notion of\r | |
160 | locking / masking is important to implement critical sections (atomic or\r | |
161 | exclusive access to data or a device). This notion is defined as Task Priority\r | |
162 | Levels.\r | |
163 | \r | |
164 | The virtio-net driver for OVMF must concern itself with events for two reasons:\r | |
165 | \r | |
166 | - The Simple Network Protocol provides its clients with a (non-optional) WAIT\r | |
167 | type event called WaitForPacket: it allows them to check or wait for Rx\r | |
168 | packets by polling or blocking on this event. (This functionality overlaps\r | |
169 | with the Receive member function.) The event is available to clients starting\r | |
170 | with EfiSimpleNetworkStopped (inclusive).\r | |
171 | \r | |
172 | The virtio-net driver is informed about such client polling or blockage by\r | |
173 | receiving an asynchronous callback (a software interrupt). In the callback\r | |
174 | function the driver must interrogate the driver instance state, and if it is\r | |
175 | EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are\r | |
176 | available for consumption. If so, it must signal the WaitForPacket WAIT type\r | |
177 | event, waking the client.\r | |
178 | \r | |
179 | For simplicity and safety, all parts of the virtio-net driver that access any\r | |
180 | bit of the driver instance (data or device) run at the TPL_CALLBACK level.\r | |
181 | This is the highest level allowed for an SNP implementation, and all code\r | |
182 | protected in this manner satisfies even stricter non-blocking requirements\r | |
183 | than what's documented for TPL_CALLBACK.\r | |
184 | \r | |
185 | The task priority level for the WaitForPacket callback too is set by the\r | |
186 | driver, the choice is TPL_CALLBACK again. This in effect serializes the\r | |
187 | WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal"\r | |
188 | parts of the driver.\r | |
189 | \r | |
190 | - According to the Driver Writer's Guide, a network driver should install a\r | |
191 | callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY\r | |
192 | type event). When the ExitBootServices() boot service has cleaned up internal\r | |
193 | firmware state and is about to pass control to the OS, any network driver has\r | |
194 | to stop any in-flight DMA transfers, lest it corrupts OS memory. For this\r | |
195 | reason EXIT_BOOT_SERVICES is emitted and the network driver must abort\r | |
196 | in-flight DMA transfers.\r | |
197 | \r | |
198 | This callback (VirtioNetExitBoot) is synchronized with the rest of the driver\r | |
199 | code just the same as explained for WaitForPacket. In\r | |
200 | EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data\r | |
201 | transfer. After the callback returns, no further driver code is expected to\r | |
202 | be scheduled.\r | |
203 | \r | |
204 | \r | |
205 | Virtio internals -- Rx\r | |
206 | ----------------------\r | |
207 | \r | |
208 | Requests (Rx and Tx alike) are always submitted by the guest and processed by\r | |
209 | the host. For Tx, processing means transmission. For Rx, processing means\r | |
210 | filling in the request with an incoming packet. Submitted requests exist on the\r | |
211 | "Available Ring", and answered (processed) requests show up on the "Used Ring".\r | |
212 | \r | |
213 | Packet data includes the media (Ethernet) header: destination MAC, source MAC,\r | |
214 | and Ethertype (14 bytes total).\r | |
215 | \r | |
216 | The following structures implement packet reception. Most of them are defined\r | |
217 | in the Virtio specification, the only driver-specific trait here is the static\r | |
218 | pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The\r | |
219 | diagram is simplified.\r | |
220 | \r | |
221 | Available Index Available Index\r | |
222 | last processed incremented\r | |
223 | by the host by the guest\r | |
224 | v -------> v\r | |
225 | Available +-------+-------+-------+-------+-------+\r | |
226 | Ring |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx|\r | |
227 | +-------+-------+-------+-------+-------+\r | |
228 | =D6 =D2\r | |
229 | \r | |
230 | D2 D3 D4 D5 D6 D7\r | |
231 | Descr. +----------+----------++----------+----------++----------+----------+\r | |
232 | Table |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx|\r | |
233 | +----------+----------++----------+----------++----------+----------+\r | |
234 | =A2 =D3 =A3 =A4 =D5 =A5 =A6 =D7 =A7\r | |
235 | \r | |
236 | \r | |
237 | A2 A3 A4 A5 A6 A7\r | |
238 | Receive +---------------+---------------+---------------+\r | |
239 | Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet|\r | |
240 | Area +---------------+---------------+---------------+\r | |
241 | \r | |
242 | Used Index Used Index incremented\r | |
243 | last processed by the guest by the host\r | |
244 | v -------> v\r | |
245 | Used +-----------+-----------+-----------+-----------+-----------+\r | |
246 | Ring |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|\r | |
247 | +-----------+-----------+-----------+-----------+-----------+\r | |
248 | =D4\r | |
249 | \r | |
250 | In VirtioNetInitRx, the guest allocates the fixed size Receive Destination\r | |
251 | Area, which accommodates all packets delivered asynchronously by the host. To\r | |
252 | each packet, a slice of this area is dedicated; each slice is further\r | |
253 | subdivided into virtio-net request header and network packet data. The\r | |
46b11f00 | 254 | (device-physical) addresses of these sub-slices are denoted with A2, A3, A4 and\r |
50d4fa86 LE |
255 | so on. Importantly, an even-subscript "A" always belongs to a virtio-net\r |
256 | request header, while an odd-subscript "A" always belongs to a packet\r | |
257 | sub-slice.\r | |
258 | \r | |
259 | Furthermore, the guest lays out a static pattern in the Descriptor Table. For\r | |
260 | each packet that can be in-flight or already arrived from the host,\r | |
261 | VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N,\r | |
262 | the Nth descriptor chain is set up as follows:\r | |
263 | \r | |
264 | - the first (=head) descriptor, with even index, points to the fixed-size\r | |
265 | sub-slice receiving the virtio-net request header,\r | |
266 | \r | |
267 | - the second descriptor (with odd index) points to the fixed (1514 byte) size\r | |
268 | sub-slice receiving the packet data,\r | |
269 | \r | |
270 | - a link from the first (head) descriptor in the chain is established to the\r | |
271 | second (tail) descriptor in the chain.\r | |
272 | \r | |
273 | Finally, the guest populates the Available Ring with the indices of the head\r | |
274 | descriptors. All descriptor indices on both the Available Ring and the Used\r | |
275 | Ring are even.\r | |
276 | \r | |
277 | Packet reception occurs as follows:\r | |
278 | \r | |
279 | - The host consumes a descriptor index off the Available Ring. This index is\r | |
280 | even (=2*N), and fingers the head descriptor of the chain belonging to packet\r | |
281 | N.\r | |
282 | \r | |
283 | - The host reads the descriptors D(2*N) and -- following the Next link there\r | |
284 | --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the\r | |
285 | packet data at A(2*N+1).\r | |
286 | \r | |
287 | - The host places the index of the head descriptor, 2*N, onto the Used Ring,\r | |
288 | and sets the Len field in the same Used Ring Element to the total number of\r | |
289 | bytes transferred for the entire descriptor chain. This enables the guest to\r | |
290 | identify the length of Rx packets.\r | |
291 | \r | |
292 | - VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it\r | |
293 | copies the data out to the caller, and recycles the index of the head\r | |
294 | descriptor (ie. 2*N) to the Available Ring.\r | |
295 | \r | |
296 | - Because the host can process (answer) Rx requests in any order theoretically,\r | |
297 | the order of head descriptor indices on each of the Available Ring and the\r | |
298 | Used Ring is virtually random. (Except right after the initial population in\r | |
299 | VirtioNetInitRx, when the Available Ring is full and increasing, and the Used\r | |
300 | Ring is empty.)\r | |
301 | \r | |
302 | - If the Available Ring is empty, the host is forced to drop packets. If the\r | |
303 | Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet\r | |
304 | available).\r | |
305 | \r | |
306 | \r | |
307 | Virtio internals -- Tx\r | |
308 | ----------------------\r | |
309 | \r | |
310 | The transmission structure erected by VirtioNetInitTx is similar, it differs\r | |
311 | in the following:\r | |
312 | \r | |
313 | - There is no Receive Destination Area.\r | |
314 | \r | |
315 | - Each head descriptor, D(2*N), points to a read-only virtio-net request header\r | |
316 | that is shared by all of the head descriptors. This virtio-net request header\r | |
317 | is never modified by the host.\r | |
318 | \r | |
76ad23ca BS |
319 | - Each tail descriptor is re-pointed to the device-mapped address of the\r |
320 | caller-supplied packet buffer whenever VirtioNetTransmit places the\r | |
321 | corresponding head descriptor on the Available Ring. A reverse mapping, from\r | |
322 | the device-mapped address to the caller-supplied packet address, is saved in\r | |
323 | an associative data structure that belongs to the driver instance.\r | |
324 | \r | |
325 | - Per spec, the caller is responsible to hang on to the unmodified packet\r | |
326 | buffer until it is reported transmitted by VirtioNetGetStatus.\r | |
50d4fa86 LE |
327 | \r |
328 | Steps of packet transmission:\r | |
329 | \r | |
330 | - Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor\r | |
331 | chains by keeping the indices of their head descriptors in a stack that is\r | |
332 | private to the driver instance. All elements of the stack are even.\r | |
333 | \r | |
334 | - If the stack is empty (that is, each descriptor chain, in isolation, is\r | |
335 | either pending transmission, or has been processed by the host but not\r | |
336 | yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns\r | |
337 | EFI_NOT_READY.\r | |
338 | \r | |
339 | - Otherwise the index of a free chain's head descriptor is popped from the\r | |
340 | stack. The linked tail descriptor is re-pointed as discussed above. The head\r | |
341 | descriptor's index is pushed on the Available Ring.\r | |
342 | \r | |
343 | - The host moves the head descriptor index from the Available Ring to the Used\r | |
344 | Ring when it transmits the packet.\r | |
345 | \r | |
346 | - Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the\r | |
347 | function reports no Tx completion. Otherwise, a head descriptor's index is\r | |
348 | consumed from the Used Ring and recycled to the private stack. The client\r | |
76ad23ca BS |
349 | code's original packet buffer address is calculated by fetching the\r |
350 | device-mapped address from the tail descriptor (where it has been stored at\r | |
351 | VirtioNetTransmit time), and by looking up the device-mapped address in the\r | |
352 | associative data structure. The reverse-mapped packet buffer address is\r | |
353 | returned to the caller.\r | |
50d4fa86 LE |
354 | \r |
355 | - The Len field of the Used Ring Element is not checked. The host is assumed to\r | |
356 | have transmitted the entire packet -- VirtioNetTransmit had forced it below\r | |
357 | 1514 bytes (inclusive). The Virtio specification suggests this packet size is\r | |
358 | always accepted (and a lower MTU could be encountered on any later hop as\r | |
359 | well). Additionally, there's no good way to report a short transmit via\r | |
360 | VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification\r | |
361 | and higher level protocols could interpret it as a fatal condition.\r | |
362 | \r | |
363 | - The host can theoretically reorder head descriptor indices when moving them\r | |
364 | from the Available Ring to the Used Ring (out of order transmission). Because\r | |
365 | of this (and the choice of a stack over a list for free descriptor chain\r | |
366 | tracking) the order of head descriptor indices on either Ring is\r | |
367 | unpredictable.\r |