3 # Technical notes for the virtio-net driver.
5 # Copyright (C) 2013, Red Hat, Inc.
7 # This program and the accompanying materials are licensed and made available
8 # under the terms and conditions of the BSD License which accompanies this
9 # distribution. The full text of the license may be found at
10 # http://opensource.org/licenses/bsd-license.php
12 # THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS, WITHOUT
13 # WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
20 All statements concerning standards and specifications are informative and not
21 normative. They are made in good faith. Corrections are most welcome on the
22 edk2-devel mailing list.
24 The following documents have been perused while writing the driver and this
26 - Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C;
28 - Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01;
29 - Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7.
35 The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for
36 virtio-net devices. Higher level protocols are automatically installed on top
37 of it by the DXE Core / the ConnectController() boot service, enabling for
38 virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib
39 applications, and PXE booting in OVMF.
45 A driver instance, belonging to a given virtio-net device, can be in one of
46 four states at any time. The states stack up as follows below. The state
47 transitions are labeled with the primary function (and its important callees
48 faithfully indented) that implement the transition.
52 [DriverBinding.c] | | [DriverBinding.c]
53 VirtioNetDriverBindingStart | | VirtioNetDriverBindingStop
54 VirtioNetSnpPopulate | | VirtioNetSnpEvacuate
55 VirtioNetGetFeatures | |
57 +-------------------------+
58 | EfiSimpleNetworkStopped |
59 +-------------------------+
61 [SnpStart.c] | | [SnpStop.c]
62 VirtioNetStart | | VirtioNetStop
65 +-------------------------+
66 | EfiSimpleNetworkStarted |
67 +-------------------------+
69 [SnpInitialize.c] | | [SnpShutdown.c]
70 VirtioNetInitialize | | VirtioNetShutdown
71 VirtioNetInitRing {Rx, Tx} | | VirtioNetShutdownRx [SnpSharedHelpers.c]
72 VirtioRingInit | | VirtIo->UnmapSharedBuffer
73 VirtioRingMap | | VirtIo->FreeSharedPages
74 VirtioNetInitTx | | VirtioNetShutdownTx [SnpSharedHelpers.c]
75 VirtIo->AllocateShare... | | VirtIo->UnmapSharedBuffer
76 VirtioMapAllBytesInSh... | | VirtIo->FreeSharedPages
77 VirtioNetInitRx | | VirtioNetUninitRing [SnpSharedHelpers.c]
78 VirtIo->AllocateShare... | | {Tx, Rx}
79 VirtioMapAllBytesInSh... | | VirtIo->UnmapSharedBuffer
82 +-----------------------------+
83 | EfiSimpleNetworkInitialized |
84 +-----------------------------+
86 The state at the top means "nonexistent" and is hence unnamed on the diagram --
87 a driver instance actually doesn't exist at that point. The transition
88 functions out of and into that state implement the Driver Binding Protocol.
90 The lower three states characterize an existent driver instance and are all
91 states defined by the Simple Network Protocol. The transition functions between
92 them are member functions of the Simple Network Protocol.
94 Each transition function validates its expected source state and its
95 parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect
96 from the controller unless it's in EfiSimpleNetworkStopped.
99 Driver instance states (Simple Network Protocol)
100 ------------------------------------------------
102 In the EfiSimpleNetworkStopped state, the virtio-net device is (has been)
103 re-set. No resources are allocated for networking / traffic purposes. The MAC
104 address and other device attributes have been retrieved from the device (this
105 is necessary for completing the VirtioNetDriverBindingStart transition).
107 The EfiSimpleNetworkStarted is completely identical to the
108 EfiSimpleNetworkStopped state for virtio-net, in the functional and
109 resource-usage sense. This state is mandated / provided by the Simple Network
110 Protocol for flexibility that the virtio-net driver doesn't exploit.
112 In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown
113 SNP member function, and must therefore correspond to a hardware configuration
114 where "[it] is safe for another driver to initialize". (Clearly another UEFI
115 driver could not do that due to the exclusivity of the driver binding that
116 VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.)
118 The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the
119 driver instance. Virtio and other resources required for network traffic have
120 been allocated, and the following SNP member functions are available (in
121 addition to VirtioNetShutdown which leaves the state):
123 - VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that
124 may have arrived asynchronously;
126 - VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous
127 transmission (meant to be used together with VirtioNetGetStatus);
129 - VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending
132 - VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6
133 address into a multicast MAC address;
135 - VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast /
136 broadcast filter configuration (not their actual effect -- a more liberal
137 filter setting than requested is allowed by the UEFI specification).
139 The following SNP member functions are not supported [SnpUnsupported.c]:
141 - VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop
142 from/to EfiSimpleNetworkInitialized);
144 - VirtioNetStationAddress: assign a new MAC address to the virtio NIC,
146 - VirtioNetStatistics: collect statistics,
148 - VirtioNetNvData: access non-volatile data on the virtio NIC.
150 Missing support for these functions is allowed by the UEFI specification and
151 doesn't seem to trip up higher level protocols.
154 Events and task priority levels
155 -------------------------------
157 The UEFI specification defines a sophisticated mechanism for asynchronous
158 events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for
159 details). Such callbacks work like software interrupts, and some notion of
160 locking / masking is important to implement critical sections (atomic or
161 exclusive access to data or a device). This notion is defined as Task Priority
164 The virtio-net driver for OVMF must concern itself with events for two reasons:
166 - The Simple Network Protocol provides its clients with a (non-optional) WAIT
167 type event called WaitForPacket: it allows them to check or wait for Rx
168 packets by polling or blocking on this event. (This functionality overlaps
169 with the Receive member function.) The event is available to clients starting
170 with EfiSimpleNetworkStopped (inclusive).
172 The virtio-net driver is informed about such client polling or blockage by
173 receiving an asynchronous callback (a software interrupt). In the callback
174 function the driver must interrogate the driver instance state, and if it is
175 EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are
176 available for consumption. If so, it must signal the WaitForPacket WAIT type
177 event, waking the client.
179 For simplicity and safety, all parts of the virtio-net driver that access any
180 bit of the driver instance (data or device) run at the TPL_CALLBACK level.
181 This is the highest level allowed for an SNP implementation, and all code
182 protected in this manner satisfies even stricter non-blocking requirements
183 than what's documented for TPL_CALLBACK.
185 The task priority level for the WaitForPacket callback too is set by the
186 driver, the choice is TPL_CALLBACK again. This in effect serializes the
187 WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal"
190 - According to the Driver Writer's Guide, a network driver should install a
191 callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY
192 type event). When the ExitBootServices() boot service has cleaned up internal
193 firmware state and is about to pass control to the OS, any network driver has
194 to stop any in-flight DMA transfers, lest it corrupts OS memory. For this
195 reason EXIT_BOOT_SERVICES is emitted and the network driver must abort
196 in-flight DMA transfers.
198 This callback (VirtioNetExitBoot) is synchronized with the rest of the driver
199 code just the same as explained for WaitForPacket. In
200 EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data
201 transfer. After the callback returns, no further driver code is expected to
205 Virtio internals -- Rx
206 ----------------------
208 Requests (Rx and Tx alike) are always submitted by the guest and processed by
209 the host. For Tx, processing means transmission. For Rx, processing means
210 filling in the request with an incoming packet. Submitted requests exist on the
211 "Available Ring", and answered (processed) requests show up on the "Used Ring".
213 Packet data includes the media (Ethernet) header: destination MAC, source MAC,
214 and Ethertype (14 bytes total).
216 The following structures implement packet reception. Most of them are defined
217 in the Virtio specification, the only driver-specific trait here is the static
218 pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The
219 diagram is simplified.
221 Available Index Available Index
222 last processed incremented
223 by the host by the guest
225 Available +-------+-------+-------+-------+-------+
226 Ring |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx|
227 +-------+-------+-------+-------+-------+
231 Descr. +----------+----------++----------+----------++----------+----------+
232 Table |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx|
233 +----------+----------++----------+----------++----------+----------+
234 =A2 =D3 =A3 =A4 =D5 =A5 =A6 =D7 =A7
238 Receive +---------------+---------------+---------------+
239 Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet|
240 Area +---------------+---------------+---------------+
242 Used Index Used Index incremented
243 last processed by the guest by the host
245 Used +-----------+-----------+-----------+-----------+-----------+
246 Ring |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|
247 +-----------+-----------+-----------+-----------+-----------+
250 In VirtioNetInitRx, the guest allocates the fixed size Receive Destination
251 Area, which accommodates all packets delivered asynchronously by the host. To
252 each packet, a slice of this area is dedicated; each slice is further
253 subdivided into virtio-net request header and network packet data. The
254 (device-physical) addresses of these sub-slices are denoted with A2, A3, A4 and
255 so on. Importantly, an even-subscript "A" always belongs to a virtio-net
256 request header, while an odd-subscript "A" always belongs to a packet
259 Furthermore, the guest lays out a static pattern in the Descriptor Table. For
260 each packet that can be in-flight or already arrived from the host,
261 VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N,
262 the Nth descriptor chain is set up as follows:
264 - the first (=head) descriptor, with even index, points to the fixed-size
265 sub-slice receiving the virtio-net request header,
267 - the second descriptor (with odd index) points to the fixed (1514 byte) size
268 sub-slice receiving the packet data,
270 - a link from the first (head) descriptor in the chain is established to the
271 second (tail) descriptor in the chain.
273 Finally, the guest populates the Available Ring with the indices of the head
274 descriptors. All descriptor indices on both the Available Ring and the Used
277 Packet reception occurs as follows:
279 - The host consumes a descriptor index off the Available Ring. This index is
280 even (=2*N), and fingers the head descriptor of the chain belonging to packet
283 - The host reads the descriptors D(2*N) and -- following the Next link there
284 --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the
285 packet data at A(2*N+1).
287 - The host places the index of the head descriptor, 2*N, onto the Used Ring,
288 and sets the Len field in the same Used Ring Element to the total number of
289 bytes transferred for the entire descriptor chain. This enables the guest to
290 identify the length of Rx packets.
292 - VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it
293 copies the data out to the caller, and recycles the index of the head
294 descriptor (ie. 2*N) to the Available Ring.
296 - Because the host can process (answer) Rx requests in any order theoretically,
297 the order of head descriptor indices on each of the Available Ring and the
298 Used Ring is virtually random. (Except right after the initial population in
299 VirtioNetInitRx, when the Available Ring is full and increasing, and the Used
302 - If the Available Ring is empty, the host is forced to drop packets. If the
303 Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet
307 Virtio internals -- Tx
308 ----------------------
310 The transmission structure erected by VirtioNetInitTx is similar, it differs
313 - There is no Receive Destination Area.
315 - Each head descriptor, D(2*N), points to a read-only virtio-net request header
316 that is shared by all of the head descriptors. This virtio-net request header
317 is never modified by the host.
319 - Each tail descriptor is re-pointed to the device-mapped address of the
320 caller-supplied packet buffer whenever VirtioNetTransmit places the
321 corresponding head descriptor on the Available Ring. A reverse mapping, from
322 the device-mapped address to the caller-supplied packet address, is saved in
323 an associative data structure that belongs to the driver instance.
325 - Per spec, the caller is responsible to hang on to the unmodified packet
326 buffer until it is reported transmitted by VirtioNetGetStatus.
328 Steps of packet transmission:
330 - Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor
331 chains by keeping the indices of their head descriptors in a stack that is
332 private to the driver instance. All elements of the stack are even.
334 - If the stack is empty (that is, each descriptor chain, in isolation, is
335 either pending transmission, or has been processed by the host but not
336 yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns
339 - Otherwise the index of a free chain's head descriptor is popped from the
340 stack. The linked tail descriptor is re-pointed as discussed above. The head
341 descriptor's index is pushed on the Available Ring.
343 - The host moves the head descriptor index from the Available Ring to the Used
344 Ring when it transmits the packet.
346 - Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the
347 function reports no Tx completion. Otherwise, a head descriptor's index is
348 consumed from the Used Ring and recycled to the private stack. The client
349 code's original packet buffer address is calculated by fetching the
350 device-mapped address from the tail descriptor (where it has been stored at
351 VirtioNetTransmit time), and by looking up the device-mapped address in the
352 associative data structure. The reverse-mapped packet buffer address is
353 returned to the caller.
355 - The Len field of the Used Ring Element is not checked. The host is assumed to
356 have transmitted the entire packet -- VirtioNetTransmit had forced it below
357 1514 bytes (inclusive). The Virtio specification suggests this packet size is
358 always accepted (and a lower MTU could be encountered on any later hop as
359 well). Additionally, there's no good way to report a short transmit via
360 VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification
361 and higher level protocols could interpret it as a fatal condition.
363 - The host can theoretically reorder head descriptor indices when moving them
364 from the Available Ring to the Used Ring (out of order transmission). Because
365 of this (and the choice of a stack over a list for free descriptor chain
366 tracking) the order of head descriptor indices on either Ring is