]>
Commit | Line | Data |
---|---|---|
32c0f0be MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | .. include:: <isonum.txt> | |
3 | ||
4 | =============================================== | |
4ceec22d SF |
5 | Ethernet switch device driver model (switchdev) |
6 | =============================================== | |
32c0f0be MCC |
7 | |
8 | Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us> | |
9 | ||
10 | Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com> | |
4ceec22d SF |
11 | |
12 | ||
13 | The Ethernet switch device driver model (switchdev) is an in-kernel driver | |
14 | model for switch devices which offload the forwarding (data) plane from the | |
15 | kernel. | |
16 | ||
17 | Figure 1 is a block diagram showing the components of the switchdev model for | |
18 | an example setup using a data-center-class switch ASIC chip. Other setups | |
19 | with SR-IOV or soft switches, such as OVS, are possible. | |
20 | ||
32c0f0be | 21 | :: |
4ceec22d | 22 | |
32c0f0be MCC |
23 | |
24 | User-space tools | |
51513748 RD |
25 | |
26 | user space | | |
27 | +-------------------------------------------------------------------+ | |
28 | kernel | Netlink | |
32c0f0be MCC |
29 | | |
30 | +--------------+-------------------------------+ | |
31 | | Network stack | | |
32 | | (Linux) | | |
33 | | | | |
34 | +----------------------------------------------+ | |
35 | ||
36 | sw1p2 sw1p4 sw1p6 | |
37 | sw1p1 + sw1p3 + sw1p5 + eth1 | |
38 | + | + | + | + | |
39 | | | | | | | | | |
40 | +--+----+----+----+----+----+---+ +-----+-----+ | |
41 | | Switch driver | | mgmt | | |
42 | | (this document) | | driver | | |
43 | | | | | | |
44 | +--------------+----------------+ +-----------+ | |
45 | | | |
51513748 RD |
46 | kernel | HW bus (eg PCI) |
47 | +-------------------------------------------------------------------+ | |
48 | hardware | | |
32c0f0be MCC |
49 | +--------------+----------------+ |
50 | | Switch device (sw1) | | |
51 | | +----+ +--------+ | |
52 | | | v offloaded data path | mgmt port | |
53 | | | | | | |
54 | +--|----|----+----+----+----+---+ | |
55 | | | | | | | | |
56 | + + + + + + | |
57 | p1 p2 p3 p4 p5 p6 | |
51513748 | 58 | |
32c0f0be | 59 | front-panel ports |
d5066c46 | 60 | |
4ceec22d | 61 | |
32c0f0be | 62 | Fig 1. |
4ceec22d SF |
63 | |
64 | ||
65 | Include Files | |
66 | ------------- | |
67 | ||
32c0f0be MCC |
68 | :: |
69 | ||
70 | #include <linux/netdevice.h> | |
71 | #include <net/switchdev.h> | |
4ceec22d SF |
72 | |
73 | ||
74 | Configuration | |
75 | ------------- | |
76 | ||
77 | Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model | |
78 | support is built for driver. | |
79 | ||
80 | ||
81 | Switch Ports | |
82 | ------------ | |
83 | ||
84 | On switchdev driver initialization, the driver will allocate and register a | |
85 | struct net_device (using register_netdev()) for each enumerated physical switch | |
86 | port, called the port netdev. A port netdev is the software representation of | |
87 | the physical port and provides a conduit for control traffic to/from the | |
88 | controller (the kernel) and the network, as well as an anchor point for higher | |
89 | level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using | |
90 | standard netdev tools (iproute2, ethtool, etc), the port netdev can also | |
91 | provide to the user access to the physical properties of the switch port such | |
92 | as PHY link state and I/O statistics. | |
93 | ||
94 | There is (currently) no higher-level kernel object for the switch beyond the | |
95 | port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. | |
96 | ||
97 | A switch management port is outside the scope of the switchdev driver model. | |
98 | Typically, the management port is not participating in offloaded data plane and | |
99 | is loaded with a different driver, such as a NIC driver, on the management port | |
100 | device. | |
101 | ||
75f3a101 IS |
102 | Switch ID |
103 | ^^^^^^^^^ | |
104 | ||
80d79ad2 FF |
105 | The switchdev driver must implement the net_device operation |
106 | ndo_get_port_parent_id for each port netdev, returning the same physical ID for | |
107 | each port of a switch. The ID must be unique between switches on the same | |
108 | system. The ID does not need to be unique between switches on different | |
109 | systems. | |
75f3a101 IS |
110 | |
111 | The switch ID is used to locate ports on a switch and to know if aggregated | |
112 | ports belong to the same switch. | |
113 | ||
4ceec22d SF |
114 | Port Netdev Naming |
115 | ^^^^^^^^^^^^^^^^^^ | |
116 | ||
117 | Udev rules should be used for port netdev naming, using some unique attribute | |
118 | of the port as a key, for example the port MAC address or the port PHYS name. | |
119 | Hard-coding of kernel netdev names within the driver is discouraged; let the | |
120 | kernel pick the default netdev name, and let udev set the final name based on a | |
121 | port attribute. | |
122 | ||
123 | Using port PHYS name (ndo_get_phys_port_name) for the key is particularly | |
1f5dc44c | 124 | useful for dynamically-named ports where the device names its ports based on |
4ceec22d SF |
125 | external configuration. For example, if a physical 40G port is split logically |
126 | into 4 10G ports, resulting in 4 port netdevs, the device can give a unique | |
32c0f0be | 127 | name for each port using port PHYS name. The udev rule would be:: |
4ceec22d | 128 | |
32c0f0be MCC |
129 | SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \ |
130 | ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" | |
4ceec22d SF |
131 | |
132 | Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y | |
133 | is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 | |
134 | would be sub-port 0 on port 1 on switch 1. | |
135 | ||
4ceec22d SF |
136 | Port Features |
137 | ^^^^^^^^^^^^^ | |
138 | ||
139 | NETIF_F_NETNS_LOCAL | |
140 | ||
141 | If the switchdev driver (and device) only supports offloading of the default | |
142 | network namespace (netns), the driver should set this feature flag to prevent | |
143 | the port netdev from being moved out of the default netns. A netns-aware | |
1f5dc44c | 144 | driver/device would not set this flag and be responsible for partitioning |
4ceec22d SF |
145 | hardware to preserve netns containment. This means hardware cannot forward |
146 | traffic from a port in one namespace to another port in another namespace. | |
147 | ||
148 | Port Topology | |
149 | ^^^^^^^^^^^^^ | |
150 | ||
151 | The port netdevs representing the physical switch ports can be organized into | |
152 | higher-level switching constructs. The default construct is a standalone | |
153 | router port, used to offload L3 forwarding. Two or more ports can be bonded | |
154 | together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge | |
d290f1fc | 155 | L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 |
4ceec22d SF |
156 | tunnels can be built on ports. These constructs are built using standard Linux |
157 | tools such as the bridge driver, the bonding/team drivers, and netlink-based | |
158 | tools such as iproute2. | |
159 | ||
160 | The switchdev driver can know a particular port's position in the topology by | |
161 | monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a | |
162 | bond will see it's upper master change. If that bond is moved into a bridge, | |
163 | the bond's upper master will change. And so on. The driver will track such | |
164 | movements to know what position a port is in in the overall topology by | |
165 | registering for netdevice events and acting on NETDEV_CHANGEUPPER. | |
166 | ||
167 | L2 Forwarding Offload | |
168 | --------------------- | |
169 | ||
170 | The idea is to offload the L2 data forwarding (switching) path from the kernel | |
171 | to the switchdev device by mirroring bridge FDB entries down to the device. An | |
172 | FDB entry is the {port, MAC, VLAN} tuple forwarding destination. | |
173 | ||
174 | To offloading L2 bridging, the switchdev driver/device should support: | |
175 | ||
176 | - Static FDB entries installed on a bridge port | |
177 | - Notification of learned/forgotten src mac/vlans from device | |
178 | - STP state changes on the port | |
179 | - VLAN flooding of multicast/broadcast and unknown unicast packets | |
180 | ||
181 | Static FDB Entries | |
182 | ^^^^^^^^^^^^^^^^^^ | |
183 | ||
184 | The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump | |
185 | to support static FDB entries installed to the device. Static bridge FDB | |
32c0f0be | 186 | entries are installed, for example, using iproute2 bridge cmd:: |
4ceec22d SF |
187 | |
188 | bridge fdb add ADDR dev DEV [vlan VID] [self] | |
189 | ||
4b5364fb | 190 | The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx |
57d80838 | 191 | ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using |
4b5364fb SF |
192 | switchdev_port_obj_xxx ops. |
193 | ||
1f5dc44c SF |
194 | XXX: what should be done if offloading this rule to hardware fails (for |
195 | example, due to full capacity in hardware tables) ? | |
196 | ||
4ceec22d | 197 | Note: by default, the bridge does not filter on VLAN and only bridges untagged |
32c0f0be | 198 | traffic. To enable VLAN support, turn on VLAN filtering:: |
4ceec22d SF |
199 | |
200 | echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering | |
201 | ||
202 | Notification of Learned/Forgotten Source MAC/VLANs | |
203 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
204 | ||
205 | The switch device will learn/forget source MAC address/VLAN on ingress packets | |
206 | and notify the switch driver of the mac/vlan/port tuples. The switch driver, | |
32c0f0be | 207 | in turn, will notify the bridge driver using the switchdev notifier call:: |
4ceec22d | 208 | |
6685987c | 209 | err = call_switchdev_notifiers(val, dev, info, extack); |
4ceec22d | 210 | |
f5ed2feb SF |
211 | Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when |
212 | forgetting, and info points to a struct switchdev_notifier_fdb_info. On | |
213 | SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the | |
214 | bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge | |
32c0f0be | 215 | command will label these entries "offload":: |
4ceec22d SF |
216 | |
217 | $ bridge fdb | |
218 | 52:54:00:12:35:01 dev sw1p1 master br0 permanent | |
219 | 00:02:00:00:02:00 dev sw1p1 master br0 offload | |
220 | 00:02:00:00:02:00 dev sw1p1 self | |
221 | 52:54:00:12:35:02 dev sw1p2 master br0 permanent | |
222 | 00:02:00:00:03:00 dev sw1p2 master br0 offload | |
223 | 00:02:00:00:03:00 dev sw1p2 self | |
224 | 33:33:00:00:00:01 dev eth0 self permanent | |
225 | 01:00:5e:00:00:01 dev eth0 self permanent | |
226 | 33:33:ff:00:00:00 dev eth0 self permanent | |
227 | 01:80:c2:00:00:0e dev eth0 self permanent | |
228 | 33:33:00:00:00:01 dev br0 self permanent | |
229 | 01:00:5e:00:00:01 dev br0 self permanent | |
230 | 33:33:ff:12:35:01 dev br0 self permanent | |
231 | ||
32c0f0be | 232 | Learning on the port should be disabled on the bridge using the bridge command:: |
4ceec22d SF |
233 | |
234 | bridge link set dev DEV learning off | |
235 | ||
32c0f0be | 236 | Learning on the device port should be enabled, as well as learning_sync:: |
4ceec22d SF |
237 | |
238 | bridge link set dev DEV learning on self | |
239 | bridge link set dev DEV learning_sync on self | |
240 | ||
5a784498 | 241 | Learning_sync attribute enables syncing of the learned/forgotten FDB entry to |
4ceec22d SF |
242 | the bridge's FDB. It's possible, but not optimal, to enable learning on the |
243 | device port and on the bridge port, and disable learning_sync. | |
244 | ||
cc0c207a | 245 | To support learning, the driver implements switchdev op |
010c8f01 | 246 | switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS. |
4ceec22d SF |
247 | |
248 | FDB Ageing | |
249 | ^^^^^^^^^^ | |
250 | ||
45ffda75 SF |
251 | The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is |
252 | the responsibility of the port driver/device to age out these entries. If the | |
253 | port device supports ageing, when the FDB entry expires, it will notify the | |
254 | driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the | |
255 | device does not support ageing, the driver can simulate ageing using a | |
5a784498 | 256 | garbage collection timer to monitor FDB entries. Expired entries will be |
45ffda75 SF |
257 | notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for |
258 | example of driver running ageing timer. | |
259 | ||
260 | To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB | |
261 | entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The | |
4ceec22d SF |
262 | notification will reset the FDB entry's last-used time to now. The driver |
263 | should rate limit refresh notifications, for example, no more than once a | |
45ffda75 | 264 | second. (The last-used time is visible using the bridge -s fdb option). |
4ceec22d SF |
265 | |
266 | STP State Change on Port | |
267 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
268 | ||
269 | Internally or with a third-party STP protocol implementation (e.g. mstpd), the | |
270 | bridge driver maintains the STP state for ports, and will notify the switch | |
f5ed2feb | 271 | driver of STP state change on a port using the switchdev op |
1f868398 | 272 | switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. |
4ceec22d SF |
273 | |
274 | State is one of BR_STATE_*. The switch driver can use STP state updates to | |
275 | update ingress packet filter list for the port. For example, if port is | |
276 | DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs | |
277 | and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. | |
278 | ||
279 | Note that STP BDPUs are untagged and STP state applies to all VLANs on the port | |
280 | so packet filters should be applied consistently across untagged and tagged | |
281 | VLANs on the port. | |
282 | ||
283 | Flooding L2 domain | |
284 | ^^^^^^^^^^^^^^^^^^ | |
285 | ||
286 | For a given L2 VLAN domain, the switch device should flood multicast/broadcast | |
287 | and unknown unicast packets to all ports in domain, if allowed by port's | |
288 | current STP state. The switch driver, knowing which ports are within which | |
371e59ad IS |
289 | vlan L2 domain, can program the switch device for flooding. The packet may |
290 | be sent to the port netdev for processing by the bridge driver. The | |
a48037e7 SF |
291 | bridge should not reflood the packet to the same ports the device flooded, |
292 | otherwise there will be duplicate packets on the wire. | |
293 | ||
6bc506b4 IS |
294 | To avoid duplicate packets, the switch driver should mark a packet as already |
295 | forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark | |
296 | the skb using the ingress bridge port's mark and prevent it from being forwarded | |
297 | through any bridge port with the same mark. | |
4ceec22d SF |
298 | |
299 | It is possible for the switch device to not handle flooding and push the | |
300 | packets up to the bridge driver for flooding. This is not ideal as the number | |
301 | of ports scale in the L2 domain as the device is much more efficient at | |
302 | flooding packets that software. | |
303 | ||
741af005 IS |
304 | If supported by the device, flood control can be offloaded to it, preventing |
305 | certain netdevs from flooding unicast traffic for which there is no FDB entry. | |
306 | ||
4ceec22d SF |
307 | IGMP Snooping |
308 | ^^^^^^^^^^^^^ | |
309 | ||
4f5590f8 ER |
310 | In order to support IGMP snooping, the port netdevs should trap to the bridge |
311 | driver all IGMP join and leave messages. | |
312 | The bridge multicast module will notify port netdevs on every multicast group | |
313 | changed whether it is static configured or dynamically joined/leave. | |
314 | The hardware implementation should be forwarding all registered multicast | |
315 | traffic groups only to the configured ports. | |
4ceec22d | 316 | |
7616dcbb SF |
317 | L3 Routing Offload |
318 | ------------------ | |
4ceec22d SF |
319 | |
320 | Offloading L3 routing requires that device be programmed with FIB entries from | |
321 | the kernel, with the device doing the FIB lookup and forwarding. The device | |
322 | does a longest prefix match (LPM) on FIB entries matching route prefix and | |
7616dcbb SF |
323 | forwards the packet to the matching FIB entry's nexthop(s) egress ports. |
324 | ||
fd41b0ea JP |
325 | To program the device, the driver has to register a FIB notifier handler |
326 | using register_fib_notifier. The following events are available: | |
7616dcbb | 327 | |
32c0f0be MCC |
328 | =================== =================================================== |
329 | FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device, | |
330 | or modifying an existing entry on the device. | |
331 | FIB_EVENT_ENTRY_DEL used for removing a FIB entry | |
332 | FIB_EVENT_RULE_ADD, | |
333 | FIB_EVENT_RULE_DEL used to propagate FIB rule changes | |
334 | =================== =================================================== | |
335 | ||
336 | FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:: | |
7616dcbb | 337 | |
fd41b0ea JP |
338 | struct fib_entry_notifier_info { |
339 | struct fib_notifier_info info; /* must be first */ | |
7616dcbb SF |
340 | u32 dst; |
341 | int dst_len; | |
342 | struct fib_info *fi; | |
343 | u8 tos; | |
344 | u8 type; | |
7616dcbb | 345 | u32 tb_id; |
fd41b0ea JP |
346 | u32 nlflags; |
347 | }; | |
7616dcbb | 348 | |
32c0f0be MCC |
349 | to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi`` |
350 | structure holds details on the route and route's nexthops. ``*dev`` is one | |
351 | of the port netdevs mentioned in the route's next hop list. | |
4ceec22d SF |
352 | |
353 | Routes offloaded to the device are labeled with "offload" in the ip route | |
32c0f0be | 354 | listing:: |
4ceec22d SF |
355 | |
356 | $ ip route show | |
357 | default via 192.168.0.2 dev eth0 | |
358 | 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload | |
359 | 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload | |
360 | 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload | |
361 | 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload | |
362 | 12.0.0.2 proto zebra metric 30 offload | |
363 | nexthop via 11.0.0.1 dev sw1p1 weight 1 | |
364 | nexthop via 11.0.0.9 dev sw1p2 weight 1 | |
365 | 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload | |
366 | 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload | |
367 | 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 | |
368 | ||
fd41b0ea JP |
369 | The "offload" flag is set in case at least one device offloads the FIB entry. |
370 | ||
7616dcbb | 371 | XXX: add/mod/del IPv6 FIB API |
4ceec22d SF |
372 | |
373 | Nexthop Resolution | |
374 | ^^^^^^^^^^^^^^^^^^ | |
375 | ||
376 | The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for | |
377 | the switch device to forward the packet with the correct dst mac address, the | |
378 | nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac | |
379 | address discovery comes via the ARP (or ND) process and is available via the | |
380 | arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver | |
381 | should trigger the kernel's neighbor resolution process. See the rocker | |
382 | driver's rocker_port_ipv4_resolve() for an example. | |
383 | ||
384 | The driver can monitor for updates to arp_tbl using the netevent notifier | |
385 | NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops | |
dd19f83d SF |
386 | for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy |
387 | to know when arp_tbl neighbor entries are purged from the port. |