]>
Commit | Line | Data |
---|---|---|
b16fdafe BP |
1 | ======================== |
2 | ovs-vswitchd Internals | |
3 | ======================== | |
4 | ||
5 | This document describes some of the internals of the ovs-vswitchd | |
6 | process. It is not complete. It tends to be updated on demand, so if | |
7 | you have questions about the vswitchd implementation, ask them and | |
8 | perhaps we'll add some appropriate documentation here. | |
9 | ||
10 | Most of the ovs-vswitchd implementation is in vswitchd/bridge.c, so | |
11 | code references below should be assumed to refer to that file except | |
12 | as otherwise specified. | |
13 | ||
14 | Bonding | |
15 | ======= | |
16 | ||
17 | Bonding allows two or more interfaces (the "slaves") to share network | |
18 | traffic. From a high-level point of view, bonded interfaces act like | |
19 | a single port, but they have the bandwidth of multiple network | |
20 | devices, e.g. two 1 GB physical interfaces act like a single 2 GB | |
21 | interface. Bonds also increase robustness: the bonded port does not | |
22 | go down as long as at least one of its slaves is up. | |
23 | ||
24 | In vswitchd, a bond always has at least two slaves (and may have | |
25 | more). If a configuration error, etc. would cause a bond to have only | |
26 | one slave, the port becomes an ordinary port, not a bonded port, and | |
27 | none of the special features of bonded ports described in this section | |
28 | apply. | |
29 | ||
1e959061 EJ |
30 | There are many forms of bonding of which ovs-vswitchd implements only |
31 | a few. The most complex bond ovs-vswitchd implements is called | |
32 | "source load balancing" or SLB bonding. SLB bonding divides traffic | |
33 | among the slaves based on the Ethernet source address. This is useful | |
34 | only if the traffic over the bond has multiple Ethernet source | |
35 | addresses, for example if network traffic from multiple VMs are | |
36 | multiplexed over the bond. | |
b16fdafe BP |
37 | |
38 | Enabling and Disabling Slaves | |
39 | ----------------------------- | |
40 | ||
41 | When a bond is created, a slave is initially enabled or disabled based | |
42 | on whether carrier is detected on the NIC (see iface_create()). After | |
43 | that, a slave is disabled if its carrier goes down for a period of | |
44 | time longer than the downdelay, and it is enabled if carrier comes up | |
45 | for longer than the updelay (see bond_link_status_update()). There is | |
46 | one exception where the updelay is skipped: if no slaves at all are | |
47 | currently enabled, then the first slave on which carrier comes up is | |
48 | enabled immediately. | |
49 | ||
50 | The updelay should be set to a time longer than the STP forwarding | |
51 | delay of the physical switch to which the bond port is connected (if | |
52 | STP is enabled on that switch). Otherwise, the slave will be enabled, | |
53 | and load may be shifted to it, before the physical switch starts | |
54 | forwarding packets on that port, which can cause some data to be | |
55 | "blackholed" for a time. The exception for a single enabled slave | |
56 | does not cause any problem in this regard because when no slaves are | |
57 | enabled all output packets are blackholed anyway. | |
58 | ||
59 | When a slave becomes disabled, the vswitch immediately chooses a new | |
60 | output port for traffic that was destined for that slave (see | |
38f7147c EJ |
61 | bond_enable_slave()). It also sends a "gratuitous learning packet", |
62 | specifically a RARP, on the bond port (on the newly chosen slave) for | |
63 | each MAC address that the vswitch has learned on a port other than the | |
64 | bond (see bond_send_learning_packets()), to teach the physical switch | |
65 | that the new slave should be used in place of the one that is now | |
66 | disabled. (This behavior probably makes sense only for a vswitch that | |
67 | has only one port (the bond) connected to a physical switch; vswitchd | |
68 | should probably provide a way to disable or configure it in other | |
69 | scenarios.) | |
b16fdafe BP |
70 | |
71 | Bond Packet Input | |
72 | ----------------- | |
73 | ||
b16fdafe BP |
74 | Bonding accepts unicast packets on any bond slave. This can |
75 | occasionally cause packet duplication for the first few packets sent | |
76 | to a given MAC, if the physical switch attached to the bond is | |
77 | flooding packets to that MAC because it has not yet learned the | |
78 | correct slave for that MAC. | |
79 | ||
80 | Bonding only accepts multicast (and broadcast) packets on a single | |
81 | bond slave (the "active slave") at any given time. Multicast packets | |
82 | received on other slaves are dropped. Otherwise, every multicast | |
83 | packet would be duplicated, once for every bond slave, because the | |
84 | physical switch attached to the bond will flood those packets. | |
85 | ||
3a55ef14 JG |
86 | Bonding also drops received packets when the vswitch has learned that |
87 | the packet's MAC is on a port other than the bond port itself. This is | |
88 | because it is likely that the vswitch itself sent the packet out the | |
89 | bond port on a different slave and is now receiving the packet back. | |
90 | This occurs when the packet is multicast or the physical switch has not | |
91 | yet learned the MAC and is flooding it. However, the vswitch makes an | |
b16fdafe BP |
92 | exception to this rule for broadcast ARP replies, which indicate that |
93 | the MAC has moved to another switch, probably due to VM migration. | |
94 | (ARP replies are normally unicast, so this exception does not match | |
95 | normal ARP replies. It will match the learning packets sent on bond | |
96 | fail-over.) | |
97 | ||
98 | The active slave is simply the first slave to be enabled after the | |
99 | bond is created (see bond_choose_active_iface()). If the active slave | |
100 | is disabled, then a new active slave is chosen among the slaves that | |
101 | remain active. Currently due to the way that configuration works, | |
102 | this tends to be the remaining slave whose interface name is first | |
103 | alphabetically, but this is by no means guaranteed. | |
104 | ||
105 | Bond Packet Output | |
106 | ------------------ | |
107 | ||
108 | When a packet is sent out a bond port, the bond slave actually used is | |
e58de0e3 EJ |
109 | selected based on the packet's source MAC and VLAN tag (see |
110 | choose_output_iface()). In particular, the source MAC and VLAN tag | |
111 | are hashed into one of 256 values, and that value is looked up in a | |
112 | hash table (the "bond hash") kept in the "bond_hash" member of struct | |
113 | port. The hash table entry identifies a bond slave. If no bond slave | |
114 | has yet been chosen for that hash table entry, vswitchd chooses one | |
115 | arbitrarily. | |
b16fdafe BP |
116 | |
117 | Every 10 seconds, vswitchd rebalances the bond slaves (see | |
118 | bond_rebalance_port()). To rebalance, vswitchd examines the | |
119 | statistics for the number of bytes transmitted by each slave over | |
120 | approximately the past minute, with data sent more recently weighted | |
121 | more heavily than data sent less recently. It considers each of the | |
122 | slaves in order from most-loaded to least-loaded. If highly loaded | |
123 | slave H is significantly more heavily loaded than the least-loaded | |
124 | slave L, and slave H carries at least two hashes, then vswitchd shifts | |
5422a9e1 JG |
125 | one of H's hashes to L. However, vswitchd will only shift a hash from |
126 | H to L if it will decrease the ratio of the load between H and L by at | |
127 | least 0.1. | |
b16fdafe BP |
128 | |
129 | Currently, "significantly more loaded" means that H must carry at | |
130 | least 1 Mbps more traffic, and that traffic must be at least 3% | |
131 | greater than L's. | |
b2272edb BP |
132 | |
133 | Bond Balance Modes | |
134 | ------------------ | |
135 | ||
136 | Each bond balancing mode has different considerations, described | |
137 | below. | |
138 | ||
139 | LACP Bonding | |
140 | ------------ | |
141 | ||
142 | LACP bonding requires the remote switch to implement LACP, but it is | |
143 | otherwise very simple in that, after LACP negotiation is complete, | |
144 | there is no need for special handling of received packets. | |
145 | ||
9dd165e0 RK |
146 | Several of the physical switches that support LACP block all traffic |
147 | for ports that are configured to use LACP, until LACP is negotiated with | |
148 | the host. When configuring a LACP bond on a OVS host (eg: XenServer), | |
149 | this means that there will be an interruption of the network connectivity | |
150 | between the time the ports on the physical switch and the bond on the OVS | |
151 | host are configured. The interruption may be relatively long, if different | |
152 | people are responsible for managing the switches and the OVS host. | |
153 | ||
154 | Such network connectivity failure can be avoided if LACP can be configured | |
155 | on the OVS host before configuring the physical switch, and having | |
156 | the OVS host fall back to a bond mode (active-backup) till the physical | |
157 | switch LACP configuration is complete. An option "lacp-fallback-ab" exists to | |
158 | provide such behavior on openvswitch. | |
159 | ||
1e959061 EJ |
160 | Active Backup Bonding |
161 | --------------------- | |
162 | ||
163 | Active Backup bonds send all traffic out one "active" slave until that | |
164 | slave becomes unavailable. Since they are significantly less | |
165 | complicated than SLB bonds, they are preferred when LACP is not an | |
166 | option. Additionally, they are the only bond mode which supports | |
167 | attaching each slave to a different upstream switch. | |
168 | ||
b2272edb BP |
169 | SLB Bonding |
170 | ----------- | |
171 | ||
172 | SLB bonding allows a limited form of load balancing without the remote | |
173 | switch's knowledge or cooperation. The basics of SLB are simple. SLB | |
174 | assigns each source MAC+VLAN pair to a link and transmits all packets | |
175 | from that MAC+VLAN through that link. Learning in the remote switch | |
176 | causes it to send packets to that MAC+VLAN through the same link. | |
177 | ||
178 | SLB bonding has the following complications: | |
179 | ||
180 | 0. When the remote switch has not learned the MAC for the | |
181 | destination of a unicast packet and hence floods the packet to | |
182 | all of the links on the SLB bond, Open vSwitch will forward | |
183 | duplicate packets, one per link, to each other switch port. | |
184 | ||
185 | Open vSwitch does not solve this problem. | |
186 | ||
187 | 1. When the remote switch receives a multicast or broadcast packet | |
188 | from a port not on the SLB bond, it will forward it to all of | |
189 | the links in the SLB bond. This would cause packet duplication | |
190 | if not handled specially. | |
191 | ||
192 | Open vSwitch avoids packet duplication by accepting multicast | |
193 | and broadcast packets on only the active slave, and dropping | |
194 | multicast and broadcast packets on all other slaves. | |
195 | ||
196 | 2. When Open vSwitch forwards a multicast or broadcast packet to a | |
197 | link in the SLB bond other than the active slave, the remote | |
198 | switch will forward it to all of the other links in the SLB | |
199 | bond, including the active slave. Without special handling, | |
200 | this would mean that Open vSwitch would forward a second copy of | |
201 | the packet to each switch port (other than the bond), including | |
202 | the port that originated the packet. | |
203 | ||
204 | Open vSwitch deals with this case by dropping packets received | |
205 | on any SLB bonded link that have a source MAC+VLAN that has been | |
206 | learned on any other port. (This means that SLB as implemented | |
207 | in Open vSwitch relies critically on MAC learning. Notably, SLB | |
208 | is incompatible with the "flood_vlans" feature.) | |
209 | ||
210 | 3. Suppose that a MAC+VLAN moves to an SLB bond from another port | |
211 | (e.g. when a VM is migrated from this hypervisor to a different | |
212 | one). Without additional special handling, Open vSwitch will | |
213 | not notice until the MAC learning entry expires, up to 60 | |
214 | seconds later as a consequence of rule #2. | |
215 | ||
216 | Open vSwitch avoids a 60-second delay by listening for | |
217 | gratuitous ARPs, which VMs commonly emit upon migration. As an | |
218 | exception to rule #2, a gratuitous ARP received on an SLB bond | |
219 | is not dropped and updates the MAC learning table in the usual | |
220 | way. (If a move does not trigger a gratuitous ARP, or if the | |
221 | gratuitous ARP is lost in the network, then a 60-second delay | |
222 | still occurs.) | |
223 | ||
224 | 4. Suppose that a MAC+VLAN moves from an SLB bond to another port | |
225 | (e.g. when a VM is migrated from a different hypervisor to this | |
226 | one), that the MAC+VLAN emits a gratuitous ARP, and that Open | |
227 | vSwitch forwards that gratuitous ARP to a link in the SLB bond | |
228 | other than the active slave. The remote switch will forward the | |
229 | gratuitous ARP to all of the other links in the SLB bond, | |
230 | including the active slave. Without additional special | |
231 | handling, this would mean that Open vSwitch would learn that the | |
232 | MAC+VLAN was located on the SLB bond, as a consequence of rule | |
233 | #3. | |
234 | ||
235 | Open vSwitch avoids this problem by "locking" the MAC learning | |
236 | table entry for a MAC+VLAN from which a gratuitous ARP was | |
237 | received from a non-SLB bond port. For 5 seconds, a locked MAC | |
238 | learning table entry will not be updated based on a gratuitous | |
239 | ARP received on a SLB bond. |