]>
Commit | Line | Data |
---|---|---|
b16fdafe BP |
1 | ======================== |
2 | ovs-vswitchd Internals | |
3 | ======================== | |
4 | ||
5 | This document describes some of the internals of the ovs-vswitchd | |
6 | process. It is not complete. It tends to be updated on demand, so if | |
7 | you have questions about the vswitchd implementation, ask them and | |
8 | perhaps we'll add some appropriate documentation here. | |
9 | ||
10 | Most of the ovs-vswitchd implementation is in vswitchd/bridge.c, so | |
11 | code references below should be assumed to refer to that file except | |
12 | as otherwise specified. | |
13 | ||
14 | Bonding | |
15 | ======= | |
16 | ||
17 | Bonding allows two or more interfaces (the "slaves") to share network | |
18 | traffic. From a high-level point of view, bonded interfaces act like | |
19 | a single port, but they have the bandwidth of multiple network | |
20 | devices, e.g. two 1 GB physical interfaces act like a single 2 GB | |
21 | interface. Bonds also increase robustness: the bonded port does not | |
22 | go down as long as at least one of its slaves is up. | |
23 | ||
24 | In vswitchd, a bond always has at least two slaves (and may have | |
25 | more). If a configuration error, etc. would cause a bond to have only | |
26 | one slave, the port becomes an ordinary port, not a bonded port, and | |
27 | none of the special features of bonded ports described in this section | |
28 | apply. | |
29 | ||
30 | There are many forms of bonding, but ovs-vswitchd currently implements | |
31 | only a single kind, called "source load balancing" or SLB bonding. | |
32 | SLB bonding divides traffic among the slaves based on the Ethernet | |
33 | source address. This is useful only if the traffic over the bond has | |
34 | multiple Ethernet source addresses, for example if network traffic | |
35 | from multiple VMs are multiplexed over the bond. | |
36 | ||
37 | Enabling and Disabling Slaves | |
38 | ----------------------------- | |
39 | ||
40 | When a bond is created, a slave is initially enabled or disabled based | |
41 | on whether carrier is detected on the NIC (see iface_create()). After | |
42 | that, a slave is disabled if its carrier goes down for a period of | |
43 | time longer than the downdelay, and it is enabled if carrier comes up | |
44 | for longer than the updelay (see bond_link_status_update()). There is | |
45 | one exception where the updelay is skipped: if no slaves at all are | |
46 | currently enabled, then the first slave on which carrier comes up is | |
47 | enabled immediately. | |
48 | ||
49 | The updelay should be set to a time longer than the STP forwarding | |
50 | delay of the physical switch to which the bond port is connected (if | |
51 | STP is enabled on that switch). Otherwise, the slave will be enabled, | |
52 | and load may be shifted to it, before the physical switch starts | |
53 | forwarding packets on that port, which can cause some data to be | |
54 | "blackholed" for a time. The exception for a single enabled slave | |
55 | does not cause any problem in this regard because when no slaves are | |
56 | enabled all output packets are blackholed anyway. | |
57 | ||
58 | When a slave becomes disabled, the vswitch immediately chooses a new | |
59 | output port for traffic that was destined for that slave (see | |
60 | bond_enable_slave()). It also sends a "gratuitous learning packet" on | |
61 | the bond port (on the newly chosen slave) for each MAC address that | |
62 | the vswitch has learned on a port other than the bond (see | |
63 | bond_send_learning_packets()), to teach the physical switch that the | |
64 | new slave should be used in place of the one that is now disabled. | |
65 | (This behavior probably makes sense only for a vswitch that has only | |
66 | one port (the bond) connected to a physical switch; vswitchd should | |
67 | probably provide a way to disable or configure it in other scenarios.) | |
68 | ||
69 | Bond Packet Input | |
70 | ----------------- | |
71 | ||
b16fdafe BP |
72 | Bonding accepts unicast packets on any bond slave. This can |
73 | occasionally cause packet duplication for the first few packets sent | |
74 | to a given MAC, if the physical switch attached to the bond is | |
75 | flooding packets to that MAC because it has not yet learned the | |
76 | correct slave for that MAC. | |
77 | ||
78 | Bonding only accepts multicast (and broadcast) packets on a single | |
79 | bond slave (the "active slave") at any given time. Multicast packets | |
80 | received on other slaves are dropped. Otherwise, every multicast | |
81 | packet would be duplicated, once for every bond slave, because the | |
82 | physical switch attached to the bond will flood those packets. | |
83 | ||
3a55ef14 JG |
84 | Bonding also drops received packets when the vswitch has learned that |
85 | the packet's MAC is on a port other than the bond port itself. This is | |
86 | because it is likely that the vswitch itself sent the packet out the | |
87 | bond port on a different slave and is now receiving the packet back. | |
88 | This occurs when the packet is multicast or the physical switch has not | |
89 | yet learned the MAC and is flooding it. However, the vswitch makes an | |
b16fdafe BP |
90 | exception to this rule for broadcast ARP replies, which indicate that |
91 | the MAC has moved to another switch, probably due to VM migration. | |
92 | (ARP replies are normally unicast, so this exception does not match | |
93 | normal ARP replies. It will match the learning packets sent on bond | |
94 | fail-over.) | |
95 | ||
96 | The active slave is simply the first slave to be enabled after the | |
97 | bond is created (see bond_choose_active_iface()). If the active slave | |
98 | is disabled, then a new active slave is chosen among the slaves that | |
99 | remain active. Currently due to the way that configuration works, | |
100 | this tends to be the remaining slave whose interface name is first | |
101 | alphabetically, but this is by no means guaranteed. | |
102 | ||
103 | Bond Packet Output | |
104 | ------------------ | |
105 | ||
106 | When a packet is sent out a bond port, the bond slave actually used is | |
e58de0e3 EJ |
107 | selected based on the packet's source MAC and VLAN tag (see |
108 | choose_output_iface()). In particular, the source MAC and VLAN tag | |
109 | are hashed into one of 256 values, and that value is looked up in a | |
110 | hash table (the "bond hash") kept in the "bond_hash" member of struct | |
111 | port. The hash table entry identifies a bond slave. If no bond slave | |
112 | has yet been chosen for that hash table entry, vswitchd chooses one | |
113 | arbitrarily. | |
b16fdafe BP |
114 | |
115 | Every 10 seconds, vswitchd rebalances the bond slaves (see | |
116 | bond_rebalance_port()). To rebalance, vswitchd examines the | |
117 | statistics for the number of bytes transmitted by each slave over | |
118 | approximately the past minute, with data sent more recently weighted | |
119 | more heavily than data sent less recently. It considers each of the | |
120 | slaves in order from most-loaded to least-loaded. If highly loaded | |
121 | slave H is significantly more heavily loaded than the least-loaded | |
122 | slave L, and slave H carries at least two hashes, then vswitchd shifts | |
5422a9e1 JG |
123 | one of H's hashes to L. However, vswitchd will only shift a hash from |
124 | H to L if it will decrease the ratio of the load between H and L by at | |
125 | least 0.1. | |
b16fdafe BP |
126 | |
127 | Currently, "significantly more loaded" means that H must carry at | |
128 | least 1 Mbps more traffic, and that traffic must be at least 3% | |
129 | greater than L's. | |
b2272edb BP |
130 | |
131 | Bond Balance Modes | |
132 | ------------------ | |
133 | ||
134 | Each bond balancing mode has different considerations, described | |
135 | below. | |
136 | ||
137 | LACP Bonding | |
138 | ------------ | |
139 | ||
140 | LACP bonding requires the remote switch to implement LACP, but it is | |
141 | otherwise very simple in that, after LACP negotiation is complete, | |
142 | there is no need for special handling of received packets. | |
143 | ||
144 | SLB Bonding | |
145 | ----------- | |
146 | ||
147 | SLB bonding allows a limited form of load balancing without the remote | |
148 | switch's knowledge or cooperation. The basics of SLB are simple. SLB | |
149 | assigns each source MAC+VLAN pair to a link and transmits all packets | |
150 | from that MAC+VLAN through that link. Learning in the remote switch | |
151 | causes it to send packets to that MAC+VLAN through the same link. | |
152 | ||
153 | SLB bonding has the following complications: | |
154 | ||
155 | 0. When the remote switch has not learned the MAC for the | |
156 | destination of a unicast packet and hence floods the packet to | |
157 | all of the links on the SLB bond, Open vSwitch will forward | |
158 | duplicate packets, one per link, to each other switch port. | |
159 | ||
160 | Open vSwitch does not solve this problem. | |
161 | ||
162 | 1. When the remote switch receives a multicast or broadcast packet | |
163 | from a port not on the SLB bond, it will forward it to all of | |
164 | the links in the SLB bond. This would cause packet duplication | |
165 | if not handled specially. | |
166 | ||
167 | Open vSwitch avoids packet duplication by accepting multicast | |
168 | and broadcast packets on only the active slave, and dropping | |
169 | multicast and broadcast packets on all other slaves. | |
170 | ||
171 | 2. When Open vSwitch forwards a multicast or broadcast packet to a | |
172 | link in the SLB bond other than the active slave, the remote | |
173 | switch will forward it to all of the other links in the SLB | |
174 | bond, including the active slave. Without special handling, | |
175 | this would mean that Open vSwitch would forward a second copy of | |
176 | the packet to each switch port (other than the bond), including | |
177 | the port that originated the packet. | |
178 | ||
179 | Open vSwitch deals with this case by dropping packets received | |
180 | on any SLB bonded link that have a source MAC+VLAN that has been | |
181 | learned on any other port. (This means that SLB as implemented | |
182 | in Open vSwitch relies critically on MAC learning. Notably, SLB | |
183 | is incompatible with the "flood_vlans" feature.) | |
184 | ||
185 | 3. Suppose that a MAC+VLAN moves to an SLB bond from another port | |
186 | (e.g. when a VM is migrated from this hypervisor to a different | |
187 | one). Without additional special handling, Open vSwitch will | |
188 | not notice until the MAC learning entry expires, up to 60 | |
189 | seconds later as a consequence of rule #2. | |
190 | ||
191 | Open vSwitch avoids a 60-second delay by listening for | |
192 | gratuitous ARPs, which VMs commonly emit upon migration. As an | |
193 | exception to rule #2, a gratuitous ARP received on an SLB bond | |
194 | is not dropped and updates the MAC learning table in the usual | |
195 | way. (If a move does not trigger a gratuitous ARP, or if the | |
196 | gratuitous ARP is lost in the network, then a 60-second delay | |
197 | still occurs.) | |
198 | ||
199 | 4. Suppose that a MAC+VLAN moves from an SLB bond to another port | |
200 | (e.g. when a VM is migrated from a different hypervisor to this | |
201 | one), that the MAC+VLAN emits a gratuitous ARP, and that Open | |
202 | vSwitch forwards that gratuitous ARP to a link in the SLB bond | |
203 | other than the active slave. The remote switch will forward the | |
204 | gratuitous ARP to all of the other links in the SLB bond, | |
205 | including the active slave. Without additional special | |
206 | handling, this would mean that Open vSwitch would learn that the | |
207 | MAC+VLAN was located on the SLB bond, as a consequence of rule | |
208 | #3. | |
209 | ||
210 | Open vSwitch avoids this problem by "locking" the MAC learning | |
211 | table entry for a MAC+VLAN from which a gratuitous ARP was | |
212 | received from a non-SLB bond port. For 5 seconds, a locked MAC | |
213 | learning table entry will not be updated based on a gratuitous | |
214 | ARP received on a SLB bond. |