]> git.proxmox.com Git - mirror_ovs.git/blame - vswitchd/INTERNALS
ofproto-dpif: Remove 'has_bundle_action'.
[mirror_ovs.git] / vswitchd / INTERNALS
CommitLineData
b16fdafe
BP
1 ========================
2 ovs-vswitchd Internals
3 ========================
4
5This document describes some of the internals of the ovs-vswitchd
6process. It is not complete. It tends to be updated on demand, so if
7you have questions about the vswitchd implementation, ask them and
8perhaps we'll add some appropriate documentation here.
9
10Most of the ovs-vswitchd implementation is in vswitchd/bridge.c, so
11code references below should be assumed to refer to that file except
12as otherwise specified.
13
14Bonding
15=======
16
17Bonding allows two or more interfaces (the "slaves") to share network
18traffic. From a high-level point of view, bonded interfaces act like
19a single port, but they have the bandwidth of multiple network
20devices, e.g. two 1 GB physical interfaces act like a single 2 GB
21interface. Bonds also increase robustness: the bonded port does not
22go down as long as at least one of its slaves is up.
23
24In vswitchd, a bond always has at least two slaves (and may have
25more). If a configuration error, etc. would cause a bond to have only
26one slave, the port becomes an ordinary port, not a bonded port, and
27none of the special features of bonded ports described in this section
28apply.
29
1e959061
EJ
30There are many forms of bonding of which ovs-vswitchd implements only
31a few. The most complex bond ovs-vswitchd implements is called
32"source load balancing" or SLB bonding. SLB bonding divides traffic
33among the slaves based on the Ethernet source address. This is useful
34only if the traffic over the bond has multiple Ethernet source
35addresses, for example if network traffic from multiple VMs are
36multiplexed over the bond.
b16fdafe
BP
37
38Enabling and Disabling Slaves
39-----------------------------
40
41When a bond is created, a slave is initially enabled or disabled based
42on whether carrier is detected on the NIC (see iface_create()). After
43that, a slave is disabled if its carrier goes down for a period of
44time longer than the downdelay, and it is enabled if carrier comes up
45for longer than the updelay (see bond_link_status_update()). There is
46one exception where the updelay is skipped: if no slaves at all are
47currently enabled, then the first slave on which carrier comes up is
48enabled immediately.
49
50The updelay should be set to a time longer than the STP forwarding
51delay of the physical switch to which the bond port is connected (if
52STP is enabled on that switch). Otherwise, the slave will be enabled,
53and load may be shifted to it, before the physical switch starts
54forwarding packets on that port, which can cause some data to be
55"blackholed" for a time. The exception for a single enabled slave
56does not cause any problem in this regard because when no slaves are
57enabled all output packets are blackholed anyway.
58
59When a slave becomes disabled, the vswitch immediately chooses a new
60output port for traffic that was destined for that slave (see
38f7147c
EJ
61bond_enable_slave()). It also sends a "gratuitous learning packet",
62specifically a RARP, on the bond port (on the newly chosen slave) for
63each MAC address that the vswitch has learned on a port other than the
64bond (see bond_send_learning_packets()), to teach the physical switch
65that the new slave should be used in place of the one that is now
66disabled. (This behavior probably makes sense only for a vswitch that
67has only one port (the bond) connected to a physical switch; vswitchd
68should probably provide a way to disable or configure it in other
69scenarios.)
b16fdafe
BP
70
71Bond Packet Input
72-----------------
73
b16fdafe
BP
74Bonding accepts unicast packets on any bond slave. This can
75occasionally cause packet duplication for the first few packets sent
76to a given MAC, if the physical switch attached to the bond is
77flooding packets to that MAC because it has not yet learned the
78correct slave for that MAC.
79
80Bonding only accepts multicast (and broadcast) packets on a single
81bond slave (the "active slave") at any given time. Multicast packets
82received on other slaves are dropped. Otherwise, every multicast
83packet would be duplicated, once for every bond slave, because the
84physical switch attached to the bond will flood those packets.
85
3a55ef14
JG
86Bonding also drops received packets when the vswitch has learned that
87the packet's MAC is on a port other than the bond port itself. This is
88because it is likely that the vswitch itself sent the packet out the
89bond port on a different slave and is now receiving the packet back.
90This occurs when the packet is multicast or the physical switch has not
91yet learned the MAC and is flooding it. However, the vswitch makes an
b16fdafe
BP
92exception to this rule for broadcast ARP replies, which indicate that
93the MAC has moved to another switch, probably due to VM migration.
94(ARP replies are normally unicast, so this exception does not match
95normal ARP replies. It will match the learning packets sent on bond
96fail-over.)
97
98The active slave is simply the first slave to be enabled after the
99bond is created (see bond_choose_active_iface()). If the active slave
100is disabled, then a new active slave is chosen among the slaves that
101remain active. Currently due to the way that configuration works,
102this tends to be the remaining slave whose interface name is first
103alphabetically, but this is by no means guaranteed.
104
105Bond Packet Output
106------------------
107
108When a packet is sent out a bond port, the bond slave actually used is
e58de0e3
EJ
109selected based on the packet's source MAC and VLAN tag (see
110choose_output_iface()). In particular, the source MAC and VLAN tag
111are hashed into one of 256 values, and that value is looked up in a
112hash table (the "bond hash") kept in the "bond_hash" member of struct
113port. The hash table entry identifies a bond slave. If no bond slave
114has yet been chosen for that hash table entry, vswitchd chooses one
115arbitrarily.
b16fdafe
BP
116
117Every 10 seconds, vswitchd rebalances the bond slaves (see
118bond_rebalance_port()). To rebalance, vswitchd examines the
119statistics for the number of bytes transmitted by each slave over
120approximately the past minute, with data sent more recently weighted
121more heavily than data sent less recently. It considers each of the
122slaves in order from most-loaded to least-loaded. If highly loaded
123slave H is significantly more heavily loaded than the least-loaded
124slave L, and slave H carries at least two hashes, then vswitchd shifts
5422a9e1
JG
125one of H's hashes to L. However, vswitchd will only shift a hash from
126H to L if it will decrease the ratio of the load between H and L by at
127least 0.1.
b16fdafe
BP
128
129Currently, "significantly more loaded" means that H must carry at
130least 1 Mbps more traffic, and that traffic must be at least 3%
131greater than L's.
b2272edb
BP
132
133Bond Balance Modes
134------------------
135
136Each bond balancing mode has different considerations, described
137below.
138
139LACP Bonding
140------------
141
142LACP bonding requires the remote switch to implement LACP, but it is
143otherwise very simple in that, after LACP negotiation is complete,
144there is no need for special handling of received packets.
145
1e959061
EJ
146Active Backup Bonding
147---------------------
148
149Active Backup bonds send all traffic out one "active" slave until that
150slave becomes unavailable. Since they are significantly less
151complicated than SLB bonds, they are preferred when LACP is not an
152option. Additionally, they are the only bond mode which supports
153attaching each slave to a different upstream switch.
154
b2272edb
BP
155SLB Bonding
156-----------
157
158SLB bonding allows a limited form of load balancing without the remote
159switch's knowledge or cooperation. The basics of SLB are simple. SLB
160assigns each source MAC+VLAN pair to a link and transmits all packets
161from that MAC+VLAN through that link. Learning in the remote switch
162causes it to send packets to that MAC+VLAN through the same link.
163
164SLB bonding has the following complications:
165
166 0. When the remote switch has not learned the MAC for the
167 destination of a unicast packet and hence floods the packet to
168 all of the links on the SLB bond, Open vSwitch will forward
169 duplicate packets, one per link, to each other switch port.
170
171 Open vSwitch does not solve this problem.
172
173 1. When the remote switch receives a multicast or broadcast packet
174 from a port not on the SLB bond, it will forward it to all of
175 the links in the SLB bond. This would cause packet duplication
176 if not handled specially.
177
178 Open vSwitch avoids packet duplication by accepting multicast
179 and broadcast packets on only the active slave, and dropping
180 multicast and broadcast packets on all other slaves.
181
182 2. When Open vSwitch forwards a multicast or broadcast packet to a
183 link in the SLB bond other than the active slave, the remote
184 switch will forward it to all of the other links in the SLB
185 bond, including the active slave. Without special handling,
186 this would mean that Open vSwitch would forward a second copy of
187 the packet to each switch port (other than the bond), including
188 the port that originated the packet.
189
190 Open vSwitch deals with this case by dropping packets received
191 on any SLB bonded link that have a source MAC+VLAN that has been
192 learned on any other port. (This means that SLB as implemented
193 in Open vSwitch relies critically on MAC learning. Notably, SLB
194 is incompatible with the "flood_vlans" feature.)
195
196 3. Suppose that a MAC+VLAN moves to an SLB bond from another port
197 (e.g. when a VM is migrated from this hypervisor to a different
198 one). Without additional special handling, Open vSwitch will
199 not notice until the MAC learning entry expires, up to 60
200 seconds later as a consequence of rule #2.
201
202 Open vSwitch avoids a 60-second delay by listening for
203 gratuitous ARPs, which VMs commonly emit upon migration. As an
204 exception to rule #2, a gratuitous ARP received on an SLB bond
205 is not dropped and updates the MAC learning table in the usual
206 way. (If a move does not trigger a gratuitous ARP, or if the
207 gratuitous ARP is lost in the network, then a 60-second delay
208 still occurs.)
209
210 4. Suppose that a MAC+VLAN moves from an SLB bond to another port
211 (e.g. when a VM is migrated from a different hypervisor to this
212 one), that the MAC+VLAN emits a gratuitous ARP, and that Open
213 vSwitch forwards that gratuitous ARP to a link in the SLB bond
214 other than the active slave. The remote switch will forward the
215 gratuitous ARP to all of the other links in the SLB bond,
216 including the active slave. Without additional special
217 handling, this would mean that Open vSwitch would learn that the
218 MAC+VLAN was located on the SLB bond, as a consequence of rule
219 #3.
220
221 Open vSwitch avoids this problem by "locking" the MAC learning
222 table entry for a MAC+VLAN from which a gratuitous ARP was
223 received from a non-SLB bond port. For 5 seconds, a locked MAC
224 learning table entry will not be updated based on a gratuitous
225 ARP received on a SLB bond.