]>
Commit | Line | Data |
---|---|---|
b16fdafe BP |
1 | ======================== |
2 | ovs-vswitchd Internals | |
3 | ======================== | |
4 | ||
5 | This document describes some of the internals of the ovs-vswitchd | |
6 | process. It is not complete. It tends to be updated on demand, so if | |
7 | you have questions about the vswitchd implementation, ask them and | |
8 | perhaps we'll add some appropriate documentation here. | |
9 | ||
10 | Most of the ovs-vswitchd implementation is in vswitchd/bridge.c, so | |
11 | code references below should be assumed to refer to that file except | |
12 | as otherwise specified. | |
13 | ||
14 | Bonding | |
15 | ======= | |
16 | ||
17 | Bonding allows two or more interfaces (the "slaves") to share network | |
18 | traffic. From a high-level point of view, bonded interfaces act like | |
19 | a single port, but they have the bandwidth of multiple network | |
20 | devices, e.g. two 1 GB physical interfaces act like a single 2 GB | |
21 | interface. Bonds also increase robustness: the bonded port does not | |
22 | go down as long as at least one of its slaves is up. | |
23 | ||
24 | In vswitchd, a bond always has at least two slaves (and may have | |
25 | more). If a configuration error, etc. would cause a bond to have only | |
26 | one slave, the port becomes an ordinary port, not a bonded port, and | |
27 | none of the special features of bonded ports described in this section | |
28 | apply. | |
29 | ||
30 | There are many forms of bonding, but ovs-vswitchd currently implements | |
31 | only a single kind, called "source load balancing" or SLB bonding. | |
32 | SLB bonding divides traffic among the slaves based on the Ethernet | |
33 | source address. This is useful only if the traffic over the bond has | |
34 | multiple Ethernet source addresses, for example if network traffic | |
35 | from multiple VMs are multiplexed over the bond. | |
36 | ||
37 | Enabling and Disabling Slaves | |
38 | ----------------------------- | |
39 | ||
40 | When a bond is created, a slave is initially enabled or disabled based | |
41 | on whether carrier is detected on the NIC (see iface_create()). After | |
42 | that, a slave is disabled if its carrier goes down for a period of | |
43 | time longer than the downdelay, and it is enabled if carrier comes up | |
44 | for longer than the updelay (see bond_link_status_update()). There is | |
45 | one exception where the updelay is skipped: if no slaves at all are | |
46 | currently enabled, then the first slave on which carrier comes up is | |
47 | enabled immediately. | |
48 | ||
49 | The updelay should be set to a time longer than the STP forwarding | |
50 | delay of the physical switch to which the bond port is connected (if | |
51 | STP is enabled on that switch). Otherwise, the slave will be enabled, | |
52 | and load may be shifted to it, before the physical switch starts | |
53 | forwarding packets on that port, which can cause some data to be | |
54 | "blackholed" for a time. The exception for a single enabled slave | |
55 | does not cause any problem in this regard because when no slaves are | |
56 | enabled all output packets are blackholed anyway. | |
57 | ||
58 | When a slave becomes disabled, the vswitch immediately chooses a new | |
59 | output port for traffic that was destined for that slave (see | |
60 | bond_enable_slave()). It also sends a "gratuitous learning packet" on | |
61 | the bond port (on the newly chosen slave) for each MAC address that | |
62 | the vswitch has learned on a port other than the bond (see | |
63 | bond_send_learning_packets()), to teach the physical switch that the | |
64 | new slave should be used in place of the one that is now disabled. | |
65 | (This behavior probably makes sense only for a vswitch that has only | |
66 | one port (the bond) connected to a physical switch; vswitchd should | |
67 | probably provide a way to disable or configure it in other scenarios.) | |
68 | ||
69 | Bond Packet Input | |
70 | ----------------- | |
71 | ||
72 | Bond packet input processing takes place in process_flow(). | |
73 | ||
74 | Bonding accepts unicast packets on any bond slave. This can | |
75 | occasionally cause packet duplication for the first few packets sent | |
76 | to a given MAC, if the physical switch attached to the bond is | |
77 | flooding packets to that MAC because it has not yet learned the | |
78 | correct slave for that MAC. | |
79 | ||
80 | Bonding only accepts multicast (and broadcast) packets on a single | |
81 | bond slave (the "active slave") at any given time. Multicast packets | |
82 | received on other slaves are dropped. Otherwise, every multicast | |
83 | packet would be duplicated, once for every bond slave, because the | |
84 | physical switch attached to the bond will flood those packets. | |
85 | ||
3a55ef14 JG |
86 | Bonding also drops received packets when the vswitch has learned that |
87 | the packet's MAC is on a port other than the bond port itself. This is | |
88 | because it is likely that the vswitch itself sent the packet out the | |
89 | bond port on a different slave and is now receiving the packet back. | |
90 | This occurs when the packet is multicast or the physical switch has not | |
91 | yet learned the MAC and is flooding it. However, the vswitch makes an | |
b16fdafe BP |
92 | exception to this rule for broadcast ARP replies, which indicate that |
93 | the MAC has moved to another switch, probably due to VM migration. | |
94 | (ARP replies are normally unicast, so this exception does not match | |
95 | normal ARP replies. It will match the learning packets sent on bond | |
96 | fail-over.) | |
97 | ||
98 | The active slave is simply the first slave to be enabled after the | |
99 | bond is created (see bond_choose_active_iface()). If the active slave | |
100 | is disabled, then a new active slave is chosen among the slaves that | |
101 | remain active. Currently due to the way that configuration works, | |
102 | this tends to be the remaining slave whose interface name is first | |
103 | alphabetically, but this is by no means guaranteed. | |
104 | ||
105 | Bond Packet Output | |
106 | ------------------ | |
107 | ||
108 | When a packet is sent out a bond port, the bond slave actually used is | |
109 | selected based on the packet's source MAC (see choose_output_iface()). | |
110 | In particular, the source MAC is hashed into one of 256 values, and | |
111 | that value is looked up in a hash table (the "bond hash") kept in the | |
112 | "bond_hash" member of struct port. The hash table entry identifies a | |
113 | bond slave. If no bond slave has yet been chosen for that hash table | |
114 | entry, vswitchd chooses one arbitrarily. | |
115 | ||
116 | Every 10 seconds, vswitchd rebalances the bond slaves (see | |
117 | bond_rebalance_port()). To rebalance, vswitchd examines the | |
118 | statistics for the number of bytes transmitted by each slave over | |
119 | approximately the past minute, with data sent more recently weighted | |
120 | more heavily than data sent less recently. It considers each of the | |
121 | slaves in order from most-loaded to least-loaded. If highly loaded | |
122 | slave H is significantly more heavily loaded than the least-loaded | |
123 | slave L, and slave H carries at least two hashes, then vswitchd shifts | |
5422a9e1 JG |
124 | one of H's hashes to L. However, vswitchd will only shift a hash from |
125 | H to L if it will decrease the ratio of the load between H and L by at | |
126 | least 0.1. | |
b16fdafe BP |
127 | |
128 | Currently, "significantly more loaded" means that H must carry at | |
129 | least 1 Mbps more traffic, and that traffic must be at least 3% | |
130 | greater than L's. |