]>
Commit | Line | Data |
---|---|---|
1 | .. _wecmp_linkbw: | |
2 | ||
3 | Weighted ECMP using BGP link bandwidth | |
4 | ====================================== | |
5 | ||
6 | .. _features-of-wecmp-linkbw: | |
7 | ||
8 | Overview | |
9 | -------- | |
10 | ||
11 | In normal equal cost multipath (ECMP), the route to a destination has | |
12 | multiple next hops and traffic is expected to be equally distributed | |
13 | across these next hops. In practice, flow-based hashing is used so that | |
14 | all traffic associated with a particular flow uses the same next hop, | |
15 | and by extension, the same path across the network. | |
16 | ||
17 | Weighted ECMP using BGP link bandwidth introduces support for network-wide | |
18 | unequal cost multipathing (UCMP) to an IP destination. The unequal cost | |
19 | load balancing is implemented by the forwarding plane based on the weights | |
20 | associated with the next hops of the IP prefix. These weights are computed | |
21 | based on the bandwidths of the corresponding multipaths which are encoded | |
22 | in the ``BGP link bandwidth extended community`` as specified in | |
23 | [Draft-IETF-idr-link-bandwidth]_. Exchange of an appropriate BGP link | |
24 | bandwidth value for a prefix across the network results in network-wide | |
25 | unequal cost multipathing. | |
26 | ||
27 | One of the primary use cases of this capability is in the data center when | |
28 | a service (represented by its anycast IP) has an unequal set of resources | |
29 | across the regions (e.g., PODs) of the data center and the network itself | |
30 | provides the load balancing function instead of an external load balancer. | |
31 | Refer to [Draft-IETF-mohanty-bess-ebgp-dmz]_ and :rfc:`7938` for details | |
32 | on this use case. This use case is applicable in a pure L3 network as | |
33 | well as in a EVPN network. | |
34 | ||
35 | The traditional use case for BGP link bandwidth to load balance traffic | |
36 | to the exit routers in the AS based on the bandwidth of their external | |
37 | eBGP peering links is also supported. | |
38 | ||
39 | ||
40 | Design Principles | |
41 | ----------------- | |
42 | ||
43 | Next hop weight computation and usage | |
44 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
45 | ||
46 | As described, in UCMP, there is a weight associated with each next hop of an | |
47 | IP prefix, and traffic is expected to be distributed across the next hops in | |
48 | proportion to their weight. The weight of a next hop is a simple factoring | |
49 | of the bandwidth of the corresponding path against the total bandwidth of | |
50 | all multipaths, mapped to the range 1 to 100. What happens if not all the | |
51 | paths in the multipath set have link bandwidth associated with them? In such | |
52 | a case, in adherence to [Draft-IETF-idr-link-bandwidth]_, the behavior | |
53 | reverts to standard ECMP among all the multipaths, with the link bandwidth | |
54 | being effectively ignored. | |
55 | ||
56 | Note that there is no change to either the BGP best path selection algorithm | |
57 | or to the multipath computation algorithm; the mapping of link bandwidth to | |
58 | weight happens at the time of installation of the route in the RIB. | |
59 | ||
60 | If data forwarding is implemented by means of the Linux kernel, the next hop’s | |
61 | weight is used in the hash calculation. The kernel uses the Hash threshold | |
62 | algorithm and use of the next hop weight is built into it; next hops need | |
63 | not be expanded to achieve UCMP. UCMP for IPv4 is available in older Linux | |
64 | kernels too, while UCMP for IPv6 is available from the 4.16 kernel onwards. | |
65 | ||
66 | If data forwarding is realized in hardware, common implementations expand | |
67 | the next hops (i.e., they are repeated) in the ECMP container in proportion | |
68 | to their weight. For example, if the weights associated with 3 next hops for | |
69 | a particular route are 50, 25 and 25 and the ECMP container has a size of 16 | |
70 | next hops, the first next hop will be repeated 8 times and the other 2 next | |
71 | hops repeated 4 times each. Other implementations are also possible. | |
72 | ||
73 | Unequal cost multipath across a network | |
74 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
75 | ||
76 | For the use cases listed above, it is not sufficient to support UCMP on just | |
77 | one router (e.g., egress router), or individually, on multiple routers; UCMP | |
78 | must be deployed across the entire network. This is achieved by employing the | |
79 | BGP link-bandwidth extended community. | |
80 | ||
81 | At the router which originates the BGP link bandwidth, there has to be user | |
82 | configuration to trigger it, which is described below. Receiving routers | |
83 | would use the received link bandwidth from their downstream routers to | |
84 | determine the next hop weight as described in the earlier section. Further, | |
85 | if the received link bandwidth is a transitive attribute, it would be | |
86 | propagated to eBGP peers, with the additional change that if the next hop | |
87 | is set to oneself, the cumulative link bandwidth of all downstream paths | |
88 | is propagated to other routers. In this manner, the entire network will | |
89 | know how to distribute traffic to an anycast service across the network. | |
90 | ||
91 | The BGP link-bandwidth extended community is encoded in bytes-per-second. | |
92 | In the use case where UCMP must be based on the number of paths, a reference | |
93 | bandwidth of 1 Mbps is used. So, for example, if there are 4 equal cost paths | |
94 | to an anycast IP, the encoded bandwidth in the extended community will be | |
95 | 500,000. The actual value itself doesn’t matter as long as all routers | |
96 | originating the link-bandwidth are doing it in the same way. | |
97 | ||
98 | ||
99 | Configuration Guide | |
100 | ------------------- | |
101 | ||
102 | The configuration for weighted ECMP using BGP link bandwidth requires | |
103 | one essential step - using a route-map to inject the link bandwidth | |
104 | extended community. An additional option is provided to control the | |
105 | processing of received link bandwidth. | |
106 | ||
107 | Injecting link bandwidth into the network | |
108 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
109 | ||
110 | At the "entry point" router that is injecting the prefix to which weighted | |
111 | load balancing must be performed, a route-map must be configured to | |
112 | attach the link bandwidth extended community. | |
113 | ||
114 | For the use case of providing weighted load balancing for an anycast service, | |
115 | this configuration will typically need to be applied at the TOR or Leaf | |
116 | router that is connected to servers which provide the anycast service and | |
117 | the bandwidth would be based on the number of multipaths for the destination. | |
118 | ||
119 | For the use case of load balancing to the exit router, the exit router should | |
120 | be configured with the route map specifying the a bandwidth value that | |
121 | corresponds to the bandwidth of the link connecting to its eBGP peer in the | |
122 | adjoining AS. In addition, the link bandwidth extended community must be | |
123 | explicitly configured to be non-transitive. | |
124 | ||
125 | The complete syntax of the route-map set command can be found at | |
126 | :ref:`bgp-extended-communities-in-route-map` | |
127 | ||
128 | This route-map is supported only at two attachment points: | |
129 | (a) the outbound route-map attached to a peer or peer-group, per address-family | |
130 | (b) the EVPN advertise route-map used to inject IPv4 or IPv6 unicast routes | |
131 | into EVPN as type-5 routes. | |
132 | ||
133 | Since the link bandwidth origination is done by using a route-map, it can | |
134 | be constrained to certain prefixes (e.g., only for anycast services) or it | |
135 | can be generated for all prefixes. Further, when the route-map is used in | |
136 | the neighbor context, the link bandwidth usage can be constrained to certain | |
137 | peers only. | |
138 | ||
139 | A sample configuration is shown below and illustrates link bandwidth | |
140 | advertisement towards the "SPINE" peer-group for anycast IPs in the | |
141 | range 192.168.x.x | |
142 | ||
143 | .. code-block:: frr | |
144 | ||
145 | ip prefix-list anycast_ip seq 10 permit 192.168.0.0/16 le 32 | |
146 | route-map anycast_ip permit 10 | |
147 | match ip address prefix-list anycast_ip | |
148 | set extcommunity bandwidth num-multipaths | |
149 | route-map anycast_ip permit 20 | |
150 | ! | |
151 | router bgp 65001 | |
152 | neighbor SPINE peer-group | |
153 | neighbor SPINE remote-as external | |
154 | neighbor 172.16.35.1 peer-group SPINE | |
155 | neighbor 172.16.36.1 peer-group SPINE | |
156 | ! | |
157 | address-family ipv4 unicast | |
158 | network 110.0.0.1/32 | |
159 | network 192.168.44.1/32 | |
160 | neighbor SPINE route-map anycast_ip out | |
161 | exit-address-family | |
162 | ! | |
163 | ||
164 | ||
165 | Controlling link bandwidth processing on the receiver | |
166 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
167 | ||
168 | There is no configuration necessary to process received link bandwidth and | |
169 | translate it into the weight associated with the corresponding next hop; | |
170 | that happens by default. If some of the multipaths do not have the link | |
171 | bandwidth extended community, the default behavior is to revert to normal | |
172 | ECMP as recommended in [Draft-IETF-idr-link-bandwidth]_. | |
173 | ||
174 | The operator can change these behaviors with the following configuration: | |
175 | ||
176 | .. clicmd:: bgp bestpath bandwidth <ignore | skip-missing | default-weight-for-missing> | |
177 | ||
178 | The different options imply behavior as follows: | |
179 | ||
180 | - ignore: Ignore link bandwidth completely for route installation | |
181 | (i.e., do regular ECMP, not weighted) | |
182 | - skip-missing: Skip paths without link bandwidth and do UCMP among | |
183 | the others (if at least some paths have link-bandwidth) | |
184 | - default-weight-for-missing: Assign a low default weight (value 1) | |
185 | to paths not having link bandwidth | |
186 | ||
187 | This configuration is per BGP instance similar to other BGP route-selection | |
188 | controls; it operates on both IPv4-unicast and IPv6-unicast routes in that | |
189 | instance. In an EVPN network, this configuration (if required) should be | |
190 | implemented in the tenant VRF and is again applicable for IPv4-unicast and | |
191 | IPv6-unicast, including the ones sourced from EVPN type-5 routes. | |
192 | ||
193 | A sample snippet of FRR configuration on a receiver to skip paths without | |
194 | link bandwidth and do weighted ECMP among the other paths (if some of them | |
195 | have link bandwidth) is as shown below. | |
196 | ||
197 | .. code-block:: frr | |
198 | ||
199 | router bgp 65021 | |
200 | bgp bestpath as-path multipath-relax | |
201 | bgp bestpath bandwidth skip-missing | |
202 | neighbor LEAF peer-group | |
203 | neighbor LEAF remote-as external | |
204 | neighbor 172.16.35.2 peer-group LEAF | |
205 | neighbor 172.16.36.2 peer-group LEAF | |
206 | ! | |
207 | address-family ipv4 unicast | |
208 | network 130.0.0.1/32 | |
209 | exit-address-family | |
210 | ! | |
211 | ||
212 | ||
213 | Stopping the propagation of the link bandwidth outside a domain | |
214 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
215 | ||
216 | The link bandwidth extended community will get automatically propagated | |
217 | with the prefix to EBGP peers, if it is encoded as a transitive attribute | |
218 | by the originator. If this propagation has to be stopped outside of a | |
219 | particular domain (e.g., stopped from being propagated to routers outside | |
220 | of the data center core network), the mechanism available is to disable | |
221 | the advertisement of all BGP extended communities on the specific peering/s. | |
222 | In other words, the propagation cannot be blocked just for the link bandwidth | |
223 | extended community. The configuration to disable all extended communities | |
224 | can be applied to a peer or peer-group (per address-family). | |
225 | ||
226 | Of course, the other common way to stop the propagation of the link bandwidth | |
227 | outside the domain is to block the prefixes themselves from being advertised | |
228 | and possibly, announce only an aggregate route. This would be quite common | |
229 | in a EVPN network. | |
230 | ||
231 | BGP link bandwidth and UCMP monitoring & troubleshooting | |
232 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
233 | ||
234 | Existing operational commands to display the BGP routing table for a specific | |
235 | prefix will show the link bandwidth extended community also, if present. | |
236 | ||
237 | An example of an IPv4-unicast route received with the link bandwidth | |
238 | attribute from two peers is shown below: | |
239 | ||
240 | .. code-block:: frr | |
241 | ||
242 | CLI# show bgp ipv4 unicast 192.168.10.1/32 | |
243 | BGP routing table entry for 192.168.10.1/32 | |
244 | Paths: (2 available, best #2, table default) | |
245 | Advertised to non peer-group peers: | |
246 | l1(swp1) l2(swp2) l3(swp3) l4(swp4) | |
247 | 65002 | |
248 | fe80::202:ff:fe00:1b from l2(swp2) (110.0.0.2) | |
249 | (fe80::202:ff:fe00:1b) (used) | |
250 | Origin IGP, metric 0, valid, external, multipath, bestpath-from-AS 65002 | |
251 | Extended Community: LB:65002:125000000 (1000.000 Mbps) | |
252 | Last update: Thu Feb 20 18:34:16 2020 | |
253 | ||
254 | 65001 | |
255 | fe80::202:ff:fe00:15 from l1(swp1) (110.0.0.1) | |
256 | (fe80::202:ff:fe00:15) (used) | |
257 | Origin IGP, metric 0, valid, external, multipath, bestpath-from-AS 65001, best (Older Path) | |
258 | Extended Community: LB:65001:62500000 (500.000 Mbps) | |
259 | Last update: Thu Feb 20 18:22:34 2020 | |
260 | ||
261 | The weights associated with the next hops of a route can be seen by querying | |
262 | the RIB for a specific route. | |
263 | ||
264 | For example, the next hop weights corresponding to the link bandwidths in the | |
265 | above example is illustrated below: | |
266 | ||
267 | .. code-block:: frr | |
268 | ||
269 | spine1# show ip route 192.168.10.1/32 | |
270 | Routing entry for 192.168.10.1/32 | |
271 | Known via "bgp", distance 20, metric 0, best | |
272 | Last update 00:00:32 ago | |
273 | * fe80::202:ff:fe00:1b, via swp2, weight 66 | |
274 | * fe80::202:ff:fe00:15, via swp1, weight 33 | |
275 | ||
276 | For troubleshooting, existing debug logs ``debug bgp updates``, | |
277 | ``debug bgp bestpath <prefix>``, ``debug bgp zebra`` and | |
278 | ``debug zebra kernel`` can be used. | |
279 | ||
280 | A debug log snippet when ``debug bgp zebra`` is enabled and a route is | |
281 | installed by BGP in the RIB with next hop weights is shown below: | |
282 | ||
283 | .. code-block:: frr | |
284 | ||
285 | 2020-02-29T06:26:19.927754+00:00 leaf1 bgpd[5459]: bgp_zebra_announce: p=192.168.150.1/32, bgp_is_valid_label: 0 | |
286 | 2020-02-29T06:26:19.928096+00:00 leaf1 bgpd[5459]: Tx route add VRF 33 192.168.150.1/32 metric 0 tag 0 count 2 | |
287 | 2020-02-29T06:26:19.928289+00:00 leaf1 bgpd[5459]: nhop [1]: 110.0.0.6 if 35 VRF 33 wt 50 RMAC 0a:11:2f:7d:35:20 | |
288 | 2020-02-29T06:26:19.928479+00:00 leaf1 bgpd[5459]: nhop [2]: 110.0.0.5 if 35 VRF 33 wt 50 RMAC 32:1e:32:a3:6c:bf | |
289 | 2020-02-29T06:26:19.928668+00:00 leaf1 bgpd[5459]: bgp_zebra_announce: 192.168.150.1/32: announcing to zebra (recursion NOT set) | |
290 | ||
291 | ||
292 | References | |
293 | ---------- | |
294 | ||
295 | .. [Draft-IETF-idr-link-bandwidth] <https://tools.ietf.org/html/draft-ietf-idr-link-bandwidth> | |
296 | .. [Draft-IETF-mohanty-bess-ebgp-dmz] <https://tools.ietf.org/html/draft-mohanty-bess-ebgp-dmz> | |
297 |