]> git.proxmox.com Git - mirror_frr.git/blame - doc/user/wecmp_linkbw.rst
Merge pull request #10424 from patrasar/master_pimv6_nht
[mirror_frr.git] / doc / user / wecmp_linkbw.rst
CommitLineData
ed647ed2 1.. _wecmp_linkbw:
2
3Weighted ECMP using BGP link bandwidth
4======================================
5
6.. _features-of-wecmp-linkbw:
7
8Overview
9--------
10
11In normal equal cost multipath (ECMP), the route to a destination has
12multiple next hops and traffic is expected to be equally distributed
13across these next hops. In practice, flow-based hashing is used so that
14all traffic associated with a particular flow uses the same next hop,
15and by extension, the same path across the network.
16
9af10db1 17Weighted ECMP using BGP link bandwidth introduces support for network-wide
ed647ed2 18unequal cost multipathing (UCMP) to an IP destination. The unequal cost
19load balancing is implemented by the forwarding plane based on the weights
20associated with the next hops of the IP prefix. These weights are computed
21based on the bandwidths of the corresponding multipaths which are encoded
22in the ``BGP link bandwidth extended community`` as specified in
23[Draft-IETF-idr-link-bandwidth]_. Exchange of an appropriate BGP link
24bandwidth value for a prefix across the network results in network-wide
25unequal cost multipathing.
26
27One of the primary use cases of this capability is in the data center when
28a service (represented by its anycast IP) has an unequal set of resources
29across the regions (e.g., PODs) of the data center and the network itself
30provides the load balancing function instead of an external load balancer.
31Refer to [Draft-IETF-mohanty-bess-ebgp-dmz]_ and :rfc:`7938` for details
32on this use case. This use case is applicable in a pure L3 network as
33well as in a EVPN network.
34
35The traditional use case for BGP link bandwidth to load balance traffic
36to the exit routers in the AS based on the bandwidth of their external
37eBGP peering links is also supported.
38
39
40Design Principles
41-----------------
42
43Next hop weight computation and usage
44^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
45
46As described, in UCMP, there is a weight associated with each next hop of an
47IP prefix, and traffic is expected to be distributed across the next hops in
48proportion to their weight. The weight of a next hop is a simple factoring
49of the bandwidth of the corresponding path against the total bandwidth of
50all multipaths, mapped to the range 1 to 100. What happens if not all the
51paths in the multipath set have link bandwidth associated with them? In such
52a case, in adherence to [Draft-IETF-idr-link-bandwidth]_, the behavior
53reverts to standard ECMP among all the multipaths, with the link bandwidth
54being effectively ignored.
55
56Note that there is no change to either the BGP best path selection algorithm
57or to the multipath computation algorithm; the mapping of link bandwidth to
58weight happens at the time of installation of the route in the RIB.
59
60If data forwarding is implemented by means of the Linux kernel, the next hop’s
61weight is used in the hash calculation. The kernel uses the Hash threshold
62algorithm and use of the next hop weight is built into it; next hops need
63not be expanded to achieve UCMP. UCMP for IPv4 is available in older Linux
64kernels too, while UCMP for IPv6 is available from the 4.16 kernel onwards.
65
66If data forwarding is realized in hardware, common implementations expand
67the next hops (i.e., they are repeated) in the ECMP container in proportion
68to their weight. For example, if the weights associated with 3 next hops for
69a particular route are 50, 25 and 25 and the ECMP container has a size of 16
70next hops, the first next hop will be repeated 8 times and the other 2 next
71hops repeated 4 times each. Other implementations are also possible.
72
73Unequal cost multipath across a network
74^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
75
76For the use cases listed above, it is not sufficient to support UCMP on just
77one router (e.g., egress router), or individually, on multiple routers; UCMP
78must be deployed across the entire network. This is achieved by employing the
79BGP link-bandwidth extended community.
80
81At the router which originates the BGP link bandwidth, there has to be user
82configuration to trigger it, which is described below. Receiving routers
83would use the received link bandwidth from their downstream routers to
84determine the next hop weight as described in the earlier section. Further,
85if the received link bandwidth is a transitive attribute, it would be
86propagated to eBGP peers, with the additional change that if the next hop
87is set to oneself, the cumulative link bandwidth of all downstream paths
88is propagated to other routers. In this manner, the entire network will
89know how to distribute traffic to an anycast service across the network.
90
91The BGP link-bandwidth extended community is encoded in bytes-per-second.
92In the use case where UCMP must be based on the number of paths, a reference
93bandwidth of 1 Mbps is used. So, for example, if there are 4 equal cost paths
94to an anycast IP, the encoded bandwidth in the extended community will be
95500,000. The actual value itself doesn’t matter as long as all routers
96originating the link-bandwidth are doing it in the same way.
97
98
99Configuration Guide
100-------------------
101
102The configuration for weighted ECMP using BGP link bandwidth requires
103one essential step - using a route-map to inject the link bandwidth
104extended community. An additional option is provided to control the
105processing of received link bandwidth.
106
107Injecting link bandwidth into the network
108^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
109
110At the "entry point" router that is injecting the prefix to which weighted
111load balancing must be performed, a route-map must be configured to
112attach the link bandwidth extended community.
113
114For the use case of providing weighted load balancing for an anycast service,
115this configuration will typically need to be applied at the TOR or Leaf
116router that is connected to servers which provide the anycast service and
117the bandwidth would be based on the number of multipaths for the destination.
118
119For the use case of load balancing to the exit router, the exit router should
120be configured with the route map specifying the a bandwidth value that
121corresponds to the bandwidth of the link connecting to its eBGP peer in the
122adjoining AS. In addition, the link bandwidth extended community must be
123explicitly configured to be non-transitive.
124
125The complete syntax of the route-map set command can be found at
126:ref:`bgp-extended-communities-in-route-map`
127
128This route-map is supported only at two attachment points:
129(a) the outbound route-map attached to a peer or peer-group, per address-family
130(b) the EVPN advertise route-map used to inject IPv4 or IPv6 unicast routes
131into EVPN as type-5 routes.
132
133Since the link bandwidth origination is done by using a route-map, it can
134be constrained to certain prefixes (e.g., only for anycast services) or it
135can be generated for all prefixes. Further, when the route-map is used in
136the neighbor context, the link bandwidth usage can be constrained to certain
137peers only.
138
139A sample configuration is shown below and illustrates link bandwidth
140advertisement towards the "SPINE" peer-group for anycast IPs in the
141range 192.168.x.x
142
143.. code-block:: frr
144
145 ip prefix-list anycast_ip seq 10 permit 192.168.0.0/16 le 32
146 route-map anycast_ip permit 10
147 match ip address prefix-list anycast_ip
148 set extcommunity bandwidth num-multipaths
149 route-map anycast_ip permit 20
150 !
151 router bgp 65001
152 neighbor SPINE peer-group
153 neighbor SPINE remote-as external
154 neighbor 172.16.35.1 peer-group SPINE
155 neighbor 172.16.36.1 peer-group SPINE
156 !
157 address-family ipv4 unicast
158 network 110.0.0.1/32
159 network 192.168.44.1/32
160 neighbor SPINE route-map anycast_ip out
161 exit-address-family
162 !
163
164
165Controlling link bandwidth processing on the receiver
166^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
167
168There is no configuration necessary to process received link bandwidth and
169translate it into the weight associated with the corresponding next hop;
170that happens by default. If some of the multipaths do not have the link
171bandwidth extended community, the default behavior is to revert to normal
172ECMP as recommended in [Draft-IETF-idr-link-bandwidth]_.
173
174The operator can change these behaviors with the following configuration:
175
ed647ed2 176.. clicmd:: bgp bestpath bandwidth <ignore | skip-missing | default-weight-for-missing>
177
178The different options imply behavior as follows:
179
180- ignore: Ignore link bandwidth completely for route installation
181 (i.e., do regular ECMP, not weighted)
182- skip-missing: Skip paths without link bandwidth and do UCMP among
183 the others (if at least some paths have link-bandwidth)
184- default-weight-for-missing: Assign a low default weight (value 1)
185 to paths not having link bandwidth
186
187This configuration is per BGP instance similar to other BGP route-selection
188controls; it operates on both IPv4-unicast and IPv6-unicast routes in that
189instance. In an EVPN network, this configuration (if required) should be
190implemented in the tenant VRF and is again applicable for IPv4-unicast and
191IPv6-unicast, including the ones sourced from EVPN type-5 routes.
192
193A sample snippet of FRR configuration on a receiver to skip paths without
194link bandwidth and do weighted ECMP among the other paths (if some of them
195have link bandwidth) is as shown below.
196
197.. code-block:: frr
198
199 router bgp 65021
200 bgp bestpath as-path multipath-relax
201 bgp bestpath bandwidth skip-missing
202 neighbor LEAF peer-group
203 neighbor LEAF remote-as external
204 neighbor 172.16.35.2 peer-group LEAF
205 neighbor 172.16.36.2 peer-group LEAF
206 !
207 address-family ipv4 unicast
208 network 130.0.0.1/32
209 exit-address-family
210 !
211
212
213Stopping the propagation of the link bandwidth outside a domain
214^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
215
216The link bandwidth extended community will get automatically propagated
217with the prefix to EBGP peers, if it is encoded as a transitive attribute
218by the originator. If this propagation has to be stopped outside of a
219particular domain (e.g., stopped from being propagated to routers outside
220of the data center core network), the mechanism available is to disable
221the advertisement of all BGP extended communities on the specific peering/s.
222In other words, the propagation cannot be blocked just for the link bandwidth
223extended community. The configuration to disable all extended communities
224can be applied to a peer or peer-group (per address-family).
225
226Of course, the other common way to stop the propagation of the link bandwidth
227outside the domain is to block the prefixes themselves from being advertised
228and possibly, announce only an aggregate route. This would be quite common
229in a EVPN network.
230
231BGP link bandwidth and UCMP monitoring & troubleshooting
232^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
233
234Existing operational commands to display the BGP routing table for a specific
235prefix will show the link bandwidth extended community also, if present.
236
237An example of an IPv4-unicast route received with the link bandwidth
238attribute from two peers is shown below:
239
240.. code-block:: frr
241
242 CLI# show bgp ipv4 unicast 192.168.10.1/32
243 BGP routing table entry for 192.168.10.1/32
244 Paths: (2 available, best #2, table default)
245 Advertised to non peer-group peers:
246 l1(swp1) l2(swp2) l3(swp3) l4(swp4)
247 65002
248 fe80::202:ff:fe00:1b from l2(swp2) (110.0.0.2)
249 (fe80::202:ff:fe00:1b) (used)
250 Origin IGP, metric 0, valid, external, multipath, bestpath-from-AS 65002
251 Extended Community: LB:65002:125000000 (1000.000 Mbps)
431dd37e
QY
252 Last update: Thu Feb 20 18:34:16 2020
253
ed647ed2 254 65001
255 fe80::202:ff:fe00:15 from l1(swp1) (110.0.0.1)
256 (fe80::202:ff:fe00:15) (used)
257 Origin IGP, metric 0, valid, external, multipath, bestpath-from-AS 65001, best (Older Path)
258 Extended Community: LB:65001:62500000 (500.000 Mbps)
259 Last update: Thu Feb 20 18:22:34 2020
260
261The weights associated with the next hops of a route can be seen by querying
262the RIB for a specific route.
263
264For example, the next hop weights corresponding to the link bandwidths in the
265above example is illustrated below:
266
267.. code-block:: frr
268
269 spine1# show ip route 192.168.10.1/32
270 Routing entry for 192.168.10.1/32
271 Known via "bgp", distance 20, metric 0, best
272 Last update 00:00:32 ago
273 * fe80::202:ff:fe00:1b, via swp2, weight 66
274 * fe80::202:ff:fe00:15, via swp1, weight 33
275
276For troubleshooting, existing debug logs ``debug bgp updates``,
277``debug bgp bestpath <prefix>``, ``debug bgp zebra`` and
278``debug zebra kernel`` can be used.
279
280A debug log snippet when ``debug bgp zebra`` is enabled and a route is
281installed by BGP in the RIB with next hop weights is shown below:
282
283.. code-block:: frr
284
285 2020-02-29T06:26:19.927754+00:00 leaf1 bgpd[5459]: bgp_zebra_announce: p=192.168.150.1/32, bgp_is_valid_label: 0
286 2020-02-29T06:26:19.928096+00:00 leaf1 bgpd[5459]: Tx route add VRF 33 192.168.150.1/32 metric 0 tag 0 count 2
287 2020-02-29T06:26:19.928289+00:00 leaf1 bgpd[5459]: nhop [1]: 110.0.0.6 if 35 VRF 33 wt 50 RMAC 0a:11:2f:7d:35:20
288 2020-02-29T06:26:19.928479+00:00 leaf1 bgpd[5459]: nhop [2]: 110.0.0.5 if 35 VRF 33 wt 50 RMAC 32:1e:32:a3:6c:bf
289 2020-02-29T06:26:19.928668+00:00 leaf1 bgpd[5459]: bgp_zebra_announce: 192.168.150.1/32: announcing to zebra (recursion NOT set)
290
291
292References
293----------
294
295.. [Draft-IETF-idr-link-bandwidth] <https://tools.ietf.org/html/draft-ietf-idr-link-bandwidth>
296.. [Draft-IETF-mohanty-bess-ebgp-dmz] <https://tools.ietf.org/html/draft-mohanty-bess-ebgp-dmz>
297