]>
Commit | Line | Data |
---|---|---|
b22ba015 QY |
1 | Next Hop Tracking |
2 | ================== | |
3 | ||
4 | Next hop tracking is an optimization feature that reduces the processing time | |
5 | involved in the BGP bestpath algorithm by monitoring changes to the routing | |
6 | table. | |
7 | ||
8 | Background | |
9 | ----------- | |
10 | ||
11 | Recursive routes are of the form: | |
12 | ||
13 | :: | |
14 | ||
15 | p/m --> n | |
16 | [Ex: 1.1.0.0/16 --> 2.2.2.2] | |
17 | ||
18 | where 'n' itself is resolved through another route as follows: | |
19 | ||
20 | :: | |
21 | ||
22 | p2/m --> h, interface | |
23 | [Ex: 2.2.2.0/24 --> 3.3.3.3, eth0] | |
24 | ||
25 | Usually, BGP routes are recursive in nature and BGP nexthops get resolved | |
26 | through an IGP route. IGP usually adds its routes pointing to an interface | |
27 | (these are called non-recursive routes). | |
28 | ||
29 | When BGP receives a recursive route from a peer, it needs to validate the | |
30 | nexthop. The path is marked valid or invalid based on the reachability status | |
31 | of the nexthop. Nexthop validation is also important for BGP decision process | |
32 | as the metric to reach the nexthop is a parameter to best path selection | |
33 | process. | |
34 | ||
35 | As it goes with routing, this is a dynamic process. Route to the nexthop can | |
36 | change. The nexthop can become unreachable or reachable. In the current BGP | |
37 | implementation, the nexthop validation is done periodically in the scanner run. | |
38 | The default scanner run interval is one minute. Every minute, the scanner task | |
39 | walks the entire BGP table. It checks the validity of each nexthop with Zebra | |
40 | (the routing table manager) through a request and response message exchange | |
41 | between BGP and Zebra process. BGP process is blocked for that duration. The | |
42 | mechanism has two major drawbacks: | |
43 | ||
44 | - The scanner task runs to completion. That can potentially starve the other | |
45 | tasks for long periods of time, based on the BGP table size and number of | |
46 | nexthops. | |
47 | ||
48 | - Convergence around routing changes that affect the nexthops can be long | |
49 | (around a minute with the default intervals). The interval can be shortened | |
50 | to achieve faster reaction time, but it makes the first problem worse, with | |
51 | the scanner task consuming most of the CPU resources. | |
52 | ||
53 | The next-hop tracking feature makes this process event-driven. It eliminates | |
54 | periodic nexthop validation and introduces an asynchronous communication path | |
55 | between BGP and Zebra for route change notifications that can then be acted | |
56 | upon. | |
57 | ||
58 | Goal | |
59 | ---- | |
60 | ||
61 | Stating the obvious, the main goal is to remove the two limitations we | |
62 | discussed in the previous section. The goals, in a constructive tone, | |
63 | are the following: | |
64 | ||
65 | - **Fairness**: the scanner run should not consume an unjustly high amount of | |
66 | CPU time. This should give an overall good performance and response time to | |
67 | other events (route changes, session events, IO/user interface). | |
68 | ||
69 | - **Convergence**: BGP must react to nexthop changes instantly and provide | |
70 | sub-second convergence. This may involve diverting the routes from one | |
71 | nexthop to another. | |
72 | ||
73 | Overview of changes | |
74 | ------------------------ | |
75 | ||
76 | The changes are in both BGP and Zebra modules. The short summary is | |
77 | the following: | |
78 | ||
79 | - Zebra implements a registration mechanism by which clients can | |
80 | register for next hop notification. Consequently, it maintains a | |
81 | separate table, per (VRF, AF) pair, of next hops and interested | |
82 | client-list per next hop. | |
83 | ||
84 | - When the main routing table changes in Zebra, it evaluates the next | |
85 | hop table: for each next hop, it checks if the route table | |
86 | modifications have changed its state. If so, it notifies the | |
87 | interested clients. | |
88 | ||
89 | - BGP is one such client. It registers the next hops corresponding to | |
90 | all of its received routes/paths. It also threads the paths against | |
91 | each nexthop structure. | |
92 | ||
93 | - When BGP receives a next hop notification from Zebra, it walks the | |
94 | corresponding path list. It makes them valid or invalid depending | |
95 | on the next hop notification. It then re-computes best path for the | |
96 | corresponding destination. This may result in re-announcing those | |
97 | destinations to peers. | |
98 | ||
99 | Design | |
100 | ------ | |
101 | ||
102 | Modules | |
103 | ~~~~~~~ | |
104 | ||
105 | The core design introduces an "nht" (next hop tracking) module in BGP | |
106 | and "rnh" (recursive nexthop) module in Zebra. The "nht" module | |
107 | provides the following APIs: | |
108 | ||
109 | +----------------------------+--------------------------------------------------+ | |
110 | | Function | Action | | |
111 | +============================+==================================================+ | |
112 | | bgp_find_or_add_nexthop() | find or add a nexthop in BGP nexthop table | | |
113 | +----------------------------+--------------------------------------------------+ | |
114 | | bgp_find_nexthop() | find a nexthop in BGP nexthop table | | |
115 | +----------------------------+--------------------------------------------------+ | |
116 | | bgp_parse_nexthop_update() | parse a nexthop update message coming from zebra | | |
117 | +----------------------------+--------------------------------------------------+ | |
118 | ||
119 | The "rnh" module provides the following APIs: | |
120 | ||
121 | +----------------------------+----------------------------------------------------------------------------------------------------------+ | |
122 | | Function | Action | | |
123 | +============================+==========================================================================================================+ | |
124 | | zebra_add_rnh() | add a recursive nexthop | | |
125 | +----------------------------+----------------------------------------------------------------------------------------------------------+ | |
126 | | zebra_delete_rnh() | delete a recursive nexthop | | |
127 | +----------------------------+----------------------------------------------------------------------------------------------------------+ | |
128 | | zebra_lookup_rnh() | lookup a recursive nexthop | | |
129 | +----------------------------+----------------------------------------------------------------------------------------------------------+ | |
130 | | zebra_add_rnh_client() | register a client for nexthop notifications against a recursive nexthop | | |
131 | +----------------------------+----------------------------------------------------------------------------------------------------------+ | |
132 | | zebra_remove_rnh_client() | remove the client registration for a recursive nexthop | | |
133 | +----------------------------+----------------------------------------------------------------------------------------------------------+ | |
134 | | zebra_evaluate_rnh_table() | (re)evaluate the recursive nexthop table (most probably because the main routing table has changed). | | |
135 | +----------------------------+----------------------------------------------------------------------------------------------------------+ | |
136 | | zebra_cleanup_rnh_client() | Cleanup a client from the "rnh" module data structures (most probably because the client is going away). | | |
137 | +----------------------------+----------------------------------------------------------------------------------------------------------+ | |
138 | ||
139 | 4.2. Control flow | |
140 | ||
141 | The next hop registration control flow is the following: | |
142 | ||
143 | :: | |
144 | ||
145 | <==== BGP Process ====>|<==== Zebra Process ====> | |
146 | | | |
147 | receive module nht module | zserv module rnh module | |
148 | ---------------------------------------------------------------------- | |
149 | | | | | |
150 | bgp_update_ | | | | |
151 | main() | bgp_find_or_add_ | | | |
152 | | nexthop() | | | |
153 | | | | | |
154 | | | zserv_nexthop_ | | |
155 | | | register() | | |
156 | | | | zebra_add_rnh() | |
157 | | | | | |
76bd1499 | 158 | |
b22ba015 QY |
159 | |
160 | The next hop notification control flow is the following: | |
161 | ||
162 | :: | |
163 | ||
164 | <==== Zebra Process ====>|<==== BGP Process ====> | |
165 | | | |
166 | rib module rnh module | zebra module nht module | |
167 | ---------------------------------------------------------------------- | |
168 | | | | | |
169 | meta_queue_ | | | | |
170 | process() | zebra_evaluate_ | | | |
171 | | rnh_table() | | | |
172 | | | | | |
173 | | | bgp_read_nexthop_ | | |
174 | | | update() | | |
175 | | | | bgp_parse_ | |
176 | | | | nexthop_update() | |
177 | | | | | |
178 | ||
179 | ||
180 | zclient message format | |
181 | ~~~~~~~~~~~~~~~~~~~~~~ | |
182 | ||
183 | ZEBRA_NEXTHOP_REGISTER and ZEBRA_NEXTHOP_UNREGISTER messages are | |
184 | encoded in the following way: | |
185 | ||
186 | :: | |
187 | ||
188 | . 0 1 2 3 | |
189 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
190 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
191 | | AF | prefix len | | |
192 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
193 | . Nexthop prefix . | |
194 | . . | |
195 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
196 | . . | |
197 | . . | |
198 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
199 | | AF | prefix len | | |
200 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
201 | . Nexthop prefix . | |
202 | . . | |
203 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
76bd1499 | 204 | |
b22ba015 QY |
205 | |
206 | ``ZEBRA_NEXTHOP_UPDATE`` message is encoded as follows: | |
207 | ||
208 | :: | |
209 | ||
210 | . 0 1 2 3 | |
211 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
212 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
213 | | AF | prefix len | | |
214 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
215 | . Nexthop prefix getting resolved . | |
216 | . . | |
217 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
218 | | metric | | |
219 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
220 | | #nexthops | | |
221 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
222 | | nexthop type | | |
223 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
224 | . resolving Nexthop details . | |
225 | . . | |
226 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
227 | . . | |
228 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
229 | | nexthop type | | |
230 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
231 | . resolving Nexthop details . | |
232 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
233 | ||
234 | ||
235 | BGP data structure | |
236 | ~~~~~~~~~~~~~~~~~~ | |
237 | Legend: | |
238 | ||
239 | :: | |
240 | ||
241 | /\ struct bgp_node: a BGP destination/route/prefix | |
242 | \/ | |
76bd1499 | 243 | |
b22ba015 | 244 | [ ] struct bgp_info: a BGP path (e.g. route received from a peer) |
76bd1499 | 245 | |
b22ba015 QY |
246 | _ |
247 | (_) struct bgp_nexthop_cache: a BGP nexthop | |
76bd1499 | 248 | |
b22ba015 QY |
249 | /\ NULL |
250 | \/--+ ^ | |
251 | | : | |
252 | +--[ ]--[ ]--[ ]--> NULL | |
253 | /\ : | |
254 | \/--+ : | |
255 | | : | |
256 | +--[ ]--[ ]--> NULL | |
257 | : | |
258 | _ : | |
259 | (_)........... | |
260 | ||
261 | ||
262 | Zebra data structure | |
263 | ~~~~~~~~~~~~~~~~~~~~ | |
264 | ||
265 | RNH table:: | |
266 | ||
d7c0a89a QY |
267 | . O |
268 | / \ | |
269 | O O | |
270 | / \ | |
271 | O O | |
272 | ||
273 | struct rnh | |
274 | { | |
275 | uint8_t flags; | |
276 | struct route_entry *state; | |
277 | struct list *client_list; | |
278 | struct route_node *node; | |
279 | }; | |
b22ba015 QY |
280 | |
281 | User interface changes | |
282 | ~~~~~~~~~~~~~~~~~~~~~~ | |
283 | ||
284 | :: | |
285 | ||
286 | frr# show ip nht | |
287 | 3.3.3.3 | |
288 | resolved via kernel | |
289 | via 11.0.0.6, swp1 | |
290 | Client list: bgp(fd 12) | |
291 | 11.0.0.10 | |
292 | resolved via connected | |
293 | is directly connected, swp2 | |
294 | Client list: bgp(fd 12) | |
295 | 11.0.0.18 | |
296 | resolved via connected | |
297 | is directly connected, swp4 | |
298 | Client list: bgp(fd 12) | |
299 | 11.11.11.11 | |
300 | resolved via kernel | |
301 | via 10.0.1.2, eth0 | |
302 | Client list: bgp(fd 12) | |
76bd1499 | 303 | |
b22ba015 QY |
304 | frr# show ip bgp nexthop |
305 | Current BGP nexthop cache: | |
306 | 3.3.3.3 valid [IGP metric 0], #paths 3 | |
307 | Last update: Wed Oct 16 04:43:49 2013 | |
76bd1499 | 308 | |
b22ba015 QY |
309 | 11.0.0.10 valid [IGP metric 1], #paths 1 |
310 | Last update: Wed Oct 16 04:43:51 2013 | |
76bd1499 | 311 | |
b22ba015 QY |
312 | 11.0.0.18 valid [IGP metric 1], #paths 2 |
313 | Last update: Wed Oct 16 04:43:47 2013 | |
76bd1499 | 314 | |
b22ba015 QY |
315 | 11.11.11.11 valid [IGP metric 0], #paths 1 |
316 | Last update: Wed Oct 16 04:43:47 2013 | |
76bd1499 | 317 | |
b22ba015 QY |
318 | frr# show ipv6 nht |
319 | frr# show ip bgp nexthop detail | |
76bd1499 | 320 | |
b22ba015 QY |
321 | frr# debug bgp nht |
322 | frr# debug zebra nht | |
76bd1499 | 323 | |
b22ba015 | 324 | 6. Sample test cases |
76bd1499 | 325 | |
b22ba015 QY |
326 | r2----r3 |
327 | / \ / | |
328 | r1----r4 | |
76bd1499 | 329 | |
b22ba015 QY |
330 | - Verify that a change in IGP cost triggers NHT |
331 | + shutdown the r1-r4 and r2-r4 links | |
332 | + no shut the r1-r4 and r2-r4 links and wait for OSPF to come back | |
333 | up | |
334 | + We should be back to the original nexthop via r4 now | |
335 | - Verify that a NH becoming unreachable triggers NHT | |
336 | + Shutdown all links to r4 | |
337 | - Verify that a NH becoming reachable triggers NHT | |
338 | + no shut all links to r4 | |
339 | ||
340 | Future work | |
341 | ~~~~~~~~~~~ | |
342 | ||
343 | - route-policy for next hop validation (e.g. ignore default route) | |
344 | - damping for rapid next hop changes | |
345 | - prioritized handling of nexthop changes ((un)reachability vs. metric | |
346 | changes) | |
347 | - handling recursion loop, e.g:: | |
348 | ||
349 | 11.11.11.11/32 -> 12.12.12.12 | |
350 | 12.12.12.12/32 -> 11.11.11.11 | |
351 | 11.0.0.0/8 -> <interface> | |
352 | - better statistics |