]>
Commit | Line | Data |
---|---|---|
fb018d25 DS |
1 | 0. Introduction |
2 | ||
3 | This is the design specification for next hop tracking feature in | |
438f5286 | 4 | Frr. |
fb018d25 DS |
5 | |
6 | 1. Background | |
7 | ||
8 | Recursive routes are of the form: | |
9 | ||
10 | p/m --> n | |
11 | [Ex: 1.1.0.0/16 --> 2.2.2.2] | |
12 | ||
13 | where 'n' itself is resolved through another route as follows: | |
14 | ||
15 | p2/m --> h, interface | |
16 | [Ex: 2.2.2.0/24 --> 3.3.3.3, eth0] | |
17 | ||
18 | Usually, BGP routes are recursive in nature and BGP nexthops get | |
19 | resolved through an IGP route. IGP usually adds its routes pointing to | |
20 | an interface (these are called non-recursive routes). | |
21 | ||
22 | When BGP receives a recursive route from a peer, it needs to validate | |
23 | the nexthop. The path is marked valid or invalid based on the | |
24 | reachability status of the nexthop. Nexthop validation is also | |
25 | important for BGP decision process as the metric to reach the nexthop | |
26 | is a parameter to best path selection process. | |
27 | ||
28 | As it goes with routing, this is a dynamic process. Route to the | |
29 | nexthop can change. The nexthop can become unreachable or | |
30 | reachable. In the current BGP implementation, the nexthop validation | |
31 | is done periodically in the scanner run. The default scanner run | |
32 | interval is one minute. Every minute, the scanner task walks the | |
33 | entire BGP table. It checks the validity of each nexthop with Zebra | |
34 | (the routing table manager) through a request and response message | |
35 | exchange between BGP and Zebra process. BGP process is blocked for | |
36 | that duration. The mechanism has two major drawbacks: | |
37 | ||
38 | (1) The scanner task runs to completion. That can potentially starve | |
39 | the other tasks for long periods of time, based on the BGP table | |
40 | size and number of nexthops. | |
41 | ||
42 | (2) Convergence around routing changes that affect the nexthops can be | |
43 | long (around a minute with the default intervals). The interval | |
44 | can be shortened to achieve faster reaction time, but it makes the | |
45 | first problem worse, with the scanner task consuming most of the | |
46 | CPU resources. | |
47 | ||
48 | "Next hop tracking" feature makes this process event-driven. It | |
49 | eliminates periodic nexthop validation and introduces an asynchronous | |
50 | communication path between BGP and Zebra for route change notifications | |
51 | that can then be acted upon. | |
52 | ||
53 | 2. Goal | |
54 | ||
55 | Stating the obvious, the main goal is to remove the two limitations we | |
56 | discussed in the previous section. The goals, in a constructive tone, | |
57 | are the following: | |
58 | ||
59 | - fairness: the scanner run should not consume an unjustly high amount | |
60 | of CPU time. This should give an overall good performance and | |
61 | response time to other events (route changes, session events, | |
62 | IO/user interface). | |
63 | ||
64 | - convergence: BGP must react to nexthop changes instantly and provide | |
65 | sub-second convergence. This may involve diverting the routes from | |
66 | one nexthop to another. | |
67 | ||
68 | 3. Overview of the changes | |
69 | ||
70 | The changes are in both BGP and Zebra modules. The short summary is | |
71 | the following: | |
72 | ||
73 | - Zebra implements a registration mechanism by which clients can | |
74 | register for next hop notification. Consequently, it maintains a | |
75 | separate table, per (VRF, AF) pair, of next hops and interested | |
76 | client-list per next hop. | |
77 | ||
78 | - When the main routing table changes in Zebra, it evaluates the next | |
79 | hop table: for each next hop, it checks if the route table | |
80 | modifications have changed its state. If so, it notifies the | |
81 | interested clients. | |
82 | ||
83 | - BGP is one such client. It registers the next hops corresponding to | |
84 | all of its received routes/paths. It also threads the paths against | |
85 | each nexthop structure. | |
86 | ||
87 | - When BGP receives a next hop notification from Zebra, it walks the | |
88 | corresponding path list. It makes them valid or invalid depending | |
89 | on the next hop notification. It then re-computes best path for the | |
90 | corresponding destination. This may result in re-announcing those | |
91 | destinations to peers. | |
92 | ||
93 | 4. Design | |
94 | ||
95 | 4.1. Modules | |
96 | ||
97 | The core design introduces an "nht" (next hop tracking) module in BGP | |
98 | and "rnh" (recursive nexthop) module in Zebra. The "nht" module | |
99 | provides the following APIs: | |
100 | ||
101 | bgp_find_or_add_nexthop() : find or add a nexthop in BGP nexthop table | |
102 | bgp_find_nexthop() : find a nexthop in BGP nexthop table | |
103 | bgp_parse_nexthop_update() : parse a nexthop update message coming | |
104 | from zebra | |
105 | ||
106 | The "rnh" module provides the following APIs: | |
107 | ||
108 | zebra_add_rnh() : add a recursive nexthop | |
109 | zebra_delete_rnh() : delete a recursive nexthop | |
110 | zebra_lookup_rnh() : lookup a recursive nexthop | |
111 | ||
112 | zebra_add_rnh_client() : register a client for nexthop notifications | |
113 | against a recursive nexthop | |
114 | ||
115 | zebra_remove_rnh_client(): remove the client registration for a | |
116 | recursive nexthop | |
117 | ||
118 | zebra_evaluate_rnh_table(): (re)evaluate the recursive nexthop table | |
119 | (most probably because the main routing | |
120 | table has changed). | |
121 | ||
122 | zebra_cleanup_rnh_client(): Cleanup a client from the "rnh" module | |
123 | data structures (most probably because the | |
124 | client is going away). | |
125 | ||
126 | 4.2. Control flow | |
127 | ||
128 | The next hop registration control flow is the following: | |
129 | ||
130 | <==== BGP Process ====>|<==== Zebra Process ====> | |
131 | | | |
132 | receive module nht module | zserv module rnh module | |
133 | ---------------------------------------------------------------------- | |
134 | | | | | |
135 | bgp_update_ | | | | |
136 | main() | bgp_find_or_add_ | | | |
137 | | nexthop() | | | |
138 | | | | | |
139 | | | zserv_nexthop_ | | |
140 | | | register() | | |
141 | | | | zebra_add_rnh() | |
142 | | | | | |
143 | ||
144 | ||
145 | The next hop notification control flow is the following: | |
146 | ||
147 | <==== Zebra Process ====>|<==== BGP Process ====> | |
148 | | | |
149 | rib module rnh module | zebra module nht module | |
150 | ---------------------------------------------------------------------- | |
151 | | | | | |
152 | meta_queue_ | | | | |
153 | process() | zebra_evaluate_ | | | |
154 | | rnh_table() | | | |
155 | | | | | |
156 | | | bgp_read_nexthop_ | | |
157 | | | update() | | |
158 | | | | bgp_parse_ | |
159 | | | | nexthop_update() | |
160 | | | | | |
161 | ||
162 | ||
163 | 4.3. zclient message format | |
164 | ||
165 | ZEBRA_NEXTHOP_REGISTER and ZEBRA_NEXTHOP_UNREGISTER messages are | |
166 | encoded in the following way: | |
167 | ||
168 | /* | |
169 | * 0 1 2 3 | |
170 | * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
171 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
172 | * | AF | prefix len | | |
173 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
174 | * . Nexthop prefix . | |
175 | * . . | |
176 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
177 | * . . | |
178 | * . . | |
179 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
180 | * | AF | prefix len | | |
181 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
182 | * . Nexthop prefix . | |
183 | * . . | |
184 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
185 | */ | |
186 | ||
187 | ZEBRA_NEXTHOP_UPDATE message is encoded as follows: | |
188 | ||
189 | /* | |
190 | * 0 1 2 3 | |
191 | * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
192 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
193 | * | AF | prefix len | | |
194 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
195 | * . Nexthop prefix getting resolved . | |
196 | * . . | |
197 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
198 | * | metric | | |
199 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
200 | * | #nexthops | | |
201 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
202 | * | nexthop type | | |
203 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
204 | * . resolving Nexthop details . | |
205 | * . . | |
206 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
207 | * . . | |
208 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
209 | * | nexthop type | | |
210 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
211 | * . resolving Nexthop details . | |
212 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
213 | */ | |
214 | ||
215 | 4.4. BGP data structure | |
216 | ||
217 | Legend: | |
218 | ||
219 | /\ struct bgp_node: a BGP destination/route/prefix | |
220 | \/ | |
221 | ||
222 | [ ] struct bgp_info: a BGP path (e.g. route received from a peer) | |
223 | ||
224 | _ | |
225 | (_) struct bgp_nexthop_cache: a BGP nexthop | |
226 | ||
227 | ||
228 | ||
229 | /\ NULL | |
230 | \/--+ ^ | |
231 | | : | |
232 | +--[ ]--[ ]--[ ]--> NULL | |
233 | /\ : | |
234 | \/--+ : | |
235 | | : | |
236 | +--[ ]--[ ]--> NULL | |
237 | : | |
238 | _ : | |
239 | (_)............. | |
240 | ||
241 | ||
242 | 4.5. Zebra data structure | |
243 | ||
244 | rnh table: | |
245 | ||
246 | O | |
247 | / \ | |
248 | O O | |
249 | / \ | |
250 | O O | |
251 | ||
252 | struct rnh | |
253 | { | |
254 | u_char flags; | |
f0f77c9a | 255 | struct route_entry *state; |
fb018d25 DS |
256 | struct list *client_list; |
257 | struct route_node *node; | |
258 | }; | |
259 | ||
260 | 5. User interface changes | |
261 | ||
438f5286 | 262 | frr# show ip nht |
fb018d25 DS |
263 | 3.3.3.3 |
264 | resolved via kernel | |
265 | via 11.0.0.6, swp1 | |
266 | Client list: bgp(fd 12) | |
267 | 11.0.0.10 | |
268 | resolved via connected | |
269 | is directly connected, swp2 | |
270 | Client list: bgp(fd 12) | |
271 | 11.0.0.18 | |
272 | resolved via connected | |
273 | is directly connected, swp4 | |
274 | Client list: bgp(fd 12) | |
275 | 11.11.11.11 | |
276 | resolved via kernel | |
277 | via 10.0.1.2, eth0 | |
278 | Client list: bgp(fd 12) | |
279 | ||
438f5286 | 280 | frr# show ip bgp nexthop |
fb018d25 DS |
281 | Current BGP nexthop cache: |
282 | 3.3.3.3 valid [IGP metric 0], #paths 3 | |
283 | Last update: Wed Oct 16 04:43:49 2013 | |
284 | ||
285 | 11.0.0.10 valid [IGP metric 1], #paths 1 | |
286 | Last update: Wed Oct 16 04:43:51 2013 | |
287 | ||
288 | 11.0.0.18 valid [IGP metric 1], #paths 2 | |
289 | Last update: Wed Oct 16 04:43:47 2013 | |
290 | ||
291 | 11.11.11.11 valid [IGP metric 0], #paths 1 | |
292 | Last update: Wed Oct 16 04:43:47 2013 | |
293 | ||
438f5286 DS |
294 | frr# show ipv6 nht |
295 | frr# show ip bgp nexthop detail | |
fb018d25 | 296 | |
438f5286 DS |
297 | frr# debug bgp nht |
298 | frr# debug zebra nht | |
fb018d25 DS |
299 | |
300 | 6. Sample test cases | |
301 | ||
302 | r2----r3 | |
303 | / \ / | |
304 | r1----r4 | |
305 | ||
306 | - Verify that a change in IGP cost triggers NHT | |
307 | + shutdown the r1-r4 and r2-r4 links | |
308 | + no shut the r1-r4 and r2-r4 links and wait for OSPF to come back | |
309 | up | |
310 | + We should be back to the original nexthop via r4 now | |
311 | - Verify that a NH becoming unreachable triggers NHT | |
312 | + Shutdown all links to r4 | |
313 | - Verify that a NH becoming reachable triggers NHT | |
314 | + no shut all links to r4 | |
315 | ||
316 | 7. Future work | |
317 | ||
318 | - route-policy for next hop validation (e.g. ignore default route) | |
319 | - damping for rapid next hop changes | |
320 | - prioritized handling of nexthop changes ((un)reachability vs. metric | |
321 | changes) | |
322 | - handling recursion loop, e.g. | |
323 | 11.11.11.11/32 -> 12.12.12.12 | |
324 | 12.12.12.12/32 -> 11.11.11.11 | |
325 | 11.0.0.0/8 -> <interface> | |
326 | - better statistics |