]>
Commit | Line | Data |
---|---|---|
aee11342 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ============================ | |
4 | LC-trie implementation notes | |
5 | ============================ | |
b2f57102 RO |
6 | |
7 | Node types | |
8 | ---------- | |
aee11342 | 9 | leaf |
b2f57102 RO |
10 | An end node with data. This has a copy of the relevant key, along |
11 | with 'hlist' with routing table entries sorted by prefix length. | |
12 | See struct leaf and struct leaf_info. | |
13 | ||
14 | trie node or tnode | |
15 | An internal node, holding an array of child (leaf or tnode) pointers, | |
16 | indexed through a subset of the key. See Level Compression. | |
17 | ||
18 | A few concepts explained | |
19 | ------------------------ | |
aee11342 | 20 | Bits (tnode) |
b2f57102 RO |
21 | The number of bits in the key segment used for indexing into the |
22 | child array - the "child index". See Level Compression. | |
23 | ||
24 | Pos (tnode) | |
25 | The position (in the key) of the key segment used for indexing into | |
26 | the child array. See Path Compression. | |
27 | ||
28 | Path Compression / skipped bits | |
29 | Any given tnode is linked to from the child array of its parent, using | |
aee11342 | 30 | a segment of the key specified by the parent's "pos" and "bits" |
b2f57102 RO |
31 | In certain cases, this tnode's own "pos" will not be immediately |
32 | adjacent to the parent (pos+bits), but there will be some bits | |
33 | in the key skipped over because they represent a single path with no | |
34 | deviations. These "skipped bits" constitute Path Compression. | |
35 | Note that the search algorithm will simply skip over these bits when | |
36 | searching, making it necessary to save the keys in the leaves to | |
37 | verify that they actually do match the key we are searching for. | |
38 | ||
39 | Level Compression / child arrays | |
40 | the trie is kept level balanced moving, under certain conditions, the | |
41 | children of a full child (see "full_children") up one level, so that | |
42 | instead of a pure binary tree, each internal node ("tnode") may | |
43 | contain an arbitrarily large array of links to several children. | |
44 | Conversely, a tnode with a mostly empty child array (see empty_children) | |
45 | may be "halved", having some of its children moved downwards one level, | |
46 | in order to avoid ever-increasing child arrays. | |
47 | ||
48 | empty_children | |
49 | the number of positions in the child array of a given tnode that are | |
50 | NULL. | |
51 | ||
52 | full_children | |
53 | the number of children of a given tnode that aren't path compressed. | |
54 | (in other words, they aren't NULL or leaves and their "pos" is equal | |
55 | to this tnode's "pos"+"bits"). | |
56 | ||
57 | (The word "full" here is used more in the sense of "complete" than | |
58 | as the opposite of "empty", which might be a tad confusing.) | |
59 | ||
60 | Comments | |
61 | --------- | |
62 | ||
aee11342 MCC |
63 | We have tried to keep the structure of the code as close to fib_hash as |
64 | possible to allow verification and help up reviewing. | |
b2f57102 RO |
65 | |
66 | fib_find_node() | |
67 | A good start for understanding this code. This function implements a | |
68 | straightforward trie lookup. | |
69 | ||
70 | fib_insert_node() | |
71 | Inserts a new leaf node in the trie. This is bit more complicated than | |
72 | fib_find_node(). Inserting a new node means we might have to run the | |
73 | level compression algorithm on part of the trie. | |
74 | ||
75 | trie_leaf_remove() | |
76 | Looks up a key, deletes it and runs the level compression algorithm. | |
77 | ||
78 | trie_rebalance() | |
79 | The key function for the dynamic trie after any change in the trie | |
fd223068 DJ |
80 | it is run to optimize and reorganize. It will walk the trie upwards |
81 | towards the root from a given tnode, doing a resize() at each step | |
b2f57102 RO |
82 | to implement level compression. |
83 | ||
84 | resize() | |
85 | Analyzes a tnode and optimizes the child array size by either inflating | |
a2ffd275 | 86 | or shrinking it repeatedly until it fulfills the criteria for optimal |
b2f57102 RO |
87 | level compression. This part follows the original paper pretty closely |
88 | and there may be some room for experimentation here. | |
89 | ||
90 | inflate() | |
91 | Doubles the size of the child array within a tnode. Used by resize(). | |
92 | ||
93 | halve() | |
94 | Halves the size of the child array within a tnode - the inverse of | |
95 | inflate(). Used by resize(); | |
96 | ||
97 | fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() | |
98 | The route manipulation functions. Should conform pretty closely to the | |
99 | corresponding functions in fib_hash. | |
100 | ||
101 | fn_trie_flush() | |
102 | This walks the full trie (using nextleaf()) and searches for empty | |
103 | leaves which have to be removed. | |
104 | ||
105 | fn_trie_dump() | |
106 | Dumps the routing table ordered by prefix length. This is somewhat | |
107 | slower than the corresponding fib_hash function, as we have to walk the | |
108 | entire trie for each prefix length. In comparison, fib_hash is organized | |
109 | as one "zone"/hash per prefix length. | |
110 | ||
111 | Locking | |
112 | ------- | |
113 | ||
114 | fib_lock is used for an RW-lock in the same way that this is done in fib_hash. | |
115 | However, the functions are somewhat separated for other possible locking | |
116 | scenarios. It might conceivably be possible to run trie_rebalance via RCU | |
117 | to avoid read_lock in the fn_trie_lookup() function. | |
118 | ||
119 | Main lookup mechanism | |
120 | --------------------- | |
121 | fn_trie_lookup() is the main lookup function. | |
122 | ||
123 | The lookup is in its simplest form just like fib_find_node(). We descend the | |
124 | trie, key segment by key segment, until we find a leaf. check_leaf() does | |
125 | the fib_semantic_match in the leaf's sorted prefix hlist. | |
126 | ||
127 | If we find a match, we are done. | |
128 | ||
129 | If we don't find a match, we enter prefix matching mode. The prefix length, | |
130 | starting out at the same as the key length, is reduced one step at a time, | |
131 | and we backtrack upwards through the trie trying to find a longest matching | |
132 | prefix. The goal is always to reach a leaf and get a positive result from the | |
133 | fib_semantic_match mechanism. | |
134 | ||
135 | Inside each tnode, the search for longest matching prefix consists of searching | |
136 | through the child array, chopping off (zeroing) the least significant "1" of | |
137 | the child index until we find a match or the child index consists of nothing but | |
138 | zeros. | |
139 | ||
140 | At this point we backtrack (t->stats.backtrack++) up the trie, continuing to | |
141 | chop off part of the key in order to find the longest matching prefix. | |
142 | ||
143 | At this point we will repeatedly descend subtries to look for a match, and there | |
144 | are some optimizations available that can provide us with "shortcuts" to avoid | |
145 | descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. | |
146 | ||
147 | To alleviate any doubts about the correctness of the route selection process, | |
148 | a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which | |
149 | gives userland access to fib_lookup(). |