[mirror_ubuntu-jammy-kernel.git] / Documentation / networking / fib_trie.rst

.. SPDX-License-Identifier: GPL-2.0

============================
LC-trie implementation notes
============================

Node types
----------
leaf
	An end node with data. This has a copy of the relevant key, along
	with 'hlist' with routing table entries sorted by prefix length.
	See struct leaf and struct leaf_info.

trie node or tnode
	An internal node, holding an array of child (leaf or tnode) pointers,
	indexed	through a subset of the key. See Level Compression.

A few concepts explained
------------------------
Bits (tnode)
	The number of bits in the key segment used for indexing into the
	child array - the "child index". See Level Compression.

Pos (tnode)
	The position (in the key) of the key segment used for indexing into
	the child array. See Path Compression.

Path Compression / skipped bits
	Any given tnode is linked to from the child array of its parent, using
	a segment of the key specified by the parent's "pos" and "bits"
	In certain cases, this tnode's own "pos" will not be immediately
	adjacent to the parent (pos+bits), but there will be some bits
	in the key skipped over because they represent a single path with no
	deviations. These "skipped bits" constitute Path Compression.
	Note that the search algorithm will simply skip over these bits when
	searching, making it necessary to save the keys in the leaves to
	verify that they actually do match the key we are searching for.

Level Compression / child arrays
	the trie is kept level balanced moving, under certain conditions, the
	children of a full child (see "full_children") up one level, so that
	instead of a pure binary tree, each internal node ("tnode") may
	contain an arbitrarily large array of links to several children.
	Conversely, a tnode with a mostly empty	child array (see empty_children)
	may be "halved", having some of its children moved downwards one level,
	in order to avoid ever-increasing child arrays.

empty_children
	the number of positions in the child array of a given tnode that are
	NULL.

full_children
	the number of children of a given tnode that aren't path compressed.
	(in other words, they aren't NULL or leaves and their "pos" is equal
	to this	tnode's "pos"+"bits").

	(The word "full" here is used more in the sense of "complete" than
	as the opposite of "empty", which might be a tad confusing.)

Comments
---------

We have tried to keep the structure of the code as close to fib_hash as
possible to allow verification and help up reviewing.

fib_find_node()
	A good start for understanding this code. This function implements a
	straightforward trie lookup.

fib_insert_node()
	Inserts a new leaf node in the trie. This is bit more complicated than
	fib_find_node(). Inserting a new node means we might have to run the
	level compression algorithm on part of the trie.

trie_leaf_remove()
	Looks up a key, deletes it and runs the level compression algorithm.

trie_rebalance()
	The key function for the dynamic trie after any change in the trie
	it is run to optimize and reorganize. It will walk the trie upwards
	towards the root from a given tnode, doing a resize() at each step
	to implement level compression.

resize()
	Analyzes a tnode and optimizes the child array size by either inflating
	or shrinking it repeatedly until it fulfills the criteria for optimal
	level compression. This part follows the original paper pretty closely
	and there may be some room for experimentation here.

inflate()
	Doubles the size of the child array within a tnode. Used by resize().

halve()
	Halves the size of the child array within a tnode - the inverse of
	inflate(). Used by resize();

fn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
	The route manipulation functions. Should conform pretty closely to the
	corresponding functions in fib_hash.

fn_trie_flush()
	This walks the full trie (using nextleaf()) and searches for empty
	leaves which have to be removed.

fn_trie_dump()
	Dumps the routing table ordered by prefix length. This is somewhat
	slower than the corresponding fib_hash function, as we have to walk the
	entire trie for each prefix length. In comparison, fib_hash is organized
	as one "zone"/hash per prefix length.

Locking
-------

fib_lock is used for an RW-lock in the same way that this is done in fib_hash.
However, the functions are somewhat separated for other possible locking
scenarios. It might conceivably be possible to run trie_rebalance via RCU
to avoid read_lock in the fn_trie_lookup() function.

Main lookup mechanism
---------------------
fn_trie_lookup() is the main lookup function.

The lookup is in its simplest form just like fib_find_node(). We descend the
trie, key segment by key segment, until we find a leaf. check_leaf() does
the fib_semantic_match in the leaf's sorted prefix hlist.

If we find a match, we are done.

If we don't find a match, we enter prefix matching mode. The prefix length,
starting out at the same as the key length, is reduced one step at a time,
and we backtrack upwards through the trie trying to find a longest matching
prefix. The goal is always to reach a leaf and get a positive result from the
fib_semantic_match mechanism.

Inside each tnode, the search for longest matching prefix consists of searching
through the child array, chopping off (zeroing) the least significant "1" of
the child index until we find a match or the child index consists of nothing but
zeros.

At this point we backtrack (t->stats.backtrack++) up the trie, continuing to
chop off part of the key in order to find the longest matching prefix.

At this point we will repeatedly descend subtries to look for a match, and there
are some optimizations available that can provide us with "shortcuts" to avoid
descending into dead ends. Look for "HL_OPTIMIZE" sections in the code.

To alleviate any doubts about the correctness of the route selection process,
a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
gives userland access to fib_lookup().
Commit	Line	Data
aee11342 MCC	1	.. SPDX-License-Identifier: GPL-2.0
	2
	3	============================
	4	LC-trie implementation notes
	5	============================
b2f57102 RO	6
	7	Node types
	8	----------
aee11342	9	leaf
b2f57102 RO	10	An end node with data. This has a copy of the relevant key, along
	11	with 'hlist' with routing table entries sorted by prefix length.
	12	See struct leaf and struct leaf_info.
	13
	14	trie node or tnode
	15	An internal node, holding an array of child (leaf or tnode) pointers,
	16	indexed through a subset of the key. See Level Compression.
	17
	18	A few concepts explained
	19	------------------------
aee11342	20	Bits (tnode)
b2f57102 RO	21	The number of bits in the key segment used for indexing into the
	22	child array - the "child index". See Level Compression.
	23
	24	Pos (tnode)
	25	The position (in the key) of the key segment used for indexing into
	26	the child array. See Path Compression.
	27
	28	Path Compression / skipped bits
	29	Any given tnode is linked to from the child array of its parent, using
aee11342	30	a segment of the key specified by the parent's "pos" and "bits"
b2f57102 RO	31	In certain cases, this tnode's own "pos" will not be immediately
	32	adjacent to the parent (pos+bits), but there will be some bits
	33	in the key skipped over because they represent a single path with no
	34	deviations. These "skipped bits" constitute Path Compression.
	35	Note that the search algorithm will simply skip over these bits when
	36	searching, making it necessary to save the keys in the leaves to
	37	verify that they actually do match the key we are searching for.
	38
	39	Level Compression / child arrays
	40	the trie is kept level balanced moving, under certain conditions, the
	41	children of a full child (see "full_children") up one level, so that
	42	instead of a pure binary tree, each internal node ("tnode") may
	43	contain an arbitrarily large array of links to several children.
	44	Conversely, a tnode with a mostly empty child array (see empty_children)
	45	may be "halved", having some of its children moved downwards one level,
	46	in order to avoid ever-increasing child arrays.
	47
	48	empty_children
	49	the number of positions in the child array of a given tnode that are
	50	NULL.
	51
	52	full_children
	53	the number of children of a given tnode that aren't path compressed.
	54	(in other words, they aren't NULL or leaves and their "pos" is equal
	55	to this tnode's "pos"+"bits").
	56
	57	(The word "full" here is used more in the sense of "complete" than
	58	as the opposite of "empty", which might be a tad confusing.)
	59
	60	Comments
	61	---------
	62
aee11342 MCC	63	We have tried to keep the structure of the code as close to fib_hash as
aee11342 MCC	64	possible to allow verification and help up reviewing.
b2f57102 RO	65
	66	fib_find_node()
	67	A good start for understanding this code. This function implements a
	68	straightforward trie lookup.
	69
	70	fib_insert_node()
	71	Inserts a new leaf node in the trie. This is bit more complicated than
	72	fib_find_node(). Inserting a new node means we might have to run the
	73	level compression algorithm on part of the trie.
	74
	75	trie_leaf_remove()
	76	Looks up a key, deletes it and runs the level compression algorithm.
	77
	78	trie_rebalance()
	79	The key function for the dynamic trie after any change in the trie
fd223068 DJ	80	it is run to optimize and reorganize. It will walk the trie upwards
fd223068 DJ	81	towards the root from a given tnode, doing a resize() at each step
b2f57102 RO	82	to implement level compression.
	83
	84	resize()
	85	Analyzes a tnode and optimizes the child array size by either inflating
a2ffd275	86	or shrinking it repeatedly until it fulfills the criteria for optimal
b2f57102 RO	87	level compression. This part follows the original paper pretty closely
	88	and there may be some room for experimentation here.
	89
	90	inflate()
	91	Doubles the size of the child array within a tnode. Used by resize().
	92
	93	halve()
	94	Halves the size of the child array within a tnode - the inverse of
	95	inflate(). Used by resize();
	96
	97	fn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
	98	The route manipulation functions. Should conform pretty closely to the
	99	corresponding functions in fib_hash.
	100
	101	fn_trie_flush()
	102	This walks the full trie (using nextleaf()) and searches for empty
	103	leaves which have to be removed.
	104
	105	fn_trie_dump()
	106	Dumps the routing table ordered by prefix length. This is somewhat
	107	slower than the corresponding fib_hash function, as we have to walk the
	108	entire trie for each prefix length. In comparison, fib_hash is organized
	109	as one "zone"/hash per prefix length.
	110
	111	Locking
	112	-------
	113
	114	fib_lock is used for an RW-lock in the same way that this is done in fib_hash.
	115	However, the functions are somewhat separated for other possible locking
	116	scenarios. It might conceivably be possible to run trie_rebalance via RCU
	117	to avoid read_lock in the fn_trie_lookup() function.
	118
	119	Main lookup mechanism
	120	---------------------
	121	fn_trie_lookup() is the main lookup function.
	122
	123	The lookup is in its simplest form just like fib_find_node(). We descend the
	124	trie, key segment by key segment, until we find a leaf. check_leaf() does
	125	the fib_semantic_match in the leaf's sorted prefix hlist.
	126
	127	If we find a match, we are done.
	128
	129	If we don't find a match, we enter prefix matching mode. The prefix length,
	130	starting out at the same as the key length, is reduced one step at a time,
	131	and we backtrack upwards through the trie trying to find a longest matching
	132	prefix. The goal is always to reach a leaf and get a positive result from the
	133	fib_semantic_match mechanism.
	134
	135	Inside each tnode, the search for longest matching prefix consists of searching
	136	through the child array, chopping off (zeroing) the least significant "1" of
	137	the child index until we find a match or the child index consists of nothing but
	138	zeros.
	139
	140	At this point we backtrack (t->stats.backtrack++) up the trie, continuing to
	141	chop off part of the key in order to find the longest matching prefix.
	142
	143	At this point we will repeatedly descend subtries to look for a match, and there
	144	are some optimizations available that can provide us with "shortcuts" to avoid
	145	descending into dead ends. Look for "HL_OPTIMIZE" sections in the code.
	146
	147	To alleviate any doubts about the correctness of the route selection process,
	148	a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
	149	gives userland access to fib_lookup().