]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | .. SPDX-License-Identifier: BSD-3-Clause |
2 | Copyright(c) 2016-2017 Intel Corporation. | |
3 | ||
4 | .. _Efd_Library: | |
5 | ||
6 | Elastic Flow Distributor Library | |
7 | ================================ | |
8 | ||
9 | Introduction | |
10 | ------------ | |
11 | ||
12 | In Data Centers today, clustering and scheduling of distributed workloads | |
13 | is a very common task. Many workloads require a deterministic | |
14 | partitioning of a flat key space among a cluster of machines. When a | |
15 | packet enters the cluster, the ingress node will direct the packet to | |
16 | its handling node. For example, data-centers with disaggregated storage | |
17 | use storage metadata tables to forward I/O requests to the correct back end | |
18 | storage cluster, stateful packet inspection will use match incoming | |
19 | flows to signatures in flow tables to send incoming packets to their | |
20 | intended deep packet inspection (DPI) devices, and so on. | |
21 | ||
22 | EFD is a distributor library that uses perfect hashing to determine a | |
23 | target/value for a given incoming flow key. It has the following | |
24 | advantages: first, because it uses perfect hashing it does not store the | |
25 | key itself and hence lookup performance is not dependent on the key | |
26 | size. Second, the target/value can be any arbitrary value hence the | |
27 | system designer and/or operator can better optimize service rates and | |
28 | inter-cluster network traffic locating. Third, since the storage | |
29 | requirement is much smaller than a hash-based flow table (i.e. better | |
30 | fit for CPU cache), EFD can scale to millions of flow keys. Finally, | |
31 | with the current optimized library implementation, performance is fully | |
32 | scalable with any number of CPU cores. | |
33 | ||
34 | Flow Based Distribution | |
35 | ----------------------- | |
36 | ||
37 | Computation Based Schemes | |
38 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
39 | ||
40 | Flow distribution and/or load balancing can be simply done using a | |
41 | stateless computation, for instance using round-robin or a simple | |
42 | computation based on the flow key as an input. For example, a hash | |
43 | function can be used to direct a certain flow to a target based on | |
44 | the flow key (e.g. ``h(key) mod n``) where h(key) is the hash value of the | |
45 | flow key and n is the number of possible targets. | |
46 | ||
47 | .. _figure_efd1: | |
48 | ||
49 | .. figure:: img/efd_i1.* | |
50 | ||
51 | Load Balancing Using Front End Node | |
52 | ||
53 | In this scheme (:numref:`figure_efd1`), the front end server/distributor/load balancer | |
54 | extracts the flow key from the input packet and applies a computation to determine where | |
55 | this flow should be directed. Intuitively, this scheme is very simple | |
56 | and requires no state to be kept at the front end node, and hence, | |
57 | storage requirements are minimum. | |
58 | ||
59 | .. _figure_efd2: | |
60 | ||
61 | .. figure:: img/efd_i2.* | |
62 | ||
63 | Consistent Hashing | |
64 | ||
65 | A widely used flow distributor that belongs to the same category of | |
66 | computation-based schemes is ``consistent hashing``, shown in :numref:`figure_efd2`. | |
67 | Target destinations (shown in red) are hashed into the same space as the flow | |
68 | keys (shown in blue), and keys are mapped to the nearest target in a clockwise | |
69 | fashion. Dynamically adding and removing targets with consistent hashing | |
70 | requires only K/n keys to be remapped on average, where K is the number of | |
71 | keys, and n is the number of targets. In contrast, in a traditional hash-based | |
72 | scheme, a change in the number of targets causes nearly all keys to be | |
73 | remapped. | |
74 | ||
75 | Although computation-based schemes are simple and need very little | |
76 | storage requirement, they suffer from the drawback that the system | |
77 | designer/operator can’t fully control the target to assign a specific | |
78 | key, as this is dictated by the hash function. | |
79 | Deterministically co-locating of keys together (for example, to minimize | |
80 | inter-server traffic or to optimize for network traffic conditions, | |
81 | target load, etc.) is simply not possible. | |
82 | ||
83 | Flow-Table Based Schemes | |
84 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
85 | ||
86 | When using a Flow-Table based scheme to handle flow distribution/load | |
87 | balancing, in contrast with computation-based schemes, the system designer | |
88 | has the flexibility of assigning a given flow to any given | |
89 | target. The flow table (e.g. DPDK RTE Hash Library) will simply store | |
90 | both the flow key and the target value. | |
91 | ||
92 | .. _figure_efd3: | |
93 | ||
94 | .. figure:: img/efd_i3.* | |
95 | ||
96 | Table Based Flow Distribution | |
97 | ||
98 | As shown in :numref:`figure_efd3`, when doing a lookup, the flow-table | |
99 | is indexed with the hash of the flow key and the keys (more than one is possible, | |
100 | because of hash collision) stored in this index and corresponding values | |
101 | are retrieved. The retrieved key(s) is matched with the input flow key | |
102 | and if there is a match the value (target id) is returned. | |
103 | ||
104 | The drawback of using a hash table for flow distribution/load balancing | |
105 | is the storage requirement, since the flow table need to store keys, | |
106 | signatures and target values. This doesn't allow this scheme to scale to | |
107 | millions of flow keys. Large tables will usually not fit in | |
108 | the CPU cache, and hence, the lookup performance is degraded because of | |
109 | the latency to access the main memory. | |
110 | ||
111 | EFD Based Scheme | |
112 | ~~~~~~~~~~~~~~~~ | |
113 | ||
114 | EFD combines the advantages of both flow-table based and computation-based | |
115 | schemes. It doesn't require the large storage necessary for | |
116 | flow-table based schemes (because EFD doesn't store the key as explained | |
117 | below), and it supports any arbitrary value for any given key. | |
118 | ||
119 | .. _figure_efd4: | |
120 | ||
121 | .. figure:: img/efd_i4.* | |
122 | ||
123 | Searching for Perfect Hash Function | |
124 | ||
125 | The basic idea of EFD is when a given key is to be inserted, a family of | |
126 | hash functions is searched until the correct hash function that maps the | |
127 | input key to the correct value is found, as shown in :numref:`figure_efd4`. | |
128 | However, rather than explicitly storing all keys and their associated values, | |
129 | EFD stores only indices of hash functions that map keys to values, and | |
130 | thereby consumes much less space than conventional flow-based tables. | |
131 | The lookup operation is very simple, similar to a computational-based | |
132 | scheme: given an input key the lookup operation is reduced to hashing | |
133 | that key with the correct hash function. | |
134 | ||
135 | .. _figure_efd5: | |
136 | ||
137 | .. figure:: img/efd_i5.* | |
138 | ||
139 | Divide and Conquer for Millions of Keys | |
140 | ||
141 | Intuitively, finding a hash function that maps each of a large number | |
142 | (millions) of input keys to the correct output value is effectively | |
143 | impossible, as a result EFD, as shown in :numref:`figure_efd5`, | |
144 | breaks the problem into smaller pieces (divide and conquer). | |
145 | EFD divides the entire input key set into many small groups. | |
146 | Each group consists of approximately 20-28 keys (a configurable parameter | |
147 | for the library), then, for each small group, a brute force search to find | |
148 | a hash function that produces the correct outputs for each key in the group. | |
149 | ||
150 | It should be mentioned that, since the online lookup table for EFD | |
151 | doesn't store the key itself, the size of the EFD table is independent | |
152 | of the key size and hence EFD lookup performance which is almost | |
153 | constant irrespective of the length of the key which is a highly | |
154 | desirable feature especially for longer keys. | |
155 | ||
156 | In summary, EFD is a set separation data structure that supports millions of | |
157 | keys. It is used to distribute a given key to an intended target. By itself | |
158 | EFD is not a FIB data structure with an exact match the input flow key. | |
159 | ||
160 | .. _Efd_example: | |
161 | ||
162 | Example of EFD Library Usage | |
163 | ---------------------------- | |
164 | ||
165 | EFD can be used along the data path of many network functions and middleboxes. | |
166 | As previously mentioned, it can used as an index table for | |
167 | <key,value> pairs, meta-data for objects, a flow-level load balancer, etc. | |
168 | :numref:`figure_efd6` shows an example of using EFD as a flow-level load | |
169 | balancer, where flows are received at a front end server before being forwarded | |
170 | to the target back end server for processing. The system designer would | |
171 | deterministically co-locate flows together in order to minimize cross-server | |
172 | interaction. | |
173 | (For example, flows requesting certain webpage objects are co-located | |
174 | together, to minimize forwarding of common objects across servers). | |
175 | ||
176 | .. _figure_efd6: | |
177 | ||
178 | .. figure:: img/efd_i6.* | |
179 | ||
180 | EFD as a Flow-Level Load Balancer | |
181 | ||
182 | As shown in :numref:`figure_efd6`, the front end server will have an EFD table that | |
183 | stores for each group what is the perfect hash index that satisfies the | |
184 | correct output. Because the table size is small and fits in cache (since | |
185 | keys are not stored), it sustains a large number of flows (N*X, where N | |
186 | is the maximum number of flows served by each back end server of the X | |
187 | possible targets). | |
188 | ||
189 | With an input flow key, the group id is computed (for example, using | |
190 | last few bits of CRC hash) and then the EFD table is indexed with the | |
191 | group id to retrieve the corresponding hash index to use. Once the index | |
192 | is retrieved the key is hashed using this hash function and the result | |
193 | will be the intended correct target where this flow is supposed to be | |
194 | processed. | |
195 | ||
196 | It should be noted that as a result of EFD not matching the exact key but | |
197 | rather distributing the flows to a target back end node based on the | |
198 | perfect hash index, a key that has not been inserted before | |
199 | will be distributed to a valid target. Hence, a local table which stores | |
200 | the flows served at each node is used and is | |
201 | exact matched with the input key to rule out new never seen before | |
202 | flows. | |
203 | ||
204 | .. _Efd_api: | |
205 | ||
206 | Library API Overview | |
207 | -------------------- | |
208 | ||
209 | The EFD library API is created with a very similar semantics of a | |
210 | hash-index or a flow table. The application creates an EFD table for a | |
211 | given maximum number of flows, a function is called to insert a flow key | |
212 | with a specific target value, and another function is used to retrieve | |
213 | target values for a given individual flow key or a bulk of keys. | |
214 | ||
215 | EFD Table Create | |
216 | ~~~~~~~~~~~~~~~~ | |
217 | ||
218 | The function ``rte_efd_create()`` is used to create and return a pointer | |
219 | to an EFD table that is sized to hold up to num_flows key. | |
220 | The online version of the EFD table (the one that does | |
221 | not store the keys and is used for lookups) will be allocated and | |
222 | created in the last level cache (LLC) of the socket defined by the | |
223 | online_socket_bitmask, while the offline EFD table (the one that | |
224 | stores the keys and is used for key inserts and for computing the | |
225 | perfect hashing) is allocated and created in the LLC of the socket | |
226 | defined by offline_socket_bitmask. It should be noted, that for | |
227 | highest performance the socket id should match that where the thread is | |
228 | running, i.e. the online EFD lookup table should be created on the same | |
229 | socket as where the lookup thread is running. | |
230 | ||
231 | EFD Insert and Update | |
232 | ~~~~~~~~~~~~~~~~~~~~~ | |
233 | ||
234 | The EFD function to insert a key or update a key to a new value is | |
235 | ``rte_efd_update()``. This function will update an existing key to | |
236 | a new value (target) if the key has already been inserted | |
237 | before, or will insert the <key,value> pair if this key has not been inserted | |
238 | before. It will return 0 upon success. It will return | |
239 | ``EFD_UPDATE_WARN_GROUP_FULL (1)`` if the operation is insert, and the | |
240 | last available space in the key's group was just used. It will return | |
241 | ``EFD_UPDATE_FAILED (2)`` when the insertion or update has failed (either it | |
242 | failed to find a suitable perfect hash or the group was full). The function | |
243 | will return ``EFD_UPDATE_NO_CHANGE (3)`` if there is no change to the EFD | |
244 | table (i.e, same value already exists). | |
245 | ||
246 | .. Note:: | |
247 | ||
248 | This function is not multi-thread safe and should only be called | |
249 | from one thread. | |
250 | ||
251 | EFD Lookup | |
252 | ~~~~~~~~~~ | |
253 | ||
254 | To lookup a certain key in an EFD table, the function ``rte_efd_lookup()`` | |
255 | is used to return the value associated with single key. | |
256 | As previously mentioned, if the key has been inserted, the correct value | |
257 | inserted is returned, if the key has not been inserted before, | |
258 | a ‘random’ value (based on hashing of the key) is returned. | |
259 | For better performance and to decrease the overhead of | |
260 | function calls per key, it is always recommended to use a bulk lookup | |
261 | function (simultaneous lookup of multiple keys) instead of a single key | |
262 | lookup function. ``rte_efd_lookup_bulk()`` is the bulk lookup function, | |
263 | that looks up num_keys simultaneously stored in the key_list and the | |
264 | corresponding return values will be returned in the value_list. | |
265 | ||
266 | .. Note:: | |
267 | ||
268 | This function is multi-thread safe, but there should not be other threads | |
269 | writing in the EFD table, unless locks are used. | |
270 | ||
271 | EFD Delete | |
272 | ~~~~~~~~~~ | |
273 | ||
274 | To delete a certain key in an EFD table, the function | |
275 | ``rte_efd_delete()`` can be used. The function returns zero upon success | |
276 | when the key has been found and deleted. Socket_id is the parameter to | |
277 | use to lookup the existing value, which is ideally the caller's socket id. | |
278 | The previous value associated with this key will be returned | |
279 | in the prev_value argument. | |
280 | ||
281 | .. Note:: | |
282 | ||
283 | This function is not multi-thread safe and should only be called | |
284 | from one thread. | |
285 | ||
286 | .. _Efd_internals: | |
287 | ||
288 | Library Internals | |
289 | ----------------- | |
290 | ||
291 | This section provides the brief high-level idea and an overview | |
292 | of the library internals to accompany the RFC. The intent of this | |
293 | section is to explain to readers the high-level implementation of | |
294 | insert, lookup and group rebalancing in the EFD library. | |
295 | ||
296 | Insert Function Internals | |
297 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
298 | ||
299 | As previously mentioned the EFD divides the whole set of keys into | |
300 | groups of a manageable size (e.g. 28 keys) and then searches for the | |
301 | perfect hash that satisfies the intended target value for each key. EFD | |
302 | stores two version of the <key,value> table: | |
303 | ||
304 | - Offline Version (in memory): Only used for the insertion/update | |
305 | operation, which is less frequent than the lookup operation. In the | |
306 | offline version the exact keys for each group is stored. When a new | |
307 | key is added, the hash function is updated that will satisfy the | |
308 | value for the new key together with the all old keys already inserted | |
309 | in this group. | |
310 | ||
311 | - Online Version (in cache): Used for the frequent lookup operation. In | |
312 | the online version, as previously mentioned, the keys are not stored | |
313 | but rather only the hash index for each group. | |
314 | ||
315 | .. _figure_efd7: | |
316 | ||
317 | .. figure:: img/efd_i7.* | |
318 | ||
319 | Group Assignment | |
320 | ||
321 | :numref:`figure_efd7` depicts the group assignment for 7 flow keys as an example. | |
322 | Given a flow key, a hash function (in our implementation CRC hash) is | |
323 | used to get the group id. As shown in the figure, the groups can be | |
324 | unbalanced. (We highlight group rebalancing further below). | |
325 | ||
326 | .. _figure_efd8: | |
327 | ||
328 | .. figure:: img/efd_i8.* | |
329 | ||
330 | Perfect Hash Search - Assigned Keys & Target Value | |
331 | ||
332 | Focusing on one group that has four keys, :numref:`figure_efd8` depicts the search | |
333 | algorithm to find the perfect hash function. Assuming that the target | |
334 | value bit for the keys is as shown in the figure, then the online EFD | |
335 | table will store a 16 bit hash index and 16 bit lookup table per group | |
336 | per value bit. | |
337 | ||
338 | .. _figure_efd9: | |
339 | ||
340 | .. figure:: img/efd_i9.* | |
341 | ||
342 | Perfect Hash Search - Satisfy Target Values | |
343 | ||
344 | For a given keyX, a hash function ``(h(keyX, seed1) + index * h(keyX, seed2))`` | |
345 | is used to point to certain bit index in the 16bit lookup_table value, | |
346 | as shown in :numref:`figure_efd9`. | |
347 | The insert function will brute force search for all possible values for the | |
348 | hash index until a non conflicting lookup_table is found. | |
349 | ||
350 | .. _figure_efd10: | |
351 | ||
352 | .. figure:: img/efd_i10.* | |
353 | ||
354 | Finding Hash Index for Conflict Free lookup_table | |
355 | ||
356 | For example, since both key3 and key7 have a target bit value of 1, it | |
357 | is okay if the hash function of both keys point to the same bit in the | |
358 | lookup table. A conflict will occur if a hash index is used that maps | |
359 | both Key4 and Key7 to the same index in the lookup_table, | |
360 | as shown in :numref:`figure_efd10`, since their target value bit are not the same. | |
361 | Once a hash index is found that produces a lookup_table with no | |
362 | contradictions, this index is stored for this group. This procedure is | |
363 | repeated for each bit of target value. | |
364 | ||
365 | Lookup Function Internals | |
366 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
367 | ||
368 | The design principle of EFD is that lookups are much more frequent than | |
369 | inserts, and hence, EFD's design optimizes for the lookups which are | |
370 | faster and much simpler than the slower insert procedure (inserts are | |
371 | slow, because of perfect hash search as previously discussed). | |
372 | ||
373 | .. _figure_efd11: | |
374 | ||
375 | .. figure:: img/efd_i11.* | |
376 | ||
377 | EFD Lookup Operation | |
378 | ||
379 | :numref:`figure_efd11` depicts the lookup operation for EFD. Given an input key, | |
380 | the group id is computed (using CRC hash) and then the hash index for this | |
381 | group is retrieved from the EFD table. Using the retrieved hash index, | |
382 | the hash function ``h(key, seed1) + index *h(key, seed2)`` is used which will | |
383 | result in an index in the lookup_table, the bit corresponding to this | |
384 | index will be the target value bit. This procedure is repeated for each | |
385 | bit of the target value. | |
386 | ||
387 | Group Rebalancing Function Internals | |
388 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
389 | ||
390 | When discussing EFD inserts and lookups, the discussion is simplified by | |
391 | assuming that a group id is simply a result of hash function. However, | |
392 | since hashing in general is not perfect and will not always produce a | |
393 | uniform output, this simplified assumption will lead to unbalanced | |
394 | groups, i.e., some group will have more keys than other groups. | |
395 | Typically, and to minimize insert time with an increasing number of keys, | |
396 | it is preferable that all groups will have a balanced number of keys, so | |
397 | the brute force search for the perfect hash terminates with a valid hash | |
398 | index. In order to achieve this target, groups are rebalanced during | |
399 | runtime inserts, and keys are moved around from a busy group to a less | |
400 | crowded group as the more keys are inserted. | |
401 | ||
402 | .. _figure_efd12: | |
403 | ||
404 | .. figure:: img/efd_i12.* | |
405 | ||
406 | Runtime Group Rebalancing | |
407 | ||
408 | :numref:`figure_efd12` depicts the high level idea of group rebalancing, given an | |
409 | input key the hash result is split into two parts a chunk id and 8-bit | |
410 | bin id. A chunk contains 64 different groups and 256 bins (i.e. for any | |
411 | given bin it can map to 4 distinct groups). When a key is inserted, the | |
412 | bin id is computed, for example in :numref:`figure_efd12` bin_id=2, | |
413 | and since each bin can be mapped to one of four different groups (2 bit storage), | |
414 | the four possible mappings are evaluated and the one that will result in a | |
415 | balanced key distribution across these four is selected the mapping result | |
416 | is stored in these two bits. | |
417 | ||
418 | ||
419 | .. _Efd_references: | |
420 | ||
421 | References | |
422 | ----------- | |
423 | ||
424 | 1- EFD is based on collaborative research work between Intel and | |
425 | Carnegie Mellon University (CMU), interested readers can refer to the paper | |
9f95a23c | 426 | "Scaling Up Clustered Network Appliances with ScaleBricks" Dong Zhou et al. |
11fdf7f2 TL |
427 | at SIGCOMM 2015 (`http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p241.pdf`) |
428 | for more information. |