]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ Copyright 2006-2008 Daniel James. |
2 | / Distributed under the Boost Software License, Version 1.0. (See accompanying | |
3 | / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ] | |
4 | ||
5 | [def __wang__ | |
6 | [@http://web.archive.org/web/20121102023700/http://www.concentric.net/~Ttwang/tech/inthash.htm | |
7 | Thomas Wang's article on integer hash functions]] | |
8 | ||
9 | [section:rationale Implementation Rationale] | |
10 | ||
11 | The intent of this library is to implement the unordered | |
12 | containers in the draft standard, so the interface was fixed. But there are | |
13 | still some implementation decisions to make. The priorities are | |
14 | conformance to the standard and portability. | |
15 | ||
16 | The [@http://en.wikipedia.org/wiki/Hash_table wikipedia article on hash tables] | |
17 | has a good summary of the implementation issues for hash tables in general. | |
18 | ||
19 | [h2 Data Structure] | |
20 | ||
21 | By specifying an interface for accessing the buckets of the container the | |
22 | standard pretty much requires that the hash table uses chained addressing. | |
23 | ||
24 | It would be conceivable to write a hash table that uses another method. For | |
25 | example, it could use open addressing, and use the lookup chain to act as a | |
26 | bucket but there are a some serious problems with this: | |
27 | ||
28 | * The draft standard requires that pointers to elements aren't invalidated, so | |
29 | the elements can't be stored in one array, but will need a layer of | |
30 | indirection instead - losing the efficiency and most of the memory gain, | |
31 | the main advantages of open addressing. | |
32 | ||
33 | * Local iterators would be very inefficient and may not be able to | |
34 | meet the complexity requirements. | |
35 | ||
36 | * There are also the restrictions on when iterators can be invalidated. Since | |
37 | open addressing degrades badly when there are a high number of collisions the | |
38 | restrictions could prevent a rehash when it's really needed. The maximum load | |
39 | factor could be set to a fairly low value to work around this - but the | |
40 | standard requires that it is initially set to 1.0. | |
41 | ||
42 | * And since the standard is written with a eye towards chained | |
43 | addressing, users will be surprised if the performance doesn't reflect that. | |
44 | ||
45 | So chained addressing is used. | |
46 | ||
47 | [/ (Removing for now as this is out of date) | |
48 | ||
49 | For containers with unique keys I store the buckets in a single-linked list. | |
50 | There are other possible data structures (such as a double-linked list) | |
51 | that allow for some operations to be faster (such as erasing and iteration) | |
52 | but the possible gain seems small compared to the extra memory needed. | |
53 | The most commonly used operations (insertion and lookup) would not be improved | |
54 | at all. | |
55 | ||
56 | But for containers with equivalent keys a single-linked list can degrade badly | |
57 | when a large number of elements with equivalent keys are inserted. I think it's | |
58 | reasonable to assume that users who choose to use `unordered_multiset` or | |
59 | `unordered_multimap` do so because they are likely to insert elements with | |
60 | equivalent keys. So I have used an alternative data structure that doesn't | |
61 | degrade, at the expense of an extra pointer per node. | |
62 | ||
63 | This works by adding storing a circular linked list for each group of equivalent | |
64 | nodes in reverse order. This allows quick navigation to the end of a group (since | |
65 | the first element points to the last) and can be quickly updated when elements | |
66 | are inserted or erased. The main disadvantage of this approach is some hairy code | |
67 | for erasing elements. | |
68 | ] | |
69 | ||
70 | [/ (Starting to write up new structure, might not be ready in time) | |
71 | The node used to be stored in a linked list for each bucket but that | |
72 | didn't meet the complexity requirements for C++11, so now the nodes | |
73 | are stored in one long single linked list. But there needs a way to get | |
74 | the bucket from the node, to do that a copy of the key's hash value is | |
75 | stored in the node. Another possibility would be to store a pointer to | |
76 | the bucket, or the bucket's index, but storing the hash value allows | |
77 | some operations to be faster. | |
78 | ] | |
79 | ||
80 | [h2 Number of Buckets] | |
81 | ||
82 | There are two popular methods for choosing the number of buckets in a hash | |
83 | table. One is to have a prime number of buckets, another is to use a power | |
84 | of 2. | |
85 | ||
86 | Using a prime number of buckets, and choosing a bucket by using the modulus | |
87 | of the hash function's result will usually give a good result. The downside | |
88 | is that the required modulus operation is fairly expensive. This is what the | |
89 | containers do in most cases. | |
90 | ||
91 | Using a power of 2 allows for much quicker selection of the bucket | |
92 | to use, but at the expense of loosing the upper bits of the hash value. | |
93 | For some specially designed hash functions it is possible to do this and | |
94 | still get a good result but as the containers can take arbitrary hash | |
95 | functions this can't be relied on. | |
96 | ||
97 | To avoid this a transformation could be applied to the hash function, for an | |
98 | example see __wang__. Unfortunately, a transformation like Wang's requires | |
99 | knowledge of the number of bits in the hash value, so it isn't portable enough | |
100 | to use as a default. It can applicable in certain cases so the containers | |
101 | have a policy based implementation that can use this alternative technique. | |
102 | ||
103 | Currently this is only done on 64 bit architecures, where prime number | |
104 | modulus can be expensive. Although this varies depending on the architecture, | |
105 | so I probably should revisit it. | |
106 | ||
107 | I'm also thinking of introducing a mechanism whereby a hash function can | |
108 | indicate that it's safe to be used directly with power of 2 buckets, in | |
109 | which case a faster plain power of 2 implementation can be used. | |
110 | ||
111 | [endsect] |