]>
Commit | Line | Data |
---|---|---|
4710c53d | 1 | NOTES ON OPTIMIZING DICTIONARIES\r |
2 | ================================\r | |
3 | \r | |
4 | \r | |
5 | Principal Use Cases for Dictionaries\r | |
6 | ------------------------------------\r | |
7 | \r | |
8 | Passing keyword arguments\r | |
9 | Typically, one read and one write for 1 to 3 elements.\r | |
10 | Occurs frequently in normal python code.\r | |
11 | \r | |
12 | Class method lookup\r | |
13 | Dictionaries vary in size with 8 to 16 elements being common.\r | |
14 | Usually written once with many lookups.\r | |
15 | When base classes are used, there are many failed lookups\r | |
16 | followed by a lookup in a base class.\r | |
17 | \r | |
18 | Instance attribute lookup and Global variables\r | |
19 | Dictionaries vary in size. 4 to 10 elements are common.\r | |
20 | Both reads and writes are common.\r | |
21 | \r | |
22 | Builtins\r | |
23 | Frequent reads. Almost never written.\r | |
24 | Size 126 interned strings (as of Py2.3b1).\r | |
25 | A few keys are accessed much more frequently than others.\r | |
26 | \r | |
27 | Uniquification\r | |
28 | Dictionaries of any size. Bulk of work is in creation.\r | |
29 | Repeated writes to a smaller set of keys.\r | |
30 | Single read of each key.\r | |
31 | Some use cases have two consecutive accesses to the same key.\r | |
32 | \r | |
33 | * Removing duplicates from a sequence.\r | |
34 | dict.fromkeys(seqn).keys()\r | |
35 | \r | |
36 | * Counting elements in a sequence.\r | |
37 | for e in seqn:\r | |
38 | d[e] = d.get(e,0) + 1\r | |
39 | \r | |
40 | * Accumulating references in a dictionary of lists:\r | |
41 | \r | |
42 | for pagenumber, page in enumerate(pages):\r | |
43 | for word in page:\r | |
44 | d.setdefault(word, []).append(pagenumber)\r | |
45 | \r | |
46 | Note, the second example is a use case characterized by a get and set\r | |
47 | to the same key. There are similar use cases with a __contains__\r | |
48 | followed by a get, set, or del to the same key. Part of the\r | |
49 | justification for d.setdefault is combining the two lookups into one.\r | |
50 | \r | |
51 | Membership Testing\r | |
52 | Dictionaries of any size. Created once and then rarely changes.\r | |
53 | Single write to each key.\r | |
54 | Many calls to __contains__() or has_key().\r | |
55 | Similar access patterns occur with replacement dictionaries\r | |
56 | such as with the % formatting operator.\r | |
57 | \r | |
58 | Dynamic Mappings\r | |
59 | Characterized by deletions interspersed with adds and replacements.\r | |
60 | Performance benefits greatly from the re-use of dummy entries.\r | |
61 | \r | |
62 | \r | |
63 | Data Layout (assuming a 32-bit box with 64 bytes per cache line)\r | |
64 | ----------------------------------------------------------------\r | |
65 | \r | |
66 | Smalldicts (8 entries) are attached to the dictobject structure\r | |
67 | and the whole group nearly fills two consecutive cache lines.\r | |
68 | \r | |
69 | Larger dicts use the first half of the dictobject structure (one cache\r | |
70 | line) and a separate, continuous block of entries (at 12 bytes each\r | |
71 | for a total of 5.333 entries per cache line).\r | |
72 | \r | |
73 | \r | |
74 | Tunable Dictionary Parameters\r | |
75 | -----------------------------\r | |
76 | \r | |
77 | * PyDict_MINSIZE. Currently set to 8.\r | |
78 | Must be a power of two. New dicts have to zero-out every cell.\r | |
79 | Each additional 8 consumes 1.5 cache lines. Increasing improves\r | |
80 | the sparseness of small dictionaries but costs time to read in\r | |
81 | the additional cache lines if they are not already in cache.\r | |
82 | That case is common when keyword arguments are passed.\r | |
83 | \r | |
84 | * Maximum dictionary load in PyDict_SetItem. Currently set to 2/3.\r | |
85 | Increasing this ratio makes dictionaries more dense resulting\r | |
86 | in more collisions. Decreasing it improves sparseness at the\r | |
87 | expense of spreading entries over more cache lines and at the\r | |
88 | cost of total memory consumed.\r | |
89 | \r | |
90 | The load test occurs in highly time sensitive code. Efforts\r | |
91 | to make the test more complex (for example, varying the load\r | |
92 | for different sizes) have degraded performance.\r | |
93 | \r | |
94 | * Growth rate upon hitting maximum load. Currently set to *2.\r | |
95 | Raising this to *4 results in half the number of resizes,\r | |
96 | less effort to resize, better sparseness for some (but not\r | |
97 | all dict sizes), and potentially doubles memory consumption\r | |
98 | depending on the size of the dictionary. Setting to *4\r | |
99 | eliminates every other resize step.\r | |
100 | \r | |
101 | * Maximum sparseness (minimum dictionary load). What percentage\r | |
102 | of entries can be unused before the dictionary shrinks to\r | |
103 | free up memory and speed up iteration? (The current CPython\r | |
104 | code does not represent this parameter directly.)\r | |
105 | \r | |
106 | * Shrinkage rate upon exceeding maximum sparseness. The current\r | |
107 | CPython code never even checks sparseness when deleting a\r | |
108 | key. When a new key is added, it resizes based on the number\r | |
109 | of active keys, so that the addition may trigger shrinkage\r | |
110 | rather than growth.\r | |
111 | \r | |
112 | Tune-ups should be measured across a broad range of applications and\r | |
113 | use cases. A change to any parameter will help in some situations and\r | |
114 | hurt in others. The key is to find settings that help the most common\r | |
115 | cases and do the least damage to the less common cases. Results will\r | |
116 | vary dramatically depending on the exact number of keys, whether the\r | |
117 | keys are all strings, whether reads or writes dominate, the exact\r | |
118 | hash values of the keys (some sets of values have fewer collisions than\r | |
119 | others). Any one test or benchmark is likely to prove misleading.\r | |
120 | \r | |
121 | While making a dictionary more sparse reduces collisions, it impairs\r | |
122 | iteration and key listing. Those methods loop over every potential\r | |
123 | entry. Doubling the size of dictionary results in twice as many\r | |
124 | non-overlapping memory accesses for keys(), items(), values(),\r | |
125 | __iter__(), iterkeys(), iteritems(), itervalues(), and update().\r | |
126 | Also, every dictionary iterates at least twice, once for the memset()\r | |
127 | when it is created and once by dealloc().\r | |
128 | \r | |
129 | Dictionary operations involving only a single key can be O(1) unless \r | |
130 | resizing is possible. By checking for a resize only when the \r | |
131 | dictionary can grow (and may *require* resizing), other operations\r | |
132 | remain O(1), and the odds of resize thrashing or memory fragmentation\r | |
133 | are reduced. In particular, an algorithm that empties a dictionary\r | |
134 | by repeatedly invoking .pop will see no resizing, which might\r | |
135 | not be necessary at all because the dictionary is eventually\r | |
136 | discarded entirely.\r | |
137 | \r | |
138 | \r | |
139 | Results of Cache Locality Experiments\r | |
140 | -------------------------------------\r | |
141 | \r | |
142 | When an entry is retrieved from memory, 4.333 adjacent entries are also\r | |
143 | retrieved into a cache line. Since accessing items in cache is *much*\r | |
144 | cheaper than a cache miss, an enticing idea is to probe the adjacent\r | |
145 | entries as a first step in collision resolution. Unfortunately, the\r | |
146 | introduction of any regularity into collision searches results in more\r | |
147 | collisions than the current random chaining approach.\r | |
148 | \r | |
149 | Exploiting cache locality at the expense of additional collisions fails\r | |
150 | to payoff when the entries are already loaded in cache (the expense\r | |
151 | is paid with no compensating benefit). This occurs in small dictionaries\r | |
152 | where the whole dictionary fits into a pair of cache lines. It also\r | |
153 | occurs frequently in large dictionaries which have a common access pattern\r | |
154 | where some keys are accessed much more frequently than others. The\r | |
155 | more popular entries *and* their collision chains tend to remain in cache.\r | |
156 | \r | |
157 | To exploit cache locality, change the collision resolution section\r | |
158 | in lookdict() and lookdict_string(). Set i^=1 at the top of the\r | |
159 | loop and move the i = (i << 2) + i + perturb + 1 to an unrolled\r | |
160 | version of the loop.\r | |
161 | \r | |
162 | This optimization strategy can be leveraged in several ways:\r | |
163 | \r | |
164 | * If the dictionary is kept sparse (through the tunable parameters),\r | |
165 | then the occurrence of additional collisions is lessened.\r | |
166 | \r | |
167 | * If lookdict() and lookdict_string() are specialized for small dicts\r | |
168 | and for largedicts, then the versions for large_dicts can be given\r | |
169 | an alternate search strategy without increasing collisions in small dicts\r | |
170 | which already have the maximum benefit of cache locality.\r | |
171 | \r | |
172 | * If the use case for a dictionary is known to have a random key\r | |
173 | access pattern (as opposed to a more common pattern with a Zipf's law\r | |
174 | distribution), then there will be more benefit for large dictionaries\r | |
175 | because any given key is no more likely than another to already be\r | |
176 | in cache.\r | |
177 | \r | |
178 | * In use cases with paired accesses to the same key, the second access\r | |
179 | is always in cache and gets no benefit from efforts to further improve\r | |
180 | cache locality.\r | |
181 | \r | |
182 | Optimizing the Search of Small Dictionaries\r | |
183 | -------------------------------------------\r | |
184 | \r | |
185 | If lookdict() and lookdict_string() are specialized for smaller dictionaries,\r | |
186 | then a custom search approach can be implemented that exploits the small\r | |
187 | search space and cache locality.\r | |
188 | \r | |
189 | * The simplest example is a linear search of contiguous entries. This is\r | |
190 | simple to implement, guaranteed to terminate rapidly, never searches\r | |
191 | the same entry twice, and precludes the need to check for dummy entries.\r | |
192 | \r | |
193 | * A more advanced example is a self-organizing search so that the most\r | |
194 | frequently accessed entries get probed first. The organization\r | |
195 | adapts if the access pattern changes over time. Treaps are ideally\r | |
196 | suited for self-organization with the most common entries at the\r | |
197 | top of the heap and a rapid binary search pattern. Most probes and\r | |
198 | results are all located at the top of the tree allowing them all to\r | |
199 | be located in one or two cache lines.\r | |
200 | \r | |
201 | * Also, small dictionaries may be made more dense, perhaps filling all\r | |
202 | eight cells to take the maximum advantage of two cache lines.\r | |
203 | \r | |
204 | \r | |
205 | Strategy Pattern\r | |
206 | ----------------\r | |
207 | \r | |
208 | Consider allowing the user to set the tunable parameters or to select a\r | |
209 | particular search method. Since some dictionary use cases have known\r | |
210 | sizes and access patterns, the user may be able to provide useful hints.\r | |
211 | \r | |
212 | 1) For example, if membership testing or lookups dominate runtime and memory\r | |
213 | is not at a premium, the user may benefit from setting the maximum load\r | |
214 | ratio at 5% or 10% instead of the usual 66.7%. This will sharply\r | |
215 | curtail the number of collisions but will increase iteration time.\r | |
216 | The builtin namespace is a prime example of a dictionary that can\r | |
217 | benefit from being highly sparse.\r | |
218 | \r | |
219 | 2) Dictionary creation time can be shortened in cases where the ultimate\r | |
220 | size of the dictionary is known in advance. The dictionary can be\r | |
221 | pre-sized so that no resize operations are required during creation.\r | |
222 | Not only does this save resizes, but the key insertion will go\r | |
223 | more quickly because the first half of the keys will be inserted into\r | |
224 | a more sparse environment than before. The preconditions for this\r | |
225 | strategy arise whenever a dictionary is created from a key or item\r | |
226 | sequence and the number of *unique* keys is known.\r | |
227 | \r | |
228 | 3) If the key space is large and the access pattern is known to be random,\r | |
229 | then search strategies exploiting cache locality can be fruitful.\r | |
230 | The preconditions for this strategy arise in simulations and\r | |
231 | numerical analysis.\r | |
232 | \r | |
233 | 4) If the keys are fixed and the access pattern strongly favors some of\r | |
234 | the keys, then the entries can be stored contiguously and accessed\r | |
235 | with a linear search or treap. This exploits knowledge of the data,\r | |
236 | cache locality, and a simplified search routine. It also eliminates\r | |
237 | the need to test for dummy entries on each probe. The preconditions\r | |
238 | for this strategy arise in symbol tables and in the builtin dictionary.\r | |
239 | \r | |
240 | \r | |
241 | Readonly Dictionaries\r | |
242 | ---------------------\r | |
243 | Some dictionary use cases pass through a build stage and then move to a\r | |
244 | more heavily exercised lookup stage with no further changes to the\r | |
245 | dictionary.\r | |
246 | \r | |
247 | An idea that emerged on python-dev is to be able to convert a dictionary\r | |
248 | to a read-only state. This can help prevent programming errors and also\r | |
249 | provide knowledge that can be exploited for lookup optimization.\r | |
250 | \r | |
251 | The dictionary can be immediately rebuilt (eliminating dummy entries),\r | |
252 | resized (to an appropriate level of sparseness), and the keys can be\r | |
253 | jostled (to minimize collisions). The lookdict() routine can then\r | |
254 | eliminate the test for dummy entries (saving about 1/4 of the time\r | |
255 | spent in the collision resolution loop).\r | |
256 | \r | |
257 | An additional possibility is to insert links into the empty spaces\r | |
258 | so that dictionary iteration can proceed in len(d) steps instead of\r | |
259 | (mp->mask + 1) steps. Alternatively, a separate tuple of keys can be\r | |
260 | kept just for iteration.\r | |
261 | \r | |
262 | \r | |
263 | Caching Lookups\r | |
264 | ---------------\r | |
265 | The idea is to exploit key access patterns by anticipating future lookups\r | |
266 | based on previous lookups.\r | |
267 | \r | |
268 | The simplest incarnation is to save the most recently accessed entry.\r | |
269 | This gives optimal performance for use cases where every get is followed\r | |
270 | by a set or del to the same key.\r |