+++ /dev/null
-NOTES ON OPTIMIZING DICTIONARIES\r
-================================\r
-\r
-\r
-Principal Use Cases for Dictionaries\r
-------------------------------------\r
-\r
-Passing keyword arguments\r
- Typically, one read and one write for 1 to 3 elements.\r
- Occurs frequently in normal python code.\r
-\r
-Class method lookup\r
- Dictionaries vary in size with 8 to 16 elements being common.\r
- Usually written once with many lookups.\r
- When base classes are used, there are many failed lookups\r
- followed by a lookup in a base class.\r
-\r
-Instance attribute lookup and Global variables\r
- Dictionaries vary in size. 4 to 10 elements are common.\r
- Both reads and writes are common.\r
-\r
-Builtins\r
- Frequent reads. Almost never written.\r
- Size 126 interned strings (as of Py2.3b1).\r
- A few keys are accessed much more frequently than others.\r
-\r
-Uniquification\r
- Dictionaries of any size. Bulk of work is in creation.\r
- Repeated writes to a smaller set of keys.\r
- Single read of each key.\r
- Some use cases have two consecutive accesses to the same key.\r
-\r
- * Removing duplicates from a sequence.\r
- dict.fromkeys(seqn).keys()\r
-\r
- * Counting elements in a sequence.\r
- for e in seqn:\r
- d[e] = d.get(e,0) + 1\r
-\r
- * Accumulating references in a dictionary of lists:\r
-\r
- for pagenumber, page in enumerate(pages):\r
- for word in page:\r
- d.setdefault(word, []).append(pagenumber)\r
-\r
- Note, the second example is a use case characterized by a get and set\r
- to the same key. There are similar use cases with a __contains__\r
- followed by a get, set, or del to the same key. Part of the\r
- justification for d.setdefault is combining the two lookups into one.\r
-\r
-Membership Testing\r
- Dictionaries of any size. Created once and then rarely changes.\r
- Single write to each key.\r
- Many calls to __contains__() or has_key().\r
- Similar access patterns occur with replacement dictionaries\r
- such as with the % formatting operator.\r
-\r
-Dynamic Mappings\r
- Characterized by deletions interspersed with adds and replacements.\r
- Performance benefits greatly from the re-use of dummy entries.\r
-\r
-\r
-Data Layout (assuming a 32-bit box with 64 bytes per cache line)\r
-----------------------------------------------------------------\r
-\r
-Smalldicts (8 entries) are attached to the dictobject structure\r
-and the whole group nearly fills two consecutive cache lines.\r
-\r
-Larger dicts use the first half of the dictobject structure (one cache\r
-line) and a separate, continuous block of entries (at 12 bytes each\r
-for a total of 5.333 entries per cache line).\r
-\r
-\r
-Tunable Dictionary Parameters\r
------------------------------\r
-\r
-* PyDict_MINSIZE. Currently set to 8.\r
- Must be a power of two. New dicts have to zero-out every cell.\r
- Each additional 8 consumes 1.5 cache lines. Increasing improves\r
- the sparseness of small dictionaries but costs time to read in\r
- the additional cache lines if they are not already in cache.\r
- That case is common when keyword arguments are passed.\r
-\r
-* Maximum dictionary load in PyDict_SetItem. Currently set to 2/3.\r
- Increasing this ratio makes dictionaries more dense resulting\r
- in more collisions. Decreasing it improves sparseness at the\r
- expense of spreading entries over more cache lines and at the\r
- cost of total memory consumed.\r
-\r
- The load test occurs in highly time sensitive code. Efforts\r
- to make the test more complex (for example, varying the load\r
- for different sizes) have degraded performance.\r
-\r
-* Growth rate upon hitting maximum load. Currently set to *2.\r
- Raising this to *4 results in half the number of resizes,\r
- less effort to resize, better sparseness for some (but not\r
- all dict sizes), and potentially doubles memory consumption\r
- depending on the size of the dictionary. Setting to *4\r
- eliminates every other resize step.\r
-\r
-* Maximum sparseness (minimum dictionary load). What percentage\r
- of entries can be unused before the dictionary shrinks to\r
- free up memory and speed up iteration? (The current CPython\r
- code does not represent this parameter directly.)\r
-\r
-* Shrinkage rate upon exceeding maximum sparseness. The current\r
- CPython code never even checks sparseness when deleting a\r
- key. When a new key is added, it resizes based on the number\r
- of active keys, so that the addition may trigger shrinkage\r
- rather than growth.\r
-\r
-Tune-ups should be measured across a broad range of applications and\r
-use cases. A change to any parameter will help in some situations and\r
-hurt in others. The key is to find settings that help the most common\r
-cases and do the least damage to the less common cases. Results will\r
-vary dramatically depending on the exact number of keys, whether the\r
-keys are all strings, whether reads or writes dominate, the exact\r
-hash values of the keys (some sets of values have fewer collisions than\r
-others). Any one test or benchmark is likely to prove misleading.\r
-\r
-While making a dictionary more sparse reduces collisions, it impairs\r
-iteration and key listing. Those methods loop over every potential\r
-entry. Doubling the size of dictionary results in twice as many\r
-non-overlapping memory accesses for keys(), items(), values(),\r
-__iter__(), iterkeys(), iteritems(), itervalues(), and update().\r
-Also, every dictionary iterates at least twice, once for the memset()\r
-when it is created and once by dealloc().\r
-\r
-Dictionary operations involving only a single key can be O(1) unless \r
-resizing is possible. By checking for a resize only when the \r
-dictionary can grow (and may *require* resizing), other operations\r
-remain O(1), and the odds of resize thrashing or memory fragmentation\r
-are reduced. In particular, an algorithm that empties a dictionary\r
-by repeatedly invoking .pop will see no resizing, which might\r
-not be necessary at all because the dictionary is eventually\r
-discarded entirely.\r
-\r
-\r
-Results of Cache Locality Experiments\r
--------------------------------------\r
-\r
-When an entry is retrieved from memory, 4.333 adjacent entries are also\r
-retrieved into a cache line. Since accessing items in cache is *much*\r
-cheaper than a cache miss, an enticing idea is to probe the adjacent\r
-entries as a first step in collision resolution. Unfortunately, the\r
-introduction of any regularity into collision searches results in more\r
-collisions than the current random chaining approach.\r
-\r
-Exploiting cache locality at the expense of additional collisions fails\r
-to payoff when the entries are already loaded in cache (the expense\r
-is paid with no compensating benefit). This occurs in small dictionaries\r
-where the whole dictionary fits into a pair of cache lines. It also\r
-occurs frequently in large dictionaries which have a common access pattern\r
-where some keys are accessed much more frequently than others. The\r
-more popular entries *and* their collision chains tend to remain in cache.\r
-\r
-To exploit cache locality, change the collision resolution section\r
-in lookdict() and lookdict_string(). Set i^=1 at the top of the\r
-loop and move the i = (i << 2) + i + perturb + 1 to an unrolled\r
-version of the loop.\r
-\r
-This optimization strategy can be leveraged in several ways:\r
-\r
-* If the dictionary is kept sparse (through the tunable parameters),\r
-then the occurrence of additional collisions is lessened.\r
-\r
-* If lookdict() and lookdict_string() are specialized for small dicts\r
-and for largedicts, then the versions for large_dicts can be given\r
-an alternate search strategy without increasing collisions in small dicts\r
-which already have the maximum benefit of cache locality.\r
-\r
-* If the use case for a dictionary is known to have a random key\r
-access pattern (as opposed to a more common pattern with a Zipf's law\r
-distribution), then there will be more benefit for large dictionaries\r
-because any given key is no more likely than another to already be\r
-in cache.\r
-\r
-* In use cases with paired accesses to the same key, the second access\r
-is always in cache and gets no benefit from efforts to further improve\r
-cache locality.\r
-\r
-Optimizing the Search of Small Dictionaries\r
--------------------------------------------\r
-\r
-If lookdict() and lookdict_string() are specialized for smaller dictionaries,\r
-then a custom search approach can be implemented that exploits the small\r
-search space and cache locality.\r
-\r
-* The simplest example is a linear search of contiguous entries. This is\r
- simple to implement, guaranteed to terminate rapidly, never searches\r
- the same entry twice, and precludes the need to check for dummy entries.\r
-\r
-* A more advanced example is a self-organizing search so that the most\r
- frequently accessed entries get probed first. The organization\r
- adapts if the access pattern changes over time. Treaps are ideally\r
- suited for self-organization with the most common entries at the\r
- top of the heap and a rapid binary search pattern. Most probes and\r
- results are all located at the top of the tree allowing them all to\r
- be located in one or two cache lines.\r
-\r
-* Also, small dictionaries may be made more dense, perhaps filling all\r
- eight cells to take the maximum advantage of two cache lines.\r
-\r
-\r
-Strategy Pattern\r
-----------------\r
-\r
-Consider allowing the user to set the tunable parameters or to select a\r
-particular search method. Since some dictionary use cases have known\r
-sizes and access patterns, the user may be able to provide useful hints.\r
-\r
-1) For example, if membership testing or lookups dominate runtime and memory\r
- is not at a premium, the user may benefit from setting the maximum load\r
- ratio at 5% or 10% instead of the usual 66.7%. This will sharply\r
- curtail the number of collisions but will increase iteration time.\r
- The builtin namespace is a prime example of a dictionary that can\r
- benefit from being highly sparse.\r
-\r
-2) Dictionary creation time can be shortened in cases where the ultimate\r
- size of the dictionary is known in advance. The dictionary can be\r
- pre-sized so that no resize operations are required during creation.\r
- Not only does this save resizes, but the key insertion will go\r
- more quickly because the first half of the keys will be inserted into\r
- a more sparse environment than before. The preconditions for this\r
- strategy arise whenever a dictionary is created from a key or item\r
- sequence and the number of *unique* keys is known.\r
-\r
-3) If the key space is large and the access pattern is known to be random,\r
- then search strategies exploiting cache locality can be fruitful.\r
- The preconditions for this strategy arise in simulations and\r
- numerical analysis.\r
-\r
-4) If the keys are fixed and the access pattern strongly favors some of\r
- the keys, then the entries can be stored contiguously and accessed\r
- with a linear search or treap. This exploits knowledge of the data,\r
- cache locality, and a simplified search routine. It also eliminates\r
- the need to test for dummy entries on each probe. The preconditions\r
- for this strategy arise in symbol tables and in the builtin dictionary.\r
-\r
-\r
-Readonly Dictionaries\r
----------------------\r
-Some dictionary use cases pass through a build stage and then move to a\r
-more heavily exercised lookup stage with no further changes to the\r
-dictionary.\r
-\r
-An idea that emerged on python-dev is to be able to convert a dictionary\r
-to a read-only state. This can help prevent programming errors and also\r
-provide knowledge that can be exploited for lookup optimization.\r
-\r
-The dictionary can be immediately rebuilt (eliminating dummy entries),\r
-resized (to an appropriate level of sparseness), and the keys can be\r
-jostled (to minimize collisions). The lookdict() routine can then\r
-eliminate the test for dummy entries (saving about 1/4 of the time\r
-spent in the collision resolution loop).\r
-\r
-An additional possibility is to insert links into the empty spaces\r
-so that dictionary iteration can proceed in len(d) steps instead of\r
-(mp->mask + 1) steps. Alternatively, a separate tuple of keys can be\r
-kept just for iteration.\r
-\r
-\r
-Caching Lookups\r
----------------\r
-The idea is to exploit key access patterns by anticipating future lookups\r
-based on previous lookups.\r
-\r
-The simplest incarnation is to save the most recently accessed entry.\r
-This gives optimal performance for use cases where every get is followed\r
-by a set or del to the same key.\r