]> git.proxmox.com Git - mirror_edk2.git/blame - AppPkg/Applications/Python/Python-2.7.10/Objects/listsort.txt
AppPkg/Applications/Python/Python-2.7.10: Initial Checkin part 3/5.
[mirror_edk2.git] / AppPkg / Applications / Python / Python-2.7.10 / Objects / listsort.txt
CommitLineData
53b2ba57
DM
1Intro\r
2-----\r
3This describes an adaptive, stable, natural mergesort, modestly called\r
4timsort (hey, I earned it <wink>). It has supernatural performance on many\r
5kinds of partially ordered arrays (less than lg(N!) comparisons needed, and\r
6as few as N-1), yet as fast as Python's previous highly tuned samplesort\r
7hybrid on random arrays.\r
8\r
9In a nutshell, the main routine marches over the array once, left to right,\r
10alternately identifying the next run, then merging it into the previous\r
11runs "intelligently". Everything else is complication for speed, and some\r
12hard-won measure of memory efficiency.\r
13\r
14\r
15Comparison with Python's Samplesort Hybrid\r
16------------------------------------------\r
17+ timsort can require a temp array containing as many as N//2 pointers,\r
18 which means as many as 2*N extra bytes on 32-bit boxes. It can be\r
19 expected to require a temp array this large when sorting random data; on\r
20 data with significant structure, it may get away without using any extra\r
21 heap memory. This appears to be the strongest argument against it, but\r
22 compared to the size of an object, 2 temp bytes worst-case (also expected-\r
23 case for random data) doesn't scare me much.\r
24\r
25 It turns out that Perl is moving to a stable mergesort, and the code for\r
26 that appears always to require a temp array with room for at least N\r
27 pointers. (Note that I wouldn't want to do that even if space weren't an\r
28 issue; I believe its efforts at memory frugality also save timsort\r
29 significant pointer-copying costs, and allow it to have a smaller working\r
30 set.)\r
31\r
32+ Across about four hours of generating random arrays, and sorting them\r
33 under both methods, samplesort required about 1.5% more comparisons\r
34 (the program is at the end of this file).\r
35\r
36+ In real life, this may be faster or slower on random arrays than\r
37 samplesort was, depending on platform quirks. Since it does fewer\r
38 comparisons on average, it can be expected to do better the more\r
39 expensive a comparison function is. OTOH, it does more data movement\r
40 (pointer copying) than samplesort, and that may negate its small\r
41 comparison advantage (depending on platform quirks) unless comparison\r
42 is very expensive.\r
43\r
44+ On arrays with many kinds of pre-existing order, this blows samplesort out\r
45 of the water. It's significantly faster than samplesort even on some\r
46 cases samplesort was special-casing the snot out of. I believe that lists\r
47 very often do have exploitable partial order in real life, and this is the\r
48 strongest argument in favor of timsort (indeed, samplesort's special cases\r
49 for extreme partial order are appreciated by real users, and timsort goes\r
50 much deeper than those, in particular naturally covering every case where\r
51 someone has suggested "and it would be cool if list.sort() had a special\r
52 case for this too ... and for that ...").\r
53\r
54+ Here are exact comparison counts across all the tests in sortperf.py,\r
55 when run with arguments "15 20 1".\r
56\r
57 Column Key:\r
58 *sort: random data\r
59 \sort: descending data\r
60 /sort: ascending data\r
61 3sort: ascending, then 3 random exchanges\r
62 +sort: ascending, then 10 random at the end\r
63 %sort: ascending, then randomly replace 1% of elements w/ random values\r
64 ~sort: many duplicates\r
65 =sort: all equal\r
66 !sort: worst case scenario\r
67\r
68 First the trivial cases, trivial for samplesort because it special-cased\r
69 them, and trivial for timsort because it naturally works on runs. Within\r
70 an "n" block, the first line gives the # of compares done by samplesort,\r
71 the second line by timsort, and the third line is the percentage by\r
72 which the samplesort count exceeds the timsort count:\r
73\r
74 n \sort /sort =sort\r
75------- ------ ------ ------\r
76 32768 32768 32767 32767 samplesort\r
77 32767 32767 32767 timsort\r
78 0.00% 0.00% 0.00% (samplesort - timsort) / timsort\r
79\r
80 65536 65536 65535 65535\r
81 65535 65535 65535\r
82 0.00% 0.00% 0.00%\r
83\r
84 131072 131072 131071 131071\r
85 131071 131071 131071\r
86 0.00% 0.00% 0.00%\r
87\r
88 262144 262144 262143 262143\r
89 262143 262143 262143\r
90 0.00% 0.00% 0.00%\r
91\r
92 524288 524288 524287 524287\r
93 524287 524287 524287\r
94 0.00% 0.00% 0.00%\r
95\r
961048576 1048576 1048575 1048575\r
97 1048575 1048575 1048575\r
98 0.00% 0.00% 0.00%\r
99\r
100 The algorithms are effectively identical in these cases, except that\r
101 timsort does one less compare in \sort.\r
102\r
103 Now for the more interesting cases. Where lg(x) is the logarithm of x to\r
104 the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for\r
105 the best any comparison-based sorting algorithm can do on average (across\r
106 all permutations). When a method gets significantly below that, it's\r
107 either astronomically lucky, or is finding exploitable structure in the\r
108 data.\r
109\r
110\r
111 n lg(n!) *sort 3sort +sort %sort ~sort !sort\r
112------- ------- ------ ------- ------- ------ ------- --------\r
113 32768 444255 453096 453614 32908 452871 130491 469141 old\r
114 448885 33016 33007 50426 182083 65534 new\r
115 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new\r
116\r
117 65536 954037 972699 981940 65686 973104 260029 1004607\r
118 962991 65821 65808 101667 364341 131070\r
119 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%\r
120\r
121 131072 2039137 2101881 2091491 131232 2092894 554790 2161379\r
122 2057533 131410 131361 206193 728871 262142\r
123 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%\r
124\r
125 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560\r
126 4377402 262437 262459 416347 1457945 524286\r
127 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%\r
128\r
129 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015\r
130 9278734 524580 524633 837947 2916107 1048574\r
131 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%\r
132\r
1331048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212\r
134 19606028 1048958 1048941 1694896 5832445 2097150\r
135 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%\r
136\r
137 Discussion of cases:\r
138\r
139 *sort: There's no structure in random data to exploit, so the theoretical\r
140 limit is lg(n!). Both methods get close to that, and timsort is hugging\r
141 it (indeed, in a *marginal* sense, it's a spectacular improvement --\r
142 there's only about 1% left before hitting the wall, and timsort knows\r
143 darned well it's doing compares that won't pay on random data -- but so\r
144 does the samplesort hybrid). For contrast, Hoare's original random-pivot\r
145 quicksort does about 39% more compares than the limit, and the median-of-3\r
146 variant about 19% more.\r
147\r
148 3sort, %sort, and !sort: No contest; there's structure in this data, but\r
149 not of the specific kinds samplesort special-cases. Note that structure\r
150 in !sort wasn't put there on purpose -- it was crafted as a worst case for\r
151 a previous quicksort implementation. That timsort nails it came as a\r
152 surprise to me (although it's obvious in retrospect).\r
153\r
154 +sort: samplesort special-cases this data, and does a few less compares\r
155 than timsort. However, timsort runs this case significantly faster on all\r
156 boxes we have timings for, because timsort is in the business of merging\r
157 runs efficiently, while samplesort does much more data movement in this\r
158 (for it) special case.\r
159\r
160 ~sort: samplesort's special cases for large masses of equal elements are\r
161 extremely effective on ~sort's specific data pattern, and timsort just\r
162 isn't going to get close to that, despite that it's clearly getting a\r
163 great deal of benefit out of the duplicates (the # of compares is much less\r
164 than lg(n!)). ~sort has a perfectly uniform distribution of just 4\r
165 distinct values, and as the distribution gets more skewed, samplesort's\r
166 equal-element gimmicks become less effective, while timsort's adaptive\r
167 strategies find more to exploit; in a database supplied by Kevin Altis, a\r
168 sort on its highly skewed "on which stock exchange does this company's\r
169 stock trade?" field ran over twice as fast under timsort.\r
170\r
171 However, despite that timsort does many more comparisons on ~sort, and\r
172 that on several platforms ~sort runs highly significantly slower under\r
173 timsort, on other platforms ~sort runs highly significantly faster under\r
174 timsort. No other kind of data has shown this wild x-platform behavior,\r
175 and we don't have an explanation for it. The only thing I can think of\r
176 that could transform what "should be" highly significant slowdowns into\r
177 highly significant speedups on some boxes are catastrophic cache effects\r
178 in samplesort.\r
179\r
180 But timsort "should be" slower than samplesort on ~sort, so it's hard\r
181 to count that it isn't on some boxes as a strike against it <wink>.\r
182\r
183+ Here's the highwater mark for the number of heap-based temp slots (4\r
184 bytes each on this box) needed by each test, again with arguments\r
185 "15 20 1":\r
186\r
187 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort\r
188 32768 16384 0 0 6256 0 10821 12288 0 16383\r
189 65536 32766 0 0 21652 0 31276 24576 0 32767\r
190 131072 65534 0 0 17258 0 58112 49152 0 65535\r
191 262144 131072 0 0 35660 0 123561 98304 0 131071\r
192 524288 262142 0 0 31302 0 212057 196608 0 262143\r
1931048576 524286 0 0 312438 0 484942 393216 0 524287\r
194\r
195 Discussion: The tests that end up doing (close to) perfectly balanced\r
196 merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort\r
197 also ends up doing balanced merges, but systematically benefits a lot from\r
198 the preliminary pre-merge searches described under "Merge Memory" later.\r
199 %sort approaches having a balanced merge at the end because the random\r
200 selection of elements to replace is expected to produce an out-of-order\r
201 element near the midpoint. \sort, /sort, =sort are the trivial one-run\r
202 cases, needing no merging at all. +sort ends up having one very long run\r
203 and one very short, and so gets all the temp space it needs from the small\r
204 temparray member of the MergeState struct (note that the same would be\r
205 true if the new random elements were prefixed to the sorted list instead,\r
206 but not if they appeared "in the middle"). 3sort approaches N//3 temp\r
207 slots twice, but the run lengths that remain after 3 random exchanges\r
208 clearly has very high variance.\r
209\r
210\r
211A detailed description of timsort follows.\r
212\r
213Runs\r
214----\r
215count_run() returns the # of elements in the next run. A run is either\r
216"ascending", which means non-decreasing:\r
217\r
218 a0 <= a1 <= a2 <= ...\r
219\r
220or "descending", which means strictly decreasing:\r
221\r
222 a0 > a1 > a2 > ...\r
223\r
224Note that a run is always at least 2 long, unless we start at the array's\r
225last element.\r
226\r
227The definition of descending is strict, because the main routine reverses\r
228a descending run in-place, transforming a descending run into an ascending\r
229run. Reversal is done via the obvious fast "swap elements starting at each\r
230end, and converge at the middle" method, and that can violate stability if\r
231the slice contains any equal elements. Using a strict definition of\r
232descending ensures that a descending run contains distinct elements.\r
233\r
234If an array is random, it's very unlikely we'll see long runs. If a natural\r
235run contains less than minrun elements (see next section), the main loop\r
236artificially boosts it to minrun elements, via a stable binary insertion sort\r
237applied to the right number of array elements following the short natural\r
238run. In a random array, *all* runs are likely to be minrun long as a\r
239result. This has two primary good effects:\r
240\r
2411. Random data strongly tends then toward perfectly balanced (both runs have\r
242 the same length) merges, which is the most efficient way to proceed when\r
243 data is random.\r
244\r
2452. Because runs are never very short, the rest of the code doesn't make\r
246 heroic efforts to shave a few cycles off per-merge overheads. For\r
247 example, reasonable use of function calls is made, rather than trying to\r
248 inline everything. Since there are no more than N/minrun runs to begin\r
249 with, a few "extra" function calls per merge is barely measurable.\r
250\r
251\r
252Computing minrun\r
253----------------\r
254If N < 64, minrun is N. IOW, binary insertion sort is used for the whole\r
255array then; it's hard to beat that given the overheads of trying something\r
256fancier (see note BINSORT).\r
257\r
258When N is a power of 2, testing on random data showed that minrun values of\r
25916, 32, 64 and 128 worked about equally well. At 256 the data-movement cost\r
260in binary insertion sort clearly hurt, and at 8 the increase in the number\r
261of function calls clearly hurt. Picking *some* power of 2 is important\r
262here, so that the merges end up perfectly balanced (see next section). We\r
263pick 32 as a good value in the sweet range; picking a value at the low end\r
264allows the adaptive gimmicks more opportunity to exploit shorter natural\r
265runs.\r
266\r
267Because sortperf.py only tries powers of 2, it took a long time to notice\r
268that 32 isn't a good choice for the general case! Consider N=2112:\r
269\r
270>>> divmod(2112, 32)\r
271(66, 0)\r
272>>>\r
273\r
274If the data is randomly ordered, we're very likely to end up with 66 runs\r
275each of length 32. The first 64 of these trigger a sequence of perfectly\r
276balanced merges (see next section), leaving runs of lengths 2048 and 64 to\r
277merge at the end. The adaptive gimmicks can do that with fewer than 2048+64\r
278compares, but it's still more compares than necessary, and-- mergesort's\r
279bugaboo relative to samplesort --a lot more data movement (O(N) copies just\r
280to get 64 elements into place).\r
281\r
282If we take minrun=33 in this case, then we're very likely to end up with 64\r
283runs each of length 33, and then all merges are perfectly balanced. Better!\r
284\r
285What we want to avoid is picking minrun such that in\r
286\r
287 q, r = divmod(N, minrun)\r
288\r
289q is a power of 2 and r>0 (then the last merge only gets r elements into\r
290place, and r < minrun is small compared to N), or q a little larger than a\r
291power of 2 regardless of r (then we've got a case similar to "2112", again\r
292leaving too little work for the last merge to do).\r
293\r
294Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a\r
295power of 2, or if that isn't possible, is close to, but strictly less than,\r
296a power of 2. This is easier to do than it may sound: take the first 6\r
297bits of N, and add 1 if any of the remaining bits are set. In fact, that\r
298rule covers every case in this section, including small N and exact powers\r
299of 2; merge_compute_minrun() is a deceptively simple function.\r
300\r
301\r
302The Merge Pattern\r
303-----------------\r
304In order to exploit regularities in the data, we're merging on natural\r
305run lengths, and they can become wildly unbalanced. That's a Good Thing\r
306for this sort! It means we have to find a way to manage an assortment of\r
307potentially very different run lengths, though.\r
308\r
309Stability constrains permissible merging patterns. For example, if we have\r
3103 consecutive runs of lengths\r
311\r
312 A:10000 B:20000 C:10000\r
313\r
314we dare not merge A with C first, because if A, B and C happen to contain\r
315a common element, it would get out of order wrt its occurrence(s) in B. The\r
316merging must be done as (A+B)+C or A+(B+C) instead.\r
317\r
318So merging is always done on two consecutive runs at a time, and in-place,\r
319although this may require some temp memory (more on that later).\r
320\r
321When a run is identified, its base address and length are pushed on a stack\r
322in the MergeState struct. merge_collapse() is then called to see whether it\r
323should merge it with preceding run(s). We would like to delay merging as\r
324long as possible in order to exploit patterns that may come up later, but we\r
325like even more to do merging as soon as possible to exploit that the run just\r
326found is still high in the memory hierarchy. We also can't delay merging\r
327"too long" because it consumes memory to remember the runs that are still\r
328unmerged, and the stack has a fixed size.\r
329\r
330What turned out to be a good compromise maintains two invariants on the\r
331stack entries, where A, B and C are the lengths of the three righmost not-yet\r
332merged slices:\r
333\r
3341. A > B+C\r
3352. B > C\r
336\r
337Note that, by induction, #2 implies the lengths of pending runs form a\r
338decreasing sequence. #1 implies that, reading the lengths right to left,\r
339the pending-run lengths grow at least as fast as the Fibonacci numbers.\r
340Therefore the stack can never grow larger than about log_base_phi(N) entries,\r
341where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice\r
342for very large arrays.\r
343\r
344If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the\r
345freshness-in-cache reason), and the new run replaces the A,B or B,C entries;\r
346e.g., if the last 3 entries are\r
347\r
348 A:30 B:20 C:10\r
349\r
350then B is merged with C, leaving\r
351\r
352 A:30 BC:30\r
353\r
354on the stack. Or if they were\r
355\r
356 A:500 B:400: C:1000\r
357\r
358then A is merged with B, leaving\r
359\r
360 AB:900 C:1000\r
361\r
362on the stack.\r
363\r
364In both examples, the stack configuration after the merge still violates\r
365invariant #2, and merge_collapse() goes on to continue merging runs until\r
366both invariants are satisfied. As an extreme case, suppose we didn't do the\r
367minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,\r
368and 2. Nothing would get merged until the final 2 was seen, and that would\r
369trigger 7 perfectly balanced merges.\r
370\r
371The thrust of these rules when they trigger merging is to balance the run\r
372lengths as closely as possible, while keeping a low bound on the number of\r
373runs we have to remember. This is maximally effective for random data,\r
374where all runs are likely to be of (artificially forced) length minrun, and\r
375then we get a sequence of perfectly balanced merges (with, perhaps, some\r
376oddballs at the end).\r
377\r
378OTOH, one reason this sort is so good for partly ordered data has to do\r
379with wildly unbalanced run lengths.\r
380\r
381\r
382Merge Memory\r
383------------\r
384Merging adjacent runs of lengths A and B in-place, and in linear time, is\r
385difficult. Theoretical constructions are known that can do it, but they're\r
386too difficult and slow for practical use. But if we have temp memory equal\r
387to min(A, B), it's easy.\r
388\r
389If A is smaller (function merge_lo), copy A to a temp array, leave B alone,\r
390and then we can do the obvious merge algorithm left to right, from the temp\r
391area and B, starting the stores into where A used to live. There's always a\r
392free area in the original area comprising a number of elements equal to the\r
393number not yet merged from the temp array (trivially true at the start;\r
394proceed by induction). The only tricky bit is that if a comparison raises an\r
395exception, we have to remember to copy the remaining elements back in from\r
396the temp area, lest the array end up with duplicate entries from B. But\r
397that's exactly the same thing we need to do if we reach the end of B first,\r
398so the exit code is pleasantly common to both the normal and error cases.\r
399\r
400If B is smaller (function merge_hi, which is merge_lo's "mirror image"),\r
401much the same, except that we need to merge right to left, copying B into a\r
402temp array and starting the stores at the right end of where B used to live.\r
403\r
404A refinement: When we're about to merge adjacent runs A and B, we first do\r
405a form of binary search (more on that later) to see where B[0] should end up\r
406in A. Elements in A preceding that point are already in their final\r
407positions, effectively shrinking the size of A. Likewise we also search to\r
408see where A[-1] should end up in B, and elements of B after that point can\r
409also be ignored. This cuts the amount of temp memory needed by the same\r
410amount.\r
411\r
412These preliminary searches may not pay off, and can be expected *not* to\r
413repay their cost if the data is random. But they can win huge in all of\r
414time, copying, and memory savings when they do pay, so this is one of the\r
415"per-merge overheads" mentioned above that we're happy to endure because\r
416there is at most one very short run. It's generally true in this algorithm\r
417that we're willing to gamble a little to win a lot, even though the net\r
418expectation is negative for random data.\r
419\r
420\r
421Merge Algorithms\r
422----------------\r
423merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo\r
424deals with runs where A <= B, and merge_hi where A > B. They don't know\r
425whether the data is clustered or uniform, but a lovely thing about merging\r
426is that many kinds of clustering "reveal themselves" by how many times in a\r
427row the winning merge element comes from the same run. We'll only discuss\r
428merge_lo here; merge_hi is exactly analogous.\r
429\r
430Merging begins in the usual, obvious way, comparing the first element of A\r
431to the first of B, and moving B[0] to the merge area if it's less than A[0],\r
432else moving A[0] to the merge area. Call that the "one pair at a time"\r
433mode. The only twist here is keeping track of how many times in a row "the\r
434winner" comes from the same run.\r
435\r
436If that count reaches MIN_GALLOP, we switch to "galloping mode". Here\r
437we *search* B for where A[0] belongs, and move over all the B's before\r
438that point in one chunk to the merge area, then move A[0] to the merge\r
439area. Then we search A for where B[0] belongs, and similarly move a\r
440slice of A in one chunk. Then back to searching B for where A[0] belongs,\r
441etc. We stay in galloping mode until both searches find slices to copy\r
442less than MIN_GALLOP elements long, at which point we go back to one-pair-\r
443at-a-time mode.\r
444\r
445A refinement: The MergeState struct contains the value of min_gallop that\r
446controls when we enter galloping mode, initialized to MIN_GALLOP.\r
447merge_lo() and merge_hi() adjust this higher when galloping isn't paying\r
448off, and lower when it is.\r
449\r
450\r
451Galloping\r
452---------\r
453Still without loss of generality, assume A is the shorter run. In galloping\r
454mode, we first look for A[0] in B. We do this via "galloping", comparing\r
455A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding\r
456the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most\r
457roughly lg(B) comparisons, and, unlike a straight binary search, favors\r
458finding the right spot early in B (more on that later).\r
459\r
460After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1\r
461consecutive elements, and a straight binary search requires exactly k-1\r
462additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we\r
463copy all the B's up to that point in one chunk, and then copy A[0]. Note\r
464that no matter where A[0] belongs in B, the combination of galloping + binary\r
465search finds it in no more than about 2*lg(B) comparisons.\r
466\r
467If we did a straight binary search, we could find it in no more than\r
468ceiling(lg(B+1)) comparisons -- but straight binary search takes that many\r
469comparisons no matter where A[0] belongs. Straight binary search thus loses\r
470to galloping unless the run is quite long, and we simply can't guess\r
471whether it is in advance.\r
472\r
473If data is random and runs have the same length, A[0] belongs at B[0] half\r
474the time, at B[1] a quarter of the time, and so on: a consecutive winning\r
475sub-run in B of length k occurs with probability 1/2**(k+1). So long\r
476winning sub-runs are extremely unlikely in random data, and guessing that a\r
477winning sub-run is going to be long is a dangerous game.\r
478\r
479OTOH, if data is lopsided or lumpy or contains many duplicates, long\r
480stretches of winning sub-runs are very likely, and cutting the number of\r
481comparisons needed to find one from O(B) to O(log B) is a huge win.\r
482\r
483Galloping compromises by getting out fast if there isn't a long winning\r
484sub-run, yet finding such very efficiently when they exist.\r
485\r
486I first learned about the galloping strategy in a related context; see:\r
487\r
488 "Adaptive Set Intersections, Unions, and Differences" (2000)\r
489