]>
Commit | Line | Data |
---|---|---|
53b2ba57 DM |
1 | Intro\r |
2 | -----\r | |
3 | This describes an adaptive, stable, natural mergesort, modestly called\r | |
4 | timsort (hey, I earned it <wink>). It has supernatural performance on many\r | |
5 | kinds of partially ordered arrays (less than lg(N!) comparisons needed, and\r | |
6 | as few as N-1), yet as fast as Python's previous highly tuned samplesort\r | |
7 | hybrid on random arrays.\r | |
8 | \r | |
9 | In a nutshell, the main routine marches over the array once, left to right,\r | |
10 | alternately identifying the next run, then merging it into the previous\r | |
11 | runs "intelligently". Everything else is complication for speed, and some\r | |
12 | hard-won measure of memory efficiency.\r | |
13 | \r | |
14 | \r | |
15 | Comparison with Python's Samplesort Hybrid\r | |
16 | ------------------------------------------\r | |
17 | + timsort can require a temp array containing as many as N//2 pointers,\r | |
18 | which means as many as 2*N extra bytes on 32-bit boxes. It can be\r | |
19 | expected to require a temp array this large when sorting random data; on\r | |
20 | data with significant structure, it may get away without using any extra\r | |
21 | heap memory. This appears to be the strongest argument against it, but\r | |
22 | compared to the size of an object, 2 temp bytes worst-case (also expected-\r | |
23 | case for random data) doesn't scare me much.\r | |
24 | \r | |
25 | It turns out that Perl is moving to a stable mergesort, and the code for\r | |
26 | that appears always to require a temp array with room for at least N\r | |
27 | pointers. (Note that I wouldn't want to do that even if space weren't an\r | |
28 | issue; I believe its efforts at memory frugality also save timsort\r | |
29 | significant pointer-copying costs, and allow it to have a smaller working\r | |
30 | set.)\r | |
31 | \r | |
32 | + Across about four hours of generating random arrays, and sorting them\r | |
33 | under both methods, samplesort required about 1.5% more comparisons\r | |
34 | (the program is at the end of this file).\r | |
35 | \r | |
36 | + In real life, this may be faster or slower on random arrays than\r | |
37 | samplesort was, depending on platform quirks. Since it does fewer\r | |
38 | comparisons on average, it can be expected to do better the more\r | |
39 | expensive a comparison function is. OTOH, it does more data movement\r | |
40 | (pointer copying) than samplesort, and that may negate its small\r | |
41 | comparison advantage (depending on platform quirks) unless comparison\r | |
42 | is very expensive.\r | |
43 | \r | |
44 | + On arrays with many kinds of pre-existing order, this blows samplesort out\r | |
45 | of the water. It's significantly faster than samplesort even on some\r | |
46 | cases samplesort was special-casing the snot out of. I believe that lists\r | |
47 | very often do have exploitable partial order in real life, and this is the\r | |
48 | strongest argument in favor of timsort (indeed, samplesort's special cases\r | |
49 | for extreme partial order are appreciated by real users, and timsort goes\r | |
50 | much deeper than those, in particular naturally covering every case where\r | |
51 | someone has suggested "and it would be cool if list.sort() had a special\r | |
52 | case for this too ... and for that ...").\r | |
53 | \r | |
54 | + Here are exact comparison counts across all the tests in sortperf.py,\r | |
55 | when run with arguments "15 20 1".\r | |
56 | \r | |
57 | Column Key:\r | |
58 | *sort: random data\r | |
59 | \sort: descending data\r | |
60 | /sort: ascending data\r | |
61 | 3sort: ascending, then 3 random exchanges\r | |
62 | +sort: ascending, then 10 random at the end\r | |
63 | %sort: ascending, then randomly replace 1% of elements w/ random values\r | |
64 | ~sort: many duplicates\r | |
65 | =sort: all equal\r | |
66 | !sort: worst case scenario\r | |
67 | \r | |
68 | First the trivial cases, trivial for samplesort because it special-cased\r | |
69 | them, and trivial for timsort because it naturally works on runs. Within\r | |
70 | an "n" block, the first line gives the # of compares done by samplesort,\r | |
71 | the second line by timsort, and the third line is the percentage by\r | |
72 | which the samplesort count exceeds the timsort count:\r | |
73 | \r | |
74 | n \sort /sort =sort\r | |
75 | ------- ------ ------ ------\r | |
76 | 32768 32768 32767 32767 samplesort\r | |
77 | 32767 32767 32767 timsort\r | |
78 | 0.00% 0.00% 0.00% (samplesort - timsort) / timsort\r | |
79 | \r | |
80 | 65536 65536 65535 65535\r | |
81 | 65535 65535 65535\r | |
82 | 0.00% 0.00% 0.00%\r | |
83 | \r | |
84 | 131072 131072 131071 131071\r | |
85 | 131071 131071 131071\r | |
86 | 0.00% 0.00% 0.00%\r | |
87 | \r | |
88 | 262144 262144 262143 262143\r | |
89 | 262143 262143 262143\r | |
90 | 0.00% 0.00% 0.00%\r | |
91 | \r | |
92 | 524288 524288 524287 524287\r | |
93 | 524287 524287 524287\r | |
94 | 0.00% 0.00% 0.00%\r | |
95 | \r | |
96 | 1048576 1048576 1048575 1048575\r | |
97 | 1048575 1048575 1048575\r | |
98 | 0.00% 0.00% 0.00%\r | |
99 | \r | |
100 | The algorithms are effectively identical in these cases, except that\r | |
101 | timsort does one less compare in \sort.\r | |
102 | \r | |
103 | Now for the more interesting cases. Where lg(x) is the logarithm of x to\r | |
104 | the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for\r | |
105 | the best any comparison-based sorting algorithm can do on average (across\r | |
106 | all permutations). When a method gets significantly below that, it's\r | |
107 | either astronomically lucky, or is finding exploitable structure in the\r | |
108 | data.\r | |
109 | \r | |
110 | \r | |
111 | n lg(n!) *sort 3sort +sort %sort ~sort !sort\r | |
112 | ------- ------- ------ ------- ------- ------ ------- --------\r | |
113 | 32768 444255 453096 453614 32908 452871 130491 469141 old\r | |
114 | 448885 33016 33007 50426 182083 65534 new\r | |
115 | 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new\r | |
116 | \r | |
117 | 65536 954037 972699 981940 65686 973104 260029 1004607\r | |
118 | 962991 65821 65808 101667 364341 131070\r | |
119 | 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%\r | |
120 | \r | |
121 | 131072 2039137 2101881 2091491 131232 2092894 554790 2161379\r | |
122 | 2057533 131410 131361 206193 728871 262142\r | |
123 | 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%\r | |
124 | \r | |
125 | 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560\r | |
126 | 4377402 262437 262459 416347 1457945 524286\r | |
127 | 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%\r | |
128 | \r | |
129 | 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015\r | |
130 | 9278734 524580 524633 837947 2916107 1048574\r | |
131 | 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%\r | |
132 | \r | |
133 | 1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212\r | |
134 | 19606028 1048958 1048941 1694896 5832445 2097150\r | |
135 | 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%\r | |
136 | \r | |
137 | Discussion of cases:\r | |
138 | \r | |
139 | *sort: There's no structure in random data to exploit, so the theoretical\r | |
140 | limit is lg(n!). Both methods get close to that, and timsort is hugging\r | |
141 | it (indeed, in a *marginal* sense, it's a spectacular improvement --\r | |
142 | there's only about 1% left before hitting the wall, and timsort knows\r | |
143 | darned well it's doing compares that won't pay on random data -- but so\r | |
144 | does the samplesort hybrid). For contrast, Hoare's original random-pivot\r | |
145 | quicksort does about 39% more compares than the limit, and the median-of-3\r | |
146 | variant about 19% more.\r | |
147 | \r | |
148 | 3sort, %sort, and !sort: No contest; there's structure in this data, but\r | |
149 | not of the specific kinds samplesort special-cases. Note that structure\r | |
150 | in !sort wasn't put there on purpose -- it was crafted as a worst case for\r | |
151 | a previous quicksort implementation. That timsort nails it came as a\r | |
152 | surprise to me (although it's obvious in retrospect).\r | |
153 | \r | |
154 | +sort: samplesort special-cases this data, and does a few less compares\r | |
155 | than timsort. However, timsort runs this case significantly faster on all\r | |
156 | boxes we have timings for, because timsort is in the business of merging\r | |
157 | runs efficiently, while samplesort does much more data movement in this\r | |
158 | (for it) special case.\r | |
159 | \r | |
160 | ~sort: samplesort's special cases for large masses of equal elements are\r | |
161 | extremely effective on ~sort's specific data pattern, and timsort just\r | |
162 | isn't going to get close to that, despite that it's clearly getting a\r | |
163 | great deal of benefit out of the duplicates (the # of compares is much less\r | |
164 | than lg(n!)). ~sort has a perfectly uniform distribution of just 4\r | |
165 | distinct values, and as the distribution gets more skewed, samplesort's\r | |
166 | equal-element gimmicks become less effective, while timsort's adaptive\r | |
167 | strategies find more to exploit; in a database supplied by Kevin Altis, a\r | |
168 | sort on its highly skewed "on which stock exchange does this company's\r | |
169 | stock trade?" field ran over twice as fast under timsort.\r | |
170 | \r | |
171 | However, despite that timsort does many more comparisons on ~sort, and\r | |
172 | that on several platforms ~sort runs highly significantly slower under\r | |
173 | timsort, on other platforms ~sort runs highly significantly faster under\r | |
174 | timsort. No other kind of data has shown this wild x-platform behavior,\r | |
175 | and we don't have an explanation for it. The only thing I can think of\r | |
176 | that could transform what "should be" highly significant slowdowns into\r | |
177 | highly significant speedups on some boxes are catastrophic cache effects\r | |
178 | in samplesort.\r | |
179 | \r | |
180 | But timsort "should be" slower than samplesort on ~sort, so it's hard\r | |
181 | to count that it isn't on some boxes as a strike against it <wink>.\r | |
182 | \r | |
183 | + Here's the highwater mark for the number of heap-based temp slots (4\r | |
184 | bytes each on this box) needed by each test, again with arguments\r | |
185 | "15 20 1":\r | |
186 | \r | |
187 | 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort\r | |
188 | 32768 16384 0 0 6256 0 10821 12288 0 16383\r | |
189 | 65536 32766 0 0 21652 0 31276 24576 0 32767\r | |
190 | 131072 65534 0 0 17258 0 58112 49152 0 65535\r | |
191 | 262144 131072 0 0 35660 0 123561 98304 0 131071\r | |
192 | 524288 262142 0 0 31302 0 212057 196608 0 262143\r | |
193 | 1048576 524286 0 0 312438 0 484942 393216 0 524287\r | |
194 | \r | |
195 | Discussion: The tests that end up doing (close to) perfectly balanced\r | |
196 | merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort\r | |
197 | also ends up doing balanced merges, but systematically benefits a lot from\r | |
198 | the preliminary pre-merge searches described under "Merge Memory" later.\r | |
199 | %sort approaches having a balanced merge at the end because the random\r | |
200 | selection of elements to replace is expected to produce an out-of-order\r | |
201 | element near the midpoint. \sort, /sort, =sort are the trivial one-run\r | |
202 | cases, needing no merging at all. +sort ends up having one very long run\r | |
203 | and one very short, and so gets all the temp space it needs from the small\r | |
204 | temparray member of the MergeState struct (note that the same would be\r | |
205 | true if the new random elements were prefixed to the sorted list instead,\r | |
206 | but not if they appeared "in the middle"). 3sort approaches N//3 temp\r | |
207 | slots twice, but the run lengths that remain after 3 random exchanges\r | |
208 | clearly has very high variance.\r | |
209 | \r | |
210 | \r | |
211 | A detailed description of timsort follows.\r | |
212 | \r | |
213 | Runs\r | |
214 | ----\r | |
215 | count_run() returns the # of elements in the next run. A run is either\r | |
216 | "ascending", which means non-decreasing:\r | |
217 | \r | |
218 | a0 <= a1 <= a2 <= ...\r | |
219 | \r | |
220 | or "descending", which means strictly decreasing:\r | |
221 | \r | |
222 | a0 > a1 > a2 > ...\r | |
223 | \r | |
224 | Note that a run is always at least 2 long, unless we start at the array's\r | |
225 | last element.\r | |
226 | \r | |
227 | The definition of descending is strict, because the main routine reverses\r | |
228 | a descending run in-place, transforming a descending run into an ascending\r | |
229 | run. Reversal is done via the obvious fast "swap elements starting at each\r | |
230 | end, and converge at the middle" method, and that can violate stability if\r | |
231 | the slice contains any equal elements. Using a strict definition of\r | |
232 | descending ensures that a descending run contains distinct elements.\r | |
233 | \r | |
234 | If an array is random, it's very unlikely we'll see long runs. If a natural\r | |
235 | run contains less than minrun elements (see next section), the main loop\r | |
236 | artificially boosts it to minrun elements, via a stable binary insertion sort\r | |
237 | applied to the right number of array elements following the short natural\r | |
238 | run. In a random array, *all* runs are likely to be minrun long as a\r | |
239 | result. This has two primary good effects:\r | |
240 | \r | |
241 | 1. Random data strongly tends then toward perfectly balanced (both runs have\r | |
242 | the same length) merges, which is the most efficient way to proceed when\r | |
243 | data is random.\r | |
244 | \r | |
245 | 2. Because runs are never very short, the rest of the code doesn't make\r | |
246 | heroic efforts to shave a few cycles off per-merge overheads. For\r | |
247 | example, reasonable use of function calls is made, rather than trying to\r | |
248 | inline everything. Since there are no more than N/minrun runs to begin\r | |
249 | with, a few "extra" function calls per merge is barely measurable.\r | |
250 | \r | |
251 | \r | |
252 | Computing minrun\r | |
253 | ----------------\r | |
254 | If N < 64, minrun is N. IOW, binary insertion sort is used for the whole\r | |
255 | array then; it's hard to beat that given the overheads of trying something\r | |
256 | fancier (see note BINSORT).\r | |
257 | \r | |
258 | When N is a power of 2, testing on random data showed that minrun values of\r | |
259 | 16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost\r | |
260 | in binary insertion sort clearly hurt, and at 8 the increase in the number\r | |
261 | of function calls clearly hurt. Picking *some* power of 2 is important\r | |
262 | here, so that the merges end up perfectly balanced (see next section). We\r | |
263 | pick 32 as a good value in the sweet range; picking a value at the low end\r | |
264 | allows the adaptive gimmicks more opportunity to exploit shorter natural\r | |
265 | runs.\r | |
266 | \r | |
267 | Because sortperf.py only tries powers of 2, it took a long time to notice\r | |
268 | that 32 isn't a good choice for the general case! Consider N=2112:\r | |
269 | \r | |
270 | >>> divmod(2112, 32)\r | |
271 | (66, 0)\r | |
272 | >>>\r | |
273 | \r | |
274 | If the data is randomly ordered, we're very likely to end up with 66 runs\r | |
275 | each of length 32. The first 64 of these trigger a sequence of perfectly\r | |
276 | balanced merges (see next section), leaving runs of lengths 2048 and 64 to\r | |
277 | merge at the end. The adaptive gimmicks can do that with fewer than 2048+64\r | |
278 | compares, but it's still more compares than necessary, and-- mergesort's\r | |
279 | bugaboo relative to samplesort --a lot more data movement (O(N) copies just\r | |
280 | to get 64 elements into place).\r | |
281 | \r | |
282 | If we take minrun=33 in this case, then we're very likely to end up with 64\r | |
283 | runs each of length 33, and then all merges are perfectly balanced. Better!\r | |
284 | \r | |
285 | What we want to avoid is picking minrun such that in\r | |
286 | \r | |
287 | q, r = divmod(N, minrun)\r | |
288 | \r | |
289 | q is a power of 2 and r>0 (then the last merge only gets r elements into\r | |
290 | place, and r < minrun is small compared to N), or q a little larger than a\r | |
291 | power of 2 regardless of r (then we've got a case similar to "2112", again\r | |
292 | leaving too little work for the last merge to do).\r | |
293 | \r | |
294 | Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a\r | |
295 | power of 2, or if that isn't possible, is close to, but strictly less than,\r | |
296 | a power of 2. This is easier to do than it may sound: take the first 6\r | |
297 | bits of N, and add 1 if any of the remaining bits are set. In fact, that\r | |
298 | rule covers every case in this section, including small N and exact powers\r | |
299 | of 2; merge_compute_minrun() is a deceptively simple function.\r | |
300 | \r | |
301 | \r | |
302 | The Merge Pattern\r | |
303 | -----------------\r | |
304 | In order to exploit regularities in the data, we're merging on natural\r | |
305 | run lengths, and they can become wildly unbalanced. That's a Good Thing\r | |
306 | for this sort! It means we have to find a way to manage an assortment of\r | |
307 | potentially very different run lengths, though.\r | |
308 | \r | |
309 | Stability constrains permissible merging patterns. For example, if we have\r | |
310 | 3 consecutive runs of lengths\r | |
311 | \r | |
312 | A:10000 B:20000 C:10000\r | |
313 | \r | |
314 | we dare not merge A with C first, because if A, B and C happen to contain\r | |
315 | a common element, it would get out of order wrt its occurrence(s) in B. The\r | |
316 | merging must be done as (A+B)+C or A+(B+C) instead.\r | |
317 | \r | |
318 | So merging is always done on two consecutive runs at a time, and in-place,\r | |
319 | although this may require some temp memory (more on that later).\r | |
320 | \r | |
321 | When a run is identified, its base address and length are pushed on a stack\r | |
322 | in the MergeState struct. merge_collapse() is then called to see whether it\r | |
323 | should merge it with preceding run(s). We would like to delay merging as\r | |
324 | long as possible in order to exploit patterns that may come up later, but we\r | |
325 | like even more to do merging as soon as possible to exploit that the run just\r | |
326 | found is still high in the memory hierarchy. We also can't delay merging\r | |
327 | "too long" because it consumes memory to remember the runs that are still\r | |
328 | unmerged, and the stack has a fixed size.\r | |
329 | \r | |
330 | What turned out to be a good compromise maintains two invariants on the\r | |
331 | stack entries, where A, B and C are the lengths of the three righmost not-yet\r | |
332 | merged slices:\r | |
333 | \r | |
334 | 1. A > B+C\r | |
335 | 2. B > C\r | |
336 | \r | |
337 | Note that, by induction, #2 implies the lengths of pending runs form a\r | |
338 | decreasing sequence. #1 implies that, reading the lengths right to left,\r | |
339 | the pending-run lengths grow at least as fast as the Fibonacci numbers.\r | |
340 | Therefore the stack can never grow larger than about log_base_phi(N) entries,\r | |
341 | where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice\r | |
342 | for very large arrays.\r | |
343 | \r | |
344 | If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the\r | |
345 | freshness-in-cache reason), and the new run replaces the A,B or B,C entries;\r | |
346 | e.g., if the last 3 entries are\r | |
347 | \r | |
348 | A:30 B:20 C:10\r | |
349 | \r | |
350 | then B is merged with C, leaving\r | |
351 | \r | |
352 | A:30 BC:30\r | |
353 | \r | |
354 | on the stack. Or if they were\r | |
355 | \r | |
356 | A:500 B:400: C:1000\r | |
357 | \r | |
358 | then A is merged with B, leaving\r | |
359 | \r | |
360 | AB:900 C:1000\r | |
361 | \r | |
362 | on the stack.\r | |
363 | \r | |
364 | In both examples, the stack configuration after the merge still violates\r | |
365 | invariant #2, and merge_collapse() goes on to continue merging runs until\r | |
366 | both invariants are satisfied. As an extreme case, suppose we didn't do the\r | |
367 | minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,\r | |
368 | and 2. Nothing would get merged until the final 2 was seen, and that would\r | |
369 | trigger 7 perfectly balanced merges.\r | |
370 | \r | |
371 | The thrust of these rules when they trigger merging is to balance the run\r | |
372 | lengths as closely as possible, while keeping a low bound on the number of\r | |
373 | runs we have to remember. This is maximally effective for random data,\r | |
374 | where all runs are likely to be of (artificially forced) length minrun, and\r | |
375 | then we get a sequence of perfectly balanced merges (with, perhaps, some\r | |
376 | oddballs at the end).\r | |
377 | \r | |
378 | OTOH, one reason this sort is so good for partly ordered data has to do\r | |
379 | with wildly unbalanced run lengths.\r | |
380 | \r | |
381 | \r | |
382 | Merge Memory\r | |
383 | ------------\r | |
384 | Merging adjacent runs of lengths A and B in-place, and in linear time, is\r | |
385 | difficult. Theoretical constructions are known that can do it, but they're\r | |
386 | too difficult and slow for practical use. But if we have temp memory equal\r | |
387 | to min(A, B), it's easy.\r | |
388 | \r | |
389 | If A is smaller (function merge_lo), copy A to a temp array, leave B alone,\r | |
390 | and then we can do the obvious merge algorithm left to right, from the temp\r | |
391 | area and B, starting the stores into where A used to live. There's always a\r | |
392 | free area in the original area comprising a number of elements equal to the\r | |
393 | number not yet merged from the temp array (trivially true at the start;\r | |
394 | proceed by induction). The only tricky bit is that if a comparison raises an\r | |
395 | exception, we have to remember to copy the remaining elements back in from\r | |
396 | the temp area, lest the array end up with duplicate entries from B. But\r | |
397 | that's exactly the same thing we need to do if we reach the end of B first,\r | |
398 | so the exit code is pleasantly common to both the normal and error cases.\r | |
399 | \r | |
400 | If B is smaller (function merge_hi, which is merge_lo's "mirror image"),\r | |
401 | much the same, except that we need to merge right to left, copying B into a\r | |
402 | temp array and starting the stores at the right end of where B used to live.\r | |
403 | \r | |
404 | A refinement: When we're about to merge adjacent runs A and B, we first do\r | |
405 | a form of binary search (more on that later) to see where B[0] should end up\r | |
406 | in A. Elements in A preceding that point are already in their final\r | |
407 | positions, effectively shrinking the size of A. Likewise we also search to\r | |
408 | see where A[-1] should end up in B, and elements of B after that point can\r | |
409 | also be ignored. This cuts the amount of temp memory needed by the same\r | |
410 | amount.\r | |
411 | \r | |
412 | These preliminary searches may not pay off, and can be expected *not* to\r | |
413 | repay their cost if the data is random. But they can win huge in all of\r | |
414 | time, copying, and memory savings when they do pay, so this is one of the\r | |
415 | "per-merge overheads" mentioned above that we're happy to endure because\r | |
416 | there is at most one very short run. It's generally true in this algorithm\r | |
417 | that we're willing to gamble a little to win a lot, even though the net\r | |
418 | expectation is negative for random data.\r | |
419 | \r | |
420 | \r | |
421 | Merge Algorithms\r | |
422 | ----------------\r | |
423 | merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo\r | |
424 | deals with runs where A <= B, and merge_hi where A > B. They don't know\r | |
425 | whether the data is clustered or uniform, but a lovely thing about merging\r | |
426 | is that many kinds of clustering "reveal themselves" by how many times in a\r | |
427 | row the winning merge element comes from the same run. We'll only discuss\r | |
428 | merge_lo here; merge_hi is exactly analogous.\r | |
429 | \r | |
430 | Merging begins in the usual, obvious way, comparing the first element of A\r | |
431 | to the first of B, and moving B[0] to the merge area if it's less than A[0],\r | |
432 | else moving A[0] to the merge area. Call that the "one pair at a time"\r | |
433 | mode. The only twist here is keeping track of how many times in a row "the\r | |
434 | winner" comes from the same run.\r | |
435 | \r | |
436 | If that count reaches MIN_GALLOP, we switch to "galloping mode". Here\r | |
437 | we *search* B for where A[0] belongs, and move over all the B's before\r | |
438 | that point in one chunk to the merge area, then move A[0] to the merge\r | |
439 | area. Then we search A for where B[0] belongs, and similarly move a\r | |
440 | slice of A in one chunk. Then back to searching B for where A[0] belongs,\r | |
441 | etc. We stay in galloping mode until both searches find slices to copy\r | |
442 | less than MIN_GALLOP elements long, at which point we go back to one-pair-\r | |
443 | at-a-time mode.\r | |
444 | \r | |
445 | A refinement: The MergeState struct contains the value of min_gallop that\r | |
446 | controls when we enter galloping mode, initialized to MIN_GALLOP.\r | |
447 | merge_lo() and merge_hi() adjust this higher when galloping isn't paying\r | |
448 | off, and lower when it is.\r | |
449 | \r | |
450 | \r | |
451 | Galloping\r | |
452 | ---------\r | |
453 | Still without loss of generality, assume A is the shorter run. In galloping\r | |
454 | mode, we first look for A[0] in B. We do this via "galloping", comparing\r | |
455 | A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding\r | |
456 | the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most\r | |
457 | roughly lg(B) comparisons, and, unlike a straight binary search, favors\r | |
458 | finding the right spot early in B (more on that later).\r | |
459 | \r | |
460 | After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1\r | |
461 | consecutive elements, and a straight binary search requires exactly k-1\r | |
462 | additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we\r | |
463 | copy all the B's up to that point in one chunk, and then copy A[0]. Note\r | |
464 | that no matter where A[0] belongs in B, the combination of galloping + binary\r | |
465 | search finds it in no more than about 2*lg(B) comparisons.\r | |
466 | \r | |
467 | If we did a straight binary search, we could find it in no more than\r | |
468 | ceiling(lg(B+1)) comparisons -- but straight binary search takes that many\r | |
469 | comparisons no matter where A[0] belongs. Straight binary search thus loses\r | |
470 | to galloping unless the run is quite long, and we simply can't guess\r | |
471 | whether it is in advance.\r | |
472 | \r | |
473 | If data is random and runs have the same length, A[0] belongs at B[0] half\r | |
474 | the time, at B[1] a quarter of the time, and so on: a consecutive winning\r | |
475 | sub-run in B of length k occurs with probability 1/2**(k+1). So long\r | |
476 | winning sub-runs are extremely unlikely in random data, and guessing that a\r | |
477 | winning sub-run is going to be long is a dangerous game.\r | |
478 | \r | |
479 | OTOH, if data is lopsided or lumpy or contains many duplicates, long\r | |
480 | stretches of winning sub-runs are very likely, and cutting the number of\r | |
481 | comparisons needed to find one from O(B) to O(log B) is a huge win.\r | |
482 | \r | |
483 | Galloping compromises by getting out fast if there isn't a long winning\r | |
484 | sub-run, yet finding such very efficiently when they exist.\r | |
485 | \r | |
486 | I first learned about the galloping strategy in a related context; see:\r | |
487 | \r | |
488 | "Adaptive Set Intersections, Unions, and Differences" (2000)\r | |
489 |