ceph/src/boost/libs/sort/doc/sort.qbk

   1 [library Boost.Sort
   2   [quickbook 1.7]
   3   [id sort]
   4   [copyright 2014 Steven Ross]
   5   [authors [Ross, Steven]]
   6   [dirname sort]
   7   [license Distributed under the
   8     [@http://boost.org/LICENSE_1_0.txt Boost Software License,  Version 1.0].
   9   ]
  10 ]
  11
  12 [/ Some composite templates]
  13 [template super[x]'''<superscript>'''[x]'''</superscript>''']
  14 [template sub[x]'''<subscript>'''[x]'''</subscript>''']
  15 [template floor[x]'''&#x230A;'''[x]'''&#x230B;''']
  16 [template floorlr[x][lfloor][x][rfloor]]
  17 [template ceil[x] '''&#x2308;'''[x]'''&#x2309;''']
  18
  19 [/ Required for autoindexing]
  20 [import ../../../tools/auto_index/include/auto_index_helpers.qbk]
  21 [/ Must be first included file!]
  22
  23 [/Files containing quickbook snippets]
  24 [import ../example/charstringsample.cpp]
  25 [import ../example/stringfunctorsample.cpp]
  26 [import ../example/reverseintsample.cpp]
  27 [import ../example/rightshiftsample.cpp]
  28 [import ../example/int64.cpp]
  29 [import ../example/floatfunctorsample.cpp]
  30 [import ../example/generalizedstruct.cpp]
  31
  32 [import html4_symbols.qbk] [/ Provides various useful squiggles.]
  33
  34 [def __spreadsort [@http://en.wikipedia.org/wiki/Spreadsort spreadsort]]
  35 [def __introsort [@http://en.wikipedia.org/wiki/Introsort introsort]]
  36 [def __stl_sort [@http://www.cplusplus.com/reference/algorithm/sort/ STL std::sort]]
  37 [def __big_o [@http://en.wikipedia.org/wiki/Big_O_notation Big O notation]]
  38 [def __radix_sort[@http://en.wikipedia.org/wiki/Radix_sort radix sort]]
  39 [def __adaptive_left_reflex [@http://www.nik.no/2002/Maus.pdf  Arne Maus, Adaptive Left Reflex]]
  40 [def __american_flag [@http://en.wikipedia.org/wiki/American_flag_sort American flag sort]]
  41 [def __overloading [link sort.overview.overloading overloading]]
  42 [def __lookup [link sort.rationale.lookup lookup]]
  43 [def __random_access_iter [@http://en.cppreference.com/w/cpp/concept/RandomAccessIterator  RandomAccessIterator]]
  44 [def __strictweakordering [@http://en.wikipedia.org/wiki/Weak_ordering#Strict_weak_orderings strict weak ordering]]
  45
  46 [/ Links to functions for use in text]
  47 [def __integer_sort [^[funcref boost::sort::spreadsort::integer_sort  integer_sort]]]
  48 [def __float_sort [^[funcref boost::sort::spreadsort::float_sort  float_sort]]]
  49 [def __string_sort [^[funcref boost::sort::spreadsort::string_sort  string_sort]]]
  50 [def __spreadsort [^[funcref boost::sort::spreadsort::spreadsort  spreadsort]]] [/Note diff from Wiki link __spreadsort]
  51 [def __std_sort [@http://en.cppreference.com/w/cpp/algorithm/sort std::sort]]
  52
  53 [section:overview Overview]
  54
  55 [section:intro Introduction]
  56
  57 The Boost.Sort library provides a generic implementation of high-speed sorting algorithms
  58 that outperform those in the C++ standard in both average and worst case performance
  59 when there are over 1000 elements in the list to sort.
  60
  61 They fall back to __stl_sort on small data sets.
  62
  63 [warning These algorithms all only work on
  64 [@http://www.cplusplus.com/reference/iterator/RandomAccessIterator/ random access iterators].]
  65
  66 They are hybrids using both radix and comparison-based sorting,
  67 specialized to sorting common data types, such as integers, floats, and strings.
  68
  69 These algorithms are encoded in a generic fashion and accept functors,
  70 enabling them to sort any object that can be processed like these basic data types.
  71 In the case of __string_sort, this includes anything
  72 with a defined strict-weak-ordering that __std_sort can sort,
  73 but writing efficient functors for some complex key types
  74 may not be worth the additional effort relative to just using __std_sort,
  75 depending on how important speed is to your application.
  76 Sample usages are available in the example directory.
  77
  78 Unlike many radix-based algorithms,
  79 the underlying __spreadsort algorithm is designed around [*worst-case performance].
  80 It performs better on chunky data (where it is not widely distributed),
  81 so that on real data it can perform substantially better than on random data.
  82 Conceptually, __spreadsort can sort any data for which an absolute ordering can be determined,
  83 and __string_sort is sufficiently flexible that this should be possible.
  84
  85 Situations where __spreadsort is fastest relative to __std_sort:
  86
  87 # Large number of elements to sort (['N] >= 10000).
  88
  89 # Slow comparison function (such as floating-point numbers on x86 processors or strings).
  90
  91 # Large data elements (such as key + data sorted on a key).
  92
  93 # Completely sorted data when __spreadsort has an optimization to quit early in this case.
  94
  95 Situations where __spreadsort is slower than __std_sort:
  96
  97 # Data sorted in reverse order. Both __std_sort and __spreadsort are faster
  98 on reverse-ordered data than randomized data,
  99 but __std_sort speeds up more in this special case.
 100
 101 # Very small amounts of data (< 1000 elements).
 102 For this reason there is a fallback in __spreadsort to __std_sort
 103 if the input size is less than 1000,
 104 so performance is identical for small amounts of data in practice.
 105
 106 These functions are defined in `namespace boost::sort::spreadsort`.
 107
 108 [endsect] [/section Introduction]
 109
 110 [section:overloading Overloading]
 111
 112 [tip In the Boost.Sort C++ Reference section, click on the appropriate overload, for example `float_sort(RandomAccessIter, RandomAccessIter, Right_shift, Compare);` to get full details of that overload.]
 113
 114 Each of __integer_sort, __float_sort, and __string_sort have 3 main versions:
 115 The base version, which takes a first iterator and a last iterator, just like __std_sort:
 116
 117    integer_sort(array.begin(), array.end());
 118    float_sort(array.begin(), array.end());
 119    string_sort(array.begin(), array.end());
 120
 121 The version with an overridden shift functor, providing flexibility
 122 in case the `operator>>` already does something other than a bitshift.
 123 The rightshift functor takes two args, first the data type,
 124 and second a natural number of bits to shift right.
 125
 126 For __string_sort this variant is slightly different;
 127 it needs a bracket functor equivalent to `operator`\[\],
 128 taking a number corresponding to the character offset,
 129 along with a second `getlength` functor to get the length of the string in characters.
 130 In all cases, this operator must return an integer type that compares with the
 131 `operator<` to provide the intended order
 132 (integers can be negated to reverse their order).
 133
 134 In other words (aside from negative floats, which are inverted as ints):
 135
 136    rightshift(A, n) < rightshift(B, n) -> A < B
 137    A < B -> rightshift(A, 0) < rightshift(B, 0)
 138
 139 [rightshift_1]
 140 [bracket_1]
 141
 142 See [@../../example/rightshiftsample.cpp  rightshiftsample.cpp] for a working example of integer sorting with a rightshift functor.
 143
 144 And a version with a comparison functor for maximum flexibility.
 145 This functor must provide the same sorting order as the integers returned by the rightshift (aside from negative floats):
 146
 147   rightshift(A, n) < rightshift(B, n) -> compare(A, B)
 148   compare(A, B) -> rightshift(A, 0) < rightshift(B, 0)
 149
 150 [reverse_int_2]
 151
 152 Examples of functors are:
 153
 154 [lessthan_functor]
 155
 156 [bracket_functor]
 157
 158 [getsize_functor]
 159
 160 and these functors are used thus:
 161
 162 [stringsort_functors_call]
 163
 164 See [@../../example/stringfunctorsample.cpp  stringfunctorsample.cpp] for a working example of sorting strings with all functors.
 165
 166 [endsect] [/section:overloading Overloading]
 167
 168 [section:performance Performance]
 169
 170 The __spreadsort algorithm is a hybrid algorithm;
 171 when the number of elements being sorted is below a certain number,
 172 comparison-based sorting is used. Above it, radix sorting is used.
 173 The radix-based algorithm will thus cut up the problem into small pieces,
 174 and either completely sort the data based upon its radix if the data is clustered,
 175 or finish sorting the cut-down pieces with comparison-based sorting.
 176
 177 The Spreadsort algorithm dynamically chooses
 178 either comparison-based or radix-based sorting when recursing,
 179 whichever provides better worst-case performance.
 180 This way worst-case performance is guaranteed to be the better of
 181 ['[bigo](N[sdot]log2(N))] comparisons and ['[bigo](N[sdot]log2(K/S + S))] operations where
 182
 183 * ['N] is the number of elements being sorted,
 184 * ['K] is the length in bits of the key, and
 185 * ['S] is a constant.
 186
 187 This results in substantially improved performance for large [' N];
 188 __integer_sort tends to be 50% to 2X faster than __std_sort,
 189 while __float_sort and _string_sort are roughly 2X faster than __std_sort.
 190
 191 Performance graphs are provided for __integer_sort,  __float_sort, and __string_sort in their description.
 192
 193 Runtime Performance comparisons and graphs were made on a Core 2 Duo laptop
 194 running Windows Vista 64 with MSVC 8.0,
 195 and an old G4 laptop running Mac OSX with gcc.
 196 [@http://www.boost.org/build/doc/html/ Boost bjam/b2]  was used to control compilation.
 197
 198 Direct performance comparisons on a newer x86 system running Ubuntu,
 199 with the fallback to __std_sort at lower input sizes disabled are below.
 200
 201 [note The fallback to __std_sort for smaller input sizes prevents
 202 the worse performance seen on the left sides of the first two graphs.]
 203
 204 __integer_sort starts to become faster than __std_sort at about 1000 integers (4000 bytes),
 205 and __string_sort becomes faster than __std_sort at slightly fewer bytes (as few as 30 strings).
 206
 207 [note The 4-threaded graph has 4 threads doing [*separate sorts simultaneously]
 208 (not splitting up a single sort)
 209 as a test for thread cache collision and other multi-threaded performance issues.]
 210
 211 __float_sort times are very similar to __integer_sort times.
 212
 213 [/ These provide links to the images, but currently graphs are shown - see below]
 214 [/@../../doc/images/single_threaded.png  single_threaded.png] [/file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png]
 215 [/@../../doc/images/4_threaded.png 4_threaded.png]
 216 [/@../../doc/images/entropy.png entropy.png]
 217 [/@../../doc/images/bits_per_byte.png bits_per_byte.png]
 218
 219 [$../images/single_threaded.png]  [/<img src="../images/single_threaded.png"> == file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png]
 220 [$../images/4_threaded.png]
 221 [$../images/entropy.png]
 222 [$../images/bits_per_byte.png]
 223
 224 Histogramming with a fixed maximum number of splits is used
 225 because it reduces the number of cache misses,
 226 thus improving performance relative to the approach described in detail
 227 in the [@http://en.wikipedia.org/wiki/Spreadsort original SpreadSort publication].
 228
 229 The importance of cache-friendly histogramming is described
 230 in __adaptive_left_reflex,
 231 though without the worst-case handling described below.
 232
 233 The time taken per radix iteration is:
 234
 235 ['[bigo](N)] iterations over the data
 236
 237 ['[bigo](N)] integer-type comparisons (even for _float_sort and __string_sort)
 238
 239 ['[bigo](N)] swaps
 240
 241 ['[bigo](2[super S])] bin operations.
 242
 243 To obtain ['[bigo](N)] worst-case performance per iteration,
 244 the restriction ['S <= log2(N)] is applied, and ['[bigo](2[super S])] becomes ['[bigo](N)].
 245 For each such iteration, the number of unsorted bits log2(range)
 246 (referred to as ['K]) per element is reduced by ['S].
 247 As ['S] decreases depending upon the amount of elements being sorted,
 248 it can drop from a maximum of ['S[sub max]] to the minimum of ['S[sub min]].
 249
 250 Assumption: __std_sort is assumed to be ['[bigo](N*log2(N))],
 251 as __introsort exists and is commonly used.
 252 (If you have a quibble with this please take it up with the implementor of your __std_sort;
 253 you're welcome to replace the recursive calls to __std_sort to calls
 254 to __introsort if your __std_sort library call is poorly implemented).
 255
 256 [@http://en.wikipedia.org/wiki/Introsort Introsort] is not included with this algorithm for simplicity and
 257 because the implementor of the __std_sort call
 258 is assumed to know what they're doing.
 259
 260 To maintain a minimum value for ['S (S[sub min])],
 261 comparison-based sorting has to be used to sort when
 262 ['n <= log2(meanbinsize)], where ['log2(meanbinsize) (lbs)] is a small constant,
 263 usually between 0 and 4, used to minimize bin overhead per element.
 264 There is a small corner-case where if ['K < S[sub min]] and ['n >= 2^K],
 265 then the data can be sorted in a single radix-based iteration with an ['S = K]
 266 (this bucketsorting special case is by default only applied to __float_sort).
 267 So for the final recursion, worst-case performance is:
 268
 269 1 radix-based iteration if ['K <= S[sub min]],
 270
 271 or ['S[sub min] + lbs] comparison-based iterations if ['K > S[sub min]] but ['n <= 2[super (S[sub min] + lbs)]].
 272
 273 So for the final iteration, worst-case runtime is ['[bigo](N*(S[sub min] + lbs))] but
 274 if ['K > S[sub min]] and ['N > 2[super (S[sub min] + lbs)]]
 275 then more than 1 radix recursion will be required.
 276
 277 For the second to last iteration, ['K <= S[sub min] * 2 + 1] can be handled,
 278 (if the data is divided into ['2[super (S[sub min] + 1)]] pieces)
 279 or if ['N < 2[super (S[sub min] + lbs + 1)]],
 280 then it is faster to fallback to __std_sort.
 281
 282 In the case of a radix-based sort plus recursion, it will take
 283 ['[bigo](N*(S[sub min] + lbs)) + [bigo](N) = [bigo](N*(S[sub min] + lbs + 1))] worst-case time,
 284 as
 285 ['K_remaining = K_start - (S[sub min] + 1)], and ['K_start <= S[sub min] * 2 + 1].
 286
 287 Alternatively, comparison-based sorting is used if ['N < 2[super (S[sub min] + lbs + 1)]],
 288 which will take ['[bigo](N*(S[sub min] + lbs + 1))] time.
 289
 290 So either way ['[bigo](N*(S[sub min] + lbs + 1))] is the worst-case time for the second to last iteration,
 291 which occurs if ['K <= S[sub min] * 2 + ]1 or ['N < 2[super (S[sub min] + lbs + 1)]].
 292
 293 This continues as long as ['S[sub min] <= S <= S[sub max]],
 294 so that for ['K_m <= K_(m-1) + S[sub min] + m] where ['m]
 295 is the maximum number of iterations after this one has finished,
 296 or where ['N < 2[super (S[sub min] + lbs + m)]],
 297 then the worst-case runtime is ['[bigo](N*(S[sub min] + lbs + m))].
 298
 299 [space][space]['K_m] at ['m <= (S[sub max] - S[sub min])] works out to:
 300
 301 [space][space]['K_1 <= (S[sub min]) + S[sub min] + 1 <= 2S[sub min] + 1]
 302
 303 [space][space]['K_2 <= (2S[sub min] + 1) + S[sub min] + 2]
 304
 305 as the sum from 0 to ['m] is ['m(m + 1)/2]
 306
 307 [space][space]['K_m <= (m + 1)S[sub min] + m(m + 1)/2 <= (S[sub min] + m/2)(m + 1)]
 308
 309 substituting in S[sub max] - S[sub min] for m
 310
 311 [space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + (S[sub max] - S[sub min])/2)*(S[sub max] - S[sub min] + 1)]
 312
 313 [space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + S[sub max]) * (S[sub max] - S[sub min] + 1)/2]
 314
 315 Since this involves ['S[sub max] - S[sub min] + 1] iterations,
 316 this works out to dividing ['K] into an average ['(S[sub min] + S[sub max])]/2
 317 pieces per iteration.
 318
 319 To finish the problem from this point takes ['[bigo](N * (S[sub max] - S[sub min]))] for ['m] iterations,
 320 plus the worst-case of ['[bigo](N*(S[sub min] + lbs))] for the last iteration,
 321 for a total of ['[bigo](N *(S[sub max] + lbs))] time.
 322
 323 When ['m > S[sub max] - S[sub min]], the problem is divided into ['S[sub max]] pieces per iteration,
 324 or __std_sort is called if ['N < 2^(m + S[sub min] + lbs)]. For this range:
 325
 326 [space][space]['K_m <= K_(m - 1) + S[sub max]], providing runtime of
 327
 328 [space][space]['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))] if recursive,
 329
 330 or ['[bigo](N * log(2^(m + S[sub min] + lbs)))] if comparison-based,
 331
 332 which simplifies to ['[bigo](N * (m + S[sub min] + lbs))],
 333 which substitutes to ['[bigo](N * ((m - (S[sub max] - S[sub min])) + S[sub max] + lbs))],
 334 which given that ['m - (S[sub max] - S[sub min]) <= (K - K_(S[sub max] - S[sub min]))/S[sub max]]
 335 (otherwise a lesser number of radix-based iterations would be used)
 336
 337 also comes out to ['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))].
 338
 339 Asymptotically, for large ['N] and large ['K], this simplifies to:
 340
 341 [space][space]['[bigo](N * (K/S[sub max] + S[sub max] + lbs))],
 342
 343 simplifying out the constants related to the ['S[sub max] - S[sub min]] range,
 344 providing an additional ['[bigo](N * (S[sub max] + lbs))] runtime on top of the
 345 ['[bigo](N * (K/S))] performance of LSD __radix_sort,
 346 but without the ['[bigo](N)] memory overhead.
 347 For simplicity, because ['lbs] is a small constant
 348 (0 can be used, and performs reasonably),
 349 it is ignored when summarizing the performance in further discussions.
 350 By checking whether comparison-based sorting is better,
 351 Spreadsort is also ['[bigo](N*log(N))], whichever is better,
 352 and unlike LSD __radix_sort, can perform much better than the worst-case
 353 if the data is either evenly distributed or highly clustered.
 354
 355 This analysis was for __integer_sort and __float_sort.
 356 __string_sort differs in that ['S[sub min] = S[sub max] = sizeof(Char_type) * 8],
 357 ['lbs] is 0, and that __std_sort's comparison is not a constant-time operation,
 358 so strictly speaking __string_sort runtime is
 359
 360 [space][space]['[bigo](N * (K/S[sub max] + (S[sub max] comparisons)))].
 361
 362 Worst-case, this ends up being ['[bigo](N * K)]
 363 (where ['K] is the mean string length in bytes),
 364 as described for __american_flag, which is better than the
 365
 366 [space][space]['[bigo](N * K * log(N))]
 367
 368 worst-case for comparison-based sorting.
 369
 370 [endsect] [/section:performance Performance]
 371
 372 [section:tuning Tuning]
 373 __integer_sort and __float_sort have tuning constants that control
 374 how the radix-sorting portion of those algorithms work.
 375 The ideal constant values for __integer_sort and __float_sort vary depending on
 376 the platform, compiler, and data being sorted.
 377 By far the most important constant is ['max_splits],
 378 which defines how many pieces the radix-sorting portion splits
 379 the data into per iteration.
 380
 381 The ideal value of ['max_splits] depends upon the size of the L1 processor cache,
 382 and is between 10 and 13 on many systems.
 383 A default value of 11 is used. For mostly-sorted data, a much larger value is better,
 384 as swaps (and thus cache misses) are rare,
 385 but this hurts runtime severely for unsorted data, so is not recommended.
 386
 387 On some x86 systems, when the total number of elements being sorted is small
 388 ( less than 1 million or so), the ideal ['max_splits] can be substantially larger,
 389 such as 17. This is suspected to be because all the data fits into the L2 cache,
 390 and misses from L1 cache to L2 cache do not impact performance
 391 as severely as misses to main memory.
 392 Modifying tuning constants other than ['max_splits] is not recommended,
 393 as the performance improvement for changing other constants is usually minor.
 394
 395 If you can afford to let it run for a day, and have at least 1GB of free memory,
 396 the perl command: `./tune.pl -large -tune` (UNIX)
 397 or `perl tune.pl -large -tune -windows` (Windows)
 398 can be used to automatically tune these constants.
 399 This should be run from the `libs/sort directory` inside the boost home directory.
 400 This will work to identify the `ideal constants.hpp` settings for your system,
 401 testing on various distributions in a 20 million element (80MB) file,
 402 and additionally verifies that all sorting routines sort correctly
 403 across various data distributions.
 404 Alternatively, you can test with the file size you're most concerned with
 405 `./tune.pl number -tune` (UNIX) or `perl tune.pl number -tune -windows` (Windows).
 406 Substitute the number of elements you want to test with for `number`.
 407 Otherwise, just use the options it comes with, they're decent.
 408 With default settings `./tune.pl -tune` (UNIX) `perl tune.pl -tune -windows` (Windows),
 409 the script will take hours to run (less than a day),
 410 but may not pick the correct ['max_splits] if it is over 10.
 411 Alternatively, you can add the `-small` option to make it take just a few minutes,
 412 tuning for smaller vector sizes (one hundred thousand elements),
 413 but the resulting constants may not be good for large files
 414 (see above note about ['max_splits] on Windows).
 415
 416 The tuning script can also be used just to verify that sorting works correctly
 417 on your system, and see how much of a speedup it gets,
 418 by omiting the "-tune" option. This runs at the end of tuning runs.
 419 Default args will take about an hour to run and give accurate results
 420 on decent-sized test vectors. `./tune.pl -small` (UNIX) `perl tune.pl -small -windows` (Windows)
 421 is a faster option, that tests on smaller vectors and isn't as accurate.
 422
 423 If any differences are encountered during tuning, please call `tune.pl` with `-debug > log_file_name`.
 424 If the resulting log file contains compilation or permissions issues,
 425 it is likely an issue with your setup.
 426 If some other type of error is encountered (or result differences),
 427 please send them to the library author at spreadsort@gmail.com.
 428 Including the zipped `input.txt` that was being used is also helpful.
 429
 430 [endsect] [/section:tuning Tuning]
 431
 432 [endsect] [/section Overview]
 433
 434 [section:sort_hpp Spreadsort]
 435
 436 [section:header_spreadsort Header `<boost/sort/spreadsort/spreadsort.hpp>`]
 437
 438 __spreadsort checks whether the data-type provided is an integer,
 439 castable float, string, or wstring.
 440
 441 * If data-type is an integer, __integer_sort is used.
 442 * If data-type is a float, __float_sort is used.
 443 * If data-type is a string or wstring, __string_sort is used.
 444 * Sorting other data-types requires picking between
 445 __integer_sort, __float_sort  and __string_sort  directly,
 446 as __spreadsort won't accept types that don't have the appropriate type traits.
 447
 448 Overloading variants are provided that permit use of user-defined right-shift functors and comparison functors.
 449
 450 Each function is optimized for its set of arguments; default functors are not provided to avoid the risk of any reduction of performance.
 451
 452 See __overloading section.
 453
 454 [h5 Rationale:]
 455
 456 __spreadsort function provides a wrapper that calls the fastest sorting algorithm
 457 available for a data-type, enabling faster generic programming.
 458
 459 [section:spreadsort_examples Spreadsort Examples]
 460
 461 See [@../../example/ example] folder for all examples.
 462
 463 See [@../../example/sample.cpp  sample.cpp] for a simple working example.
 464
 465 For an example of 64-bit integer sorting, see [@../../example/int64.cpp  int64.cpp].
 466
 467 This example sets the element type of a vector to 64-bit integer
 468
 469 [int64bit_1]
 470
 471 and calls the sort
 472
 473 [int64bit_2]
 474
 475 For a simple example sorting `float`s,
 476
 477         vector<float> vec;
 478         vec.push_back(1.0);
 479         vec.push_back(2.3);
 480         vec.push_back(1.3);
 481         ...
 482         spreadsort(vec.begin(), vec.end());
 483         //The sorted vector contains "1.0 1.3 2.3 ..."
 484
 485 See also [@../../example/floatsample.cpp  floatsample.cpp] which checks for abnormal values.
 486
 487 [endsect] [/section:spreadsort_examples Spreadsort Examples]
 488
 489 [endsect] [/section:header_spreadsort Header `<boost/sort/spreadsort/spreadsort.hpp>`]
 490
 491 [section:integer_sort Integer Sort]
 492
 493 __integer_sort is a fast templated in-place hybrid radix/comparison algorithm,
 494 which in testing tends to be roughly 50% to 2X faster than
 495 __std_sort for large tests (>=100kB).
 496 Worst-case performance is ['[bigo](N * (log2(range)/s + s))],
 497 so __integer_sort is asymptotically faster than pure comparison-based algorithms.
 498 ['s] is ['max_splits], which defaults to 11,
 499 so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)]
 500 slow radix-based iterations + 11 fast comparison-based iterations).
 501
 502 Some performance plots of runtime vs. n and log2(range) are provided below:
 503
 504 [@../../doc/graph/windows_integer_sort.htm Windows Integer Sort]
 505
 506 [@../../doc/graph/osx_integer_sort.htm  OSX integer Sort]
 507
 508 [section:integersort_examples Integer Sort Examples]
 509
 510 See [@../../example/rightshiftsample.cpp  rightshiftsample.cpp] for a working example of using rightshift, using a user-defined functor:
 511
 512 [rightshift_int_functor]
 513
 514 Other examples:
 515
 516 [@../../example/keyplusdatasample.cpp Sort structs using an integer key.]
 517
 518 [@../../example/reverseintsample.cpp Sort integers in reverse order.]
 519
 520 [@../../example/mostlysorted.cpp Simple sorting of integers; this case is a performance test for integers that are already mostly sorted.]
 521
 522 [endsect] [/section:integersort_examples Integer Sort Examples]
 523
 524 [endsect] [/section:integer_sort Integer Sort]
 525
 526 [section:float_sort Float Sort]
 527
 528 __float_sort is a fast templated in-place hybrid radix/comparison algorithm much like __integer_sort, but sorts IEEE floating-point numbers (positive, zero, NaN, and negative) into ascending order by casting them to integers.  This works because positive IEEE floating-point numbers sort like integers with the same bits, and negative IEEE floating-point numbers sort in the reverse order of integers with the same bits.  float_sort is roughly 2X as fast as std::sort.
 529
 530 -0.0 vs. 0.0 and NaN are given definitive ordered positions by the radix-based portion of this algorithm, where comparison-based sorting does not guarantee their relative position.  The included tests avoid creating NaN and -0.0 so that results match std::sort, which is not consistent in how it handles these numbers, as they compare as equal to numbers with different values.
 531
 532 float_sort checks the size of the data type and whether it is castable, picking
 533  an integer type to cast to, if a casting functor isn't provided by the user.
 534
 535 float_mem_cast casts IEEE floating-point numbers (positive, zero, NaN, and negative) into integers.  This is an essential utility for creating a custom rightshift functor for float_sort, when one is needed.  Only IEEE floating-point numbers of the same size as the integer type being cast to should be used in this specialized method call.
 536 Worst-case performance is ['[bigo](N * (log2(range)/s + s))],
 537 so __float_sort is asymptotically faster than pure comparison-based algorithms.
 538 ['s] is ['max_splits], which defaults to 11,
 539 so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)]
 540 slow radix-based iterations + 11 fast comparison-based iterations).
 541
 542 Some performance plots of runtime vs. n and log2(range) are provided below:
 543
 544 [@../../doc/graph/windows_float_sort.htm Windows Float Sort]
 545
 546 [@../../doc/graph/osx_float_sort.htm  OSX Float Sort]
 547
 548 [section:floatsort_examples Float Sort Examples]
 549
 550 See [@../../example/floatfunctorsample.cpp floatfunctorsample.cpp] for a working example of how to sort structs with a float key:
 551
 552 [float_functor_types]
 553
 554 [float_functor_datatypes]
 555
 556 Right-shift functor:
 557
 558 [float_functor_rightshift]
 559
 560 Comparison lessthan `operator<` functor:
 561
 562 [float_functor_lessthan]
 563
 564 Other examples:
 565
 566 [@../../example/double.cpp Sort doubles.]
 567
 568 [@../../example/shiftfloatsample.cpp Sort floats using a rightshift functor.]
 569
 570 [endsect] [/section:floatsort_examples Float Sort Examples]
 571
 572 [endsect] [/section:float_sort Float Sort]
 573
 574 [section:string_sort String Sort]
 575 __string_sort is a hybrid radix-based/comparison-based algorithm that sorts strings of characters (or arrays of binary data) in ascending order.
 576
 577 The simplest version (no functors) sorts strings of items that can cast to an unsigned data type (such as an unsigned char), have a &lt; operator, have a size function, and have a data() function that returns a pointer to an array of characters, such as a std::string. The functor version can sort any data type that has a strict weak ordering, via templating, but requires definitions of a get_char (acts like x[offset] on a string or a byte array), get_length (returns length of the string being sorted), and a comparison functor.  Individual characters returned by get_char must support the != operator and have an unsigned value that defines their lexicographical order.
 578
 579 This algorithm is not efficient for character types larger than 2 bytes each, and is optimized for one-byte character strings.  For this reason, __std_sort will be called instead if the character type is of size > 2.
 580
 581 __string_sort has a special optimization for identical substrings.  This adds some overhead on random data, but identical substrings are common in real strings.
 582
 583 reverse_string_sort sorts strings in reverse (decending) order, but is otherwise identical.    __string_sort is sufficiently flexible that it should sort any data type that __std_sort can, assuming the user provides appropriate functors that index into a key.
 584
 585 [@../../doc/graph/windows_string_sort.htm Windows String Sort]
 586
 587 [@../../doc/graph/osx_string_sort.htm  OSX String Sort]
 588
 589
 590
 591 [section:stringsort_examples String Sort Examples]
 592
 593 See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for an example of how to sort structs using a string key and all functors:
 594
 595 [lessthan_functor]
 596
 597 [bracket_functor]
 598
 599 [getsize_functor]
 600
 601 and these functors are used thus:
 602
 603 [stringsort_functors_call]
 604
 605
 606 See [@../../example/generalizedstruct.cpp generalizedstruct.cpp] for a working example of a generalized approach to sort structs by a sequence of integer, float, and multiple string keys using string_sort:
 607
 608 [generalized_functors]
 609
 610 [generalized_functors_call]
 611
 612 Other examples:
 613
 614 [@../../example/stringsample.cpp String sort.]
 615
 616 [@../../example/reversestringsample.cpp Reverse string sort.]
 617
 618 [@../../example/wstringsample.cpp Wide character string sort.]
 619
 620 [@../../example/caseinsensitive.cpp Case insensitive string sort.]
 621
 622 [@../../example/charstringsample.cpp Sort structs using a string key and indexing functors.]
 623
 624 [@../../example/reversestringfunctorsample.cpp Sort structs using a string keynd all functors  in reverse order.]
 625
 626 [endsect] [/section:stringsort_examples String Sort Examples]
 627
 628 [endsect] [/section:string_sort String Sort]
 629
 630 [section:rationale Rationale]
 631
 632 [section:radix_sorting Radix Sorting]
 633 Radix-based sorting allows the data to be divided up into more than 2 pieces per iteration,
 634 and for cache-friendly versions, it normally cuts the data up into around a thousand pieces per iteration.
 635 This allows many fewer iterations to be used to complete sorting the data,
 636 enabling performance superior to the ['[bigo](N*log(N))] comparison-based sorting limit.
 637 [endsect] [/section:radix_sorting Radix Sorting]
 638
 639 [section:hybrid_radix Hybrid Radix]
 640
 641 There a two primary types of radix-based sorting:
 642
 643 Most-significant-digit Radix sorting (MSD) divides the data recursively
 644 based upon the top-most unsorted bits.
 645 This approach is efficient for even distributions that divide nicely,
 646 and can be done in-place (limited additional memory used).
 647 There is substantial constant overhead for each iteration to deal
 648 with the splitting structure.
 649 The algorithms provided here use MSD Radix Sort for their radix-sorting portion.
 650 The main disadvantage of MSD Radix sorting is that when the data is cut up into small
 651 pieces, the overhead for additional recursive calls starts to dominate runtime,
 652 and this makes worst-case behavior substantially worse than ['[bigo](N*log(N))].
 653
 654 By contrast, __integer_sort, __float_sort, and __string_sort all check to see
 655 whether Radix-based or Comparison-based sorting have better worst-case runtime,
 656 and make the appropriate recursive call.
 657 Because Comparison-based sorting algorithms are efficient on small pieces,
 658 the tendency of MSD __radix_sort to cut the problem up small is turned into
 659 an advantage by these hybrid sorts. It is hard to conceive of a common usage case
 660 where pure MSD __radix_sort would have any significant advantage
 661 over hybrid algorithms.
 662
 663 Least-significant-digit __radix_sort (LSD) sorts based upon
 664 the least-significant bits first. This requires a complete copy of the data being sorted,
 665 using substantial additional memory. The main advantage of LSD Radix Sort
 666 is that aside from some constant overhead and the memory allocation,
 667 it uses a fixed amount of time per element to sort, regardless of distribution or
 668 size of the list. This amount of time is proportional to the length of the radix.
 669 The other advantage of LSD Radix Sort is that it is a stable sorting algorithm,
 670 so elements with the same key will retain their original order.
 671
 672 One disadvantage is that LSD Radix Sort uses the same amount of time
 673 to sort "easy" sorting problems as "hard" sorting problems,
 674 and this time spent may end up being greater than an efficient ['[bigo](N*log(N))]
 675 algorithm such as __introsort spends sorting "hard" problems on large data sets,
 676 depending on the length of the datatype, and relative speed of comparisons,
 677 memory allocation, and random accesses.
 678
 679 The other main disadvantage of LSD Radix Sort is its memory overhead.
 680 It's only faster for large data sets, but large data sets use significant memory,
 681 which LSD Radix Sort needs to duplicate. LSD Radix Sort doesn't make sense for items
 682 of variable length, such as strings; it could be implemented by starting at the end
 683 of the longest element, but would be extremely inefficient.
 684
 685 All that said, there are places where LSD Radix Sort is the appropriate and
 686 fastest solution, so it would be appropriate to create a templated LSD Radix Sort
 687 similar to __integer_sort and __float_sort. This would be most appropriate in cases where
 688 comparisons are extremely slow.
 689
 690 [endsect] [/section:hybrid_radix Hybrid Radix]
 691
 692 [section:why_spreadsort  Why spreadsort?]
 693
 694 The __spreadsort algorithm used in this library is designed to provide best possible
 695 worst-case performance, while still being cache-friendly.
 696 It provides the better of ['[bigo](N*log(K/S + S))] and ['[bigo](N*log(N))] worst-case time,
 697 where ['K] is the log of the range. The log of the range is normally the length in bits
 698 of the data type; 32 for a 32-bit integer.
 699
 700 `flash_sort` (another hybrid algorithm), by comparison is ['[bigo](N)]
 701 for evenly distributed lists. The problem is, `flash_sort` is merely an MSD __radix_sort
 702 combined with ['[bigo](N*N)] insertion sort to deal with small subsets where
 703 the MSD Radix Sort is inefficient, so it is inefficient with chunks of data
 704 around the size at which it switches to `insertion_sort`, and ends up operating
 705 as an enhanced MSD Radix Sort.
 706 For uneven distributions this makes it especially inefficient.
 707
 708 __integer_sort and __float_sort use __introsort instead, which provides ['[bigo](N*log(N))]
 709 performance for these medium-sized pieces. Also, `flash_sort`'s ['[bigo](N)]
 710 performance for even distributions comes at the cost of cache misses,
 711 which on modern architectures are extremely expensive, and in testing
 712 on modern systems ends up being slower than cutting up the data in multiple,
 713 cache-friendly steps. Also worth noting is that on most modern computers,
 714 `log2(available RAM)/log2(L1 cache size)` is around 3,
 715 where a cache miss takes more than 3 times as long as an in-cache random-access,
 716 and the size of ['max_splits] is tuned to the size of the cache.
 717 On a computer where cache misses aren't this expensive, ['max_splits]
 718 could be increased to a large value, or eliminated entirely,
 719 and `integer_sort/float_sort` would have the same ['[bigo](N)] performance
 720 on even distributions.
 721
 722 Adaptive Left Radix (ALR) is similar to `flash_sort`, but more cache-friendly.
 723 It still uses insertion_sort. Because ALR uses ['[bigo](N*N)] `insertion_sort`,
 724 it isn't efficient to use the comparison-based fallback sort on large lists,
 725 and if the data is clustered in small chunks just over the fallback size
 726 with a few outliers, radix-based sorting iterates many times doing little sorting
 727 with high overhead. Asymptotically, ALR is still ['[bigo](N*log(K/S + S))],
 728 but with a very small ['S] (about 2 in the worst case),
 729 which compares unfavorably with the 11 default value of ['max_splits] for
 730 Spreadsort.
 731
 732 ALR also does not have the ['[bigo](N*log(N))] fallback, so for small lists
 733 that are not evenly distributed it is extremely inefficient.
 734 See the `alrbreaker` and `binaryalrbreaker` testcases for examples;
 735 either replace the call to sort with a call to ALR and update the ALR_THRESHOLD
 736 at the top, or as a quick comparison make `get_max_count return ALR_THRESHOLD`
 737 (20 by default based upon the paper).
 738 These small tests take 4-10 times as long with ALR as __std_sort
 739 in the author's testing, depending on the test system,
 740 because they are trying to sort a highly uneven distribution.
 741 Normal Spreadsort does much better with them, because `get_max_count`
 742 is designed around minimizing worst-case runtime.
 743
 744 `burst_sort` is an efficient hybrid algorithm for strings that
 745 uses substantial additional memory.
 746
 747 __string_sort uses minimal additional memory by comparison.
 748 Speed comparisons between the two haven't been made,
 749 but the better memory efficiency makes __string_sort more general.
 750
 751 `postal_sort` and __string_sort are similar. A direct performance comparison
 752 would be welcome, but an efficient version of `postal_sort` was not found
 753 in a search for source.
 754
 755 __string_sort is most similar to the __american_flag algorithm.
 756 The main difference is that it doesn't bother trying to optimize how empty
 757 buckets/piles are handled, instead just checking to see if all characters
 758 at the current index are equal. Other differences are using __std_sort
 759 as the fallback algorithm, and a larger fallback size (256 vs. 16),
 760 which makes empty pile handling less important.
 761
 762 Another difference is not applying the stack-size restriction.
 763 Because of the equality check in __string_sort, it would take ['m*m] memory
 764 worth of strings to force __string_sort to create a stack of depth ['m].
 765 This problem isn't a realistic one on modern systems with multi-megabyte stacksize
 766 limits, where main memory would be exhausted holding the long strings necessary
 767 to exceed the stacksize limit. __string_sort can be thought of as modernizing
 768 __american_flag to take advantage of __introsort as a fallback algorithm.
 769 In the author's testing, __american_flag (on `std::strings`) had comparable runtime
 770 to __introsort, but making a hybrid of the two allows reduced overhead and
 771 substantially superior performance.
 772
 773 [endsect] [/section:why_spreadsort]
 774
 775 [section:unstable_sort Unstable Sorting]
 776
 777 Making a __radix_sort stable requires the usage of an external copy of the data.
 778 A stable hybrid algorithm also requires a stable comparison-based algorithm,
 779 and these are generally slow. LSD __radix_sort uses an external copy of the data,
 780 and provides stability, along with likely being faster (than a stable hybrid sort),
 781 so that's probably a better way to go for integer and floating-point types.
 782 It might make sense to make a stable version of __string_sort using external memory,
 783 but for simplicity this has been left out for now.
 784
 785 [endsect] [/section:unstable_sort Unstable Sorting]
 786
 787 [section:optimization Unused X86 optimization]
 788
 789 Though the ideal ['max_splits] for `n < 1 million` (or so) on x86
 790 ['seems] to be substantially larger, enabling a roughly 15% speedup for such tests,
 791 this optimization isn't general, and doesn't apply for `n > 1 million`.
 792 A too large ['max_splits] can cause sort to take more than twice as long,
 793 so it should be set on the low end of the reasonable range, where it is right now.
 794
 795 [endsect] [/section:optimization Unused X86 optimization]
 796
 797 [section:lookup Lookup Table?]
 798
 799 The ideal way to optimize the constants would be to have a carefully-tuned
 800 lookup-table instead of the `get_max_count` function, but 4 tuning variables
 801 is simpler, `get_max_count` enforces worst-case performance minimization rules,
 802 and such a lookup table would be difficult to optimize
 803 for cross-platform performance.
 804
 805 Alternatively, `get_max_count` could be used to generate a static lookup table.
 806 This hasn't been done due to concerns about cross-platform compatibility
 807 and flexibility.
 808
 809 [endsect] [/section:lookup]
 810
 811 [endsect] [/section:rationale Rationale]
 812
 813 [endsect] [/section:sort_hpp Spreadsort]
 814
 815 [section:definitions Definitions]
 816
 817 [h4 stable sort]
 818
 819 A sorting approach that preserves pre-existing order.
 820 If there are two elements with identical keys in a list that is later stably sorted,
 821 whichever came first in the initial list will come first in a stably sorted list.
 822 The algorithms provided here provide no such guarantee; items with identical keys
 823 will have arbitrary resulting order relative to each other.
 824
 825 [endsect] [/section:definitions Definitions]
 826
 827 [section:faq Frequently Asked Questions]
 828
 829 There are no FAQs yet.
 830
 831 [endsect] [/section:faq Frequently asked Questions]
 832
 833 [section:acks Acknowledgements]
 834
 835 * The author would like to thank his wife Mary for her patience and support
 836 during the long process of converting this from a piece of C code
 837 to a template library.
 838
 839 * The author would also like to thank Phil Endecott and Frank Gennari
 840 for the improvements they've suggested and for testing.
 841 Without them this would have taken longer to develop or performed worse.
 842
 843 * `float_mem_cast` was fixed to be safe and fast thanks to Scott McMurray.
 844 That fix was critical for a high-performance cross-platform __float_sort.
 845
 846 * Thanks also for multiple helpful suggestions provided by Steven Watanabe,
 847 Edouard Alligand, and others.
 848
 849 * Initial documentation was refactored to use Quickbook by Paul A. Bristow.
 850
 851 [endsect] [/section:acknowledgements Acknowledgements]
 852
 853 [section:bibliog Bibliography]
 854
 855 [h4 Standard Template Library Sort Algorithms]
 856
 857 [@http://www.cplusplus.com/reference/algorithm/sort/ C++ STL sort algorithms].
 858
 859 [h4 Radix Sort]
 860
 861 A type of algorithm that sorts based upon distribution instead of by comparison.
 862 Wikipedia has an article about Radix Sorting.
 863 A more detailed description of various Radix Sorting algorithms is provided here:
 864
 865 Donald Knuth. The Art of Computer Programming,
 866 Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998.
 867 ISBN 0-201-89685-0. Section 5.2.5: Sorting by Distribution, pp.168-179.
 868
 869 [h4 Introsort]
 870
 871 A high-speed comparison-based sorting algorithm that takes ['[bigo](N * log(N))] time.
 872 See __introsort and
 873 Musser, David R. (1997). "Introspective Sorting and Selection Algorithms",
 874 Software: Practice and Experience (Wiley) 27 (8), pp 983-993,
 875 available at [@http://www.cs.rpi.edu/~musser/gp/introsort.ps  Musser Introsort].
 876
 877 [h4 American Flag Sort]
 878
 879 A high-speed hybrid string sorting algorithm that __string_sort is partially based
 880 upon.  See __american_flag and Peter M. McIlroy, Keith Bostic, M. Douglas McIlroy. Engineering Radix Sort,  Computing Systems 1993.
 881
 882 [h4 Adaptive Left Radix (ARL)]
 883
 884 ARL (Adaptive Left Radix) is a hybrid cache-friendly integer sorting algorithm
 885 with comparable speed on random data to __integer_sort,
 886 but does not have the optimizations for worst-case performance,
 887 causing it to perform poorly on certain types of unevenly distributed data.
 888
 889 Arne Maus, [@http://www.nik.no/2002/Maus.pdf ARL, a faster in-place, cache friendly sorting algorithm],
 890 presented at NIK2002, Norwegian Informatics Conference, Kongsberg, 2002. Tapir, ISBN 82-91116-45-8.
 891
 892 [h4 Original Spreadsort]
 893
 894 The algorithm that __integer_sort was originally based on.
 895 __integer_sort uses a smaller number of key bits at a time for better cache efficiency
 896 than the method described in the paper.
 897 The importance of cache efficiency grew as CPU clock speeds increased
 898 while main memory latency stagnated.
 899 See Steven J. Ross,
 900 The Spreadsort High-performance General-case Sorting Algorithm,
 901 Parallel and Distributed Processing Techniques and Applications, Volume 3, pp.1100-1106. Las Vegas Nevada. 2002. See
 902 [@../../doc/papers/original_spreadsort06_2002.pdf Steven Ross spreadsort_2002].
 903
 904 [endsect] [/section:bibliography Bibliography]
 905
 906 [section:history History]
 907
 908 * First release following review in Boost 1.58.
 909
 910 * [@http://permalink.gmane.org/gmane.comp.lib.boost.devel/255194 Review of Boost.Sort/Spreadsort library]
 911
 912 [endsect] [/section:history]
 913
 914 [xinclude autodoc.xml] [/ Using Doxygen reference documentation.]
 915
 916 [/Include the indexes (class, function and everything) ]
 917 '''
 918     <index type="function_name">
 919      <title>Function Index</title>
 920     </index>
 921
 922    <index/>
 923
 924 '''
 925
 926 [/
 927   Copyright (c) 2014 Steven Ross
 928   Distributed under the Boost Software License,
 929   Version 1.0. (See accompanying file LICENSE_1_0.txt
 930   or copy at http://boost.org/LICENSE_1_0.txt)
 931 ]
 932