]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ |
2 | Copyright 2015 Paul A. Bristow. | |
3 | Copyright 2015 John Maddock. | |
4 | Distributed under the Boost Software License, Version 1.0. | |
5 | (See accompanying file LICENSE_1_0.txt or copy at | |
6 | http://www.boost.org/LICENSE_1_0.txt). | |
7 | ] | |
8 | ||
9 | [section:root_comparison Comparison of Root Finding Algorithms] | |
10 | ||
11 | [section:cbrt_comparison Comparison of Cube Root Finding Algorithms] | |
12 | ||
13 | In the table below, the cube root of 28 was computed for three __fundamental_types floating-point types, | |
14 | and one __multiprecision type __cpp_bin_float using 50 decimal digit precision, using four algorithms. | |
15 | ||
16 | The 'exact' answer was computed using a 100 decimal digit type: | |
17 | ||
18 | cpp_bin_float_100 full_answer ("3.036588971875662519420809578505669635581453977248111123242141654169177268411884961770250390838097895"); | |
19 | ||
20 | Times were measured using __boost_timer using `class cpu_timer`. | |
21 | ||
22 | * ['Its] is the number of iterations taken to find the root. | |
23 | * ['Times] is the CPU time-taken in arbitrary units. | |
24 | * ['Norm] is a normalized time, in comparison to the quickest algorithm (with value 1.00). | |
25 | * ['Dis] is the distance from the nearest representation of the 'exact' root in bits. | |
26 | Distance from the 'exact' answer is measured by using function __float_distance. | |
27 | One or two bits distance means that all results are effectively 'correct'. | |
28 | Zero means 'exact' - the nearest __representable value for the floating-point type. | |
29 | ||
30 | The cube-root function is a simple function, and is a contrived example for root-finding. | |
31 | It does allow us to investigate some of the factors controlling efficiency that may be extrapolated to | |
32 | more complex functions. | |
33 | ||
34 | The program used was [@ ../../example/root_finding_algorithms.cpp root_finding_algorithms.cpp]. | |
35 | 100000 evaluations of each floating-point type and algorithm were used and the CPU times were | |
36 | judged from repeat runs to have an uncertainty of 10 %. Comparing MSVC for `double` and `long double` | |
37 | (which are identical on this patform) may give a guide to uncertainty of timing. | |
38 | ||
39 | The requested precision was set as follows: | |
40 | ||
41 | [table | |
42 | [[Function][Precision Requested]] | |
43 | [[TOMS748][numeric_limits<T>::digits - 2]] | |
44 | [[Newton][floor(numeric_limits<T>::digits * 0.6)]] | |
45 | [[Halley][floor(numeric_limits<T>::digits * 0.4)]] | |
46 | [[Schr'''ö'''der][floor(numeric_limits<T>::digits * 0.4)]] | |
47 | ] | |
48 | ||
49 | * The C++ Standard cube root function [@http://en.cppreference.com/w/cpp/numeric/math/cbrt std::cbrt] | |
50 | is only defined for built-in or fundamental types, | |
51 | so cannot be used with any User-Defined floating-point types like __multiprecision. | |
52 | This, and that the cube function is so impeccably-behaved, | |
53 | allows the implementer to use many tricks to achieve a fast computation. | |
54 | On some platforms,`std::cbrt` appeared several times as quick as the more general `boost::math::cbrt`, | |
55 | on other platforms / compiler options `boost::math::cbrt` is noticeably faster. In general, the results are highly | |
56 | dependent on the code-generation / processor architecture selection compiler options used. One can | |
57 | assume that the standard library will have been compiled with options ['nearly] optimal for the platform | |
58 | it was installed on, where as the user has more choice over the options used for Boost.Math. Pick something | |
59 | too general/conservative and performance suffers, while selecting options that make use of the latest | |
60 | instruction set opcodes speed's things up noticeably. | |
61 | ||
62 | * Two compilers in optimise mode were compared: GCC 4.9.1 using Netbeans IDS | |
63 | and Microsoft Visual Studio 2013 (Update 1) on the same hardware. | |
64 | The number of iterations seemed consistent, but the relative run-times surprisingly different. | |
65 | ||
66 | * `boost::math::cbrt` allows use with ['any user-defined floating-point type], conveniently | |
67 | __multiprecision. It too can take some advantage of the good-behaviour of the cube function, | |
68 | compared to the more general implementation in the nth root-finding examples. For example, | |
69 | it uses a polynomial approximation to generate a better guess than dividing the exponent by three, | |
70 | and can avoid the complex checks in __newton required to prevent the | |
71 | search going wildly off-track. For a known precision, it may also be possible to | |
72 | fix the number of iterations, allowing inlining and loop unrolling. It also | |
73 | algebraically simplifies the Halley steps leading to a big reduction in the | |
74 | number of floating point operations required compared to a "black box" implementation | |
75 | that calculates the derivatives seperately and then combines them in the Halley code. | |
76 | Typically, it was found that computation using type `double` | |
77 | took a few times longer when using the various root-finding algorithms directly rather | |
78 | than the hand coded/optimized `cbrt` routine. | |
79 | ||
80 | * The importance of getting a good guess can be seen by the iteration count for the multiprecision case: | |
81 | here we "cheat" a little and use the cube-root calculated to double precision as the initial guess. | |
82 | The limitation of this tactic is that the range of possible (exponent) values may be less than the multiprecision type. | |
83 | ||
84 | * For __fundamental_types, there was little to choose between the three derivative methods, | |
85 | but for __cpp_bin_float, __newton was twice as fast. Note that the cube-root is an extreme | |
86 | test case as the cost of calling the functor is so cheap that the runtimes are largely | |
87 | dominated by the complexity of the iteration code. | |
88 | ||
89 | * Compiling with optimisation halved computation times, and any differences between algorithms | |
90 | became nearly negligible. The optimisation speed-up of the __TOMS748 was especially noticable. | |
91 | ||
92 | * Using a multiprecision type like `cpp_bin_float_50` for a precision of 50 decimal digits | |
93 | took a lot longer, as expected because most computation | |
94 | uses software rather than 64-bit floating-point hardware. | |
95 | Speeds are often more than 50 times slower. | |
96 | ||
97 | * Using `cpp_bin_float_50`, __TOMS748 was much slower showing the benefit of using derivatives. | |
98 | __newton was found to be twice as quick as either of the second-derivative methods: | |
99 | this is an extreme case though, the function and its derivatives are so cheap to compute that we're | |
100 | really measuring the complexity of the boilerplate root-finding code. | |
101 | ||
102 | * For multiprecision types only one or two extra ['iterations] are needed to get the remaining 35 digits, whatever the algorithm used. | |
103 | (The time taken was of course much greater for these types). | |
104 | ||
105 | * Using a 100 decimal-digit type only doubled the time and required only a very few more iterations, | |
106 | so the cost of extra precision is mainly the underlying cost of computing more digits, | |
107 | not in the way the algorithm works. This confirms previous observations using __NTL high-precision types. | |
108 | ||
109 | [include root_comparison_tables_msvc.qbk] | |
110 | [include root_comparison_tables_gcc.qbk] | |
111 | ||
112 | [endsect] [/section:cbrt_comparison Comparison of Cube Root Finding Algorithms] | |
113 | ||
114 | [section:root_n_comparison Comparison of Nth-root Finding Algorithms] | |
115 | ||
116 | A second example compares four generalized nth-root finding algorithms for various n-th roots (5, 7 and 13) | |
117 | of a single value 28.0, for four floating-point types, `float`, `double`, | |
118 | `long double` and a __multiprecision type `cpp_bin_float_50`. | |
119 | In each case the target accuracy was set using our "recomended" accuracy limits | |
120 | (or at least limits that make a good starting point - which is likely to give | |
121 | close to full accuracy without resorting to unnecessary iterations). | |
122 | ||
123 | [table | |
124 | [[Function][Precision Requested]] | |
125 | [[TOMS748][numeric_limits<T>::digits - 2]] | |
126 | [[Newton][floor(numeric_limits<T>::digits * 0.6)]] | |
127 | [[Halley][floor(numeric_limits<T>::digits * 0.4)]] | |
128 | [[Schr'''ö'''der][floor(numeric_limits<T>::digits * 0.4)]] | |
129 | ] | |
130 | Tests used Microsoft Visual Studio 2013 (Update 1) and GCC 4.9.1 using source code | |
131 | [@../../example/root_n_finding_algorithms.cpp root_n_finding_algorithms.cpp]. | |
132 | ||
133 | The timing uncertainty (especially using MSVC) is at least 5% of normalized time 'Norm'. | |
134 | ||
135 | To pick out the 'best' and 'worst' algorithms are highlighted in blue and red. | |
136 | More than one result can be 'best' when normalized times are indistinguishable | |
137 | within the uncertainty. | |
138 | ||
139 | [/include roots_table_100_msvc.qbk] | |
140 | [/include roots_table_75_msvc.qbk] | |
141 | ||
142 | [/include roots_table_75_msvc_X86.qbk] | |
143 | [/include roots_table_100_msvc_X86.qbk] | |
144 | ||
145 | [/include roots_table_100_msvc_AVX.qbk] | |
146 | [/include roots_table_75_msvc_AVX.qbk] | |
147 | ||
148 | [/include roots_table_75_msvc_X86_SSE2.qbk] | |
149 | [/include roots_table_100_msvc_X86_SSE2.qbk] | |
150 | ||
151 | [/include roots_table_100_gcc_X64_SSE2.qbk] | |
152 | [/include roots_table_75_gcc_X64_SSE2.qbk] | |
153 | ||
154 | [/include type_info_table_100_msvc.qbk] | |
155 | [/include type_info_table_75_msvc.qbk] | |
156 | ||
157 | [include roots_table_100_msvc_X86_SSE2.qbk] | |
158 | [include roots_table_100_msvc_X64_AVX.qbk] | |
159 | [include roots_table_100_gcc_X64_SSE2.qbk] | |
160 | ||
161 | Some tentative conclusions can be drawn from this limited exercise. | |
162 | ||
163 | * Perhaps surprisingly, there is little difference between the various algorithms for __fundamental_types floating-point types. | |
164 | Using the first derivatives (__newton) is usually the best, but while the improvement over the no-derivative | |
165 | __TOMS748 is considerable in number of iterations, but little in execution time. This reflects the fact that the function | |
166 | we are finding the root for is trivial to evaluate, so runtimetimes are dominated by the time taken by the boilerplate code | |
167 | in each method. | |
168 | ||
169 | * The extra cost of evaluating the second derivatives (__halley or __schroder) is usually too much for any net benefit: | |
170 | as with the cube root, these functors are so cheap to evaluate that the runtime is largely dominated by the | |
171 | complexity of the root finding method. | |
172 | ||
173 | * For a __multiprecision floating-point type, the __newton is a clear winner with a several-fold gain over __TOMS748, | |
174 | and again no improvement from the second-derivative algorithms. | |
175 | ||
176 | * The run-time of 50 decimal-digit __multiprecision is about 30-fold greater than `double`. | |
177 | ||
178 | * The column 'dis' showing the number of bits distance from the correct result. | |
179 | The Newton-Raphson algorithm shows a bit or two better accuracy than __TOMS748. | |
180 | ||
181 | * The goodness of the 'guess' is especially crucial for __multiprecision. | |
182 | Separate experiments show that evaluating the 'guess' using `double` allows | |
183 | convergence to the final exact result in one or two iterations. | |
184 | So in this contrived example, crudely dividing the exponent by N for a 'guess', | |
185 | it would be far better to use a `pow<double>` or , | |
186 | if more precise `pow<long double>`, function to estimate a 'guess'. | |
187 | The limitation of this tactic is that the range of possible (exponent) values may be less than the multiprecision type. | |
188 | ||
189 | * Using floating-point extension __SSE2 made a modest ten-percent speedup. | |
190 | ||
191 | *Using MSVC, there was some improvement using 64-bit, markedly for __multiprecision. | |
192 | ||
193 | * The GCC compiler 4.9.1 using 64-bit was at least five-folder faster that 32-bit, | |
194 | apparently reflecting better optimization. | |
195 | ||
196 | Clearly, your mileage [*will vary], but in summary, __newton seems the first choice of algorithm, | |
197 | and effort to find a good 'guess' the first speed-up target, especially for __multiprecision. | |
198 | And of course, compiler optimisation is crucial for speed. | |
199 | ||
200 | [endsect] [/section:root_n_comparison Comparison of Nth-root Finding Algorithms] | |
201 | ||
202 | [section:elliptic_comparison Comparison of Elliptic Integral Root Finding Algoritghms] | |
203 | ||
204 | A second example compares four root finding algorithms for locating | |
205 | the second radius of an ellipse with first radius 28 and arc length 300, | |
206 | for four floating-point types, `float`, `double`, | |
207 | `long double` and a __multiprecision type `cpp_bin_float_50`. | |
208 | ||
209 | Which is to say we're solving: | |
210 | ||
211 | [pre 4xE(sqrt(1 - 28[super 2] / x[super 2])) - 300 = 0] | |
212 | ||
213 | In each case the target accuracy was set using our "recomended" accuracy limits | |
214 | (or at least limits that make a good starting point - which is likely to give | |
215 | close to full accuracy without resorting to unnecessary iterations). | |
216 | ||
217 | [table | |
218 | [[Function][Precision Requested]] | |
219 | [[TOMS748][numeric_limits<T>::digits - 2]] | |
220 | [[Newton][floor(numeric_limits<T>::digits * 0.6)]] | |
221 | [[Halley][floor(numeric_limits<T>::digits * 0.4)]] | |
222 | [[Schr'''ö'''der][floor(numeric_limits<T>::digits * 0.4)]] | |
223 | ] | |
224 | Tests used Microsoft Visual Studio 2013 (Update 1) and GCC 4.9.1 using source code | |
225 | [@../../example/root_elliptic_finding.cpp root_elliptic_finding.cpp]. | |
226 | ||
227 | The timing uncertainty (especially using MSVC) is at least 5% of normalized time 'Norm'. | |
228 | ||
229 | To pick out the 'best' and 'worst' algorithms are highlighted in blue and red. | |
230 | More than one result can be 'best' when normalized times are indistinguishable | |
231 | within the uncertainty. | |
232 | ||
233 | [include elliptic_table_100_msvc_X86_SSE2.qbk] | |
234 | [include elliptic_table_100_msvc_X64_AVX.qbk] | |
235 | [include elliptic_table_100_gcc_X64_SSE2.qbk] | |
236 | ||
237 | Remarks: | |
238 | ||
239 | * The function being solved is now moderately expensive to call, and twice as expensive to call | |
240 | when obtaining the derivative than when not. Consequently there is very little improvement in moving | |
241 | from a derivative free method, to Newton iteration. However, once you've calculated the first derivative | |
242 | the second comes almost for free, consequently the third order methods (Halley) does much the best. | |
243 | * Of the two second order methods, Halley does best as would be expected: the Schroder method offers better | |
244 | guarantees of ['quadratic] convergence, while Halley relies on a smooth function with a single root to | |
245 | give ['cubic] convergence. It's not entirely clear why Schroder iteration often does worse than Newton. | |
246 | ||
247 | [endsect][/section:elliptic_comparison Comparison of Elliptic Integral Root Finding Algoritghms] | |
248 | ||
249 | [endsect] [/section:root_comparison Comparison of Root Finding Algorithms] | |
250 |