]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ def names all end in distrib to avoid clashes with names of functions] |
2 | ||
3 | [def __binomial_distrib [link math_toolkit.dist_ref.dists.binomial_dist Binomial Distribution]] | |
4 | [def __chi_squared_distrib [link math_toolkit.dist_ref.dists.chi_squared_dist Chi Squared Distribution]] | |
5 | [def __normal_distrib [link math_toolkit.dist_ref.dists.normal_dist Normal Distribution]] | |
6 | [def __F_distrib [link math_toolkit.dist_ref.dists.f_dist Fisher F Distribution]] | |
7 | [def __students_t_distrib [link math_toolkit.dist_ref.dists.students_t_dist Students t Distribution]] | |
8 | ||
9 | [def __handbook [@http://www.itl.nist.gov/div898/handbook/ | |
10 | NIST/SEMATECH e-Handbook of Statistical Methods.]] | |
11 | ||
12 | [section:stat_tut Statistical Distributions Tutorial] | |
13 | This library is centred around statistical distributions, this tutorial | |
14 | will give you an overview of what they are, how they can be used, and | |
15 | provides a few worked examples of applying the library to statistical tests. | |
16 | ||
17 | [section:overview Overview of Distributions] | |
18 | ||
19 | [section:headers Headers and Namespaces] | |
20 | ||
21 | All the code in this library is inside namespace boost::math. | |
22 | ||
23 | In order to use a distribution /my_distribution/ you will need to include | |
24 | either the header <boost/math/my_distribution.hpp> or | |
25 | the "include all the distributions" header: <boost/math/distributions.hpp>. | |
26 | ||
27 | For example, to use the Students-t distribution include either | |
28 | <boost/math/students_t.hpp> or | |
29 | <boost/math/distributions.hpp> | |
30 | ||
31 | You also need to bring distribution names into scope, | |
32 | perhaps with a `using namespace boost::math;` declaration, | |
33 | ||
34 | or specific `using` declarations like `using boost::math::normal;` (*recommended*). | |
35 | ||
36 | [caution Some math function names are also used in namespace std so including <random> could cause ambiguity!] | |
37 | ||
38 | [endsect] [/ section:headers Headers and Namespaces] | |
39 | ||
40 | [section:objects Distributions are Objects] | |
41 | ||
42 | Each kind of distribution in this library is a class type - an object. | |
43 | ||
44 | [link policy Policies] provide fine-grained control | |
45 | of the behaviour of these classes, allowing the user to customise | |
46 | behaviour such as how errors are handled, or how the quantiles | |
47 | of discrete distribtions behave. | |
48 | ||
49 | [tip If you are familiar with statistics libraries using functions, | |
50 | and 'Distributions as Objects' seem alien, see | |
51 | [link math_toolkit.stat_tut.weg.nag_library the comparison to | |
52 | other statistics libraries.] | |
53 | ] [/tip] | |
54 | ||
55 | Making distributions class types does two things: | |
56 | ||
57 | * It encapsulates the kind of distribution in the C++ type system; | |
58 | so, for example, Students-t distributions are always a different C++ type from | |
59 | Chi-Squared distributions. | |
60 | * The distribution objects store any parameters associated with the | |
61 | distribution: for example, the Students-t distribution has a | |
62 | ['degrees of freedom] parameter that controls the shape of the distribution. | |
63 | This ['degrees of freedom] parameter has to be provided | |
64 | to the Students-t object when it is constructed. | |
65 | ||
66 | Although the distribution classes in this library are templates, there | |
67 | are typedefs on type /double/ that mostly take the usual name of the | |
68 | distribution | |
69 | (except where there is a clash with a function of the same name: beta and gamma, | |
70 | in which case using the default template arguments - `RealType = double` - | |
71 | is nearly as convenient). | |
72 | Probably 95% of uses are covered by these typedefs: | |
73 | ||
74 | // using namespace boost::math; // Avoid potential ambiguity with names in std <random> | |
75 | // Safer to declare specific functions with using statement(s): | |
76 | ||
77 | using boost::math::beta_distribution; | |
78 | using boost::math::binomial_distribution; | |
79 | using boost::math::students_t; | |
80 | ||
81 | // Construct a students_t distribution with 4 degrees of freedom: | |
82 | students_t d1(4); | |
83 | ||
84 | // Construct a double-precision beta distribution | |
85 | // with parameters a = 10, b = 20 | |
86 | beta_distribution<> d2(10, 20); // Note: _distribution<> suffix ! | |
87 | ||
88 | If you need to use the distributions with a type other than `double`, | |
89 | then you can instantiate the template directly: the names of the | |
90 | templates are the same as the `double` typedef but with `_distribution` | |
91 | appended, for example: __students_t_distrib or __binomial_distrib: | |
92 | ||
93 | // Construct a students_t distribution, of float type, | |
94 | // with 4 degrees of freedom: | |
95 | students_t_distribution<float> d3(4); | |
96 | ||
97 | // Construct a binomial distribution, of long double type, | |
98 | // with probability of success 0.3 | |
99 | // and 20 trials in total: | |
100 | binomial_distribution<long double> d4(20, 0.3); | |
101 | ||
102 | The parameters passed to the distributions can be accessed via getter member | |
103 | functions: | |
104 | ||
105 | d1.degrees_of_freedom(); // returns 4.0 | |
106 | ||
107 | This is all well and good, but not very useful so far. What we often want | |
108 | is to be able to calculate the /cumulative distribution functions/ and | |
109 | /quantiles/ etc for these distributions. | |
110 | ||
111 | [endsect] [/section:objects Distributions are Objects] | |
112 | ||
113 | ||
114 | [section:generic Generic operations common to all distributions are non-member functions] | |
115 | ||
116 | Want to calculate the PDF (Probability Density Function) of a distribution? | |
117 | No problem, just use: | |
118 | ||
119 | pdf(my_dist, x); // Returns PDF (density) at point x of distribution my_dist. | |
120 | ||
121 | Or how about the CDF (Cumulative Distribution Function): | |
122 | ||
123 | cdf(my_dist, x); // Returns CDF (integral from -infinity to point x) | |
124 | // of distribution my_dist. | |
125 | ||
126 | And quantiles are just the same: | |
127 | ||
128 | quantile(my_dist, p); // Returns the value of the random variable x | |
129 | // such that cdf(my_dist, x) == p. | |
130 | ||
131 | If you're wondering why these aren't member functions, it's to | |
132 | make the library more easily extensible: if you want to add additional | |
133 | generic operations - let's say the /n'th moment/ - then all you have to | |
134 | do is add the appropriate non-member functions, overloaded for each | |
135 | implemented distribution type. | |
136 | ||
137 | [tip | |
138 | ||
139 | [*Random numbers that approximate Quantiles of Distributions] | |
140 | ||
141 | If you want random numbers that are distributed in a specific way, | |
142 | for example in a uniform, normal or triangular, | |
143 | see [@http://www.boost.org/libs/random/ Boost.Random]. | |
144 | ||
145 | Whilst in principal there's nothing to prevent you from using the | |
146 | quantile function to convert a uniformly distributed random | |
147 | number to another distribution, in practice there are much more | |
148 | efficient algorithms available that are specific to random number generation. | |
149 | ] [/tip Random numbers that approximate Quantiles of Distributions] | |
150 | ||
151 | For example, the binomial distribution has two parameters: | |
152 | n (the number of trials) and p (the probability of success on any one trial). | |
153 | ||
154 | The `binomial_distribution` constructor therefore has two parameters: | |
155 | ||
156 | `binomial_distribution(RealType n, RealType p);` | |
157 | ||
158 | For this distribution the __random_variate is k: the number of successes observed. | |
159 | The probability density\/mass function (pdf) is therefore written as ['f(k; n, p)]. | |
160 | ||
161 | [note | |
162 | ||
163 | [*Random Variates and Distribution Parameters] | |
164 | ||
165 | The concept of a __random_variable is closely linked to the term __random_variate: | |
166 | a random variate is a particular value (outcome) of a random variable. | |
167 | and [@http://en.wikipedia.org/wiki/Parameter distribution parameters] | |
168 | are conventionally distinguished (for example in Wikipedia and Wolfram MathWorld) | |
169 | by placing a semi-colon or vertical bar) | |
170 | /after/ the __random_variable (whose value you 'choose'), | |
171 | to separate the variate from the parameter(s) that defines the shape of the distribution.[br] | |
172 | For example, the binomial distribution probability distribution function (PDF) is written as | |
173 | ['f(k| n, p)] = Pr(K = k|n, p) = probability of observing k successes out of n trials. | |
174 | K is the __random_variable, k is the __random_variate, | |
175 | the parameters are n (trials) and p (probability). | |
176 | ] [/tip Random Variates and Distribution Parameters] | |
177 | ||
178 | [note By convention, __random_variate are lower case, usually k is integral, x if real, and | |
179 | __random_variable are upper case, K if integral, X if real. But this implementation treats | |
180 | all as floating point values `RealType`, so if you really want an integral result, | |
181 | you must round: see note on Discrete Probability Distributions below for details.] | |
182 | ||
183 | As noted above the non-member function `pdf` has one parameter for the distribution object, | |
184 | and a second for the random variate. So taking our binomial distribution | |
185 | example, we would write: | |
186 | ||
187 | `pdf(binomial_distribution<RealType>(n, p), k);` | |
188 | ||
189 | The ranges of __random_variate values that are permitted and are supported can be | |
190 | tested by using two functions `range` and `support`. | |
191 | ||
192 | The distribution (effectively the __random_variate) is said to be 'supported' | |
193 | over a range that is | |
194 | [@http://en.wikipedia.org/wiki/Probability_distribution | |
195 | "the smallest closed set whose complement has probability zero"]. | |
196 | MathWorld uses the word 'defined' for this range. | |
197 | Non-mathematicians might say it means the 'interesting' smallest range | |
198 | of random variate x that has the cdf going from zero to unity. | |
199 | Outside are uninteresting zones where the pdf is zero, and the cdf zero or unity. | |
200 | ||
201 | For most distributions, with probability distribution functions one might describe | |
202 | as 'well-behaved', we have decided that it is most useful for the supported range | |
203 | to *exclude* random variate values like exact zero *if the end point is discontinuous*. | |
204 | For example, the Weibull (scale 1, shape 1) distribution smoothly heads for unity | |
205 | as the random variate x declines towards zero. | |
206 | But at x = zero, the value of the pdf is suddenly exactly zero, by definition. | |
207 | If you are plotting the PDF, or otherwise calculating, | |
208 | zero is not the most useful value for the lower limit of supported, as we discovered. | |
209 | So for this, and similar distributions, | |
210 | we have decided it is most numerically useful to use | |
211 | the closest value to zero, min_value, for the limit of the supported range. | |
212 | (The `range` remains from zero, so you will still get `pdf(weibull, 0) == 0`). | |
213 | (Exponential and gamma distributions have similarly discontinuous functions). | |
214 | ||
215 | Mathematically, the functions may make sense with an (+ or -) infinite value, | |
216 | but except for a few special cases (in the Normal and Cauchy distributions) | |
217 | this implementation limits random variates to finite values from the `max` | |
218 | to `min` for the `RealType`. | |
219 | (See [link math_toolkit.sf_implementation.handling_of_floating_point_infin | |
220 | Handling of Floating-Point Infinity] for rationale). | |
221 | ||
222 | ||
223 | [note | |
224 | ||
225 | [*Discrete Probability Distributions] | |
226 | ||
227 | Note that the [@http://en.wikipedia.org/wiki/Discrete_probability_distribution | |
228 | discrete distributions], including the binomial, negative binomial, Poisson & Bernoulli, | |
229 | are all mathematically defined as discrete functions: | |
230 | that is to say the functions `cdf` and `pdf` are only defined for integral values | |
231 | of the random variate. | |
232 | ||
233 | However, because the method of calculation often uses continuous functions | |
234 | it is convenient to treat them as if they were continuous functions, | |
235 | and permit non-integral values of their parameters. | |
236 | ||
237 | Users wanting to enforce a strict mathematical model may use `floor` | |
238 | or `ceil` functions on the random variate prior to calling the distribution | |
239 | function. | |
240 | ||
241 | The quantile functions for these distributions are hard to specify | |
242 | in a manner that will satisfy everyone all of the time. The default | |
243 | behaviour is to return an integer result, that has been rounded | |
244 | /outwards/: that is to say, lower quantiles - where the probablity | |
245 | is less than 0.5 are rounded down, while upper quantiles - where | |
246 | the probability is greater than 0.5 - are rounded up. This behaviour | |
247 | ensures that if an X% quantile is requested, then /at least/ the requested | |
248 | coverage will be present in the central region, and /no more than/ | |
249 | the requested coverage will be present in the tails. | |
250 | ||
251 | This behaviour can be changed so that the quantile functions are rounded | |
252 | differently, or return a real-valued result using | |
253 | [link math_toolkit.pol_overview Policies]. It is strongly | |
254 | recommended that you read the tutorial | |
255 | [link math_toolkit.pol_tutorial.understand_dis_quant | |
256 | Understanding Quantiles of Discrete Distributions] before | |
257 | using the quantile function on a discrete distribtion. The | |
258 | [link math_toolkit.pol_ref.discrete_quant_ref reference docs] | |
259 | describe how to change the rounding policy | |
260 | for these distributions. | |
261 | ||
262 | For similar reasons continuous distributions with parameters like | |
263 | "degrees of freedom" | |
264 | that might appear to be integral, are treated as real values | |
265 | (and are promoted from integer to floating-point if necessary). | |
266 | In this case however, there are a small number of situations where non-integral | |
267 | degrees of freedom do have a genuine meaning. | |
268 | ] | |
269 | ||
270 | [endsect] [/ section:generic Generic operations common to all distributions are non-member functions] | |
271 | ||
272 | [section:complements Complements are supported too - and when to use them] | |
273 | ||
274 | Often you don't want the value of the CDF, but its complement, which is | |
275 | to say `1-p` rather than `p`. It is tempting to calculate the CDF and subtract | |
276 | it from `1`, but if `p` is very close to `1` then cancellation error | |
277 | will cause you to lose accuracy, perhaps totally. | |
278 | ||
279 | [link why_complements See below ['"Why and when to use complements?"]] | |
280 | ||
281 | In this library, whenever you want to receive a complement, just wrap | |
282 | all the function arguments in a call to `complement(...)`, for example: | |
283 | ||
284 | students_t dist(5); | |
285 | cout << "CDF at t = 1 is " << cdf(dist, 1.0) << endl; | |
286 | cout << "Complement of CDF at t = 1 is " << cdf(complement(dist, 1.0)) << endl; | |
287 | ||
288 | But wait, now that we have a complement, we have to be able to use it as well. | |
289 | Any function that accepts a probability as an argument can also accept a complement | |
290 | by wrapping all of its arguments in a call to `complement(...)`, for example: | |
291 | ||
292 | students_t dist(5); | |
293 | ||
294 | for(double i = 10; i < 1e10; i *= 10) | |
295 | { | |
296 | // Calculate the quantile for a 1 in i chance: | |
297 | double t = quantile(complement(dist, 1/i)); | |
298 | // Print it out: | |
299 | cout << "Quantile of students-t with 5 degrees of freedom\n" | |
300 | "for a 1 in " << i << " chance is " << t << endl; | |
301 | } | |
302 | ||
303 | [tip | |
304 | ||
305 | [*Critical values are just quantiles] | |
306 | ||
307 | Some texts talk about quantiles, or percentiles or fractiles, | |
308 | others about critical values, the basic rule is: | |
309 | ||
310 | ['Lower critical values] are the same as the quantile. | |
311 | ||
312 | ['Upper critical values] are the same as the quantile from the complement | |
313 | of the probability. | |
314 | ||
315 | For example, suppose we have a Bernoulli process, giving rise to a binomial | |
316 | distribution with success ratio 0.1 and 100 trials in total. The | |
317 | ['lower critical value] for a probability of 0.05 is given by: | |
318 | ||
319 | `quantile(binomial(100, 0.1), 0.05)` | |
320 | ||
321 | and the ['upper critical value] is given by: | |
322 | ||
323 | `quantile(complement(binomial(100, 0.1), 0.05))` | |
324 | ||
325 | which return 4.82 and 14.63 respectively. | |
326 | ] | |
327 | ||
328 | [#why_complements] | |
329 | [tip | |
330 | ||
331 | [*Why bother with complements anyway?] | |
332 | ||
333 | It's very tempting to dispense with complements, and simply subtract | |
334 | the probability from 1 when required. However, consider what happens when | |
335 | the probability is very close to 1: let's say the probability expressed at | |
336 | float precision is `0.999999940f`, then `1 - 0.999999940f = 5.96046448e-008`, | |
337 | but the result is actually accurate to just ['one single bit]: the only | |
338 | bit that didn't cancel out! | |
339 | ||
340 | Or to look at this another way: consider that we want the risk of falsely | |
341 | rejecting the null-hypothesis in the Student's t test to be 1 in 1 billion, | |
342 | for a sample size of 10,000. | |
343 | This gives a probability of 1 - 10[super -9], which is exactly 1 when | |
344 | calculated at float precision. In this case calculating the quantile from | |
345 | the complement neatly solves the problem, so for example: | |
346 | ||
347 | `quantile(complement(students_t(10000), 1e-9))` | |
348 | ||
349 | returns the expected t-statistic `6.00336`, where as: | |
350 | ||
351 | `quantile(students_t(10000), 1-1e-9f)` | |
352 | ||
353 | raises an overflow error, since it is the same as: | |
354 | ||
355 | `quantile(students_t(10000), 1)` | |
356 | ||
357 | Which has no finite result. | |
358 | ||
359 | With all distributions, even for more reasonable probability | |
360 | (unless the value of p can be represented exactly in the floating-point type) | |
361 | the loss of accuracy quickly becomes significant if you simply calculate probability from 1 - p | |
362 | (because it will be mostly garbage digits for p ~ 1). | |
363 | ||
364 | So always avoid, for example, using a probability near to unity like 0.99999 | |
365 | ||
366 | `quantile(my_distribution, 0.99999)` | |
367 | ||
368 | and instead use | |
369 | ||
370 | `quantile(complement(my_distribution, 0.00001))` | |
371 | ||
372 | since 1 - 0.99999 is not exactly equal to 0.00001 when using floating-point arithmetic. | |
373 | ||
374 | This assumes that the 0.00001 value is either a constant, | |
375 | or can be computed by some manner other than subtracting 0.99999 from 1. | |
376 | ||
377 | ] [/ tip *Why bother with complements anyway?] | |
378 | ||
379 | [endsect] [/ section:complements Complements are supported too - and why] | |
380 | ||
381 | [section:parameters Parameters can be calculated] | |
382 | ||
383 | Sometimes it's the parameters that define the distribution that you | |
384 | need to find. Suppose, for example, you have conducted a Students-t test | |
385 | for equal means and the result is borderline. Maybe your two samples | |
386 | differ from each other, or maybe they don't; based on the result | |
387 | of the test you can't be sure. A legitimate question to ask then is | |
388 | "How many more measurements would I have to take before I would get | |
389 | an X% probability that the difference is real?" Parameter finders | |
390 | can answer questions like this, and are necessarily different for | |
391 | each distribution. They are implemented as static member functions | |
392 | of the distributions, for example: | |
393 | ||
394 | students_t::find_degrees_of_freedom( | |
395 | 1.3, // difference from true mean to detect | |
396 | 0.05, // maximum risk of falsely rejecting the null-hypothesis. | |
397 | 0.1, // maximum risk of falsely failing to reject the null-hypothesis. | |
398 | 0.13); // sample standard deviation | |
399 | ||
400 | Returns the number of degrees of freedom required to obtain a 95% | |
401 | probability that the observed differences in means is not down to | |
402 | chance alone. In the case that a borderline Students-t test result | |
403 | was previously obtained, this can be used to estimate how large the sample size | |
404 | would have to become before the observed difference was considered | |
405 | significant. It assumes, of course, that the sample mean and standard | |
406 | deviation are invariant with sample size. | |
407 | ||
408 | [endsect] [/ section:parameters Parameters can be calculated] | |
409 | ||
410 | [section:summary Summary] | |
411 | ||
412 | * Distributions are objects, which are constructed from whatever | |
413 | parameters the distribution may have. | |
414 | * Member functions allow you to retrieve the parameters of a distribution. | |
415 | * Generic non-member functions provide access to the properties that | |
416 | are common to all the distributions (PDF, CDF, quantile etc). | |
417 | * Complements of probabilities are calculated by wrapping the function's | |
418 | arguments in a call to `complement(...)`. | |
419 | * Functions that accept a probability can accept a complement of the | |
420 | probability as well, by wrapping the function's | |
421 | arguments in a call to `complement(...)`. | |
422 | * Static member functions allow the parameters of a distribution | |
423 | to be found from other information. | |
424 | ||
425 | Now that you have the basics, the next section looks at some worked examples. | |
426 | ||
427 | [endsect] [/section:summary Summary] | |
428 | [endsect] [/section:overview Overview] | |
429 | ||
430 | [section:weg Worked Examples] | |
431 | [include distribution_construction.qbk] | |
432 | [include students_t_examples.qbk] | |
433 | [include chi_squared_examples.qbk] | |
434 | [include f_dist_example.qbk] | |
435 | [include binomial_example.qbk] | |
436 | [include geometric_example.qbk] | |
437 | [include negative_binomial_example.qbk] | |
438 | [include normal_example.qbk] | |
439 | [/include inverse_gamma_example.qbk] | |
440 | [/include inverse_gaussian_example.qbk] | |
441 | [include inverse_chi_squared_eg.qbk] | |
442 | [include nc_chi_squared_example.qbk] | |
443 | [include error_handling_example.qbk] | |
444 | [include find_location_and_scale.qbk] | |
445 | [include nag_library.qbk] | |
446 | [include c_sharp.qbk] | |
447 | [endsect] [/section:weg Worked Examples] | |
448 | ||
449 | [include background.qbk] | |
450 | ||
451 | [endsect] [/ section:stat_tut Statistical Distributions Tutorial] | |
452 | ||
453 | [/ dist_tutorial.qbk | |
454 | Copyright 2006, 2010, 2011 John Maddock and Paul A. Bristow. | |
455 | Distributed under the Boost Software License, Version 1.0. | |
456 | (See accompanying file LICENSE_1_0.txt or copy at | |
457 | http://www.boost.org/LICENSE_1_0.txt). | |
458 | ] | |
459 |