]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [section:hypergeometric_dist Hypergeometric Distribution] |
2 | ||
3 | ``#include <boost/math/distributions/hypergeometric.hpp>`` | |
4 | ||
5 | namespace boost{ namespace math{ | |
6 | ||
7 | template <class RealType = double, | |
8 | class ``__Policy`` = ``__policy_class`` > | |
9 | class hypergeometric_distribution; | |
10 | ||
11 | template <class RealType, class Policy> | |
12 | class hypergeometric_distribution | |
13 | { | |
14 | public: | |
15 | typedef RealType value_type; | |
16 | typedef Policy policy_type; | |
17 | // Construct: | |
18 | hypergeometric_distribution(unsigned r, unsigned n, unsigned N); | |
19 | // Accessors: | |
20 | unsigned total()const; | |
21 | unsigned defective()const; | |
22 | unsigned sample_count()const; | |
23 | }; | |
24 | ||
25 | typedef hypergeometric_distribution<> hypergeometric; | |
26 | ||
27 | }} // namespaces | |
28 | ||
29 | The hypergeometric distribution describes the number of "events" /k/ | |
30 | from a sample /n/ drawn from a total population /N/ ['without replacement]. | |
31 | ||
32 | Imagine we have a sample of /N/ objects of which /r/ are "defective" | |
33 | and N-r are "not defective" | |
34 | (the terms "success\/failure" or "red\/blue" are also used). If we sample /n/ | |
35 | items /without replacement/ then what is the probability that exactly | |
36 | /k/ items in the sample are defective? The answer is given by the pdf of the | |
37 | hypergeometric distribution `f(k; r, n, N)`, whilst the probability of | |
38 | /k/ defectives or fewer is given by F(k; r, n, N), where F(k) is the | |
39 | CDF of the hypergeometric distribution. | |
40 | ||
41 | [note Unlike almost all of the other distributions in this library, | |
42 | the hypergeometric distribution is strictly discrete: it can not be | |
43 | extended to real valued arguments of its parameters or random variable.] | |
44 | ||
45 | The following graph shows how the distribution changes as the proportion | |
46 | of "defective" items changes, while keeping the population and sample sizes | |
47 | constant: | |
48 | ||
49 | [graph hypergeometric_pdf_1] | |
50 | ||
51 | Note that since the distribution is symmetrical in parameters /n/ and /r/, if we | |
52 | change the sample size and keep the population and proportion "defective" the same | |
53 | then we obtain basically the same graphs: | |
54 | ||
55 | [graph hypergeometric_pdf_2] | |
56 | ||
57 | [h4 Member Functions] | |
58 | ||
59 | hypergeometric_distribution(unsigned r, unsigned n, unsigned N); | |
60 | ||
61 | Constructs a hypergeometric distribution with a population of /N/ objects, | |
62 | of which /r/ are defective, and from which /n/ are sampled. | |
63 | ||
64 | unsigned total()const; | |
65 | ||
66 | Returns the total number of objects /N/. | |
67 | ||
68 | unsigned defective()const; | |
69 | ||
70 | Returns the number of objects /r/ in population /N/ which are defective. | |
71 | ||
72 | unsigned sample_count()const; | |
73 | ||
74 | Returns the number of objects /n/ which are sampled from the population /N/. | |
75 | ||
76 | [h4 Non-member Accessors] | |
77 | ||
78 | All the [link math_toolkit.dist_ref.nmp usual non-member accessor functions] | |
79 | that are generic to all distributions are supported: __usual_accessors. | |
80 | ||
81 | The domain of the random variable is the unsigned integers in the range | |
82 | \[max(0, n + r - N), min(n, r)\]. A __domain_error is raised if the | |
83 | random variable is outside this range, or is not an integral value. | |
84 | ||
85 | [caution | |
86 | The quantile function will by default return an integer result that has been | |
87 | /rounded outwards/. That is to say lower quantiles (where the probability is | |
88 | less than 0.5) are rounded downward, and upper quantiles (where the probability | |
89 | is greater than 0.5) are rounded upwards. This behaviour | |
90 | ensures that if an X% quantile is requested, then /at least/ the requested | |
91 | coverage will be present in the central region, and /no more than/ | |
92 | the requested coverage will be present in the tails. | |
93 | ||
94 | This behaviour can be changed so that the quantile functions are rounded | |
95 | differently using | |
96 | [link math_toolkit.pol_overview Policies]. It is strongly | |
97 | recommended that you read the tutorial | |
98 | [link math_toolkit.pol_tutorial.understand_dis_quant | |
99 | Understanding Quantiles of Discrete Distributions] before | |
100 | using the quantile function on the Hypergeometric distribution. The | |
101 | [link math_toolkit.pol_ref.discrete_quant_ref reference docs] | |
102 | describe how to change the rounding policy | |
103 | for these distributions. | |
104 | ||
105 | However, note that the implementation method of the quantile function | |
106 | always returns an integral value, therefore attempting to use a __Policy | |
107 | that requires (or produces) a real valued result will result in a | |
108 | compile time error. | |
109 | ] [/ caution] | |
110 | ||
111 | ||
112 | [h4 Accuracy] | |
113 | ||
114 | For small N such that | |
115 | `N < boost::math::max_factorial<RealType>::value` then table based | |
116 | lookup of the results gives an accuracy to a few epsilon. | |
117 | `boost::math::max_factorial<RealType>::value` is 170 at double or long double | |
118 | precision. | |
119 | ||
120 | For larger N such that `N < boost::math::prime(boost::math::max_prime)` | |
121 | then only basic arithmetic is required for the calculation | |
122 | and the accuracy is typically < 20 epsilon. This takes care of N | |
123 | up to 104729. | |
124 | ||
125 | For `N > boost::math::prime(boost::math::max_prime)` then accuracy quickly | |
126 | degrades, with 5 or 6 decimal digits being lost for N = 110000. | |
127 | ||
128 | In general for very large N, the user should expect to lose log[sub 10]N | |
129 | decimal digits of precision during the calculation, with the results | |
130 | becoming meaningless for N >= 10[super 15]. | |
131 | ||
132 | [h4 Testing] | |
133 | ||
134 | There are three sets of tests: our implementation is tested against a table of values | |
135 | produced by Mathematica's implementation of this distribution. We also sanity check | |
136 | our implementation against some spot values computed using the online calculator | |
137 | here [@http://stattrek.com/Tables/Hypergeometric.aspx http://stattrek.com/Tables/Hypergeometric.aspx]. | |
138 | Finally we test accuracy against some high precision test data using | |
139 | this implementation and NTL::RR. | |
140 | ||
141 | [h4 Implementation] | |
142 | ||
143 | The PDF can be calculated directly using the formula: | |
144 | ||
145 | [equation hypergeometric1] | |
146 | ||
147 | However, this can only be used directly when the largest of the factorials | |
148 | is guaranteed not to overflow the floating point representation used. | |
149 | This formula is used directly when `N < max_factorial<RealType>::value` | |
150 | in which case table lookup of the factorials gives a rapid and accurate | |
151 | implementation method. | |
152 | ||
153 | For larger /N/ the method described in | |
154 | "An Accurate Computation of the Hypergeometric Distribution Function", | |
155 | Trong Wu, ACM Transactions on Mathematical Software, Vol. 19, No. 1, | |
156 | March 1993, Pages 33-43 is used. The method relies on the fact that | |
157 | there is an easy method for factorising a factorial into the product | |
158 | of prime numbers: | |
159 | ||
160 | [equation hypergeometric2] | |
161 | ||
162 | Where p[sub i] is the i'th prime number, and e[sub i] is a small | |
163 | positive integer or zero, which can be calculated via: | |
164 | ||
165 | [equation hypergeometric3] | |
166 | ||
167 | Further we can combine the factorials in the expression for the PDF | |
168 | to yield the PDF directly as the product of prime numbers: | |
169 | ||
170 | [equation hypergeometric4] | |
171 | ||
172 | With this time the exponents e[sub i] being either positive, negative | |
173 | or zero. Indeed such a degree of cancellation occurs in the calculation | |
174 | of the e[sub i] that many are zero, and typically most have a magnitude | |
175 | or no more than 1 or 2. | |
176 | ||
177 | Calculation of the product of the primes requires some care to prevent | |
178 | numerical overflow, we use a novel recursive method which splits the | |
179 | calculation into a series of sub-products, with a new sub-product | |
180 | started each time the next multiplication would cause either overflow | |
181 | or underflow. The sub-products are stored in a linked list on the | |
182 | program stack, and combined in an order that will guarantee no overflow | |
183 | or unnecessary-underflow once the last sub-product has been calculated. | |
184 | ||
185 | This method can be used as long as N is smaller than the largest prime | |
186 | number we have stored in our table of primes (currently 104729). The method | |
187 | is relatively slow (calculating the exponents requires the most time), but | |
188 | requires only a small number of arithmetic operations to | |
189 | calculate the result (indeed there is no shorter method involving only basic | |
190 | arithmetic once the exponents have been found), the method is therefore | |
191 | much more accurate than the alternatives. | |
192 | ||
193 | For much larger N, we can calculate the PDF from the factorials using | |
194 | either lgamma, or by directly combining lanczos approximations to avoid | |
195 | calculating via logarithms. We use the latter method, as it is usually | |
196 | 1 or 2 decimal digits more accurate than computing via logarithms with | |
197 | lgamma. However, in this area where N > 104729, the user should expect | |
198 | to lose around log[sub 10]N decimal digits during the calculation in | |
199 | the worst case. | |
200 | ||
201 | The CDF and its complement is calculated by directly summing the PDF's. | |
202 | We start by deciding whether the CDF, or its complement, is likely to be | |
203 | the smaller of the two and then calculate the PDF at /k/ (or /k+1/ if we're | |
204 | calculating the complement) and calculate successive PDF values via the | |
205 | recurrence relations: | |
206 | ||
207 | [equation hypergeometric5] | |
208 | ||
209 | Until we either reach the end of the distributions domain, or the next | |
210 | PDF value to be summed would be too small to affect the result. | |
211 | ||
212 | The quantile is calculated in a similar manner to the CDF: we first guess | |
213 | which end of the distribution we're nearer to, and then sum PDFs starting | |
214 | from the end of the distribution this time, until we have some value /k/ that | |
215 | gives the required CDF. | |
216 | ||
217 | The median is simply the quantile at 0.5, and the remaining properties are | |
218 | calculated via: | |
219 | ||
220 | [equation hypergeometric6] | |
221 | ||
222 | [endsect] | |
223 | ||
224 | [/ hypergeometric.qbk | |
225 | Copyright 2008 John Maddock. | |
226 | Distributed under the Boost Software License, Version 1.0. | |
227 | (See accompanying file LICENSE_1_0.txt or copy at | |
228 | http://www.boost.org/LICENSE_1_0.txt). | |
229 | ] |