[ceph.git] / ceph / src / boost / libs / math / doc / distributions / hypergeometric.qbk

[section:hypergeometric_dist Hypergeometric Distribution]

``#include <boost/math/distributions/hypergeometric.hpp>``

   namespace boost{ namespace math{

   template <class RealType = double,
             class ``__Policy``   = ``__policy_class`` >
   class hypergeometric_distribution;

   template <class RealType, class Policy>
   class hypergeometric_distribution
   {
   public:
      typedef RealType value_type;
      typedef Policy   policy_type;
      // Construct:
      hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
      // Accessors:
      unsigned total()const;
      unsigned defective()const;
      unsigned sample_count()const;
   };

   typedef hypergeometric_distribution<> hypergeometric;

   }} // namespaces

The hypergeometric distribution describes the number of "events" /k/
from a sample /n/ drawn from a total population /N/ ['without replacement].

Imagine we have a sample of /N/ objects of which /r/ are "defective"
and N-r are "not defective"
(the terms "success\/failure" or "red\/blue" are also used).  If we sample /n/
items /without replacement/ then what is the probability that exactly
/k/ items in the sample are defective?  The answer is given by the pdf of the
hypergeometric distribution `f(k; r, n, N)`, whilst the probability of
/k/ defectives or fewer is given by F(k; r, n, N), where F(k) is the
CDF of the hypergeometric distribution.

[note Unlike almost all of the other distributions in this library,
the hypergeometric distribution is strictly discrete: it can not be
extended to real valued arguments of its parameters or random variable.]

The following graph shows how the distribution changes as the proportion
of "defective" items changes, while keeping the population and sample sizes
constant:

[graph hypergeometric_pdf_1]

Note that since the distribution is symmetrical in parameters /n/ and /r/, if we
change the sample size and keep the population and proportion "defective" the same
then we obtain basically the same graphs:

[graph hypergeometric_pdf_2]

[h4 Member Functions]

   hypergeometric_distribution(unsigned r, unsigned n, unsigned N);

Constructs a hypergeometric distribution with a population of /N/ objects,
of which /r/ are defective, and from which /n/ are sampled.

   unsigned total()const;

Returns the total number of objects /N/.

   unsigned defective()const;

Returns the number of objects /r/ in population /N/ which are defective.

   unsigned sample_count()const;

Returns the number of objects /n/ which are sampled from the population /N/.

[h4 Non-member Accessors]

All the [link math_toolkit.dist_ref.nmp usual non-member accessor functions]
that are generic to all distributions are supported: __usual_accessors.

The domain of the random variable is the unsigned integers in the range
\[max(0, n + r - N), min(n, r)\].  A __domain_error is raised if the
random variable is outside this range, or is not an integral value.

[caution
The quantile function will by default return an integer result that has been
/rounded outwards/.  That is to say lower quantiles (where the probability is
less than 0.5) are rounded downward, and upper quantiles (where the probability
is greater than 0.5) are rounded upwards.  This behaviour
ensures that if an X% quantile is requested, then /at least/ the requested
coverage will be present in the central region, and /no more than/
the requested coverage will be present in the tails.

This behaviour can be changed so that the quantile functions are rounded
differently using
[link math_toolkit.pol_overview Policies].  It is strongly
recommended that you read the tutorial
[link math_toolkit.pol_tutorial.understand_dis_quant
Understanding Quantiles of Discrete Distributions] before
using the quantile function on the Hypergeometric distribution.  The
[link math_toolkit.pol_ref.discrete_quant_ref reference docs]
describe how to change the rounding policy
for these distributions.

However, note that the implementation method of the quantile function
always returns an integral value, therefore attempting to use a __Policy
that requires (or produces) a real valued result will result in a
compile time error.
] [/ caution]


[h4 Accuracy]

For small N such that
`N < boost::math::max_factorial<RealType>::value` then table based
lookup of the results gives an accuracy to a few epsilon.
`boost::math::max_factorial<RealType>::value` is 170 at double or long double
precision.

For larger N such that `N < boost::math::prime(boost::math::max_prime)`
then only basic arithmetic is required for the calculation
and the accuracy is typically < 20 epsilon.  This takes care of N
up to 104729.

For `N > boost::math::prime(boost::math::max_prime)` then accuracy quickly
degrades, with 5 or 6 decimal digits being lost for N = 110000.

In general for very large N, the user should expect to lose log[sub 10]N
decimal digits of precision during the calculation, with the results
becoming meaningless for N >= 10[super 15].

[h4 Testing]

There are three sets of tests: our implementation is tested against a table of values
produced by Mathematica's implementation of this distribution. We also sanity check
our implementation against some spot values computed using the online calculator
here [@http://stattrek.com/Tables/Hypergeometric.aspx http://stattrek.com/Tables/Hypergeometric.aspx].
Finally we test accuracy against some high precision test data using
this implementation and NTL::RR.

[h4 Implementation]

The PDF can be calculated directly using the formula:

[equation hypergeometric1]

However, this can only be used directly when the largest of the factorials
is guaranteed not to overflow the floating point representation used.
This formula is used directly when `N < max_factorial<RealType>::value`
in which case table lookup of the factorials gives a rapid and accurate
implementation method.

For larger /N/ the method described in
"An Accurate Computation of the Hypergeometric Distribution Function",
Trong Wu, ACM Transactions on Mathematical Software, Vol. 19, No. 1,
March 1993, Pages 33-43 is used.  The method relies on the fact that
there is an easy method for factorising a factorial into the product
of prime numbers:

[equation hypergeometric2]

Where p[sub i] is the i'th prime number, and e[sub i] is a small
positive integer or zero, which can be calculated via:

[equation hypergeometric3]

Further we can combine the factorials in the expression for the PDF
to yield the PDF directly as the product of prime numbers:

[equation hypergeometric4]

With this time the exponents e[sub i] being either positive, negative
or zero.  Indeed such a degree of cancellation occurs in the calculation
of the e[sub i] that many are zero, and typically most have a magnitude
or no more than 1 or 2.

Calculation of the product of the primes requires some care to prevent
numerical overflow, we use a novel recursive method which splits the
calculation into a series of sub-products, with a new sub-product
started each time the next multiplication would cause either overflow
or underflow.  The sub-products are stored in a linked list on the
program stack, and combined in an order that will guarantee no overflow
or unnecessary-underflow once the last sub-product has been calculated.

This method can be used as long as N is smaller than the largest prime
number we have stored in our table of primes (currently 104729).  The method
is relatively slow (calculating the exponents requires the most time), but
requires only a small number of arithmetic operations to
calculate the result (indeed there is no shorter method involving only basic
arithmetic once the exponents have been found), the method is therefore
much more accurate than the alternatives.

For much larger N, we can calculate the PDF from the factorials using
either lgamma, or by directly combining lanczos approximations to avoid
calculating via logarithms.  We use the latter method, as it is usually
1 or 2 decimal digits more accurate than computing via logarithms with
lgamma.  However, in this area where N > 104729, the user should expect
to lose around log[sub 10]N decimal digits during the calculation in
the worst case.

The CDF and its complement is calculated by directly summing the PDF's.
We start by deciding whether the CDF, or its complement, is likely to be
the smaller of the two and then calculate the PDF at /k/ (or /k+1/ if we're
calculating the complement) and calculate successive PDF values via the
recurrence relations:

[equation hypergeometric5]

Until we either reach the end of the distributions domain, or the next
PDF value to be summed would be too small to affect the result.

The quantile is calculated in a similar manner to the CDF: we first guess
which end of the distribution we're nearer to, and then sum PDFs starting
from the end of the distribution this time, until we have some value /k/ that
gives the required CDF.

The median is simply the quantile at 0.5, and the remaining properties are
calculated via:

[equation hypergeometric6]

[endsect]

[/ hypergeometric.qbk
  Copyright 2008 John Maddock.
  Distributed under the Boost Software License, Version 1.0.
  (See accompanying file LICENSE_1_0.txt or copy at
  http://www.boost.org/LICENSE_1_0.txt).
]
Commit	Line	Data
7c673cae FG	1	[section:hypergeometric_dist Hypergeometric Distribution]
	2
	3	``#include <boost/math/distributions/hypergeometric.hpp>``
	4
	5	namespace boost{ namespace math{
	6
	7	template <class RealType = double,
	8	class ``__Policy`` = ``__policy_class`` >
	9	class hypergeometric_distribution;
	10
	11	template <class RealType, class Policy>
	12	class hypergeometric_distribution
	13	{
	14	public:
	15	typedef RealType value_type;
	16	typedef Policy policy_type;
	17	// Construct:
	18	hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
	19	// Accessors:
	20	unsigned total()const;
	21	unsigned defective()const;
	22	unsigned sample_count()const;
	23	};
	24
	25	typedef hypergeometric_distribution<> hypergeometric;
	26
	27	}} // namespaces
	28
	29	The hypergeometric distribution describes the number of "events" /k/
	30	from a sample /n/ drawn from a total population /N/ ['without replacement].
	31
	32	Imagine we have a sample of /N/ objects of which /r/ are "defective"
	33	and N-r are "not defective"
	34	(the terms "success\/failure" or "red\/blue" are also used). If we sample /n/
	35	items /without replacement/ then what is the probability that exactly
	36	/k/ items in the sample are defective? The answer is given by the pdf of the
	37	hypergeometric distribution `f(k; r, n, N)`, whilst the probability of
	38	/k/ defectives or fewer is given by F(k; r, n, N), where F(k) is the
	39	CDF of the hypergeometric distribution.
	40
	41	[note Unlike almost all of the other distributions in this library,
	42	the hypergeometric distribution is strictly discrete: it can not be
	43	extended to real valued arguments of its parameters or random variable.]
	44
	45	The following graph shows how the distribution changes as the proportion
	46	of "defective" items changes, while keeping the population and sample sizes
	47	constant:
	48
	49	[graph hypergeometric_pdf_1]
	50
	51	Note that since the distribution is symmetrical in parameters /n/ and /r/, if we
	52	change the sample size and keep the population and proportion "defective" the same
	53	then we obtain basically the same graphs:
	54
	55	[graph hypergeometric_pdf_2]
	56
	57	[h4 Member Functions]
	58
	59	hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
	60
	61	Constructs a hypergeometric distribution with a population of /N/ objects,
	62	of which /r/ are defective, and from which /n/ are sampled.
	63
	64	unsigned total()const;
65
66	Returns the total number of objects /N/.
67
68	unsigned defective()const;
69
70	Returns the number of objects /r/ in population /N/ which are defective.
71
72	unsigned sample_count()const;
73
74	Returns the number of objects /n/ which are sampled from the population /N/.
75
76	[h4 Non-member Accessors]
77
78	All the [link math_toolkit.dist_ref.nmp usual non-member accessor functions]
79	that are generic to all distributions are supported: __usual_accessors.
80
81	The domain of the random variable is the unsigned integers in the range
82	\[max(0, n + r - N), min(n, r)\]. A __domain_error is raised if the
83	random variable is outside this range, or is not an integral value.
84
85	[caution
86	The quantile function will by default return an integer result that has been
87	/rounded outwards/. That is to say lower quantiles (where the probability is
88	less than 0.5) are rounded downward, and upper quantiles (where the probability
89	is greater than 0.5) are rounded upwards. This behaviour
90	ensures that if an X% quantile is requested, then /at least/ the requested
91	coverage will be present in the central region, and /no more than/
92	the requested coverage will be present in the tails.
93
94	This behaviour can be changed so that the quantile functions are rounded
95	differently using
96	[link math_toolkit.pol_overview Policies]. It is strongly
97	recommended that you read the tutorial
98	[link math_toolkit.pol_tutorial.understand_dis_quant
99	Understanding Quantiles of Discrete Distributions] before
100	using the quantile function on the Hypergeometric distribution. The
101	[link math_toolkit.pol_ref.discrete_quant_ref reference docs]
102	describe how to change the rounding policy
103	for these distributions.
104
105	However, note that the implementation method of the quantile function
106	always returns an integral value, therefore attempting to use a __Policy
107	that requires (or produces) a real valued result will result in a
108	compile time error.
109	] [/ caution]
110
111
112	[h4 Accuracy]
113
114	For small N such that
115	`N < boost::math::max_factorial<RealType>::value` then table based
116	lookup of the results gives an accuracy to a few epsilon.
117	`boost::math::max_factorial<RealType>::value` is 170 at double or long double
118	precision.
119
120	For larger N such that `N < boost::math::prime(boost::math::max_prime)`
121	then only basic arithmetic is required for the calculation
122	and the accuracy is typically < 20 epsilon. This takes care of N
123	up to 104729.
124
125	For `N > boost::math::prime(boost::math::max_prime)` then accuracy quickly
126	degrades, with 5 or 6 decimal digits being lost for N = 110000.
127
128	In general for very large N, the user should expect to lose log[sub 10]N
129	decimal digits of precision during the calculation, with the results
130	becoming meaningless for N >= 10[super 15].
131
132	[h4 Testing]
133
134	There are three sets of tests: our implementation is tested against a table of values
135	produced by Mathematica's implementation of this distribution. We also sanity check
136	our implementation against some spot values computed using the online calculator
137	here [@http://stattrek.com/Tables/Hypergeometric.aspx http://stattrek.com/Tables/Hypergeometric.aspx].
138	Finally we test accuracy against some high precision test data using
139	this implementation and NTL::RR.
140
141	[h4 Implementation]
142
143	The PDF can be calculated directly using the formula:
144
145	[equation hypergeometric1]
146
147	However, this can only be used directly when the largest of the factorials
148	is guaranteed not to overflow the floating point representation used.
149	This formula is used directly when `N < max_factorial<RealType>::value`
150	in which case table lookup of the factorials gives a rapid and accurate
151	implementation method.
152
153	For larger /N/ the method described in
154	"An Accurate Computation of the Hypergeometric Distribution Function",
155	Trong Wu, ACM Transactions on Mathematical Software, Vol. 19, No. 1,
156	March 1993, Pages 33-43 is used. The method relies on the fact that
157	there is an easy method for factorising a factorial into the product
158	of prime numbers:
159
160	[equation hypergeometric2]
161
162	Where p[sub i] is the i'th prime number, and e[sub i] is a small
163	positive integer or zero, which can be calculated via:
164
165	[equation hypergeometric3]
166
167	Further we can combine the factorials in the expression for the PDF
168	to yield the PDF directly as the product of prime numbers:
169
170	[equation hypergeometric4]
171
172	With this time the exponents e[sub i] being either positive, negative
173	or zero. Indeed such a degree of cancellation occurs in the calculation
174	of the e[sub i] that many are zero, and typically most have a magnitude
175	or no more than 1 or 2.
176
177	Calculation of the product of the primes requires some care to prevent
178	numerical overflow, we use a novel recursive method which splits the
179	calculation into a series of sub-products, with a new sub-product
180	started each time the next multiplication would cause either overflow
181	or underflow. The sub-products are stored in a linked list on the
182	program stack, and combined in an order that will guarantee no overflow
183	or unnecessary-underflow once the last sub-product has been calculated.
184
185	This method can be used as long as N is smaller than the largest prime
186	number we have stored in our table of primes (currently 104729). The method
187	is relatively slow (calculating the exponents requires the most time), but
188	requires only a small number of arithmetic operations to
189	calculate the result (indeed there is no shorter method involving only basic
190	arithmetic once the exponents have been found), the method is therefore
191	much more accurate than the alternatives.
192
193	For much larger N, we can calculate the PDF from the factorials using
194	either lgamma, or by directly combining lanczos approximations to avoid
195	calculating via logarithms. We use the latter method, as it is usually
196	1 or 2 decimal digits more accurate than computing via logarithms with
197	lgamma. However, in this area where N > 104729, the user should expect
198	to lose around log[sub 10]N decimal digits during the calculation in
199	the worst case.
200
201	The CDF and its complement is calculated by directly summing the PDF's.
202	We start by deciding whether the CDF, or its complement, is likely to be
203	the smaller of the two and then calculate the PDF at /k/ (or /k+1/ if we're
204	calculating the complement) and calculate successive PDF values via the
205	recurrence relations:
206
207	[equation hypergeometric5]
208
209	Until we either reach the end of the distributions domain, or the next
210	PDF value to be summed would be too small to affect the result.
211
212	The quantile is calculated in a similar manner to the CDF: we first guess
213	which end of the distribution we're nearer to, and then sum PDFs starting
214	from the end of the distribution this time, until we have some value /k/ that
215	gives the required CDF.
216
217	The median is simply the quantile at 0.5, and the remaining properties are
218	calculated via:
219
220	[equation hypergeometric6]
221
222	[endsect]
223
224	[/ hypergeometric.qbk
225	Copyright 2008 John Maddock.
226	Distributed under the Boost Software License, Version 1.0.
227	(See accompanying file LICENSE_1_0.txt or copy at
228	http://www.boost.org/LICENSE_1_0.txt).
229	]