ceph/src/boost/libs/math/doc/distributions/hypergeometric.qbk

   1 [section:hypergeometric_dist Hypergeometric Distribution]
   2
   3 ``#include <boost/math/distributions/hypergeometric.hpp>``
   4
   5    namespace boost{ namespace math{
   6
   7    template <class RealType = double,
   8              class ``__Policy``   = ``__policy_class`` >
   9    class hypergeometric_distribution;
  10
  11    template <class RealType, class Policy>
  12    class hypergeometric_distribution
  13    {
  14    public:
  15       typedef RealType value_type;
  16       typedef Policy   policy_type;
  17       // Construct:
  18       hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
  19       // Accessors:
  20       unsigned total()const;
  21       unsigned defective()const;
  22       unsigned sample_count()const;
  23    };
  24
  25    typedef hypergeometric_distribution<> hypergeometric;
  26
  27    }} // namespaces
  28
  29 The hypergeometric distribution describes the number of "events" /k/
  30 from a sample /n/ drawn from a total population /N/ ['without replacement].
  31
  32 Imagine we have a sample of /N/ objects of which /r/ are "defective"
  33 and N-r are "not defective"
  34 (the terms "success\/failure" or "red\/blue" are also used).  If we sample /n/
  35 items /without replacement/ then what is the probability that exactly
  36 /k/ items in the sample are defective?  The answer is given by the pdf of the
  37 hypergeometric distribution `f(k; r, n, N)`, whilst the probability of
  38 /k/ defectives or fewer is given by F(k; r, n, N), where F(k) is the
  39 CDF of the hypergeometric distribution.
  40
  41 [note Unlike almost all of the other distributions in this library,
  42 the hypergeometric distribution is strictly discrete: it can not be
  43 extended to real valued arguments of its parameters or random variable.]
  44
  45 The following graph shows how the distribution changes as the proportion
  46 of "defective" items changes, while keeping the population and sample sizes
  47 constant:
  48
  49 [graph hypergeometric_pdf_1]
  50
  51 Note that since the distribution is symmetrical in parameters /n/ and /r/, if we
  52 change the sample size and keep the population and proportion "defective" the same
  53 then we obtain basically the same graphs:
  54
  55 [graph hypergeometric_pdf_2]
  56
  57 [h4 Member Functions]
  58
  59    hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
  60
  61 Constructs a hypergeometric distribution with a population of /N/ objects,
  62 of which /r/ are defective, and from which /n/ are sampled.
  63
  64    unsigned total()const;
  65
  66 Returns the total number of objects /N/.
  67
  68    unsigned defective()const;
  69
  70 Returns the number of objects /r/ in population /N/ which are defective.
  71
  72    unsigned sample_count()const;
  73
  74 Returns the number of objects /n/ which are sampled from the population /N/.
  75
  76 [h4 Non-member Accessors]
  77
  78 All the [link math_toolkit.dist_ref.nmp usual non-member accessor functions]
  79 that are generic to all distributions are supported: __usual_accessors.
  80
  81 The domain of the random variable is the unsigned integers in the range
  82 \[max(0, n + r - N), min(n, r)\].  A __domain_error is raised if the
  83 random variable is outside this range, or is not an integral value.
  84
  85 [caution
  86 The quantile function will by default return an integer result that has been
  87 /rounded outwards/.  That is to say lower quantiles (where the probability is
  88 less than 0.5) are rounded downward, and upper quantiles (where the probability
  89 is greater than 0.5) are rounded upwards.  This behaviour
  90 ensures that if an X% quantile is requested, then /at least/ the requested
  91 coverage will be present in the central region, and /no more than/
  92 the requested coverage will be present in the tails.
  93
  94 This behaviour can be changed so that the quantile functions are rounded
  95 differently using
  96 [link math_toolkit.pol_overview Policies].  It is strongly
  97 recommended that you read the tutorial
  98 [link math_toolkit.pol_tutorial.understand_dis_quant
  99 Understanding Quantiles of Discrete Distributions] before
 100 using the quantile function on the Hypergeometric distribution.  The
 101 [link math_toolkit.pol_ref.discrete_quant_ref reference docs]
 102 describe how to change the rounding policy
 103 for these distributions.
 104
 105 However, note that the implementation method of the quantile function
 106 always returns an integral value, therefore attempting to use a __Policy
 107 that requires (or produces) a real valued result will result in a
 108 compile time error.
 109 ] [/ caution]
 110
 111
 112 [h4 Accuracy]
 113
 114 For small N such that
 115 `N < boost::math::max_factorial<RealType>::value` then table based
 116 lookup of the results gives an accuracy to a few epsilon.
 117 `boost::math::max_factorial<RealType>::value` is 170 at double or long double
 118 precision.
 119
 120 For larger N such that `N < boost::math::prime(boost::math::max_prime)`
 121 then only basic arithmetic is required for the calculation
 122 and the accuracy is typically < 20 epsilon.  This takes care of N
 123 up to 104729.
 124
 125 For `N > boost::math::prime(boost::math::max_prime)` then accuracy quickly
 126 degrades, with 5 or 6 decimal digits being lost for N = 110000.
 127
 128 In general for very large N, the user should expect to lose log[sub 10]N
 129 decimal digits of precision during the calculation, with the results
 130 becoming meaningless for N >= 10[super 15].
 131
 132 [h4 Testing]
 133
 134 There are three sets of tests: our implementation is tested against a table of values
 135 produced by Mathematica's implementation of this distribution. We also sanity check
 136 our implementation against some spot values computed using the online calculator
 137 here [@http://stattrek.com/Tables/Hypergeometric.aspx http://stattrek.com/Tables/Hypergeometric.aspx].
 138 Finally we test accuracy against some high precision test data using
 139 this implementation and NTL::RR.
 140
 141 [h4 Implementation]
 142
 143 The PDF can be calculated directly using the formula:
 144
 145 [equation hypergeometric1]
 146
 147 However, this can only be used directly when the largest of the factorials
 148 is guaranteed not to overflow the floating point representation used.
 149 This formula is used directly when `N < max_factorial<RealType>::value`
 150 in which case table lookup of the factorials gives a rapid and accurate
 151 implementation method.
 152
 153 For larger /N/ the method described in
 154 "An Accurate Computation of the Hypergeometric Distribution Function",
 155 Trong Wu, ACM Transactions on Mathematical Software, Vol. 19, No. 1,
 156 March 1993, Pages 33-43 is used.  The method relies on the fact that
 157 there is an easy method for factorising a factorial into the product
 158 of prime numbers:
 159
 160 [equation hypergeometric2]
 161
 162 Where p[sub i] is the i'th prime number, and e[sub i] is a small
 163 positive integer or zero, which can be calculated via:
 164
 165 [equation hypergeometric3]
 166
 167 Further we can combine the factorials in the expression for the PDF
 168 to yield the PDF directly as the product of prime numbers:
 169
 170 [equation hypergeometric4]
 171
 172 With this time the exponents e[sub i] being either positive, negative
 173 or zero.  Indeed such a degree of cancellation occurs in the calculation
 174 of the e[sub i] that many are zero, and typically most have a magnitude
 175 or no more than 1 or 2.
 176
 177 Calculation of the product of the primes requires some care to prevent
 178 numerical overflow, we use a novel recursive method which splits the
 179 calculation into a series of sub-products, with a new sub-product
 180 started each time the next multiplication would cause either overflow
 181 or underflow.  The sub-products are stored in a linked list on the
 182 program stack, and combined in an order that will guarantee no overflow
 183 or unnecessary-underflow once the last sub-product has been calculated.
 184
 185 This method can be used as long as N is smaller than the largest prime
 186 number we have stored in our table of primes (currently 104729).  The method
 187 is relatively slow (calculating the exponents requires the most time), but
 188 requires only a small number of arithmetic operations to
 189 calculate the result (indeed there is no shorter method involving only basic
 190 arithmetic once the exponents have been found), the method is therefore
 191 much more accurate than the alternatives.
 192
 193 For much larger N, we can calculate the PDF from the factorials using
 194 either lgamma, or by directly combining lanczos approximations to avoid
 195 calculating via logarithms.  We use the latter method, as it is usually
 196 1 or 2 decimal digits more accurate than computing via logarithms with
 197 lgamma.  However, in this area where N > 104729, the user should expect
 198 to lose around log[sub 10]N decimal digits during the calculation in
 199 the worst case.
 200
 201 The CDF and its complement is calculated by directly summing the PDF's.
 202 We start by deciding whether the CDF, or its complement, is likely to be
 203 the smaller of the two and then calculate the PDF at /k/ (or /k+1/ if we're
 204 calculating the complement) and calculate successive PDF values via the
 205 recurrence relations:
 206
 207 [equation hypergeometric5]
 208
 209 Until we either reach the end of the distributions domain, or the next
 210 PDF value to be summed would be too small to affect the result.
 211
 212 The quantile is calculated in a similar manner to the CDF: we first guess
 213 which end of the distribution we're nearer to, and then sum PDFs starting
 214 from the end of the distribution this time, until we have some value /k/ that
 215 gives the required CDF.
 216
 217 The median is simply the quantile at 0.5, and the remaining properties are
 218 calculated via:
 219
 220 [equation hypergeometric6]
 221
 222 [endsect]
 223
 224 [/ hypergeometric.qbk
 225   Copyright 2008 John Maddock.
 226   Distributed under the Boost Software License, Version 1.0.
 227   (See accompanying file LICENSE_1_0.txt or copy at
 228   http://www.boost.org/LICENSE_1_0.txt).
 229 ]