StdLib/LibC/Softfloat/softfloat.txt

   1 $NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $
   2
   3 SoftFloat Release 2a General Documentation
   4
   5 John R. Hauser
   6 1998 December 13
   7
   8
   9 -------------------------------------------------------------------------------
  10 Introduction
  11
  12 SoftFloat is a software implementation of floating-point that conforms to
  13 the IEC/IEEE Standard for Binary Floating-Point Arithmetic.  As many as four
  14 formats are supported:  single precision, double precision, extended double
  15 precision, and quadruple precision.  All operations required by the standard
  16 are implemented, except for conversions to and from decimal.
  17
  18 This document gives information about the types defined and the routines
  19 implemented by SoftFloat.  It does not attempt to define or explain the
  20 IEC/IEEE Floating-Point Standard.  Details about the standard are available
  21 elsewhere.
  22
  23
  24 -------------------------------------------------------------------------------
  25 Limitations
  26
  27 SoftFloat is written in C and is designed to work with other C code.  The
  28 SoftFloat header files assume an ISO/ANSI-style C compiler.  No attempt
  29 has been made to accommodate compilers that are not ISO-conformant.  In
  30 particular, the distributed header files will not be acceptable to any
  31 compiler that does not recognize function prototypes.
  32
  33 Support for the extended double-precision and quadruple-precision formats
  34 depends on a C compiler that implements 64-bit integer arithmetic.  If the
  35 largest integer format supported by the C compiler is 32 bits, SoftFloat is
  36 limited to only single and double precisions.  When that is the case, all
  37 references in this document to the extended double precision, quadruple
  38 precision, and 64-bit integers should be ignored.
  39
  40
  41 -------------------------------------------------------------------------------
  42 Contents
  43
  44     Introduction
  45     Limitations
  46     Contents
  47     Legal Notice
  48     Types and Functions
  49     Rounding Modes
  50     Extended Double-Precision Rounding Precision
  51     Exceptions and Exception Flags
  52     Function Details
  53         Conversion Functions
  54         Standard Arithmetic Functions
  55         Remainder Functions
  56         Round-to-Integer Functions
  57         Comparison Functions
  58         Signaling NaN Test Functions
  59         Raise-Exception Function
  60     Contact Information
  61
  62
  63
  64 -------------------------------------------------------------------------------
  65 Legal Notice
  66
  67 SoftFloat was written by John R. Hauser.  This work was made possible in
  68 part by the International Computer Science Institute, located at Suite 600,
  69 1947 Center Street, Berkeley, California 94704.  Funding was partially
  70 provided by the National Science Foundation under grant MIP-9311980.  The
  71 original version of this code was written as part of a project to build
  72 a fixed-point vector processor in collaboration with the University of
  73 California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
  74
  75 THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE.  Although reasonable effort
  76 has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
  77 TIMES RESULT IN INCORRECT BEHAVIOR.  USE OF THIS SOFTWARE IS RESTRICTED TO
  78 PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
  79 AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
  80
  81
  82 -------------------------------------------------------------------------------
  83 Types and Functions
  84
  85 When 64-bit integers are supported by the compiler, the `softfloat.h' header
  86 file defines four types:  `float32' (single precision), `float64' (double
  87 precision), `floatx80' (extended double precision), and `float128'
  88 (quadruple precision).  The `float32' and `float64' types are defined in
  89 terms of 32-bit and 64-bit integer types, respectively, while the `float128'
  90 type is defined as a structure of two 64-bit integers, taking into account
  91 the byte order of the particular machine being used.  The `floatx80' type
  92 is defined as a structure containing one 16-bit and one 64-bit integer, with
  93 the machine's byte order again determining the order of the `high' and `low'
  94 fields.
  95
  96 When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
  97 header file defines only two types:  `float32' and `float64'.  Because
  98 ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
  99 the `float32' type is identified with an appropriate integer type.  The
 100 `float64' type is defined as a structure of two 32-bit integers, with the
 101 machine's byte order determining the order of the fields.
 102
 103 In either case, the types in `softfloat.h' are defined such that if a system
 104 implements the usual C `float' and `double' types according to the IEC/IEEE
 105 Standard, then the `float32' and `float64' types should be indistinguishable
 106 in memory from the native `float' and `double' types.  (On the other hand,
 107 when `float32' or `float64' values are placed in processor registers by
 108 the compiler, the type of registers used may differ from those used for the
 109 native `float' and `double' types.)
 110
 111 SoftFloat implements the following arithmetic operations:
 112
 113 -- Conversions among all the floating-point formats, and also between
 114    integers (32-bit and 64-bit) and any of the floating-point formats.
 115
 116 -- The usual add, subtract, multiply, divide, and square root operations
 117    for all floating-point formats.
 118
 119 -- For each format, the floating-point remainder operation defined by the
 120    IEC/IEEE Standard.
 121
 122 -- For each floating-point format, a ``round to integer'' operation that
 123    rounds to the nearest integer value in the same format.  (The floating-
 124    point formats can hold integer values, of course.)
 125
 126 -- Comparisons between two values in the same floating-point format.
 127
 128 The only functions required by the IEC/IEEE Standard that are not provided
 129 are conversions to and from decimal.
 130
 131
 132 -------------------------------------------------------------------------------
 133 Rounding Modes
 134
 135 All four rounding modes prescribed by the IEC/IEEE Standard are implemented
 136 for all operations that require rounding.  The rounding mode is selected
 137 by the global variable `float_rounding_mode'.  This variable may be set
 138 to one of the values `float_round_nearest_even', `float_round_to_zero',
 139 `float_round_down', or `float_round_up'.  The rounding mode is initialized
 140 to nearest/even.
 141
 142
 143 -------------------------------------------------------------------------------
 144 Extended Double-Precision Rounding Precision
 145
 146 For extended double precision (`floatx80') only, the rounding precision
 147 of the standard arithmetic operations is controlled by the global variable
 148 `floatx80_rounding_precision'.  The operations affected are:
 149
 150    floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
 151
 152 When `floatx80_rounding_precision' is set to its default value of 80, these
 153 operations are rounded (as usual) to the full precision of the extended
 154 double-precision format.  Setting `floatx80_rounding_precision' to 32
 155 or to 64 causes the operations listed to be rounded to reduced precision
 156 equivalent to single precision (`float32') or to double precision
 157 (`float64'), respectively.  When rounding to reduced precision, additional
 158 bits in the result significand beyond the rounding point are set to zero.
 159 The consequences of setting `floatx80_rounding_precision' to a value other
 160 than 32, 64, or 80 is not specified.  Operations other than the ones listed
 161 above are not affected by `floatx80_rounding_precision'.
 162
 163
 164 -------------------------------------------------------------------------------
 165 Exceptions and Exception Flags
 166
 167 All five exception flags required by the IEC/IEEE Standard are
 168 implemented.  Each flag is stored as a unique bit in the global variable
 169 `float_exception_flags'.  The positions of the exception flag bits within
 170 this variable are determined by the bit masks `float_flag_inexact',
 171 `float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
 172 `float_flag_invalid'.  The exception flags variable is initialized to all 0,
 173 meaning no exceptions.
 174
 175 An individual exception flag can be cleared with the statement
 176
 177     float_exception_flags &= ~ float_flag_<exception>;
 178
 179 where `<exception>' is the appropriate name.  To raise a floating-point
 180 exception, the SoftFloat function `float_raise' should be used (see below).
 181
 182 In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
 183 for underflow either before or after rounding.  The choice is made by
 184 the global variable `float_detect_tininess', which can be set to either
 185 `float_tininess_before_rounding' or `float_tininess_after_rounding'.
 186 Detecting tininess after rounding is better because it results in fewer
 187 spurious underflow signals.  The other option is provided for compatibility
 188 with some systems.  Like most systems, SoftFloat always detects loss of
 189 accuracy for underflow as an inexact result.
 190
 191
 192 -------------------------------------------------------------------------------
 193 Function Details
 194
 195 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 196 Conversion Functions
 197
 198 All conversions among the floating-point formats are supported, as are all
 199 conversions between a floating-point format and 32-bit and 64-bit signed
 200 integers.  The complete set of conversion functions is:
 201
 202    int32_to_float32      int64_to_float32
 203    int32_to_float64      int64_to_float32
 204    int32_to_floatx80     int64_to_floatx80
 205    int32_to_float128     int64_to_float128
 206
 207    float32_to_int32      float32_to_int64
 208    float32_to_int32      float64_to_int64
 209    floatx80_to_int32     floatx80_to_int64
 210    float128_to_int32     float128_to_int64
 211
 212    float32_to_float64    float32_to_floatx80   float32_to_float128
 213    float64_to_float32    float64_to_floatx80   float64_to_float128
 214    floatx80_to_float32   floatx80_to_float64   floatx80_to_float128
 215    float128_to_float32   float128_to_float64   float128_to_floatx80
 216
 217 Each conversion function takes one operand of the appropriate type and
 218 returns one result.  Conversions from a smaller to a larger floating-point
 219 format are always exact and so require no rounding.  Conversions from 32-bit
 220 integers to double precision and larger formats are also exact, and likewise
 221 for conversions from 64-bit integers to extended double and quadruple
 222 precisions.
 223
 224 Conversions from floating-point to integer raise the invalid exception if
 225 the source value cannot be rounded to a representable integer of the desired
 226 size (32 or 64 bits).  If the floating-point operand is a NaN, the largest
 227 positive integer is returned.  Otherwise, if the conversion overflows, the
 228 largest integer with the same sign as the operand is returned.
 229
 230 On conversions to integer, if the floating-point operand is not already an
 231 integer value, the operand is rounded according to the current rounding
 232 mode as specified by `float_rounding_mode'.  Because C (and perhaps other
 233 languages) require that conversions to integers be rounded toward zero, the
 234 following functions are provided for improved speed and convenience:
 235
 236    float32_to_int32_round_to_zero    float32_to_int64_round_to_zero
 237    float64_to_int32_round_to_zero    float64_to_int64_round_to_zero
 238    floatx80_to_int32_round_to_zero   floatx80_to_int64_round_to_zero
 239    float128_to_int32_round_to_zero   float128_to_int64_round_to_zero
 240
 241 These variant functions ignore `float_rounding_mode' and always round toward
 242 zero.
 243
 244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 245 Standard Arithmetic Functions
 246
 247 The following standard arithmetic functions are provided:
 248
 249    float32_add    float32_sub    float32_mul    float32_div    float32_sqrt
 250    float64_add    float64_sub    float64_mul    float64_div    float64_sqrt
 251    floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
 252    float128_add   float128_sub   float128_mul   float128_div   float128_sqrt
 253
 254 Each function takes two operands, except for `sqrt' which takes only one.
 255 The operands and result are all of the same type.
 256
 257 Rounding of the extended double-precision (`floatx80') functions is affected
 258 by the `floatx80_rounding_precision' variable, as explained above in the
 259 section _Extended_Double-Precision_Rounding_Precision_.
 260
 261 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 262 Remainder Functions
 263
 264 For each format, SoftFloat implements the remainder function according to
 265 the IEC/IEEE Standard.  The remainder functions are:
 266
 267    float32_rem
 268    float64_rem
 269    floatx80_rem
 270    float128_rem
 271
 272 Each remainder function takes two operands.  The operands and result are all
 273 of the same type.  Given operands x and y, the remainder functions return
 274 the value x - n*y, where n is the integer closest to x/y.  If x/y is exactly
 275 halfway between two integers, n is the even integer closest to x/y.  The
 276 remainder functions are always exact and so require no rounding.
 277
 278 Depending on the relative magnitudes of the operands, the remainder
 279 functions can take considerably longer to execute than the other SoftFloat
 280 functions.  This is inherent in the remainder operation itself and is not a
 281 flaw in the SoftFloat implementation.
 282
 283 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 284 Round-to-Integer Functions
 285
 286 For each format, SoftFloat implements the round-to-integer function
 287 specified by the IEC/IEEE Standard.  The functions are:
 288
 289    float32_round_to_int
 290    float64_round_to_int
 291    floatx80_round_to_int
 292    float128_round_to_int
 293
 294 Each function takes a single floating-point operand and returns a result of
 295 the same type.  (Note that the result is not an integer type.)  The operand
 296 is rounded to an exact integer according to the current rounding mode, and
 297 the resulting integer value is returned in the same floating-point format.
 298
 299 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 300 Comparison Functions
 301
 302 The following floating-point comparison functions are provided:
 303
 304    float32_eq    float32_le    float32_lt
 305    float64_eq    float64_le    float64_lt
 306    floatx80_eq   floatx80_le   floatx80_lt
 307    float128_eq   float128_le   float128_lt
 308
 309 Each function takes two operands of the same type and returns a 1 or 0
 310 representing either _true_ or _false_.  The abbreviation `eq' stands for
 311 ``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
 312 for ``less than'' (<).
 313
 314 The standard greater-than (>), greater-than-or-equal (>=), and not-equal
 315 (!=) functions are easily obtained using the functions provided.  The
 316 not-equal function is just the logical complement of the equal function.
 317 The greater-than-or-equal function is identical to the less-than-or-equal
 318 function with the operands reversed; and the greater-than function can be
 319 obtained from the less-than function in the same way.
 320
 321 The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
 322 functions raise the invalid exception if either input is any kind of NaN.
 323 The equal functions, on the other hand, are defined not to raise the invalid
 324 exception on quiet NaNs.  For completeness, SoftFloat provides the following
 325 additional functions:
 326
 327    float32_eq_signaling    float32_le_quiet    float32_lt_quiet
 328    float64_eq_signaling    float64_le_quiet    float64_lt_quiet
 329    floatx80_eq_signaling   floatx80_le_quiet   floatx80_lt_quiet
 330    float128_eq_signaling   float128_le_quiet   float128_lt_quiet
 331
 332 The `signaling' equal functions are identical to the standard functions
 333 except that the invalid exception is raised for any NaN input.  Likewise,
 334 the `quiet' comparison functions are identical to their counterparts except
 335 that the invalid exception is not raised for quiet NaNs.
 336
 337 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 338 Signaling NaN Test Functions
 339
 340 The following functions test whether a floating-point value is a signaling
 341 NaN:
 342
 343    float32_is_signaling_nan
 344    float64_is_signaling_nan
 345    floatx80_is_signaling_nan
 346    float128_is_signaling_nan
 347
 348 The functions take one operand and return 1 if the operand is a signaling
 349 NaN and 0 otherwise.
 350
 351 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 352 Raise-Exception Function
 353
 354 SoftFloat provides a function for raising floating-point exceptions:
 355
 356     float_raise
 357
 358 The function takes a mask indicating the set of exceptions to raise.  No
 359 result is returned.  In addition to setting the specified exception flags,
 360 this function may cause a trap or abort appropriate for the current system.
 361
 362 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 363
 364
 365 -------------------------------------------------------------------------------
 366 Contact Information
 367
 368 At the time of this writing, the most up-to-date information about
 369 SoftFloat and the latest release can be found at the Web page `http://
 370 HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.
 371
 372