]>
Commit | Line | Data |
---|---|---|
3352b62b N |
1 | $NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $\r |
2 | \r | |
3 | SoftFloat Release 2a General Documentation\r | |
4 | \r | |
5 | John R. Hauser\r | |
6 | 1998 December 13\r | |
7 | \r | |
8 | \r | |
9 | -------------------------------------------------------------------------------\r | |
10 | Introduction\r | |
11 | \r | |
12 | SoftFloat is a software implementation of floating-point that conforms to\r | |
13 | the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four\r | |
14 | formats are supported: single precision, double precision, extended double\r | |
15 | precision, and quadruple precision. All operations required by the standard\r | |
16 | are implemented, except for conversions to and from decimal.\r | |
17 | \r | |
18 | This document gives information about the types defined and the routines\r | |
19 | implemented by SoftFloat. It does not attempt to define or explain the\r | |
20 | IEC/IEEE Floating-Point Standard. Details about the standard are available\r | |
21 | elsewhere.\r | |
22 | \r | |
23 | \r | |
24 | -------------------------------------------------------------------------------\r | |
25 | Limitations\r | |
26 | \r | |
27 | SoftFloat is written in C and is designed to work with other C code. The\r | |
28 | SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt\r | |
29 | has been made to accommodate compilers that are not ISO-conformant. In\r | |
30 | particular, the distributed header files will not be acceptable to any\r | |
31 | compiler that does not recognize function prototypes.\r | |
32 | \r | |
33 | Support for the extended double-precision and quadruple-precision formats\r | |
34 | depends on a C compiler that implements 64-bit integer arithmetic. If the\r | |
35 | largest integer format supported by the C compiler is 32 bits, SoftFloat is\r | |
36 | limited to only single and double precisions. When that is the case, all\r | |
37 | references in this document to the extended double precision, quadruple\r | |
38 | precision, and 64-bit integers should be ignored.\r | |
39 | \r | |
40 | \r | |
41 | -------------------------------------------------------------------------------\r | |
42 | Contents\r | |
43 | \r | |
44 | Introduction\r | |
45 | Limitations\r | |
46 | Contents\r | |
47 | Legal Notice\r | |
48 | Types and Functions\r | |
49 | Rounding Modes\r | |
50 | Extended Double-Precision Rounding Precision\r | |
51 | Exceptions and Exception Flags\r | |
52 | Function Details\r | |
53 | Conversion Functions\r | |
54 | Standard Arithmetic Functions\r | |
55 | Remainder Functions\r | |
56 | Round-to-Integer Functions\r | |
57 | Comparison Functions\r | |
58 | Signaling NaN Test Functions\r | |
59 | Raise-Exception Function\r | |
60 | Contact Information\r | |
61 | \r | |
62 | \r | |
63 | \r | |
64 | -------------------------------------------------------------------------------\r | |
65 | Legal Notice\r | |
66 | \r | |
67 | SoftFloat was written by John R. Hauser. This work was made possible in\r | |
68 | part by the International Computer Science Institute, located at Suite 600,\r | |
69 | 1947 Center Street, Berkeley, California 94704. Funding was partially\r | |
70 | provided by the National Science Foundation under grant MIP-9311980. The\r | |
71 | original version of this code was written as part of a project to build\r | |
72 | a fixed-point vector processor in collaboration with the University of\r | |
73 | California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.\r | |
74 | \r | |
75 | THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort\r | |
76 | has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT\r | |
77 | TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO\r | |
78 | PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY\r | |
79 | AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.\r | |
80 | \r | |
81 | \r | |
82 | -------------------------------------------------------------------------------\r | |
83 | Types and Functions\r | |
84 | \r | |
85 | When 64-bit integers are supported by the compiler, the `softfloat.h' header\r | |
86 | file defines four types: `float32' (single precision), `float64' (double\r | |
87 | precision), `floatx80' (extended double precision), and `float128'\r | |
88 | (quadruple precision). The `float32' and `float64' types are defined in\r | |
89 | terms of 32-bit and 64-bit integer types, respectively, while the `float128'\r | |
90 | type is defined as a structure of two 64-bit integers, taking into account\r | |
91 | the byte order of the particular machine being used. The `floatx80' type\r | |
92 | is defined as a structure containing one 16-bit and one 64-bit integer, with\r | |
93 | the machine's byte order again determining the order of the `high' and `low'\r | |
94 | fields.\r | |
95 | \r | |
96 | When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'\r | |
97 | header file defines only two types: `float32' and `float64'. Because\r | |
98 | ISO/ANSI C guarantees at least one built-in integer type of 32 bits,\r | |
99 | the `float32' type is identified with an appropriate integer type. The\r | |
100 | `float64' type is defined as a structure of two 32-bit integers, with the\r | |
101 | machine's byte order determining the order of the fields.\r | |
102 | \r | |
103 | In either case, the types in `softfloat.h' are defined such that if a system\r | |
104 | implements the usual C `float' and `double' types according to the IEC/IEEE\r | |
105 | Standard, then the `float32' and `float64' types should be indistinguishable\r | |
106 | in memory from the native `float' and `double' types. (On the other hand,\r | |
107 | when `float32' or `float64' values are placed in processor registers by\r | |
108 | the compiler, the type of registers used may differ from those used for the\r | |
109 | native `float' and `double' types.)\r | |
110 | \r | |
111 | SoftFloat implements the following arithmetic operations:\r | |
112 | \r | |
113 | -- Conversions among all the floating-point formats, and also between\r | |
114 | integers (32-bit and 64-bit) and any of the floating-point formats.\r | |
115 | \r | |
116 | -- The usual add, subtract, multiply, divide, and square root operations\r | |
117 | for all floating-point formats.\r | |
118 | \r | |
119 | -- For each format, the floating-point remainder operation defined by the\r | |
120 | IEC/IEEE Standard.\r | |
121 | \r | |
122 | -- For each floating-point format, a ``round to integer'' operation that\r | |
123 | rounds to the nearest integer value in the same format. (The floating-\r | |
124 | point formats can hold integer values, of course.)\r | |
125 | \r | |
126 | -- Comparisons between two values in the same floating-point format.\r | |
127 | \r | |
128 | The only functions required by the IEC/IEEE Standard that are not provided\r | |
129 | are conversions to and from decimal.\r | |
130 | \r | |
131 | \r | |
132 | -------------------------------------------------------------------------------\r | |
133 | Rounding Modes\r | |
134 | \r | |
135 | All four rounding modes prescribed by the IEC/IEEE Standard are implemented\r | |
136 | for all operations that require rounding. The rounding mode is selected\r | |
137 | by the global variable `float_rounding_mode'. This variable may be set\r | |
138 | to one of the values `float_round_nearest_even', `float_round_to_zero',\r | |
139 | `float_round_down', or `float_round_up'. The rounding mode is initialized\r | |
140 | to nearest/even.\r | |
141 | \r | |
142 | \r | |
143 | -------------------------------------------------------------------------------\r | |
144 | Extended Double-Precision Rounding Precision\r | |
145 | \r | |
146 | For extended double precision (`floatx80') only, the rounding precision\r | |
147 | of the standard arithmetic operations is controlled by the global variable\r | |
148 | `floatx80_rounding_precision'. The operations affected are:\r | |
149 | \r | |
150 | floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt\r | |
151 | \r | |
152 | When `floatx80_rounding_precision' is set to its default value of 80, these\r | |
153 | operations are rounded (as usual) to the full precision of the extended\r | |
154 | double-precision format. Setting `floatx80_rounding_precision' to 32\r | |
155 | or to 64 causes the operations listed to be rounded to reduced precision\r | |
156 | equivalent to single precision (`float32') or to double precision\r | |
157 | (`float64'), respectively. When rounding to reduced precision, additional\r | |
158 | bits in the result significand beyond the rounding point are set to zero.\r | |
159 | The consequences of setting `floatx80_rounding_precision' to a value other\r | |
160 | than 32, 64, or 80 is not specified. Operations other than the ones listed\r | |
161 | above are not affected by `floatx80_rounding_precision'.\r | |
162 | \r | |
163 | \r | |
164 | -------------------------------------------------------------------------------\r | |
165 | Exceptions and Exception Flags\r | |
166 | \r | |
167 | All five exception flags required by the IEC/IEEE Standard are\r | |
168 | implemented. Each flag is stored as a unique bit in the global variable\r | |
169 | `float_exception_flags'. The positions of the exception flag bits within\r | |
170 | this variable are determined by the bit masks `float_flag_inexact',\r | |
171 | `float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and\r | |
172 | `float_flag_invalid'. The exception flags variable is initialized to all 0,\r | |
173 | meaning no exceptions.\r | |
174 | \r | |
175 | An individual exception flag can be cleared with the statement\r | |
176 | \r | |
177 | float_exception_flags &= ~ float_flag_<exception>;\r | |
178 | \r | |
179 | where `<exception>' is the appropriate name. To raise a floating-point\r | |
180 | exception, the SoftFloat function `float_raise' should be used (see below).\r | |
181 | \r | |
182 | In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess\r | |
183 | for underflow either before or after rounding. The choice is made by\r | |
184 | the global variable `float_detect_tininess', which can be set to either\r | |
185 | `float_tininess_before_rounding' or `float_tininess_after_rounding'.\r | |
186 | Detecting tininess after rounding is better because it results in fewer\r | |
187 | spurious underflow signals. The other option is provided for compatibility\r | |
188 | with some systems. Like most systems, SoftFloat always detects loss of\r | |
189 | accuracy for underflow as an inexact result.\r | |
190 | \r | |
191 | \r | |
192 | -------------------------------------------------------------------------------\r | |
193 | Function Details\r | |
194 | \r | |
195 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\r | |
196 | Conversion Functions\r | |
197 | \r | |
198 | All conversions among the floating-point formats are supported, as are all\r | |
199 | conversions between a floating-point format and 32-bit and 64-bit signed\r | |
200 | integers. The complete set of conversion functions is:\r | |
201 | \r | |
202 | int32_to_float32 int64_to_float32\r | |
203 | int32_to_float64 int64_to_float32\r | |
204 | int32_to_floatx80 int64_to_floatx80\r | |
205 | int32_to_float128 int64_to_float128\r | |
206 | \r | |
207 | float32_to_int32 float32_to_int64\r | |
208 | float32_to_int32 float64_to_int64\r | |
209 | floatx80_to_int32 floatx80_to_int64\r | |
210 | float128_to_int32 float128_to_int64\r | |
211 | \r | |
212 | float32_to_float64 float32_to_floatx80 float32_to_float128\r | |
213 | float64_to_float32 float64_to_floatx80 float64_to_float128\r | |
214 | floatx80_to_float32 floatx80_to_float64 floatx80_to_float128\r | |
215 | float128_to_float32 float128_to_float64 float128_to_floatx80\r | |
216 | \r | |
217 | Each conversion function takes one operand of the appropriate type and\r | |
218 | returns one result. Conversions from a smaller to a larger floating-point\r | |
219 | format are always exact and so require no rounding. Conversions from 32-bit\r | |
220 | integers to double precision and larger formats are also exact, and likewise\r | |
221 | for conversions from 64-bit integers to extended double and quadruple\r | |
222 | precisions.\r | |
223 | \r | |
224 | Conversions from floating-point to integer raise the invalid exception if\r | |
225 | the source value cannot be rounded to a representable integer of the desired\r | |
226 | size (32 or 64 bits). If the floating-point operand is a NaN, the largest\r | |
227 | positive integer is returned. Otherwise, if the conversion overflows, the\r | |
228 | largest integer with the same sign as the operand is returned.\r | |
229 | \r | |
230 | On conversions to integer, if the floating-point operand is not already an\r | |
231 | integer value, the operand is rounded according to the current rounding\r | |
232 | mode as specified by `float_rounding_mode'. Because C (and perhaps other\r | |
233 | languages) require that conversions to integers be rounded toward zero, the\r | |
234 | following functions are provided for improved speed and convenience:\r | |
235 | \r | |
236 | float32_to_int32_round_to_zero float32_to_int64_round_to_zero\r | |
237 | float64_to_int32_round_to_zero float64_to_int64_round_to_zero\r | |
238 | floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero\r | |
239 | float128_to_int32_round_to_zero float128_to_int64_round_to_zero\r | |
240 | \r | |
241 | These variant functions ignore `float_rounding_mode' and always round toward\r | |
242 | zero.\r | |
243 | \r | |
244 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\r | |
245 | Standard Arithmetic Functions\r | |
246 | \r | |
247 | The following standard arithmetic functions are provided:\r | |
248 | \r | |
249 | float32_add float32_sub float32_mul float32_div float32_sqrt\r | |
250 | float64_add float64_sub float64_mul float64_div float64_sqrt\r | |
251 | floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt\r | |
252 | float128_add float128_sub float128_mul float128_div float128_sqrt\r | |
253 | \r | |
254 | Each function takes two operands, except for `sqrt' which takes only one.\r | |
255 | The operands and result are all of the same type.\r | |
256 | \r | |
257 | Rounding of the extended double-precision (`floatx80') functions is affected\r | |
258 | by the `floatx80_rounding_precision' variable, as explained above in the\r | |
259 | section _Extended_Double-Precision_Rounding_Precision_.\r | |
260 | \r | |
261 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\r | |
262 | Remainder Functions\r | |
263 | \r | |
264 | For each format, SoftFloat implements the remainder function according to\r | |
265 | the IEC/IEEE Standard. The remainder functions are:\r | |
266 | \r | |
267 | float32_rem\r | |
268 | float64_rem\r | |
269 | floatx80_rem\r | |
270 | float128_rem\r | |
271 | \r | |
272 | Each remainder function takes two operands. The operands and result are all\r | |
273 | of the same type. Given operands x and y, the remainder functions return\r | |
274 | the value x - n*y, where n is the integer closest to x/y. If x/y is exactly\r | |
275 | halfway between two integers, n is the even integer closest to x/y. The\r | |
276 | remainder functions are always exact and so require no rounding.\r | |
277 | \r | |
278 | Depending on the relative magnitudes of the operands, the remainder\r | |
279 | functions can take considerably longer to execute than the other SoftFloat\r | |
280 | functions. This is inherent in the remainder operation itself and is not a\r | |
281 | flaw in the SoftFloat implementation.\r | |
282 | \r | |
283 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\r | |
284 | Round-to-Integer Functions\r | |
285 | \r | |
286 | For each format, SoftFloat implements the round-to-integer function\r | |
287 | specified by the IEC/IEEE Standard. The functions are:\r | |
288 | \r | |
289 | float32_round_to_int\r | |
290 | float64_round_to_int\r | |
291 | floatx80_round_to_int\r | |
292 | float128_round_to_int\r | |
293 | \r | |
294 | Each function takes a single floating-point operand and returns a result of\r | |
295 | the same type. (Note that the result is not an integer type.) The operand\r | |
296 | is rounded to an exact integer according to the current rounding mode, and\r | |
297 | the resulting integer value is returned in the same floating-point format.\r | |
298 | \r | |
299 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\r | |
300 | Comparison Functions\r | |
301 | \r | |
302 | The following floating-point comparison functions are provided:\r | |
303 | \r | |
304 | float32_eq float32_le float32_lt\r | |
305 | float64_eq float64_le float64_lt\r | |
306 | floatx80_eq floatx80_le floatx80_lt\r | |
307 | float128_eq float128_le float128_lt\r | |
308 | \r | |
309 | Each function takes two operands of the same type and returns a 1 or 0\r | |
310 | representing either _true_ or _false_. The abbreviation `eq' stands for\r | |
311 | ``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands\r | |
312 | for ``less than'' (<).\r | |
313 | \r | |
314 | The standard greater-than (>), greater-than-or-equal (>=), and not-equal\r | |
315 | (!=) functions are easily obtained using the functions provided. The\r | |
316 | not-equal function is just the logical complement of the equal function.\r | |
317 | The greater-than-or-equal function is identical to the less-than-or-equal\r | |
318 | function with the operands reversed; and the greater-than function can be\r | |
319 | obtained from the less-than function in the same way.\r | |
320 | \r | |
321 | The IEC/IEEE Standard specifies that the less-than-or-equal and less-than\r | |
322 | functions raise the invalid exception if either input is any kind of NaN.\r | |
323 | The equal functions, on the other hand, are defined not to raise the invalid\r | |
324 | exception on quiet NaNs. For completeness, SoftFloat provides the following\r | |
325 | additional functions:\r | |
326 | \r | |
327 | float32_eq_signaling float32_le_quiet float32_lt_quiet\r | |
328 | float64_eq_signaling float64_le_quiet float64_lt_quiet\r | |
329 | floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet\r | |
330 | float128_eq_signaling float128_le_quiet float128_lt_quiet\r | |
331 | \r | |
332 | The `signaling' equal functions are identical to the standard functions\r | |
333 | except that the invalid exception is raised for any NaN input. Likewise,\r | |
334 | the `quiet' comparison functions are identical to their counterparts except\r | |
335 | that the invalid exception is not raised for quiet NaNs.\r | |
336 | \r | |
337 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\r | |
338 | Signaling NaN Test Functions\r | |
339 | \r | |
340 | The following functions test whether a floating-point value is a signaling\r | |
341 | NaN:\r | |
342 | \r | |
343 | float32_is_signaling_nan\r | |
344 | float64_is_signaling_nan\r | |
345 | floatx80_is_signaling_nan\r | |
346 | float128_is_signaling_nan\r | |
347 | \r | |
348 | The functions take one operand and return 1 if the operand is a signaling\r | |
349 | NaN and 0 otherwise.\r | |
350 | \r | |
351 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\r | |
352 | Raise-Exception Function\r | |
353 | \r | |
354 | SoftFloat provides a function for raising floating-point exceptions:\r | |
355 | \r | |
356 | float_raise\r | |
357 | \r | |
358 | The function takes a mask indicating the set of exceptions to raise. No\r | |
359 | result is returned. In addition to setting the specified exception flags,\r | |
360 | this function may cause a trap or abort appropriate for the current system.\r | |
361 | \r | |
362 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\r | |
363 | \r | |
364 | \r | |
365 | -------------------------------------------------------------------------------\r | |
366 | Contact Information\r | |
367 | \r | |
368 | At the time of this writing, the most up-to-date information about\r | |
369 | SoftFloat and the latest release can be found at the Web page `http://\r | |
370 | HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.\r | |
371 | \r | |
372 | \r |