Ddoc $(D_S $(TITLE), $(H2 Introduction) $(P $(I by Don Clugston)) $(P Computers were originally conceived as devices for performing mathematics. The earliest computers spent most of their time solving equations. Although the engineering and scientific community now forms only a miniscule part of the computing world, there is a fantastic legacy from those former times: almost all computers now feature superb hardware for performing mathematical calculations accurately and extremely quickly. Sadly, most programming languages make it difficult for programmers to take full advantage of this hardware. An even bigger problem is the lack of documentation; even for many mathematical programmers, aspects of floating-point arithmetic remain shrouded in mystery. ) $(P As a systems programming language, the D programming language attempts to remove all barriers between the programmer and the compiler, and between the programmer and the machine. This philosophy is particularly evident in support for floating-point arithmetic. A personal anecdote may illustrate the importance of having an accurate understanding of the hardware. ) $(P My first floating-point nightmare occurred in a C++ program which hung once in every hundred runs or so. I eventually traced the problem to a while loop which occasionally failed to terminate. The essence of the code is shown in Listing 1. ) ------ double q[8]; ... int x = 0; while (x < 8) { if ( q[x] >= 0 ) return true; if ( q[x] < 0 ) ++x; } return false; ------ $(P Initially, I was completely baffled as to how this harmless-looking loop could fail. But eventually, I discovered that q had not been initialized properly; q[7] contained random garbage. Occasionally, that garbage had every bit set, which mean that q[7] was a Not-a-Number (NaN), a special code which indicates that the value of the variable is nonsense. NaNs were not mentioned in the compiler's documentation - the only information I could find about them was in Intel's assembly instruction set documentation! Any comparison involving a NaN is false, so q[7] was neither >= 0, nor < 0, killing my program. Until that unwelcome discovery, I'd been unaware that NaNs even existed. I had lived in a fool's paradise, mistakenly believing that every floating point number was either positive, negative, or zero. ) $(P My experience would have been quite different in D. The "strange" features of floating point have a higher visibility in the language, improving the education of numerical programmers. Uninitialized floating point numbers are initialized to NaN by the compiler, so the problematic loop would fail every time, not intermittently. Numerical programmers in D will generally execute their programs with the 'invalid' floating point exception enabled. Under those circumstances, as soon as the program accessed the uninitialized variable, a hardware exception would occur, summoning the debugger. Easy access to the "strange" features of floating point results in better educated programmers, reduced confusion, faster debugging, better testing, and hopefully, more reliable and correct numerical programs. This article will provide a brief overview of the support for floating point in the D programming language. ) $(H2 Demystifying Floating-Point) $(P D guarantees that all built-in floating-point types conform to IEEE 754 arithmetic, making behaviour entirely predictable (note that this is $(I not) the same as producing identical results on all platforms). IEEE 754-2008 is the latest revision of the IEEE 754 Standard for Floating-Point Arithmetic. D is progressing towards full compliance with 754-2008.) $(P The IEEE standard floating point types currently supported by D are $(D float) and $(D double). Additionally, D supports the $(D real) type, which is either 'IEEE 80-bit extended' if supported by the CPU; otherwise, it is the same as $(D double). In the future, the new types from 754-2008 will be added: $(D quadruple), $(D decimal64), and $(D decimal128).) $(P The characteristics of these types are easily accessible in the language via $(I properties). For example, $(D float.max) is the maximum value which can be stored in a float; $(D float.mant_dig) is the number of digits (bits) stored in the mantissa.) $(P To make sense of mathematics in D, it's necessary to have a basic understanding of IEEE floating-point arithmetic. Fundamentally, it is a mapping of the infinitely many real numbers onto a small number of bytes. Only 4000 million distinct numbers are representable as an IEEE 32-bit float. Even with such a pathetically small representation space, IEEE floating point arithmetic does a remarkably good job of maintaining the illusion that mathematical real numbers are being used; but it's important to understand when the illusion breaks down.) $(P Most problems arise from the distribution of these representable numbers. The IEEE number line is quite different to the mathematical number line.) --- + +-----------+------------+ .. + .. +----------+----------+ + # -infinity -float.max -1 -float.min_normal 0 float.min_normal 1 float.max infinity NaN --- $(P Notice that half of the IEEE number line lies between -1 and +1. There are 1000 million representable floats between 0 and 0.5, but only 8 million between 0.5 and 1. This has important implications for accuracy: the effective precision is incredibly high near zero. Several examples will be presented where numbers in the range -1 to +1 are treated seperately to take advantage of this.) $(P Notice also the special numbers: $(PLUSMNINF); the so-called "subnormals" between $(PLUSMN)float.min_normal and 0, which are represented at reduced precision; the fact that there are TWO zeros, +0 and -0, and finally "NaN"("Not-a-Number"), the nonsense value, which caused so much grief in Listing 1.) $(P Why does NaN exist? It serves a valuable role: it $(I eradicates undefined behaviour) from floating-point arithmetic. This makes floating-point completely predictable. Unlike the $(D int) type, where 3/0 invokes a hardware division by zero trap handler, possibly ending your program, the floating-point division 3.0/0.0 results in $(INFIN). Numeric overflow (eg, $(D real.max*2)) also creates $(INFIN). Depending on the application, $(INFIN) may be a perfectly valid result; more typically, it indicates an error. Nonsensical operations, such as $(D 0.0 / 0.0), result in NaN; but $(I your program does not lose control). At first glance, infinity and NaN may appear unnecessary -- why not just make it an error, just as in the integer case? After all, it is easy to avoid division by zero, simply by checking for a zero denominator before every division. The real difficulty comes from overflow: it is extremely difficult to determine in advance whether an overflow will occur in a multiplication.) $(P Subnormals are necessary to prevent certain anomalies, and preserve important relationships such as: "x - y == 0 if and only if x == y".) $(P Since $(INFIN) can be produced by overflow, both +$(INFIN) and -$(INFIN) are required. Both +0 and -0 are required in order to preserve identities such as: if $(D x>0), then $(D 1/(1/x) > 0). In almost all other cases, however, there is no difference between +0 and -0.) $(P It's worth noting that these $(SINGLEQUOTE special values) are usually not very efficient. On x86 machines, for example, a multiplication involving a NaN, an infinity, or a subnormal can be twenty to fifty times slower than an operation on normal numbers. If your numerical code is unexpectedly slow, it's possible that you are inadvertently creating many of these special values. Enabling floating-point exception trapping, described later, is a quick way to confirm this.) $(P One of the biggest factor obscuring what the machine is doing is in the conversion between binary and decimal. You can eliminate this by using the $(D "%a") format when displaying results. This is an invaluable debugging tool, and an enormously helpful aid when developing floating-point algorithms. The $(D 0x1.23Ap+6) hexadecimal floating-point format can also be used in source code for ensuring that your input data is $(I exactly) what you intended.) $(H2 The Quantized Nature of Floating-Point) $(P The fact that the possible values are limited gives access to some operations which are not possible on mathematical real numbers. Given a number x, $(D nextUp(x)) gives the next representable number which is greater than x. $(D nextDown(x)) gives the next representable number which is less than x. ) $(P Numerical analysts often describe errors in terms of "units in the last place"(ulp), a surprisingly subtle term which is often used rather carelessly. [footnote: The most formal definition is found in [J.-M. Muller, "On the definition of ulp(x)",INRIA Technical Report 5504 (2005).]: If $(D x) is a real number that lies between two finite consecutive floating-point numbers a and b of type F, without being equal to one of them, then ulp(x)=abs(b-a); otherwise ulp(x) = $(D x*F.epsilon). Moreover, ulp(NaN) is NaN, and ulp($(PLUSMN)F.infinity) = $(PLUSMN)$(D F.max*F.epsilon).] I prefer a far simpler definition: The difference in ulps between two numbers x and y is is the number of times which you need to call nextUp() or nextDown() to move from x to y. [Footnote: This will not be an integer if either x or y is a real number, rather than a floating point number.] The D library function $(D feqrel(x, y)) gives the number of bits which are equal between x and y; it is an easy way to check for loss-of-precision problems. ) $(P The quantized nature of floating point has some interesting consequences.) $(UL $(LI ANY mathematical range [a,b$(RPAREN), $(LPAREN)a,b], or (a,b) can be converted into a range or the form [a,b]. (The converse does not apply: there is no (a,b) equivalent to [-$(INFIN), $(INFIN)]).) $(LI A naive binary chop doesn't work correctly. The fact that there are hundreds or thousands of times as many representable numbers between 0 and 1, as there are between 1 and 2, is problematic for divide-and-conquer algorithms. A naive binary chop would divide the interval [0 .. 2] into [0 .. 1] and [1 .. 2]. Unfortunately, this is not a true binary chop, because the interval [0 .. 1] contains more than 99% of the representable numbers from the original interval!) ) $(H2 Condition number) $(P Using nextUp, it's easy to approximately calculate the condition number.) --- real x = 0x1.1p13L; real u = nextUp(x); int bitslost = feqrel(x, u) - feqrel(exp(x), exp(u)); --- $(P This shows that at these huge values of x, a one-bit error in x destroys 12 bits of accuracy in exp(x)! The error has increased by roughly 6000 units in the last place. The condition number is thus 6000 at this value of x. ) $(H2 The semantics of float, double, and real) $(P For the x86 machines which dominate the market, floating point has traditionally been performed on a descendent of the 8087 math coprocessor. These "x87" floating point units were the first to implement IEEE754 arithmetic. The SSE2 instruction set is an alternative for x86-64 processors, but x87 remains the only portable option for floating point 32-bit x86 machines (no 32-bit AMD processors support SSE2).) $(P The x87 is unusual compared to most other floating-point units. It _only_ supports 80-bit operands, henceforth termed "real80". All $(D double) and $(D float) operands are first converted to 80-bit, all arithmetic operations are performed at 80-bit precision, and the results are reduced to 64-bit or 32-bit precision if required. This means that the results can be significantly more accurate than on a machine which supports at most 64 bit operations. However, it also poses challenges for writing portable code. (Footnote: The x87 allows you to reduce the mantissa length to be the same as '$(D double) or $(D float), but it retains the real80 exponent, which means different results are obtained with subnormal numbers. To precisely emulate $(D double) arithmetic slows down floating point code by an order of magnitude). ) $(P Apart from the x87 family, the Motorola 68K (but not ColdFire) and Itanium processors are the only ones which support 80-bit floating point.) $(P A similar issue relates to the FMA (fused multiply and accumulate) instruction, which is available on an increasing number of processors, including PowerPC, Itanium, Sparc, and Cell. On such processors, when evaluating expressions such as $(D x*y + z), the $(D x*y) is performed at twice the normal precison. Some calculations which would otherwise cause a total loss of precision, are instead calculated exactly. The challenge for a high-level systems programming language is to create an abstraction which provides predictable behaviour on all platforms, but which nonetheless makes good use of the available hardware. ) $(P D's approach to this situation arises from the following observations:) $(OL $(LI It is extremely costly performance-wise to ensure identical behaviour on all processors. In particular, it is crippling for the x87.) $(LI Very many programs will only run on a particular processor. It would be unfortunate to deny the use of more accurate hardware, for the sake of portability which would never be required.) $(LI The requirements for portability and for high precision are never required simultaneously. If $(D double) precision is inadequate, increasing the precision on only some processors doesn't help.) $(LI The language should not be tied to particular features of specific processors. ) ) $(P A key design goal is: it should be possible to write code such that, regardless of the processor which is used, the accuracy is never worse than would be obtained on a system which only supports the $(D double) type.) $(P (Footnote: $(D real) is close to $(SINGLEQUOTE indigenous) in the Borneo proposal for the Java programming language[Ref Borneo]).) $(P Consider evaluating $(D x*y + z*w), where $(D x, y, z) and $(D w) are double.) $(OL $(LI double r1 = x * y + z * w;) $(LI double a = x * y; double r2 = a + z * w;) $(LI real b = x * y; double r3 = b + z * w;) ) $(P Note that during optimisation, (2) and (3) may be transformed into (1), but this is implementation-dependent. Case (2) is particularly problematic, because it introduces an additional rounding. ) $(P On a "simple" CPU, r1==r2==r3. We will call this value r0. On PowerPC, r2==r3, but r1 may be more accurate than the others, since it enables use of FMA. On x86, r1==r3, which may be more accurate than r0, though not as much as for the PowerPC case. r2, however, may be LESS accurate than r0. ) $(P By using $(D real) for intermediate values, we are guaranteed that our results are never worse than for a simple CPU which only supports $(D double).) $(H2 Properties of the Built-in Types) $(P The fundamental floating-point properties are $(D epsilon), $(D min_normal) and $(D max). The six integral properties are simply the log2 or log10 of these three.) $(TABLE1 $(TR $(TH ) $(TH float) $(TH double) $(TH real80) $(TH quadruple) $(TH decimal64) $(TH decimal128)) $(TR $(TD epsilon) $(TD 0x1p-23) $(TD 0x1p-52) $(TD 0x1p-63) $(TD 0x1p-112) $(TD 1e-16 (1p-54)) $(TD 1e-34 (1p-113))) $(TR $(TD [min_normal) $(TD 0x1p-126) $(TD 0x1p-1022) $(TD 0x1p-16382) $(TD 0x1p-16382) $(TD 1e-383) $(TD 1e-6143)) $(TR $(TD ..max$(RPAREN)) $(TD 0x1p+128) $(TD 0x1p+1024) $(TD 0x1p+16384) $(TD 0x1p+16384) $(TD 1e+385) $(TD 1e+6145)) $(TR