## Alternative to floating point: "supernormal" numbers and tracking uncertainties

A place to discuss the science of computers and programs, from algorithms to computability.

Formal proofs preferred.

Moderators: phlip, Moderators General, Prelates

arbiteroftruth
Posts: 461
Joined: Wed Sep 21, 2011 3:44 am UTC

### Alternative to floating point: "supernormal" numbers and tracking uncertainties

I’ve been developing an alternative floating point format (conceptually anyway; I haven’t created an actual implementation). It follows a similar structure as the IEEE standard, but is designed to have a symmetric dynamic range (maximum finite value and minimum finite value are reciprocals of each other), to keep track of uncertainties, and to cleanly represent infinities and NaN-like values as naturally occurring extreme points, rather than as explicitly special cases.

There is, in order, a sign bit, an exponent field, a significand field, and an exactness bit. We’ll get to the exactness bit later. For now, just assume that it’s a 0, indicating an exact value. The exponent field is interpreted as a biased integer like the IEEE format, but the bias value is increased by 1. So while a single-precision float in the standard format uses an exponent value of 127 to represent an exponent of 0, this format uses a value of 128 to represent an exponent of 0.

The number +1 then is represented with a sign bit of 0, the highest bit of the exponent field set to 1, and the remaining bits all set to 0. Counting down from there works exactly like the standard format, just with one less bit of precision (we had to borrow one bit from the significand for the exactness bit). You count down through all the exponent values, with an implied leading 1 in front of the significand, until the exponent field is all 0s, at which point there is no more implied leading 1, and the format becomes linear rather than exponential, gradually losing precision as the value approaches 0.

Counting up from 1 works similarly, up until the exponent field is all 1s. The standard format uses the maximum exponent only for infinities and NaNs. My proposal is that this region should mirror the subnormal numbers when the exponent was all 0s, gradually losing precision as the value naturally approaches infinity. To do this, when the exponent is all 1s, interpret the exponent as being the naturally expected value, plus the number of leading 1s in the significand. So the exponent continues to grow, while simultaneously the precision shrinks, exactly mirroring the behavior of the subnormal numbers. Each next power of 2 is reached in half as many steps, until at the extreme, a single step corresponds to traversing infinitely many powers of 2, and you have a code point naturally representing infinity.

For example, following the equivalent of the single precision format:

Code: Select all

`sign  exp         sig         exactness0 11111111 0000000000000000000000 0      = 2^1270 11111111 1000000000000000000000 0      = 2^1280 11111111 1100000000000000000000 0      = 2^1290 11111111 1111111111111000000000 0      = 2^1400 11111111 1111111111111111111111 0      = 2^1491 00000000 0000000000000000000000 0      = Inf`

This mirrors the behavior on the low end:

Code: Select all

`sign  exp         sig         exactness0 00000001 0000000000000000000000 0      = 2^-1270 00000000 1000000000000000000000 0      = 2^-1280 00000000 0100000000000000000000 0      = 2^-1290 00000000 0000000000001000000000 0      = 2^-1400 00000000 0000000000000000000001 0      = 2^-1490 00000000 0000000000000000000000 0      = 0`

Note that in this format, both 0 and Inf are unsigned values. Which brings me to negative numbers. In this format, negative numbers are defined in terms of the 2’s complement of the positive numbers. 0 and Inf naturally take the two code points that are their own negations.
But what about signed 0s, and especially signed infinities? And what about NaNs? Those are dealt with via the exactness bit.

When the exactness bit is 0, everything works as I’ve described above. When the exactness bit is 1, then the degree of uncertainty is represented by the final 0 bit in the code, together with all the subsequent 1s. Hereafter, I’ll refer to a final 0 followed by all 1s as the “uncertainty tail”.

An inexact value is interpreted as having a nominal value equal to whatever is left after discarding the uncertainty tail, and an interval of possible values found by casting the nominal value into an integer type, adding or subtracting 1 from its least significant bit, and casting back into this format. Again, an example using the equivalent of single precision:

Code: Select all

`sign  exp         sig         exactness0 00000000 0000000000000000000001 1     uncertainty tail is the final “011”0 00000000 00000000000000000000         = 0, nominal value0 00000000 00000000000000000001         = 2^-147, maximum value1 11111111 11111111111111111111         = -2^-147, minimum value0 00000000 0000000000000000000001 1     = 0 ± 2^-147`

Signed 0s and signed infinities are represented as inexact values with a maximum or minimum equal to 0 or infinity. For example, the “most precise” representation of +Inf is:
0 11111111 1111111111111111111110 1

With a nominal value of 2148, minimum value of 2147, and maximum value of Inf. There are several different positive infinities, each with a different lower bound on the value. Similarly, there are several positive 0s, each with a different upper bound. In the extreme case, these overlap as a code point that represents the entire range of non-negative numbers:
0 10111111 1111111111111111111111 1 = arbitrary non-negative number

This brings us to NaN-like values, which are the natural result of an uncertainty tail that fills all or nearly all of the bits. The above example is an arbitrary non-negative number, and its negation represents an arbitrary non-positive number. Make the uncertainty tail longer by a single bit, and all that’s left after discarding the uncertainty tail is the sign bit. It will have a nominal value of either 0 or 1, but then the calculation of the max and min will show that the other value is also possible. So these two code points represent a completely arbitrary number of unknown sign. It’s not quite a true NaN, because in principle these code points represent a valid number, we just can’t say anything about its value. I think it might be useful to use one of these points to represent unknown data or a number whose uncertainty has exploded beyond control, and use the other one to represent indeterminate forms like 0/0 or Inf - Inf. They would behave identically in terms of arithmetic, but knowing which one you have might be useful information for debugging.

Make the uncertainty tail one bit longer, and it fills the entire bit field:
0 11111111 1111111111111111111111 1

After discarding the uncertainty tail, there’s no data remaining. Therefore this naturally represents a Null value. Your program was expecting to receive data here, but no data is available. Maybe some instrument failed to report a measurement, or maybe you tried to solve an over-determined set of equations. In some contexts your computation might still recover. For example, when averaging a set of data points, if one of them is a Null, that just means that you have one less data point than you expected, but the average of the data you do have still exists and is easily computable. Perhaps the Null should set a flag in this case, but the computation can still move forward in a meaningful way.

Finally, if the entire bit field is filled with ones, then you can’t even parse the uncertainty tail itself, since there is no “final 0” signaling the start of the tail. The method of interpreting code points in this format has broken down, so this code point represents a true NaN. There is data here, but it’s not interpretable by the format. Maybe it’s an imaginary number, or maybe you tried to convert a vector data type into a scalar. One way or the other, something has gone terribly wrong and the NaN should propagate through all computations.

Some notes:
Negating any value in this format is achieved by splitting off the uncertainty tail, finding the 2s complement of the remaining bits, and reattaching the uncertainty tail. This works correctly for all values, including signed and unsigned 0s and infinities, and Nulls and NaNs.

A rough initial approximation of the function -1/x can be found by simply flipping the sign bit. It is correct to within 12.5%, except when x is Null or NaN, in which case -1/x = x. Flipping the sign bit gives the exact result of -1/x for all 0s, infinities, and powers of 2.

Although this format sacrifices one bit of precision relative to the IEEE standard, it has a greater dynamic range. The low end matches the IEEE standard, because we make up for the lost precision by increasing the exponent bias by 1. The high end exceeds the IEEE standard by assigning “supernormal” values to the code points with all 1s in the exponent field. On that subject, this format specifies an interpretation of every single code point, while still supporting a few reasonable NaN-like values as natural consequences of the format. Lastly, this format is capable of directly tracking uncertainties, with the only real sacrifice being 1 bit of precision.

I’m starting to think through the options for interpreting Boolean comparisons in this format, given that half of the possible values represent intervals rather than exact values, but the idea is sufficiently complete that I thought I’d share it. I welcome all thoughts on the matter.