I proved in 2006 that you don't need to, that in fact the size is bounded by, IIRC, something like 144 bytes, and you need several variables this size. I had correspondence with a local university HoD who put me in touch with Steele, who agreed with my analysis.
[...]
The goal of this algorithm, by the way, is that you can print any floating point number to a decimal string, and then convert the string back to a floating point number, and it will have the EXACT SAME bits as the original one, every time. And also produce in every case the shortest string for which this is true.
Interesting!
I hope you can publish at minimum an outline, because it sounds extremely useful
and interesting.
I've investigated the opposite case, parsing decimal number strings, using bignums with per-limb radix for the fractional part. For example, to exactly parse 32 fractional bits (say, Q31.32) on an 32-bit architecture, I would use five 32-bit limbs, with radixes 2
16, 2
17, 5
11, 5
11, and 5
11. 0.5 ULP = 2
-33 = {0,1,0,0,0}, 10
-33 = {0,0,0,0,1}, and for example 0.1 = {6553,78643,9765625,0,0}.
The idea is to be able to describe all nonzero decimal digits in the decimal representation of 0.5 ULP,
exactly. In the case of 32 fractional bits, 2
-33×10
33 = 116415321826934814453125 = 5
33, exactly; and it is easiest to represent on an 32-bit architecture using three limbs of radix 5
11 each, since 5
11 ≃ 0.7×2
26, leaving enough room for decimal carry/overflow handling in each limb.
Round towards zero mode uses the first and all but the least significant bit in the second limb. Round halfway up mode examines the LSB in the second limb, and if it is set, increments the second limb by one; then uses the first and all but the least significant bit in the second limb. Third to fifth limbs are only needed to obtain exact representation before rounding.
Adding two such values together means each limb can only carry by one, so comparison-subtraction-increment rippling from the least significant limb to most significant limb suffices. Adding a decimal digit involves multiply-by-up-to-ten-and-add, so those can overflow by up to 19×radix; but since the radix is a constant for each limb, that can be handled either by a loop or by division-by-radix-with-remainder.
These conversions are also exact, and match e.g. IEEE 754 rules.
Back to topic at hand, converting floating-point or fixed-point numbers to decimal representation:
The fractional part is often implemented using a multiply-by-10
n, then extracting and subtracting the
n decimal digits from the integer part. For example, 1/8 = 0.125, using
n=1:
10×1/8 = 10/8 = 1 + 2/8, yielding '1'
10×2/8 = 20/8 = 2 + 4/8, yielding '2'
10×4/8 = 40/8 = 5 + 0/8, yielding '5'
Any further iterations yield '0'
The "problem" is that if you are to output a specific number of decimal digits, and you end up with remainder that is at least one half (say, if printing 1/8 with just two decimals), you will need to increment the decimal result by one, rippling it possibly up to the previously yielded digits, including into whole decimals.
As an example, consider printing 255/256 using just two decimals. The first yields '9' with remainder 246, the second '9' with remainder 156. If we are to round halfway up, 156/256 ≥ 1/2, thus we need to increment the previous digit. That turns it into '0', with the need to increment the one before that. That too turns '0', leading to the need to increment the integer part by one too.
It is not a problem at all, if you can convert the fractional part into a string buffer, and do this before constructing the integer part.
To speed up conversion when the exponent of the floating-point mantissa is small, one can simply use larger
n. Technically, we should have a mantissa with up to 1024 bits in it (for double precision), because each multiplication by 10
n is a shift left by
n bits, and a multiplication by 5
n, increasing the width of the nonzero part of the mantissa by almost 2.322
n bits. (It also shifts the mantissa up by one bit, and thus eliminates about 3.322
n intervening bits.)
For example, if you start with a 53-bit mantissa, and there are say 44 zero bits between the MSB of the mantissa and the decimal point (so the LSB of the mantissa has value 2
-66), you can do
n=13: multiply the 53-bit mantissa by 5
13 = 1220703125, a 31-bit unsigned number. Your mantissa is now 84 bits wide, and since we also do a logical shift left by
n=13 bits, this gets rid of 44 intervening zero bits in one step. Each such step adds 31 bits to the mantissa, getting rid of 44 intervening zero bits. For double precision, there can be up to 1023 of them, so 23 rounds of this yields an exact mantissa of 766 bits, producing the 299 decimal zero digits to the output buffer.
If one uses a straightforward buffer of 1024 fractional bits, then one can track where the least and most significant bits set are, and only operate on the related limbs, implementing at least a couple of different
n steps (with
n=1 mandatory).
Another option is to just convert the fractional part to radix-10
n limbs, and discard/ignore the zero ones (as they represent
n consecutive zeroes in the decimal representation).
Also, if you use floating point types to implement this, do note that the obtained representation is not exact when the exponent is negative enough; i.e. when there are zero bits between the top of the mantissa and the decimal point. This is because each multiply-by-10
n generates a result that is 2.322
n bits wider than the original mantissa, and when using floating point, you'll discard those extra bits.
Of course, none of this is as smart as what Bruce described, because this only generates the
desired number of decimals, and does not know how many suffice. (And an implementation based on this would treat "%f" as "%.6f".) But it does kinda-sorta prove that if you don't mind being somewhat slow, you can definitely do it for double precision floating-point with something like 256 bytes of RAM plus a char array large enough to hold the entire result.
For single precision (AKA
float), something like 40 bytes of RAM plus a char array large enough should suffice.
The more ROM/Flash you can spare for power of ten tables and similar, the faster you can make it.
I've elaborated somewhat on related ideas in
this thread I started a year ago; some of you might find it interesting. My key point in that is that the *
printf() interface in C is actually not that well suited for many tasks, and I've been investigating alternatives, especially those useful for systems programming and memory-constrained embedded development. In particular, I've found that I usually do have a buffer I want to put the result into, and
snprintf() makes that annoying/difficult, because I have to very carefully track the fill point myself. For thread-safety and re-entrancy, any temporary memory needed should be provided by the caller, which also means the requirements must be known at compile time.