Author Topic: Best thread-safe "printf", and why does printf need the heap for %f etc? (Read 20875 times)

peter-h · « **on:** July 28, 2022, 07:19:15 am »

Context:
https://www.eevblog.com/forum/programming/a-question-on-mutexes-normal-v-recursive-and-printf/msg4324318/#msg4324318

I am looking for a "full standard" implementation.

I fail to understand this. For decades, C compilers for embedded work came with a printf lib and that never used the heap. Even those which used IEEE standard floats (which were several times slower than those which used proprietary float representation) didn't need the heap, or lots of RAM.

But this code from 1990, which appears to be a full implementation, uses not only mutexes to make itself thread-safe but also uses the heap
https://sourceware.org/git?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdio/vfprintf.c;h=6a198e2c657e8cf44b720c8bec76b1121921a42d;hb=HEAD#l866

I read somewhere that this bloat is the result of a special implementation which avoids stuff like 1+1=1.99999999 (well that's assuming 2 does have an exact representation in IEEE floats; I don't know enough about this). But that's pointless; with floats the whole idea of == disappears.

I replaced the above with https://github.com/MaJerle/lwprintf and https://github.com/eyalroz/printf has also been recommended.

rstofer · « **Reply #1 on:** August 12, 2022, 02:58:55 pm »

Quote from: peter-h on July 28, 2022, 07:19:15 am

I read somewhere that this bloat is the result of a special implementation which avoids stuff like 1+1=1.99999999 (well that's assuming 2 does have an exact representation in IEEE floats; I don't know enough about this). But that's pointless; with floats the whole idea of == disappears.

The idea that floats are seldom equal and almost never exactly 0.0... is the reason that Fortran has an arithmetic IF statement with 3 targets: Less than, equal to, greater than. If I want to branch to a statement on >= then I write something like IF (expression) 10, 20, 20 I wind up at statement 20 if the expression is >= 0.0 else statement 10.

http://www-linac.kek.jp/cont/langinfo/web/Fortran/docs/lrm/lrm0128.htm

FWIW, this was in the first release of FORTRAN back in '57 so this isn't a new issue.

We have the block IF statement beginning around Fortran 77 and the early version, though still acceptable, is likely seldom used because programmers have walked away from statement numbers.

http://ponce.sdsu.edu/fortranbook04.html

eutectique · « **Reply #2 on:** August 12, 2022, 03:35:35 pm »

Quote from: peter-h on July 28, 2022, 07:19:15 am

But that's pointless; with floats the whole idea of == disappears.

This is the reason why all these epsilon constants exist in float.h:

Code: [Select]

#undef FLT_EPSILON
#undef DBL_EPSILON
#undef LDBL_EPSILON
#define FLT_EPSILON     __FLT_EPSILON__
#define DBL_EPSILON     __DBL_EPSILON__
#define LDBL_EPSILON    __LDBL_EPSILON__

The correct way of comparing for equality would be

Code: [Select]

bool relativelyEqual(float a, float b)
{
    float maxRelativeDiff = FLT_EPSILON;
    float difference = fabs(a - b);

    // Scale to the largest value.
    a = fabs(a);
    b = fabs(b);
    float scaledEpsilon = maxRelativeDiff * max(a, b);

    return difference <= scaledEpsilon;
}

ejeffrey · « **Reply #3 on:** August 12, 2022, 08:47:52 pm »

My guess is that this is because multi threaded applications often have tiny tiny stacks. Using dozens of bytes of stack space for automatic buffers could be a problem. Especially if you expect users to rarely if ever actually request maximum precision.

brucehoult · « **Reply #4 on:** August 12, 2022, 10:38:37 pm »

printf() doesn't need to use the heap.

Some printf()s do anyway, but they don't *need* to.

Steele & White's 1990 paper How to Print Floating-Point Numbers Accurately says that you need arbitrary-sized bignums to do the job properly. I proved in 2006 that you don't need to, that in fact the size is bounded by, IIRC, something like 144 bytes, and you need several variables this size. I had correspondence with a local university HoD who put me in touch with Steele, who agreed with my analysis. I expect I still have the correspondence somewhere. However my boss forbade me to publish, so the only implementation was on our Java-to-ARM-native (on BREW) compiler & runtime library "AlcheMo".

The product was obsoleted by iPhone and later Android, and the company is long gone, so I should probably resurrect that analysis and code, if no one else has rediscovered in the meantime.

But, anyway, you don't need heap. You can allocate the variables on the stack, or statically if you don't care about multithreading or using that space for something else the rest of the time.

The goal of this algorithm, by the way, is that you can print any floating point number to a decimal string, and then convert the string back to a floating point number, and it will have the EXACT SAME bits as the original one, every time. And also produce in every case the shortest string for which this is true.

Nominal Animal · « **Reply #5 on:** August 13, 2022, 07:44:57 am »

Quote from: brucehoult on August 12, 2022, 10:38:37 pm

I proved in 2006 that you don't need to, that in fact the size is bounded by, IIRC, something like 144 bytes, and you need several variables this size. I had correspondence with a local university HoD who put me in touch with Steele, who agreed with my analysis.
[...]
The goal of this algorithm, by the way, is that you can print any floating point number to a decimal string, and then convert the string back to a floating point number, and it will have the EXACT SAME bits as the original one, every time. And also produce in every case the shortest string for which this is true.

Interesting!

I hope you can publish at minimum an outline, because it sounds extremely useful and interesting.

I've investigated the opposite case, parsing decimal number strings, using bignums with per-limb radix for the fractional part. For example, to exactly parse 32 fractional bits (say, Q31.32) on an 32-bit architecture, I would use five 32-bit limbs, with radixes 2¹⁶, 2¹⁷, 5¹¹, 5¹¹, and 5¹¹. 0.5 ULP = 2^-33 = {0,1,0,0,0}, 10^-33 = {0,0,0,0,1}, and for example 0.1 = {6553,78643,9765625,0,0}.
The idea is to be able to describe all nonzero decimal digits in the decimal representation of 0.5 ULP, exactly. In the case of 32 fractional bits, 2^-33×10³³ = 116415321826934814453125 = 5³³, exactly; and it is easiest to represent on an 32-bit architecture using three limbs of radix 5¹¹ each, since 5¹¹ ≃ 0.7×2²⁶, leaving enough room for decimal carry/overflow handling in each limb.
Round towards zero mode uses the first and all but the least significant bit in the second limb. Round halfway up mode examines the LSB in the second limb, and if it is set, increments the second limb by one; then uses the first and all but the least significant bit in the second limb. Third to fifth limbs are only needed to obtain exact representation before rounding.
Adding two such values together means each limb can only carry by one, so comparison-subtraction-increment rippling from the least significant limb to most significant limb suffices. Adding a decimal digit involves multiply-by-up-to-ten-and-add, so those can overflow by up to 19×radix; but since the radix is a constant for each limb, that can be handled either by a loop or by division-by-radix-with-remainder.
These conversions are also exact, and match e.g. IEEE 754 rules.

Back to topic at hand, converting floating-point or fixed-point numbers to decimal representation:

The fractional part is often implemented using a multiply-by-10ⁿ, then extracting and subtracting the n decimal digits from the integer part. For example, 1/8 = 0.125, using n=1:
10×1/8 = 10/8 = 1 + 2/8, yielding '1'
10×2/8 = 20/8 = 2 + 4/8, yielding '2'
10×4/8 = 40/8 = 5 + 0/8, yielding '5'
Any further iterations yield '0'
The "problem" is that if you are to output a specific number of decimal digits, and you end up with remainder that is at least one half (say, if printing 1/8 with just two decimals), you will need to increment the decimal result by one, rippling it possibly up to the previously yielded digits, including into whole decimals.
As an example, consider printing 255/256 using just two decimals. The first yields '9' with remainder 246, the second '9' with remainder 156. If we are to round halfway up, 156/256 ≥ 1/2, thus we need to increment the previous digit. That turns it into '0', with the need to increment the one before that. That too turns '0', leading to the need to increment the integer part by one too.

It is not a problem at all, if you can convert the fractional part into a string buffer, and do this before constructing the integer part.

To speed up conversion when the exponent of the floating-point mantissa is small, one can simply use larger n. Technically, we should have a mantissa with up to 1024 bits in it (for double precision), because each multiplication by 10ⁿ is a shift left by n bits, and a multiplication by 5ⁿ, increasing the width of the nonzero part of the mantissa by almost 2.322n bits. (It also shifts the mantissa up by one bit, and thus eliminates about 3.322n intervening bits.)

For example, if you start with a 53-bit mantissa, and there are say 44 zero bits between the MSB of the mantissa and the decimal point (so the LSB of the mantissa has value 2^-66), you can do n=13: multiply the 53-bit mantissa by 5¹³ = 1220703125, a 31-bit unsigned number. Your mantissa is now 84 bits wide, and since we also do a logical shift left by n=13 bits, this gets rid of 44 intervening zero bits in one step. Each such step adds 31 bits to the mantissa, getting rid of 44 intervening zero bits. For double precision, there can be up to 1023 of them, so 23 rounds of this yields an exact mantissa of 766 bits, producing the 299 decimal zero digits to the output buffer.

If one uses a straightforward buffer of 1024 fractional bits, then one can track where the least and most significant bits set are, and only operate on the related limbs, implementing at least a couple of different n steps (with n=1 mandatory).
Another option is to just convert the fractional part to radix-10ⁿ limbs, and discard/ignore the zero ones (as they represent n consecutive zeroes in the decimal representation).

Also, if you use floating point types to implement this, do note that the obtained representation is not exact when the exponent is negative enough; i.e. when there are zero bits between the top of the mantissa and the decimal point. This is because each multiply-by-10ⁿ generates a result that is 2.322n bits wider than the original mantissa, and when using floating point, you'll discard those extra bits.

Of course, none of this is as smart as what Bruce described, because this only generates the desired number of decimals, and does not know how many suffice. (And an implementation based on this would treat "%f" as "%.6f".) But it does kinda-sorta prove that if you don't mind being somewhat slow, you can definitely do it for double precision floating-point with something like 256 bytes of RAM plus a char array large enough to hold the entire result.
For single precision (AKA float), something like 40 bytes of RAM plus a char array large enough should suffice.
The more ROM/Flash you can spare for power of ten tables and similar, the faster you can make it.

I've elaborated somewhat on related ideas in this thread I started a year ago; some of you might find it interesting. My key point in that is that the *printf() interface in C is actually not that well suited for many tasks, and I've been investigating alternatives, especially those useful for systems programming and memory-constrained embedded development. In particular, I've found that I usually do have a buffer I want to put the result into, and snprintf() makes that annoying/difficult, because I have to very carefully track the fill point myself. For thread-safety and re-entrancy, any temporary memory needed should be provided by the caller, which also means the requirements must be known at compile time.

DiTBho · « **Reply #6 on:** August 13, 2022, 08:58:47 am »

With myC "printf" is banned.
You must not use it!

do you want to show a uint32 item? uint32_show(uint32_t item, string_t format)
do you want to show a uin16 item? uin16_show(uin16_t item, string_t format)
do you want to show a uint8 item? uint8_show(uint8_t item, string_t format)
do you want to show a sint32 item? sint32_show(sint32_t item, string_t format)
do you want to show a sin16 item? sin16_show(sin16_t item, string_t format)
do you want to show a sint8 item? sint8_show(sint8_t item, string_t format)
do you want to show a string item? string_show(string_t item, string_t format)
do you want to show a char item? char_show(char_t item, string_t format)
do you want to show a boolean item? boolean_show(boolean_t item, string_t format)
do you want to show a fp32 item? fp32_show(fp32_t item, string_t format)
do you want to show a fp64 item? fp64_show(fp64_t item, string_t format)
do you want to show a fx1616 item? fx1616_show(fx1616_t item, string_t format)
do you want to show a fx32 item? fx32_show(fx32_t item, string_t format)
do you want to show a cplx_fp32 item? cplx_fp32_show(cplx_fp32_t item, string_t format)
do you want to show a cplx_fp64 item? cplx_fp64_show(cplx_fp64_t item, string_t format)
do you want to show a cplx_fx1616 item? cplx_fx1616_show(cplx_fx1616_t item, string_t format)
do you want to show a cplx_fx32 item? cplx_fx32_show(cplx_fx32_t item, string_t format)
do you want to show a p_uint32 item? p_uint32_show(p_uint32_t item, string_t format)
do you want to show a p_uin16 item? p_uin16_show(p_uin16_t item, string_t format)
do you want to show a p_uint8 item? p_uint8_show(p_uint8_t item, string_t format)
do you want to show a p_sint32 item? p_sint32_show(p_sint32_t item, string_t format)
do you want to show a p_sin16 item? p_sin16_show(p_sin16_t item, string_t format)
do you want to show a p_sint8 item? p_sint8_show(p_sint8_t item, string_t format)
do you want to show a p_string item? p_string_show(p_string_t item, string_t format)
do you want to show a p_char item? p_char_show(p_char_t item, string_t format)
do you want to show a p_boolean item? p_boolean_show(p_boolean_t item, string_t format)
do you want to show a p_fp32 item? p_fp32_show(p_fp32_t item, string_t format)
do you want to show a p_fp64 item? p_fp64_show(p_fp64_t item, string_t format)
do you want to show a p_fx1616 item? p_fx1616_show(p_fx1616_t item, string_t format)
do you want to show a p_fx32 item? p_fx32_show(p_fx32_t item, string_t format)
do you want to show a p_cplx_fp32 item? p_cplx_fp32_show(p_cplx_fp32_t item, string_t format)
do you want to show a p_cplx_fp64 item? p_cplx_fp64_show(p_cplx_fp64_t item, string_t format)
do you want to show a p_cplx_fx1616 item? p_cplx_fx1616_show(p_cplx_fx1616_t item, string_t format)
do you want to show a p_cplx_fx32 item? p_cplx_fx32_show(p_cplx_fx32_t item, string_t format)
do you want to show a p_pointer32 item? p_pointer32_show(p_pointer32_t item, string_t format)
do you want to show a p_pointer64 item? p_pointer64_show(p_pointer64_t item, string_t format)
do you want to show a p_function item? p_function_show(p_function_t item, string_t format)

Code: [Select]

void item_show
(
    autorefrence item,
    string_t format
)
{
    type_t type;

    type=typeof(item);
    switch(type)
    {
         case uint32
         {
              uint32_show(item, format)
         }
         case uin16
         {
              uin16_show(item, format)
         }
         case uint8
         {
              uint8_show(item, format)
         }
         case sint32
         {
              sint32_show(item, format)
         }
         case sin16
         {
              sin16_show(item, format)
         }
         case sint8
         {
              sint8_show(item, format)
         }
         case string
         {
              string_show(item, format)
         }
         case char
         {
              char_show(item, format)
         }
         case boolean
         {
              boolean_show(item, format)
         }
         case fp32
         {
              fp32_show(item, format)
         }
         case fp64
         {
              fp64_show(item, format)
         }
         case fx1616
         {
              fx1616_show(item, format)
         }
         case fx32
         {
              fx32_show(item, format)
         }
         case cplx_fp32
         {
              cplx_fp32_show(item, format)
         }
         case cplx_fp64
         {
              cplx_fp64_show(item, format)
         }
         case cplx_fx1616
         {
              cplx_fx1616_show(item, format)
         }
         case cplx_fx32
         {
              cplx_fx32_show(item, format)
         }
         case p_uint32
         {
              p_uint32_show(item, format)
         }
         case p_uin16
         {
              p_uin16_show(item, format)
         }
         case p_uint8
         {
              p_uint8_show(item, format)
         }
         case p_sint32
         {
              p_sint32_show(item, format)
         }
         case p_sin16
         {
              p_sin16_show(item, format)
         }
         case p_sint8
         {
              p_sint8_show(item, format)
         }
         case p_string
         {
              p_string_show(item, format)
         }
         case p_char
         {
              p_char_show(item, format)
         }
         case p_boolean
         {
              p_boolean_show(item, format)
         }
         case p_fp32
         {
              p_fp32_show(item, format)
         }
         case p_fp64
         {
              p_fp64_show(item, format)
         }
         case p_fx1616
         {
              p_fx1616_show(item, format)
         }
         case p_fx32
         {
              p_fx32_show(item, format)
         }
         case p_cplx_fp32
         {
              p_cplx_fp32_show(item, format)
         }
         case p_cplx_fp64
         {
              p_cplx_fp64_show(item, format)
         }
         case p_cplx_fx1616
         {
              p_cplx_fx1616_show(item, format)
         }
         case p_cplx_fx32
         {
              p_cplx_fx32_show(item, format)
         }
         case p_pointer32
         {
              p_pointer32_show(item, format)
         }
         case p_pointer64
         {
              p_pointer64_show(item, format)
         }
         case p_function
         {
              p_function_show(item, format)
         }
         default
         {
              panic(module, id, "unknown type" );
         }
    }
}

show(whatever type, format), and you get the job done!

This is how a "data show" should be written!

brucehoult · « **Reply #7 on:** August 13, 2022, 11:17:09 am »

Quote from: Nominal Animal on August 13, 2022, 07:44:57 am

Of course, none of this is as smart as what Bruce described, because this only generates the desired number of decimals, and does not know how many suffice. (And an implementation based on this would treat "%f" as "%.6f".) But it does kinda-sorta prove that if you don't mind being somewhat slow, you can definitely do it for double precision floating-point with something like 256 bytes of RAM plus a char array large enough to hold the entire result.
For single precision (AKA float), something like 40 bytes of RAM plus a char array large enough should suffice.
The more ROM/Flash you can spare for power of ten tables and similar, the faster you can make it.

You're along very much the right lines.

The worst case is numbers with large exponents. For positive exponents, just convert the number to an explicit integer. For negative exponents (and small positive ones less than the significand size) the key is to multiply them by 10 enough times to turn them into an integer (and remember how many times it took). Digging deeeeeeep into memory, the upper limit on the number of bits needed in the bignum is the maximum exponent, plus twice the number of fraction bits (including the implied leading 1).

So for IEEE Double it is 1024 + 2*54 = 1132 bits, which rounded up to whole 32 bit or 64 bit words is 1152 bits or 144 bytes. Or maybe that 1024 should be 1023 or 1022. It doesn't affect the rounded answer :-)

For IEEE Float it is 128 + 2*24 = 176, which rounded up to whole 32 bit or 64 bit words is 192 bits or 24 bytes.

These are multiple precision integers, but as bignums go they aren't big. Especially for converting single precision it's probably not worth keeping track of the actual used length, especially if code size is of more concern than a small speed increment. Keeping output devices saturated is seldom a problem, at least with 32 bit machines.

Deciding how many digits to print is conceptually easy. Make two numbers, one 0.5 ULP less than the actual number, and one 0.5 ULP bigger and convert them to decimal in parallel. Stop outputting decimal digits when two digits differ. Of course you don't actually have to keep both numbers in full, but that's the conceptual level.

Converting from a string to binary is usually very simple. Most numbers in text don't use the full number of significant digits/bits, especially in program code. Just take all the digits and convert them to a binary integer. Convert this integer to an (always exact) floating point number, then do a floating point multiply or divide by the appropriate power of ten, either from a table, or calculated on the fly. Make sure you calculate the power of ten FIRST (which will be exact), and then a single multiply or divide with the significant digits, so you get only a single 0.5 ULP error.

If the integer from the decimal digits would exceed 2^23 (Float) or 2^53 (Double) then take as many as fit. In fact take one more digit than fits, then shift right until you have 24 or 54 bits, and remember the number of shifts as a final adjustment to the FP exponent later. Do the same multiply or divide by the appropriate power of ten as before.

No matter how many decimal digits you ignored, the result you have now is either the correct result of converting the entire decimal number (even if it is a million digits long!), or else you need to increase the LSB by 1. You can decide which by adding 0.5 ULP, converting to decimal (using the first algorithm), and compare the generated digits to the original string you were given. When a digit differs you know whether to add 1 ULP or not.

peter-h · « **Reply #8 on:** August 13, 2022, 05:00:49 pm »

Quote

My guess is that this is because multi threaded applications often have tiny tiny stacks. Using dozens of bytes of stack space for automatic buffers could be a problem. Especially if you expect users to rarely if ever actually request maximum precision.

It is true that in an RTOS environment if you run say a printf() which uses 1k on the stack, from each of 15 RTOS tasks, each of those tasks will need a 1k stack headroom just for that, so you need 15k (and a re-entrant printf) whereas if you structured your code as one big state machine or loop (etc etc) then you would need just the 1 x 1k (and the printf can uses statics etc). But in reality this is not an issue. My current project has ~15 tasks and this is not an issue; most of the stack space needed in each task is actually stuff specific to that task. Total space allocated to RTOS stacks is 64k, out of 192k total RAM. But then I don't have a printf() using a lot of stack because I got rid of the one I did have

And there are solutions if you are pushed e.g. having just one instance of the printf, in its own RTOS task, and send it the stuff to output via RTOS messages.

Quote

The goal of this algorithm, by the way, is that you can print any floating point number to a decimal string, and then convert the string back to a floating point number, and it will have the EXACT SAME bits as the original one, every time. And also produce in every case the shortest string for which this is true.

Doesn't that imply that the library contains a complementary printf() and scanf() ? From tracing code, I have not found evidence of my scanf() (actually sscanf()) using mutexes and mallocs like that notorious ex-1990 printf library does. But then I have never found the scanf source. The likely printf source URL is in my #1 post.

Quote

Steele & White's 1990 paper How to Print Floating-Point Numbers Accurately says that you need arbitrary-sized bignums to do the job properly.

ISTM that that paper was used in just about every printf implementation since 1990, for "big machines" with no memory issues. I can't tell if the printf source mentioned above uses it; it is too convoluted.

I went looking for the "1990" scanf source. Maybe here https://sourceware.org/cgi-bin/search.cgi?q=scanf+1990 and interesting hits and #3 and #4. Could be this one
https://code.woboq.org/userspace/glibc/stdio-common/vfscanf-internal.c.html#__vfscanf_internal
which is full of mallocs! But these are much newer than 1990 and I have my doubts anyway because a breakpoint on _sbrk (which is always called by malloc) is not getting hit. And this code is certainly not running a private heap. It might use statics.

SiliconWizard · « **Reply #9 on:** August 13, 2022, 05:58:24 pm »

Quote from: DiTBho on August 13, 2022, 08:58:47 am

(...)
show(whatever type, format), and you get the job done!

Note that with C11, you can have the compiler select the right function at compile time instead of having this large switch/case function, thanks to _Generic.
But I'm kind of guessing that you don't use C11.

DiTBho · « **Reply #10 on:** August 13, 2022, 06:56:13 pm »

Quote from: SiliconWizard on August 13, 2022, 05:58:24 pm

Note that with C11, you can have the compiler select the right function at compile time instead of having this large switch/case function, thanks to _Generic.
But I'm kind of guessing that you don't use C11.

No, but this might be * very * interesting for those using C / 11

On C11, _Generic basically works like a kind of switch whose labels are type names which are tested against the type of the first expression, so the result becomes the result of evaluating the _Generic().

Some months ago, I read this blog.

Nominal Animal · « **Reply #11 on:** August 13, 2022, 09:58:03 pm »

Quote from: brucehoult on August 13, 2022, 11:17:09 am

Converting from a string to binary is usually very simple. Most numbers in text don't use the full number of significant digits/bits, especially in program code. Just take all the digits and convert them to a binary integer. Convert this integer to an (always exact) floating point number, then do a floating point multiply or divide by the appropriate power of ten, either from a table, or calculated on the fly. Make sure you calculate the power of ten FIRST (which will be exact), and then a single multiply or divide with the significant digits, so you get only a single 0.5 ULP error.

I've never proven to myself that this is so. In fact, I do believe I have an unfortunate counterexample below.

To keep everyone along: If we have a b-bit fraction (whose most significant bit corresponds to 0.5), and wish to convert it to a d-digit decimal number, we need to multiply by 10^d=2^d5^d and divide by 2^b, where d≥1 and b≥1. This simplifies to a multiplication by 5^d and a bit shift by d-b bits (positive up, negative down).

For exact rounding, we need to be able to detect the case when the resulting fraction, before any rounding, is exactly half. This is the tie case, and can be solved either by rounding ties to even (most common) or away from zero (what I normally use). Aside from the tie case, we only need one extra bit to get rounding right.

The problem is that multiplying an n-bit number by 5^d yields an n+2.322d-bit number (where 2.322 ≳ log₂5, still assuming d≥1). When using IEEE 754 Binary32/64 floating point numbers (which we almost certainly are if using float or double nowadays, except for exotic hardware like DSPs), this is the point where rounding error may be introduced, because those extra bits need to be rounded out.

Now, here's the point I have a problem with:

If that rounding results in a floating-point value that is exactly k+½, we may end up doing the tie-breaking twice, when turning that into an integer.

For example, round(5⁶×144115188075.85596) = round(2251799813685249 + 3/8) = 2251799813685249, but with double-precision floating point the multiplication itself yields round(2251799813685249.5) = 2251799813685250, which is off by 5/8 when ULP = 1/2, i.e. the error is 1.25 ULP.
Ouch.

(Yes, I know: these cases are very, very rare. I had to go search for one deliberately, going backwards and examining FP bit patterns. Many people think this kind of rare error is perfectly acceptable. But, it isn't acceptable if we abide by IEEE 754 rules; and C libraries printf() et al. are designed for correctness, not for speed/efficiency.)

What we need, is an extra tri-state flag from the multiplication: result exact with no rounding; result was rounded towards zero; result was rounded away from zero. If we had this knowledge in hand, we could handle rounding correctly. Tie-breaking only applies when the result was exact, otherwise exact half gets rounded the opposite way.

It would be even better, if we can combine the multiply-by-power-of-five, the bit shift, and rounding to a single operation that yields just the correctly rounded integer part. Unfortunately, most floating-point implementations do not include such functionality.

brucehoult · « **Reply #12 on:** August 13, 2022, 11:25:40 pm »

Quote from: Nominal Animal on August 13, 2022, 09:58:03 pm

For example, round(5⁶×144115188075.85596) = round(2251799813685249 + 3/8) = 2251799813685249, but with double-precision floating point the multiplication itself yields round(2251799813685249.5) = 2251799813685250, which is off by 5/8 when ULP = 1/2, i.e. the error is 1.25 ULP.
Ouch.

Noooo!!

Everything except the final multiply / divide is done in INTEGER arithmetic.

If presented with "144115188075.85596" then we start from INTEGER 14411518807585596 and exponent = -5

Now, 14411518807585596 is bigger than 2^53 (9007199254740992), so it's already not an example of the "easy case" I was talking about.

Let's just pretend for the moment that the last 6 isn't there and you wrote "144115188075.8559". So we have INTEGER 1441151880758559 (0x51EB851EB851F) and exponent = -4.

The following function converts a positive integer less than 2^53 *exactly* to an IEEE double:

Code: [Select]

double int2double(uint64_t v){
  int binShift = 0;
  // use clz instead if you have it
  while (v < (1l<<52)){
    v<<=1;
    binShift++;
  }
  v &= ~(1l<<52);
  v |= (1023l + (52-binShift)) << 52;
  return *(double*)&v;
}

So now you just do int2double(1441151880758559)/10000 and you have your ±0.5 ULP converted value.

Oh, yes, I forgot to mention that this easy case only works for decimal exponents where the power of ten is also exactly representable. That is true if 5^exp is less than 2^53, so for values to be converted of up to 10^±22.

Fortunately, most numbers encountered in practice are in this range. Outside this exponent range (or with too many significant digits given) you need to follow a slower procedure, as previously outlined.

PlainName · « **Reply #13 on:** August 13, 2022, 11:51:44 pm »

Quote from: peter-h

It is true that in an RTOS environment if you run say a printf() which uses 1k on the stack, from each of 15 RTOS tasks, each of those tasks will need a 1k stack headroom just for that, so you need 15k (and a re-entrant printf) whereas if you structured your code as one big state machine or loop (etc etc) then you would need just the 1 x 1k (and the printf can uses statics etc).

Not quite. You could easily have a printf which uses a single block of memory¹ and which uses mutexes to prevent multiple threads using it at the same time. Printf shouldn't take long so whether or not that potential small pause would be a problem depends on the app. However, since you're waiting on the output device regardless it's not going to make a real difference if you wait at the start of printf or the output of it.

[Perhaps that other printf you had, with the mutex stuff, was intended to work like this.]

[1] Simplistically you could just allocate 1K of heap and save yourself 14K from those 15 tasks. What would be better is to be able to dynamically change how much memory is allocated to suit the job, so if it's a long string with many variables you might want a lot more memory than for a simple 'hello world'. Or you might consider 1K enough and just truncate a job that needs more.

Dynamic memory shouldn't be a problem - a stack is dynamic memory you have no control of after all! Of course, it depends on the system, but where many uses of dynamic memory come unstuck is by not allowing that the allocation may fail. What happens then depends on the system - maybe you just take a breath and try again, since whatever is gobbling it may be then have released it, of you just give up in a safe way. Or, you could just do the static allocations as normal, allocate all the stacks, and what you have left over is the printf buffer.

Nominal Animal · « **Reply #14 on:** August 14, 2022, 03:42:40 am »

Quote from: brucehoult on August 13, 2022, 11:25:40 pm

Quote from: Nominal Animal on August 13, 2022, 09:58:03 pm
For example, round(5⁶×144115188075.85596) = round(2251799813685249 + 3/8) = 2251799813685249, but with double-precision floating point the multiplication itself yields round(2251799813685249.5) = 2251799813685250, which is off by 5/8 when ULP = 1/2, i.e. the error is 1.25 ULP.
Ouch.
Noooo!!

we start from INTEGER

D'oh! I was thinking of something completely different.

Yes, I do agree that whenever the result is known to be an integer, there is no danger of double-tie-rounding error, because the only rounding occurs when multiplying by the power of ten, and we can expect the implementation to abide by IEEE 754 rules, so the result will have at most 0.5 ULP of error.

Similarly, when one reads e.g 1.2345e-6 = 12345e-10 = 0.0000012345, one can convert it to floating point by dividing 12345 by 10⁶⁺⁴=10¹⁰, as long as both can be exactly represented by the floating point type. This limits the divisor to between 10⁰=1 and 10²², inclusive, as you mentioned above. (For negative power of ten divisors, just multiply by a positive power of ten, also between 10⁰ and 10²², inclusive.)

Put simply, you can multiply or divide integers with up to 15 decimal digits with a power of ten between 10⁰=1 and 10²², using double-precision floating point numbers, and the result will be the correct double-precision floating point representation.

For single precision, the limits are smaller: only up to 7 decimal digits and a power of ten between 10⁰=1 and 10¹⁰, inclusive.

(For example, to convert e.g. "0.0000000ddddddddddddddd" to double-precision floating point, you parse ddddddddddddddd as an integer, then divide that by 10²². To convert e.g. ddddddd0000000000.0 to single-precision floating point, you can parse ddddddd as an integer (you can use a single-precision floating point for it), then multiply that by 10¹⁰. An opportunistic parser could start by assuming this is the case, but fail immediately when either limit has been exceeded. Then, the actual parser can call the opportunistic parser that either yields the correct answer or fails (but usually yields the correct answer); and when that fails, use a slower parsing method.)

peter-h · « **Reply #15 on:** August 14, 2022, 08:10:03 am »

Quote

You could easily have a printf which uses a single block of memory1 and which uses mutexes to prevent multiple threads using it at the same time. Printf shouldn't take long so whether or not that potential small pause would be a problem depends on the app. However, since you're waiting on the output device regardless it's not going to make a real difference if you wait at the start of printf or the output of it.

[Perhaps that other printf you had, with the mutex stuff, was intended to work like this.]

AFAICT that 1990 printf lib uses the heap to get some RAM and then uses mutexes to prevent being used multithreaded. Well, malloc and free themselves need mutexes anyway. So they used recursive mutexes. Unfortunately these need to be initialised somewhere and none of the various bits of code I found online was doing the right thing. If I wrote a printf lib and published it, I would include the source to a printf_mutex_init function

The problem with the heap is that you will get fragmentation, and an eventual crash which is guaranteed (except when it happens) unless you mutex the whole heap access with a single mutex (so every malloc must be followed by a free) but programmers will hate you for that

That 1990 code was written for a "big machine" on which memory leaks of the order of a few k will remain for ever undiscovered so nobody cares.

Sure, there is a load of ways around this, but a heap is the last thing I would use. At best, it makes a slow printf even slower. Doing a mutex, malloc, mutex for the malloc, and then all that in reverse, for every printf invocation, is crazy.

Also if you mutex a printf, then you don't need to use the heap. You can just allocate memory statically. On the target I am working on, a change in the ST-supplied USB code from using the heap to using statics dramatically improved the system reliability (zero crashes now). And a free() is replaced with nothing because statics don't need to be freed

The heap is just an old way of doing stuff... And the amount of code executed in a printf probably falls by 1/3 or 1/2.

I've just been running my target, with all tasks running including a load of code with sscanfs, on a debugger, with a breakpoint on _sbrk, and nothing had been there. So if the sscanf is using the heap, it may be using it only for complicated cases. However, as posted above, there should be no reason for lots of RAM for reading decimal numbers. Anyway, a good lesson and a good case for a DIY approach using atoi() rather than sscanf with %d which is the lazy way

Interesting discussion above. I didn't know IEEE floats have such a horrendous spec. I'd like to find the source for the ST-supplied sscanf so I could double-check it isn't doing something stupid, but have no idea where to start.

PlainName · « **Reply #16 on:** August 14, 2022, 10:58:55 am »

Quote

The problem with the heap is that you will get fragmentation, and an eventual crash which is guaranteed

It can do, but in the simple case for printf you don't have to do it like that. There is no real difference in allocating a static 1K array, and dynamically creating the same array at run time. The problems start appearing if you free it, but since you won't then there are no problems

You could just allocate the static array, of course. But by dynamically creating it you can use up whatever memory is left over after everything else has been set up. Just an example, and if you went this route I would suggest using the static array to get going and then mess with dynamic memory later.

Quote

Doing a mutex, malloc, mutex for the malloc, and then all that in reverse, for every printf invocation, is crazy

Only for us. Computers are pretty good at repetitive tasks

The effort of getting a mutex and releasing it is nothing compared to what printf is doing under the bonnet. And real good value to save 14K.

Quote

Also if you mutex a printf, then you don't need to use the heap. You can just allocate memory statically.

Yes, that's what I was suggesting. The dynamic stuff is icing you can implement later if you want (for instance, if there is other stuff that benefits from large memory but can use small allocations if necessary).

peter-h · « **Reply #17 on:** August 14, 2022, 11:30:55 am »

Quote

There is no real difference in allocating a static 1K array, and dynamically creating the same array at run time. The problems start appearing if you free it, but since you won't then there are no problems

Obviously if you malloc but never free, then all is good. Even I think that is ok for embedded work

But people will argue there is no point then in having a heap... Especially as with all the other methods (static, or on stack as a variable inside a function) you know how much you are grabbing and where it ends up, whereas with the heap you don't know who has got what and how much, until you run out - unless you use some heap analysis tool, or you prefill the heap area with 0xA5 or whatever and look at it after running it for a while. Any analysis tool will be totally specific to the particular heap code, and I don't even know where the ST-supplied libc.a heap code came from. I am pretty sure most people never bother.

It's staggering how much of 2022 I have spent going down rabbit holes to make this project solid. The sh*t I found in the ST-supplied libc.a - like, a non thread safe printf doing mallocs for long and float output, with both printf itself and the mallocs being mutex-protected, but the mutex calls being empty stubs which just returned straight back, all designed so that nobody finds out that it is all actually junk. And to top it all off nicely, libc.a containing non weak functions so the crap code can't be replaced (solved by using objcopy to weaken the whole lib, as posted previously).

PlainName · « **Reply #18 on:** August 14, 2022, 02:47:14 pm »

Quote

The sh*t I found in the ST-supplied libc.a

Not really what one expects from the chip vendor. There is good stuff out there, but often I find third-party sources to be like help pages for writing one's own code rather than something to be used verbatim.

SiliconWizard · « **Reply #19 on:** August 14, 2022, 06:42:27 pm »

True for vendor code. But the standard C library most often comes from newlib for MCUs, it's not the vendor that wrote it.
Besides, if you're not happy with the compiler (and its associated libraries) that is provided by the vendor, you can always use another. That's the beauty of using ARM Cortex-based MCUs. You can use GCC configured in any way you like, compiled with newlib or possible replacements, or even use LLVM is you're so inclined. But from what I've seen, most often vendor tools ship with the "official" GCC binaries provided by ARM, which is not unreasonable: https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads

peter-h · « **Reply #20 on:** August 14, 2022, 09:24:03 pm »

I am sure it is "newlib"-something but that doesn't quite connect with the screenshots here
https://www.eevblog.com/forum/microcontrollers/32f4-hard-fault-trap-how-to-track-this-down/msg4317418/#msg4317418
which say "Standard C / C++" and currently I have newlib-nano not checked.

I made a local copy of that libc.a so that selector doesn't do anything anyway.

The printf is history now because I replaced it, but the scanf is original. It doesn't use the heap in my tests but I can't test every one of the % formats.

ledtester · « **Reply #21 on:** August 14, 2022, 09:30:02 pm »

Quote from: peter-h on July 28, 2022, 07:19:15 am

But this code from 1990, which appears to be a full implementation, uses not only mutexes to make itself thread-safe but also uses the heap
https://sourceware.org/git?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdio/vfprintf.c;h=6a198e2c657e8cf44b720c8bec76b1121921a42d;hb=HEAD#l866

From the copyright it appears this code is part of or was derived from BSD Unix which indicates it was meant to run on a multi-user machine with ample memory.

For modus operandi of the newlib implementation is to accumlate all of the formatted output into one buffer. When calling fprintf() this buffer is maintained as part of the FILE* structure.

The calls to _malloc_r are used in the following ways:

1. to create/enlarge the output buffer associated with a file handle (FILE*).
2. to create a workspace for the C99 'A' double conversion if the precision is greater than a certain limit
3. to create a workspace for the 'S' wide character string conversion

Otherwise conversions are performed in a stack allocated buffer before being transferred to the output buffer.

The lwprintf library just prints 'NaN' for the 'A' double conversion:

https://github.com/MaJerle/lwprintf/blob/develop/lwprintf/src/lwprintf/lwprintf.c#L993

The '%S' format seems to be deprecated -- even the Linux man page says not to use it.

https://man7.org/linux/man-pages/man3/printf.3.html

(search for "Synonym for ls")

peter-h · « **Reply #22 on:** August 14, 2022, 09:43:02 pm »

Thank you.

Does any of this change if one is not using the file versions (fprintf) but uses just the sprintf and snprintf ones?

This project does use floats but will never need doubles to be output as doubles.

I wonder if someone can dig out the likely sscanf source? It could well be from the same era (1990??). But isn't C99 well past 1990? The copyright says 1990 but then there are symbols like _WANT_IO_C99_FORMATS.

ledtester · « **Reply #23 on:** August 14, 2022, 09:59:19 pm »

The sprintf code appears to be here:

https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdio/sprintf.c;h=be66ec6f5d73589e5147afeb7634b9de0daba09d;hb=HEAD

and it looks like it creates a dummy file handle before calling something which does all the work -- see lines 584 and then 591.

tellurium · « **Reply #24 on:** August 14, 2022, 10:49:43 pm »

Maybe what I write below, will be useful for somebody.

My company's product, a networking library, implements its own printf (a subset of ISO C). It is a fairly compact one:

https://github.com/cesanta/mongoose/blob/master/src/fmt.c
Doc: https://mongoose.ws/documentation/#mg_snprintf-mg_vsnprintf

In addition to some standard specifiers like %d, %u, %s, %g, %x, it implements some non-standard specifiers, like %q/%Q (for escaped quoted strings), %H (for hex-encoded), %V (for base64-encoded), and %M (where one can pass a custom printing function with arbitrary args).
This gives an ability to produce JSON strings easily - useful for communication.

That printf is fully retargetable and can print to anything by specifying a custom "putchar" function.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Best thread-safe "printf", and why does printf need the heap for %f etc? (Read 20875 times)

Share me