Author Topic: Unexpected printf behavior (Read 2357 times)

MikeK · « **on:** July 08, 2023, 04:00:32 pm »

Here's my program:

#include <stdio.h>

int main()
{
    printf("%6s %3s\n", "\u00B0C", "\u00B0F");
    printf("%6s %3s\n", "C", "F");
    printf("%6.1f %3.0f\n", 100.0, 212.0);
}

The bottom two lines appear right-justified, but the first line not. I'm not understanding why.

Code: [Select]

   °C °F
     C   F
 100.0 212

bpiphany · « **Reply #1 on:** July 08, 2023, 04:37:32 pm »

Wild guess.. The unicode characters are "2 chars wide"(?) Add an extra space in front of them to trick printf(?)

Yes, seems like it https://www.google.com/search?q=printf+width+unicode

MikeK · « **Reply #2 on:** July 08, 2023, 04:54:49 pm »

Adding a space does make C and F consistent, but not fully right-justifying them. Shouldn't they get forced all the way over?

Code: [Select]

   °C  °F
     C   F
 100.0 212

DavidAlfa · « **Reply #3 on:** July 08, 2023, 05:17:59 pm »

Try using \260 for the degree symbol
(Still 0xB0 char but now in octal)

Code: [Select]

printf("%6s %3s\n", "\260C", "\260F");

My guessing is printf is counting the input chars, not the effective ones.
Two Unicode bytes generate 1 output char, but printf thought it found 2.
Just increase the tabs when using Unicode:

Code: [Select]

printf("%7s %4s\n", "\u00b0C", "\u00B0F");

Perkele · « **Reply #4 on:** July 08, 2023, 05:22:54 pm »

You did not specify which OS is used.
Unicode and printf don't mix really well, because printf expects pointer to a simple null-terminated string.
If you need Unicode, set your locale to LC_ALL, and whatever Unicode flavour you plan to use.
And then use Unicode-aware functions.
But as already answered, there is a character that you need in standard 7/8-bit character set.

Perkele · « **Reply #5 on:** July 08, 2023, 05:26:54 pm »

Quote from: bpiphany on July 08, 2023, 04:37:32 pm

Wild guess.. The unicode characters are "2 chars wide"(?) Add an extra space in front of them to trick printf(?)

Yes, seems like it https://www.google.com/search?q=printf+width+unicode

Depending of "flavour", Unicode characters width can be from 1 to 4 bytes.
On top of that, don't count on tricks working with all operating systems.

MikeK · « **Reply #6 on:** July 08, 2023, 05:45:45 pm »

Quote from: DavidAlfa on July 08, 2023, 05:17:59 pm

Try using \260 for the degree symbol
(Still 0xB0 char but now in octal)
Code: [Select]
printf("%6s %3s\n", "\260C", "\260F");

The reason I went with unicode is that \260 shows up as the "undefined" inverse question mark character. It does format it correctly, though.

(I'm using Linux Mint 21.1 and gcc.)

Veteran68 · « **Reply #7 on:** July 08, 2023, 05:47:08 pm »

Quote from: Perkele on July 08, 2023, 05:26:54 pm

On top of that, don't count on tricks working with all operating systems.

It can get even more complicated than OS. Different implementations of stdio across compiler implementations, and even versions within the same implementation, can behave differently. If the ANSI/ISO standard doesn't specify an exact behavior, then it becomes undefined yet would still meet the standard. Although one would expect any reasonably recent implementation to be consistent here, given how long printf and Unicode/MBCS have been around.

SiliconWizard · « **Reply #8 on:** July 08, 2023, 09:21:29 pm »

I've tried various locales and using wide-char literals instead (L prefix for literals, %ls for the printf format), and it doesn't seem to do what you'd expect either.

Not sure there's really any way, at least portable, apart from writing your own formatting function.

MikeK · « **Reply #9 on:** July 08, 2023, 09:33:02 pm »

Thanks for doing that.

SiliconWizard · « **Reply #10 on:** July 08, 2023, 09:36:17 pm »

I haven't been able to figure out if this is a "known" limitation of printf formatting ("known" as in: according to the C standard), or if it's a limitation of glibc. Could be interesting to try with an alternative libc (/on a platform that doesn't use glibc.)

DiTBho · « **Reply #11 on:** July 08, 2023, 10:53:51 pm »

for me, "printfs" is all wrong approach and does not even deserve to be implemented.

A valid alternative was discussed here on this forum, ... too sad nobody has followed suggestions and replaced printfs with something better.

Nominal Animal · « **Reply #12 on:** July 08, 2023, 11:54:39 pm »

This is the correct implementation that works in all standard C implementations that support wide output and the users' current locale.

Code: [Select]

#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(void)
{
    if (!setlocale(LC_ALL, ""))
        fprintf(stderr, "Warning: Current locale is not supported by the OS!\n");
    if (fwide(stdout, 1) < 1)
        fprintf(stderr, "Warning: Standard C library does not support wide character output!\n");

    wprintf(L"%6ls %3ls\n", L"\u00B0C", L"\u00B0F");
    wprintf(L"%6ls %3ls\n", L"C", L"F");
    wprintf(L"%6.1f %3.0f\n", 100.0, 212.0);

    return 0;
}

in the C.UTF-8 locale (default UTF-8 locale) it outputs
°C °F
C F
100.0 212
which I assume was the point. I know Windows needs some kind of internal windows-only call between the two if clauses, to set some standard output stream internal flag, but I forget the details.

Note that you can specify the degree symbol as a string macro, L"\u00B0", if you find it annoying to work with Unicode constants.
With #define WC_DEGREES L"\u00B0" you can use wprintf(L"%6ls %3ls\n", WC_DEGREES L"C", WC_DEGREES L"F");.

Note that it is not sufficient to use %3s and "\u00B0C" for °C, nor %3ls with printf(), because only wide character strings' Unicode characters can be correctly counted. A wide character print function, a wide character formatting string, and a wide character constant (char if %lc, string if %ls) must be used for correct results.

Edited: The L"..." notation denotes a wide string constant, and must be used. (I forgot them above in the WC_DEGREES suggestion.)

MikeK · « **Reply #13 on:** July 09, 2023, 05:30:07 pm »

Quote from: Nominal Animal on July 08, 2023, 11:54:39 pm

...

That works. Thanks for posting it. I didn't know about wprintf until now.

Nominal Animal · « **Reply #14 on:** July 09, 2023, 06:15:16 pm »

Not many do, even though the above has been correct standard code since C99.

You can find more details in the wprintf(), setlocale(), fwide(), fgetws(), and wcs*() functions' (wcscmp(), wcslen(), etc.) man pages.

While the wide character set is its own, it should support the first 64k of Unicode characters using L"\uHHHH". Not all implementations correctly support the rest via L"\UHHHHHHHH" (where H denotes a hexadecimal digit, with values from 00000000 to 0010FFFF possible in current Unicode versions), only some do. You can use the mbstowcs()/mbsrtowcs()/wcstombs()/wcsrtombs() standard C functions to convert between wide character and multibyte strings, or better yet, the POSIX iconv_open()/iconv()/iconv_close() interface: it is built-in in all POSIXy systems including Linux, and available as a libiconv library on all other OSes, and the closest thing to a standard interface for conversion between character sets (including wide character strings) we have.

(I've always been a bit of a character set/non-latin alphabet support proponent, explaining why I know the above. Back before the turn of the century, when browsers hadn't yet grown support of the accept-charset parameter in web forms and I implemented web backends, I used hidden form fields with a set of prototype values like € and á to detect what character set the browser actually used to encode the text the user supplied, with automagic conversion support for anything between cp850 (Windows Western European) and macRoman and ISO 8859-n and UTF-8, using my own conversion code.)

SiliconWizard · « **Reply #15 on:** July 09, 2023, 07:55:08 pm »

But if you want to deal with UTF-8, you're out luck. Aren't you?

ejeffrey · « **Reply #16 on:** July 09, 2023, 09:25:52 pm »

Quote from: SiliconWizard on July 09, 2023, 07:55:08 pm

But if you want to deal with UTF-8, you're out luck. Aren't you?

If you want to deal with utf-8 in memory you are out of luck. But you can read and write utf-8 files and use wchar_t for the in memory representation. As long as you always use wide IO functions and never try to peek inside or interpret the data you shouldn't have to care about the representation which is unfortunate platform dependent.

SiliconWizard · « **Reply #17 on:** July 09, 2023, 09:55:36 pm »

Quote from: ejeffrey on July 09, 2023, 09:25:52 pm

Quote from: SiliconWizard on July 09, 2023, 07:55:08 pm
But if you want to deal with UTF-8, you're out luck. Aren't you?

If you want to deal with utf-8 in memory you are out of luck. But you can read and write utf-8 files and use wchar_t for the in memory representation. As long as you always use wide IO functions and never try to peek inside or interpret the data you shouldn't have to care about the representation which is unfortunate platform dependent.

Yes, as I get it, you need to go via wide char and associated functions - at least for the kind of things the OP is trying to achieve.

I personally do not use much of the standard library when dealing with UTF-8 myself, except of course the base functions that don't care about the character representation (strlen, strcopy, ...)

Nominal Animal · « **Reply #18 on:** July 09, 2023, 10:25:19 pm »

I forgot to mention that even my example fails, if the wide string contains combining diacriticals, because the wprintf() interface counts characters, not glyphs.

For example, the string L"åa" == L"\u0061\u030A\u0061" is counted as three characters, because it is: the middle one is U+030A COMBINING RING ABOVE.

Quote from: SiliconWizard on July 09, 2023, 07:55:08 pm

But if you want to deal with UTF-8, you're out luck. Aren't you?

What do you mean? (I suspect I agree, in that C standard or POSIX haven't kept up, but I'm not sure it is what you refer to.)

In POSIXy systems, after the setlocale(LC_ALL,"") (or setlocale(LC_CTYPE,"") call if you are only interested in strings and character sets and not the other locale features like date and time formatting), you can use nl_langinfo(CODESET) to obtain an immutable string to the current character set. If it matches "UTF-8", then your multibyte strings (mbs*(), *mbs()) are in UTF-8. Otherwise, you can use iconv_open() with that string to convert to either wide characters or to UTF-8, or vice versa.

If you want to know how many characters there are in a multibyte string, just use len = mbstowcs(NULL,string,0).
Unfortunately, there is no multibyte equivalent of printf()/wprintf() (would be mbprintf(), I guess), so there is no standard C stream interfaces that counted output or input in multibyte characters instead of chars.
Thus, the wide character output functions (wprintf(), fwprintf(), swprintf(), vswprintf() et al) are the only ones we can use in standard C (since C99) or POSIX C to format output correctly. (For terminal applications, ncursesw library also works.)

Now, I am also a proponent of UTF-8 Everywhere, simply because it is the one Unicode encoding that works everywhere without new/additional requirements, and is downwards ASCII-compatible. I personally believe that when using e.g. the GNU standard C library, the source character set should always be UTF-8. (IIRC, GCC et al. already use iconv facilities to translate from the current locale to UTF-8.)
I also think it is perfectly reasonable to say that a particular program or utility only works in locales using the UTF-8 character set.

In Linux and MacOS X, sizeof (wchar_t) == 4, and if nl_langinfo(CODESET) == "UTF-8", then wchar_t is UTF-32 (which is UCS-4 compatible). So, in Linux and MacOS X, it is reasonable to require an UTF-8 -based locale, so that wide characters are UTF-32/UCS-4/Unicode.

Which leads to interfaces I implement for myself.

Conversion from UTF-8 to any other character set is mostly very straightforward. Each unicode character is 1 to 4 bytes long, and the sets of combining characters –– that occupy the same position as the previous character(s), i.e. do not change the "cursor" position –– is known.
Converting combining character sequences to non-Unicode character sets is annoying, because there are so darn many (and the conversion tables require quite a lot of memory, relatively speaking, although they can be and are memory-mapped on architectures with virtual memory), and I'm still struggling whether one needs to do that. I do, however, believe that any printable character plus any following combining characters, should be calculated as a single glyph; and that double-wide characters should be calculated as two glyphs.

This lets one do exactly what SiliconWizard implied (I think), and just treat everything as UTF-8, even with auto-conversion to/from whatever character set the user uses. Formatting patterns like "this string using at most 8 cursor positions" would then be correctly counted, even for input combining say Japanese and Finnish non-ASCII characters.

However, GUI toolkits.

GTK GStrings (gchar *) are usually UTF-8, so that's okay.
Qt, however, uses UTF-16 internally.

In Linux, paths are nul-terminated opaque byte sequences, where 0 terminates the path, and 47 (/) is path component separator (no file or directory name may contain 47), and may use whatever character set the user wants. Environment variables are opaque byte sequences terminated by 0, with the first 61 (=) being the separator between the name and the value part.

This starts reminding a lot of the Python separation between str and bytes, doesn't it? However, because not all inputs are valid UTF-8, I personally like to use illegal Unicode code points U+1FFF00 to U+1FFFFF (which encode to 4-byte sequences; only code points U+000000 to U+10FFFF are actually valid, although there are some holes in there too) to represent raw bytes that do not form a valid UTF-8 code sequence. This ensures I can support all possible input byte sequences, and output them as is, without erroring out or mangling invalid sequences.

With UTF-8 locales, that approach has very little to no overhead; with other locales, there is additional overhead on input and output when consuming or producing text, but not when consuming or producing byte sequences. Also, there is no problem in mixing text and byte sequences; no "mode" is needed at all.

Yet, this is a new, non-standard interface. With standard C and POSIX C, we have to get by with wide characters and wide strings.

SiliconWizard · « **Reply #19 on:** July 09, 2023, 10:29:21 pm »

Yes, you pretty much detailed what I meant.

And I must modulate what I quickly said above - about the base C string functions when dealing with UTF-8. I listed strlen() has an example, but obviously it only counts the number of char's, not the number of UTF-8 characters. So even for string length (in terms of characters), I use my own function.

Nominal Animal · « **Reply #20 on:** July 09, 2023, 10:46:06 pm »

The question is, whether one should also use a function that returns the number of cursor positions the UTF-8 string occupies, or not.

For example, "x᷒" is two characters but one cursor position wide, but "＠" is just one character, U+FF20, that takes two cursor positions, a so-called full-width form.
With such a function, one can align strings properly even in terminals, but it does increase the maintenance burden a lot.

It is not difficult to extract the set of combining characters (that take zero additional cursor positions) and the set of full-width forms from the Unicode tables and compress it to a fast and small hash table, but the damned Unicode tables are still evolving, which means it really should be a dynamically linked library that one can just regenerate when needed. But then, to keep the speed (because dynamic library call per code point is slow), all the useful related code should be in that library.

And then we have one more additional library to maintain.

On the other hand, if we just assume one code point = one cursor position, we can properly align and tabulate most text, it just .. fails for combining characters and full-width forms –– but no maintenance needed.

SiliconWizard · « **Reply #21 on:** July 09, 2023, 10:50:57 pm »

It sure can be a rabbit hole, depending on what your goal is.

Another kind of rabbit hole is capitalizing UTF-8 text.

HwAoRrDk · « **Reply #22 on:** July 10, 2023, 10:28:20 am »

Quote from: SiliconWizard on July 09, 2023, 10:50:57 pm

Another kind of rabbit hole is capitalizing UTF-8 text.

Well, at least the Germans have been trying to help that situation by finally implementing capital Eszett. Unlike those dastardly Turks.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Unexpected printf behavior (Read 2357 times)

Share me