I forgot to mention that even my example fails, if the wide string contains combining diacriticals, because the wprintf() interface counts
characters, not
glyphs.
For example, the string
L"åa" ==
L"\u0061\u030A\u0061" is counted as three characters, because it is: the middle one is
U+030A COMBINING RING ABOVE.
But if you want to deal with UTF-8, you're out luck. Aren't you?
What do you mean? (I suspect I agree, in that C standard or POSIX haven't kept up, but I'm not sure it is what you refer to.)
In POSIXy systems, after the
setlocale(LC_ALL,"") (or
setlocale(LC_CTYPE,"") call if you are only interested in strings and character sets and not the other locale features like date and time formatting), you can use
nl_langinfo(CODESET) to obtain an immutable string to the current character set. If it matches
"UTF-8", then your multibyte strings (mbs*(), *mbs()) are in UTF-8. Otherwise, you can use iconv_open() with that string to convert to either wide characters or to UTF-8, or vice versa.
If you want to know how many
characters there are in a multibyte string, just use
len = mbstowcs(NULL,string,0).
Unfortunately, there is no multibyte equivalent of
printf()/
wprintf() (would be
mbprintf(), I guess), so there is no standard C stream interfaces that counted output or input in multibyte characters instead of
chars.
Thus, the wide character output functions (
wprintf(),
fwprintf(),
swprintf(),
vswprintf() et al) are the only ones we can use in standard C (since C99) or POSIX C to format output correctly. (For terminal applications,
ncursesw library also works.)
Now, I am also a proponent of
UTF-8 Everywhere, simply because it is the one Unicode encoding that works everywhere without new/additional requirements, and is downwards ASCII-compatible. I personally believe that when using e.g. the GNU standard C library, the
source character set should always be UTF-8. (IIRC, GCC et al. already use
iconv facilities to translate from the current locale to UTF-8.)
I also think it is perfectly reasonable to say that a particular program or utility only works in locales using the UTF-8 character set.
In Linux and MacOS X,
sizeof (wchar_t) == 4, and if
nl_langinfo(CODESET) == "UTF-8", then
wchar_t is UTF-32 (which is UCS-4 compatible). So, in Linux and MacOS X, it is reasonable to require an UTF-8 -based locale, so that wide characters are UTF-32/UCS-4/Unicode.
Which leads to interfaces I implement for myself.
Conversion from UTF-8 to any other character set is mostly very straightforward. Each unicode character is 1 to 4 bytes long, and the sets of combining characters –– that occupy the same position as the previous character(s), i.e. do not change the "cursor" position –– is known.
Converting combining character sequences to non-Unicode character sets is annoying, because there are so darn many (and the conversion tables require quite a lot of memory, relatively speaking, although they can be and are memory-mapped on architectures with virtual memory), and I'm still struggling whether one needs to do that. I do, however, believe that any printable character plus any following combining characters, should be calculated as a single glyph; and that double-wide characters should be calculated as two glyphs.
This lets one do exactly what SiliconWizard implied (I think), and just treat everything as UTF-8, even with auto-conversion to/from whatever character set the user uses. Formatting patterns like "this string using at most 8 cursor positions" would then be correctly counted, even for input combining say Japanese and Finnish non-ASCII characters.
However, GUI toolkits.
GTK GStrings (
gchar *) are usually UTF-8, so that's okay.
Qt, however, uses UTF-16 internally.
In Linux, paths are nul-terminated opaque byte sequences, where 0 terminates the path, and 47 (/) is path component separator (no file or directory name may contain 47), and may use whatever character set the user wants. Environment variables are opaque byte sequences terminated by 0, with the first 61 (=) being the separator between the name and the value part.
This starts reminding a lot of the Python separation between
str and
bytes, doesn't it? However, because not all inputs are valid UTF-8, I personally like to use illegal Unicode code points U+1FFF00 to U+1FFFFF (which encode to 4-byte sequences; only code points U+000000 to U+10FFFF are actually valid, although there are some holes in there too) to represent raw bytes that do not form a valid UTF-8 code sequence. This ensures I can support all possible input byte sequences, and output them as is, without erroring out or mangling invalid sequences.
With UTF-8 locales, that approach has very little to no overhead; with other locales, there is additional overhead on input and output when consuming or producing text, but not when consuming or producing byte sequences. Also, there is no problem in mixing text and byte sequences; no "mode" is needed at all.
Yet, this is a new, non-standard interface. With standard C and POSIX C, we have to get by with wide characters and wide strings.