As a sideways comment to those who are interested in such details -- wall of text follows:
C actually has
two string types: ordinary strings, and wide strings.
They are both simply unspecified-length arrays, terminated by a zero value (
'\0' and
L'\0', respectively).
(Because it can be unclear whether by zero one means code point zero or the zero digit character, I like to call this value
nul. In comparison, the zero pointer value I call
NULL, with the length of the final consonant separating them in everyday speech.)
For ordinary strings, each character in the array is of type
char, but because of the integer promotion rules in C, in expressions literal character constants like
'X' are promoted to
ints.
For wide strings, each character in the array is of type
wchar_t. Unlike ordinary string literals, the type
wint_t is not related to integer promotion at all, and is just a type that can hold any
wchar_t value, plus the
WEOF value (indicating end-of-stream for wide character streams).
C11 added support for specifying Unicode code points in both ordinary and wide character constants and string literals, using
\uHHHH or
\UHHHHHHHH, where
HHHH and
HHHHHHHH are the code point in hexadecimal.
The exact character set used for ordinary and wide character constants and string literals is a bit of a complex issue. In practice nowadays, the ordinary character set is ASCII compatible, either UTF-8 or one of the 8-bit ASCII-compatible character sets. The character set used for wide characters is even messier, partly because the Microsoft C library used in Windows uses UTF-16, where some glyphs can require more than one wide character; I'm not exactly clear which Windows versions and libraries actually support that, and which are limited to the first 65536 characters of the Unicode set.
For POSIXy systems -- that means Linux, *BSDs, Mac OS, Android, and some other esoteric systems -- the C library provides
iconv conversion facilities. It can basically convert, at run time, between various character sets (using ordinary strings), and to/from wide character strings, using a very simple but efficient conversion interface.
Standard C also contains wide character equivalents of the typical I/O functions --
wprintf()/printf(),
wfprintf()/fprintf(),
wscanf()/scanf(),
wcslen()/
strlen() -- and so on. (The only thing that is missing is the wide character equivalents of POSIX
getline() and
getdelim(), really; you have to
roll your own for those.)
But to be most practical, we should just
use UTF-8 everywhere.
(This is most important when dealing with internet-of-things gadgets and such.)
As an example, if you write a Linux/Mac/BSD/POSIXy program that states that it only works in UTF-8 locales, and the sources use UTF-8 encoding, you can use ordinary string literals that contain non-ASCII characters like
"Öbaut 2.50€", and they will work fine. What will not work, however, is single-character non-ASCII literals like
'€' or
'Ö', unfortunately. This is because non-ASCII characters in UTF-8 are composed of 2 to 6
chars. However, if you write your code to consider
substrings instead of
single character constants, it is not a problem at all.
I do personally have a bit of a chip on my solder about Microsoft wrt. C11 and
getline() and wide-character support. If MS hadn't made the mistake of assuming early on that 65536 characters would be enough for everyone (Unicode has 1,114,112 code points), we could have proper Unicode support standardized for C wide characters now, with widget and file system access libraries having wide character interfaces. But enough of that: the world is what it is, and it is much better to be practical and robust, and forget whining about what could be. Sorry about that.
In practice, you have two robust approaches to choose from, depending on where the C code you are working on will operate in.
- Specify the character set the code uses. For some minimal gadgets that could be ASCII, but in general, UTF-8 is used (as it supports all Unicode code points, and therefore the vast majority of written human languages).
- Use the user locale for I/O character set, and the iconv facilities to convert to the internal character set, typically either wide characters, or UTF-8.
If anyone is interested enough, I'd be happy to provide some example code for the various cases; just let me know of a specific situation you'd like to see.
(Full disclosure: I first encountered this problem in late nineties, when implementing a localized web form for course feedback reports for students, using Windows, Mac (pre-OS X), and Linux machines. Internet Explorer in particular used the character set in current user locale for non-ASCII characters, regardless of the form data. So, I developed hidden form fields with specific detector characters, to detect the actual character set the browser used for the input fields. I have worked on character set and localization issues a lot, in other words.)