I would really like a better set of string (text) functions. Maybe more capable of dealing with unicode that current stuff, but ... definitely more like the support in other languages.
It is interesting to note that current Unicode limits to code points 0 to 0x10FFFF, inclusive (1,114,111 unique code points), which means that UTF-8 code points are 1, 2, or 3 bytes long. All newline conventions are either one or two bytes long. Commonly interesting escape/end sequences are two or three bytes long. And so on.
It seems to me that we really need string functions that instead of single-character bytes, work on characters or character sequences that are 1, 2, or 3 bytes long. This covers not only UTF-8, but other use cases as well. UTF-8 sequences are in many ways even easier, because the initial byte also describes the sequence length; this makes them relatively easy to support automagically when globbing or implementing regular expressions.
(I've worked quite a bit with wide character strings and wide I/O, and while they solve the individual character problem, they do not solve combined glyphs nor newline conventions nor escape sequences.)
For operations that are done to a limited-size buffer, the functions need to be able to return the case when the decisive sequence is cut short by the end of the buffer, so that the caller knows to resize/grow/move the buffer.
(So, the equivalent of strnstr() should be able to return
here,
not found, or
cut short by end of buffer.)
However, even this is most important for those short sequences - a few characters at most; the longer string matching is much rarer operation, relatively speaking.
Making the common operations efficient is the key. After that, the rarer operations only need to be non-silly.
I'm not sure what that would look like, exactly. One possibility is that strings could have their own garbage-collected memory management, without switching other things away from malloc/free.
Definitely an intriguing option. I also like the underlying idea of modularity.
Perhaps, instead of a monolithic base library, it should be split into a core and optional sub-libraries?
I find it a bit depressing how often people have used uint16_t, when what they really should have used was uint_fast16_t. Of course, if you didn't like uint16_t because of readabilty or typeability issues, you REALLY hate uint_fast16_t :-(
That's exactly the reason I was mulling using
u16 for
uint_fast16_t and
u16e for
uint16_t.
Minimum-size fast types should be the most commonly used ones, so why not make them the easiest ones to use? The exact-sized types can then be thought of as stricter type variants (in the logical sense; computationally completely separate types), so a suffix denoting "exactly" seems logical to me.