Author Topic: Does anybody learn C any more? (Read 38896 times)

rsjsouza · « **Reply #200 on:** September 12, 2019, 06:00:46 pm »

Quote from: bsfeechannel on September 12, 2019, 10:32:17 am

People outside the US usually complain that their keyboard layouts do not encompass all the printable ASCII characters. But that is not a problem with ASCII.

ASCII does not encompass the entirety of the latin characters (áéóôüñ and so on), let alone the mostly different ones. Whoever lived in other countries had to fight a constant battle with MODE CODEPAGE PREPARE and CHCP in the DOS days. The printers then? Lots of fun with neverending streams of continuous feed printer paper or changes in formatting due to non-standard characters sent to the printer.

bsfeechannel · « **Reply #201 on:** September 12, 2019, 06:29:04 pm »

Quote from: rsjsouza on September 12, 2019, 06:00:46 pm

ASCII does not encompass the entirety of the latin characters (áéóôüñ and so on), let alone the mostly different ones. Whoever lived in other countries had to fight a constant battle with MODE CODEPAGE PREPARE and CHCP in the DOS days. The printers then? Lots of fun with neverending streams of continuous feed printer paper or changes in formatting due to non-standard characters sent to the printer.

With UTF-8 this is a thing of the, now remote, past. The first 128 characters of UTF-8 are identical to ASCII. Together with the next 128 code points, you have Latin-1, the widest used "extended ASCII" set. And all the other 4 billion code points are there to code simply all the glyphs produced by humankind since the invention of writing by the ancient Sumerians.

Look at the characters I can get directly from my keyboard:

'1234567890-=qwertyuiop'[asdfghjklç~]zxcvbnm,.;\
"!@#$%"&*()_+QWERTYUIOP`{ASDFGHJKLÇ^}ZXCVBNM<>:| SHIFT
¬¹²³£¢¬{[]}\§/?€®ŧ←↓→øþ´ªæßðđŋħ ̉ĸł'~º«»©“”nµ · ̣º ALTGR
¬¡½¾¼⅜¨⅞™±°¿˛/?€®Ŧ¥↑ıØÞ`¯Æ§ÐªŊĦ &Ł˝^º<>©‘’Nµ×÷˙˘ ALTGR + SHIFT

And if I combine the dead keys with the appropriate characters above I get:

ÁÉÍÓÚáéíóúÀÈÌÒÙàèìòùÂÊÎÔÛâêîôûÃẼĨÕŨãẽĩõũṔṕÝýŚśĀĒĪŌŪāēīōūĂĔĬŎŬăĕĭŏŭṠȦĖİȮ, etc., etc., etc.

Chinese characters? No worries.

#include<stdio.h>

int main()
{
printf( "欢迎来到中国\n" );
}

$ ./utf-8
欢迎来到中国
$

Welcome to the 21st century.

SiliconWizard · « **Reply #202 on:** September 13, 2019, 02:36:39 pm »

Quote from: bsfeechannel on September 12, 2019, 06:29:04 pm

With UTF-8 this is a thing of the, now remote, past. The first 128 characters of UTF-8 are identical to ASCII. Together with the next 128 code points, you have Latin-1, the widest used "extended ASCII" set. And all the other 4 billion code points are there to code simply all the glyphs produced by humankind since the invention of writing by the ancient Sumerians.

Yup. 16-bit Unicode should have never existed to begin with. It was a disgrace, used unnecessary space and was a huge problem-maker for porting existing apps.
UTF-8 is great. You almost have nothing to do to support it, except when you need to delimit/count characters. And even that is pretty easy with just a couple rules to know and apply.

coppice · « **Reply #203 on:** September 13, 2019, 02:55:41 pm »

Quote from: SiliconWizard on September 13, 2019, 02:36:39 pm

Yup. 16-bit Unicode should have never existed to begin with. It was a disgrace, used unnecessary space and was a huge problem-maker for porting existing apps.
UTF-8 is great. You almost have nothing to do to support it, except when you need to delimit/count characters. And even that is pretty easy with just a couple rules to know and apply.

Perhaps 16 bit Unicode was a trick played on Microsoft management who looked at how many characters were in a Microsoft Chinese font, instead of looking at how many Chinese characters there really are.

SiliconWizard · « **Reply #204 on:** September 13, 2019, 03:52:01 pm »

Quote from: coppice on September 13, 2019, 02:55:41 pm

Perhaps 16 bit Unicode was a trick played on Microsoft management who looked at how many characters were in a Microsoft Chinese font, instead of looking at how many Chinese characters there really are.

Well, ahah. But probably not. Both approaches can be justified. 16-bit Unicode had the merits of having fixed-size characters, so that probably appeared to be much simpler to deal with (after all, it was just a matter of changing the size of a "char"). All code could be in theory reused just by redefining a type. In practice though, this change was often more of a burden than it initially appeared.

OTOH, complex parsers, or text editors, especially if not well written, could be a lot more hassle to port to UTF-8 than Unicode.

NorthGuy · « **Reply #205 on:** September 13, 2019, 05:07:36 pm »

Quote from: SiliconWizard on September 13, 2019, 03:52:01 pm

16-bit Unicode had the merits of having fixed-size characters, so that probably appeared to be much simpler to deal with (after all, it was just a matter of changing the size of a "char").

Except for composite symbols of course

SiliconWizard · « **Reply #206 on:** September 13, 2019, 08:10:50 pm »

Quote from: NorthGuy on September 13, 2019, 05:07:36 pm

Quote from: SiliconWizard on September 13, 2019, 03:52:01 pm
16-bit Unicode had the merits of having fixed-size characters, so that probably appeared to be much simpler to deal with (after all, it was just a matter of changing the size of a "char").

Except for composite symbols of course

Well, isn't this more like UTF-16 than the original 16-bit Unicode that MS implemented? Not sure about that, just a question...

NorthGuy · « **Reply #207 on:** September 13, 2019, 08:46:44 pm »

Quote from: SiliconWizard on September 13, 2019, 08:10:50 pm

Well, isn't this more like UTF-16 than the original 16-bit Unicode that MS implemented? Not sure about that, just a question...

Unicode has code points for accented characters (for example 0x00e9 is e with "accent de gue"), but the same character my be composed, for example e (0x0065) followed by "combining" accent de gue (0x0301).

Most funny application is Mac OS, where the file names must be converted to canonical form (I think it's composed, but I don't remember exactly) before use. As a result, different UTF-8 strings may refer to the same file - cannot use strcmp().

SiliconWizard · « **Reply #208 on:** September 13, 2019, 08:55:34 pm »

Quote from: NorthGuy on September 13, 2019, 08:46:44 pm

Quote from: SiliconWizard on September 13, 2019, 08:10:50 pm
Well, isn't this more like UTF-16 than the original 16-bit Unicode that MS implemented? Not sure about that, just a question...

Unicode has code points for accented characters (for example 0x00e9 is e with "accent de gue"), but the same character my be composed, for example e (0x0065) followed by "combining" accent de gue (0x0301).

Most funny application is Mac OS, where the file names must be converted to canonical form (I think it's composed, but I don't remember exactly) before use. As a result, different UTF-8 strings may refer to the same file - cannot use strcmp().

Oh, I see! Well those combinations remind me of using Latex with no babel package, or similar.

- Just a note: "acute" is "accent aigu" in French (if that's what you were trying to spell.)

-

NorthGuy · « **Reply #209 on:** September 13, 2019, 09:25:04 pm »

Quote from: SiliconWizard on September 13, 2019, 08:55:34 pm

- Just a note: "acute" is "accent aigu" in French (if that's what you were trying to spell.) -

I'm sorry about that. I felt something was wrong. I should've gone with "accent grave".

rsjsouza · « **Reply #210 on:** September 14, 2019, 02:09:01 pm »

Quote from: bsfeechannel on September 12, 2019, 06:29:04 pm

Quote from: rsjsouza on September 12, 2019, 06:00:46 pm
Quote from: bsfeechannel on September 12, 2019, 10:32:17 am
People outside the US usually complain that their keyboard layouts do not encompass all the printable ASCII characters. But that is not a problem with ASCII.
ASCII does not encompass the entirety of the latin characters (áéóôüñ and so on), let alone the mostly different ones. Whoever lived in other countries had to fight a constant battle with MODE CODEPAGE PREPARE and CHCP in the DOS days. The printers then? Lots of fun with neverending streams of continuous feed printer paper or changes in formatting due to non-standard characters sent to the printer.

With UTF-8 this is a thing of the, now remote, past.

My point exactly. It was a problem with ASCII used by computer systems of yore.

SiliconWizard · « **Reply #211 on:** September 14, 2019, 04:25:45 pm »

I still don't quite see the point.
ASCII was clearly designed with the English language in mind (so no accents) and kind of the lowest common denominator as far as latin letters and symbols go, that would fit within 7 bits of data. It was a limitation, but already a nice step forward.

bsfeechannel · « **Reply #212 on:** September 14, 2019, 06:29:01 pm »

Quote from: rsjsouza on September 14, 2019, 02:09:01 pm

Quote from: bsfeechannel on September 12, 2019, 06:29:04 pm
With UTF-8 this is a thing of the, now remote, past.
My point exactly. It was a problem with ASCII used by computer systems of YORE.

TIFIFY

bsfeechannel · « **Reply #213 on:** September 14, 2019, 06:53:51 pm »

Quote from: SiliconWizard on September 14, 2019, 04:25:45 pm

ASCII was clearly designed with the English language in mind (so no accents)

Not entirely true. Some diacritical symbols are there `, ^, ~. Other punctuation symbols can double as accents: ', ". Remember that when ASCII came about, printers were the main machine-human interface, not video terminals. On a printer you can type LETTER+BACKSPACE+ACCENT, or ACCENT+LETTER (like on typewriters) if you configure accents as dead keys.

Do you need a c-cedilla to write "Ça va? Ça va bien, merci!"? Just print C, then backspace, then comma and you're good to go.

Tre[CTRL+H]`s chic.

Quote

It was a limitation, but already a nice step forward.

No doubt.

westfw · « **Reply #214 on:** September 15, 2019, 01:05:15 am »

I've sort-of been waiting for the first language/ide to allow user-specified mark-up of the source code. (no, not just some scheme done dynamically by the IDE. Actually IN the source code.) It would be ... interesting.
Hmm. Which is worse, punctuation-heavy languages (like C), or languages with many English keywords?

techman-001 · « **Reply #215 on:** September 15, 2019, 01:52:39 am »

Quote from: westfw on September 15, 2019, 01:05:15 am

I've sort-of been waiting for the first language/ide to allow user-specified mark-up of the source code. (no, not just some scheme done dynamically by the IDE. Actually IN the source code.) It would be ... interesting.

Then you're probably waiting for "westfw-forth" ;-)

legacy · « **Reply #216 on:** September 15, 2019, 08:30:55 am »

Code: [Select]

void foo()
{
        char_UTF8_t msg1[]="欢迎来到中国";
        uint8_t msg2[]="欢迎来到中国"; /* there was warning here, it'was somehow handled as ASCII 8bit */

        uint32_t len1=sizeof(msg1)-1;
        uint32_t len2=sizeof(msg2)-1;
}

len1 = 24 byte
len2 = 6byte

Houston, we have a problem

edit:
When I manually copied the piece of code, I forgot to add "-1" after each sizeof() in the example.
I have also just renamed "size" with "len".

TK · « **Reply #217 on:** September 15, 2019, 12:28:27 pm »

Quote from: legacy on September 15, 2019, 08:30:55 am

Code: [Select]
void foo() { char_UTF8_t msg1[]="欢迎来到中国"; uint8_t msg2[]="欢迎来到中国"; /* there was warning here, it'was somehow handled as ASCII 8bit */ uint32_t size1=sizeof(msg1); uint32_t size2=sizeof(msg2); }
size1 = 24 byte
size2 = 6byte

Houston, we have a problem

I don't see any problem... uint8_t is 1 byte. Probably it will store the first 1.5 chinese characters scattered through 6 bytes

magic · « **Reply #218 on:** September 15, 2019, 12:50:17 pm »

legacy has a problem with some bullshit dinosaur-era compiler, as usual

This is a UTF-8 encoded string. GCC compiles it just fine on my system and it did for many years.

If you want to be standard-compliant, since C11 you type the literal as u8"欢迎来到中国" and every compiler is supposed to handle it correctly regardless of locale or anything.

By the way, I don't know WTF is char_UTF8_t and what that compiler is doing. The string as posted here on the forum is 18 bytes long. When using C, add 1 for null-termination. That's still neither 24 nor 6.

SiliconWizard · « **Reply #219 on:** September 15, 2019, 01:43:10 pm »

Quote from: magic on September 15, 2019, 12:50:17 pm

If you want to be standard-compliant, since C11 you type the literal as u8"欢迎来到中国" and every compiler is supposed to handle it correctly regardless of locale or anything.

Yep.
Also note that you can insert UTF-8 characters in numeric form inside string literals using the \u or \U escaped prefixes.

Eg:

Code: [Select]

u8"\u03BC" which is the small "mu" greek letter.

legacy · « **Reply #220 on:** September 15, 2019, 03:14:47 pm »

Quote from: TK on September 15, 2019, 12:28:27 pm

I don't see any problem

The problem is that warning-message since the UTF-8_t message somehow passed with a "cast" even if uint8_t is not the right type.

I would have been happy in seeing the C compiler issuing some error-message so the user could fix the mistake.

legacy · « **Reply #221 on:** September 15, 2019, 03:27:56 pm »

Quote from: magic on September 15, 2019, 12:50:17 pm

legacy has a problem with some bullshit dinosaur-era compiler, as usual

Yup. My team supports EOL computers; we are on things that are 20 years old, not older than this, but usually not more modern than this. Anyway, I was considering the DDE used for RISC/OS classic, whose C compiler is *supposed* to have some modern support for UTF.

Quote from: magic on September 15, 2019, 12:50:17 pm

18 bytes long

yup, this is a second problem: the string has *somehow* been handled as UTF-32 (each char is 4 byte, always) rather than UTF-8 (variable length).

SiliconWizard · « **Reply #222 on:** September 15, 2019, 03:28:41 pm »

Quote from: legacy on September 15, 2019, 03:14:47 pm

Quote from: TK on September 15, 2019, 12:28:27 pm
I don't see any problem

The problem is that warning-message since the UTF-8_t message somehow passed with a "cast" even if uint8_t is not the right type.

I would have been happy in seeing the C compiler issuing some error-message so the user could fix the mistake.

I don't quite get your point. The compiler issued a warning, if I got it well? It's the appropriate behavior of a C compiler with fishy casts/conversions. (As was mentioned, your example doesn't show the right way of handling UTF-8 with modern C, but if your compiler doesn't support this, then it doesn't support officially UTF-8 either, or apparently in a way that's completely implementation-specific. As magic said, the char_UTF8_t is non-standard AFAIK.)

If you think a C compiler is too liberal issuing a warning here instead of an error, either use a stricter language, or set up a *zero warning* policy, which is what should be done in any serious development team (and I think is required in most "safe C" rules such as MISRA-C and many others). I personally don't tolerate ANY warning. If you don't trust yourself or others to follow that policy, many compilers have a flag to treat all warnings as errors. Enable this. If both approaches fail, stop using C.

legacy · « **Reply #223 on:** September 15, 2019, 03:40:08 pm »

Quote from: SiliconWizard on September 15, 2019, 03:28:41 pm

char_UTF8_t is non-standard

I guess it's a "typedef", defined somewhere in the DDE ecosystem. I have to investigate.

legacy · « **Reply #224 on:** September 15, 2019, 04:09:58 pm »

Quote from: SiliconWizard on September 15, 2019, 03:28:41 pm

I think is required in most "safe C" rules such as MISRA-C

In avionics, we have to pass external tools' validation.

(this funny image rapparesents the lib_tokenizer v8)

Anyway, the HL compiler I have been designing for Arise-v2 is able to recognize a Unicode string at the token layer (as Gcc does, I guess) since my lib_tokenizer is able to pass this information to the upper layers, but since I have just banned every kind of "casting", the compiler would have issued a serious error due to the type mismatch.

Code: [Select]

# evaline char_UTF32_t msg1="欢迎来到中国";
[char_UTF32_t] kind3 4:1 token_StrictAlphaNum, type21
[msg1] kind3 4:2 token_StrictAlphaNum, type21
[=] kind2 4:3 token_Assign, type39
[欢迎来到中国] kind3 4:4 token_String_UTF32, type424
[;] kind2 4:5 token_Semicolon, type92

The lib_tokenizer passes the token list to the parser, and when the parser sees "token_String_UTF" next to "char_UTF32_t" it checks if the "data-type" matches the define, and if not, it issues an error.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Does anybody learn C any more? (Read 38896 times)

Share me