Author Topic: Does anybody learn C any more?  (Read 38193 times)

0 Members and 4 Guests are viewing this topic.

Offline rsjsouza

  • Super Contributor
  • ***
  • Posts: 6054
  • Country: us
  • Eternally curious
    • Vbe - vídeo blog eletrônico
Re: Does anybody learn C any more?
« Reply #200 on: September 12, 2019, 06:00:46 pm »
People outside the US usually complain that their keyboard layouts do not encompass all the printable ASCII characters. But that is not a problem with ASCII.
ASCII does not encompass the entirety of the latin characters (áéóôüñ and so on), let alone the mostly different ones. Whoever lived in other countries had to fight a constant battle with MODE CODEPAGE PREPARE and CHCP in the DOS days. The printers then? Lots of fun with neverending streams of continuous feed printer paper or changes in formatting due to non-standard characters sent to the printer.  :-DD
Vbe - vídeo blog eletrônico http://videos.vbeletronico.com

Oh, the "whys" of the datasheets... The information is there not to be an axiomatic truth, but instead each speck of data must be slowly inhaled while carefully performing a deep search inside oneself to find the true metaphysical sense...
 

Offline bsfeechannel

  • Super Contributor
  • ***
  • Posts: 1668
  • Country: 00
Re: Does anybody learn C any more?
« Reply #201 on: September 12, 2019, 06:29:04 pm »
ASCII does not encompass the entirety of the latin characters (áéóôüñ and so on), let alone the mostly different ones. Whoever lived in other countries had to fight a constant battle with MODE CODEPAGE PREPARE and CHCP in the DOS days. The printers then? Lots of fun with neverending streams of continuous feed printer paper or changes in formatting due to non-standard characters sent to the printer.  :-DD

With UTF-8 this is a thing of the, now remote, past. The first 128 characters of UTF-8 are identical to ASCII. Together with the next 128 code points, you have Latin-1, the widest used "extended ASCII" set. And all the other 4 billion code points are there to code simply all the glyphs produced by humankind since the invention of writing by the ancient Sumerians.

Look at the characters I can get directly from my keyboard:

'1234567890-=qwertyuiop'[asdfghjklç~]zxcvbnm,.;\
"!@#$%"&*()_+QWERTYUIOP`{ASDFGHJKLÇ^}ZXCVBNM<>:|   SHIFT
¬¹²³£¢¬{[]}\§/?€®ŧ←↓→øþ´ªæßðđŋħ ̉ĸł'~º«»©“”nµ · ̣º   ALTGR
¬¡½¾¼⅜¨⅞™±°¿˛/?€®Ŧ¥↑ıØÞ`¯Æ§ÐªŊĦ &Ł˝^º<>©‘’Nµ×÷˙˘   ALTGR + SHIFT

And if I combine the dead keys with the appropriate characters above I get:

ÁÉÍÓÚáéíóúÀÈÌÒÙàèìòùÂÊÎÔÛâêîôûÃẼĨÕŨãẽĩõũṔṕÝýŚśĀĒĪŌŪāēīōūĂĔĬŎŬăĕĭŏŭṠȦĖİȮ, etc., etc., etc.

Chinese characters? No worries.

#include<stdio.h>

int main()
{
        printf( "欢迎来到中国\n" );
}

$ ./utf-8
欢迎来到中国
$

Welcome to the 21st century.
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15257
  • Country: fr
Re: Does anybody learn C any more?
« Reply #202 on: September 13, 2019, 02:36:39 pm »
With UTF-8 this is a thing of the, now remote, past. The first 128 characters of UTF-8 are identical to ASCII. Together with the next 128 code points, you have Latin-1, the widest used "extended ASCII" set. And all the other 4 billion code points are there to code simply all the glyphs produced by humankind since the invention of writing by the ancient Sumerians.

Yup. 16-bit Unicode should have never existed to begin with. It was a disgrace, used unnecessary space and was a huge problem-maker for porting existing apps.
UTF-8 is great. You almost have nothing to do to support it, except when you need to delimit/count characters. And even that is pretty easy with just a couple rules to know and apply.

 

Online coppice

  • Super Contributor
  • ***
  • Posts: 9375
  • Country: gb
Re: Does anybody learn C any more?
« Reply #203 on: September 13, 2019, 02:55:41 pm »
Yup. 16-bit Unicode should have never existed to begin with. It was a disgrace, used unnecessary space and was a huge problem-maker for porting existing apps.
UTF-8 is great. You almost have nothing to do to support it, except when you need to delimit/count characters. And even that is pretty easy with just a couple rules to know and apply.
Perhaps 16 bit Unicode was a trick played on Microsoft management who looked at how many characters were in a Microsoft Chinese font, instead of looking at how many Chinese characters there really are. :)
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15257
  • Country: fr
Re: Does anybody learn C any more?
« Reply #204 on: September 13, 2019, 03:52:01 pm »
Perhaps 16 bit Unicode was a trick played on Microsoft management who looked at how many characters were in a Microsoft Chinese font, instead of looking at how many Chinese characters there really are. :)

  ;D

Well, ahah. But probably not. Both approaches can be justified. 16-bit Unicode had the merits of having fixed-size characters, so that probably appeared to be much simpler to deal with (after all, it was just a matter of changing the size of a "char"). All code could be in theory reused just by redefining a type. In practice though, this change was often more of a burden than it initially appeared.

OTOH, complex parsers, or text editors, especially if not well written, could be a lot more hassle to port to UTF-8 than Unicode.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3243
  • Country: ca
Re: Does anybody learn C any more?
« Reply #205 on: September 13, 2019, 05:07:36 pm »
16-bit Unicode had the merits of having fixed-size characters, so that probably appeared to be much simpler to deal with (after all, it was just a matter of changing the size of a "char").

Except for composite symbols of course :)
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15257
  • Country: fr
Re: Does anybody learn C any more?
« Reply #206 on: September 13, 2019, 08:10:50 pm »
16-bit Unicode had the merits of having fixed-size characters, so that probably appeared to be much simpler to deal with (after all, it was just a matter of changing the size of a "char").

Except for composite symbols of course :)

Well, isn't this more like UTF-16 than the original 16-bit Unicode that MS implemented? Not sure about that, just a question...
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3243
  • Country: ca
Re: Does anybody learn C any more?
« Reply #207 on: September 13, 2019, 08:46:44 pm »
Well, isn't this more like UTF-16 than the original 16-bit Unicode that MS implemented? Not sure about that, just a question...

Unicode has code points for accented characters (for example 0x00e9 is e with "accent de gue"), but the same character my be composed, for example e (0x0065) followed by "combining" accent de gue (0x0301).

Most funny application is Mac OS, where the file names must be converted to canonical form (I think it's composed, but I don't remember exactly) before use. As a result, different UTF-8 strings may refer to the same file - cannot use strcmp().
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15257
  • Country: fr
Re: Does anybody learn C any more?
« Reply #208 on: September 13, 2019, 08:55:34 pm »
Well, isn't this more like UTF-16 than the original 16-bit Unicode that MS implemented? Not sure about that, just a question...

Unicode has code points for accented characters (for example 0x00e9 is e with "accent de gue"), but the same character my be composed, for example e (0x0065) followed by "combining" accent de gue (0x0301).

Most funny application is Mac OS, where the file names must be converted to canonical form (I think it's composed, but I don't remember exactly) before use. As a result, different UTF-8 strings may refer to the same file - cannot use strcmp().

Oh, I see! Well those combinations remind me of using Latex with no babel package, or similar.

- Just a note: "acute" is "accent aigu" in French (if that's what you were trying to spell.)  ;D -

 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3243
  • Country: ca
Re: Does anybody learn C any more?
« Reply #209 on: September 13, 2019, 09:25:04 pm »
- Just a note: "acute" is "accent aigu" in French (if that's what you were trying to spell.)  ;D -

:) I'm sorry about that. I felt something was wrong. I should've gone with "accent grave".
 

Offline rsjsouza

  • Super Contributor
  • ***
  • Posts: 6054
  • Country: us
  • Eternally curious
    • Vbe - vídeo blog eletrônico
Re: Does anybody learn C any more?
« Reply #210 on: September 14, 2019, 02:09:01 pm »
People outside the US usually complain that their keyboard layouts do not encompass all the printable ASCII characters. But that is not a problem with ASCII.
ASCII does not encompass the entirety of the latin characters (áéóôüñ and so on), let alone the mostly different ones. Whoever lived in other countries had to fight a constant battle with MODE CODEPAGE PREPARE and CHCP in the DOS days. The printers then? Lots of fun with neverending streams of continuous feed printer paper or changes in formatting due to non-standard characters sent to the printer.  :-DD

With UTF-8 this is a thing of the, now remote, past.
My point exactly. It was a problem with ASCII used by computer systems of yore.
Vbe - vídeo blog eletrônico http://videos.vbeletronico.com

Oh, the "whys" of the datasheets... The information is there not to be an axiomatic truth, but instead each speck of data must be slowly inhaled while carefully performing a deep search inside oneself to find the true metaphysical sense...
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15257
  • Country: fr
Re: Does anybody learn C any more?
« Reply #211 on: September 14, 2019, 04:25:45 pm »
I still don't quite see the point.
ASCII was clearly designed with the English language in mind (so no accents) and kind of the lowest common denominator as far as latin letters and symbols go, that would fit within 7 bits of data. It was a limitation, but already a nice step forward.
 

Offline bsfeechannel

  • Super Contributor
  • ***
  • Posts: 1668
  • Country: 00
Re: Does anybody learn C any more?
« Reply #212 on: September 14, 2019, 06:29:01 pm »
With UTF-8 this is a thing of the, now remote, past.
My point exactly. It was a problem with ASCII  used by computer systems of YORE.

TIFIFY  ;)
 
The following users thanked this post: rsjsouza

Offline bsfeechannel

  • Super Contributor
  • ***
  • Posts: 1668
  • Country: 00
Re: Does anybody learn C any more?
« Reply #213 on: September 14, 2019, 06:53:51 pm »
ASCII was clearly designed with the English language in mind (so no accents)

Not entirely true. Some diacritical symbols are there `, ^, ~. Other punctuation symbols can double as accents: ', ". Remember that when ASCII came about, printers were the main machine-human interface, not video terminals. On a printer you can type LETTER+BACKSPACE+ACCENT, or ACCENT+LETTER (like on typewriters) if you configure accents as dead keys.

Do you need a c-cedilla to write "Ça va? Ça va bien, merci!"? Just print C, then backspace, then comma and you're good to go. 

Tre[CTRL+H]`s chic.

Quote
It was a limitation, but already a nice step forward.

No doubt.
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4301
  • Country: us
Re: Does anybody learn C any more?
« Reply #214 on: September 15, 2019, 01:05:15 am »
I've sort-of been waiting for the first language/ide to allow user-specified mark-up of the source code.  (no, not just some scheme done dynamically by the IDE.  Actually IN the source code.)  It would be ... interesting.
Hmm.  Which is worse, punctuation-heavy languages (like C), or languages with many English keywords?
 

Offline techman-001

  • Frequent Contributor
  • **
  • !
  • Posts: 748
  • Country: au
  • Electronics technician for the last 50 years
    • Mecrisp Stellaris Unofficial UserDoc
Re: Does anybody learn C any more?
« Reply #215 on: September 15, 2019, 01:52:39 am »
I've sort-of been waiting for the first language/ide to allow user-specified mark-up of the source code.  (no, not just some scheme done dynamically by the IDE.  Actually IN the source code.)  It would be ... interesting.

Then you're probably waiting for "westfw-forth" ;-)
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: Does anybody learn C any more?
« Reply #216 on: September 15, 2019, 08:30:55 am »
Code: [Select]
void foo()
{
        char_UTF8_t msg1[]="欢迎来到中国";
        uint8_t msg2[]="欢迎来到中国"; /* there was warning here, it'was somehow handled as ASCII 8bit */

        uint32_t len1=sizeof(msg1)-1;
        uint32_t len2=sizeof(msg2)-1;
}

len1 = 24 byte
len2 = 6byte

Houston, we have a problem  :o


edit:
When I manually copied the piece of code, I forgot to add "-1" after each sizeof() in the example.
I have also just renamed "size" with "len".
« Last Edit: September 15, 2019, 03:54:48 pm by legacy »
 

Offline TK

  • Super Contributor
  • ***
  • Posts: 1722
  • Country: us
  • I am a Systems Analyst who plays with Electronics
Re: Does anybody learn C any more?
« Reply #217 on: September 15, 2019, 12:28:27 pm »
Code: [Select]
void foo()
{
        char_UTF8_t msg1[]="欢迎来到中国";
        uint8_t msg2[]="欢迎来到中国"; /* there was warning here, it'was somehow handled as ASCII 8bit */

        uint32_t size1=sizeof(msg1);
        uint32_t size2=sizeof(msg2);
}

size1 = 24 byte
size2 = 6byte

Houston, we have a problem  :o
I don't see any problem... uint8_t is 1 byte.  Probably it will store the first 1.5 chinese characters scattered through 6 bytes
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7169
  • Country: pl
Re: Does anybody learn C any more?
« Reply #218 on: September 15, 2019, 12:50:17 pm »
legacy has a problem with some bullshit dinosaur-era compiler, as usual :P

This is a UTF-8 encoded string. GCC compiles it just fine on my system and it did for many years.

If you want to be standard-compliant, since C11 you type the literal as u8"欢迎来到中国" and every compiler is supposed to handle it correctly regardless of locale or anything.

By the way, I don't know WTF is char_UTF8_t and what that compiler is doing. The string as posted here on the forum is 18 bytes long. When using C, add 1 for null-termination. That's still neither 24 nor 6.
« Last Edit: September 15, 2019, 01:00:30 pm by magic »
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15257
  • Country: fr
Re: Does anybody learn C any more?
« Reply #219 on: September 15, 2019, 01:43:10 pm »
If you want to be standard-compliant, since C11 you type the literal as u8"欢迎来到中国" and every compiler is supposed to handle it correctly regardless of locale or anything.

Yep.
Also note that you can insert UTF-8 characters in numeric form inside string literals using the \u or \U escaped prefixes.

Eg:
Code: [Select]
u8"\u03BC" which is the small "mu" greek letter.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: Does anybody learn C any more?
« Reply #220 on: September 15, 2019, 03:14:47 pm »
I don't see any problem

The problem is that warning-message since the UTF-8_t message somehow passed with a "cast" even if uint8_t is not the right type.

I would have been happy in seeing the C compiler issuing some error-message so the user could fix the mistake.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: Does anybody learn C any more?
« Reply #221 on: September 15, 2019, 03:27:56 pm »
legacy has a problem with some bullshit dinosaur-era compiler, as usual :P

Yup. My team supports EOL computers; we are on things that are 20 years old, not older than this, but usually not more modern than this. Anyway, I was considering the DDE used for RISC/OS classic, whose C compiler is *supposed* to have some modern support for UTF.

18 bytes long

yup, this is a second problem: the string has *somehow* been handled as UTF-32 (each char is 4 byte, always) rather than UTF-8 (variable length).
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15257
  • Country: fr
Re: Does anybody learn C any more?
« Reply #222 on: September 15, 2019, 03:28:41 pm »
I don't see any problem

The problem is that warning-message since the UTF-8_t message somehow passed with a "cast" even if uint8_t is not the right type.

I would have been happy in seeing the C compiler issuing some error-message so the user could fix the mistake.

I don't quite get your point. The compiler issued a warning, if I got it well? It's the appropriate behavior of a C compiler with fishy casts/conversions. (As was mentioned, your example doesn't show the right way of handling UTF-8 with modern C, but if your compiler doesn't support this, then it doesn't support officially UTF-8 either, or apparently in a way that's completely implementation-specific. As magic said, the char_UTF8_t  is non-standard AFAIK.)

If you think a C compiler is too liberal issuing a warning here instead of an error, either use a stricter language, or set up a *zero warning* policy, which is what should be done in any serious development team (and I think is required in most "safe C" rules such as MISRA-C and many others). I personally don't tolerate ANY warning. If you don't trust yourself or others to follow that policy, many compilers have a flag to treat all warnings as errors. Enable this. If both approaches fail, stop using C.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: Does anybody learn C any more?
« Reply #223 on: September 15, 2019, 03:40:08 pm »
char_UTF8_t  is non-standard

I guess it's a "typedef", defined somewhere in the DDE ecosystem. I have to investigate.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: Does anybody learn C any more?
« Reply #224 on: September 15, 2019, 04:09:58 pm »
I think is required in most "safe C" rules such as MISRA-C

In avionics, we have to pass external tools' validation.


(this funny image rapparesents the lib_tokenizer v8)

Anyway, the HL compiler I have been designing for Arise-v2 is able to recognize a Unicode string at the token layer (as Gcc does, I guess) since my lib_tokenizer is able to pass this information to the upper layers, but since I have just banned every kind of "casting", the compiler would have issued a serious error due to the type mismatch.

Code: [Select]
# evaline char_UTF32_t msg1="欢迎来到中国";
[char_UTF32_t] kind3 4:1 token_StrictAlphaNum, type21
[msg1] kind3 4:2 token_StrictAlphaNum, type21
[=] kind2 4:3 token_Assign, type39
[欢迎来到中国] kind3 4:4 token_String_UTF32, type424
[;] kind2 4:5 token_Semicolon, type92

The lib_tokenizer passes the token list to the parser, and when the parser sees "token_String_UTF" next to "char_UTF32_t" it checks if the "data-type" matches the define, and if not, it issues an error.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf