I've managed to get Antlr to recognize the grammar's support for numeric literals in multiple bases and which can contain readability spaces (or underscore) as separators. All of these are recognized:
proc test(x)
value = DEF ABC 2DC0 6BA0:h * 100; // note, no more than a single space can be the separator.
call reboot(FACE DEFB 10C0:H + 239:H);
value = 100 123.45;
value = 1011 FC4.7F:h;
value = 1010 1010;
value = 123:d;
value = 1010 1010:b;
octal = 765:o;
end;
There are four kinds of numeric literals, binary, octal, decimal and hex, they all need a colon-char base designator except dec, for which this can be omitted.
Personal opinion: I do not like spaces in the literals. Underscore, for example, would be better.
From the tokenizer/parser point of view things would be much easier if the spaces would not be allowed in literals. See also my comment about the tools below.
Edit: There has also been some discussion on how to group the digits in numerical literals. I guess that it would be possible to build a simple tool using regexpr rule(s) for checking the grouping of the digits for floating point, decimal, hex, octal and binary numbers of person's liking. Just run the checker as part of the build process, and the checker will fail the build process if the numerical literals are using invalid grouping. Building a directive for the rules of digit grouping into the language grammar could be also possible, but I am not sure if that would be a wise choice (without extensive prototyping and evaluation at least).
---
At some point in this thread there was a discussion whether to use reserved words or not. I think the history of the programming languages have shown that using reserved words makes the grammar easier to parse and reduce ambiguity. Just keep the grammar as simple as possible.
About tools: All tools - such as code formatters, IDEs, refactoring tools etc. - would benefit from a simple grammar without ambiguity.
We have something like 70 years of knowledge about programming languages, so there is no need to repeat the problems of the earlier programming languages.
The (single) space is entirely optional and underscore also permitted. There are standards bodies who recommend spaces over commas or periods, the space is actually common in several industries and reduces risk of misunderstanding across cultures, I did research this, its an interesting aspect of language design.
Having the type as suffix (e.g :H) rather than a prefix (e.g. 0x) is what makes these spaces possible, more effort to recognize but Antlr4 is extremely powerful.
The reserved words question gets much attention, the core motive is
never ever fail to compile code that might have identifiers that are the same as newly added language keywords. Consider C# where they had to use "yield return" when simply "yield" was the obvious choice, their grammar could not support this and be backward compatible.
That's frankly a poor design right there, but like most languages the designers pay too little attention to careful methodical grammar design, grammar's must be designed, no escaping this point, the trend for years has been "make it C-like" and just look at the problems that's led to.
Of course avoiding, minimizing obfuscation is important too, and providing compile time options to warn against unintended keyword clash is easy to do and better then losing backward compatibility.
I want to stress that the grammar has no ambiguities, C++ even C do have grammar ambiguities, you can read about these, C++ also has the most complex grammar of any programming language, littered with edge cases and ifs and buts, until recently this construct was illegal in C++
Typename<Type<Other>>
Instead one was forced to write
Typename<Type<Other> >
That's frankly ridiculous (the >> is recognized as a shift operator) and can be traced back to sloppy or hasty grammar choices, I talk about this a lot but that's because its very important, without careful design you get all the syntax mess we see in C++, C# and so on.
This new grammar parses robustly, easily that's because there are only two fundamental kinds of statements, assignments and keyword statements.
An assignment (however complex lexically) can always be recognized 100% reliably irrespective of the spelling of terms. Only if a statement is not an assignment do we look for a keyword start.
Despite first impressions, this is actually simple, not complex, grammars should have this kind of power IMHO.