There are 8-bit architectures that have very few registers and were designed to pass function arguments on the stack, but they're quite difficult to optimize code for. Most hardware today, from AVRs to ARMs to RISCV, have many general-purpose registers, so targeting those makes more sense to me anyway.
Not always true...
Well, I do think 12 general purpose registers (
xCORE-200 XS2 ISA (PDF)) is plenty!
But, sure, there are exceptions to any rule of thumb. XCore is definitely one.
One optimization difficulty is how data has to be shuffled around between function calls. When memory access is equally fast as register access –– so it doesn't matter whether you shuffle data in registers, or between registers and memory ––, then optimizing such shuffling becomes simple.
However, even in the xCORE XS2 ISA, arithmetic and logical operations are done in registers, with any of the 12 general purpose registers used as source and destination registers. A function that does say additions, multiplications, and divisions between a few values, can do so directly if the values are passed in registers. If the values are passed on stack, they have to be loaded from the stack before they're operated on. This is the overhead one can avoid by passing by value in registers, with passing by reference implemented via pointers.
That was demonstrated very clearly with the Sun UltrasSPARC Niagara T series processors, which had up to 128 "cores" (and used them effectively!) in 2005-2011, when x86-64 has <8 cores.
Yep. I personally like the idea of asymmetric multiprocessing a lot, and would love to play with such hardware. Alas, the only ones I have are mixed Cortex-A cores (
ARM big.LITTLE).
We can see from the increasing complexity of peripherals on even cheap ARM Cortex-M microcontrollers, and things like Raspberry Pi Pico/RP2040 (and of course TI AM335x with its PRUs) that we're slowly going that way anyway.
Personally, I'd love to have tiny programmable cores with just a basic ALU – addition, subtraction, comparison, and bit operations, with just a couple of run-time registers – and access to reading and writing from the main memory, and the ability to raise an interrupt in the main core. Heck, all the buses I use could be implemented with one.
I looked at XEF216-512-TQ128-C20A (23.16€ in singles at Digikey, in stock), and the problem I have with it is that it is too powerful!
I fully understand why XMOS developed xC, a variant of C with additions for tile computing; something like this would be
perfect for developing a new event-based programming language, since it already has the hardware support for it.
For now, however, I have my own sights set much lower: replacing the standard C library with something more suitable for my needs, to be used in a mixed freestanding C/C++ environment. This is well defined in the standards, and there are more than one compiler I can use it with (specifically, GCC and Clang on different architectures, including 8-bit (AVR), 32-bit (ARM, x86), and 64-bit (x86-64, Aarch64/ARM64) ones).