Also, why do you think STM8 is better than both (that hasn't necessarily been my experience).
In general, it seems like 8051 gets a bad rap lots of times and I'm not entirely sure why. Is it the split memory architecture? Or the lack of a good open source C compiler? Yes, I know about SDCC (and use it all the time), but it really isn't all that efficient/optimized in the code it generates. But, it is good enough for high level things, and one can always drop down to inline assembly for the places where speed and/or size efficiency is actually important.
Speaking of program size, 8051 is actually pretty optimal in this regard. There are a lot of 1 byte instructions in addition to the 2 byte instructions, and very few 3 byte instructions.
Let's compare STM8 to MCS-51 then. SDCC supports both. For the comparison, I'll assume we need a few KB of RAM (i.e. large memory model for mcs51, medium memory model for stm8) and want full reentrancy as in the C standard (i.e. --stack-auto option for mcs51).
The STM8 has good support for pointers (flat address space, registers x and y), while MCS-51 has to juggle with memory spaces and go through dptr. Also, the STM8 has stackpointer-relative addressing. And the SDCC stm8 port has more fancy optimizations than the mcs51 one.
Looking at a benchmark, we can see what this means (Dhrystone, STM8AF at 16 MHz vs C8051 at 24.5 MHz):
stm8 code size is half of mcs51:
https://sourceforge.net/p/sdcc/code/HEAD/tree/trunk/sdcc-extra/historygraphs/dhrystone-stm8-size.svg
https://sourceforge.net/p/sdcc/code/HEAD/tree/trunk/sdcc-extra/historygraphs/dhrystone-mcs51-size.svg
Despite the C8051 being single-cycle and having 50% higher clock speed, the STM8 is 85% faster:
https://sourceforge.net/p/sdcc/code/HEAD/tree/trunk/sdcc-extra/historygraphs/dhrystone-stm8-score.svg
https://sourceforge.net/p/sdcc/code/HEAD/tree/trunk/sdcc-extra/historygraphs/dhrystone-mcs51-score.svg
The graphs are from SDCC, where they are used to track code size and speed to quickly notice regressions.
Thanks for the links, although to be honest, they seem to be more about how good SDCC is for a particular architecture over time than the architecture itself. It looks like more work is being put into STM8 so it is on an upward trajectory (i.e. smaller code size and faster execution), while MCS51 may have had some regressions introduced that speaks to a downward trajectory (i.e. larger code size and slower execution).
Also, turning on --stack-auto for re-entrancy as well as using the medium or large memory models seems to go against the SDCC defaults and recommendations. For the projects I have worked on, there wasn't a need to go with either of those. Potentially those options are dramatically contributing to the increased code size and lower performance. My observations from porting code originally written for AVR, is that it usually compiles to a smaller size when ported to 8051. But, it could also be that I am also optimizing it in the process and it would be smaller regardless of destination architecture.
I fully agree that SDCC is not particularly good at generating optimal code for the MCS51 architecture. What I'm not sure about is whether that is inherently because of the 8051 architecture, or just because SDCC tries to work across so many architectures that it is hard to optimize for any one. Or is it just because some architectures have had more interest, and therefore more work done on optimizing them. Also, I'm not trying to pick on SDCC, I am very thankful it exists, and I can appreciate how difficult it must be to create and maintain a complex multi-architecture compiler.
I will say, however, that after looking at the generated code for MCS51, one has to wonder if the authors actually read through and understand the full 8051 instruction set.
Something simple like:
uint8_t i = 8;
do { /* ... */ } while (--i);
should compile down to some like this (4 bytes, 6 cycles) (Note all cycle counts here and below are for the N76E003, other MCUs are even better cycle wise):
mov r7,#0x08 ; 2 bytes, 2 cycles
00101$:
; ...
djnz r7,00101$ ; 2 byte, 4 cycles
why then does SDCC generate this (7 bytes, 8 cycles)? Does SDCC not know about the DJNZ (decrement and jump if not zero) instruction?:
mov r7,#0x08 ; 2 bytes, 2 cycles
00101$:
; ...
mov a,r7 ; 1 byte, 1 cycle
dec a ; 1 byte, 1 cycle
mov r7,a ; 1 byte, 1 cycle
jnz 00101$ ; 2 bytes, 3 cycles
Or, consider this:
char __code lookup[] = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
char low_nibble_to_hex(uint8_t nibble) {
return lookup[nibble & 0xF];
}
for which SDCC generates (22 bytes, 31 cycles):
mov r7,dpl ; 2 bytes, 2 cycles
anl ar7,#0x0f ; 3 bytes, 4 cycles
mov r6,#0x00 ; 2 bytes, 2 cycles
mov a,r7 ; 1 byte, 2 cycles
add a,#_lookup ; 2 bytes, 2 cycles
mov dpl,a ; 2 bytes, 2 cycles
mov a,r6 ; 1 byte, 1 cycle
addc a,#(_lookup >> 8) ; 2 bytes, 2 cycles
mov dph,a ; 2 bytes, 2 cycles
clr a ; 1 byte, 1 cycle
movc a,@a+dptr ; 1 byte, 4 cycles
mov dpl,a ; 2 bytes, 2 cycles
ret ; 1 byte, 5 cycles
which could be as simple as (11 bytes, 19 cycles). Does SDCC not know about the MOV DPTR, #address instruction?:
mov a,dpl ; 2 bytes, 3 cycles
anl a,#0x0f ; 2 bytes, 2 cycles
mov dptr,#_lookup ; 3 bytes, 3 cycles
movc a,@a+dptr ; 1 byte, 4 cycles
mov dpl,a ; 2 bytes, 2 cycles
ret ; 1 byte, 5 cycles
And, how about this:
void print(char __xdata *string) {
char c = *string;
while (c != 0) {
// ...
string++;
c = *string;
};
}
for which SDCC generates (23 bytes, 40 cycles, 28 cycles in the loop):
mov r6,dpl ; 2 bytes, 2 cycles
mov r7,dph ; 2 bytes, 2 cycles
movx a,@dptr ; 1 byte, 4 cycles
mov r5,a ; 1 byte, 1 cycle
00101$:
mov a,r5 ; 1 byte, 1 cycles
jz 00104$ ; 2 bytes, 3 cycles
; ...
inc r6 ; 1 byte, 3 cycles
cjne r6,#0x00,00116$ ; 3 bytes, 4 cycles
inc r7 ; 1 byte, 3 cycles
00116$:
mov dpl,r6 ; 2 bytes, 2 cycles
mov dph,r7 ; 2 bytes, 2 cycles
movx a,@dptr ; 1 byte, 4 cycles
mov r5,a ; 1 byte, 1 cycle
sjmp 00101$ ; 2 bytes, 3 cycles
00104$:
ret ; 1 bytes, 5 cycles
which could be as simple as this (8 bytes, 20 cycles, 11 cycles in the loop). Does SDCC not know about the INC DPTR instruction?:
movx a,@dptr ; 1 byte, 4 cycles
00101$:
jz 00104$ ; 2 bytes, 3 cycles
; ...
inc dptr ; 1 byte, 1 cycles
movx a,@dptr ; 1 byte, 4 cycles
sjmp 00101$ ; 2 bytes, 3 cycles
00104$:
ret ; 1 bytes, 5 cycles
Or, how about this:
uint8_t div8(uint8_t a) {
return a / 8;
}
for which SDCC generates this (17+ bytes, 29+ cycles):
mov r7,dpl ; 2 bytes, 4 cycles
mov r6,#0x00 ; 2 bytes, 2 cycles
mov __divsint_PARM_2,#0x08 ; 3 bytes, 3 cycles
mov (__divsint_PARM_2 + 1),r6 ; 2 bytes, 3 cycles
mov dpl,r7 ; 2 bytes, 4 cycles
mov dph,r6 ; 2 bytes, 4 cycles
ljmp __divsint ; 3 bytes, 4 cycles (plus unknown bytes/cycles inside __divsint and a final 1 byte, 5 cycles for ret)
which could be as simple as this (10 bytes, 16 cycles). Why does SDCC need to farm this out to a helper function?:
mov a, dpl ; 2 bytes, 3 cycles
mov b, #0x08 ; 3 bytes, 3 cycles
div ab ; 2 bytes, 3 cycles
mov dpl, a ; 2 bytes, 2 cycles
ret ; 1 byte, 5 cycles
or even as simple as this (9 bytes, 14 cycles). Does SDCC not know that / 8 is the same as >> 3?:
mov a,dpl ; 2 bytes, 3 cycles
swap a ; 1 byte, 1 cycle
rl a ; 1 byte, 1 cycle
anl a,#0x1f ; 2 bytes, 2 cycles
mov dpl,a ; 2 bytes, 2 cycles
ret ; 1 byte, 5 cycles
Sorry if some of these seem a bit contrived, but they are all subsets of things I have had to manually optimize around in a recent project I have been working on. And, they go to show how much of an impact a compiler implementation can have on the program size and execution speed. I don't have experience using them, but from what I have read the IAR and Keil compilers generate better / more optimized code than SDCC. Again, not trying to bash SDCC, just pointing out that what a particular compiler generates isn't the end-all be-all of a given processor architecture.
It would be an interesting exercise to take some real world code and hand optimize it to certain architectures to compare more real-world results.