I ran dhrystone 2.1 (a useless benchmark) on a few mcus that I have.
I would run 10,000 times the benchmark, and then flip a pin. By measuring the duration between pin flips, we measured the duration of the benchmark. The shorter the duration, the faster the execution.
No optimization of any kind, for any chip - the exact same code ran on all chips.
[edit: represent the data in terms of Dhrystone / Mhz per second, from high to low]
PIC32MX320: 3,378 C32 2.x, optimized (-O3)
PIC32MX440: 3,067 X32 1.21, optimized (-O3) - @ 20Mhz
LM4F120: 2,914, MDK-ARM, optimized (-O3 + time)
STM32F4: 2,888, MDK-ARM, optimized (-O3 + time)
STM32F4: 2,525, gcc-arm, optimized (-O3)
PIC24F: 2,432, C30 2.x, optimized -O3 (speed)
PIC24F: 2,403 XC16 pro, optimized (-O3)
PIC32MX440, 2,288 X32 1.21, optimized (-O3) @ 80Mhz
LPC1343: 2,087, MDK-ARM, optimized (-O3 + time)
STM32F3: 1,964, MDK-ARM, optimized (-O3 + time)
LM4F120: 1,911, IAR-ARM, optimized
STM32F4: 1,903, IAR-ARM, optimized
PIC24F: 1,901, C30 2.x, optimized -O2 (speed)
LPC1227: 1,506, IAR-ARM, optimized
LPC1343: 1,410, gcc-arm, optimized (-O3)
STM32F3: 1,362, IAR-ARM, optimized
LM4F120: 1,297, MDK-ARM
LM4F120: 1,245, IAR-ARM,
PIC24F: 1,237, C30 2.x,
PIC24H: 1,195, XC16, optimized (-O3)
PIC24H: 1,170, XC16 pro, optimized (-O3)
STM32F4: 1,162, IAR-ARM, optimized (-O3)
PIC32MX320: 1,151, C32 2.x
PIC24F: 1,106 [compiler?] optimized (-O3)
STM32F4: 1,053, IAR-ARM
LPC1227: 1,050, IAR-ARM
STM32F4: 1,029, MDK-ARM
PIC32MX440, 1,004, X32 1.21 @ 20Mhz
PIC24F: 993, XC16 free,
STM32F4: 955, MDK-ARM, optimized (-O3)
STM32F1: 921, IAR-ARM, optimized
LPC1343: 906, MDK-ARM
STM32F4: 902, gcc-arm
STM32F3: 858, MDK-ARM
STM32F3: 854, IAR-ARM,
STM32F4: 806, MDK-ARM
STM32F3: 804, gcc-arm, optimized
STM32F3: 766, gcc-arm,
PIC32MX440, 762, X32 1.21, @ 80Mhz
STM32F1: 736, gcc-arm, optimized
MSP430F2418: 734, IAR, optimized (3)
MSP430F2370: 667, IAR, optimized (3)
LPC1343: 664, gcc-arm
STM32F1: 653, IAR-ARM,
MSP430F2418: 630, IAR
STM32F030F: 619, MDK-ARM, O3 optimized
LPC1114: 614, IAR-ARM, optimized
MSP430F2370: 573, IAR
PIC24F: 555, [unknown]
STM32F030F: 552, MDK-ARM, O0
STM32F4: 489, IAR-ARM
PIC24H: 489, XC16 free
STM8S: 482, IAR-STM8, optimized
P87C51MC2: 470, Keil C51 optimized for speed
STM32F1: 453, gcc-arm,
P87C51MC2: 439, Keil C51 optimized for size
STM8S: 434, IAR-STM8,
LPC1114: 410, IAR-ARM
PIC18F26K20: 380, XC8 pro
PIC18F26K20: 323, XC8 free
PIC18F26K20: 322, PICC18 pro
AVR90USB1286: 237, gcc-avr
PIC18F26K20: 168, PICC18 lite
Simulation only:
PIC32MZ: 1173
PIC32MZ: 3,413, O3
PIC24F: 978 XC16
PIC24F: 2,404, XC16, O3
PIC24F: 1,215 C30
PIC24F: 2,433 C30, O3
I was surprised:
1) pic24f was really fast. and stm32f1/3 sucked on a per Mhz basis.
2) avr sucked wind.
3) stm8s did OK. Evident of 6502's staying power.
Didn't run on 8051 but would expect it to hold its own reasonably well.
edit: 1) added stm32f100 numbers.
2) added stm32f100 + iar-arm, vs. gcc-arm.
3) added results from IAR-ARM and represented the data to make it easier for the eyes.
4) added results from pic18f
5) updated PIC24F results (fat fingers) and added -O3 optimization
6) added mdk-arm numbers for STM32F3. Pretty much identical unoptimized.
7) added jaxbird's results for STM32F4 and PIC24H.
added PIC24F results under XC16 free/pro. Still very high.
9) added jaxbird's and hans' results for pic24h and pic24f/stm32f4, respectively.
10) added LM4F120 under IAR-ARM and MDK-ARM.
11) added LPC1343 under gcc-arm and MDK-ARM.
12) added STM32F4 under gcc-arm and iar-arm, and mdk-arm too.
13) added LPC1114 under iar-arm. The first CM0 chip in the comparison.
14) added C51 (P87C51MC2) under Keil C51. Optimized for size and speed.
15) added PIC32MX320 under C32, 2.x. (simulated)
16) added PIC32MX440F512H results, for 80Mhz and 20Mhz
17) added msp430. Fairly respective scores.
18) added LPC1227 / IAR scores.
19) added MSP430F2370 - similar to the MSP430 scores obtained earlier.
20) added simulated results for PIC32MZ, and PIC24F (XC16 and C30)
21) added the results for STM32F030F, from the ghetto thread.