It could also be pipeline differences between MCU's that can cause Dhrystone to perform very well or very bad. PIC18 does 1 instruction every 4 ticks, PIC24 does 1 every 2, and ARM 1 every 1, however they probably run a pipeline (others could do too). But if one instruction modifies the data of the next instruction, then that next instruction can be stalled until the preceding instruction was completed.
I don't think this changes with clock speed, because the pipeline only runs at the instruction clock.
I've now become interested to see why PIC24 is so fast, and "forked" this version of the Dhrystone:
https://github.com/rkrajnc/amber/tree/5f1fc912d06346cc3266a0ed0148f0b4272f1e43/sw/dhry ( used for testing Amber ARM FPGA softcore)
Interesting part is that the test lists no of cycles for other processors like Intel i3: it says that takes 389 cycles per Dhrystone. This means 2570 Dhrystone/MIPS.
For PIC24FJ64GA004 it took 1001 cycles(XC16 -O0), so that means I got 999 Dhrystones/MIPS with XC16 -O0.
With -O3 -unroll I get 483 cycles per Dhrystone, which means 2070 Dhrystones/MIPS .
With -O3 I get 452 cycles per Dhrystone -> 2212 Dhrystones/MIPS .
With -O3 and small code model, small data model, constants in RAM I get 421 cycles per Dhrystone -> 2375 Dhrystones/MIPS .
Very much in line with what dannyf tested. But, as the PIC24 does 0.5 instruction per Hz, I argue that the actual performance is half. So I actually think you get ~1187 Dhrystone/MHz.
Time for STM32F407. I set up timer TIM2 with prescaler 0 (1:1), en clock div 1. The input frequency of the timer is the APB1 RCC clock, 37.5MHz, where the CPU runs at 150MHz. So each timer value we get, is actually 1:4 resolution.
With that setup, the figures for IAR are (FLASH / 150MHz):
2044 cycles/Dhrystone @ no optimisations -> 489 Dhrystone/MHz
1904 cycles/Dhrystone @ low -> 525 Dhrystone/MHz
1228 cycles/Dhrystone @ medium -> 814 Dhrystone/MHz
1182 cycles/Dhrystone @ high (size) -> 846 Dhrystone/MHz
1086 cycles/Dhrystone @ high (balanced) -> -> 920 Dhrystone/MHz
860 cycles/Dhrystone @ high (speed) -> 1162 Dhrystone/MHz
I had a plan to run the code from RAM, but if I place add __ramfunc to every function, it actually gets slower. With no optimisations I get 2026 cycles/Dhrystone, but with high speed optimisation it still takes 1396 cycles/Dhrystone
So I tried lowering the clock speed, to 75MHz, and even 37.5MHz, but it makes no difference. So it certainly is not the FLASH wait state.
By the way, how did you trick the PIC18 into running Dhrystone? The version I got wants to allocate 5K bytes in 1 array, which is larger than the whole memory of the chip.