Author Topic: Dhrystone 2.1 on mcus (Read 45427 times)

westfw · « **Reply #25 on:** March 04, 2014, 03:47:31 am »

So you essentially replaced the printf code in the original benchmarks with a pin toggle for measuring the timing? And these numbers should result in the nominal DMIPS/MHz if divided by the usual magic constant (1757 for a VAX780)? That puts the PIC24 at about 1.4, which seems to be in the usual range for modern microcontrollers. (1.25 to 1.89 is what ARM quotes for CM3.)

The interesting question is why most of the other chips' numbers are so much lower than expected.

nuhamind2 · « **Reply #26 on:** March 04, 2014, 11:50:57 am »

I doubt that's the case for AVR,AVR flash is always as fast as the core when fetching instruction. The only slow down happen when doing literal load from flash which take 3 cycles compared from RAM which take 2 cycles. But you only load constant from flash if you state to store your data implicitely in flash using some kind of modifier.Otherwise your data will be copied to RAM and loaded from there.

dannyf · « **Reply #27 on:** March 04, 2014, 12:10:34 pm »

Quote

How about running the benchmark on slower frequency and disabling wait state.

Will look into that later.

dannyf · « **Reply #28 on:** March 04, 2014, 01:47:52 pm »

Played with the flash wait state (aka latency) on gcc-arm. Without optimization, the default latency setting produced a Dhrystone / Mhz score of 766 on STM32F3. Pushing it to wait state of 0 pushes the score up about 100, and pushing it to wait state 2 pushes the score down about 100.

Kjelt · « **Reply #29 on:** March 04, 2014, 01:56:32 pm »

Interesting results, thanks

nuhamind2 · « **Reply #30 on:** March 04, 2014, 02:32:15 pm »

Just adding some info , the stm32f3 page on mouser state 62 DMPIS/72Mhz (2 wait state) or 0.861 DMIPS/Mhz ( pretty close to the benchmark posted here )and 94DMIPS/72Mhz when running from CCM-RAM (0 wait state) or 1.305 DMIPS/Mhz

jaxbird · « **Reply #31 on:** March 04, 2014, 02:40:09 pm »

I'm surprised by the difference in results between the pic24f using C30 and pic24h using XC16. It's close to twice the score for the pic24f/C30 combination.

I had a go at a few different frequencies the pic24h to see if that makes any difference. My original test was run at ~80MHz.

4 MHz - 1195
20 Mhz - 1197
80 MHz - 1190
(All with pic24h, XC16, -O3 optimization)

Very similar, I'd put the differences down to not 100% accurate time keeping, plus using the internal 7.37MHz oscillator with feedback divisor and pre/post scaler results in frequencies slightly below or above the target.

dannyf · « **Reply #32 on:** March 04, 2014, 05:34:18 pm »

Yeah. I am with you on that. The PIC24 numbers are simply too good to be true. My first reaction then was that the compiler may have been coded to recognize dhrystone code.

When I get some time, I am going to insert some pin flipping code in the dhrystone itself to see if all pieces of the code are actually executed.

hans · « **Reply #33 on:** March 04, 2014, 06:54:37 pm »

It could also be pipeline differences between MCU's that can cause Dhrystone to perform very well or very bad. PIC18 does 1 instruction every 4 ticks, PIC24 does 1 every 2, and ARM 1 every 1, however they probably run a pipeline (others could do too). But if one instruction modifies the data of the next instruction, then that next instruction can be stalled until the preceding instruction was completed.
I don't think this changes with clock speed, because the pipeline only runs at the instruction clock.

I've now become interested to see why PIC24 is so fast, and "forked" this version of the Dhrystone: https://github.com/rkrajnc/amber/tree/5f1fc912d06346cc3266a0ed0148f0b4272f1e43/sw/dhry ( used for testing Amber ARM FPGA softcore)

Interesting part is that the test lists no of cycles for other processors like Intel i3: it says that takes 389 cycles per Dhrystone. This means 2570 Dhrystone/MIPS.
For PIC24FJ64GA004 it took 1001 cycles(XC16 -O0), so that means I got 999 Dhrystones/MIPS with XC16 -O0.
With -O3 -unroll I get 483 cycles per Dhrystone, which means 2070 Dhrystones/MIPS .
With -O3 I get 452 cycles per Dhrystone -> 2212 Dhrystones/MIPS .
With -O3 and small code model, small data model, constants in RAM I get 421 cycles per Dhrystone -> 2375 Dhrystones/MIPS .

Very much in line with what dannyf tested. But, as the PIC24 does 0.5 instruction per Hz, I argue that the actual performance is half. So I actually think you get ~1187 Dhrystone/MHz.

Time for STM32F407. I set up timer TIM2 with prescaler 0 (1:1), en clock div 1. The input frequency of the timer is the APB1 RCC clock, 37.5MHz, where the CPU runs at 150MHz. So each timer value we get, is actually 1:4 resolution.

With that setup, the figures for IAR are (FLASH / 150MHz):
2044 cycles/Dhrystone @ no optimisations -> 489 Dhrystone/MHz
1904 cycles/Dhrystone @ low -> 525 Dhrystone/MHz
1228 cycles/Dhrystone @ medium -> 814 Dhrystone/MHz
1182 cycles/Dhrystone @ high (size) -> 846 Dhrystone/MHz
1086 cycles/Dhrystone @ high (balanced) -> -> 920 Dhrystone/MHz
860 cycles/Dhrystone @ high (speed) -> 1162 Dhrystone/MHz

I had a plan to run the code from RAM, but if I place add __ramfunc to every function, it actually gets slower. With no optimisations I get 2026 cycles/Dhrystone, but with high speed optimisation it still takes 1396 cycles/Dhrystone

So I tried lowering the clock speed, to 75MHz, and even 37.5MHz, but it makes no difference. So it certainly is not the FLASH wait state.

By the way, how did you trick the PIC18 into running Dhrystone? The version I got wants to allocate 5K bytes in 1 array, which is larger than the whole memory of the chip.

dannyf · « **Reply #34 on:** March 04, 2014, 07:13:13 pm »

I came across this document, produced by TI: http://www.ti.com/lit/an/slaa205c/slaa205c.pdf

The chart in the back does suggest that PIC24 has lower cycle counts for math operations and comparable cycle counts with ARM7TDMI (LPC2106 class chips) in thumb mode (which is similar to STM32F chips).

So maybe what we observed is indeed within realm of possibilities.

smashIt · « **Reply #35 on:** March 04, 2014, 08:14:28 pm »

Quote from: hans on March 04, 2014, 06:54:37 pm

But, as the PIC24 does 0.5 instruction per MHz, I argue that the actual performance is half. So I actually think you get ~1187 Dhrystone/MHz.

i think you went the wrong direction

if the pic has 2000 dhrystone/MHz and 2 cicles per instruction, it should do 4000 dhrystone / million instrucions / second

hans · « **Reply #36 on:** March 04, 2014, 09:32:06 pm »

Maybe I should have explained that with cycles I meant instructions.

As 2 Hz are 1 instruction. So 2MHz yields 1MIPS.
I understand how you could assume a cycle is 1 Hz.

westfw · « **Reply #37 on:** March 05, 2014, 08:11:34 am »

Quote

PIC24 does 1 [instruction] every 2 [cycles]

Where did you get that idea? The pic24 manuals say "up to 40 MIPS" (and 40MHz clock) and:

Quote

All instructions execute in a single cycle, with the exception of instructions that change the program flow,

(Of course, the 8-bit PICs also say something like that, and elsewhere define a "cycle" as "four oscillator clocks", but I believe that the PIC24 is really 1 instruction per clock...)

hans · « **Reply #38 on:** March 05, 2014, 09:12:13 am »

Look up the datasheet of a PIC24FJ64GB004. Even the summary page says;
"Up to 16 MIPS Operation @ 32 MHz"

PIC24 and dsPIC practically all use the same core, but with different speeds, DSP instructions, and the E/H series have some instructions removed & added.
For example: a PIC24EP128GP202 can run at 60/70 MIPS (depending on temperature range), and the PLL has a maximum output of 120 or 140MHz.
A DSPIC33FJ128GP804 can clock up to 80MHz, which yields 40MIPS.

That's why I focused on the no. of instructions PIC24 uses to complete a Dhrystone, and calculate from there.

JTR · « **Reply #39 on:** March 05, 2014, 10:28:59 am »

Quote from: westfw on March 05, 2014, 08:11:34 am

(Of course, the 8-bit PICs also say something like that, and elsewhere define a "cycle" as "four oscillator clocks", but I believe that the PIC24 is really 1 instruction per clock...)

The PIC24 is a minimum of two clocks per instruction, gospel...

nuhamind2 · « **Reply #40 on:** March 05, 2014, 10:31:11 am »

Perhaps the better wording is every instruction take the multiple of 2 clock cycles

dannyf · « **Reply #41 on:** March 05, 2014, 11:01:52 am »

Quote

Where did you get that idea?

The datasheet? It is fairly easy to pick it up there, actually.

Quote

The PIC24 is a minimum of two clocks per instruction, gospel...

Yeah. It has been well known for quite some time by now,

Quote

Perhaps the better wording is every instruction take the multiple of 2 clock cycles

Yeah.

westfw · « **Reply #42 on:** March 05, 2014, 11:28:06 am »

Huh. I guess the PIC24H parts can double the (up to) 40MHz external clock, yielding an 80MHz internal clock and 40MIPs instruction rate... I hadn't realized that they internal clock rate had gotten so high!

legacy · « **Reply #43 on:** March 05, 2014, 12:23:18 pm »

can i see the C code of the test ? i'd like to test my board/toolchain

dannyf · « **Reply #44 on:** March 05, 2014, 01:58:04 pm »

I did some testing. I put some pin-flipping patterns into the various Procx() called in dhrystone benchmark. Those patterns differ from each other so by observing them on the pins I get to confirm that those routines are called.

I did in fact observe those patterns so they are indeed called by the benchmark, both in debug and release modes. That would suggest that the dhrystone numbers for PIC24F are real -> fairly remarkable I think.

Two possible shortfalls here:

1) maybe the routines are only called with those patterns inserted: instead of inserting those patterns, I did an OR 0x00 on the output port and the time I got is similar to the original score.
2) maybe the routines are called by the results are faulty: I did not investigate that.

Too bad that PIC24 is really under-marketed by Microchip and under-appreciated by the mass.

diyaudio · « **Reply #45 on:** March 05, 2014, 02:04:42 pm »

Quote from: dannyf on March 05, 2014, 01:58:04 pm

Too bad that PIC24 is really under-marketed by Microchip and under-appreciated by the mass.

I just got on-board with the PIC24 a month ago (after I received my samples) as I ran out of I/O using a 18F series, thus far its very interesting, true that its under marketed, few open source projects or material on the net.

jaxbird · « **Reply #46 on:** March 06, 2014, 02:24:58 pm »

No doubt, that Microchip designed a killer 16 bit series with the pic24/dsPic33 series. It's been my favorite in this class for the last couple of years. For me it started with a $25 microstick dev board including a couple of dip package mcus to try it out.

Just want to make sure we agree that we are calculating results per MHz and not per MIPS. The results posted by Hans and my results are very similar 1100+ from Hans with pic24f and XC16. Mine at 1190 using pic24h and XC16.

Dannyf: I know you used C30, did you have a chance to run your tests with XC16 for comparison?

Just for fun I had a go at overclocking. The pic24h I tested will run at 120MHz+, not bad at all. But of course pointless as it's not guaranteed to run stable within the temperature specs at that clock speed. But you could probably run it at 100MHz without any problems as long as it's not exposed to extreme temperatures.

dannyf · « **Reply #47 on:** March 06, 2014, 06:52:51 pm »

I did. The numbers are posted earlier and updated in the first post. The XC16 produced comparable but marginally slower performance vs. C30.

JTR · « **Reply #48 on:** March 06, 2014, 10:18:21 pm »

Anyway, for all that has been said and argued, the simple fact is that there is no difference between the PIC24F and the PIC24H in terms of instructions per MHz. Both of them scale linearly and in lock step to each other. Ergo, there cannot be a difference in Dhrystones per MHz. The fact that the PIC24F is listed here as having pretty much double the performance of the PIC24H (with same compiler and settings) itself has to be a red flag that something is wrong with the calculations.

dannyf · « **Reply #49 on:** March 07, 2014, 12:48:32 am »

Quote

pretty much double the performance

"double the performance" = 2x.

I wonder what might have caused that,


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Dhrystone 2.1 on mcus (Read 45427 times)

Share me