Author Topic: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency? (Read 15208 times)

MT · « **on:** February 17, 2024, 04:18:06 am »

These are basically identical internally and externally(PCB). ST are 72Mhz and CH 144Mhz yet the CH consumes half as much current at full speed both with identical setup.
Are there any computing efficiency comparison chart between ARM cores v.s RISC cores? In this case M3 and RISCV/QingKeV4? How much slower are the QingKeV4?

brucehoult · « **Reply #1 on:** February 17, 2024, 05:25:04 am »

Quote from: MT on February 17, 2024, 04:18:06 am

How much slower are the QingKeV4?

Why would it be slower than a Cortex-M3?

I wouldn't expect it to be slower at the same clock speed, let alone at twice the MHz.

I haven't tested a QingKeV4 but from the manual it looks very comparable to SiFive's original E31 core in the HiFive1 in late 2016, except the QingKeV4 has a 3-stage pipeline vs 5-stage for the E31. That has two likely effects: 1) a lower Fmax (the HiFive1 ran at 320 MHz in 180nm, while the CH32V203 is as you say running at 144 MHz), and 2) a smaller penalty for branch mispredict.

The QingKeV4 has typical branch prediction facilities including a Branch History Table, a Branch Target Buffer, and a Return Address Stack. I believe the M3 has none of these.

I have a very simple benchmark program I created in 2016 which I have run (or others have run for me) on a lot of different machines. I find it correlates well with more complex benchmarks such as Dhrystone or Coremark and with real programs.

https://hoult.org/primes.txt

I have data there for the HiFive1 and also for an M3 (BluePill), M4F (BlackPill), and M7 (Teensy 4):

Code: [Select]

    43.516 sec Teensy 4.0 Cortex M7 @ 600 MHz        228 bytes  26.1 billion clocks
   112.163 sec HiFive1 RISCV E31 @ 320 MHz           178 bytes  35.9 billion clocks
   309.251 sec BlackPill Cortex M4F @ 168 MHz        228 bytes  52.0 billion clocks
   927.547 sec BluePill Cortex M3 @ 72 MHz           228 bytes  66.8 billion clocks

If someone wants to run this on a WCH core or cores and send me the results I would appreciate that, but I expect the QingKeV4 to come in very similar to the E31 in terms of clock cycles used.

SiliconWizard · « **Reply #2 on:** February 17, 2024, 07:54:45 am »

Quote from: brucehoult on February 17, 2024, 05:25:04 am

If someone wants to run this on a WCH core or cores and send me the results I would appreciate that, but I expect the QingKeV4 to come in very similar to the E31 in terms of clock cycles used.

Ah, I can do that (on a CH32V307 or CH32V305.) I'll let you know when I get around to it.

Note that the CH32V203 has a Qingke V4B core, while the 30x have a V4F core. Apart from the fact the V4F has an embedded FPU, it has a higher number of HPE levels (3 vs 2) and a higher number of interrupt nesting levels (8 vs 2), but otherwise seems similar. There may be other minor differences though - in the official manual, they list the V4B and the V4C with exactly the same features, for instance, and there probably are differences. So, who knows.

For pure integer calculations, memory access and branches, though (which are the points that matter for your benchmark), it's probably almost identical.

MT · « **Reply #3 on:** February 17, 2024, 11:12:27 am »

Quote from: brucehoult on February 17, 2024, 05:25:04 am

Quote from: MT on February 17, 2024, 04:18:06 am
How much slower are the QingKeV4?

Why would it be slower than a Cortex-M3?
I wouldn't expect it to be slower at the same clock speed, let alone at twice the MHz.

I had read a bunch of , not entierly newest, tech articles that claimed ARM was per clock cycle more efficient then RISCV etc.

Quote

I haven't tested a QingKeV4 but from the manual it looks very comparable to SiFive's original E31 core in the HiFive1 in late 2016, except the QingKeV4 has a 3-stage pipeline vs 5-stage for the E31. That has two likely effects: 1) a lower Fmax (the HiFive1 ran at 320 MHz in 180nm, while the CH32V203 is as you say running at 144 MHz), and 2) a smaller penalty for branch mispredict.

The QingKeV4 has typical branch prediction facilities including a Branch History Table, a Branch Target Buffer, and a Return Address Stack. I believe the M3 has none of these.

I have a very simple benchmark program I created in 2016 which I have run (or others have run for me) on a lot of different machines. I find it correlates well with more complex benchmarks such as Dhrystone or Coremark and with real programs.

https://hoult.org/primes.txt

I have data there for the HiFive1 and also for an M3 (BluePill), M4F (BlackPill), and M7 (Teensy 4):

Code: [Select]
43.516 sec Teensy 4.0 Cortex M7 @ 600 MHz 228 bytes 26.1 billion clocks 112.163 sec HiFive1 RISCV E31 @ 320 MHz 178 bytes 35.9 billion clocks 309.251 sec BlackPill Cortex M4F @ 168 MHz 228 bytes 52.0 billion clocks 927.547 sec BluePill Cortex M3 @ 72 MHz 228 bytes 66.8 billion clocksIf someone wants to run this on a WCH core or cores and send me the results I would appreciate that, but I expect the QingKeV4 to come in very similar to the E31 in terms of clock cycles used.

Interesting stats there, thank you!

SiliconWizard · « **Reply #4 on:** February 17, 2024, 11:47:43 pm »

Alright, I've done the test with Bruce's primes on the CH32V307 @144 MHz.
Here are the results:

// 319.155 sec WCH32V307 @ 144 MHz (-O1)

bytes 46.0 billion clocks
// 293.020 sec WCH32V307 @ 144 MHz (-O3)

bytes 42.2 billion clocks

(Normally, Bruce's primes is meant to be compiled at -O1. I tested also at -O3 to see the difference.)
Compiled with mainline GCC 13.2 (riscv64-unknown-elf).

I was always a bit surprised by the results with the Cortex M3 and M4 in this list (haven't tested primes myself on these). It's rather surprising that at least the M4 seems to do worse than a "modest" RISC-V.
Note that benchmarks are benchmarks and tell you only a small story. CoreMark, for instance, definitely shows a significant advantage for the Cortex M4. So, don't get hung up on exact numbers here - what it shows is that the Qingke V4 is definitely on par with a Cortex M3/M4.

Small note: I derived the test code (outside of Bruce's function) from the one I had written for my own RISC-V core, and it used rdcycle and rdinstret instructions to get the number of cycles and number of retired instructions (to compute the CPI). This has shown me that neither counters seem to be implemented on the Qingke V4, both always return 0. I had to use the SysTick timer instead, confiugred as up counter with HCLK as time base, to get the number of cycles. And didn't find anything to get the number of executed instructions. Not that it should matter for "normal" work with it, but I found this odd.

brucehoult · « **Reply #5 on:** February 18, 2024, 02:49:14 am »

Quote from: SiliconWizard on February 17, 2024, 11:47:43 pm

Alright, I've done the test with Bruce's primes on the CH32V307 @144 MHz.
Here are the results:

// 319.155 sec WCH32V307 @ 144 MHz (-O1) bytes 46.0 billion clocks

Thanks for that! Added to the official file. I've assigned it the 202 bytes code size GCC 13.2 gets for RV32GC on Godbolt.

Quote

Note that benchmarks are benchmarks and tell you only a small story. CoreMark, for instance, definitely shows a significant advantage for the Cortex M4. So, don't get hung up on exact numbers here - what it shows is that the Qingke V4 is definitely on par with a Cortex M3/M4.

Sure, different code will give different results. Apply maybe ±10% uncertainty for a different benchmark. My "primes" is pretty branchy and benefits a lot from RISC-V's combined compare-and-branch instructions. This isn't deliberate -- I wrote the code before I even knew RISC-V existed.

Note that Coremark is effectively owned by Arm and there's a bit of history there. In early RISC-V days better Coremark numbers were published, but they required adding a #define or changing a typedef so that loop counters (and therefore array indexes) were int not uint as in the official code. Arm zero-extends 32 bit values which makes uint and ulong the same bits. RISC-V sign-extends 32 bit values, which makes int, long, and ulong (aka size_t) the same bits, but uint needs explicit zero-extending to make it long or ulong to (shift and) add to a pointer for array indexing. After a bit Arm/Coremark sent a communication saying "You're not allowed to change the types -- you must use the benchmark code exactly as provided". And so RISC-V numbers got worse. Eventually, we added in the B extension instructions SH2ADD.UW, SH3ADD.UW etc which zero-extend rs1 before shifting it and this fixes the CoreMark problem (and other things).

That doesn't apply to 32 bit microcontrollers, of course.

SiliconWizard · « **Reply #6 on:** February 18, 2024, 06:16:27 am »

Yes, thanks for the details about the benchmarks. Yes, your "primes" has very tight loops with a lot of branches, so that makes sense with your explanation that RISC-V is a bit favored compared to ARM.
The SiFive E31 appears to have a significant performance advantage (nearly -30% cycles) compared to the Qingke V4. Possibly a more effective branch prediction and the pipeline may also be different. I haven't seen the depth of the pipeline for
the Qingke V4 documented, so I can't tell what it is exactly. If anyone knows.

The datasheet talks briefly about the difference between the V4B (CH32V203) and the V4C (CH32V208) - they say the V4C has faster hardware division and improved memory protection (haven't seen yet what the exact difference with PMP is).

MT · « **Reply #7 on:** February 18, 2024, 10:06:13 pm »

Quote from: SiliconWizard on February 18, 2024, 06:16:27 am

Yes, thanks for the details about the benchmarks. Yes, your "primes" has very tight loops with a lot of branches, so that makes sense with your explanation that RISC-V is a bit favored compared to ARM.
The SiFive E31 appears to have a significant performance advantage (nearly -30% cycles) compared to the Qingke V4. Possibly a more effective branch prediction and the pipeline may also be different. I haven't seen the depth of the pipeline for the Qingke V4 documented, so I can't tell what it is exactly. If anyone knows.

The datasheet talks briefly about the difference between the V4B (CH32V203) and the V4C (CH32V208) - they say the V4C has faster hardware division and improved memory protection (haven't seen yet what the exact difference with PMP is).

The QingKeV4_Processor_Manual.PDF mentions all 4 versions have a 3 stage pipeline.
https://www.wch-ic.com/downloads/QingKeV4_Processor_Manual_PDF.html

Btw, seams the interrupt and stack mechanism in theQingKeV4 are quite different from the M3's 2 stack solution.

It also have something called VTF (Vat The Feck), have no idea how it distinguish between interrupt sources.

Quote

3.5 Vector Table Free (VTF)
The Programmable Fast Interrupt Controller (PFIC) provides two VTF channels, i.e., direct access to the
interrupt function entry without going through the interrupt vector table lookup process.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency? (Read 15208 times)

MT

STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?

brucehoult

Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?

SiliconWizard

Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?

MT

Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?

SiliconWizard

Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?

brucehoult

Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?

SiliconWizard

Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?

MT

Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?

Share me