Author Topic: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?  (Read 15205 times)

0 Members and 1 Guest are viewing this topic.

Offline MTTopic starter

  • Super Contributor
  • ***
  • Posts: 1675
  • Country: aq
STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?
« on: February 17, 2024, 04:18:06 am »
These are basically identical internally and externally(PCB). ST are 72Mhz and CH 144Mhz yet the CH consumes half as much current at full speed both with identical setup.
Are there any computing efficiency comparison chart between ARM cores v.s RISC cores? In this case M3 and RISCV/QingKeV4? How much slower are the QingKeV4?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4533
  • Country: nz
Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?
« Reply #1 on: February 17, 2024, 05:25:04 am »
How much slower are the QingKeV4?

Why would it be slower than a Cortex-M3?

I wouldn't expect it to be slower at the same clock speed, let alone at twice the MHz.

I haven't tested a QingKeV4 but from the manual it looks very comparable to SiFive's original E31 core in the HiFive1 in late 2016, except the QingKeV4 has a 3-stage pipeline vs 5-stage for the E31. That has two likely effects: 1) a lower Fmax (the HiFive1 ran at 320 MHz in 180nm, while the CH32V203 is as you say running at 144 MHz), and 2) a smaller penalty for branch mispredict.

The QingKeV4 has typical branch prediction facilities including a Branch History Table, a Branch Target Buffer, and a Return Address Stack. I believe the M3 has none of these.

I have a very simple benchmark program I created in 2016 which I have run (or others have run for me) on a lot of different machines. I find it correlates well with more complex benchmarks such as Dhrystone or Coremark and with real programs.

https://hoult.org/primes.txt

I have data there for the HiFive1 and also for an M3 (BluePill), M4F (BlackPill), and M7 (Teensy 4):

Code: [Select]
    43.516 sec Teensy 4.0 Cortex M7 @ 600 MHz        228 bytes  26.1 billion clocks
   112.163 sec HiFive1 RISCV E31 @ 320 MHz           178 bytes  35.9 billion clocks
   309.251 sec BlackPill Cortex M4F @ 168 MHz        228 bytes  52.0 billion clocks
   927.547 sec BluePill Cortex M3 @ 72 MHz           228 bytes  66.8 billion clocks

If someone wants to run this on a WCH core or cores and send me the results I would appreciate that, but I expect the QingKeV4 to come in very similar to the E31 in terms of clock cycles used.
« Last Edit: February 17, 2024, 05:35:22 am by brucehoult »
 
The following users thanked this post: hans

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15413
  • Country: fr
Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?
« Reply #2 on: February 17, 2024, 07:54:45 am »
If someone wants to run this on a WCH core or cores and send me the results I would appreciate that, but I expect the QingKeV4 to come in very similar to the E31 in terms of clock cycles used.

Ah, I can do that (on a CH32V307 or CH32V305.) I'll let you know when I get around to it.

Note that the CH32V203 has a Qingke V4B core, while the 30x have a V4F core. Apart from the fact the V4F has an embedded FPU, it has a higher number of HPE levels (3 vs 2) and a higher number of interrupt nesting levels (8 vs 2), but otherwise seems similar. There may be other minor differences though - in the official manual, they list the V4B and the V4C with exactly the same features, for instance, and there probably are differences. So, who knows.

For pure integer calculations, memory access and branches, though (which are the points that matter for your benchmark), it's probably almost identical.
 

Offline MTTopic starter

  • Super Contributor
  • ***
  • Posts: 1675
  • Country: aq
Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?
« Reply #3 on: February 17, 2024, 11:12:27 am »
How much slower are the QingKeV4?

Why would it be slower than a Cortex-M3?
I wouldn't expect it to be slower at the same clock speed, let alone at twice the MHz.

I had read a bunch of , not entierly newest, tech articles that claimed ARM was per clock cycle more efficient then RISCV etc.
Quote
I haven't tested a QingKeV4 but from the manual it looks very comparable to SiFive's original E31 core in the HiFive1 in late 2016, except the QingKeV4 has a 3-stage pipeline vs 5-stage for the E31. That has two likely effects: 1) a lower Fmax (the HiFive1 ran at 320 MHz in 180nm, while the CH32V203 is as you say running at 144 MHz), and 2) a smaller penalty for branch mispredict.

The QingKeV4 has typical branch prediction facilities including a Branch History Table, a Branch Target Buffer, and a Return Address Stack. I believe the M3 has none of these.

I have a very simple benchmark program I created in 2016 which I have run (or others have run for me) on a lot of different machines. I find it correlates well with more complex benchmarks such as Dhrystone or Coremark and with real programs.

https://hoult.org/primes.txt

I have data there for the HiFive1 and also for an M3 (BluePill), M4F (BlackPill), and M7 (Teensy 4):

Code: [Select]
    43.516 sec Teensy 4.0 Cortex M7 @ 600 MHz        228 bytes  26.1 billion clocks
   112.163 sec HiFive1 RISCV E31 @ 320 MHz           178 bytes  35.9 billion clocks
   309.251 sec BlackPill Cortex M4F @ 168 MHz        228 bytes  52.0 billion clocks
   927.547 sec BluePill Cortex M3 @ 72 MHz           228 bytes  66.8 billion clocks
If someone wants to run this on a WCH core or cores and send me the results I would appreciate that, but I expect the QingKeV4 to come in very similar to the E31 in terms of clock cycles used.

Interesting stats there, thank you!
« Last Edit: February 18, 2024, 09:33:58 pm by MT »
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15413
  • Country: fr
Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?
« Reply #4 on: February 17, 2024, 11:47:43 pm »
Alright, I've done the test with Bruce's primes on the CH32V307 @144 MHz.
Here are the results:

// 319.155 sec WCH32V307 @ 144 MHz (-O1)       ??? bytes  46.0 billion clocks
// 293.020 sec WCH32V307 @ 144 MHz (-O3)       ??? bytes  42.2 billion clocks

(Normally, Bruce's primes is meant to be compiled at -O1. I tested also at -O3 to see the difference.)
Compiled with mainline GCC 13.2 (riscv64-unknown-elf).

I was always a bit surprised by the results with the Cortex M3 and M4 in this list (haven't tested primes myself on these). It's rather surprising that at least the M4 seems to do worse than a "modest" RISC-V.
Note that benchmarks are benchmarks and tell you only a small story. CoreMark, for instance, definitely shows a significant advantage for the Cortex M4. So, don't get hung up on exact numbers here - what it shows is that the Qingke V4 is definitely on par with a Cortex M3/M4.

Small note: I derived the test code (outside of Bruce's function) from the one I had written for my own RISC-V core, and it used rdcycle and rdinstret instructions to get the number of cycles and number of retired instructions (to compute the CPI). This has shown me that neither counters seem to be implemented on the Qingke V4, both always return 0. I had to use the SysTick timer instead, confiugred as up counter with HCLK as time base, to get the number of cycles. And didn't find anything to get the number of executed instructions. Not that it should matter for "normal" work with it, but I found this odd.
« Last Edit: February 18, 2024, 12:05:06 am by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4533
  • Country: nz
Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?
« Reply #5 on: February 18, 2024, 02:49:14 am »
Alright, I've done the test with Bruce's primes on the CH32V307 @144 MHz.
Here are the results:

// 319.155 sec WCH32V307 @ 144 MHz (-O1)       ??? bytes  46.0 billion clocks

Thanks for that! Added to the official file. I've assigned it the 202 bytes code size GCC 13.2 gets for RV32GC on Godbolt.


Quote
Note that benchmarks are benchmarks and tell you only a small story. CoreMark, for instance, definitely shows a significant advantage for the Cortex M4. So, don't get hung up on exact numbers here - what it shows is that the Qingke V4 is definitely on par with a Cortex M3/M4.

Sure, different code will give different results. Apply maybe ±10% uncertainty for a different benchmark. My "primes" is pretty branchy and benefits a lot from RISC-V's combined compare-and-branch instructions. This isn't deliberate -- I wrote the code before I even knew RISC-V existed.

Note that Coremark is effectively owned by Arm and there's a bit of history there. In early RISC-V days better Coremark numbers were published, but they required adding a #define or changing a typedef so that loop counters (and therefore array indexes) were int not uint as in the official code. Arm zero-extends 32 bit values which makes uint and ulong the same bits. RISC-V sign-extends 32 bit values, which makes int, long, and ulong (aka size_t) the same bits, but uint needs explicit zero-extending to make it long or ulong to (shift and) add to a pointer for array indexing. After a bit Arm/Coremark sent a communication saying "You're not allowed to change the types -- you must use the benchmark code exactly as provided". And so RISC-V numbers got worse.  Eventually, we added in the B extension instructions SH2ADD.UW, SH3ADD.UW etc which zero-extend rs1 before shifting it and this fixes the CoreMark problem (and other things).

That doesn't apply to 32 bit microcontrollers, of course.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15413
  • Country: fr
Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?
« Reply #6 on: February 18, 2024, 06:16:27 am »
Yes, thanks for the details about the benchmarks. Yes, your "primes" has very tight loops with a lot of branches, so that makes sense with your explanation that RISC-V is a bit favored compared to ARM.
The SiFive E31 appears to have a significant performance advantage (nearly -30% cycles) compared to the Qingke V4. Possibly a more effective branch prediction and  the pipeline may also be different. I haven't seen the depth of the pipeline for
the Qingke V4 documented, so I can't tell what it is exactly. If anyone knows.

The datasheet talks briefly about the difference between the V4B (CH32V203) and the V4C (CH32V208) - they say the V4C has faster hardware division and improved memory protection (haven't seen yet what the exact difference with PMP is).
 

Offline MTTopic starter

  • Super Contributor
  • ***
  • Posts: 1675
  • Country: aq
Re: STM32F103C8T6 v.s CH32V203C8T6 same computing efficency?
« Reply #7 on: February 18, 2024, 10:06:13 pm »
Yes, thanks for the details about the benchmarks. Yes, your "primes" has very tight loops with a lot of branches, so that makes sense with your explanation that RISC-V is a bit favored compared to ARM.
The SiFive E31 appears to have a significant performance advantage (nearly -30% cycles) compared to the Qingke V4. Possibly a more effective branch prediction and  the pipeline may also be different. I haven't seen the depth of the pipeline for  the Qingke V4 documented, so I can't tell what it is exactly. If anyone knows.

The datasheet talks briefly about the difference between the V4B (CH32V203) and the V4C (CH32V208) - they say the V4C has faster hardware division and improved memory protection (haven't seen yet what the exact difference with PMP is).

The QingKeV4_Processor_Manual.PDF mentions all 4 versions have a 3 stage pipeline.
https://www.wch-ic.com/downloads/QingKeV4_Processor_Manual_PDF.html

Btw, seams the interrupt and stack mechanism in theQingKeV4 are quite different from the M3's 2 stack solution.

It also have something called VTF (Vat The Feck), have no idea how it distinguish between interrupt sources.
Quote
3.5 Vector Table Free (VTF)
The Programmable Fast Interrupt Controller (PFIC) provides two VTF channels, i.e., direct access to the
interrupt function entry without going through the interrupt vector table lookup process. 
« Last Edit: February 19, 2024, 12:14:18 am by MT »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf