Author Topic: A couple questions about Milk-V Duo boards  (Read 5698 times)

0 Members and 1 Guest are viewing this topic.

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4218
  • Country: nz
Re: A couple questions about Milk-V Duo boards
« Reply #50 on: July 07, 2024, 09:23:53 am »
Oh you edited while I was writing my reply.

Sure, if you have C extension then only 2 byte alignment is guaranteed by the compiler/linker. Without C you're guaranteed 4 byte alignment.

Many people add some set of -falign-functions -falign-loops  -falign-labels with 4 or 8 or something to their CFLAGs to make branches cause less of a stall, especially with superscalar CPUs, but even with single-issue you might get a penalty for jumping to a misaligned 4 byte instruction -- or even a 2 byte instruction that is not on a 4 byte boundary. Depends on the bus width between L1 icache and the fetch/decode unit. It might only be instructions (4 byte only of course) that straddle an 8, 16, 64, 4096 byte boundary that need two fetches to get the whole instruction.
 
The following users thanked this post: SiliconWizard

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4218
  • Country: nz
Re: A couple questions about Milk-V Duo boards
« Reply #51 on: July 07, 2024, 09:46:32 am »
Some interesting data from the dav1d porting effort ... 128 bit RVV on the Kendryte K230 (first RVV 1.0 chip on the market, back in November) is very similar to NEON on the A53 in an Odroid C2 ... and the C908 core is significantly better on scalar versions of the code.



Best to start a bit before that, but that's the data

The K230 uses the THead C908 core which is dual-issue but otherwise very much appears to be just an updated C906, especially in the vector unit -- just updated from RVV 0.7 to 1.0 without any serious performance changes, it appears.
 
The following users thanked this post: glenenglish

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 14892
  • Country: fr
Re: A couple questions about Milk-V Duo boards
« Reply #52 on: July 08, 2024, 02:33:15 am »
A more conclusive result, with something simple: a weighted sum, which lends itself very easily to vectorization.

I compared a pure C version:
Code: [Select]
double WeightedSum(double *X, double *Y, uint64_t nPts)
{
double Sum = 0.0;

for (uint64_t i = 0; i < nPts; i++)
Sum += X[i] * Y[i];

return Sum;
}
Which gets compiled as:
Code: [Select]
fmv.d.x fa5,zero
mv a5,s4
mv a4,s0
.L708:
fld fa3,0(a5)
fld fa4,0(a4)
addi a5,a5,8
addi a4,a4,8
fmadd.d fa5,fa3,fa4,fa5
bne s0,a5,.L708

With the following hand-written assembly, still using double-precision FP:
Code: [Select]
VectD_Zero:
.double 0.0
.double 0.0

.global Vector_WeightedSum
.type   Vector_WeightedSum, @function

Vector_WeightedSum:
beq a2, zero, Vector_WeightedSum_End

addi sp, sp, -16

li t0, 2
th.vsetvli zero, t0, e64, m1

lla t0, VectD_Zero
th.vle.v v1, (t0)

mv t1, a0
th.addsl t3, a0, a2, 3
mv t2, a1

Vector_WeightedSum_Loop:
th.vle.v v2, (t1)
th.vle.v v3, (t2)

addi t1, t1, 16
addi t2, t2, 16

th.vfmacc.vv v1, v2, v3

bgtu t3, t1, Vector_WeightedSum_Loop

th.vse.v v1, 0(sp)
fld ft0, 0(sp)
fld ft1, 8(sp)
fadd.d fa0, ft0, ft1

addi sp, sp, 16

Vector_WeightedSum_End:
ret
(nPts must be even. It's not checked here, and I didn't bother adding code to adjust for odd values, but this is easy to add.)

This time, the results are more conclusive. I tested with nPts = 1024.
C version: Cycles = 6169, CPI = 1.003
Assembly (vectored): Cycles = 3679, CPI = 1.188 (so about 1.7 times faster).

So, even with 2 64-bit elements per vector, this is still worth it.

I've since seen that one could load all zeros to a vector with the following sequence, instead of loading from memory as I did in the above piece:
Code: [Select]
fmv.d.x ft0, zero
vfmv.v.f v1, ft0

Probably much faster if the loaded "zero vector" is not in cache, otherwise possibly slower (as there are 2 instructions). Not sure which one should be prefered.
« Last Edit: July 08, 2024, 02:47:57 am by SiliconWizard »
 
The following users thanked this post: glenenglish

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4218
  • Country: nz
Re: A couple questions about Milk-V Duo boards
« Reply #53 on: July 08, 2024, 04:05:07 am »
Floating point zero is all 0s, so you can just splat an integer 0, or xor the total register with itself, or whatever.

Instead of saving the total vector to memory and then loading back the individual elements to FP registers you can use vfredsum.vs.  This will also make the code independent of whether you're using single or double precision, different LMUL etc.  In that case you'd also want to use vsetvli properly for the loop control, to deal with any number of elements (not just even / not even).

The one trick is you'd need to set vl to VLMAX before the vfmacc to make sure you don't inadvertently zero the tail elements of the totals. Technically you only need to do that on the last iteration (if vl < VLMAX) but it's no great harm to do it every time. Again, the tail elements after the short final vector you loaded from the input arrays will be zeros, and so not affect the totals in those positions (but there will be wasted multiplies of 0*0).
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 14892
  • Country: fr
Re: A couple questions about Milk-V Duo boards
« Reply #54 on: July 08, 2024, 04:30:43 am »
Instead of saving the total vector to memory and then loading back the individual elements to FP registers you can use vfredsum.vs.

I had seen this one actually, but problem being to extract the value to a FP register after that anyway. How would you do it without passing via memory?

Edit: got it, th.vfmv.f.s does that - I had missed this one. So new version is this:

Code: [Select]
Vector_WeightedSum:
beq a2, zero, Vector_WeightedSum_End

li t0, 2
th.vsetvli zero, t0, e64, m1

fmv.d.x ft0, zero
th.vfmv.v.f v1, ft0

mv t1, a0
th.addsl t3, a0, a2, 3
mv t2, a1

Vector_WeightedSum_Loop:
th.vle.v v2, (t1)
th.vle.v v3, (t2)

addi t1, t1, 16
addi t2, t2, 16

th.vfmacc.vv v1, v2, v3

bgtu t3, t1, Vector_WeightedSum_Loop

th.vfmv.v.f v0, ft0
th.vfredsum.vs v2, v1, v0
th.vfmv.f.s fa0, v2

Vector_WeightedSum_End:
ret

Now, surprise. Which I don't get. The number of cycles jumps to 3875 compared to the previous version, with a CPI of 1.253. The number of cycles for the C version remains identical, so it's reproductible.
Is it a difference with the cache maybe? Would be odd, since I execute a dry run before timing the second run, to get cache misses out of the way. What's up...

And... what do you know: circles back to what we discussed earlier. Alignment of instructions. I added .option norvc before my assembly function. And, bam, I'm back to 3679 cycles. Fun fact is that it's the exact same number of cycles, but I bet that's because everything was already in data cache, otherwise the difference would have been much greater. And, it's cleaner this way, without passing via memory.

So, yes, I can confirm that instruction alignment on this CPU makes a significant difference.

« Last Edit: July 08, 2024, 04:57:27 am by SiliconWizard »
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4218
  • Country: nz
Re: A couple questions about Milk-V Duo boards
« Reply #55 on: July 08, 2024, 04:58:06 am »
vfmv.f.s fd, vs2
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4218
  • Country: nz
Re: A couple questions about Milk-V Duo boards
« Reply #56 on: July 08, 2024, 05:03:08 am »
Gah .. your edit was 39 seconds before my reply...
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 14892
  • Country: fr
Re: A couple questions about Milk-V Duo boards
« Reply #57 on: July 08, 2024, 05:07:41 am »
I'll have to learn more about the strided addressing mode.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf