Author Topic: A couple questions about Milk-V Duo boards (Read 5698 times)

brucehoult · « **Reply #50 on:** July 07, 2024, 09:23:53 am »

Oh you edited while I was writing my reply.

Sure, if you have C extension then only 2 byte alignment is guaranteed by the compiler/linker. Without C you're guaranteed 4 byte alignment.

Many people add some set of -falign-functions -falign-loops -falign-labels with 4 or 8 or something to their CFLAGs to make branches cause less of a stall, especially with superscalar CPUs, but even with single-issue you might get a penalty for jumping to a misaligned 4 byte instruction -- or even a 2 byte instruction that is not on a 4 byte boundary. Depends on the bus width between L1 icache and the fetch/decode unit. It might only be instructions (4 byte only of course) that straddle an 8, 16, 64, 4096 byte boundary that need two fetches to get the whole instruction.

brucehoult · « **Reply #51 on:** July 07, 2024, 09:46:32 am »

Some interesting data from the dav1d porting effort ... 128 bit RVV on the Kendryte K230 (first RVV 1.0 chip on the market, back in November) is very similar to NEON on the A53 in an Odroid C2 ... and the C908 core is significantly better on scalar versions of the code.

Best to start a bit before that, but that's the data

The K230 uses the THead C908 core which is dual-issue but otherwise very much appears to be just an updated C906, especially in the vector unit -- just updated from RVV 0.7 to 1.0 without any serious performance changes, it appears.

SiliconWizard · « **Reply #52 on:** July 08, 2024, 02:33:15 am »

A more conclusive result, with something simple: a weighted sum, which lends itself very easily to vectorization.

I compared a pure C version:

Code: [Select]

double WeightedSum(double *X, double *Y, uint64_t nPts)
{
	double Sum = 0.0;

	for (uint64_t i = 0; i < nPts; i++)
		Sum += X[i] * Y[i];
	
	return Sum;
}

Which gets compiled as:

Code: [Select]

	fmv.d.x	fa5,zero
	mv	a5,s4
	mv	a4,s0
.L708:
	fld	fa3,0(a5)
	fld	fa4,0(a4)
	addi	a5,a5,8
	addi	a4,a4,8
	fmadd.d	fa5,fa3,fa4,fa5
	bne	s0,a5,.L708

With the following hand-written assembly, still using double-precision FP:

Code: [Select]

VectD_Zero:
	.double 0.0
	.double 0.0

.global Vector_WeightedSum
.type   Vector_WeightedSum, @function

Vector_WeightedSum:
	beq a2, zero, Vector_WeightedSum_End
	
	addi sp, sp, -16
	
	li t0, 2
	th.vsetvli zero, t0, e64, m1
	
	lla t0, VectD_Zero
	th.vle.v v1, (t0)
	
	mv t1, a0
	th.addsl t3, a0, a2, 3
	mv t2, a1
	
Vector_WeightedSum_Loop:
	th.vle.v v2, (t1)
	th.vle.v v3, (t2)
	
	addi t1, t1, 16
	addi t2, t2, 16
	
	th.vfmacc.vv v1, v2, v3
	
	bgtu t3, t1, Vector_WeightedSum_Loop
	
	th.vse.v v1, 0(sp)
	fld ft0, 0(sp)
	fld ft1, 8(sp)
	fadd.d fa0, ft0, ft1
	
	addi sp, sp, 16
	
Vector_WeightedSum_End:
	ret

(nPts must be even. It's not checked here, and I didn't bother adding code to adjust for odd values, but this is easy to add.)

This time, the results are more conclusive. I tested with nPts = 1024.
C version: Cycles = 6169, CPI = 1.003
Assembly (vectored): Cycles = 3679, CPI = 1.188 (so about 1.7 times faster).

So, even with 2 64-bit elements per vector, this is still worth it.

I've since seen that one could load all zeros to a vector with the following sequence, instead of loading from memory as I did in the above piece:

Code: [Select]

fmv.d.x ft0, zero
vfmv.v.f v1, ft0

Probably much faster if the loaded "zero vector" is not in cache, otherwise possibly slower (as there are 2 instructions). Not sure which one should be prefered.

brucehoult · « **Reply #53 on:** July 08, 2024, 04:05:07 am »

Floating point zero is all 0s, so you can just splat an integer 0, or xor the total register with itself, or whatever.

Instead of saving the total vector to memory and then loading back the individual elements to FP registers you can use vfredsum.vs. This will also make the code independent of whether you're using single or double precision, different LMUL etc. In that case you'd also want to use vsetvli properly for the loop control, to deal with any number of elements (not just even / not even).

The one trick is you'd need to set vl to VLMAX before the vfmacc to make sure you don't inadvertently zero the tail elements of the totals. Technically you only need to do that on the last iteration (if vl < VLMAX) but it's no great harm to do it every time. Again, the tail elements after the short final vector you loaded from the input arrays will be zeros, and so not affect the totals in those positions (but there will be wasted multiplies of 0*0).

SiliconWizard · « **Reply #54 on:** July 08, 2024, 04:30:43 am »

Quote from: brucehoult on July 08, 2024, 04:05:07 am

Instead of saving the total vector to memory and then loading back the individual elements to FP registers you can use vfredsum.vs.

I had seen this one actually, but problem being to extract the value to a FP register after that anyway. How would you do it without passing via memory?

Edit: got it, th.vfmv.f.s does that - I had missed this one. So new version is this:

Code: [Select]

Vector_WeightedSum:
	beq a2, zero, Vector_WeightedSum_End
	
	li t0, 2
	th.vsetvli zero, t0, e64, m1
	
	fmv.d.x ft0, zero
	th.vfmv.v.f v1, ft0
	
	mv t1, a0
	th.addsl t3, a0, a2, 3
	mv t2, a1
	
Vector_WeightedSum_Loop:
	th.vle.v v2, (t1)
	th.vle.v v3, (t2)
	
	addi t1, t1, 16
	addi t2, t2, 16
	
	th.vfmacc.vv v1, v2, v3
	
	bgtu t3, t1, Vector_WeightedSum_Loop
	
	th.vfmv.v.f v0, ft0
	th.vfredsum.vs v2, v1, v0
	th.vfmv.f.s fa0, v2
	
Vector_WeightedSum_End:
	ret

Now, surprise. Which I don't get. The number of cycles jumps to 3875 compared to the previous version, with a CPI of 1.253. The number of cycles for the C version remains identical, so it's reproductible.
Is it a difference with the cache maybe? Would be odd, since I execute a dry run before timing the second run, to get cache misses out of the way. What's up...

And... what do you know: circles back to what we discussed earlier. Alignment of instructions. I added .option norvc before my assembly function. And, bam, I'm back to 3679 cycles. Fun fact is that it's the exact same number of cycles, but I bet that's because everything was already in data cache, otherwise the difference would have been much greater. And, it's cleaner this way, without passing via memory.

So, yes, I can confirm that instruction alignment on this CPU makes a significant difference.

brucehoult · « **Reply #55 on:** July 08, 2024, 04:58:06 am »

vfmv.f.s fd, vs2

brucehoult · « **Reply #56 on:** July 08, 2024, 05:03:08 am »

Gah .. your edit was 39 seconds before my reply...

SiliconWizard · « **Reply #57 on:** July 08, 2024, 05:07:41 am »

I'll have to learn more about the strided addressing mode.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: A couple questions about Milk-V Duo boards (Read 5698 times)

brucehoult

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

SiliconWizard

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

SiliconWizard

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

SiliconWizard

Re: A couple questions about Milk-V Duo boards

Share me