A more conclusive result, with something simple: a weighted sum, which lends itself very easily to vectorization.
I compared a pure C version:
double WeightedSum(double *X, double *Y, uint64_t nPts)
{
double Sum = 0.0;
for (uint64_t i = 0; i < nPts; i++)
Sum += X[i] * Y[i];
return Sum;
}
Which gets compiled as:
fmv.d.x fa5,zero
mv a5,s4
mv a4,s0
.L708:
fld fa3,0(a5)
fld fa4,0(a4)
addi a5,a5,8
addi a4,a4,8
fmadd.d fa5,fa3,fa4,fa5
bne s0,a5,.L708
With the following hand-written assembly, still using double-precision FP:
VectD_Zero:
.double 0.0
.double 0.0
.global Vector_WeightedSum
.type Vector_WeightedSum, @function
Vector_WeightedSum:
beq a2, zero, Vector_WeightedSum_End
addi sp, sp, -16
li t0, 2
th.vsetvli zero, t0, e64, m1
lla t0, VectD_Zero
th.vle.v v1, (t0)
mv t1, a0
th.addsl t3, a0, a2, 3
mv t2, a1
Vector_WeightedSum_Loop:
th.vle.v v2, (t1)
th.vle.v v3, (t2)
addi t1, t1, 16
addi t2, t2, 16
th.vfmacc.vv v1, v2, v3
bgtu t3, t1, Vector_WeightedSum_Loop
th.vse.v v1, 0(sp)
fld ft0, 0(sp)
fld ft1, 8(sp)
fadd.d fa0, ft0, ft1
addi sp, sp, 16
Vector_WeightedSum_End:
ret
(nPts must be even. It's not checked here, and I didn't bother adding code to adjust for odd values, but this is easy to add.)
This time, the results are more conclusive. I tested with nPts = 1024.
C version: Cycles = 6169, CPI = 1.003
Assembly (vectored): Cycles = 3679, CPI = 1.188 (so about 1.7 times faster).
So, even with 2 64-bit elements per vector, this is still worth it.
I've since seen that one could load all zeros to a vector with the following sequence, instead of loading from memory as I did in the above piece:
fmv.d.x ft0, zero
vfmv.v.f v1, ft0
Probably much faster if the loaded "zero vector" is not in cache, otherwise possibly slower (as there are 2 instructions). Not sure which one should be prefered.