Author Topic: A couple questions about Milk-V Duo boards (Read 7644 times)

brucehoult · « **Reply #50 on:** July 07, 2024, 09:23:53 am »

Oh you edited while I was writing my reply.

Sure, if you have C extension then only 2 byte alignment is guaranteed by the compiler/linker. Without C you're guaranteed 4 byte alignment.

Many people add some set of -falign-functions -falign-loops -falign-labels with 4 or 8 or something to their CFLAGs to make branches cause less of a stall, especially with superscalar CPUs, but even with single-issue you might get a penalty for jumping to a misaligned 4 byte instruction -- or even a 2 byte instruction that is not on a 4 byte boundary. Depends on the bus width between L1 icache and the fetch/decode unit. It might only be instructions (4 byte only of course) that straddle an 8, 16, 64, 4096 byte boundary that need two fetches to get the whole instruction.

brucehoult · « **Reply #51 on:** July 07, 2024, 09:46:32 am »

Some interesting data from the dav1d porting effort ... 128 bit RVV on the Kendryte K230 (first RVV 1.0 chip on the market, back in November) is very similar to NEON on the A53 in an Odroid C2 ... and the C908 core is significantly better on scalar versions of the code.

Best to start a bit before that, but that's the data

The K230 uses the THead C908 core which is dual-issue but otherwise very much appears to be just an updated C906, especially in the vector unit -- just updated from RVV 0.7 to 1.0 without any serious performance changes, it appears.

SiliconWizard · « **Reply #52 on:** July 08, 2024, 02:33:15 am »

A more conclusive result, with something simple: a weighted sum, which lends itself very easily to vectorization.

I compared a pure C version:

Code: [Select]

double WeightedSum(double *X, double *Y, uint64_t nPts)
{
	double Sum = 0.0;

	for (uint64_t i = 0; i < nPts; i++)
		Sum += X[i] * Y[i];
	
	return Sum;
}

Which gets compiled as:

Code: [Select]

	fmv.d.x	fa5,zero
	mv	a5,s4
	mv	a4,s0
.L708:
	fld	fa3,0(a5)
	fld	fa4,0(a4)
	addi	a5,a5,8
	addi	a4,a4,8
	fmadd.d	fa5,fa3,fa4,fa5
	bne	s0,a5,.L708

With the following hand-written assembly, still using double-precision FP:

Code: [Select]

VectD_Zero:
	.double 0.0
	.double 0.0

.global Vector_WeightedSum
.type   Vector_WeightedSum, @function

Vector_WeightedSum:
	beq a2, zero, Vector_WeightedSum_End
	
	addi sp, sp, -16
	
	li t0, 2
	th.vsetvli zero, t0, e64, m1
	
	lla t0, VectD_Zero
	th.vle.v v1, (t0)
	
	mv t1, a0
	th.addsl t3, a0, a2, 3
	mv t2, a1
	
Vector_WeightedSum_Loop:
	th.vle.v v2, (t1)
	th.vle.v v3, (t2)
	
	addi t1, t1, 16
	addi t2, t2, 16
	
	th.vfmacc.vv v1, v2, v3
	
	bgtu t3, t1, Vector_WeightedSum_Loop
	
	th.vse.v v1, 0(sp)
	fld ft0, 0(sp)
	fld ft1, 8(sp)
	fadd.d fa0, ft0, ft1
	
	addi sp, sp, 16
	
Vector_WeightedSum_End:
	ret

(nPts must be even. It's not checked here, and I didn't bother adding code to adjust for odd values, but this is easy to add.)

This time, the results are more conclusive. I tested with nPts = 1024.
C version: Cycles = 6169, CPI = 1.003
Assembly (vectored): Cycles = 3679, CPI = 1.188 (so about 1.7 times faster).

So, even with 2 64-bit elements per vector, this is still worth it.

I've since seen that one could load all zeros to a vector with the following sequence, instead of loading from memory as I did in the above piece:

Code: [Select]

fmv.d.x ft0, zero
vfmv.v.f v1, ft0

Probably much faster if the loaded "zero vector" is not in cache, otherwise possibly slower (as there are 2 instructions). Not sure which one should be prefered.

brucehoult · « **Reply #53 on:** July 08, 2024, 04:05:07 am »

Floating point zero is all 0s, so you can just splat an integer 0, or xor the total register with itself, or whatever.

Instead of saving the total vector to memory and then loading back the individual elements to FP registers you can use vfredsum.vs. This will also make the code independent of whether you're using single or double precision, different LMUL etc. In that case you'd also want to use vsetvli properly for the loop control, to deal with any number of elements (not just even / not even).

The one trick is you'd need to set vl to VLMAX before the vfmacc to make sure you don't inadvertently zero the tail elements of the totals. Technically you only need to do that on the last iteration (if vl < VLMAX) but it's no great harm to do it every time. Again, the tail elements after the short final vector you loaded from the input arrays will be zeros, and so not affect the totals in those positions (but there will be wasted multiplies of 0*0).

SiliconWizard · « **Reply #54 on:** July 08, 2024, 04:30:43 am »

Quote from: brucehoult on July 08, 2024, 04:05:07 am

Instead of saving the total vector to memory and then loading back the individual elements to FP registers you can use vfredsum.vs.

I had seen this one actually, but problem being to extract the value to a FP register after that anyway. How would you do it without passing via memory?

Edit: got it, th.vfmv.f.s does that - I had missed this one. So new version is this:

Code: [Select]

Vector_WeightedSum:
	beq a2, zero, Vector_WeightedSum_End
	
	li t0, 2
	th.vsetvli zero, t0, e64, m1
	
	fmv.d.x ft0, zero
	th.vfmv.v.f v1, ft0
	
	mv t1, a0
	th.addsl t3, a0, a2, 3
	mv t2, a1
	
Vector_WeightedSum_Loop:
	th.vle.v v2, (t1)
	th.vle.v v3, (t2)
	
	addi t1, t1, 16
	addi t2, t2, 16
	
	th.vfmacc.vv v1, v2, v3
	
	bgtu t3, t1, Vector_WeightedSum_Loop
	
	th.vfmv.v.f v0, ft0
	th.vfredsum.vs v2, v1, v0
	th.vfmv.f.s fa0, v2
	
Vector_WeightedSum_End:
	ret

Now, surprise. Which I don't get. The number of cycles jumps to 3875 compared to the previous version, with a CPI of 1.253. The number of cycles for the C version remains identical, so it's reproductible.
Is it a difference with the cache maybe? Would be odd, since I execute a dry run before timing the second run, to get cache misses out of the way. What's up...

And... what do you know: circles back to what we discussed earlier. Alignment of instructions. I added .option norvc before my assembly function. And, bam, I'm back to 3679 cycles. Fun fact is that it's the exact same number of cycles, but I bet that's because everything was already in data cache, otherwise the difference would have been much greater. And, it's cleaner this way, without passing via memory.

So, yes, I can confirm that instruction alignment on this CPU makes a significant difference.

brucehoult · « **Reply #55 on:** July 08, 2024, 04:58:06 am »

vfmv.f.s fd, vs2

brucehoult · « **Reply #56 on:** July 08, 2024, 05:03:08 am »

Gah .. your edit was 39 seconds before my reply...

SiliconWizard · « **Reply #57 on:** July 08, 2024, 05:07:41 am »

I'll have to learn more about the strided addressing mode.

SiliconWizard · « **Reply #58 on:** July 20, 2024, 07:34:16 am »

Having some questions about the CLINT and PLIC. I can't really figure out how these are mapped in memory for multiple cores, at least regarding the CV1800B, for which the documentation is basically absent about this. There is documentation for the C906 (open), but it's also not extremely helpful regarding this point,, as they document the CLINT and PLIC the C906 implements only as if there was only one core. Thanks.

So, for the CLINT: I suppose that on SoCs that have multiple RISC-V cores, each core has its own CLINT? How is it memory-mapped? Is each CLINT (registers) mapped to a different address, or all mapped at the same address? (If the latter, that shoud mean each core has a special case for accessing it that doesn't go through the AXI bus?) Possibly obvious question, but it's confusing to me.

About the PLIC: I suppose that there is only one PLIC for the whole SoC. The official spec defines PLIC registers for each hart (0, 1, etc). But on the CV1800B, the fun fact is that both cores have the same hart id (0). They apparently like jokes. So I can't figure out how to configure the PLIC for the second core.

The SDK source code didn't really help. There's a TON of source code all over the place, with references to the PLIC and CLINT, but all coded in different styles with a bunch of macros that make it hard to figure out what exactly they use in actual code that runs.

I was able to configure interrupts fine on the main core, but for the second core, this is puzzling. Any ideas?

brucehoult · « **Reply #59 on:** July 20, 2024, 10:04:32 am »

Quote from: SiliconWizard on July 20, 2024, 07:34:16 am

So, for the CLINT: I suppose that on SoCs that have multiple RISC-V cores, each core has its own CLINT? How is it memory-mapped? Is each CLINT (registers) mapped to a different address, or all mapped at the same address?

If the HARTs are intended to make up a multi-CPU computer under one OS then, yeah, the registers have different addresses, and each HART can access the registers of the other HARTs. It would be pretty difficult to do a software interrupt between HARTs otherwise!

For the way things are intended to be done it's usually useful to read SiFive documentation. For example the manual for the U74-MC core complex that is at the heart of SiFive's FU740 SoC (HiFive Unmatched) and StarFive's JH7100 and JH7110 SoCs (VisionFive, VisionFive 2, Pine64 Star64, Milk-V Mars, DC-Roma laptop etc). The U54-MC in the HiFive Unleashed, PolarFire SoC FPGA, PIC64GX are basically the same.

CLINT memory map on P190.

https://sifive.cdn.prismic.io/sifive/f24c0f97-cd86-4a88-9f2d-af23e8e32a10_u74mc_core_complex_manual_21G1.pdf

NOw if two cores are NOT intended to be used under one OS ... which both having HART id 0 would indicate! ... then I don't know. Maybe they're at the same address and only accessible to their own core.

Quote

About the PLIC: I suppose that there is only one PLIC for the whole SoC. The official spec defines PLIC registers for each hart (0, 1, etc). But on the CV1800B, the fun fact is that both cores have the same hart id (0).

Yeah, tricky. PLIC is normally one for the computer. But, again, if they're treating this as two computers that just happen to share RAM (etc) then I don't know! There must be a way to route pins to the right core.

Maybe look at the Arduino library source code? That runs on the 2nd core, and is probably simpler than the RTOS. I assume they support attaching a handler function to a pin interrupt, same as on other Arduino.

SiliconWizard · « **Reply #60 on:** July 20, 2024, 10:29:48 pm »

Quote from: brucehoult on July 20, 2024, 10:04:32 am

NOw if two cores are NOT intended to be used under one OS ... which both having HART id 0 would indicate! ... then I don't know. Maybe they're at the same address and only accessible to their own core.

Yeah. Obviously, best scenario would just be that they had documented it. But they haven't. At least not in the public documentation.

So, there is reverse-engineering. Yes I've looked at the source code for the provided FreeRTOS which runs on the second core, but as I said, it's quite confusing. There are definitions all over the place, it's hard to follow. I haven't looked at the Arduino code, if it's less confusing to decipher, I may have a look.

Anyway, from what I've gathered from the source code in the SDK and my tests, I can confirm that:

- They have assigned the same hart ID for both C906 core. Which, yes, I understand that it's "ok" if both cores are not supposed to be used in a SMP context, since they can't run the same code anyway, but I still find it rather odd, and an unfortunate decision. But it's just a detail here - it was more confusing about how the CLINT and PLIC were shared.

- The CLINT registers are mapped exactly at the same addresses for both cores. But, clearly they are separate. So the SoC must be internally mapping these addresses to different actual registers for either core. That's very confusing. But that's how it works here. For the PLIC, I haven't tested it yet on the second core, but from what I've gathered in the SDK again, I suspect this is the same thing. Actually, rather than a single PLIC, I suspect that each core has its own PLIC, which would explain why the PLIC interrupt numbers for each core are different.

Now, as the RISC-V specs (and even the C906 specs) do not mandate anything specific for this kind of cases, I get it that every SoC designer is free to do whatever they want. I just find it a pretty odd decision, using the same memory-mapped addresses for separate registers. I haven't looked at the source code for the OpenC906 - possibly it would explain this decision in an obvious way.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: A couple questions about Milk-V Duo boards (Read 7644 times)

brucehoult

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

SiliconWizard

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

SiliconWizard

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

SiliconWizard

Re: A couple questions about Milk-V Duo boards

SiliconWizard

Re: A couple questions about Milk-V Duo boards

brucehoult

Re: A couple questions about Milk-V Duo boards

SiliconWizard

Re: A couple questions about Milk-V Duo boards

Share me