Author Topic: The RISC-V ISA discussion (Read 19346 times)

legacy · « **Reply #75 on:** January 03, 2020, 03:40:33 pm »

Quote from: SiliconWizard on January 03, 2020, 02:44:58 pm

should be achievable with a simple multiply operator in HDL without harming the max frequency too much (at least on those FPGAs), the divides are another story

The ones implemented in my softcore Arise-v2 are a bit slow in terms of clock cycles but they don't add any fmax penalty to the synthesis. The Mul is a 32bit MAC(1) and it costs the fsm to wait for 13 clock cycles, while The Div costs the fsm to wait for 34 clock cycles.

(1) not using any DSP slice. It's manually implemented.

brucehoult · « **Reply #76 on:** January 03, 2020, 09:31:24 pm »

Quote from: SiliconWizard on January 03, 2020, 02:44:58 pm

Quote from: legacy on January 02, 2020, 10:58:52 pm
Do you have mul and div? If yes, how are them implemented?

As he said RV32I, no, it doesn't include MUL and DIV ops which are part of an extension in RISC-V ('M'). Whereas the multiplies, with his non pipelined-design, should be achievable with a simple multiply operator in HDL without harming the max frequency too much (at least on those FPGAs), the divides are another story. I doubt they can be achieved without pipelining, and then it creates a whole new data hazard to handle...

If you tell gcc to compile for RV32I then it will automatically use and link in software multiply and divide routines from libgcc if they are needed.

It's possible to use RV32IM with a hardware multiply but not divide. Well, in fact you're allowed to compile code using RV32IM without having either multiply or divide instructions, as long as you ensure the "unimplemented instruction" traps go to a handler that emulates them. Saying a system supports RV32IM (or any other extension) promises that programs using multiply and divide opcodes will work, it doesn't promise that they will be fast.

So, you can implement multiply in hardware but not divide, implement a handler for divide and happily compile for RV32IM. Divide will be slower than if you called the libgcc routine directly, but possibly not all that much slower, depending on how efficiently you implement the trap handler.

There is a -mno-div flag to tell gcc not to use the divide instruction even though you're compiling for RV32IM.

SiliconWizard · « **Reply #77 on:** January 04, 2020, 10:38:33 pm »

Quote from: brucehoult on January 03, 2020, 09:31:24 pm

Quote from: SiliconWizard on January 03, 2020, 02:44:58 pm
Quote from: legacy on January 02, 2020, 10:58:52 pm
Do you have mul and div? If yes, how are them implemented?

As he said RV32I, no, it doesn't include MUL and DIV ops which are part of an extension in RISC-V ('M'). Whereas the multiplies, with his non pipelined-design, should be achievable with a simple multiply operator in HDL without harming the max frequency too much (at least on those FPGAs), the divides are another story. I doubt they can be achieved without pipelining, and then it creates a whole new data hazard to handle...

If you tell gcc to compile for RV32I then it will automatically use and link in software multiply and divide routines from libgcc if they are needed.

Well, sure! It's just software emulation, it's not like actually having multiply and divide. You can emulate anything in software, and yes, GCC can emulate integer multiplies and divides, and of course FP operations. And in this case, GCC is implementing them, not actually you. So that wouldn't really answer legacy's question.

Quote from: brucehoult on January 03, 2020, 09:31:24 pm

It's possible to use RV32IM with a hardware multiply but not divide. Well, in fact you're allowed to compile code using RV32IM without having either multiply or divide instructions, as long as you ensure the "unimplemented instruction" traps go to a handler that emulates them. Saying a system supports RV32IM (or any other extension) promises that programs using multiply and divide opcodes will work, it doesn't promise that they will be fast.

Well well. Certainly again. But this would be software emulation as well (this time through traps). Of course nothing is said about speed in the spec. But I would still consider the above approach dubious. Whereas the spec doesn't say a given instruction has to be fast, I think it's pretty reasonable to assume that the core will not raise an exception if an instruction from an extension it's supposed to support is used. This would be a bit of a non-sense. Of course you didn't say that the core here would be RV32IM, you said the "system" would be. It certainly expresses emulated functionality, but I personally think talking about a "RV32IM system" would be a bit of a misnomer. And I certainly hope no vendor has the balls to market some RV core as supporting some extensions if said support is only via trap handlers. That would be really bad marketing if you ask me.

And getting back to the question and hamster's work, the trap thing was probably way out of the question as he said his work was still preliminary and he hasn't even set up a stack yet...

Quote from: brucehoult on January 03, 2020, 09:31:24 pm

So, you can implement multiply in hardware but not divide, implement a handler for divide and happily compile for RV32IM. Divide will be slower than if you called the libgcc routine directly, but possibly not all that much slower, depending on how efficiently you implement the trap handler.

You can, but except 1/ if you need to run unmodified object code or 2/ if you absolutely need code size reduction (using traps this way would of course decrease code size as opposed to letting the compiler inline the emulated operations), I don't really see the point. No matter how efficient your trap handler is, it's going to be much slower than an operation directly executed in the ALU without disurbing the pipeline.

Of course I understand it was worth mentioning as a way to handle executing code using some extension on a core that doesn't support it.

Quote from: brucehoult on January 03, 2020, 09:31:24 pm

There is a -mno-div flag to tell gcc not to use the divide instruction even though you're compiling for RV32IM.

I was wondering about this earlier, so it's nice to know. I was indeed considering the "M" extension and how it requires divides, whereas divides are a lot more ressource hungry to implement in hardware and not always as necessary performance-wise, because in practice, a lot of code of computations can be rewritten to avoid using divides. So having a flag that ensures that no divide instruction is used is nice to have.

But I wouldn't know how to call a core that implements only multiplies. I do again think that you couldn't claim your core supporting the "M" extension in this case.

SiliconWizard · « **Reply #78 on:** January 08, 2020, 02:45:45 pm »

So... my CPU emulator/simulator is now functional!

At this point, it can simulate a 5-stage pipelined RISC-V core. The variant can be selected at init time, it currently supports RV32I/E, RV64I/E, and the M extension (RV32/RV64). The C extension is not quite finished, and then I intend on implementing most other standard extensions as well.

I have tested it with a few RISC-V test programs. After the first few very simple, hand-crafted assembly tests, I have written some start-up code and a linker script, so I can test C code. Current tests are emulating an environment with a 256KB instruction memory and 1MB data memory. The startup code: sets up the stack pointer, initializes data, zero-fills unitialized globals, calls 'main' and then I put an 'ebreak' instruction right after that. I've set up a stop condition in my simulator to stop execution on 'ebreak', so that's a simple setup to stop automatically once 'main' has returned.

Example code that simulates correctly (stripping out include/comments/...):

Code: [Select]

static char gst_szString[256];

static size_t StrLen(const char *szString)
{
	size_t nLength;
	
	if (szString == NULL)
		return 0;
	
	for (nLength = 0; *szString != '\0'; szString++)
		nLength++;
	
	return nLength;
}

static char * StrReverse(char *szStringDest, const char *szStringSrc)
{
	size_t nLength, i, j;
	
	if ((szStringDest == NULL) || (szStringSrc == NULL))
		return NULL;
	
	nLength = StrLen(szStringSrc);
	
	for (i = 0, j = nLength - 1; i < nLength; i++, j--)
		szStringDest[i] = szStringSrc[j];
	
	szStringDest[nLength] = '\0';
	
	return szStringDest;
}

int main(void)
{
	(void) StrReverse(gst_szString, "This is a test string!");
	
	return 0;
}

So, this basically stores a reversed version of some constant string in a global variable. The simulator dumps data memory content in a file once it's done, so I can check that it correctly stored "!gnirts tset a si sihT".

The simulator CLI outputs this for the above at this point (yes as you can see it runs on Windows here, but it's pure C99 code, so it builds and runs fine on any Posix-compliant platform):

Code: [Select]

CPUEmu: CPUEmu-RISCV (0.1.1) on CPUEmu (0.1.0)
CPU Variant: RV32IM
Binary file '..\..\Tests\RISCV_Code\GCC-Build\Test2.bin' loaded.
Running simulation...
Simulation completed in 0.000015 s.
Clock Counter = 664
Instruction Counter = 442
CPI = 1.502
STOP: Stop on Instruction (Num = 1)
Registers:
    x0        0x00000000
    x1        0x00000058
    x2        0x10100000
    x3        0x00000000
    x4        0x00000000
    x5        0x00000000
    x6        0x00000000
    x7        0x00000000
    x8        0x00000000
    x9        0x00000000
    x10       0x00000000
    x11       0x10000016
    x12       0x00000054
    x13       0x10000016
    x14       0x000000C7
    x15       0x10000016
    x16       0x00000000
    x17       0x00000000
    x18       0x00000000
    x19       0x00000000
    x20       0x00000000
    x21       0x00000000
    x22       0x00000000
    x23       0x00000000
    x24       0x00000000
    x25       0x00000000
    x26       0x00000000
    x27       0x00000000
    x28       0x00000000
    x29       0x00000000
    x30       0x00000000
    x31       0x00000000
    pc        0x0000005C

(About half of all executed instructions are in the startup code, as it zero-fills the 256-byte string global.)

The CPI at about 1.5 is not extraordinary but it's not too bad for a first pipeline version. I think branches are the culprit here, as I haven't implemented branch prediction so far, so it's just wasting 2 cycles each time a branch is taken, and this small test code is almost entirely made up of loops.

So now I'm going to write many tests for it. (I'm also going to test Bruce's benchmark shortly.) If anyone has any example code to share that I could test, that'd be cool as well.

legacy · « **Reply #79 on:** January 08, 2020, 04:30:11 pm »

Quote from: SiliconWizard on January 08, 2020, 02:45:45 pm

simulate a 5-stage pipelined RISC-V core

How do you simulate the pipeline stages?

SiliconWizard · « **Reply #80 on:** January 08, 2020, 07:13:11 pm »

Quote from: legacy on January 08, 2020, 04:30:11 pm

Quote from: SiliconWizard on January 08, 2020, 02:45:45 pm
simulate a 5-stage pipelined RISC-V core

How do you simulate the pipeline stages?

For each simulated clock cycle, each stage executes what would be executed in a purely digital design. The FETCH stage fetches the instruction from instruction memory at the current PC, the DECODE stage decodes the instruction, etc. To simulate the registering and propagation, each stage has associated "output" registers in a dedicated struct, which are read as inputs for the next stage. For instance, the FETCH stage stores the fetched instruction and PC in members of a dedicated struct, the DECODE stage reads them to decode, and also pass them on to the next stage by writing them to its own output registers.... To avoid having to duplicate the registers between each stage (to make sure that each stage reads the output registers from the previous stage for the previous clock cycle) you just need to execute stages from last to first (as they are obviously executed sequentially in software, and not in parallel...)

For handling stalls, I have two stall registers (which are integers). Each bit represents the current stall state of each corresponding stage. The first stall register is to handle bubble-like stalls. Bubbles must be propagated through the pipeline, so this "register" is shifted left at each simulated clock cycle. The second register is to handle stalls that are not bubbles. Then both registers are AND'ed and the result defines which stages will be executed and which won't at each cycle.

Not getting into a lot of details as it's a bit involved in the end but hopefully you get the basics from the above.

The pipeline itself I implemented is a classic RISC 5-stage pipeline (IF, ID, IE, MA, WB). I simulated data bypass for data hazards and handled the Load-Use hazards with a 1-cycle stall when required, and the taken branches with a 2-cycle stall. So, I basically tried to make it as efficient as possible (stalling only when strictly necessary). Still pretty textbook stuff. The thing that could be improved is with branches by adding some form of branch prediction. I'll probably do this later on. As I said earlier, the whole idea is to implement it so that it's "easily" implementable in HDL later (and so that it can simulate it accurately). As I said before, doing this really helps understanding how to design this. I now have a much better grasp of CPU pipelines (and of RISC-V of course as well), and the ease and speed with which things can be tested is a big plus as opposed to directly implement them in HDL and go through tedious steps each time you want to test something new... so I definitely confirm the approach of designing a simulator first is interesting. And ironing out the initial bugs I had with the pipeline, for instance, would have taken a lot more time if I had started in HDL directly.

SiliconWizard · « **Reply #81 on:** January 08, 2020, 07:17:13 pm »

I've added branch counters so for the above example, I now get:

Code: [Select]

Clock Counter = 664
Instruction Counter = 442
CPI = 1.502
Branch Counter = 114
Taken Branch Counter = 109

Definitely helps seeing what happens.

SiliconWizard · « **Reply #82 on:** January 09, 2020, 12:06:01 am »

Now with Bruce's benchmark code (prime counting):
(This benchmark helped me find - and fix - a small decoding bug for the M extension.)

Code: [Select]

Clock Counter = 47886782350
Instruction Counter = 31207736416
CPI = 1.534
Branch Counter = 13783063651
Taken Branch Counter = 6628998761

And... the result is actually correct!

the number of primes found can be seen in the following register:

Code: [Select]

x29 0x0038A888
(which is indeed 3713160 dec)

So... that's ~47.9 billion clocks (simulated), which is explained by the CPI of ~1.5, and ~6.6 billion taken branches. It looks clear that a 5-stage pipeline without branch prediction is not ideal. Now I can make some interesting experiments. But already happy that I got this far.

ataradov · « **Reply #83 on:** January 09, 2020, 12:13:10 am »

Quote from: SiliconWizard on January 09, 2020, 12:06:01 am

Now with Bruce's benchmark code (prime counting):

I can't find it mentioned anywhere in this thread? Where is the code?

SiliconWizard · « **Reply #84 on:** January 09, 2020, 12:36:50 am »

Quote from: ataradov on January 09, 2020, 12:13:10 am

Quote from: SiliconWizard on January 09, 2020, 12:06:01 am
Now with Bruce's benchmark code (prime counting):
I can't find it mentioned anywhere in this thread? Where is the code?

A link was posted here: https://www.eevblog.com/forum/embedded-computing/raspberry-pi-4/msg2765598/#msg2765598

brucehoult · « **Reply #85 on:** January 09, 2020, 07:39:50 am »

Quote from: ataradov on January 09, 2020, 12:13:10 am

Quote from: SiliconWizard on January 09, 2020, 12:06:01 am
Now with Bruce's benchmark code (prime counting):
I can't find it mentioned anywhere in this thread? Where is the code?

The code lives at http://hoult.org/primes.txt I should probably put it on github :-)

Silly little benchmark with the main virtue that it's been run (and *can* run) on a metric shedload of stuff from AVR to Xeons. Only exercises cpu core and L1 / SRAM, very branchy.

SiliconWizard · « **Reply #86 on:** January 09, 2020, 03:28:52 pm »

Quote from: brucehoult on January 09, 2020, 07:39:50 am

Quote from: ataradov on January 09, 2020, 12:13:10 am
Quote from: SiliconWizard on January 09, 2020, 12:06:01 am
Now with Bruce's benchmark code (prime counting):
I can't find it mentioned anywhere in this thread? Where is the code?

The code lives at http://hoult.org/primes.txt I should probably put it on github :-)

Silly little benchmark with the main virtue that it's been run (and *can* run) on a metric shedload of stuff from AVR to Xeons. Only exercises cpu core and L1 / SRAM, very branchy.

Yes it's very simple and very branchy as you said, so it actually helps not just benchmarking but debugging CPUs... I have found a second bug in my simulator thanks to it (that didn't appear when compiled at -O1 but did at -O3).

legacy · « **Reply #87 on:** January 09, 2020, 09:58:49 pm »

Not polished, simply pacakged as "demo" for being compiled on a Linux host. It's here, you can download it as tgz. It's the simple integer calculator written in C I was talking about a few posts ago. The machine define was automatically generated and needs to be redefined for the target, as well as the console_IO.

Have fun

SiliconWizard · « **Reply #88 on:** January 10, 2020, 01:20:00 am »

Thanks. I haven't implemented any kind of emulation for a console at this point (or any peripheral), so I can't really test it as such for the time being. For now I will favor code that can be run without interaction. But I'll give it a try when I implement at least some peripheral emulation.

I've found and fixed a new tricky bug with data hazard + JALR thanks to test code I had written for a Project Euler problem. Getting there!

(Regarding branch prediction, what I said earlier about not having implemented it is technically not quite exact. It has some form a very basic static prediction, as untaken branches will have no penalty. I don't consider it branch prediction per se, but the textbook definition would say that it is. And that being said, if anyone has worked on branch prediction, and has a *simple* yet reasonably effective approach, I'd be happy to hear about it and implement it. As an interesting alternative to branch prediction I've thought about, and have seen that it was considered by the PULP team, would be to implement an extension for "hardware loops". Another idea - not new per se - would be to add a couple instructions to give hints about the preferred branch option for any branch instruction. That would be quite effective IMO, but would require compiler support, which may not be that trivial to do. For hand-written assembly, that would certainly work fine...)

hamster_nz · « **Reply #89 on:** January 10, 2020, 01:50:28 am »

I'm looking at adding really simple interrupts to my design. Nothing fancy, just a single interrupt.

Has anybody found a concise discussion on how interrupts should work in RISC-V? Reading though the privileged ISA spec it seems the following CSR come into play....

mstatus : Machine Status register
mtvec : Machine trap-handler base address
mip : Machine Interrupt Pending
mio : Machine interrupt enable
mepc : Machine Exception Program Counter
mcause : Machine Cause Register

Does anybody know of a simple how-to describing it?

SiliconWizard · « **Reply #90 on:** January 10, 2020, 03:02:49 am »

Quote from: hamster_nz on January 10, 2020, 01:50:28 am

I'm looking at adding really simple interrupts to my design. Nothing fancy, just a single interrupt.

Has anybody found a concise discussion on how interrupts should work in RISC-V? Reading though the privileged ISA spec it seems the following CSR come into play....

Interested as well.

SiliconWizard · « **Reply #91 on:** January 12, 2020, 03:55:45 pm »

Have done more intensive tests, including testing single- and double-precision FP (GCC) emulation on simulated RV32IM. (Helped me find and fix and couple other bugs.)
It's starting to be relatively stable. (Until next bug of course

)

Then I added emulation of MMIO and added simple emulation of console output on top of that (no input yet, coming!)
I can use printf() from the C lib. As I'm using my own startup code, I had to implement a few C IO functions. (Only _write() for now, all others stubs.)
Below the code if anyone's interested (nothing fancy but it could save you a few minutes of looking up.)

Code: [Select]

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <stdbool.h>

#include <unistd.h>
#include <errno.h>
#include <sys/stat.h>

#include "Tests_MMIO.h"

////////////////////////////////////////////////////////////////////////////////

int _read(int file, char *data, int len)
{
	return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _close(int file)
{
	return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _lseek(int file, int ptr, int dir)
{
	return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _fstat(int file, struct stat *st)
{
	return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _isatty(int file)
{
	return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _write(int file, char *data, int len)
{
	if ((file != STDOUT_FILENO) && (file != STDERR_FILENO))
	{
		errno = EBADF;
		return -1;
	}
	
	for (int i = 0; i < len; i++)
		*TESTS_MMIO_CONSOLEOUT_PTR = data[i];
	
	return len;
}

////////////////////////////////////////////////////////////////////////////////

The obligatory Hello World with just

Code: [Select]

int main(void)
{
	printf("Hello World!\n");
	
	return 0;
}

gives this:

Code: [Select]

CPUEmu: CPUEmu-RISCV (0.1.6) on CPUEmu (0.1.2)
CPU Variant: RV32IM
Binary file '..\..\Tests\RISCV_Code\GCC-Build\Test8.bin' loaded.
Running simulation...
---------------------------------Emulated Console-------------------------------
Hello World!

--------------------------------------------------------------------------------
Simulation completed in 0.000419 s.
Clock Counter = 5136
Instruction Counter = 3732
CPI = 1.376
Branch Counter = 753
Taken Branch Counter = 687

(The RISC-V binary file is about 14KB.)

SiliconWizard · « **Reply #92 on:** January 14, 2020, 12:03:03 am »

@legacy:

Code: [Select]

CPUEmu: CPUEmu-RISCV (0.1.7) on CPUEmu (0.1.2)
CPU Variant: RV32IM
Binary file '..\..\Tests\RISCV_Code\GCC-Build\Test11.bin' loaded.
Running simulation...
---------------------------------Emulated Console-------------------------------

check ctypes: success

mycalc v2.1 24/04/2002, 24/05/2004, 24/06/2014, 24/10/2019
iset={@,+,-,*,/,!,=,?,<,>,a..z,A..Z,0..9}
    type '@' to exit
> a=10
10

> b=5
5

> c=-30
-30

> a*b+c
20

> d=a*c+b
-295

> d/c
9

> @
byebye


--------------------------------------------------------------------------------
Simulation completed in 0.008811 s.
Clock Counter = 46044
Instruction Counter = 33980
CPI = 1.355
Branch Counter = 6723
Taken Branch Counter = 4603

Actually thanks for this little prog, it helped me find and fix a nasty issue with load-use hazards affecting the JALR instruction. Your prog uses big switches which GCC compiles as branch tables (even at -O0), and that helped find the issue!

SiliconWizard · « **Reply #93 on:** January 15, 2020, 05:53:37 pm »

Alright - moving foward...
I got to run CoreMark with my simulator. Which was a nice achievement. (Helped me find a silly bug in the SLT instruction!)

https://www.eembc.org/coremark
You can get the code here: https://github.com/eembc/coremark

(To port it to some target, you just need to adjust core_portme.c and core_portme.h files. It basically just requires to adapt timing stuff, and adapt some other defines. (I'm using the CSR cycle register as I just implemented the Zicsr extension.)
So for timing, it looks like this:

Code: [Select]

inline __attribute__((always_inline)) uint32_t ReadCycle32(void)
{
	uint32_t nCycle;
	
	__asm__ volatile ("rdcycle %0" : "=r" (nCycle));
	
	return nCycle;
}

#define GETMYTIME(_t)		(*_t = ReadCycle32())

I get ~2.57 CoreMark/MHz. The average CPI is ~1.3. I guess I could get a bit better score with some branch prediction, there is none yet.
I'd be curious to see what kind of score you can get with a real RISC-V CPU. So if anyone here can try this...

There are a few scores for RISC-V on eembc's website, for Andes processors and SiFive HiFive Unleashed. For the latter, there are actually two scores (1 and 4 threads). My own test is of course on only 1 thread at this point. Interestingly, the 1-thread score for the HiFive unleashed is 2.01, so my core (simulated, but as I said, is cycle accurate) seems to do better...

(But one thing to consider here probably: it has no memory access penalty as it simulates a 1-cycle memory, so I'm guessing the difference lies in the fact that the HiFive is using DDR RAM and caches? - although the data memory this benchmark uses is very small so everything should fit in the cache...)

Would be interesting to see more results with the FE310 for instance, and the GigaDevice MCU if anyone has a dev board...

brucehoult · « **Reply #94 on:** January 15, 2020, 06:12:17 pm »

FE310-G000 does 2.73 Coremark/MHz. The U54 in the HiFive Unleashed does 3.01 not 2.01. Maybe a typo in the page you saw?

Actually, I get a couple of percent better than the published numbers on the HiFive1 if I enable -msave-restore which uses short routines to save registers on function entry and restore them on exit instead of inline code. It uses extra instructions but makes more of Coremark fit into the 16 k L1 instruction cache.

Single-cycle memory is certainly a huge advantage. SiFive cores have 2 cycle access time for L1 or SRAM (3 cycle for subword loads).

SiliconWizard · « **Reply #95 on:** January 15, 2020, 06:27:24 pm »

Quote from: brucehoult on January 15, 2020, 06:12:17 pm

FE310-G000 does 2.73 Coremark/MHz. The U54 in the HiFive Unleashed does 3.01 not 2.01. Maybe a typo in the page you saw?

Actually, I get a couple of percent better than the published numbers on the HiFive1 if I enable -msave-restore which uses short routines to save registers on function entry and restore them on exit instead of inline code. It uses extra instructions but makes more of Coremark fit into the 16 k L1 instruction cache.

Single-cycle memory is certainly a huge advantage. SiFive cores have 2 cycle access time for L1 or SRAM (3 cycle for subword loads).

Thanks for the info!
Well the scores can be found there: https://www.eembc.org/coremark/scores.php
It's 2.01 (it's noted 3020.46 @1500MHz, so unless the raw score itself is wrong, it looks correct) for 1 thread and 8.02 for 4 threads, so it looked at least kinda consistent? But apparently the scores where evaluated with CoreMark compiled at -O2, maybe that could explain the difference, dunno if you used -O3? But could be a typo indeed... (if so, you may want to report this as eembc gives "official" scores.)

So, that the U54 does better than my core seems more logical.

I think the U54 also has a 5-stage pipeline, but with branch prediction and OOE (mine has neither so far)? So that should make quite a bit of difference.
Still, I'm happy so far to get something reasonable with a relatively "simple" design.

SiliconWizard · « **Reply #96 on:** January 16, 2020, 02:41:34 pm »

Quote from: hamster_nz on January 10, 2020, 01:50:28 am

I'm looking at adding really simple interrupts to my design. Nothing fancy, just a single interrupt.

Has anybody found a concise discussion on how interrupts should work in RISC-V? Reading though the privileged ISA spec it seems the following CSR come into play....

Am working on that now.
Whereas the unprivileged ISA spec is clear and easy to follow, I find the privileged one to be a lot harder to get.

This doc helps clearing things up a bit:
https://sifive.cdn.prismic.io/sifive/0d163928-2128-42be-a75a-464df65e04e0_sifive-interrupt-cookbook.pdf

You can also watch this:
which was a discussion for future improvements on interrupt handling.

And now a question (for Bruce?):
is any part of the privileged architecture mandatory for any RISC-V core (even just say RV32I?) If so and I understand things correctly, is the "machine" level the minimum that must be implemented?

brucehoult · « **Reply #97 on:** January 16, 2020, 03:27:41 pm »

Quote from: SiliconWizard on January 16, 2020, 02:41:34 pm

And now a question (for Bruce?):
is any part of the privileged architecture mandatory for any RISC-V core (even just say RV32I?) If so and I understand things correctly, is the "machine" level the minimum that must be implemented?

If you want to call something "RISC-V" then it is only necessary to be able to run User ISA instructions. You are explicitly allowed to have a different (or no) privileged architecture.

Until very recently (v2.2) it was required to implement FENCE.I and the ability to read time/cycle/instructions retired CSRs, but in the ratified version of the ISA these have been made optional extensions.

If you're making actual hardware, then you need some kind of Machine level, but you might run User mode RISC-V software under some non RISC-V Machine level (x86, POWER, something custom) or in an emulator.

SiliconWizard · « **Reply #98 on:** January 16, 2020, 03:40:05 pm »

OK, thanks for this. Implementing a different privileged arch could be interesting for some particular applications.

I guess one benefit of implementing at least some of the privileged arch is to be able to take advantage of the support in the software toolchain. For instance, the "interrupt" GCC attribute that would automatically save registers used in said handler function and return with an "mret" instruction. If you don't implement this in hardware, then you're pretty much on your own for handling interrupts and exceptions in general.

But yes, as I said, I find the privileged spec to be harder to follow, and information more spread out. Dunno what's your opinion on this?

Just a tiny related remark: any part of the privileged arch requires the Zicsr extension as far as I've seen. But the "Zicsr " is usually not mentioned. Most of the "non-G" (for which it's implicit) CPUs don't seem to mention Zicsr as part of the supported extensions, when it's usually supported. I know I'm nitpicking, maybe it's just due to the fact this extension has been only relatively recently separated from the base ISA?

And another question now that I'm starting to get the whole thing a bit better: (I'll take the machine level as an example, but it's the same at any level)
The spec says the mepc CSR is supposed to hold the PC of the instruction that caused the exception/or that was interrupted. Then the mret instruction in the end is supposed to restore the current PC to mepc. So this would mean that returning from an exception would jump to the same instruction again. (For interrupts, it means the interrupted instruction will be executed again, meaning that if it's interrupted, it must NOT get to the execute stage!) Am I getting it right? If so, how are you supposed to return from exception handlers, if that even makes sense to do so, since it would just return to the same instruction again and again?

brucehoult · « **Reply #99 on:** January 16, 2020, 05:44:39 pm »

Quote from: SiliconWizard on January 16, 2020, 03:40:05 pm

But yes, as I said, I find the privileged spec to be harder to follow, and information more spread out. Dunno what's your opinion on this?

Maybe. As a compiler kind of person I've paid much more attention to the unprivileged ISA :-)

Quote

Just a tiny related remark: any part of the privileged arch requires the Zicsr extension as far as I've seen. But the "Zicsr " is usually not mentioned. Most of the "non-G" (for which it's implicit) CPUs don't seem to mention Zicsr as part of the supported extensions, when it's usually supported. I know I'm nitpicking, maybe it's just due to the fact this extension has been only relatively recently separated from the base ISA?

I think so, yes, as that's *very* recent.

Quote

The spec says the mepc CSR is supposed to hold the PC of the instruction that caused the exception/or that was interrupted. Then the mret instruction in the end is supposed to restore the current PC to mepc. So this would mean that returning from an exception would jump to the same instruction again. (For interrupts, it means the interrupted instruction will be executed again, meaning that if it's interrupted, it must NOT get to the execute stage!) Am I getting it right? If so, how are you supposed to return from exception handlers, if that even makes sense to do so, since it would just return to the same instruction again and again?

It's important to be able to try the same instruction again, especially for things such as page faults, but there are also other kinds of things that you can fix up and retry.

It's much much easier to figure out the instruction length and skip it before returning than it would be to try to work backwards to find the start of the previous instruction.

In the case of asynchronous interrupts, I believe the definition is that interrupts are checked for before the execution of the current instruction, so there is no concern of executing it twice.

If you look at for example https://github.com/riscv/riscv-pk/blob/master/machine/misaligned_ldst.c you will see that misaligned_load_trap() calculates insn_t insn = get_insn(mepc, &mstatus); uintptr_t npc = mepc + insn_len(insn); and then later write_csr(mepc, npc);

i.e. it emulates the misaligned load (if the hardware trapped on it, which some will and some won't), and then returns to the next instruction.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: The RISC-V ISA discussion (Read 19346 times)

Share me