Author Topic: The RISC-V ISA discussion (Read 19714 times)

SiliconWizard · « **on:** December 27, 2019, 04:46:35 pm »

(Foreword: not sure this is the appropriate section, but no other really seems more appropriate so...)

I've been seriously considering and studying the RISC-V ISA lately. I also started developing a cycle-accurate CPU "emulator" (simulator may be a better word?), meant to be useful for testing new ideas, benchmarking, etc. (and the first target is RISC-V, but it won't be limited to that.)

So while taking a closer look at the RISC-V ISA, I have a few remarks/questions... if anyone having worked with it and having experience/insight, that would be great if they could chime in. The discussion can pretty much follow on many other aspects of it. Thought this could be interesting.

My first remarks:

1. Looks like a very nice "exercise" in simplicity. I like the "minimalist" approach. Makes implementing it pretty straightforward.
2. The minimalism looks a bit too much to me on some points. A couple examples:
2.1. The "bit manipulation" extension is not part of the base ISA. I personally think this decision is a bit too drastic. Bit manipulation can definitely be pretty useful in many cases (I'm thinking of some instruction akin to "clz" for instance... or byte swaps, bit reverse, etc.) Could be debated, but what's worse, this "B" extension is not even defined yet. I really think this is a problem at this point, because it's (in my eyes) part of basic operations and even if it's an extension, it should have been defined already IMO. As it is, core designers are likely to define their own extensions with this, and this is going to lead to useless fragmentation for something that again, seems basic to me.
2.2. The "no flag register" approach is interesting, but it makes some operations pretty clunky. For instance, working with integers wider than the native ISA width. No "add with carry" or anything like this. Would be interesting to see how you guys (experienced with RISC-V) would implement it and how much more efficient (or not?) it would be with at least additional operations with carry. You may say this could be a further extension, but again I think this is pretty basic?

I'll probably have tons of other remarks and questions later on, but I'd be interested in reading opinions on these first two to begin with.

ataradov · « **Reply #1 on:** December 27, 2019, 05:27:00 pm »

Well, what is in the main set and what is an extension is a matter of preference. If you include everything into the basic set, then you will make basic implementations of the ISA much harder. And I personally appreciate the simplicity and ease of implementation of the basic set.

There are no advanced instructions in Cortex-M0+ either. You just go to the higher end core when you need them. Same with RISC-V, you go to a core that also implements an extension.

Yes, it sucks that extensions sit undefined for years. I believe the main stopping point here is lack of confidence in that specific implementation or a set of instructions are good. Hopefully that with more and more RISC-V devices appearing, there will be more push to standardize things.

Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.

brucehoult · « **Reply #2 on:** December 27, 2019, 07:17:29 pm »

I think it's pretty amazing that despite being fairly minimalist with only 37 instructions a compiler will generate from C code (so leaving out fence, system call, debugger call, the CSR instructions), RV32I has everything necessary to efficiently support a modern software stack.

You could even make it a bit more minimalist without any great harm. I'd suggest, for example, leaving out all the "immediate" instructions except addi. Boom! You're now down to 29 instructions. And you've freed up 3% of the opcode space at a stroke. The cost? One instruction to load the desired immediate value into a register and then use the register-to-register version of the instruction instead.

Here are some instruction frequency stats I gathered from the RISC-V Debian distro with the standard packages an an assortment of extras. Format: percentage of total instructions, mnemonic, raw instruction count. I've listed the top 16 instructions in full, but only the full register immediate ones after that.

16.224593 addi 2528047
15.237536 jal 2374248
11.123998 auipc 1733294
9.981167 ld 1555223
6.658275 beq 1037464
4.305509 bne 670866
3.687067 sd 574503
3.418121 lbu 532597
3.376591 jalr 526126
2.435197 lw 379442
2.357368 lui 367315
1.800274 sb 280511
1.768576 addiw 275572
1.472592 sw 229453
1.430902 slli 222957
1.314809 andi 204868
:
0.433172 srli 67495
0.296973 ori 46273
0.244295 xori 38065
0.192047 sltiu 29924
0.131027 srai 20416
0.011071 slti 1725

addi is *the* most popular instruction. This happens on other code bases I've looked at as well. On Fedora addi comes in slightly behind jal. Part of this is that addi does triple duty as both the "move register" instruction and the "load immediate" instruction (both of which could be done by other instructions such as ori instead) but incrementing and decrementing loop variables and the stack pointer is anyway so common that addi would always be in the top instructions. This is 64 bit code, so addiw also makes a showing. If you want to think about RV32I then probably just lump addi and addiw together and call it 18%.

What about the others? slli+andi+srli+ori+xori+sltiu+srai+slti together come to 4.05% of all instructions. That's more than the 3.125% of the opcode space they take up (along with addi), but not a lot more. If you left them all out then RISC-V programs would get at most 4% bigger (less, because the same constant could often be loaded once and left in a register, of which there are usually plenty, to be used several times), and probably no more than 1% slower (because the loading of the constant could often be done outside of a loop).

Do I seriously suggest ripping those immediate instructions out of the standard? No, of course not. The standard is ratified :-) And they are carrying their weight, collectively, even if ori, xori, sltiu, srai, slti individually are not. It would also make the hardware *more* complex to disable them, given that the ALU supports those operations, and the data path for immediates from the instruction decoder to the ALU has to exist anyway.

It is however a simple mathematical fact that an immediate instruction takes up 128x more encoding space than the corresponding instruction with two register sources. We can add hundreds and hundreds of R-type instructions in future without problems, but it's going to need a very strong justification to add more immediate instructions -- at least within the 32 bit opcode space. Future 48 bit, 64 bit or longer instructions are a different matter.

I make an exception for the shift instructions. slli, srli, srai don't use the entire 12 bit immediate field, but only enough bits to encode a number up to the register size -- 5 bits for RV32, 6 for RV64. There is room to add more than 100 "shift-like" instructions in the unused all-zero bits of the slli and srli encodings. (srai already uses one of these). The proposal for the BitManip extension adds a number of "shift-like" instructions with immediate versions e.g. sloi, sroi, rori, grevi, gorci.

All this does I think demonstrate that while RV32I is fairly minimal, it could be made significantly more minimal without huge harm to code size or speed.

SiliconWizard · « **Reply #3 on:** December 27, 2019, 08:22:19 pm »

Quote from: ataradov on December 27, 2019, 05:27:00 pm

Well, what is in the main set and what is an extension is a matter of preference. If you include everything into the basic set, then you will make basic implementations of the ISA much harder. And I personally appreciate the simplicity and ease of implementation of the basic set.

Well of course not everyone will have the same opinion of what should be the minimal set, and it will largely depend on the kind of code they tend to work on.
The RISC-V idea is to put a minimal subset in the base set (I/E), which you can implement everything with (except for the more hardware-related extensions such as"A"). Additional extensions (except again the hardware-related ones) are for performance only. You can absolutely implement FP in software with RV32/64I for instance.

So yes it all comes down to what you consider important for performance or not. As I said, for instance I wouldn't have a problem with bit manipulation having its own extension (although I may have done it differently, but that's preferences as you said). I just think it's past time it would get defined. I understand the whole idea of statistically evaluating the use of given instructions and decide which ones to include based on that, but I also think this approach is not without flaws.

Quote from: ataradov on December 27, 2019, 05:27:00 pm

There are no advanced instructions in Cortex-M0+ either. You just go to the higher end core when you need them. Same with RISC-V, you go to a core that also implements an extension.

OK, but the difference here is not as drastic. Cortex M0 has (I don't know the difference with M0+? is the IS smaller than in the M0?) "add with carry" instructions and clz (I think), for instance, which were in question here.

Quote from: ataradov on December 27, 2019, 05:27:00 pm

Yes, it sucks that extensions sit undefined for years. I believe the main stopping point here is lack of confidence in that specific implementation or a set of instructions are good. Hopefully that with more and more RISC-V devices appearing, there will be more push to standardize things.

Certainly. I don't quite know how priorities at the RISC-V Foundation level are defined though. I'd be interested in understanding what drives them. I'd suspect that they are largely influenced by the "main" big members.

Quote from: ataradov on December 27, 2019, 05:27:00 pm

Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.

Well, sure it would use 'sltu'. As to much easier to implement... this seems slightly exxagerated. Handling a carry flag is pretty cheap IMO. You get it for almost no added cost with pretty much any multi-bit adder. Adding a couple instructions (which would be derivatives of normal add anyway) wouldn't massively hurt anything either.

Just a small example.
Consider the very simple code below, compiled for a 32-bit target:

Code: [Select]

uint64_t Add64(uint64_t n1, uint64_t n2)
{
	return n1 + n2;
}

RV32I:

Code: [Select]

	mv	a5,a0
	add	a0,a0,a2
	sltu	a5,a0,a5
	add	a1,a1,a3
	add	a1,a5,a1
	ret

NanoMIPS (you can see that it's almost exactly the same as with RV32I):

Code: [Select]

	addu	$a2,$a0,$a2
	addu	$a1,$a1,$a3
	sltu	$a4,$a2,$a0
	move	$a0,$a2
	addu	$a1,$a4,$a1
	jrc	$ra

ARM Cortex-M4:

Code: [Select]

	adds	r0, r0, r2
	adc	r1, r3, r1
	bx	lr

ARM Cortex-M0 (don't know the difference between adc and adcs, but it seems pretty equivalent to -M4):

Code: [Select]

	adds	r0, r0, r2
	adcs	r1, r1, r3
	bx	lr

It's basically 5 instructions (not counting ret) for RV32I (and interestingly NanoMIPS, which looks pretty close anyway - not that surprising), and 2 for Cortex-M0 and -M4.
No matter how efficient your implementation is, it's hard to beat that. If you're using a lot of large integer operations in some code, it'll make a pretty significant difference.

Not to mention that beyond code size (which can be mitigated using compressed instructions), you potentially get additional performance issues if you need more instructions to do the same operation. Data hazards are a lot more likely to occur between successive instructions and may not all be solvable without stalling the pipeline...

ataradov · « **Reply #4 on:** December 27, 2019, 08:29:47 pm »

Adding carry and other flags has significant implications on the hardware design.

Having separate flags introduces additional pipeline hazards, which may make efficient implementation very hard.

The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.

Who cares how many instructions there are if they take the same amount of time to execute.

Of course, simplest implementations won't do any of this and will suffer a bit. But you shouldn't design modern architectures for simplest implementations.

brucehoult · « **Reply #5 on:** December 27, 2019, 08:34:52 pm »

Quote from: SiliconWizard on December 27, 2019, 04:46:35 pm

2.1. The "bit manipulation" extension is not part of the base ISA. I personally think this decision is a bit too drastic. Bit manipulation can definitely be pretty useful in many cases (I'm thinking of some instruction akin to "clz" for instance... or byte swaps, bit reverse, etc.) Could be debated, but what's worse, this "B" extension is not even defined yet. I really think this is a problem at this point, because it's (in my eyes) part of basic operations and even if it's an extension, it should have been defined already IMO. As it is, core designers are likely to define their own extensions with this, and this is going to lead to useless fragmentation for something that again, seems basic to me.

Unfortunately we don't have time machines. Would it have been good to have bitmanip instructions ready to go in 2015? Sure, of course. Should the ISA announcement, formation of the Foundation etc have been delayed to 2019 or 2020 to allow time for bitmanip and vectors to be designed and added? HELL NO.

There is an element of things taking longer now because it's not just Krste, Andrew and Yunsup sitting around a table and deciding by fiat what is in and what is out. The ISA is owned by a community consisting of dozens (hundreds) of organisations now, and it's necessary to get input from a lot of people as to what they'd like to see in there for their applications, evaluate how useful each thing is, how synergistic different things are, and vote on inclusion and how to organize into various sub-extensions.

When it *was* just Krste, Andrew, and Yunsup they took the time to propose something, implement it in actual chips, add support to gcc, and compiler and run software. And throw that away and try something else.

Even now, the strong preference is to actually implement proposed extensions in real chips (preferably multiple independent implementations) and gain experience with it before ratifying it. It's not just throwing a spec over the wall and hoping the designers were sufficiently prescient.

I think also there is an element of people simply not realizing how long things take even in a closed-doors and unannounced effort at Intel or ARM.

I've heard from people who previously worked for ARM that the Aarch64 project was started in 2001. soon after the AMD64 specification was published and well before the first Opteron or Athlon64 processors were released in 2003. This was successfully kept secret until the ARMv8.0 spec was published in October 2011 -- after the RISC-V project was started. There were no Aarch64 chips until Apple shocked everyone with the iPhone 5s in September 2013. The other phone makers were all simultaneously saying both "Why on earth would you need 64 bits in a phone? It's just a marketing gimmick." and "We'll have one in six months". Actually, it was 19 months until April 2015 with the Galaxy S6.

brucehoult · « **Reply #6 on:** December 27, 2019, 08:57:01 pm »

Quote from: SiliconWizard on December 27, 2019, 08:22:19 pm

Quote from: ataradov on December 27, 2019, 05:27:00 pm
Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.

Well, sure it would use 'sltu'. As to much easier to implement... this seems slightly exxagerated. Handling a carry flag is pretty cheap IMO. You get it for almost no added cost with pretty much any multi-bit adder. Adding a couple instructions (which would be derivatives of normal add anyway) wouldn't massively hurt anything either.

It's not the extra instructions, it's having to add a flags register as an extra result of instructions, and extra source for some instructions. That's expensive, a huge cost and bottleneck especially once you go superscalar or out of order, and simply not useful very often.

Yes, it takes four instructions on RISC-V or NanoMIPS (which btw exists as precisely one licensed RTL core at the moment with as far as I know exactly one user: Mediatek -- you can't buy a chip or a board with a chip on it) vs two instructions on ARM. But how often do you need it? (the "mov" is an artifact of the calling convention, and will disappear if the function is inlined)

Sure, I was doing add with carry all the time on 8 bit CPUs, but on 32 bit or 64 bit CPUs it's an rarity.

The only time it would come close to being performance-critical is for bignum libraries, and in that case it's going to be dominated by the loads and stores even if the data is coming from L1 cache (or SRAM). So now if you have a carry flag it's four instructions per word instead of one. And then there's a couple of instructions of loop overhead (you can unroll, but there's still overhead). RISC-V keeps that constant 2 instructions per word extra.

If you're doing a specialized embedded CPU and multi-precision arithmetic is a dominant part of your workload, then you can add custom instructions for it.

emece67 · « **Reply #7 on:** December 27, 2019, 09:33:43 pm »

I wanted a rude username · « **Reply #8 on:** December 27, 2019, 09:47:44 pm »

It is instructive to reflect that the legendary Alpha/AXP, HPC king of the 1990s, did not even have an integer DIV instruction.

"Perfection is attained not when there is no longer anything to add, but when there is no longer anything to take away." (Antoine de Saint-Exupéry)

ataradov · « **Reply #9 on:** December 27, 2019, 09:49:51 pm »

They did not do because of some "perfection", but because they could not implement is economically.

Cortex-M0+ also does not have an integer divide instruction and it sucks.

SiliconWizard · « **Reply #10 on:** December 27, 2019, 09:51:04 pm »

Quote from: brucehoult on December 27, 2019, 08:57:01 pm

Quote from: SiliconWizard on December 27, 2019, 08:22:19 pm
Quote from: ataradov on December 27, 2019, 05:27:00 pm
Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.

Well, sure it would use 'sltu'. As to much easier to implement... this seems slightly exxagerated. Handling a carry flag is pretty cheap IMO. You get it for almost no added cost with pretty much any multi-bit adder. Adding a couple instructions (which would be derivatives of normal add anyway) wouldn't massively hurt anything either.

It's not the extra instructions, it's having to add a flags register as an extra result of instructions, and extra source for some instructions. That's expensive, a huge cost and bottleneck especially once you go superscalar or out of order, and simply not useful very often.

I understand that. Yes that would be instructions with the equivalent of 3 sources instead of 2. There are such instructions in the FP extensions by the way (fused multiply add), so if you're implementing FP extensions, you'll need the logic to handle 3 sources anyway. Given that 3-source instructions are part of extensions already, it would make sense to put instructions using carry in an extension as well, I concede that.

As to which approach would yield the most efficient execution is really not this trivial, it would depend on a number of factors, but I would guess that in simpler architectures with in-order execution, that may be more efficient. The simulator I'm working on is meant to help figuring this out on real code.

And as to the "rarity" of needing that... I don't know. There are certainly a number of applications where using 64-bit integers, for instance, in 32-bit code is pretty common. Whether this would be performance critical largely depends on the application.

Again I agree that the third source is added pain to handle, but I'll point out again that this is already the case if you implement the F extension?

brucehoult · « **Reply #11 on:** December 27, 2019, 09:52:09 pm »

By the way, the RISC-V Vector extension proposal includes add-with-carry, using the mask input as carry-in instead of as a mask:

# vd = vs2 + vs1 + v0.LSB
vadc.vvm vd, vs2, vs1, v0 # in the base vector encoding with 32 bit opcodes, the mask can only come from v0

# vd = carry_out(vs2 + vs1 + v0.LSB)
vmadc.vvm vd, vs2, vs1, v0 # produces the carry out into vd (has to be v0 or moved to v0 to use it later)

Note that this enables doing a number of multi-precision adds in parallel, not one huge multi-precision add across the vector.

brucehoult · « **Reply #12 on:** December 27, 2019, 10:06:05 pm »

Quote from: SiliconWizard on December 27, 2019, 09:51:04 pm

I understand that. Yes that would be instructions with the equivalent of 3 sources instead of 2. There are such instructions in the FP extensions by the way (fused multiply add), so if you're implementing FP extensions, you'll need the logic to handle 3 sources anyway. Given that 3-source instructions are part of extensions already, it would make sense to put instructions using carry in an extension as well, I concede that.

That's a different register file and a different ALU.

Supporting 3 input operands is expensive. but FMA is maybe *the* most common FP operation, so it's extremely important to support it efficiently. ADC on the other hand is a rarity in most code.

The Bitmanip extension proposal includes some operations that need 3 integer register inputs: cmix (conditional mix), cmov (conditional move) and funnel shifts (fsl, fsr, fsri). Some people may find it worthwhile to support those, but I think they're *extremely* unlikely to find their way into the standard B extension that is what general purpose operating systems such as Linux will support.

SiliconWizard · « **Reply #13 on:** December 27, 2019, 10:32:28 pm »

Quote from: brucehoult on December 27, 2019, 10:06:05 pm

Quote from: SiliconWizard on December 27, 2019, 09:51:04 pm
I understand that. Yes that would be instructions with the equivalent of 3 sources instead of 2. There are such instructions in the FP extensions by the way (fused multiply add), so if you're implementing FP extensions, you'll need the logic to handle 3 sources anyway. Given that 3-source instructions are part of extensions already, it would make sense to put instructions using carry in an extension as well, I concede that.

That's a different register file and a different ALU.

Oh. Right. So you'd have to duplicate it, but I guess the structure would be pretty similar. I still don't have a precise idea of how much area/LEs it takes to implement that. I'm currently working on a typical 5-stage pipeline with 2 data sources and 1 destination as per the base RISC-V ISA. But at this point, even that I have no idea exactly how much it would take in hardware.

I have implemented pipelines in the past but none with data hazards (or very simple ones) or branch hazards, so I'm still wrapping my head around that.

Quote from: brucehoult on December 27, 2019, 10:06:05 pm

Supporting 3 input operands is expensive. but FMA is maybe *the* most common FP operation, so it's extremely important to support it efficiently. ADC on the other hand is a rarity in most code.

Yep, so this was considered that doing this would indeed improve performance even with the added complexity.

Quote from: brucehoult on December 27, 2019, 10:06:05 pm

The Bitmanip extension proposal includes some operations that need 3 integer register inputs: cmix (conditional mix), cmov (conditional move) and funnel shifts (fsl, fsr, fsri). Some people may find it worthwhile to support those, but I think they're *extremely* unlikely to find their way into the standard B extension that is what general purpose operating systems such as Linux will support.

Interesting, I have only read the ratified documents so far, I don't know anything about the current proposals.
From what you say about the Bitmanip extension, it looks like the proposal includes a lot of stuff (probably way beyond what I had in mind) and that may be one of the reasons it takes time to finalize...

I wanted a rude username · « **Reply #14 on:** December 27, 2019, 10:38:00 pm »

Quote from: ataradov on December 27, 2019, 09:49:51 pm

They did not do because of some "perfection", but because they could not implement is economically.

Just because they couldn't change the laws of physics doesn't mean their design was bad.

Obviously if the cost had been the same as MUL, they would have included it. But it isn't, not just in space but also time, which would have flow-on effects for the pipeline. And anyway, in server land floating point division is more useful.

brucehoult · « **Reply #15 on:** December 27, 2019, 10:53:16 pm »

Quote from: SiliconWizard on December 27, 2019, 09:51:04 pm

[As to which approach would yield the most efficient execution is really not this trivial, it would depend on a number of factors, but I would guess that in simpler architectures with in-order execution, that may be more efficient. The simulator I'm working on is meant to help figuring this out on real code.

A simulator is of course specific to a particular microarchitecture.

What is best is really not trivial at all. Just counting instructions or clock cycles is not enough. Adding hardware for extra instructions can result in a slower maximum MHz. Decreasing the number of clock cycles by 1% is not useful if you then have to run the clock 1% slower (or more). For battery powered things, the additional gates also add to the energy use, whether you are using those instructions or not. No one does clock gating on just the "clz" circuit :-) The extra die area also adds to the cost of each chip.

For applications processors in current mobile phones and up, both Intel and ARM have decided to take a kitchen sink approach, and mandate that all CPUs have every instruction anyone every thought of. Everyone gets SIMD. Everyone gets clz and popcount and sha and aes and ....

Are they right? Maybe. Maybe not.

SiliconWizard · « **Reply #16 on:** December 27, 2019, 11:25:21 pm »

Quote from: brucehoult on December 27, 2019, 10:53:16 pm

Quote from: SiliconWizard on December 27, 2019, 09:51:04 pm
[As to which approach would yield the most efficient execution is really not this trivial, it would depend on a number of factors, but I would guess that in simpler architectures with in-order execution, that may be more efficient. The simulator I'm working on is meant to help figuring this out on real code.

A simulator is of course specific to a particular microarchitecture.

Well, the whole idea is to be able to try various IS but also various microarchitectures. Implementing just one microarchitecture would serve limited purpose as comparing different IS this way would inevitably be biased.

Quote from: brucehoult on December 27, 2019, 10:53:16 pm

What is best is really not trivial at all. Just counting instructions or clock cycles is not enough. Adding hardware for extra instructions can result in a slower maximum MHz. Decreasing the number of clock cycles by 1% is not useful if you then have to run the clock 1% slower (or more). For battery powered things, the additional gates also add to the energy use, whether you are using those instructions or not. No one does clock gating on just the "clz" circuit :-) The extra die area also adds to the cost of each chip.

A lot of factors for sure. Decreasing the CPI (and the number of required instructions, as long as it doesn't adversely increase the CPI) can be beneficial even if it won't clock as fast, as you can do the same amount of work as at a lower frequency. For that to be interesting, of course you need applications in which it makes a difference, and power consumption wise, it could indeed not be beneficial... or it could be. (But running at lower frequencies could have other benefits anyway.) So yeah it so much... depends!

Quote from: brucehoult on December 27, 2019, 10:53:16 pm

For applications processors in current mobile phones and up, both Intel and ARM have decided to take a kitchen sink approach, and mandate that all CPUs have every instruction anyone every thought of. Everyone gets SIMD. Everyone gets clz and popcount and sha and aes and ....

Are they right? Maybe. Maybe not.

As we said above, there are just too many factors to consider, so they just go for the general-purpose approach that will get them the most customers, and that is relatively easy to handle, not requiring a huge number of potential variants (which can be a problematic point with RISC-V.)

But we are talking about completely different things... RISC-V is just an ISA. Not actual chips. Intel sells mostly chips. ARM, IPs, but still too high a level of customization would probably be a nightmare to handle for them. Having completely "modular" instruction sets is pretty neat, but very tough to handle when you sell something IMO.

At SiFive, you are in an interesting stage, as you are actually selling stuff out of RISC-V, and you probably encounter the issues I'm talking about above... picking just the right set of extensions, managing queries from customers that ask for non-standard stuff... By the way, I know you have a number of "off-the-shelf" base cores, but they don't include ALL extensions of course. How do you handle it if some customer asks for a specific extension that your cores don't support? Is it a no-no, or is it yes (and you have most ratified extensions ready), or do you even design custom extensions in some cases?

westfw · « **Reply #17 on:** December 28, 2019, 06:53:34 am »

It's not so different with its limitations than other RISC architectures.

MIPS lacks a carry bit and multi-precision math is a bit weird.

Cortex-M0 and M0+ are missing a painful number of "expected" instructions (everyone notices division, but bit-tests are pretty painful, too. Neither the sort of "and with immediate" or the "shift register till the desired bit in carry or sign position" methods common on other ARM architectures is there.

brucehoult · « **Reply #18 on:** December 28, 2019, 07:41:57 am »

Quote from: westfw on December 28, 2019, 06:53:34 am

Cortex-M0 and M0+ are missing a painful number of "expected" instructions (everyone notices division, but bit-tests are pretty painful, too. Neither the sort of "and with immediate" or the "shift register till the desired bit in carry or sign position" methods common on other ARM architectures is there.

That doesn't sound right. As with any Thumb1 implementation, there is the "LSL Rd, Rm, #bits" instruction ("MOVS Rd, Rs, LSL #bits" in unified syntax), which sets the N and Z flags as you would expect. Opcode 00000bbbbbsssddd. You can then follow it with a BMI or BPL as expected.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html

legacy · « **Reply #19 on:** December 28, 2019, 07:52:10 am »

Quote from: ataradov on December 27, 2019, 08:29:47 pm

Adding carry and other flags has significant implications on the hardware design.
Having separate flags introduces , which may make efficient implementation very hard.

Yup, precisely.

magic · « **Reply #20 on:** December 28, 2019, 09:56:29 am »

Quote from: ataradov on December 27, 2019, 08:29:47 pm

The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.

That's gonna be tricky for ADC as the decoder needs to remember which register holds the SLTU-computed carry bit of which addition and then find the ADD instructions that consume this register, which may come in a few permutations, possibly partly before the SLTU itself, and possibly spread out and interleaved with unrelated code if the compiler targets a superscalar core. Sounds fun, I'm not sure if even x86 perform such complex fusions.

westfw · « **Reply #21 on:** December 28, 2019, 10:14:39 am »

Quote

there is the "LSL Rd, Rm, #bits" instruction ("MOVS Rd, Rs, LSL #bits" in unified syntax), which sets the N and Z flags as you would expect.

Hmm. You're correct. I wonder what I was think of? :-(

SiliconWizard · « **Reply #22 on:** December 28, 2019, 02:52:28 pm »

Quote from: magic on December 28, 2019, 09:56:29 am

Quote from: ataradov on December 27, 2019, 08:29:47 pm
The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.
That's gonna be tricky for ADC as the decoder needs to remember which register holds the SLTU-computed carry bit of which addition and then find the ADD instructions that consume this register, which may come in a few permutations, possibly partly before the SLTU itself, and possibly spread out and interleaved with unrelated code if the compiler targets a superscalar core. Sounds fun, I'm not sure if even x86 perform such complex fusions.

I've heard/read about proposals to design RISC-V cores with instruction fusion like this, but have never seen one actually work. It sure sounds like pretty complex to implement correctly, and I'm frankly not convinced it would end up being simpler than just handling 3-source instructions (which again already exists in some RISC-V extensions anyway...)

Anyway, it's all an exercise of making a good compromise between simplicity and performance. A pretty tough endeavor that's obviously bound not to please everyone.

I completely understand the whole idea of having a simple ISA (RISC-V in that regard is very much in the RISC spirit of the early days, whereas most RISC processors now have become monsters and we can question what the "R" means anymore). But putting all the work for performance upon the microarchitecture's shoulders is debatable as well. One point of RISC-V is to make it very easy/and lightweight to implement, but then if we need to design complex microarchitectures to really make it efficient, is the compromise really always worth it? At least it certainly doesn't look as easy as what we may hear here and there...

SiliconWizard · « **Reply #23 on:** December 28, 2019, 02:59:03 pm »

Quote from: legacy on December 28, 2019, 07:52:10 am

Quote from: ataradov on December 27, 2019, 08:29:47 pm
Adding carry and other flags has significant implications on the hardware design.
Having separate flags introduces , which may make efficient implementation very hard.

Yup, precisely.

As we said, it's just basically handling 3 sources instead of 2. Sure it adds complexity, but "very hard" is a bit much here. Certainly though if you're looking to design very small cores, that would be something to avoid.

I was thinking of another way to implement ADC not requiring a separate flag register. Not ultra efficient, but simpler?
The idea would just be to make all integer registers 1 bit wider (ie. 33 bits for RV32). The destination of any ADD would naturally receive the carry in its MSB (bit 32). Thus a further ADC using this register as a source would not require handling a third source. This extra bit could maybe be used for other purposes as well? I know it sounds a bit wasteful, but it looks much simpler to implement. And yes, for those who consider ADC to be a rarity these days, that would probably make them cringe.

legacy · « **Reply #24 on:** December 28, 2019, 04:27:50 pm »

Exactly what's your interest?
Building an HDL cpu-core with pipeline?
Writing a cycle-accurate pipeline simulator?
Designing an HL compiler, from HL to machine-code?

Each of these fields has its trade-off

but talking about "architecture", I wish I had "see RISC-V run" (The Book) under my Xmas tree.
Has it already been written? Let's write it! I wanna it under the tree


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: The RISC-V ISA discussion (Read 19714 times)

Share me