Author Topic: RISC-V assembly language programming tutorial on YouTube (Read 63012 times)

David Hess · « **Reply #50 on:** December 12, 2018, 02:40:52 am »

Quote from: NorthGuy on December 11, 2018, 10:57:07 pm

Quote from: David Hess on December 11, 2018, 09:27:29 pm
Intel has a lot of performance problems with their vector units and so much so that they had to issue a directive not to use them for things like memory copies.

I definitely need to read this. I'm one of those who have been using them for memory copying for 20 years or so, and it always had performance increases in my tests when I moved to the next bigger register size over the years. Do you have any reference for the document?

The discussion was in the RWT (Real World Technologies) forums months ago. The problem was library routines or periodically executed code which blindly uses AVX for string copies thereby triggering lower core clock rates when certain AVX instructions are used. The result was lower scalar performance.

brucehoult · « **Reply #51 on:** December 12, 2018, 05:33:22 am »

At the outset here, I want to point out that my previous post was describing what characteristics are generally associated with the term "RISC", not whether they are individually or collectively good or bad.

I think the collective goodness or badness is sufficiently addressed by the simple fact that since 1980 there has been no (successful) new CISC instruction set, and the ones that have survived have done so by translating to a more RISCy form internally.

CISC doesn't even win for compactness of programs in the 32 bit and 64 bit eras. RISC architectures with a simple mix of 2-byte and 4-byte instructions do. VAX and x86 both allow instructions with lengths from 1 byte to 15 or 16 bytes in 1 byte increments. VAX even bigger in some extreme instructions, but let's stick to common things such as ADDL3. In the 64 bit era, x86's 15 byte limitation is by fiat -- the syntax easily allows longer instructions.

Quote from: David Hess on December 11, 2018, 12:59:07 am

Quote from: brucehoult on December 10, 2018, 11:27:09 pm
- strict separation of computation from data transfer (load/store)

On the other hand, allowing ALU instructions to have one memory operand acts as a type of instruction set compression, lowers register pressure, and seems to have little disadvantage when out-of-order execution allows long load-to-use latencies from cache.

Lowers register pressure yes. "Little disadvantage" with OoO: yes. But a significant disadvantage with simpler in-order implementations, such as microcontrollers.

Program compression: it seems logical, but is not borne out by empirical data.

It's logical that not having to mention a temporary register name (twice!) in a load and subsequent arithmetic instruction could make for smaller programs. But no one seems to have been able to achieve this in practice.

I think the problem is that in all of PDP11, VAX, 68k, x86 each operand that can potentially be in memory is accompanied by an addressing mode field, which is *wasted* in the more common case (if you have enough registers) when the operand is actually in a register. PDP11 and 68k have in 16 bit opcodes two 3 bit register fields plus two 3 bit addressing modes. VAX has 4 bit register field plus 4 bit addressing mode for each operand. x86 limits one operand to be a register but effectively wastes three bits out of 16 for reg-reg operations (1 bit in the opcode to select reg-mem or mem-reg, plus 2 bits in the modr/m byte).

If not for those often-wasted addressing mode fields, all of PDP11, 68k and x86 could perhaps have instead fit 3-address instructions into a 16 bit opcode (as Thumb does for add&sub), saving a lot of extra instructions to copy one operand to where the result is required first.

Quote

Quote
- enough registers that you don't touch memory much. Arguments for most functions fit in registers, and the return address too (the otherwise RISC AVR8 violates this).

But if the register set is too large, it means more state to save. There are other solutions for this though.

Sure. It's possible to go overboard. Four registers doesn't seem to be enough. 128 is too many. Sixteen seems to be a good number if you have complex addressing modes (big immediates and offsets, register plus scaled register), and thirty two if you don't. Eight isn't really enough, even with complex addressing modes, but two sets of eight seems workable e.g. 8 address plus 8 data (68k) or 8 general purpose lo registers plus 8 more limited hi (Thumb). Or 8 directly addressable, plus 8 that need an extra prefix byte to address (x86_64).

Quote

Quote
- no instruction can cause more than one cache or TLB miss, or two adjacent lines/pages if unaligned access is supported (and this case might be trapped and emulated)

The lack of hardware support for unaligned access always seems to end up being a performance problem once a processor gets deployed into the real world.

Weak memory ordering which seems like it should be an advantage also becomes a liability.

Unaligned access, I agree.

Weak memory ordering ... I think languages, compilers, and programmers have got that under control now. It took a while. I don't think TSO scales well to 100+ cores .. or even 50. We'll really start to see this bite (or not) in the next five years.

Quote

Quote
- each instruction modifies at most one register.

That is pretty standard but how then do you handle integer multiplies and divides? Break them up into two instructions?

Yes. The high part of multiplies and the remainder from divisions are very rarely needed. Better to define separate instructions for them. Higher-end processors can notice both are being calculated and combine them, if that's profitable.

Quote

Quote
- integer instructions read at most two registers. This is ultra-purist :-) A number of RISC ISAs break it in order to have e.g. a register plus (scaled) register addressing mode, or conditional select. But no more than three!

Internally it seems like this sort of thing and modifying more than one register should be broken up into separate micro-operations so that the register file has a lower number of read and write ports. The alternative is having to decode more instructions which clog up the front end once an out-of-order implementation is desired.

Breaking complex instructions up into several microops for execution is a valid thing to do, especially on a lower end implementation. Intel obviously does this at large scale, but even ARM does it quite a lot, including in Aarch64.

Recognising adjacent simple operations and dynamically combining them into a more powerful single operation is also a valid thing to do. Modern x86 does this, for example, when there is a compare followed by a conditional branch. Future high end RISC-V CPUs are also expected to do this heavily.

The former puts a burden on to low end implementations. The latter keeps low end implementations as simple as possible, while putting a burden onto high end implementations, which can perhaps more easily afford it.

Quote

On the other hand, this means discarding the performance advantages of the FMA instruction.

I was explicitly talking about *integer* instructions. Most floating point code uses FMA very heavily (if -ffast-math is enabled) and IEEE 754-2008 mandates it. It's a big performance advantage to have three read ports on the FP register file, worth the expense. It would be wasted most of the time on the integer register file.

Quote

Quote
- no microcode or hardware sequencing. Each instruction executes in a small and fixed number of clock cycles (usually one). Load/Store multiple are the main offenders in both ARM and PowerPC. They help with code size, but it's interesting that ARM didn't put them in Aarch64 and is deprecating them in 32 bit as well, providing the much less offensive load/store pair.

Maybe more interesting is why ARM even included them in the first place. Load and store multiple took advantage of fast-page-mode DRAM access when ARMs instruction pipeline was closely linked with DRAM access.

Yes. Modern caches achieve the same effect with individual stores.

Load/store multiple do make programs significantly smaller, especially in function prologue and epilogue.

RISC-V gcc has an option "-msave-restore" (this might get included in -Os later) that calls one of several special library routines as the first instruction in functions to create the stack frame and save the return address plus s0-s2, or ra&s0-s6 or ra&s0-s10 or ra&s0-s11. Function return is then done by tail-calling the corresponding return function.

This replaces anything up to 29 instructions (116 bytes with rv32i or rv64i, 58 bytes with the C extension) with two instructions (8 bytes with rv32i or rv64i, 6 bytes with the C extension though I have a plan to reduce that to 4). Of course the average case is a lot less than that -- most functions that create a stack frame at all use three or fewer callee-save registers, so you're usually replacing 5/7/9/11 by 2 instructions.

The time cost is three extra jump instructions, plus sometimes a couple of unneeded loads and stores because not every size is provided in order to keep the library size down.

In the current version, the total size of the save/restore library functions is 96 bytes with RVC enabled.

I made a couple of tests on a HiFive1 board with gcc 7.2.0.

Dhrystone w/rv32im, -msave-restore made the program 4 bytes smaller, and 5% slower (1.66 vs 1.58).

CoreMark w/rv32imac, -msave-restore made the program 252 bytes (0.35%) smaller, and 0.4% FASTER (2.676 vs 2.687, with +/- 0.001 variance on different runs).

I attribute the faster speed for CoreMark to greater effectiveness of the 16 KB instruction cache.

Both of these are very small programs of course (especially Dhrystone). Bigger programs will show more difference.

Quote

Should stack instructions be broken up as well?

Yes. This is done almost universally in RISCV ISAs. A function starts with decrementing the stack pointer (once) and ends with incrementing it. Each register (or sometimes register pair) is saved and restored with an individual instruction -- which on a superscalar processor might run multiple in each clock cycle.

The x86's push/pop instructions are very hard on OoO machinery, and recent ones have a special push/pop execution (keeping track of the stack pointer) IN THE DECODER, translating each push or pop in a series to a store or load at an offset from the stack pointer as it was at the start of the sequence.

Quote

Quote
What a huge number of instructions *does* do is make very small low end implementations impossible. And puts a big burden of work on every hardware and every emulator implementer.

I do not know about that. Multiple physical designs covering a wide performance range are possible with the same ISA. Microcode is convenient to handle seldom used instructions. Vector operations can be broken up into instructions which fit the micro-architecture's ALU width while allowing support for the same vector instruction set across a wide range of implementations.

Or you can use instruction set extensions every time you want to support a different vector length. How many FPU ISAs has ARM gone through now?

Seldom-used operations are handled just as easily by library routines as by microcode -- especially ones which are inherently slow. This is commonly done in every CPU family for floating point operations. Sometimes the compiler generates a binary with instructions replaced by library calls. Sometimes the instruction is generated but on execution traps into the operating system which then calls the appropriate library function.

Microcode made sense when ROM within the processor was faster than RAM, but since about 1980 SRAM has actually been faster than ROM. You could copy the microcode into SRAM at boot -- and even allow the user to write some custom microcode, as the VAX did. Or you can use that SRAM as an icache and create instruction sets that don't need microcode. Then real functions are just as fast to call as microcode ones.

When Dave Patterson (who invented the terms "RISC" and later "RAID" and is currently involved with RISC-V and Google's TPU) was on sabbatical at DEC he discovered that even on the VAX, using a series of simple instructions was faster than using the complex microcoded instructions such as CALLS and RET (which automatically saved and restored registers for you, according to a bitmask at the start of the function).

John Cocke at IBM independently discovered the same fact about the IBM 370 at about the same time (in fact a couple of years earlier).

I agree with you about vectors. ARM has gone through several vector/multimedia instruction sets, and Intel is up to at least number 4 (depending on what you count the different iterations of SSE as).

The RISC-V vector instruction set which is being finalised at the moment (I'm on the Working Group) -- and which has been in development and testing in various iterations for at least ten years (it's the main reason the RISC-V project was started in the first place) -- is vector length agnostic. The same code will run on hardware with any vector length.

ARM's new SVE is of course similar.

brucehoult · « **Reply #52 on:** December 12, 2018, 05:56:53 am »

Quote from: NorthGuy on December 11, 2018, 02:17:23 pm

You probably can re-code it on MIPS one-to-one, except for LUI (if not followed by XORI or ADDI), which would require an extra instruction - very simple hardware emulator

Sure, it's very very similar to MIPS.

MIPS has 16 bits for both LUI and for XORI/ADDI (and all other immediates and offsets), while RISC-V has 20 bits for LUI/AUIPC/JAL and 12 bits for everything else. This is the major thing that buys a lot of spare instruction encoding space in RISC-V compared to MIPS.

Quote

Why every instruction has "11" at the end? This way it only uses 1/4 of the code point space.

Aaaand .. that's how a little of the extra instruction encoding space is spent :-)

Instructions with 11 in the LSBs are 32 bit instructions (30 bits available to be used).
Instructions with 00/01/10 in the LSBs are 16 bit instructions (49152 (48k) encodings available)

Instructions with 11111 in the LSBs are reserved for instructions longer than 32 bits. So actually there are only 939,524,096 possible 32 bit opcodes not 1,073,741,824.

You can compare this to Thumb2 where instructions with 111 in the MSBs of the first 16 bit packet are 32 bit instructions and all others are 16 bit instructions.

obiwanjacobi · « **Reply #53 on:** December 12, 2018, 06:07:35 am »

I don't get why you would talk about Risc-V asm and then spend 6 episodes trying to take the beginners by the hand.
I would rather see video's that assume a certain level - close to the material at hand- and provide resources where to get up to speed.

Other than that it was a pretty nice series.

BTW, is debugging on PlatformIO free these days? Last I looked at it you needed a pro (paid) account...

brucehoult · « **Reply #54 on:** December 12, 2018, 07:21:43 am »

Quote from: David Hess on December 11, 2018, 09:27:29 pm

Quote from: brucehoult on December 11, 2018, 01:07:18 am
It would be an interesting experiment to do to implement this. And this is EXACTLY what RISC-V enables you to do for low cost in time and money. Modify your favourite FPGA implementation to have your new instructions, modify gcc or llvm to generate them, and run dhrystone/coremark/SPEC/your favourite benchmark suite without and without using the new instructions. Publish the results with execution time, energy use, area cost, and any effect on MHz. We all learn something!

It would be too big of a change to RISC-V. It alters the basic ISA and architecture and then a new code generator would be required anyway. It goes against the design principles of RISC-V.

I disagree. Your exact suggestion is a little outside the scope of what SiFive is at present allowing for automated generation of CPU cores with custom instructions specified by customers: those will at least at first be restricted to "A = B op C" where op is specified by customer-supplied HDL. However, if you take source code to an existing core it would be trivial to enlarge each registers and data bus by 1 bit, and add new branch instructions based on that bit.

However I'd suggest another plan. Just create a new instruction "SETV a,b,c" that sets a to 1 if b+c has a signed overflow and to 0 otherwise. And/or create an instruction "BVS b,c,label" that branches to label if b+c overflows. And BVC if you want. Or TRAPV b,c, or ADDV a,b,c (that traps if b+c overflows). It's up to you. All of those would fit into existing instruction formats and pipelines no problem at all. Well .. the trapping ones would take a little more work. But they're all easier than your original suggestion.

I note that MIPS "ADD" and "ADDI" instructions trap on overflow, but virtually no one ever using them, using the "Unsigned add" instruction even for signed values, and they are now deprecated.

Quote

Quote
Again, worth trying, though context switches are very rare on normal systems.

But subroutine calls are not.

But the vast majority of subroutine calls save and restore very few registers.

brucehoult · « **Reply #55 on:** December 12, 2018, 07:32:29 am »

Quote from: obiwanjacobi on December 12, 2018, 06:07:35 am

BTW, is debugging on PlatformIO free these days? Last I looked at it you needed a pro (paid) account...

Yep, $10/month for pro, with a 30 day free trial.

I guess that's my biggest problem with the series. It could have used SiFive's free Eclipse-based "Freedom Studio", which of course works with the SiFive board.

The VS Code plus PlatformIO setup works very nicely, but .... yeah.

David Hess · « **Reply #56 on:** December 12, 2018, 03:11:59 pm »

Quote from: brucehoult on December 12, 2018, 07:21:43 am

I disagree. Your exact suggestion is a little outside the scope of what SiFive is at present allowing for automated generation of CPU cores with custom instructions specified by customers: those will at least at first be restricted to "A = B op C" where op is specified by customer-supplied HDL. However, if you take source code to an existing core it would be trivial to enlarge each registers and data bus by 1 bit, and add new branch instructions based on that bit.

It is more than that because the flags are stored in a separate independently accessed register file so no additional ports need to be added to the regular register file. This does not matter so much in low performance implementations but it is a performance limiting problem with superscalar designs.

Quote

However I'd suggest another plan. Just create a new instruction "SETV a,b,c" that sets a to 1 if b+c has a signed overflow and to 0 otherwise. And/or create an instruction "BVS b,c,label" that branches to label if b+c overflows. And BVC if you want. Or TRAPV b,c, or ADDV a,b,c (that traps if b+c overflows). It's up to you. All of those would fit into existing instruction formats and pipelines no problem at all. Well .. the trapping ones would take a little more work. But they're all easier than your original suggestion.

This gets back to the reason to store the flags in the first place. It avoids having to execute the same ALU operation again which is a waste of power.

NorthGuy · « **Reply #57 on:** December 12, 2018, 06:00:15 pm »

Quote from: brucehoult on December 12, 2018, 05:33:22 am

Sure. It's possible to go overboard. Four registers doesn't seem to be enough. 128 is too many. Sixteen seems to be a good number if you have complex addressing modes (big immediates and offsets, register plus scaled register), and thirty two if you don't. Eight isn't really enough, even with complex addressing modes ...

So it seems. i86 in 32-bit mode has 8 registers, x64 has 16. Should be a huge improvement, right? But in practice, there's none. If you compile the same program for 32-bt and for 64-bit and run it on the same computer, there's no increase in speed whatsoever.

rstofer · « **Reply #58 on:** December 12, 2018, 08:25:08 pm »

Quote from: brucehoult on December 11, 2018, 02:52:36 am

Video #1 has been replaced today. I don't know what changed. I wonder if Western Digital will update the videos that show buggy code (at least with a text "oops.." overlay as they already did for a few things).

I watched the new version and no changes jumped out at me. It is just an introduction with very little technical content other than describing the tools and 3 books. Still, it a pretty good introduction!

rstofer · « **Reply #59 on:** December 12, 2018, 08:26:22 pm »

Quote from: NorthGuy on December 12, 2018, 06:00:15 pm

Quote from: brucehoult on December 12, 2018, 05:33:22 am
Sure. It's possible to go overboard. Four registers doesn't seem to be enough. 128 is too many. Sixteen seems to be a good number if you have complex addressing modes (big immediates and offsets, register plus scaled register), and thirty two if you don't. Eight isn't really enough, even with complex addressing modes ...

So it seems. i86 in 32-bit mode has 8 registers, x64 has 16. Should be a huge improvement, right? But in practice, there's none. If you compile the same program for 32-bt and for 64-bit and run it on the same computer, there's no increase in speed whatsoever.

Is this because the code generator doesn't even bother with the extra registers?

NorthGuy · « **Reply #60 on:** December 12, 2018, 09:20:24 pm »

Quote from: rstofer on December 12, 2018, 08:26:22 pm

Is this because the code generator doesn't even bother with the extra registers?

I think it does. It has to because the calling conventions are different.

It would be a good experiment to run benchmarks on RISC-V with different number of registers and see how the performance depends on the number of registers. Either GCC or other compilers may allow this.

brucehoult · « **Reply #61 on:** December 13, 2018, 01:43:15 am »

Quote from: David Hess on December 12, 2018, 03:11:59 pm

Quote
However I'd suggest another plan. Just create a new instruction "SETV a,b,c" that sets a to 1 if b+c has a signed overflow and to 0 otherwise. And/or create an instruction "BVS b,c,label" that branches to label if b+c overflows. And BVC if you want. Or TRAPV b,c, or ADDV a,b,c (that traps if b+c overflows). It's up to you. All of those would fit into existing instruction formats and pipelines no problem at all. Well .. the trapping ones would take a little more work. But they're all easier than your original suggestion.

This gets back to the reason to store the flags in the first place. It avoids having to execute the same ALU operation again which is a waste of power.

Where and when are you going to use this new oVerflow flag, and how often?

99.9999% of when it gets used on an x86 or ARM is because you don't have a direct "branch to label if A < B, signed" like on RISC-V but only a "CMP" instruction which does a subtraction and sets the flags but throws away the result. And then the conditional branch needs to reconstruct what happened -- and only the conditional branch knows whether you wanted a signed or unsigned comparison.

Given that you have "branch if A < B, signed", writing an overflow flag that's never going to be looked at from a million instructions will be a bigger waste of energy than computing an addition twice instead of once, one instruction in a million.

What is your use-case where this overflow flag is so critical?

David Hess · « **Reply #62 on:** December 13, 2018, 01:51:02 am »

Quote from: brucehoult on December 13, 2018, 01:43:15 am

Where and when are you going to use this new oVerflow flag, and how often?

99.9999% of when it gets used on an x86 or ARM is because you don't have a direct "branch to label if A < B, signed" like on RISC-V but only a "CMP" instruction which does a subtraction and sets the flags but throws away the result. And then the conditional branch needs to reconstruct what happened -- and only the conditional branch knows whether you wanted a signed or unsigned comparison.

Given that you have "branch if A < B, signed", writing an overflow flag that's never going to be looked at from a million instructions will be a bigger waste of energy than computing an addition twice instead of once, one instruction in a million.

What is your use-case where this overflow flag is so critical?

Multiword math.

westfw · « **Reply #63 on:** December 13, 2018, 02:06:49 am »

Quote

The simple fact that since 1980 there has been no (successful) new CISC instruction set

What about the MSP430?

So ... what prompts the development of a new RISC instruction set, anyway?
You'd think that by the time things were "reduced" enough, there wouldn't be all that much room for innovation or improvement. Do you learn from mistakes in other vendors' instruction sets (I've got to say that the more I look at it, the more unpleasant I find the ARM v6m instruction set. (CM0: thumb-16 only)) Do advances in hardware (what's "standard" in an FPGA, for instance, or the growing popularity of QSPI memory)or SW issues (security) drive things?
I guess RISC-V is somewhat motivated by wanting to provide an open-source instruction set. Bur is that all?

brucehoult · « **Reply #64 on:** December 13, 2018, 02:20:11 am »

Quote from: NorthGuy on December 12, 2018, 06:00:15 pm

So it seems. i86 in 32-bit mode has 8 registers, x64 has 16. Should be a huge improvement, right? But in practice, there's none. If you compile the same program for 32-bt and for 64-bit and run it on the same computer, there's no increase in speed whatsoever.

There are several differences between 32 bit code and 64 bit code, all of which will have an effect:

- 16 registers instead of 8. It does in fact make it faster for non-trivial programs :-)

- pointers are 64 bit instead of 32 bit. This makes data structures bigger and programs slower. Except when your program won't fit into 4 GB (or 3 GB or whatever) and so won't run at *all*.

- calling convention is to use registers for arguments, not stack. Does in fact make the program faster.

- arithmetic is 64 bit instead of 32 bit. Has no effect if your program only uses 32 bit variables, a significant effect if you use 64 bit variables.

It's pretty hard to separate out and test these factors individually. The Linux kernel and gcc support the `-mx32` flag that uses 64 bit registers, 16 registers, and arguments in registers, but uses only 32 bit pointers. It runs faster than standard 64 bit code, and a LOT faster than i686.

brucehoult · « **Reply #65 on:** December 13, 2018, 02:25:07 am »

Quote from: NorthGuy on December 12, 2018, 09:20:24 pm

Quote from: rstofer on December 12, 2018, 08:26:22 pm
Is this because the code generator doesn't even bother with the extra registers?

I think it does. It has to because the calling conventions are different.

It would be a good experiment to run benchmarks on RISC-V with different number of registers and see how the performance depends on the number of registers. Either GCC or other compilers may allow this.

They do. "rv32e" is a standard alternative for very small embedded systems that is the same as "rv32i" but only had 16 registers. gcc supports it.

brucehoult · « **Reply #66 on:** December 13, 2018, 02:27:48 am »

Quote from: David Hess on December 13, 2018, 01:51:02 am

Quote from: brucehoult on December 13, 2018, 01:43:15 am
Where and when are you going to use this new oVerflow flag, and how often?

99.9999% of when it gets used on an x86 or ARM is because you don't have a direct "branch to label if A < B, signed" like on RISC-V but only a "CMP" instruction which does a subtraction and sets the flags but throws away the result. And then the conditional branch needs to reconstruct what happened -- and only the conditional branch knows whether you wanted a signed or unsigned comparison.

Given that you have "branch if A < B, signed", writing an overflow flag that's never going to be looked at from a million instructions will be a bigger waste of energy than computing an addition twice instead of once, one instruction in a million.

What is your use-case where this overflow flag is so critical?

Multiword math.

Multiword math uses carry, not overflow. Carry can be detected by simply testing if the result is less than one of the arguments (it doesn't matter which one), and either branch on that (using BLTU) or set a register to the value of the carry (using SLTU). Pick your poison.

It's also extremely rare and not a performance influencer outside of specialised domains.

westfw · « **Reply #67 on:** December 13, 2018, 02:52:21 am »

IP Checksum can make use of a carry flag (a one's complement add does an end-around carry, and you can fold that into a series of "add with carry" on the widest word your CPU implements, to achieve a improvement over the methods that are (now) "traditionally" used in C.
(eg twice the speed and half the size on AVR: https://github.com/WestfW/Duino-hacks/blob/master/ipchecksum_test/ipchecksum_test.ino )
I'm not sure I'd call that "common" enough justify a carry flag bit, except that ... today's CPUs tend to do an awful lot of IP checksumming!
(also: limited by memory speed fetching the data to-be-checksummed, so probably irrelevant on faster RISC chips. It takes a little getting used to when a valid answer to "but it takes more instructions" is "so what?")

brucehoult · « **Reply #68 on:** December 13, 2018, 03:49:33 am »

Quote from: westfw on December 13, 2018, 02:52:21 am

(also: limited by memory speed fetching the data to-be-checksummed, so probably irrelevant on faster RISC chips. It takes a little getting used to when a valid answer to "but it takes more instructions" is "so what?")

Precisely.

While you were posting that, I did a little experiment and whipped up a quick and dirty multi-word add in C.

Code: [Select]

typedef unsigned int uint;

void bignumAdd(uint *res, uint *a, uint *b, int len){
    uint carry = 0;
    for (int i=0; i<len; ++i){
        uint t = a[i] + b[i];
        uint carryOut = t < a[i];
        t += carry;
        res[i] = t;
        carry = (t<carry) | carryOut;
    }
}

Here's the code for 32 bit RISC-V with gcc -O2

Code: [Select]

00000000 <bignumAdd>:
   0:   02d05763                blez    a3,2e <.L1>
   4:   068a                    slli    a3,a3,0x2
   6:   00d588b3                add     a7,a1,a3
   a:   4781                    li      a5,0

0000000c <.L3>:
   c:   4198                    lw      a4,0(a1)
   e:   4214                    lw      a3,0(a2)
  10:   0591                    addi    a1,a1,4
  12:   0611                    addi    a2,a2,4
  14:   96ba                    add     a3,a3,a4
  16:   00f68833                add     a6,a3,a5
  1a:   01052023                sw      a6,0(a0)
  1e:   00e6b733                sltu    a4,a3,a4
  22:   00f837b3                sltu    a5,a6,a5
  26:   8fd9                    or      a5,a5,a4
  28:   0511                    addi    a0,a0,4
  2a:   feb891e3                bne     a7,a1,c <.L3>

0000002e <.L1>:
  2e:   8082                    ret

Here's for Thumb2:

Code: [Select]

00000000 <bignumAdd>:
   0:   2b00            cmp     r3, #0
   2:   dd19            ble.n   38 <bignumAdd+0x38>
   4:   3a04            subs    r2, #4
   6:   eb01 0383       add.w   r3, r1, r3, lsl #2
   a:   3804            subs    r0, #4
   c:   b4f0            push    {r4, r5, r6, r7}
   e:   2500            movs    r5, #0

  10:   f851 4b04       ldr.w   r4, [r1], #4
  14:   2700            movs    r7, #0
  16:   f852 6f04       ldr.w   r6, [r2, #4]!
  1a:   19a4            adds    r4, r4, r6
  1c:   bf28            it      cs
  1e:   2701            movcs   r7, #1
  20:   1964            adds    r4, r4, r5
  22:   f840 4f04       str.w   r4, [r0, #4]!
  26:   bf2c            ite     cs
  28:   2401            movcs   r4, #1
  2a:   2400            movcc   r4, #0
  2c:   428b            cmp     r3, r1
  2e:   ea47 0504       orr.w   r5, r7, r4
  32:   d1ed            bne.n   10 <bignumAdd+0x10>

  34:   bcf0            pop     {r4, r5, r6, r7}
  36:   4770            bx      lr
  38:   4770            bx      lr
  3a:   bf00            nop

48 bytes for RISC-V, with 12 instructions in the loop.
58 bytes for ARM, with 14 instructions in the loop. (I'm not counting the nop for alignment)

It seems pretty obvious you could improve the ARM one by hand coding in assembly language, but few people are going to do that -- they just use gcc and take what they get.

The RISC-V one can't be improved by hand coding assembly language. I like that. Who doesn't want to get the best results possible just by coding in C?

By the way, I swear I wrote the C code at the start, compiled it for both, and did not adjust the C in any way.

Let's try 64 bit ARM!

Code: [Select]

0000000000000000 <bignumAdd>:
   0:   7100007f        cmp     w3, #0x0
   4:   5400020d        b.le    44 <bignumAdd+0x44>
   8:   d2800004        mov     x4, #0x0                        // #0
   c:   52800005        mov     w5, #0x0                        // #0

  10:   b8647828        ldr     w8, [x1, x4, lsl #2]
  14:   b8647846        ldr     w6, [x2, x4, lsl #2]
  18:   0b060106        add     w6, w8, w6
  1c:   0b0500c7        add     w7, w6, w5
  20:   6b06011f        cmp     w8, w6
  24:   1a9f97e6        cset    w6, hi  // hi = pmore
  28:   b8247807        str     w7, [x0, x4, lsl #2]
  2c:   6b0500ff        cmp     w7, w5
  30:   91000484        add     x4, x4, #0x1
  34:   1a9f27e5        cset    w5, cc  // cc = lo, ul, last
  38:   6b04007f        cmp     w3, w4
  3c:   2a0500c5        orr     w5, w6, w5
  40:   54fffe8c        b.gt    10 <bignumAdd+0x10>

  44:   d65f03c0        ret

So that's 72 bytes of code (by far the biggest) and 13 instructions in the loop (one more than RISC-V, one less than Thumb2).

Maybe x86_64?

Code: [Select]

0000000000000000 <bignumAdd>:
   0:   85 c9                   test   %ecx,%ecx
   2:   7e 39                   jle    3d <bignumAdd+0x3d>
   4:   8d 41 ff                lea    -0x1(%rcx),%eax
   7:   45 31 c0                xor    %r8d,%r8d
   a:   31 c9                   xor    %ecx,%ecx
   c:   4c 8d 14 85 04 00 00    lea    0x4(,%rax,4),%r10
  13:   00 
  14:   0f 1f 40 00             nopl   0x0(%rax)

  18:   45 31 c9                xor    %r9d,%r9d
  1b:   8b 04 0a                mov    (%rdx,%rcx,1),%eax
  1e:   03 04 0e                add    (%rsi,%rcx,1),%eax
  21:   72 1c                   jb     3f <bignumAdd+0x3f>
  23:   44 01 c0                add    %r8d,%eax
  26:   41 0f 92 c0             setb   %r8b
  2a:   89 04 0f                mov    %eax,(%rdi,%rcx,1)
  2d:   48 83 c1 04             add    $0x4,%rcx
  31:   45 0f b6 c0             movzbl %r8b,%r8d
  35:   45 09 c8                or     %r9d,%r8d
  38:   49 39 ca                cmp    %rcx,%r10
  3b:   75 db                   jne    18 <bignumAdd+0x18>

  3d:   f3 c3                   repz retq 

  3f:   41 b9 01 00 00 00       mov    $0x1,%r9d
  45:   eb dc                   jmp    23 <bignumAdd+0x23>

71 bytes of code, 12 instructions in the loop but 2 more outside (the compiler introduces conditional branches where there were none).

How about 32 bit x86?

Code: [Select]

00000000 <bignumAdd>:
   0:   55                      push   %ebp
   1:   57                      push   %edi
   2:   56                      push   %esi
   3:   53                      push   %ebx
   4:   8b 44 24 20             mov    0x20(%esp),%eax
   8:   8b 7c 24 14             mov    0x14(%esp),%edi
   c:   8b 5c 24 18             mov    0x18(%esp),%ebx
  10:   8b 6c 24 1c             mov    0x1c(%esp),%ebp
  14:   85 c0                   test   %eax,%eax
  16:   7e 29                   jle    41 <bignumAdd+0x41>
  18:   31 c9                   xor    %ecx,%ecx
  1a:   31 d2                   xor    %edx,%edx
  1c:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi

  20:   31 f6                   xor    %esi,%esi
  22:   8b 44 8d 00             mov    0x0(%ebp,%ecx,4),%eax
  26:   03 04 8b                add    (%ebx,%ecx,4),%eax
  29:   72 1b                   jb     46 <bignumAdd+0x46>
  2b:   01 d0                   add    %edx,%eax
  2d:   0f 92 c2                setb   %dl
  30:   89 04 8f                mov    %eax,(%edi,%ecx,4)
  33:   83 c1 01                add    $0x1,%ecx
  36:   0f b6 d2                movzbl %dl,%edx
  39:   09 f2                   or     %esi,%edx
  3b:   39 4c 24 20             cmp    %ecx,0x20(%esp)
  3f:   75 df                   jne    20 <bignumAdd+0x20>

  41:   5b                      pop    %ebx
  42:   5e                      pop    %esi
  43:   5f                      pop    %edi
  44:   5d                      pop    %ebp
  45:   c3                      ret    
  46:   be 01 00 00 00          mov    $0x1,%esi
  4b:   eb de                   jmp    2b <bignumAdd+0x2b>

77 bytes of code, and the same 12-14 instructions in the loop.

Motorola 68000?

Code: [Select]

00000000 <bignumAdd>:
   0:   4e56 0000       linkw %fp,#0
   4:   48e7 3030       moveml %d2-%d3/%a2-%a3,%sp@-
   8:   262e 0014       movel %fp@(20),%d3
   c:   6f36            bles 44 <bignumAdd+0x44>
   e:   206e 000c       moveal %fp@(12),%a0
  12:   266e 0010       moveal %fp@(16),%a3
  16:   246e 0008       moveal %fp@(8),%a2
  1a:   e58b            lsll #2,%d3
  1c:   d688            addl %a0,%d3
  1e:   4281            clrl %d1

  20:   2418            movel %a0@+,%d2
  22:   2002            movel %d2,%d0
  24:   d09b            addl %a3@+,%d0
  26:   2240            moveal %d0,%a1
  28:   d3c1            addal %d1,%a1
  2a:   24c9            movel %a1,%a2@+
  2c:   b082            cmpl %d2,%d0
  2e:   55c0            scs %d0
  30:   4400            negb %d0
  32:   b289            cmpl %a1,%d1
  34:   52c1            shi %d1
  36:   4401            negb %d1
  38:   8200            orb %d0,%d1
  3a:   0281 0000 00ff  andil #255,%d1
  40:   b1c3            cmpal %d3,%a0
  42:   66dc            bnes 20 <bignumAdd+0x20>

  44:   4cdf 0c0c       moveml %sp@+,%d2-%d3/%a2-%a3
  48:   4e5e            unlk %fp
  4a:   4e75            rts

76 bytes of code, 16 instruction in the loop.

msp430?

Code: [Select]

00000000 <bignumAdd>:
   0:   0b 12           push    r11             
   2:   0a 12           push    r10             
   4:   09 12           push    r9              
   6:   08 12           push    r8              
   8:   07 12           push    r7              
   a:   06 12           push    r6              
   c:   1c 93           cmp     #1,     r12     ;r3 As==01
   e:   1f 38           jl      $+64            ;abs 0x4e
  10:   29 4e           mov     @r14,   r9      
  12:   28 4d           mov     @r13,   r8      
  14:   0c 5c           rla     r12             
  16:   0b 43           clr     r11             
  18:   0a 43           clr     r10             
  1a:   06 3c           jmp     $+14            ;abs 0x28

  1c:   09 4e           mov     r14,    r9      
  1e:   09 5b           add     r11,    r9      
  20:   29 49           mov     @r9,    r9      
  22:   08 4d           mov     r13,    r8      
  24:   08 5b           add     r11,    r8      
  26:   28 48           mov     @r8,    r8      
  28:   08 59           add     r9,     r8      
  2a:   07 48           mov     r8,     r7      
  2c:   07 5a           add     r10,    r7      
  2e:   06 4f           mov     r15,    r6      
  30:   06 5b           add     r11,    r6      
  32:   86 47 00 00     mov     r7,     0(r6)   ;0x0000(r6)
  36:   16 43           mov     #1,     r6      ;r3 As==01
  38:   07 9a           cmp     r10,    r7      
  3a:   01 28           jnc     $+4             ;abs 0x3e
  3c:   06 43           clr     r6              
  3e:   1a 43           mov     #1,     r10     ;r3 As==01
  40:   08 99           cmp     r9,     r8      
  42:   01 28           jnc     $+4             ;abs 0x46
  44:   0a 43           clr     r10             
  46:   0a d6           bis     r6,     r10     
  48:   2b 53           incd    r11             
  4a:   0b 9c           cmp     r12,    r11     
  4c:   e7 23           jnz     $-48            ;abs 0x1c

  4e:   36 41           pop     r6              
  50:   37 41           pop     r7              
  52:   38 41           pop     r8              
  54:   39 41           pop     r9              
  56:   3a 41           pop     r10             
  58:   3b 41           pop     r11             instructions in the loop. (I'm not counting the nop for alignment)
  5a:   30 41           ret

92 bytes of code, and 24 instructions in the loop.

sh4:

Code: [Select]

00000000 <bignumAdd>:
   0:   15 47           cmp/pl  r7
   2:   13 8b           bf      2c <bignumAdd+0x2c>
   4:   73 63           mov     r7,r3
   6:   08 43           shll2   r3
   8:   fc 73           add     #-4,r3
   a:   09 43           shlr2   r3
   c:   00 e1           mov     #0,r1
   e:   01 73           add     #1,r3

  10:   56 60           mov.l   @r5+,r0
  12:   66 62           mov.l   @r6+,r2
  14:   0c 32           add     r0,r2
  16:   23 67           mov     r2,r7
  18:   26 30           cmp/hi  r2,r0
  1a:   1c 37           add     r1,r7
  1c:   29 02           movt    r2
  1e:   76 31           cmp/hi  r7,r1
  20:   29 01           movt    r1
  22:   72 24           mov.l   r7,@r4
  24:   10 43           dt      r3
  26:   04 74           add     #4,r4
  28:   f2 8f           bf.s    10 <bignumAdd+0x10>

  2a:   2b 21           or      r2,r1
  2c:   0b 00           rts     
  2e:   09 00           nop

46 bytes of code and 13 instructions in the loop. I like it!

To summarise .. bytes of code and instructions in the loop for all

48 12 RISC-V
58 14 Thumb2
72 13 Aarch64
71 14 x86_64
77 14 i386
76 16 m68k
92 24 msp430
46 13 sh4

sh4 and RISC-V the clear winners, Thumb2 not too far behind.

x86_64, aarch64, m68k, i386 in a close bunch

msp430 bringing up the rear by some distance.

Once again, this is just one arbitrary silly example, made (I think) somewhat fair by using gcc for everything, as most people do. You could improve a lot of these by handing coding, at least for code size -- maybe or maybe not for execution speed.

I *like* a processor where you can do everything in C.

Anyone want to show off their hand coding?

ataradov · « **Reply #69 on:** December 13, 2018, 04:38:44 am »

Quote from: brucehoult on December 13, 2018, 03:49:33 am

I *like* a processor where you can do everything in C.

So do I. How do I do startup code and interrupt handlers in C with RISC-V?

brucehoult · « **Reply #70 on:** December 13, 2018, 04:50:06 am »

Quote from: westfw on December 13, 2018, 02:06:49 am

Quote
The simple fact that since 1980 there has been no (successful) new CISC instruction set

What about the MSP430?

Yes, it's a little bit CISCy with memory-to-memory moves and adds. It falls into the PDP11-design M68000 space.

Based on the example I just tried, the code isn't very compact! At least as generated from C by gcc.

Quote

So ... what prompts the development of a new RISC instruction set, anyway?
You'd think that by the time things were "reduced" enough, there wouldn't be all that much room for innovation or improvement. Do you learn from mistakes in other vendors' instruction sets (I've got to say that the more I look at it, the more unpleasant I find the ARM v6m instruction set. (CM0: thumb-16 only)) Do advances in hardware (what's "standard" in an FPGA, for instance, or the growing popularity of QSPI memory)or SW issues (security) drive things?
I guess RISC-V is somewhat motivated by wanting to provide an open-source instruction set. Bur is that all?

RISC-V was motivated originally by wanting something as simple as possible (while still being effective) for Berkeley students to use to:

- learn assembly language programming (undergrad)
- design and implement their own processor core (masters?)
- design and implement experimental instruction set extensions (PhD)

Their previous experience was mostly with MIPS, but:

- only 32 bit could be used freely
- it's got annoying warts
- there is very little spare opcode space to put experimental extensions in .. and no easy way to go variable-length and have longer instructions.

ARM and x86 were considered as bases, but:

- x86 was far too complex for students to implement, and legally completely impossible to get permission to use, especially if you wanted to distribute results of your work.

- ARM also very complex, 32 bit only (at that time), mostly impossible to get permission to use, and again then not publishable.

OpenRISC and LEON were also considered, but again the things you could use were 32-bit only, and also lacked space opcode space.

I've looked closely at OpenRISC and it's very nicely done, as an instruction set pitched at one point in time, one fixed set of capabilities.

RISC-V is intended as a simple base that is sufficient for standard software (Linux or similar kernel and applications (Windows or MacOS/iOS would be fine too), compiled from C/C++ and similar languages) and that base is fixed and software built for it will work forever. But there is room and mechanism to add any number of future extensions targeting as yet unthought of application areas.

Will that be successful or not, I don't know, but anyway it's trying :-)

At the very least, it's not going to disappear without trace when the company that owns it goes out of business or loses interest. That's a serious problem. How much software has been lost as a result of the demise of PDP11, VAX, Alpha, Nova, Eclipse, PA-RISC, Itanium (Kittson is the last generation ever), SPARC (Oracle has cancelled future development, though Fujitsu is soldiering on for the moment).

Linux and the GNU toolchain and environment help to keep software alive over changes in instruction set. RISC-V means it may not be necessary to have another instruction set in future, unless and until computers change fundamentally from the way they've been since at least the 1960s. The Linux kernel maintainers even speculated that RISC-V and C-SKY might be the last hardware ISAs ever added to Linux, as everyone not using ARM/x86/POWER is switching to RISC-V for future chips -- including C-SKY and Andes (nds32).

https://www.phoronix.com/scan.php?page=news_item&px=C-SKY-Approved-Last-Arch

brucehoult · « **Reply #71 on:** December 13, 2018, 05:07:20 am »

Quote from: ataradov on December 13, 2018, 04:38:44 am

Quote from: brucehoult on December 13, 2018, 03:49:33 am
I *like* a processor where you can do everything in C.
So do I. How do I do startup code and interrupt handlers in C with RISC-V?

All algorithms found in user applications, I mean.

Operating systems and compiler runtime libraries are always going to have a little bit of assembler in them -- at least as long as there exist CSRs that are not memory-mapped.

However that's out of scope for both the C language and the RISC-V Unprivileged ISA (which is the thing that is portable and hopefully forever).

Interrupt handlers are no problem. You can either use hardware vectoring and add __attribute__((interrupt)) to your C function (in which case it will save all registers it uses), or else with the CLIC which is up for ratification as a standard for RISC-V you can use absolutely standard C functions along with a small firmware software vectoring function that can live in mask ROM (making it little different to microcode). When using the upcoming EABI instead of the Linux ABI the latency to get to a standard C function is very similar to that on ARM Cortex M.

Details at https://github.com/sifive/clic-spec/blob/master/clic.adoc#c-abi-trampoline-code

I notice your company's management have said they are 100% behind your subsidiary's move to put RISC-V into their FPGAs.

ataradov · « **Reply #72 on:** December 13, 2018, 05:09:34 am »

Quote from: brucehoult on December 13, 2018, 05:07:20 am

Operating systems and compiler runtime libraries are always going to have a little bit of assembler in them -- at least as long as there exist CSRs that are not memory-mapped.

That's really a sign of a bad design in 2018.

Quote from: brucehoult on December 13, 2018, 05:07:20 am

I notice your company's management have said they are 100% behind your subsidiary's move to put RISC-V into their FPGAs.

So?

Current RISC-V implementations are heavily slanted towards MPUs. Which is fine, but has to be acknowledged. Trying to shove the same design in an MCU will not go over well.

NorthGuy · « **Reply #73 on:** December 13, 2018, 05:35:47 am »

Quote from: brucehoult on December 13, 2018, 03:49:33 am

Anyone want to show off their hand coding?

Why not. dsPIC33:

Code: [Select]

      dec w0,w0
      add #0,w0 ; clear carry
      do w0,loop
      mov [w1++],w4
loop: addc w4,[w2++],[w3++]

5 instructions, 15 bytes, n+3 instruction cycles (where n is the number of bytes)

NorthGuy · « **Reply #74 on:** December 13, 2018, 05:38:52 am »

Quote from: brucehoult on December 13, 2018, 02:25:07 am

They do. "rv32e" is a standard alternative for very small embedded systems that is the same as "rv32i" but only had 16 registers. gcc supports it.

Were there any performance comparisons between these two?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: RISC-V assembly language programming tutorial on YouTube (Read 63012 times)

Share me