At the outset here, I want to point out that my previous post was describing what characteristics are generally associated with the term "RISC", not whether they are individually or collectively good or bad.
I think the collective goodness or badness is sufficiently addressed by the simple fact that since 1980 there has been no (successful) new CISC instruction set, and the ones that have survived have done so by translating to a more RISCy form internally.
CISC doesn't even win for compactness of programs in the 32 bit and 64 bit eras. RISC architectures with a simple mix of 2-byte and 4-byte instructions do. VAX and x86 both allow instructions with lengths from 1 byte to 15 or 16 bytes in 1 byte increments. VAX even bigger in some extreme instructions, but let's stick to common things such as ADDL3. In the 64 bit era, x86's 15 byte limitation is by fiat -- the syntax easily allows longer instructions.
- strict separation of computation from data transfer (load/store)
On the other hand, allowing ALU instructions to have one memory operand acts as a type of instruction set compression, lowers register pressure, and seems to have little disadvantage when out-of-order execution allows long load-to-use latencies from cache.
Lowers register pressure yes. "Little disadvantage" with OoO: yes. But a significant disadvantage with simpler in-order implementations, such as microcontrollers.
Program compression: it seems logical, but is not borne out by empirical data.
It's logical that not having to mention a temporary register name (twice!) in a load and subsequent arithmetic instruction could make for smaller programs. But no one seems to have been able to achieve this in practice.
I think the problem is that in all of PDP11, VAX, 68k, x86 each operand that can potentially be in memory is accompanied by an addressing mode field, which is *wasted* in the more common case (if you have enough registers) when the operand is actually in a register. PDP11 and 68k have in 16 bit opcodes two 3 bit register fields plus two 3 bit addressing modes. VAX has 4 bit register field plus 4 bit addressing mode for each operand. x86 limits one operand to be a register but effectively wastes three bits out of 16 for reg-reg operations (1 bit in the opcode to select reg-mem or mem-reg, plus 2 bits in the modr/m byte).
If not for those often-wasted addressing mode fields, all of PDP11, 68k and x86 could perhaps have instead fit 3-address instructions into a 16 bit opcode (as Thumb does for add&sub), saving a lot of extra instructions to copy one operand to where the result is required first.
- enough registers that you don't touch memory much. Arguments for most functions fit in registers, and the return address too (the otherwise RISC AVR8 violates this).
But if the register set is too large, it means more state to save. There are other solutions for this though.
Sure. It's possible to go overboard. Four registers doesn't seem to be enough. 128 is too many. Sixteen seems to be a good number if you have complex addressing modes (big immediates and offsets, register plus scaled register), and thirty two if you don't. Eight isn't really enough, even with complex addressing modes, but two sets of eight seems workable e.g. 8 address plus 8 data (68k) or 8 general purpose lo registers plus 8 more limited hi (Thumb). Or 8 directly addressable, plus 8 that need an extra prefix byte to address (x86_64).
- no instruction can cause more than one cache or TLB miss, or two adjacent lines/pages if unaligned access is supported (and this case might be trapped and emulated)
The lack of hardware support for unaligned access always seems to end up being a performance problem once a processor gets deployed into the real world.
Weak memory ordering which seems like it should be an advantage also becomes a liability.
Unaligned access, I agree.
Weak memory ordering ... I think languages, compilers, and programmers have got that under control now. It took a while. I don't think TSO scales well to 100+ cores .. or even 50. We'll really start to see this bite (or not) in the next five years.
- each instruction modifies at most one register.
That is pretty standard but how then do you handle integer multiplies and divides? Break them up into two instructions?
Yes. The high part of multiplies and the remainder from divisions are very rarely needed. Better to define separate instructions for them. Higher-end processors can notice both are being calculated and combine them, if that's profitable.
- integer instructions read at most two registers. This is ultra-purist :-) A number of RISC ISAs break it in order to have e.g. a register plus (scaled) register addressing mode, or conditional select. But no more than three!
Internally it seems like this sort of thing and modifying more than one register should be broken up into separate micro-operations so that the register file has a lower number of read and write ports. The alternative is having to decode more instructions which clog up the front end once an out-of-order implementation is desired.
Breaking complex instructions up into several microops for execution is a valid thing to do, especially on a lower end implementation. Intel obviously does this at large scale, but even ARM does it quite a lot, including in Aarch64.
Recognising adjacent simple operations and dynamically combining them into a more powerful single operation is also a valid thing to do. Modern x86 does this, for example, when there is a compare followed by a conditional branch. Future high end RISC-V CPUs are also expected to do this heavily.
The former puts a burden on to low end implementations. The latter keeps low end implementations as simple as possible, while putting a burden onto high end implementations, which can perhaps more easily afford it.
On the other hand, this means discarding the performance advantages of the FMA instruction.
I was explicitly talking about *integer* instructions. Most floating point code uses FMA very heavily (if -ffast-math is enabled) and IEEE 754-2008 mandates it. It's a big performance advantage to have three read ports on the FP register file, worth the expense. It would be wasted most of the time on the integer register file.
- no microcode or hardware sequencing. Each instruction executes in a small and fixed number of clock cycles (usually one). Load/Store multiple are the main offenders in both ARM and PowerPC. They help with code size, but it's interesting that ARM didn't put them in Aarch64 and is deprecating them in 32 bit as well, providing the much less offensive load/store pair.
Maybe more interesting is why ARM even included them in the first place. Load and store multiple took advantage of fast-page-mode DRAM access when ARMs instruction pipeline was closely linked with DRAM access.
Yes. Modern caches achieve the same effect with individual stores.
Load/store multiple do make programs significantly smaller, especially in function prologue and epilogue.
RISC-V gcc has an option "-msave-restore" (this might get included in -Os later) that calls one of several special library routines as the first instruction in functions to create the stack frame and save the return address plus s0-s2, or ra&s0-s6 or ra&s0-s10 or ra&s0-s11. Function return is then done by tail-calling the corresponding return function.
This replaces anything up to 29 instructions (116 bytes with rv32i or rv64i, 58 bytes with the C extension) with two instructions (8 bytes with rv32i or rv64i, 6 bytes with the C extension though I have a plan to reduce that to 4). Of course the average case is a lot less than that -- most functions that create a stack frame at all use three or fewer callee-save registers, so you're usually replacing 5/7/9/11 by 2 instructions.
The time cost is three extra jump instructions, plus sometimes a couple of unneeded loads and stores because not every size is provided in order to keep the library size down.
In the current version, the total size of the save/restore library functions is 96 bytes with RVC enabled.
I made a couple of tests on a HiFive1 board with gcc 7.2.0.
Dhrystone w/rv32im, -msave-restore made the program 4 bytes smaller, and 5% slower (1.66 vs 1.58).
CoreMark w/rv32imac, -msave-restore made the program 252 bytes (0.35%) smaller, and 0.4% FASTER (2.676 vs 2.687, with +/- 0.001 variance on different runs).
I attribute the faster speed for CoreMark to greater effectiveness of the 16 KB instruction cache.
Both of these are very small programs of course (especially Dhrystone). Bigger programs will show more difference.
Should stack instructions be broken up as well?
Yes. This is done almost universally in RISCV ISAs. A function starts with decrementing the stack pointer (once) and ends with incrementing it. Each register (or sometimes register pair) is saved and restored with an individual instruction -- which on a superscalar processor might run multiple in each clock cycle.
The x86's push/pop instructions are very hard on OoO machinery, and recent ones have a special push/pop execution (keeping track of the stack pointer) IN THE DECODER, translating each push or pop in a series to a store or load at an offset from the stack pointer as it was at the start of the sequence.
What a huge number of instructions *does* do is make very small low end implementations impossible. And puts a big burden of work on every hardware and every emulator implementer.
I do not know about that. Multiple physical designs covering a wide performance range are possible with the same ISA. Microcode is convenient to handle seldom used instructions. Vector operations can be broken up into instructions which fit the micro-architecture's ALU width while allowing support for the same vector instruction set across a wide range of implementations.
Or you can use instruction set extensions every time you want to support a different vector length. How many FPU ISAs has ARM gone through now?
Seldom-used operations are handled just as easily by library routines as by microcode -- especially ones which are inherently slow. This is commonly done in every CPU family for floating point operations. Sometimes the compiler generates a binary with instructions replaced by library calls. Sometimes the instruction is generated but on execution traps into the operating system which then calls the appropriate library function.
Microcode made sense when ROM within the processor was faster than RAM, but since about 1980 SRAM has actually been faster than ROM. You could copy the microcode into SRAM at boot -- and even allow the user to write some custom microcode, as the VAX did. Or you can use that SRAM as an icache and create instruction sets that don't need microcode. Then real functions are just as fast to call as microcode ones.
When Dave Patterson (who invented the terms "RISC" and later "RAID" and is currently involved with RISC-V and Google's TPU) was on sabbatical at DEC he discovered that even on the VAX, using a series of simple instructions was faster than using the complex microcoded instructions such as CALLS and RET (which automatically saved and restored registers for you, according to a bitmask at the start of the function).
John Cocke at IBM independently discovered the same fact about the IBM 370 at about the same time (in fact a couple of years earlier).
I agree with you about vectors. ARM has gone through several vector/multimedia instruction sets, and Intel is up to at least number 4 (depending on what you count the different iterations of SSE as).
The RISC-V vector instruction set which is being finalised at the moment (I'm on the Working Group) -- and which has been in development and testing in various iterations for at least ten years (it's the main reason the RISC-V project was started in the first place) -- is vector length agnostic. The same code will run on hardware with any vector length.
ARM's new SVE is of course similar.