Author Topic: The RISC-V ISA discussion (Read 19344 times)

SiliconWizard · « **Reply #125 on:** February 09, 2020, 02:55:46 pm »

Quote from: hamster_nz on February 09, 2020, 03:23:43 am

I found it pretty interesting to think about how the two-bit saturating counter is 'good enough' for branch prediction.

If you have any loop (e.g. iterating over a list, or "for(i=0;i<10;i++) ..." after one or two passes it quickly learns the flow through the code, and when the loop finishes the history of branch prediction doesn't get completely trashed by exiting the loop.

Exiting the loop also removes half of the history (going from 11 to 10, or from 00 to 01), so next time the entry in the table is used it will quickly adapt to the new pattern.

Yes, it's really just about compromising between "latency", the number of times it will take to get prediction right, and accuracy. One bit is too little, two bits is the sweet spot (confirmed by many studies and by my own tests.) I've tried with 3 bits and more, it consistently performs worse on average. (Of course you can always devise very specific examples with which this would perform better, but the main point is to devise something that will work well enough in most situations.) Maybe a little more "intelligence" in the predictor could use some kind of adaptive counter depth. (That might have been done already, I have read quite a few papers but certainly not all.)

Real "loops" can be predictable per se at compile time and even with good predictors, you always have the overhead of the branch instruction itself (which will take one cycle even with the best predicition.) To further optimize loops without requiring complex predictors (which would not be better for other cases anyway), you can also implement some kind of hardware loop extension. That's what they did with PULP (you can take a look at the documents to see what they did.) Loops will execute N times without the extra branch instruction.

SiliconWizard · « **Reply #126 on:** February 15, 2020, 03:13:35 pm »

OK, now the time has come to start implementing this in HDL.

And now I'm facing something new. Whereas I'm currently interested in implementing core(s) that are relatively simple and, for instance, would not deal with any kind of memory cache (just directly coupled SRAM blocks), my prototyping stage will obviously be done on FPGAs. Problem is, unless I go for a very large FPGA (expensive shit), the amount of embedded block RAM is relatively limited. I can certainly test things but will be quickly limited to run even moderately large programs.

And of course many dev boards (I already have quite a few) have a nice amount of SDRAM or DDR RAM. Problem is, it's impossible to use them without implementing caches, unless you're ready to have your core run at a very low frequency.

So now my next task is to implement caches.

There would be another possible approach - designing my own dev board with some FPGA + several fast SRAM chips. Not cheap and would require a very large number of IOs in order to be able to access the different SRAM chips concurrently...

And while working on caches, I now fully see why OoO execution would be particularly helpful. This can help a lot keeping the core busy when there are cache misses. Dang, looks like opening some can of worms.

Has any of you already implemented simple CPU cores with some memory caches, and if so what approach did you take and did you evaluate the average penalty (compared to a simple, tightly-coupled memory system)?

hamster_nz · « **Reply #127 on:** February 16, 2020, 02:23:37 am »

Quote from: SiliconWizard on February 15, 2020, 03:13:35 pm

And while working on caches, I now fully see why OoO execution would be particularly helpful. This can help a lot keeping the core busy when there are cache misses. Dang, looks like opening some can of worms.

To me Out of Order ('OoO') seems to be only helpful with hiding the hit from cache misses, as it can only hide a few cycles of latency. If memory accesses can raise exceptions then it is a whole new level of complexity in making it all work properly.

I've been playing around implementing the Tomasulo Algorithm to allow OoO, but decided that it is only super-useful if either your instructions take multiple cycles, or you can issue multiple instructions per cycle... The classical implementation limits performance to generating at most one result each cycle.

Also, the cache doesn't have to be that big to be useful - 80486 had 8KB or 16KB of L1 cache...

SiliconWizard · « **Reply #128 on:** February 16, 2020, 03:21:00 am »

Quote from: hamster_nz on February 16, 2020, 02:23:37 am

I've been playing around implementing the Tomasulo Algorithm to allow OoO, but decided that it is only super-useful if either your instructions take multiple cycles, or you can issue multiple instructions per cycle... The classical implementation limits performance to generating at most one result each cycle.

Yeah, I've studied that a bit, but I definitely don't want to mess with OoO at this point. My mental process is just going through all this, and I'm starting to more clearly see when things are useful and why they are commonly used in today's high-performance CPUs...

Quote from: hamster_nz on February 16, 2020, 02:23:37 am

Also, the cache doesn't have to be that big to be useful - 80486 had 8KB or 16KB of L1 cache...

I'm going to make some more experiments with cache with my simulator first and get an idea, especially to select appropriate cache sizes and experiment with simple replacement policies. Yes I'm basically thinking of implementing an instruction cache of 32KB and data cache of 32KB or 64KB, the rest of EBR will be used for register files and branch prediction (should fit in the FPGAs I'm targetting right now).

While working on memory access, I'm also thinking of a possible extension for dealing with memory copy. It's a very frequent operation in actual code, and doing that with the base instruction set and loops with branches looks rather inefficient - that kind of looks like it would be an opportunity for a specific extension. I've thought of just making that part of a more general DMA coprocessor, but typical DMA would require significant setup overhead for starting every copy/transfer (and would have to be much more generic that just being able to move memory blocks...), with typically a set of memory-mapped registers... so specifically for memory-to-memory transfers, I think an extension with just a couple specific and well thought out instructions could improve things drastically.

brucehoult · « **Reply #129 on:** February 16, 2020, 08:02:24 am »

Quote from: hamster_nz on February 16, 2020, 02:23:37 am

Quote from: SiliconWizard on February 15, 2020, 03:13:35 pm
And while working on caches, I now fully see why OoO execution would be particularly helpful. This can help a lot keeping the core busy when there are cache misses. Dang, looks like opening some can of worms.

To me Out of Order ('OoO') seems to be only helpful with hiding the hit from cache misses, as it can only hide a few cycles of latency.

Or long latency instructions. And it can be quite a lot of clock cycles on a cache miss if you have to go all the way last L1, L2, maybe L3 cache to RAM.

In 1996 the Alpha 21264 could have 80 instructions in the OoO engine, the largest up to that point. The Pentium Pro (1995) had a 40 entry Reorder Buffer.

Intel has gradually been increasing the size of the Reorder Buffer:

Nehalem: 128 uops
Sandy Bridge: 168
Haswell: 192
Skylake: 224

That can hide a *lot* of latency.

Quote

Also, the cache doesn't have to be that big to be useful - 80486 had 8KB or 16KB of L1 cache...

68020/68030 got useful speedups from a 256 byte icache. The 68030 added a 256 byte dcache. 68040 increased both to 4k which made a really big difference.

The PowerPC 603 had 8k each for icache and dcache which worked well for normal code but it performed badly with the m68000 emulator which used a huge 512k byte switch statement to contain two PPC instructions for each of the 65536 possible m68k instructions. In some cases the 1st PPC instruction completely emulated the m68k instruction and the 2nd PPC instruction jumped back to fetch the next m68k instruction. In other cases the 1st instruction just set up some information about the m68k instruction with a load immediate and then the 2nd instruction jumped to some common handler code.

See https://groups.google.com/d/msg/comp.sys.powerpc/jfSUDOGuNNM/BHeYAVoT2NIJ

The PowerPC 603e increased the L1 caches to 16k each, which fixed this problem.

SiliconWizard · « **Reply #130 on:** March 04, 2020, 06:26:32 pm »

Alright, I've implemented memory caches. So far I have added only an instruction cache (the memory cache mechanism itself was easy, but properly inserting it in the pipeline was not... correctly stalling the pipeline in ALL cases - while being as efficient as possible - with the added instruction cache took a while to get right.) So next step will be to add a data cache, now that it's ironed out.

I've tested the instruction cache with the following parameters: 4-way set associative, 64 bytes per line, 32KB, and a PLRUm replacement policy.

In most tests I have done, the penalty is negligible (of course thanks to reasonable code locality on average). There is only a marginal difference with CoreMark, linpack and Bruce's primes benchmark, and the miss rate is consistently less than 0.01%.

I tested with a 2-way, 16KB, still 64 bytes/line cache. The difference was negligible compared to 4-way, 32KB with the above benchmarks, but more significant obviously on code with a lot of function calls in further parts of the object code, in which, not surprisingly, the miss rate almost doubled on average (eg: 0.71% vs. 0.38% in one of my tests).

brucehoult · « **Reply #131 on:** March 04, 2020, 07:35:54 pm »

Quote from: SiliconWizard on March 04, 2020, 06:26:32 pm

Alright, I've implemented memory caches. So far I have added only an instruction cache (the memory cache mechanism itself was easy, but properly inserting it in the pipeline was not... correctly stalling the pipeline in ALL cases - while being as efficient as possible - with the added instruction cache took a while to get right.) So next step will be to add a data cache, now that it's ironed out.

I've tested the instruction cache with the following parameters: 4-way set associative, 64 bytes per line, 32KB, and a PLRUm replacement policy.

In most tests I have done, the penalty is negligible (of course thanks to reasonable code locality on average). There is only a marginal difference with CoreMark, linpack and Bruce's primes benchmark, and the miss rate is consistently less than 0.01%.

I tested with a 2-way, 16KB, still 64 bytes/line cache. The difference was negligible compared to 4-way, 32KB with the above benchmarks, but more significant obviously on code with a lot of function calls in further parts of the object code, in which, not surprisingly, the miss rate almost doubled on average (eg: 0.71% vs. 0.38% in one of my tests).

My primes benchmark compiles to around 200 bytes of code, so any instruction cache at all is going to work :-) It uses 8 KB of data, so doesn't need a lot of cache there either.

I've noticed experimentally that Coremark runs well with a 16 KB instruction cache if you enable the C extension but considerably less well without it. If I recall correctly on the HiFive1 it's around a factor of two in execution time. That's an extreme case because icache misses have to go to SPI flash which is very slow. Turning on -msave-restore (which uses runtime functions to save registers in function prologues and restore them in function epilogs) also gives a couple of percent speedup despite the extra instructions executed because a bit more of the hot code will fit into the instruction cache.

SiliconWizard · « **Reply #132 on:** March 04, 2020, 08:44:00 pm »

Quote from: brucehoult on March 04, 2020, 07:35:54 pm

My primes benchmark compiles to around 200 bytes of code, so any instruction cache at all is going to work :-) It uses 8 KB of data, so doesn't need a lot of cache there either.

Yup obviously.

Quote from: brucehoult on March 04, 2020, 07:35:54 pm

I've noticed experimentally that Coremark runs well with a 16 KB instruction cache if you enable the C extension but considerably less well without it.

That's interesting. It may depend on your compilation options a bit - dunno.
But I haven't implemented the C extension yet so it's purely RV32IM here, but even with a 16KB cache, I see little penalty compared to 32KB. I get a negligible miss rate, so even high penalty for misses would not matter much (unless of course the penalty was HUGE.)

What kind of icache is implemented in the CPU you're mentioning? Is it a n-way set associative? If so, what "n"? And if so, what's the replacement policy?

Now what is also the core frequency you run it at? Because of course for very slow flash and very high core frequency, miss penalties will have a huge impact. (In my simulator I have set up a typical penalty for accessing DDR RAM.)

What would be interesting would be to know the miss rate on icache you get with CoreMark (don't know if this CPU has instrumentation registers that allow to figure this out...?)

brucehoult · « **Reply #133 on:** March 05, 2020, 12:56:07 am »

FE310-G000 in a HIFive1, running at 256 MHz or 320 MHz depending on my whim. icache is 16 KB 2-way associative, 32 byte cache lines.

It takes around 1 us to do a 1/2/4 byte data load from the SPI. An icache line will be slower, but not hugely because it will do a burst transfer. But it's several hundred clock cycles, anyway, much like missing all the way to DRAM on a modern x86.

Sadly, the hardware performance monitor has only clock cycles and instructions retired counters.

https://sifive.cdn.prismic.io/sifive%2F500a69f8-af3a-4fd9-927f-10ca77077532_fe310-g000.pdf

I could try on the FU540 in the HiFive Unleashed, which has a much more comprehensive performance monitor. And L2 cache.

SiliconWizard · « **Reply #134 on:** March 05, 2020, 02:52:00 pm »

On my tests, enough of CoreMark's "core" code seems to fit within 16KB that I still get an extremely low miss rate even with just 16KB cache. I'm not sure what would explain the difference here with the FE310. The only thing that may differ is the replacement policy, but with only 2 ways, that shouldn't make much difference if at all? The other difference is the line size (I use 64 bytes), but I don't think it would make a difference here either. If not all code fits within 16KB, larger lines could actually be more a problem than a benefit?

Of course the other difference could be compiled code. I am currently using GCC 9.2 (custom build), I could try with SiFive's toolchain (currently based on GCC 8.3 IIRC?), although I don't expect that much of a difference. (But I could very well be "lucky" here, and it could be a matter of just a few dozen less bytes!)

SiliconWizard · « **Reply #135 on:** March 14, 2020, 06:15:36 pm »

Data memory cache is now also implemented. Ran all my tests with both instruction and data cache enabled. Both are 32KB, 4-way set associative.

For CoreMark, I get negligible penalty compared to tightly-coupled memory.

One of the most taxing benchmarks was linpack with N=1000 and FP emulation (uses over 8MB of data memory, heavy matrix computation stuff with FP.) About -2.2% speed compared to tightly-coupled memory. Average CPI = 1.054. Miss rate on instruction cache is negligible (everything fits within 32KB with no problem even though FP emulation is used.) Miss rate on data cache is ~0.357%. Probably not the best possible, but I'm pretty happy with the results given that I used a replacement policy that is pretty simple (PLRUm, read a few papers and this one looked like the sweet spot for a very simple yet effective policy.)

If anyone knows of typical benchmark code for data caches, I'll take it!

SiliconWizard · « **Reply #136 on:** March 19, 2020, 06:05:40 pm »

OK, I've just devised some code that looks like memory testing code (writing and reading back patterns, written in linear order or differently).

On large buffers (4 MB) I get an average miss rate of ~7%. Not hugely surprising.

I have also benchmarked memcpy() for various buffer sizes from 1KB to 4MB. Got the following results:

Code: [Select]

* memcpy(): 1 KiB, Exec. cycles = 1261 (1.23 cycles/byte)
* memcpy(): 2 KiB, Exec. cycles = 1738 (0.85 cycles/byte)
* memcpy(): 4 KiB, Exec. cycles = 3412 (0.83 cycles/byte)
* memcpy(): 8 KiB, Exec. cycles = 6876 (0.84 cycles/byte)
* memcpy(): 16 KiB, Exec. cycles = 23188 (1.42 cycles/byte)
* memcpy(): 32 KiB, Exec. cycles = 59268 (1.81 cycles/byte)
* memcpy(): 64 KiB, Exec. cycles = 130742 (1.99 cycles/byte)
* memcpy(): 128 KiB, Exec. cycles = 261112 (1.99 cycles/byte)
* memcpy(): 256 KiB, Exec. cycles = 521858 (1.99 cycles/byte)
* memcpy(): 512 KiB, Exec. cycles = 1043440 (1.99 cycles/byte)
* memcpy(): 1024 KiB, Exec. cycles = 2086538 (1.99 cycles/byte)
* memcpy(): 2048 KiB, Exec. cycles = 4172772 (1.99 cycles/byte)
* memcpy(): 4096 KiB, Exec. cycles = 8345270 (1.99 cycles/byte)

I'd be curious to compare that with what you get on a FU540...

brucehoult · « **Reply #137 on:** March 20, 2020, 08:03:21 am »

Ok, on FU540 at 1.45GHz:

Code: [Select]

* memcpy(): 1 KiB, Exec. cycles = 549 (0.54) cycles/byte)
* memcpy(): 2 KiB, Exec. cycles = 1018 (0.50) cycles/byte)
* memcpy(): 4 KiB, Exec. cycles = 1953 (0.48) cycles/byte)
* memcpy(): 8 KiB, Exec. cycles = 3812 (0.47) cycles/byte)
* memcpy(): 16 KiB, Exec. cycles = 11051 (0.67) cycles/byte)
* memcpy(): 32 KiB, Exec. cycles = 18994 (0.58) cycles/byte)
* memcpy(): 64 KiB, Exec. cycles = 91698 (1.40) cycles/byte)
* memcpy(): 128 KiB, Exec. cycles = 425028 (3.24) cycles/byte)
* memcpy(): 256 KiB, Exec. cycles = 1780168 (6.79) cycles/byte)
* memcpy(): 512 KiB, Exec. cycles = 3308322 (6.31) cycles/byte)
* memcpy(): 1024 KiB, Exec. cycles = 5550824 (5.29) cycles/byte)
* memcpy(): 2048 KiB, Exec. cycles = 10328072 (4.92) cycles/byte)
* memcpy(): 4096 KiB, Exec. cycles = 20878807 (4.98) cycles/byte)
* memcpy(): 8192 KiB, Exec. cycles = 40870825 (4.87) cycles/byte)
* memcpy(): 16384 KiB, Exec. cycles = 81417119 (4.85) cycles/byte)
* memcpy(): 32768 KiB, Exec. cycles = 162796040 (4.85) cycles/byte)
* memcpy(): 65536 KiB, Exec. cycles = 325120247 (4.84) cycles/byte)
* memcpy(): 131072 KiB, Exec. cycles = 649763946 (4.84) cycles/byte)
* memcpy(): 262144 KiB, Exec. cycles = 1298570608 (4.84) cycles/byte)
* memcpy(): 524288 KiB, Exec. cycles = 2600203293 (4.84) cycles/byte)
* memcpy(): 1048576 KiB, Exec. cycles = 5197559587 (4.84) cycles/byte)

So memcpy() in DRAM is 285 MB/sec.

Code:

Code: [Select]

#include <stdio.h>
#include <string.h>

typedef unsigned long ulong;

ulong read_cycles() {
    ulong cycles;
    asm volatile ("rdcycle %0" : "=r" (cycles));
    return cycles;
}

typedef void test_proc(ulong arg);

ulong measure_time(test_proc p, ulong arg) {
  ulong min = -1;
  for (int i=0; i<10; ++i) {
    ulong start = read_cycles();
    p(arg);
    ulong cycles = read_cycles() - start;
    if (min > cycles) min = cycles;
  }
  return min;
}

#define MAX_SZ (1<<30)
char buf_a[MAX_SZ], buf_b[MAX_SZ];

void empty(ulong arg) {};
void do_memcpy(ulong sz) {memcpy(buf_a, buf_b, sz);}

int main() {
  ulong empty_time = measure_time(empty, 0);
  for (ulong kb=1; kb<=(1<<20); kb*=2) {
    ulong t = measure_time(do_memcpy, 1024*kb);
    printf("* memcpy(): %lu KiB, Exec. cycles = %lu (%4.2f) cycles/byte)\n",
	   kb, t, t/(1024.0*kb));
  }
}

brucehoult · « **Reply #138 on:** March 20, 2020, 01:49:11 pm »

Incidentally, if I drop the FU540 down to 37.75 MHz (the slowest clock the board can generate) then I get:

Code: [Select]

* memcpy(): 1 KiB, Exec. cycles = 554 (0.54) cycles/byte)
* memcpy(): 2 KiB, Exec. cycles = 1016 (0.50) cycles/byte)
* memcpy(): 4 KiB, Exec. cycles = 1969 (0.48) cycles/byte)
* memcpy(): 8 KiB, Exec. cycles = 3906 (0.48) cycles/byte)
* memcpy(): 16 KiB, Exec. cycles = 9683 (0.59) cycles/byte)
* memcpy(): 32 KiB, Exec. cycles = 19035 (0.58) cycles/byte)
* memcpy(): 64 KiB, Exec. cycles = 115153 (1.76) cycles/byte)
* memcpy(): 128 KiB, Exec. cycles = 257464 (1.96) cycles/byte)
* memcpy(): 256 KiB, Exec. cycles = 486513 (1.86) cycles/byte)
* memcpy(): 512 KiB, Exec. cycles = 943371 (1.80) cycles/byte)
* memcpy(): 1024 KiB, Exec. cycles = 1751138 (1.67) cycles/byte)
* memcpy(): 2048 KiB, Exec. cycles = 3342729 (1.59) cycles/byte)
* memcpy(): 4096 KiB, Exec. cycles = 6900410 (1.65) cycles/byte)
* memcpy(): 8192 KiB, Exec. cycles = 14309351 (1.71) cycles/byte)
* memcpy(): 16384 KiB, Exec. cycles = 28623709 (1.71) cycles/byte)
* memcpy(): 32768 KiB, Exec. cycles = 57386550 (1.71) cycles/byte)
* memcpy(): 65536 KiB, Exec. cycles = 114705607 (1.71) cycles/byte)

So slowing the CPU down doesn't slow down the RAM by as much. (I hit ^C before completion of the program..)

I can get 2.00 cycles per byte at 182 MHz:

Code: [Select]

* memcpy(): 1 KiB, Exec. cycles = 544 (0.53) cycles/byte)
* memcpy(): 2 KiB, Exec. cycles = 1013 (0.49) cycles/byte)
* memcpy(): 4 KiB, Exec. cycles = 1937 (0.47) cycles/byte)
* memcpy(): 8 KiB, Exec. cycles = 3859 (0.47) cycles/byte)
* memcpy(): 16 KiB, Exec. cycles = 9441 (0.58) cycles/byte)
* memcpy(): 32 KiB, Exec. cycles = 33845 (1.03) cycles/byte)
* memcpy(): 64 KiB, Exec. cycles = 70787 (1.08) cycles/byte)
* memcpy(): 128 KiB, Exec. cycles = 160051 (1.22) cycles/byte)
* memcpy(): 256 KiB, Exec. cycles = 447387 (1.71) cycles/byte)
* memcpy(): 512 KiB, Exec. cycles = 990994 (1.89) cycles/byte)
* memcpy(): 1024 KiB, Exec. cycles = 1994527 (1.90) cycles/byte)
* memcpy(): 2048 KiB, Exec. cycles = 3930281 (1.87) cycles/byte)
* memcpy(): 4096 KiB, Exec. cycles = 7860005 (1.87) cycles/byte)
* memcpy(): 8192 KiB, Exec. cycles = 16408431 (1.96) cycles/byte)
* memcpy(): 16384 KiB, Exec. cycles = 33264848 (1.98) cycles/byte)
* memcpy(): 32768 KiB, Exec. cycles = 66778139 (1.99) cycles/byte)
* memcpy(): 65536 KiB, Exec. cycles = 133847271 (1.99) cycles/byte)
* memcpy(): 131072 KiB, Exec. cycles = 267943099 (2.00) cycles/byte)
* memcpy(): 262144 KiB, Exec. cycles = 536411759 (2.00) cycles/byte)

Which works out to about 87 MB/sec at that clock speed. Or 21 MB/sec at 37.75 MHz.

SiliconWizard · « **Reply #139 on:** March 20, 2020, 05:26:04 pm »

Thanks! Intreresting results that look consistent with what I get. Of course CPU clock relative to DRAM throughput (and latency) will influence the average number of cycles per byte.

Another interesting point is the difference (from my own tests) regarding the buffer size at which the "knee" appears. I think the FU540 also has a 32KB L1 cache? But does it have an L2 cache?

brucehoult · « **Reply #140 on:** March 20, 2020, 05:38:54 pm »

32 KB 8-way L1, yes. And 2 MB 16-way L2 shared between the 4 cores.

Weird that the speed is actually worse when working in L2 than once it gets to RAM! I don't have an explanation for that.

SiliconWizard · « **Reply #141 on:** March 20, 2020, 05:55:11 pm »

Quote from: brucehoult on March 20, 2020, 05:38:54 pm

32 KB 8-way L1, yes. And 2 MB 16-way L2 shared between the 4 cores.

L2 cache mostly explains why you get the knee much later (meaning with bigger buffer sizes) on average.

Quote from: brucehoult on March 20, 2020, 05:38:54 pm

Weird that the speed is actually worse when working in L2 than once it gets to RAM! I don't have an explanation for that.

For both points, do you have any means of disabling L2 cache only and see what you get?

brucehoult · « **Reply #142 on:** March 20, 2020, 06:20:25 pm »

Quote from: SiliconWizard on March 20, 2020, 05:55:11 pm

Quote from: brucehoult on March 20, 2020, 05:38:54 pm
Weird that the speed is actually worse when working in L2 than once it gets to RAM! I don't have an explanation for that.

For both points, do you have any means of disabling L2 cache only and see what you get?

You can't disable L2 entirely. Way 0 is enabled at reset and software can enable more ways, but once enabled they can not be disabled except by reset.

You can mask ways from being allocated in by a particular CPU (or DMA) by setting bits in the appropriate WayMask register but it seems each CPU must have at least one way unmasked.

See page 69 and following in https://static.dev.sifive.com/FU540-C000-v1.0.pdf

SiliconWizard · « **Reply #143 on:** March 20, 2020, 06:33:52 pm »

Quote from: brucehoult on March 20, 2020, 06:20:25 pm

Quote from: SiliconWizard on March 20, 2020, 05:55:11 pm
Quote from: brucehoult on March 20, 2020, 05:38:54 pm
Weird that the speed is actually worse when working in L2 than once it gets to RAM! I don't have an explanation for that.

For both points, do you have any means of disabling L2 cache only and see what you get?

You can't disable L2 entirely. Way 0 is enabled at reset and software can enable more ways, but once enabled they can not be disabled except by reset.

You can mask ways from being allocated in by a particular CPU (or DMA) by setting bits in the appropriate WayMask register but it seems each CPU must have at least one way unmasked.

See page 69 and following in https://static.dev.sifive.com/FU540-C000-v1.0.pdf

OK thanks. You could at least try disabling all ways beside way 0 and see if it makes any difference...

brucehoult · « **Reply #144 on:** March 20, 2020, 06:53:43 pm »

I think that would require running bare metal code rather than running under Debian :-)

I've never done that with this board, though it's one of the projects I've been considering playing with during funemployment after my move from SF to New Zealand at the end of the month (hopefully...).

SiliconWizard · « **Reply #145 on:** March 20, 2020, 07:12:40 pm »

Yep that would require going bare metal... Well if that was already one of your projects

SiliconWizard · « **Reply #146 on:** April 20, 2020, 04:49:38 pm »

Just some news. I've worked on a VHDL implementation for a while. Writing a simulator first (cycle-accurate for a pipelined core) definitely helped a lot. Some things were still tougher than others to properly "translate", but that wasn't too bad. I would have wasted a lot of time ironing out the bugs if I had gone for an HDL version first. (Main reason being that it's a 5-stage pipelined core. A non-pipelined, or with fewer stages, core would have been much easier to implement and get right.)

So, it's basically done. Right now I'm still having a small issue with the branch prediction logic, which makes the synthesizer infer block RAM for it only partially, so that's a lot of wasted LUTs and hinders speed a bit as well. Currently without branch prediction, RV32IM+Zicsr (w/single-cycle multiply) takes about 2000 LUTs (5-stage pipeline, fully bypassed) and can run at up to ~100MHz on a Spartan 6 LX25. With branch prediction enabled, that's about 3500 LUTs and max freq down to ~85MHz. But I should be able to get it to at least 100MHz once I fix the implementation of the branch prediction, and down to about 2100 or 2200 LUTs max.

The test project currently uses block RAM for instruction and data memory, but I intend on using DDR RAM later on with memory cache. I've implemented a memory cache in VHDL and verified it with simulation, now I'll need to test this for real on an FPGA. But let's fix this branch prediction thing first...

brucehoult · « **Reply #147 on:** April 20, 2020, 08:19:16 pm »

Nice work!

SiliconWizard · « **Reply #148 on:** April 20, 2020, 11:52:57 pm »

Thanks. I now get the "issue" with the branch predictor, but I still need to find an appropriate "fix" without harming its performance.

The main reason is that it must be able to predict the PC at each cycle for the next cycle, but the prediction itself can lead to a branch instruction that will require the right prediction for the subsequent cycle... there is thus a small part in my implementation that is combinatorial, which is not that great. I need to give it some more thought.

SiliconWizard · « **Reply #149 on:** April 22, 2020, 04:04:08 pm »

Haven't had a lot of time re-working this, but I'm kind of stuck...

Issue basically is:
- The current PC is either the predicted PC or the "next" PC;
- The predicted PC is determined based on the previous PC;
- That kind of makes prediction not fully synchronous, which is the cause of my "issues" here.

Does anyone have an idea how to solve this? Does branch prediction actually have a 1-cycle latency in some cases to be able to make the whole prediction fully synchronous? (Which may be the answer here but would hinder performance.) Any idea welcome.

I think the problematic case would be the one I spotted above: a branch prediction leading to a branch instruction (because to handle this case, prediction must be able to work at EVERY clock cycle) - otherwise I think the latency can be worked around. If anyone sees what I mean and has any tip...


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: The RISC-V ISA discussion (Read 19344 times)

Share me