I was interested to see how hard it is implement to a rudimentary virtual memory trap... Good news is my CPU pipeline worries of yesterday were unfounded.
I had to
disable write buffering (only works on Cortex-M3/M4 by the way - the M7 has a different store mechanism). This inevitably slows down all store instructions, as they have become synchronous. This however did resolve the IMPRECISERR to PRECISERR busfaults, which occured on stores.
Attached vmem.c and main.c
pp (GCC O3):
uint32_t variable = 0;
uint32_t bm_mem(volatile uint32_t* ptr) {
uint32_t cyc0_vmem = DWT->CYCCNT;
uint32_t cnt = 0;
for (auto i = 0; i < 100000; i++) {
*ptr = cnt * 2;
cnt = *ptr + 1;
}
uint32_t cyc1_vmem = DWT->CYCCNT;
return cyc1_vmem - cyc0_vmem;
}
int main(void) {
DWT->CTRL |= 1; // enable CYCCNT
volatile uint32_t* ACTLR = reinterpret_cast<volatile uint32_t*>(0xE000E008);
*ACTLR |= 1<<1; // set DISDEFWBUF, https://interrupt.memfault.com/blog/cortex-m-fault-debug
volatile uint32_t* ptr_vmem = reinterpret_cast<volatile uint32_t*>(0x30000000);
volatile uint32_t* ptr_sp = reinterpret_cast<volatile uint32_t*>(&variable);
volatile uint32_t t_vmem = bm_mem(ptr_vmem);
volatile uint32_t t_sp = bm_mem(ptr_sp);
for(;;);
}
It seems to work.. but I only tested the rudimentary 16-bit ldr/str instructions with immediate offsets (and also admittedly, only for offset 0, but the BFAR should give the correct address) and on r3 (not sure if r4-r7 work). The opcode for register offset is also in there, but untested. No post inc/decrement also... (I think that's a 32-bit Thumb opcode).
In terms of performance.. this is without actual external memory.. 100k load/stores:
t_sp = 1400003 (14 cycles/iteration)
t_vmem = 31500003 (315 cycles/iteration)
t_sp decreases to 12cycles/iteration if I run it before bm_mem(ptr_vmem);
So it's about 26-27x slower (incl. toy calculations) on this 16-bit Thumb opcode. Probably slower if also 32-bit Thumb opcodes are handled since they require more decoding and operations.
Just this example is IMO not
too bad.. (168MHz/27=6.2MHz, and if the use-case is really sporadic, then it's OK). The problem is that probably a SPI access will totally kill any speed that's left.
E.g. 14 cycles for 1R 1W take 83ns @ 168MHz. Blow it up to 315 cycles (1875ns) + 2x SPI 40-bit transactions @ 40MHz (2000-3000ns).. I suppose a slowdown in the order of 100x is not unfeasible. Question if that is *good enough*. The code I post is only proof of concept..
Also don't know how comfortable I would be running an application with application code in HardFault_Handler though.
(For a more permanent solution, the F446 has QSPI memory interface with memory mapped mode, but unfortunately no integrated ethernet)
edit:Had a thought about speed..
Normally, memory load/store should take 1 cycle. It's done twice in the 12 cycle benchmark, so 10 cycles overhead for the loop and toy calculation. With 4bytes/cycle, 168MHz, that's 672MB/s throughput.
The 315 cycles for the memory trap routine then means an overhead of almost 150 cycles per load/store instruction. So by that definition it's already 150x slower (about 1us @ 168MHz).
If a single SPI memory access takes 1us (40-bit @ 40MHz), then that would double access time.
Or be approx 320x slower than regular SRAM.. or about 2.1MB/s.
FSMC can easily reach dozens of MB/s (16-bit 60MHz=120MB/s). QuadSPI can also reach several dozen MB/s (4-bit SDR 50MHz = 25MB/s). So in that respect QuadSPI is already quite a big slowdown..