Author Topic: ST 32F417: Any way to make an SPI SRAM to look like normal RAM? (Read 8069 times)

peter-h · « **on:** April 01, 2022, 09:51:15 pm »

What I am after is trapping an invalid address access e.g. at 0x30000000 and reading or writing 4 bytes (a complete 32 bit word, to keep things simple) from an SPI SRAM.

The invalid address trap does trap the access but do you get the address you attempted to access and (if writing) the data word?

It is basically a virtual memory question.

Obviously it will be rather slow, limited by my 21MHz max SPI clock, but a 2 megabyte/sec "RAM" is better than nothing and would be fine for many applications.

Reading this https://www.keil.com/appnotes/files/apnt209.pdf and specifically the MemManage trap, there appears to be a mechanism to trap (via a bus fault interrupt) a read because you do get the address (which is not surprising since accessing an invalid memory location under a debugger does tell you the address accessed) but if you trap a write you do likewise get the address but seemingly can't retrieve the data you tried to write.

I find this apparent shortcoming surprising, because much older CPUs have been capable of a complete virtual memory implementation.

agehall · « **Reply #1 on:** April 01, 2022, 10:14:02 pm »

You would need an MMU to do this. At least if you want to do it efficiently. You could maybe emulate this by trapping the read/write, swapping in the memory and restarting the instruction that caused the error, but I’m not sure that would be a good solution.

Does it have to be SPI SRAM? The memory controller in the device supports external SRAM, just not SPI-based ones…

ataradov · « **Reply #2 on:** April 01, 2022, 10:16:07 pm »

EDIT: Oh, yes, to get the data you have to interpret the instruction that caused the fault. It is not that hard, there is a limited number of store instructions. But making access functions and calling them from the code directly would be more efficient, of course.

And even with load instructions, you still need to interpret them and return correct result in a correct register.

MMFAR contains the faulting address on both reads and writes. MemManage is always precise, so it is always possible to recover from the fault, so it is absolutely possible to do what you want to do.

But what makes you think you will get anywhere close to 2 megabyte/sec? If you have a command + 4 byte address + data, then you will get ~400 kbytes/second at most.

langwadt · « **Reply #3 on:** April 01, 2022, 10:45:47 pm »

Quote from: peter-h on April 01, 2022, 09:51:15 pm

What I am after is trapping an invalid address access e.g. at 0x30000000 and reading or writing 4 bytes (a complete 32 bit word, to keep things simple) from an SPI SRAM.

The invalid address trap does trap the access but do you get the address you attempted to access and (if writing) the data word?

It is basically a virtual memory question.

Obviously it will be rather slow, limited by my 21MHz max SPI clock, but a 2 megabyte/sec "RAM" is better than nothing and would be fine for many applications.

Reading this https://www.keil.com/appnotes/files/apnt209.pdf and specifically the MemManage trap, there appears to be a mechanism to trap (via a bus fault interrupt) a read because you do get the address (which is not surprising since accessing an invalid memory location under a debugger does tell you the address accessed) but if you trap a write you do likewise get the address but seemingly can't retrieve the data you tried to write.

I find this apparent shortcoming surprising, because much older CPUs have been capable of a complete virtual memory implementation.

I think some of the STMF4s are capable of memory mapped QSPI memory

DavidAlfa · « **Reply #4 on:** April 02, 2022, 01:00:20 am »

Only quad/octo-spi peripherals are able to do that, not the "normal" spi.

Your mcu can do that using fsmc, but it's meant only for parallel interfaces.

SiliconWizard · « **Reply #5 on:** April 02, 2022, 01:22:27 am »

What are you really trying to attempt/solve with this approach?

bson · « **Reply #6 on:** April 02, 2022, 08:45:04 am »

I believe you get a "BusFault on data access", either precise (known address) or imprecise (address unknown due to D-space buffering enabled).
See the Armv7-M Architecture Reference Manual, page B1-553.

BFAR at 0xe000ed38 contains the bus fault address.
It's valid if bit 7 (BFARVALID) in the BFSR is set (byte 1 in the CFSR @ 0xe000ed28). Page B3-610.

Read up on the bus fault registers... you may need to disable memory buffering (at least for D, maybe I) to get a valid address.

Not sure how to determine whether it's a read or a write, other than grabbing the exception address and looking at the load/store instruction.
If you decide to emulate it over SPI, don't forget to update the exception address to the next instruction, but things like LDM/STM might get tricky so hopefully you have only simple single-word LD/ST without increment or decrement.

peter-h · « **Reply #7 on:** April 02, 2022, 08:58:58 am »

Quote

But what makes you think you will get anywhere close to 2 megabyte/sec? If you have a command + 4 byte address + data, then you will get ~400 kbytes/second at most.

Good point - unless the usage context is sequential accessing within a long buffer.

It is the 16k buffers in MbedTLS which have soaked up a lot of RAM in this project.

The debugger reports the faulting address so it may be possible but how would you recover the value which you tried to write, when the bus fault happened? Only by decoding the faulting instruction.

Quote

What are you really trying to attempt/solve with this approach?

Just a lack of RAM in this project. Some bits which use a lot of RAM are not that performance critical.

There are no spare I/O pins and certainly any parallel type RAM is out of the question.

Instruction decoding would be potentially complex, especially as you have no control over which instructions the compiler might be using. I now recall doing this in the 1980s on a Z280. I had to decode the faulting instruction, sufficiently to determine its length i.e. where the next one started. And I was supporting only a DIV by zero fault, in a recoverable way.

An alternative approach, not transparent to the running code, would be possible if the SPI RAM was used as a "cache" i.e. one had say a local 16k buffer and merely transferred 16k blocks between that, and any number of 16k buffers in the SPI RAM. That would be a trivial DMA job. But that's nothing to do with the original question.

SiliconWizard · « **Reply #8 on:** April 02, 2022, 05:05:12 pm »

Quote from: peter-h on April 02, 2022, 08:58:58 am

Quote
What are you really trying to attempt/solve with this approach?

Just a lack of RAM in this project. Some bits which use a lot of RAM are not that performance critical.

There are no spare I/O pins and certainly any parallel type RAM is out of the question.

I see. This MCU supports PSRAM - unfortunately, only through the FSMC controller. It doesn't support QSPI as far as I can tell. (But I don't even know whether you would have had enough IOs for QSPI anyway.)

OK then, if you need extra RAM, it's one thing. But why do you need to trap invalid addresses? Can you not implement this extra RAM access through a dedicated software interface instead of making it look like regular memory access? This would be by far your easiest bet here?

DavidAlfa · « **Reply #9 on:** April 02, 2022, 06:00:46 pm »

This seems the situation where start making buffers everywhere and run out of memory.
There's nothing you could allocate and destroy on demand? That would save huge ammounts of memory.
Might be slower but still much faster than spi psram.

bson · « **Reply #10 on:** April 02, 2022, 07:17:10 pm »

Quote from: SiliconWizard on April 02, 2022, 05:05:12 pm

But why do you need to trap invalid addresses? Can you not implement this extra RAM access through a dedicated software interface instead of making it look like regular memory access? This would be by far your easiest bet here?

Code reuse? Especially if it's library code or shared with other projects. Keeps from going down the rathole of messing it up with a bunch of data access wrappers and stuff. Emulation is perfectly reasonable if it's either rare, or performance doesn't matter.

Here's a thought... in the busfault exception handler, if there's an imprecise fault, disable the D-cache and return. This will re-execute the instruction with the cache disabled, faulting again and this time giving you the bus address. Then re-enable the cache, decode the instruction and advance the exception return address. Finally fill in the register on the stack (load from address) or do something with it (store to address). You probably also need to decode the data width and handle alignment.

You can also back it with a cache buffer in SRAM.

peter-h · « **Reply #11 on:** April 02, 2022, 08:57:17 pm »

Yes a big chunk of this issue is library code.

For example the ST supplied ETH libs contain a number of buffers - about 10k. Can't get rid of these. And so it goes on. The code I/we wrote uses minimal RAM. I reckon the stuff I wrote over the past year and a half uses less than 10k. Sure, I use stack vars inside functions, and use the CCM freely during boot-up. All the usual stuff one does when writing in asm

MbedTLS uses about 50k and lots of people complain about this. If one is implementing only e.g. a TLS client then one can reduce this quite a lot.

It would have been good to get say 128k bytes of RAM this way (e.g. N01S818HA) but I can see it won't be simple if one has to do a lot of instruction decoding. An ESP32 can use these directly.

I am actually surprised that nobody has already done this, because in many applications these chips are short of RAM. And SPI1 can run at 42MHz.

hans · « **Reply #12 on:** April 02, 2022, 09:08:00 pm »

IIRC STM32 Eth library uses 5 ethernet frame buffers. Perhaps something that could be decreased if you don't need the throughput or burst length? Could save significant amount of space.

------
@bson: What about the PC accuracy of the imprecise fault? It could be some cycles before the memory pipeline faults for the storing of data (because, eventually, the destination store address has become known). Meanwhile the CPU pipeline has gone over to manipulate data in registers (potentially even the same data register, introducing an inverse data hazard, I suppose write-before-read, read-before-write or write-before-write

)
Perhaps I'm zooming too much in from a CPU pipeline design, in which there is no single PC active, but rather a PC "in flight" for each pipeline stage (instead of in a programmer's model, where there is only a single PC).
Or hopefully the ARM CPU flushes (instead of stalls) the whole pipeline when a fault occurs (as the register writeback probably occurs even 1 cycle later than the memory fault), so that this hazard is actually a non-issue or even resolved..

But running the CPU with write buffering would decrease overall performance, obviously.
However, I think the overall performance of this implementation would be rather slow anyway.. Just imagine the instruction decoding and storing the result to the correct data register. Sounds like quite a few instructions, which perhaps could be processed during SPI transactions, to figure out what to do in the end.
But if it's all that's needed - obviously performance concerns are a non-issue then.

DavidAlfa · « **Reply #13 on:** April 02, 2022, 09:38:24 pm »

With SPI @ 42MHz, you get at best 5Mbytes/s (Add addressing overhead, etc), instead 640MB/s from the internal 32-bit sram @168MHz.
In that moment, better you get something else. It's like racing a Ferrari with a broken gas pedal.
Has 500hp, but you can only use 50. What's the point?

peter-h · « **Reply #14 on:** April 03, 2022, 08:39:03 am »

The point is that if you have a 100k lump of library code, C source but practically impossible to do anything with, which wants 100k of RAM, but isn't hugely performance critical, you could solve it with this. Also nearly all the RAM will be in buffers which will be relatively very rarely accessed. The normal variables will still be in the usual places.

You could not use DMA with this method, however.

BTW, does the 32F4 fault on DMA accessing invalid memory?

DavidAlfa · « **Reply #15 on:** April 03, 2022, 09:08:04 am »

I think you'll get Tranfer Error flag (TEIF), check "Error management" in DMA section RM.
With your current needs, you'd better try to migrate to higher series (ex. 429) which have FMC and can be connected to SDRAM or either get a higher pin count 417 which allows connecting a small PSRAM.

If still taking the SPI SRAM way, it would be that hard.
- Create typedef structures for each function requirements
- Partition the sram in different sections, declaring a base address for each function.
- Declare a scratchpad to store chunks of the SRAM (Read-modify-write model)
- Making Sram read/write handlers for
- Reading/writing complete structures or SRAM partitions
- Reading/writing single values

- Then each function would need to refresh the scratchpad when entering, then access it using a struct pointer.

hans · « **Reply #16 on:** April 03, 2022, 10:28:44 am »

I was interested to see how hard it is implement to a rudimentary virtual memory trap... Good news is my CPU pipeline worries of yesterday were unfounded.

I had to disable write buffering (only works on Cortex-M3/M4 by the way - the M7 has a different store mechanism). This inevitably slows down all store instructions, as they have become synchronous. This however did resolve the IMPRECISERR to PRECISERR busfaults, which occured on stores.

Attached vmem.c and main.cpp (GCC O3):

Code: [Select]

uint32_t variable = 0;

uint32_t bm_mem(volatile uint32_t* ptr) {
    uint32_t cyc0_vmem = DWT->CYCCNT;
    uint32_t cnt = 0;
    for (auto i = 0; i < 100000; i++) {
        *ptr = cnt * 2;
        cnt = *ptr + 1;
    }
    uint32_t cyc1_vmem = DWT->CYCCNT;
    return cyc1_vmem - cyc0_vmem;
}

int main(void) {
    DWT->CTRL |= 1; // enable CYCCNT

    volatile uint32_t* ACTLR = reinterpret_cast<volatile uint32_t*>(0xE000E008);
    *ACTLR |= 1<<1; // set DISDEFWBUF, https://interrupt.memfault.com/blog/cortex-m-fault-debug

    volatile uint32_t* ptr_vmem = reinterpret_cast<volatile uint32_t*>(0x30000000);
    volatile uint32_t* ptr_sp = reinterpret_cast<volatile uint32_t*>(&variable);

    volatile uint32_t t_vmem = bm_mem(ptr_vmem);
    volatile uint32_t t_sp = bm_mem(ptr_sp);

    for(;;);
}

It seems to work.. but I only tested the rudimentary 16-bit ldr/str instructions with immediate offsets (and also admittedly, only for offset 0, but the BFAR should give the correct address) and on r3 (not sure if r4-r7 work). The opcode for register offset is also in there, but untested. No post inc/decrement also... (I think that's a 32-bit Thumb opcode).

In terms of performance.. this is without actual external memory.. 100k load/stores:
t_sp = 1400003 (14 cycles/iteration)
t_vmem = 31500003 (315 cycles/iteration)

t_sp decreases to 12cycles/iteration if I run it before bm_mem(ptr_vmem);

So it's about 26-27x slower (incl. toy calculations) on this 16-bit Thumb opcode. Probably slower if also 32-bit Thumb opcodes are handled since they require more decoding and operations.
Just this example is IMO not too bad.. (168MHz/27=6.2MHz, and if the use-case is really sporadic, then it's OK). The problem is that probably a SPI access will totally kill any speed that's left.
E.g. 14 cycles for 1R 1W take 83ns @ 168MHz. Blow it up to 315 cycles (1875ns) + 2x SPI 40-bit transactions @ 40MHz (2000-3000ns).. I suppose a slowdown in the order of 100x is not unfeasible. Question if that is *good enough*. The code I post is only proof of concept..

Also don't know how comfortable I would be running an application with application code in HardFault_Handler though.

(For a more permanent solution, the F446 has QSPI memory interface with memory mapped mode, but unfortunately no integrated ethernet)

edit:
Had a thought about speed..
Normally, memory load/store should take 1 cycle. It's done twice in the 12 cycle benchmark, so 10 cycles overhead for the loop and toy calculation. With 4bytes/cycle, 168MHz, that's 672MB/s throughput.
The 315 cycles for the memory trap routine then means an overhead of almost 150 cycles per load/store instruction. So by that definition it's already 150x slower (about 1us @ 168MHz).
If a single SPI memory access takes 1us (40-bit @ 40MHz), then that would double access time.
Or be approx 320x slower than regular SRAM.. or about 2.1MB/s.
FSMC can easily reach dozens of MB/s (16-bit 60MHz=120MB/s). QuadSPI can also reach several dozen MB/s (4-bit SDR 50MHz = 25MB/s). So in that respect QuadSPI is already quite a big slowdown..

DC1MC · « **Reply #17 on:** April 03, 2022, 10:33:52 am »

Is time to implement virtual memory/paging or, if no MMU is available, the DOS solution: overlays, I still have flashbacks from Borland's TurboX compilers.

peter-h · « **Reply #18 on:** April 03, 2022, 11:07:16 am »

That's awesome work, Hans

Thank you.

Quote

Also don't know how comfortable I would be running an application with application code in HardFault_Handler though

Can it not be emulated transparently, so in effect the invalid address access generates an interrupt which emulates a real memory access?

This looks like an amazing project. The ESP32 designers thought it was worthwhile enough to implement it in hardware, which is probably quite an easy state machine (or microcode).

Re the suggestions on paging, yes, I was doing that in the Z80/Z180 days, but there you had say 1MB physical RAM and you selected a 4k bank (typically the top 4k of the 64k address space) with an 8-bit bank # register. One never actually copied data. IAR implemented this in their Z180 compiler, "large model", but for code only, and you could not have a function bigger than the bank size. This cannot be done in a 32F4 because you don't have the address lines available, and if you did (because you aren't using all the pins for GPIO, ETH, USB, etc) then this is all moot. What might work ok is if you had multiple RTOS threads and every time the RTOS switches to Thread X it invokes a DMA transfer to save/restore the memory contents from the SPI RAM. You would need to hook into the RTOS...

DavidAlfa · « **Reply #19 on:** April 03, 2022, 11:56:12 am »

What about upgrading to a 427/429/469? You get 256-384KB of RAM.
You can get quad-spi starting with the 446.

I still suck at handling pointers, addressing and casting, made some terrible stuff as concept.
I gave up on the details, also ain't my job, feel free to fix the addressing/packing/alingment and all kind of illegal stuff you might find

https://onlinegdb.com/HEQk6RVPE

Siwastaja · « **Reply #20 on:** April 03, 2022, 01:36:05 pm »

Generic solution will be crappy enough (i.e.: too slow; all new projects would just use enough memory to begin with!) to not see any use beside this single time. Hence, don't invest time in generic solution.

Instead, find the biggest/simplest non timing critical memory buffer, which is large enough to matter, but only accessed in a few places, and just modify / write custom access code for that, replacing those few accesses with whatever SPI access code.

peter-h · « **Reply #21 on:** April 03, 2022, 05:11:25 pm »

Quote

What about upgrading to a 427/429/469? You get 256-384KB of RAM.

I had a quick look. 32F417 to 32F437 is an extra 64k. That would actually do it.

The Q is software compatibility. The 100QFP pinout appears identical, so such a change, with no PCB change, appears possible.

Surprisingly, google doesn't find anything in the way of a "design change guide". It would take many hours just to compare the 300 page data sheets, let alone compare the RMs. I am sure many have been up this path.

But this is just for a little extra RAM. A virtual memory implementation would give you far more.

SiliconWizard · « **Reply #22 on:** April 03, 2022, 05:28:26 pm »

I would expect a rather high level of compatibility, as long as the 437 has the peripherals you need, which shouldn't take too long to check. That sure looks like the easiest path IMO while not requiring to jump through hoops. Did you use STM32's HAL? Or something else?

peter-h · « **Reply #23 on:** April 03, 2022, 06:33:33 pm »

Someone else (working 1 day a week) set up the Cube IDE project ~ 3 years ago. I took it over almost full-time 1.5 years ago and had to re-learn C, plus a huge curve on the 32F417, Cube, etc.

HAL functions were used quite a lot but I replaced a lot of them. They work but sometimes they are way too slow. They are still used for the alternate pin function (AFx) setups. The AF mapping must change, due to more peripherals. I rarely use HAL funcs now - too horrible.

I have hard-coded memory regions in a few places but that's ok.

This is off topic for here but I guess one has to re-setup the whole Cube project with libs for the new CPU. That is likely to open a whole new can of worms (code testing).

The biggest gotchas are likely to be in subtle changes e.g. DMA config register behaviour. The 437 (i.e. going for the same 1MB FLASH space) is likely to have a different set of silicon bugs, too.

Lots and lots of code will not work anymore, and looking at how often I have spent literally days on one tiny piece of code (e.g. shutting down ETH - see other thread - to save power for the SPI FLASH data save function), often involving testing and verification with a scope, I don't fancy it. The virtual memory invalid address trap solution, or even an RTOS thread based virtual memory scheme where a whole bank is copied to/from SPI SRAM, would be an isolated solution which would avoid re-testing everything else. Plus I now have a nice stock of the 32F417 which took a year to find

nctnico · « **Reply #24 on:** April 03, 2022, 06:47:18 pm »

Isn't there some kind of document to support moving vertically through a line of microcontrollers from ST? The chips may not even be different at all. Just some fuses to set the various options when the chips are being tested / packaged. IMHO the whole point of having a range of related microcontrollers is that you can move through these vertically while remaining code compatibility.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: ST 32F417: Any way to make an SPI SRAM to look like normal RAM? (Read 8069 times)

Share me