Author Topic: ST 32F417: Any way to make an SPI SRAM to look like normal RAM? (Read 8067 times)

peter-h · « **Reply #25 on:** April 03, 2022, 07:32:51 pm »

I've not found one.

Fuse-programming different features and creating a huge range of chips that way is standard for marketing differentiation, and always has been (while causing huge problems with sourcing, because you end up running around trying to find one part among 50 very slightly different ones) but if the 417 was no more than a reconfigured 437, that would be a whole new ballgame

I asked here: https://community.st.com/s/question/0D53W00001TGJ54SAH/are-there-design-change-guides-eg-32f417-to-32f437

SiliconWizard · « **Reply #26 on:** April 03, 2022, 07:52:59 pm »

Quote from: peter-h on April 03, 2022, 06:33:33 pm

Someone else (working 1 day a week) set up the Cube IDE project ~ 3 years ago. I took it over almost full-time 1.5 years ago and had to re-learn C, plus a huge curve on the 32F417, Cube, etc.

HAL functions were used quite a lot but I replaced a lot of them. They work but sometimes they are way too slow. They are still used for the alternate pin function (AFx) setups. The AF mapping must change, due to more peripherals. I rarely use HAL funcs now - too horrible.
(...)

Yep I see. This is clearly not your ideal situation, as your project looks like a stack of ad-hoc stuff for a number of reasons, so porting it to another target, even one that is pretty close, is going to be tough.
I'd highly suggest trying to design something in a more "portable" way in the future, that's always what I try to do. But this time you are stuck with this.

Unfortunately, I do not know low-level ARM well enough to help here. I'd have no problem doing the same with RISC-V. Keep in mind that if you manage to do this, the access will be pretty slow. I know you said that would be ok, but make sure of that.

An alternative approach would be to define a few smaller intermediate buffers in main RAM, and pass those to your library functions. Then you can hand-implement some simple caching scheme to update from/to those buffers with your external RAM. That would require some hand management of memory, and would probably depend on how the libraries you use handle memory to begin with. But caching memory like this would also be more efficient than intercepting each access in an exception. Now if all libraries you use require fixed, static memory buffers of a "large" size, then it's not going to be workable this way.

wek · « **Reply #27 on:** April 03, 2022, 08:08:54 pm »

Quote from: peter-h on April 03, 2022, 05:11:25 pm

Surprisingly, google doesn't find anything in the way of a "design change guide".

AN4547

Quote

It would take many hours just to compare the 300 page data sheets, let alone compare the RMs.

The RM is identical, RM0090.

The '42x/43x are almost perfectly upwards compatible to '40x/41x. I actually do use this combination. The only factual exception I know of, is the TEMP/VBAT arrangement in ADC, both a different MUX input and different VBAT divider. And then there's an annoyance, as FSMC is renamed to FMC, and it has a difference in some very nuanced details of its behaviour (have ranted on STM32 forum on this, am lazy to look it up at the moment).

Quote

Is there any possibility that a 417 is just a factory-configured 437?

No, they are two different dices.

JW

peter-h · « **Reply #28 on:** April 03, 2022, 08:25:21 pm »

I got a little worried about this in AN4547 but fortunately it is a typo

It should be 192k, not 128k

hans · « **Reply #29 on:** April 03, 2022, 08:51:34 pm »

AFAIK the STM32F42x has a bunch of new peripherals like LCD controller and DMA2D, but also just more of the same peripherals. So perhaps the pinmux is slightly different. Backporting a F42x project to F40x sounds tougher if you're unlucky.

Quote from: peter-h on April 03, 2022, 11:07:16 am

That's awesome work, Hans Thank you.

Quote
Also don't know how comfortable I would be running an application with application code in HardFault_Handler though

Can it not be emulated transparently, so in effect the invalid address access generates an interrupt which emulates a real memory access?

Yes it can.. that's what the code is basically already doing. Stepping over the ptr accesses doesn't trigger any breakpoint or hint that it's handled by a HardFault handler. In principle, some hardfaults are recoverable, so perhaps that's what the ARM CPU designers had in mind when they wrote that.
The issue is that this has a very dramatic speed vs robustness trade off. If you want to add more sanity checks to the routine, then that will always cost more cycles. For example, the code I posted naively assumes that all hardfaults are the memory redirects. What if a genuine fault is generated (e.g. div by zero)? More checks would be needed to check the relevant hardfault/usage registers (and if it's not a memory redirect, then write debug trace, lock up, reset the program, etc.). More checks = bigger slowdown, even on top of what it already is.

Also the code I posted naively assumes all accesses are 32-bit wide. It sounds like that was fine, but what if it's not? Also, unaligned/aligned memory access is something to think about (Cortex M3/M4/M7 supports this, but M0+ does not, and AFAIK on RISC-V it's not mandatory). For a SPI ram that's byte addressable it's probably fine, but the way I addresses the 32-bit vmem array, it won't be... Not all register indices are tested. Unfortunately due the way how exception entry/leave works, the r0-r3 registers are in a different place than the r4-r11 registers. I hope they are in the right place, with the right arithmetic, and with the right handling of the stack push/pop.. It's all quite fiddly IMO, but I was just happy to get something working this morning to give an impression how fast/slow it is

In terms of "generating" an interrupt: you could use a full software-triggered interrupt for that like a SVcall, however, that may be in use by some RTOS's. In this case you may also be fine by using a C function callback instead. After all, the hardfault context is already an interrupt..

Quote

Re the suggestions on paging, yes, I was doing that in the Z80/Z180 days, but there you had say 1MB physical RAM and you selected a 4k bank (typically the top 4k of the 64k address space) with an 8-bit bank # register. One never actually copied data. IAR implemented this in their Z180 compiler, "large model", but for code only, and you could not have a function bigger than the bank size. This cannot be done in a 32F4 because you don't have the address lines available, and if you did (because you aren't using all the pins for GPIO, ETH, USB, etc) then this is all moot. What might work ok is if you had multiple RTOS threads and every time the RTOS switches to Thread X it invokes a DMA transfer to save/restore the memory contents from the SPI RAM. You would need to hook into the RTOS...

That sounds like a complex solution to modify the RTOS to change it's behaviour for this special "task". I assume you'd swap out the application RAM data with the 'proprietary' protocol data, and then proceed to run the protocol code like normal.
Then if that task is done, swap the application data back in.
I imagine you would also need to block the application tasks while you're swapping things around, which doesn't sound trivial nor optimal.
Also the swapping is a write/read transaction for a pretty large block of memory. Say you would be (re)storing 4K of memory: that's 32768 bits to transfer twice, taking 1.64ms. Hoping also that any processor time for that time is continuous, as switching back is very costly.
A context switch on this STM chip is probably in the order of single-digit microseconds

peter-h · « **Reply #30 on:** April 04, 2022, 07:00:28 am »

Decoding instructions is probably not that slow once one goes about it fully. It tends to be table driven, using an index. Many emulators have been written over the years and the performance loss tends to be around 10x. That's what PHP/Java does

One can make it a lot less by coding the OS and its API calls to run natively.

I think it would work ok because it would get used only for accessing large buffers, so perhaps only one line of code out of hundreds would get the hit.

Then somebody could get more clever and do a proper VM implementation, which have existed since the 1960s, where you swap memory to disk. In this case you would have a 2 megabyte/sec disk which is far faster than the old days. In 1991 I worked on a Sun Sparcstation which had state of the art SCSI disks doing 1 megabyte/sec

hans · « **Reply #31 on:** April 04, 2022, 07:35:28 am »

Oh for sure a table driven approach is better.. then it only needs a 32 and 128 entry function pointer table to decode the opcode group/instruction type. Those 640 bytes of LUT is totally worth it for performance.

The trouble I was talking about, was 32-bit instructions like conditional execution, post inc/dec (then you also need to read and modify the address register/immediate), etc.

I think in particular conditional execution is harder, since in ARM and Thumb instructions it's handled differently. AFAIK 16-bit Thumb uses the If-Then-Else instruction to encode conditionality of upcoming opcodes. I think ARM instruction set uses conditional flags at the front of the instruction, which would be easier to decode on the fly. In other words, in Thumb mode the CPU has state and I don't know if the HardFault handler can access that state easily (I suppose it need to be able to, otherwise context switching is hard).

Nonetheless, more fiddly-ness...

peter-h · « **Reply #32 on:** April 04, 2022, 08:12:41 am »

Isn't STM 32F all thumb?

hans · « **Reply #33 on:** April 04, 2022, 02:49:44 pm »

Yes.. ARMv7 uses Thumb2.. The first 5 bits of the opcode indicate whether it's 16 or 32-bit instruction.
Older ARM MCUs like ARM7TDMI used an older Thumb instruction set, alongside ARM instructions, which required interworking code to switch.

Nonetheless, conditional execution may require some attention. Maybe some of the opcode decoding can be optimized with some kind of Just-In-Time "emulation", where instructions decode results are cached in a hashmap (with PC as key), so preceding instructions only need to be analyzed once... but that also requires (a lot) more RAM to do so.

abyrvalg · « **Reply #34 on:** April 04, 2022, 11:50:23 pm »

Conditional execution shouldn’t be a problem, if an instruction triggers a fault - the condition was true. But there are many other kinds of fun - LDRD/STRD, LDM/STM with 4 modes of pre/post increment/decrement and reglists, and the whole zooo of DSP/FPU instructions loading/storing their own sets of registers.
Writing code to handle all this to get a solution orders of magnitude slower than normal instead of adapting the code to use SPI RAM directly looks like masochism. If the OP needs help implementing this mess anyway, why don’t just ask for help adapting the library instead?

peter-h · « **Reply #35 on:** April 05, 2022, 09:58:26 am »

I don't think it is quite as bad, because you need to support only the instructions which are going to be used to access this memory area, and all other fault reasons can be trapped collectively (and should not happen in a running system because of they do, it crashes / watchdog trips).

In many/most situations where somebody wants to do this, the performance hit would be small. Admittedly if somebody does a strstr() on the said memory area, and ends up searching 100k bytes, that isn't so good

. But you can check the source if it does that.

abyrvalg · « **Reply #36 on:** April 05, 2022, 11:50:46 am »

I haven’t seen much DSP/FPU Cortex-M code in the wild, but integer code actively uses all flavors of load/store (i.e. just a single optimized memcpy/memset does the bulk part with STMIA and finishes the tail with STR, STRH, STRB). Are you going to do some code coverage analysis to find out exact instructions accessing your virtual buffer? What if that changes after next compilation? Or after a compiler upgrade?

What’s the typical pattern of calling the library that needs more memory? Implementing an overlay can be trivial in cases like calling a few top level functions of mem-hungry lib - just wrap those functions and add swapping to/from external RAM before/after the real call.

bson · « **Reply #37 on:** April 06, 2022, 02:12:44 am »

Quote from: hans on April 03, 2022, 10:28:44 am

I was interested to see how hard it is implement to a rudimentary virtual memory trap... Good news is my CPU pipeline worries of yesterday were unfounded.

I had to disable write buffering (only works on Cortex-M3/M4 by the way - the M7 has a different store mechanism). This inevitably slows down all store instructions, as they have become synchronous. This however did resolve the IMPRECISERR to PRECISERR busfaults, which occured on stores.

Nice!

Try my suggestion as well: if you get an IMPRECISERR busfault, disable buffering and return from the exception. This will execute the faulting instruction again, this time without buffering. Then proceed as normal and reenable buffering when done.

ataradov · « **Reply #38 on:** April 06, 2022, 02:18:14 am »

Quote from: bson on April 06, 2022, 02:12:44 am

Try my suggestion as well: if you get an IMPRECISERR busfault, disable buffering and return from the exception. This will execute the faulting instruction again, this time without buffering. Then proceed as normal and reenable buffering when done.

You can't do this. Once you get an imprecise exception, the only valid way to recover from it is a reset.

The whole point of it being imprecise is that your code runs past the faulting place, sometimes quite a few instructions. In case of cache issues, it may be a lot of instructions. Imprecise exception is not even necessarily caused by an instruction.

hans · « **Reply #39 on:** April 06, 2022, 08:25:40 am »

I agree. The PC points to instructions that are 2 ahead in my observations, but obviously it depends on the pipeline and if any other stalls occur. If there is a hazard like a write-before-write (instead of after, since PC has incremented already), then we would reexecute the same opcode with different register contents.
For example:

Code: [Select]

STR R3, [R7] # R7 = virtual memory address
ADD R3, R3, 1 # increment R3
SLL R3, R3, 1 # R3 << 1
NOP # ..etc..

If the STR triggers an IMPRECISERR, then the PC may have changed an arbitrary amount. We could disable buffering and retry, but in the meantime the contents of R3 has changed.

@abyrvalg: Yes you're right, I was thinking too much from a RTL design again, and not the software perspective. It doesn't make sense that the trap would get triggered if the conditional code says skip execution.

In terms of DSP code: modern GCC with O3 is not incapable of producing DSP/SIMD code by itself. I've seen it happen with certain algorithms where it would eventually resort over to vector instructions. It can sometimes also do that for memcpy/memset procedures. I was very surprised..

In terms of code profiling to support the right instructions.. sounds a very dangerous path to walk. I think including all types of instructions should be included for robustness. In the proof-of-concept, the 32-bit word LDR/STR are checked using:

Code: [Select]

        bool isLDR16 = (opc7 == 0b0101100 || opc5 == 0b01101);
        bool isSTR16 = (opc7 == 0b0101000 || opc5 == 0b01100);

These are 16-bit Thumb opcodes for LDR/STR 32-bit word with register and immediate address offsets, respectively.

The halfword, byte and signed variants are easily supported by looking up the other opcode groups. If the opcode handling is implemented using a LUT, then that should all be possible with minimal expansion of run-time.
I may actually look into implementing that later this week. Sounds like a fun addition. I'm personally very interested to how far such a software implementation can be pushed, even if it's only out of academic interest (and less for a practical high-performance one).

wek · « **Reply #40 on:** April 06, 2022, 01:19:58 pm »

Quote from: hans on April 06, 2022, 08:25:40 am

@abyrvalg: Yes you're right, I was thinking too much from a RTL design again, and not the software perspective. It doesn't make sense that the trap would get triggered if the conditional code says skip execution.

But if it does not skip, you still would have to manually advance the if-then status bits. And it would be exceptional fun, given it's saved in discontinuous set of bits in EPSR.

The pre/post-incremented address registers would be loads of fun, too.

And then there are the unpriviledged loads/stores, the exclusives, double and multiple (already mentioned); and if one would want to be extra pedantic/complete, then also stacks/unstacks and code/vector fetches. I may have forgotten some of the more pervert ones.

The way how ARM envisaged this whole thing is, that in the exception you are supposed to remove the *cause* for the exception, and then by returning the processor automatically replays the - now not offending anymore - instruction.

JW

peter-h · « **Reply #41 on:** April 06, 2022, 03:36:44 pm »

Getting back to basics, how is virtual memory implemented on say 80386+ ? Don't they set up an address trap?

nctnico · « **Reply #42 on:** April 06, 2022, 04:06:56 pm »

Quote from: peter-h on April 06, 2022, 03:36:44 pm

Getting back to basics, how is virtual memory implemented on say 80386+ ? Don't they set up an address trap?

They do but the result is that the virtual memory area gets mapped to real memory. The 32F417 seems to have an MPU that may be used to implement something similar.

abyrvalg · « **Reply #43 on:** April 06, 2022, 04:35:49 pm »

Checked F417's MPU already - no fun, it doesn't support remapping, only trapping physical address acceses.
The idea of 80386's & friends VM is to have different virtual addresses backed by same physical addresses (trap access to non-mapped VA, swap some other physical page content to disk, detach that page from old VA, attach to new VA, resume the sw and let it access new VA).

nctnico · « **Reply #44 on:** April 06, 2022, 05:04:38 pm »

Quote from: abyrvalg on April 06, 2022, 04:35:49 pm

Checked F417's MPU already - no fun, it doesn't support remapping, only trapping physical address acceses.

Yes, but I would theorise that you can replace the contents of the physical address space with something else before exiting from the trap. I think this should allow to use one piece of memory for multiple purposes where an OS takes care of setting up the MPU depending on the task it is executing. It will take a large amount of work to implement though.

Basically:
- task switch -> protect memory area
- trap -> swap memory contents, unprotect and continue

peter-h · « **Reply #45 on:** April 06, 2022, 08:47:46 pm »

I don't think there is any way on the 32F417 to access memory which does not physically exist on the chip.

But I never examined that part of it.

nctnico · « **Reply #46 on:** April 06, 2022, 08:50:54 pm »

Quote from: peter-h on April 06, 2022, 08:47:46 pm

I don't think there is any way on the 32F417 to access memory which does not physically exist on the chip.

True. But you can reserve a piece of memory for use by multiple processes and swap it out when accessed using the MPU. But it requires some form of management because it must be very clear which process 'owns' the memory at any given time. It could be an interesting way to have a better simultaneous usage of memory in a microcontroller with the SPI ram as an overflow to still have enough memory for the worst case situation.

peter-h · « **Reply #47 on:** April 06, 2022, 09:12:31 pm »

OK; that was my drift when I said about hooking into the RTOS and, when it switches to a thread, you copy a load of data from the SPI RAM to some RAM buffer, and then write it back when it switches away from that thread.

But as someone pointed out, this is not exactly quick, unless you are switching tasks relatively slowly. It takes "ages" to move say 64k via 21MHz SPI.

I have used this method with a simple RTOS I wrote, with bank switching, which is "instant" because it is just a value in a register. Each of about 10 threads had its own 32k RAM, on a Z80-family chip.

peter-h · « **Reply #48 on:** May 06, 2022, 10:15:03 am »

The funny thing is that I have just done a little PCB for another job relating to this project and was going to put an SPI SRAM chip on it, to save birdsnesting it later on.

Except... there aren't any :-) Well there are tiny stocks of 128kbyte ones e.g. https://uk.farnell.com/on-semiconductor/n01s830bat22i/serial-sram-1mb-spi-tssop-8/dp/2627903. Nothing much bigger. Either they are used in huge numbers and are hoarded like so much else (which I doubt, because very few CPUs can use them directly) or they are going obsolete.

woofy · « **Reply #49 on:** May 06, 2022, 11:10:39 am »

Mouser has stocks of 4mbit chips.
https://www.mouser.co.uk/ProductDetail/ISSI/IS62WVS5128FBLL-20NLI-TR?qs=byeeYqUIh0MVWOLDS41N2g%3D%3D


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: ST 32F417: Any way to make an SPI SRAM to look like normal RAM? (Read 8067 times)

Share me