Author Topic: 32F417 SPI running at one third the speed it should (Read 9736 times)

peter-h · « **on:** January 14, 2022, 05:00:54 pm »

I posted about this here
https://community.st.com/s/question/0D53W00001J9VXrSAN/32f417-spi-running-at-half-the-speed-it-should?t=1641986845261
but that's a very high throughput forum which almost nobody reads. One guy there spent some time on identifying what might be the explanation but has not offered a solution. I am posting it here in the hope that someone might have come across this before.

Basically the code is using polling and is stuffing bytes into an SPI master "uart" and receiving what comes back. The SPI is running at 21MHz so the limit is just over 2 megabytes/sec. It appears that even a 32F4 running at 168MHz can't cope with this!

This is the code

Code: [Select]

 
    while ((hspi->TxXferCount > 0U) || (hspi->RxXferCount > 0U))
    {
      /* Check TXE flag */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))
      {
        *(__IO uint8_t *)&hspi->Instance->DR = (*hspi->pTxBuffPtr);
        hspi->pTxBuffPtr++;
        hspi->TxXferCount--;
        /* Next Data is a reception (Rx). Tx not allowed */
        txallowed = 0U;
      }
 
      /* Wait until RXNE flag is reset */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (hspi->RxXferCount > 0U))
      {
        (*(uint8_t *)hspi->pRxBuffPtr) = hspi->Instance->DR;
        hspi->pRxBuffPtr++;
        hspi->RxXferCount--;
        /* Next Data is a Transmission (Tx). Tx is allowed */
        txallowed = 1U;
      }

The issue seems to have two parts:

- The existing code (from an ST library) is a bit dumb in that it blocks loading a TX byte unless an RX byte has been retrieved. This results in the double-buffered TX being allowed to run out of data, and with SPI if you aren't sending stuff out you won't be getting stuff back because it is the action of sending out that generates the clock. This seems reasonable as a quick hack, but it ignores the fact that you can load two bytes into a TX; one propagates through to the shift register and the other ends up in the TX buffer. Whereas the RX, while having the same buffering technically, is less usable because once you detect there is data available you don't know how much time you have to get that byte out.

- The SPI runs with its own clock which is much slower than the 168MHz of the CPU. It is the APB clock, which in my case is 42MHz (and I can't change that for various reasons; there is an 84MHz option but it uses another SPI channel which I can't access). This means that say a read of the "tx buffer empty" bit actually takes a lot more than the ~7ns CPU instruction speed. It's a stupid design where they use an ARM core and "asynchronously" attached a pile of peripherals to it, so there are various oddball delays, some requiring multiple APB clocks to prevent metastability.

The answer should be DMA but that is quite tricky to get working.

I am sure it can be done with polling. Many years ago I did a floppy disk controller with a Z80 running at 2MHz and it had the same issue, which was solved by a cunning loop structure. In this case I suspect a similar trick might work, whereby the TX channel is kept full, but the RX channel is checked often enough.

wek · « **Reply #1 on:** January 14, 2022, 05:07:59 pm »

Quote from: peter-h on January 14, 2022, 05:00:54 pm

One guy there spent some time on identifying what might be the explanation but has not offered a solution.

Beg your pardon. You've been told several times, use DMA. That *is* the solution. That you refuse it, is your problem.

JW

peter-h · « **Reply #2 on:** January 14, 2022, 05:50:17 pm »

I am happy to try it if you can post some code to get me started. It takes hours of reading just to find out which of the 135 DMA channels is the one to use

DavidAlfa · « **Reply #3 on:** January 14, 2022, 06:29:47 pm »

At best, you only have 64 clocks/byte at 168MHZ Core & 21MHz SPI, there're quite a lot of checks, branches, add few flash cache misses (another 4-7 clocks)... looks pretty tight there!
I suggest to setup the DMA in cubeMX, because it really takes close to no effort. Then start increasing the frequency until you start losing bytes/clocks.

peter-h · « **Reply #4 on:** January 14, 2022, 06:39:48 pm »

One thing I wondered was whether stuff like hspi->TxXferCount is wasting time. It seems to be indexing into a structure and possibly doing so at runtime.

Variables like the byte counts should be in registers.

Currently I am using -Og optimisation level. I have tried various others and -O3 seems to break some code, and apparently (there was a thread on this) this is to be expected since it is an experimental thing. But I could try it, or use some directive to create register variables.

I've never used Cube MX but know somebody who has.

Presumably one would replace just that one loop with DMA, plus a mysterious bit of code before it which handles a single-byte case

Code: [Select]

   // The need for this initial byte is unknown
    if (initial_TxXferCount == 0x01U)
    {
      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
      hspi->pTxBuffPtr++;
      hspi->TxXferCount--;
    }

Siwastaja · « **Reply #5 on:** January 14, 2022, 07:28:43 pm »

Quote from: peter-h on January 14, 2022, 05:50:17 pm

I am happy to try it if you can post some code to get me started. It takes hours of reading just to find out which of the 135 DMA channels is the one to use

Geez, just follow the manual. On the top of my head, set the DMA channel control register, which will also have the channel mapping field. The channel mapping table is in the reference manual. Set memory addres in M0AR, and SPI data register in PAR. Write number of data to NDTR. Enable DMA in SPI peripheral. Clear the DMA error flags (IFCR, if I recall correctly). Enable DMA channel (CR |= 1;).

SiliconWizard · « **Reply #6 on:** January 14, 2022, 07:33:06 pm »

If you don't want to use the HAL or can't use it because it's too inefficient, you can still look at the source code to speed up your development. I've done this a lot. It will at least help you make sense of the reference manuals and get you there faster. Source code is provided - take advantage of it. Still read the manual to figure out what is strictly necessary in your case and what isn't. But I find those vendor-libraries, provided with source code, a better source of information than they are good for direct use. =)

peter-h · « **Reply #7 on:** January 14, 2022, 09:13:34 pm »

How does DMA take care of sending out a byte and then waiting for one to come back? It must involve two channels.

SiliconWizard · « **Reply #8 on:** January 14, 2022, 09:59:10 pm »

Are you not using SPI? =)

wek · « **Reply #9 on:** January 14, 2022, 10:23:25 pm »

Quote

How does DMA take care of sending out a byte and then waiting for one to come back? It must involve two channels.

Yes, of course. Except that in 'F4, the individual DMA elements are called *streams*. And you want the stream handling Rx to be of higher priority than any other stream in that DMA.

Quote

It takes hours of reading just to find out which of the 135 DMA channels is the one to use

And you expect somebody else to spend that time for you? How do you intend to program without reading the fine manual?
Here you are (attachment). And remember, the DMA element in question is *stream*; *channel* means input of the request mux (i.e. the number which goes into DMA_SxCR.CHSEL.

32F417 SPI running at one third the speed it should

Except for that, everything is exactly as Siwastaja said above, just follow his description together with reading the registers part of DMA chapter. And maybe this could help you to fill in the control register (this is for something else so you have to go through the individual fields and correct them in yourself):

Code: [Select]

          
#define OR |  // because I hate C

#define DMA_SxCR_xBURST_INCR1                0       // only in FIFO mode (not Direct mode)
#define DMA_SxCR_xBURST_INCR4                1
#define DMA_SxCR_xBURST_INCR8                2
#define DMA_SxCR_xBURST_INCR16               3

#define DMA_SxCR_CT_MEMORY0                  0       // currently trageted memory, only in double-buffer mode
#define DMA_SxCR_CT_MEMORY1                  1

#define DMA_SxCR_PL_PRIORITY_LOW             0
#define DMA_SxCR_PL_PRIORITY_MEDIUM          1
#define DMA_SxCR_PL_PRIORITY_HIGH            2
#define DMA_SxCR_PL_PRIORITY_VERY_HIGH       3

#define DMA_SxCR_PINCOS_IS_PSIZE             0       // peripheral increment offset size, only if PINC=1, PBURST=0 and FIFO mode (not Direct mode)
#define DMA_SxCR_PINCOS_IS_4                 1

#define DMA_SxCR_xSIZE_BYTE                  0       // in direct (non-FIFO) mode, MSIZE is ignored and PSIZE used throughut
#define DMA_SxCR_xSIZE_HALFWORD              1
#define DMA_SxCR_xSIZE_WORD                  2

#define DMA_SxCR_DIR_P2M                     0       // direction (peripheral = P, memory = M)
#define DMA_SxCR_DIR_M2P                     1
#define DMA_SxCR_DIR_M2M                     2


#define DMA_SxFCR_FTH__1_4   0  // FIFO threshold = 1/4
#define DMA_SxFCR_FTH__1_2   1  // FIFO threshold = 1/2
#define DMA_SxFCR_FTH__3_4   2  // FIFO threshold = 3/4
#define DMA_SxFCR_FTH__FULL  3  // FIFO threshold = full

#define DMA_SxFCR_FS__1_4    0  // 0   <  FIFO status < 1/4
#define DMA_SxFCR_FS__2_4    1  // 1/4 <= FIFO status < 1/2
#define DMA_SxFCR_FS__3_4    2  // 1/2 <= FIFO status < 3/4
#define DMA_SxFCR_FS__4_4    3  // 3/4 <= FIFO status < full
#define DMA_SxFCR_FS__EMPTY  4
#define DMA_SxFCR_FS__FULL   5



disp_DMAStream->CR = 0
            OR (disp_DMAChannel          * DMA_SxCR_CHSEL_0   )  // channel select
            OR (DMA_SxCR_xBURST_INCR1    * DMA_SxCR_MBURST_0  )  // memory burst (only in FIFO mode)
            OR (DMA_SxCR_xBURST_INCR1    * DMA_SxCR_PBURST_0  )  // peripheral burst (only in FIFO mode)
            OR (0                        * DMA_SxCR_ACK       )  // "reserved" (says manual)
            OR (0                        * DMA_SxCR_CT        )  // current target (only in double-buffer mode)
            OR (0                        * DMA_SxCR_DBM       )  // double-buffer mode
            OR (DMA_SxCR_PL_PRIORITY_LOW * DMA_SxCR_PL_0      )  // priority level
            OR (0                        * DMA_SxCR_PINCOS    )  // peripheral increment offset size (only if peripheral address increments, FIFO mode and PBURST is 0)
            OR (DMA_SxCR_xSIZE_WORD      * DMA_SxCR_MSIZE_0   )  // memory data size; in direct mode forced to the same value as PSIZE
            OR (DMA_SxCR_xSIZE_HALFWORD      * DMA_SxCR_PSIZE_0   )  // peripheral data size
            OR (1                        * DMA_SxCR_MINC      )  // memory address increments
            OR (1                        * DMA_SxCR_PINC      )  // peripheral address increments
            OR (0                        * DMA_SxCR_CIRC      )  // circular mode (forced to 1 if double-buffer mode, forced to 0 if flow control is peripheral)
            OR (DMA_SxCR_DIR_M2M         * DMA_SxCR_DIR_0     )  // data transfer direction
            OR (0                        * DMA_SxCR_PFCTRL    )  // peripheral is the flow controller (i.e. who determines end of transfer) - only for SDIO
            OR (1                        * DMA_SxCR_TCIE      )  // transfer complete interrupt enable
            OR (0                        * DMA_SxCR_HTIE      )  // half transfer interrupt enable
            OR (0                        * DMA_SxCR_TEIE      )  // transfer error interrupt enable
            OR (0                        * DMA_SxCR_DMEIE     )  // direct mode error interrupt enable
            OR (1                        * DMA_SxCR_EN        )  // stream enable
          ;

Don't forget clearing the transfer-complete flag, as Siwastaja said.

You may want to use FIFO to assemble bytes into words to reduce traffic on the memory side, but start without it (i.e. leave DMA_SxFCR at its reset state) until you gain confidence. With FIFO off, MSIZE field is ignored.

Btw. you want to use 16-bit transfers on the SPI side.

Btw. you would want to do that in the polled implementation, too - that's the simplest possible optimization and it would probably result in noticeable improvement.

JW

SiliconWizard · « **Reply #10 on:** January 14, 2022, 11:11:54 pm »

Would the HAL_SPI_TransmitReceive_DMA() function from the HAL not be good enough in your case?

And if not, you can have a look at its source code as I suggested (it does a lot including configuring the DMA transfers, which you might want to do only once, and repeatedly restart them rather that reconfiguring the channels from scratch every time...) But the code is likely to at least guide you.

peter-h · « **Reply #11 on:** January 15, 2022, 06:12:21 am »

My code was based on the non-DMA HAL function, with unwanted stuff removed. Yes that's a very good point; they have an interrupt version and a DMA version. I am away right now so can't test anything but will look this up when I get back.

How can one use SPI in a 16-bit mode? Do you mean actually setting SPI to 16 bits so that it is shifting out, and shifting in, a 16 bit value at a time? I am doing that for an ADC which needs that (ADS1118). I can see that would work, in the case of a len value which is a multiple of two.

"And you expect somebody else to spend that time for you?"

No; I sent you a PM asking if you do consultancy, since you mentioned that somewhere, but got no reply

Development in this area can be very slow because getting it wrong trashes the FLASH chip so it has to be re-initialised, so you need two sets of w/r functions: the old and the one you are testing. One can spend days on this unless one already knows how.

DavidAlfa · « **Reply #12 on:** January 15, 2022, 08:42:05 am »

Have you considered using libopencm3? It's very much like the old ST libraries.
Setting up stuff is harder, you must take care of everything, but it's still better than doing it from scratch.

Edit: Easier said than done. Tried, errors everywhere. Again, no too many instructions else than "make"..
I will never hate these Linux things enough. Quick setup? keep dreaming.
First, you have to master the operating system, spend 10 hours on google until you find a lucky post where a guy posts some Matrix codes that does the fix.
Everything in that universe is sick, stuck in the 80s complexity. Then they critice HAL is bulky and buggy?
With HAL I can start developing in 30 seconds, and fight only against the stm32 problems.
With the free hippie linux stuff I just wanna thrown all away!

Siwastaja · « **Reply #13 on:** January 15, 2022, 12:51:32 pm »

Quote from: peter-h on January 14, 2022, 09:13:34 pm

How does DMA take care of sending out a byte and then waiting for one to come back? It must involve two channels.

Yes, you configure two streams with different memory addresses. Look at the channel mapping, there are SPIx_RX and SPIx_TX separately. And yeah, in STM32 terminology, streams are something that run in parallel, and channels are those fixed connections, like SPI5_RX is some channel number and say UART2_TX is another channel number. You configure the channel number in the control register of the DMA Stream of your choice. Note that not all streams support all channels, so you have to pick the right combo, but this isn't too hard, it's in the manual.

Again, just read the DMA section in the reference manual, it's quite simple if you ignore advanced operating modes (FIFOs, packing modes).

And indeed, using SPI in 16-bit mode halves the processing overhead (number of CPU interruptions in non-DMA solution, or number of DMA transfers in the simple non-FIFO case), but if your SPI device actually works with 8-bit granularity, then only even number of bytes is supported. This might or might not be a problem. How to use it? Well, just set it to 16-bit mode and process two bytes at a time in software. The only possible issue is the order of bytes, and having only two options, swapping the bytes is quite easy to do if it does not work properly. But then, if the slave expects, say, 7 byte transaction and misbehaves if it gets 8, you can't go this way.

Siwastaja · « **Reply #14 on:** January 15, 2022, 01:09:32 pm »

DMA channel mapping list is on pages 307-308 of RM0090. This also shows a total classic documentation error, ST calls this fixed mapping "an example", as if you could modify it. You can't, this is the only existing mapping, not an example of anything.

In any case, here is some random code of DMA configuration for SPI, I pulled out from some random old project:

Init:

Code: [Select]

	MC1_CS1();

	// DMA2 STREAM 0 ch 3 = motcon RX
	DMA2_Stream0->M0AR = (uint32_t)&motcon_rx[0];
	DMA2_Stream0->PAR = (uint32_t)&(SPI1->DR);
	DMA2_Stream0->NDTR = MOTCON_DATAGRAM_LEN;
	DMA2_Stream0->CR = 3UL<<25 /*Channel*/ | 0b10UL<<16 /*high prio*/ | 0b01UL<<13 /*16-bit mem*/ | 0b01UL<<11 /*16-bit periph*/ |
	                   1UL<<10 /*mem increment*/ | 1UL<<4 /*transfer complete interrupt*/;

	// DMA2 STREAM 3 ch 3 = motcon TX
	DMA2_Stream3->M0AR = (uint32_t)&motcon_tx[0];
	DMA2_Stream3->PAR = (uint32_t)&(SPI1->DR);
	DMA2_Stream3->NDTR = MOTCON_DATAGRAM_LEN;
	DMA2_Stream3->CR = 3UL<<25 /*Channel*/ | 0b10UL<<16 /*high prio*/ | 0b01UL<<13 /*16-bit mem*/ | 0b01UL<<11 /*16-bit periph*/ |
	                   1UL<<10 /*mem increment*/ | 0b01<<6 /*mem->periph*/;


	// SPI1 @ APB2 = 60 MHz
	SPI1->CR1 = 1UL<<11 /*16-bit frame*/ | 1UL<<9 /*Software slave management*/ | 1UL<<8 /*SSI bit must be high*/ |
		0b010UL<<3 /*div 8 = 7.5 MHz*/ | 1UL<<2 /*Master*/;

	SPI1->CR2 = 1UL<<1 /* TX DMA enable */ | 1UL<<0 /* RX DMA enable*/;

	SPI1->CR1 |= 1UL<<6; // Enable SPI

	NVIC_SetPriority(DMA2_Stream0_IRQn, 0b0000); // Priority is the most urgent; keep the ISR short
	NVIC_EnableIRQ(DMA2_Stream0_IRQn);

Start of transfer:

Code: [Select]

	MC1_CS0();
	DMA2->LIFCR = 0b111101UL<<0;  DMA2_Stream0->CR |= 1UL; // Enable RX DMA
	DMA2->LIFCR = 0b111101UL<<22; DMA2_Stream3->CR |= 1UL; // Enable TX DMA

End of transfer:

Code: [Select]

void motcon_rx_done_inthandler()
{
	DMA2->LIFCR = 0b111101UL<<0;
	MC1_CS1();
}

Maybe not your exact case but I'm sure having references won't hurt?

peter-h · « **Reply #15 on:** January 15, 2022, 02:14:38 pm »

Yes; thank you all. This is very good stuff. I will get onto it as soon as I get back.

If using DMA, am I right that the special case of count=1 doesn't need to be specially handled?

IIRC, the blocksize is always 512. The various functions take length as a parameter, and even support any 16 bit blocksize, but when I was reading the flash data sheet it was too ambiguous, so I avoided non-512 sizes. And when Windows goes in via USB, it always does who sectors (512) anyway.

IIRC, the flash supports any size read simply by clocking it but I am not using that. It isn't useful because there isn't enough RAM to make use of it.

If I used polling with 16 bit SPI mode, given the buffer is defined as uint8_t *buf, presumably I would need to convert that into a uint16_t *buf. Otherwise I would have to extract two bytes at a time, pack them into a 16 bit int, and stuff that into the SPI, which wastes a bit of time. I am still trying to get my head around this stuff

Siwastaja · « **Reply #16 on:** January 15, 2022, 02:34:47 pm »

Yes, DMA can handle the size of 1. You could improve performance of these single-byte accesses by not using DMA on them, but not much, because DMA channel configuration isn't too many operations, especially if you don't use HAL for it. If performance is important:
* Only write the DMA control fields which need change (maybe M0AR, probably NDTR)
* Instead of read-modify-write on CR, do full writes. DMA1_Stream0->CR = config; DMA1_Stream0->CR = config | 1UL; is faster than DMA1_Stream0->CR = config; DMA1_Stream0->CR |= 1UL;

If DMA config doesn't change except for NDTR, re-enabling would be just three writes: NDTR, IFCR, CR once for enabling the channel.

Time consumed is same if you read from 8-bit SPI and write a byte in memory, or read from 16-bit SPI and write a halfword in memory. Hence, 16-bit version will double the performance. It may happen that you need to swap the bytes, I don't remember if M7 has a single instruction for this but assuming you have optimizations enabled, it's not a big operation, you still save a lot of time by halving all other overhead (think about interrupt entry latency of 12 cycles alone!), even if you need to spend 1-3 CPU cycles to swap the bytes.

You can make the buffer uint16_t, or you can keep it uint8_t and cast the pointer to uint16_t, or you can create two different access types through union, whatever. In some solutions (like pointer casting), you need to add align attribute to the definition, because uint16_t needs to be aligned by 2, but uint8_t can be arbitrarily aligned. Or, you can just write higher level C code where you do two writes to the uint8_t[] table, and let the compiler optimize it. But I'm not sure if the compiler understands if the table is aligned or not, even if you have aligned attribute, so it's not necessarily as fast.

peter-h · « **Reply #17 on:** January 15, 2022, 05:16:36 pm »

Just been reading up the RM on SPI. There is a bit ordering config but no byte ordering config and I can't immediately find out a doc specifying which byte (in 16 bit mode) gets shifted out first.

One would assume that if 16 bit mode and MSB-first was selected then the high byte of a 16 bit value would come out first, because "MSB first" ought to mean bit 15

On the 32F4 that is the byte at the higher memory address (of a 16 bit int).

The flash serial interface shifts MSB first so it looks like 16 bit mode could be used but would need a byte swap.

However, with DMA, there won't be any need for 16-bit mode anyway.

Coming back to polling mode, and 16 bit SPI mode, what I am still not sure about is whether that length=1 hack is needed. It isn't possible anyway; the minimum length will be 2. I don't understand the need for that hack, however (I vaguely recall reading somewhere it was needed in case an interrupt came in at the wrong moment, and would bugger up the SPI) and it may be moot since as I said, I can limit DMA (or 16-bit polled mode) to the length=512 case.

Contrary to what I said before, the length=1 case is used for things like command bytes. But obviously no speed optimisation is needed there. Only the full block w/r of 512 needs optimising.

BTW, FWIW, SPI2 is used only for the serial flash. I am driving various other chips (ADCs, displays, etc) using SPI3 but for that I am using the HAL SPI functions unmodified. Those all work fine and don't need tweaking since SPI speeds are low; from ~500k to ~5mbps.

nctnico · « **Reply #18 on:** January 15, 2022, 05:20:25 pm »

Quote from: Siwastaja on January 15, 2022, 02:34:47 pm

You can make the buffer uint16_t, or you can keep it uint8_t and cast the pointer to uint16_t,

I strongly advice against casting pointers that way. You never know whether an uint8_t array is aligned or not (there is no guarantee at all) and unaligned accesses are not supported on most ARM platforms. It may even fail silently (been there, done that).

peter-h · « **Reply #19 on:** January 15, 2022, 05:25:05 pm »

An excellent point; I cannot tell if the buffer is aligned because it is supplied by the caller, which could be anything.

It thus looks like the only way is to use DMA, in 8-bit mode. The 16-bit mode will need special handling on the 1st/last byte.

Siwastaja · « **Reply #20 on:** January 15, 2022, 05:50:47 pm »

Quote from: nctnico on January 15, 2022, 05:20:25 pm

Quote from: Siwastaja on January 15, 2022, 02:34:47 pm
You can make the buffer uint16_t, or you can keep it uint8_t and cast the pointer to uint16_t,
I strongly advice against casting pointers that way. You never know whether an uint8_t array is aligned or not (there is no guarantee at all)

Of course you know if you align it. All decent compilers support this (see attributes), and if your MCU projects absolutely must be portable outside of the two major ARM compilers (gcc and clang) which follow the same syntax, you can create a wrapper header.

Perfect portability and adherence to only C standard utilities is impossible in MCU projects anyway; usually you need to add alignment attributes when using DMA, for the DMA.

You are of course right, if you can, avoid the pointer casting, except casting to char* or void* which is always safe. And in this case, using uint16_t [] in definition and casting to uint8_t* (or char*, or void*) wherever needed as single bytes is always safe.

And despite the fact that you have encountered a CPU where unaligned access caused weird issues instead of the expected crash, STM32F417 does not behave that way, but gives UsageFault as documented. And if you google UsageFault, the first result shows "unaligned memory access" and you don't even need to open that link to see that. So it's not usually a huge problem even if you mess it up accidently. One of the easiest bugs to track down, if we exclude your weird experience on non-Cortex-M4.

peter-h · « **Reply #21 on:** January 17, 2022, 06:28:21 pm »

There is another angle on this. Not worth a lot but maybe something.

To receive say 512 bytes, you merely need to stuff 512 bytes (or 256 words if using 16-bit SPI mode) into the TX. It doesn't matter what they are, you are interested only in generating the clock pulses, so no need to read the value out of a buffer, increment a pointer, etc. They can all be 0x00 or 0x0000

And likewise to transmit, you don't need to put the RX data anywhere, because it will be junk anyway.

That "transmit-receive" function was written by ST to be general-purpose, but in reality whether you are having to transmit the 512, or receive the 512, will depend on preceeding commands (mostly byte values) sent to the FLASH. You are unlikely to be wanting to TX and RX valid data. The only case I recall where both need to be valid data was with a TI ADS1118 which shifts out 16 bits and shifts in 16 bits at the same time. It's a weird device...

Re DMA, note that DMA can't access CCM, so the calling code need to watch that. In my case it is reasonably easy to take care of this, but if you use CCM for RTOS stacks (which is probably very tempting) then it could bite you.

betocool · « **Reply #22 on:** January 19, 2022, 09:41:58 am »

Hey, for SPI transfers, DMA is the way to go IMO...

Here's some code setting up an I2S transfer (which uses the SPI interface) for sending data on an STM32H7:

Code: [Select]

/* Set up DMA TX */
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;   // DMA2 clock enable;

	DMA2_Stream3->CR &= ~DMA_SxCR_EN;
	while(DMA2_Stream3->CR & DMA_SxCR_EN)
	{

	}

	aux_register |= DMA_SxCR_MSIZE_1;
	aux_register |= DMA_SxCR_PSIZE_1;
	aux_register |= DMA_SxCR_MINC;
	aux_register |= DMA_SxCR_CIRC;
	aux_register |= DMA_SxCR_PL_1;
	aux_register |= DMA_SxCR_DIR_0;
	aux_register |= DMA_SxCR_TCIE;
	aux_register |= DMA_SxCR_HTIE;

    HAL_NVIC_SetPriority(DMA2_Stream3_IRQn, 6, 0);
    HAL_NVIC_EnableIRQ(DMA2_Stream3_IRQn);

	DMA2_Stream3->CR = aux_register;
	DMA2_Stream3->FCR &= ~(DMA_SxFCR_DMDIS);

	DMA2_Stream3->PAR = (uint32_t)&(SPI3->TXDR);

	DMAMUX1_Channel11->CCR = 0x3E;

	/* Set up I2S3 TX */
	SPI3->CR1 &= ~SPI_CR1_SPE;
	while(SPI3->CR1 & ~SPI_CR1_SPE)
	{

	}

	aux_register = 0;
	aux_register |= SPI_CFG1_TXDMAEN;
	SPI3->CFG1 |= aux_register;

	//Chip Enable pin
	HAL_GPIO_WritePin(GPIOD, GPIO_PIN_11, GPIO_PIN_SET);

	DMA2_Stream3->M0AR = (uint32_t)&(controller_output_buffer[0].left);
	DMA2_Stream3->NDTR = 2 * OUTPUT_BUF_SIZE_STEREO_SAMPLES;

	/* Clear interrupt flags */
	DMA2->LIFCR |= DMA_LIFCR_CTCIF3 | DMA_LIFCR_CHTIF3;
	DMA2_Stream3->CR |= DMA_SxCR_EN;
	SPI3->CR1 |= SPI_CR1_SPE;
	SPI3->CR1 |= SPI_CR1_CSTART;

You'll see it's not at all complex once you read the datasheet. Basically set the direction, peripheral, number of transfers and an interrupt. If you want, set up a double buffer, and receive interrupts at transfer completion and half transfer (what I use).

You'll have to set up an additional SPI Rx DMA channel, main difference being the direction! You will receive whatever data the slave is ready to send back to you. If you only want to receive, as someone mentione before... send Zero's out on the Tx line. I'm pretty sure you can set up an RX only SPI channel with clocking and DMA... I must see if I find something like that, I think I used that on a STM32L4.

Cheers,

Alberto

betocool · « **Reply #23 on:** January 19, 2022, 09:45:14 am »

Here's part of another example. This one runs on an L4...

Code: [Select]

	/*
	 * Prepare SPI 3 for DMA transfers in both directions
	 * This functionality gets activated using low level
	 * register access
	 * SPI 3 TX -> DMA2 Channel 2 / DIR: 1 /
	 * SPI 3 RX -> DMA2 Channel 1 / DIR: 0
	 */

	/* Configure PC3 input as interrupt, rising edge */
	GPIO_InitStruct.Pin = GPIO_PIN_3;
	GPIO_InitStruct.Mode = GPIO_MODE_IT_RISING;
	GPIO_InitStruct.Pull = GPIO_PULLDOWN;
	HAL_GPIO_Init(GPIOC, &GPIO_InitStruct);

	/* Enable DMA2 */
	RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;

	/* Select channels */
	DMA2_CSELR->CSELR |= (0x3 << DMA_CSELR_C1S_Pos); ///< SPI3RX Ch1
	DMA2_CSELR->CSELR |= (0x3 << DMA_CSELR_C2S_Pos); ///< SPI3TX Ch2
	/*
	 * SPI TX
	 */
	DMA2_Channel2->CCR |= 	DMA_CCR_MSIZE_0 | 				///< 16 Bits Mem Width
							DMA_CCR_PSIZE_0 | 				///< 16 Bits Peripheral Width
							DMA_CCR_MINC    |				///< Increment memory
							DMA_CCR_DIR;					///< Direction (TX)
	/*
	 * SPI RX
	 */
	DMA2_Channel1->CCR |= 	DMA_CCR_MSIZE_0 | 				///< 16 Bits Mem Width
							DMA_CCR_PSIZE_0 | 				///< 16 Bits Peripheral Width
							DMA_CCR_MINC    ;				///< Increment memory
	DMA2_Channel1->CCR |= 	DMA_CCR_TCIE;					///< Enable DMA RX ready interrupt
	/* Clear interrupt flags just in case */
	DMA2->IFCR |= DMA_IFCR_CTCIF1;
	DMA2->IFCR |= DMA_IFCR_CTCIF2;

	HAL_NVIC_SetPriority(DMA2_Channel1_IRQn, 7, 0);			///< Set the interrupt priority and enable
	HAL_NVIC_EnableIRQ(DMA2_Channel1_IRQn);


	// Disable SPI
	bool loop = true;
	while(loop)
	{
		if((SPI3->SR & SPI_SR_BSY) == 0)
		{
			if((SPI3->SR & SPI_SR_FRLVL) != 0)
			{
				(void)(uint32_t)SPI3->DR;
			}
			else
			{
				loop = false;
			}
		}
	}
	SPI3->CR1 &= ~SPI_CR1_SPE;

peter-h · « **Reply #24 on:** January 19, 2022, 05:22:10 pm »

One issue is that DMA cannot write to the CCM, so I still need a version of this which does not use DMA. So I spent a few hours today on that, since it should be simple

I implemented 16 bit mode only if blocksize=512 which is the only time-critical scenario.

Well, it works:

Code: [Select]

  if (Size==512)
  {

	  // Set 16 bit SPI2 mode

	  SPI2->CR1 &= ~(0x1 << 6);			// disable SPI2
	  SPI2->CR1 |= (0x1 << 11);			// set 16 bit mode
	  SPI2->CR1 |= (0x1 << 6);			// enable SPI2

	  // Loop, sending out TX buffer while receiving bytes into RXbuffer
	  // Generally, both counts must be the same, and if you are just receiving then TX data can be anything

	  uint16_t val16,val16b;

	  while ((TxXferCount > 0U) || (RxXferCount > 0U))
	  {
	      /* Do a transmit if TX empty */
	      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (TxXferCount > 0U) && txallowed )
	      {
	    	  val16 = (*hspi->pTxBuffPtr);
	    	  val16b = (val16>>8) | (val16<<8);  // byte swap
	    	   *(__IO uint16_t *)&SPI2->DR = val16b;
	    	  hspi->pTxBuffPtr+=2;
	    	  TxXferCount-=2;
	    	  /* Next Data is a reception (Rx). Tx not allowed */
	    	  txallowed = false;
	      }

	      /* Do a receive if RX not empty */
	      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (RxXferCount > 0U))
	      {
	    	  val16 = SPI2->DR;
	    	  (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);  // byte swap
	    	  hspi->pRxBuffPtr+=2;
	    	  RxXferCount-=2;
	    	  /* Next Data is a Transmission (Tx). Tx is allowed */
	    	  txallowed = true;
	      }

	  }

	  // Set back to 8 bit SPI2 mode
	  SPI2->CR1 &= ~(0x1 << 6);			// disable SPI2
	  SPI2->CR1 &= ~(0x1 << 11);		// set 8 bit mode back
	  SPI2->CR1 |= (0x1 << 6);			// enable SPI2


  }

Running a tight loop for reading the flash and timing 100000 512 byte page reads, I get this

original 8 bit SPI mode = 50 secs = 1024kbytes/sec
16 bit SPI mode = 40 secs = 1280kbytes/sec (-Og optimisation)
16 bit SPI mode = 37 secs = 1383kbytes/sec (-O3 optimisation)

The optimisation is done with __attribute__((optimize("O3"))) i.e. on that one function only.

So the help with 16 bit SPI mode is only 20%.

It seems obvious that the main issue must be the txallowed flag. Its purpose is to prevent loading the TX unless a value has been immediately previously extracted from RX. That loop just keeps going around, loading the TX if there is space, and reading the RX if there is data. The complication is that this is not a normal UART because you need TX to generate the clocks to get RX data to come in. So there is a deadlock, which is needed, but it has the effect of under-utilising the TX. I wonder if anyone has any ideas on how to fix this.

I then changed the loop to test txallowed and check the TXE bit only if txallowed is true, because TXE is a slow read. I also changed the test of either counter being nonzero to ORing them together. That knocked off 3 seconds - same as -O3, and now -O3 has no effect. Curious!

Code: [Select]

	  uint16_t val16,val16b;

	  while ( (TxXferCount|RxXferCount) != 0 )	// while ((TxXferCount > 0U) || (RxXferCount > 0U))
	  {

	      /* Do a transmit if TX empty */
		  if (txallowed && (TxXferCount > 0U))
		  {
		      if ( __HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) )
		      {
		    	  val16 = (*hspi->pTxBuffPtr);
		    	  val16b = (val16>>8) | (val16<<8);		// byte swap
		    	  /* *(__IO uint16_t *)& */ SPI2->DR = val16b;
		    	  hspi->pTxBuffPtr+=2;
		    	  TxXferCount-=2;
		    	  /* Next Data is a reception (Rx). Tx not allowed */
		    	  txallowed = false;
		      }
		  }

	      /* Do a receive if RX not empty */
	      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (RxXferCount > 0U))
	      {
	    	  val16 = SPI2->DR;
	    	  (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);	// byte swap
	    	  hspi->pRxBuffPtr+=2;
	    	  RxXferCount-=2;
	    	  /* Next Data is a Transmission (Tx). Tx is allowed */
	    	  txallowed = true;
	      }

	  }

No matter what I do I can't get under 37 secs, which is 50% of the SPI speed. Well, better than 30%

Scope trace below showing the 16 clocks and then a gap.

I will try DMA next. Maybe DMA will be faster even with CCM, after copying the 512 bytes to/from a temp buffer in main RAM...


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: 32F417 SPI running at one third the speed it should (Read 9736 times)

Share me