Author Topic: Order Of Magnitude Expected Refresh Rate, STM32, Small TFT Display (Read 4029 times)

Etesla · « **on:** August 02, 2022, 04:28:25 pm »

Hi All.

I'm thinking of embarking on a project to use an STM32F0 MCU and a small display (240x240 pixel ish) using the gc9a01 driver.

I want to know what kind of update rates I might expect in the average case.

The kind of displays I'm looking at need to be written to using an SPI interface on a pixel by pixel basis. I'm aware that there are many options for driving a display, different controllers, different interfaces, different firmware libraries ect, but I'm asking about the 4 wire SPI pixel by pixel type interface setup.

Off the top of my head, say the SPI clock is at 16 MHZ (no idea if that's reasonable, can't find a max frequency spec in most driver datasheets), I send 24 bits per pixel, It's a 240x240 ish display (57600 pixels), then refreshing the entire screen might take on the order of 80 millisecond (1/(16e6)*24*57600) if I use the DMA and probably store images in some kind of memory or heap.

Is that math half reasonable with the 80 milliseconds? Am I missing any big steps? Does anyone have a gut feel for what update rate I can expect for a project like this?

MasterT · « **Reply #1 on:** August 02, 2022, 04:52:55 pm »

Forget 80 msec, more like 4 seconds ti refill all screen.
Have a look, 24-28 seconds in this video:

Etesla · « **Reply #2 on:** August 02, 2022, 04:57:10 pm »

Hmmm. I wonder what SPI clock frequency they are using and what method they use for storing images and sending them over SPI. I'm a little skeptical of arduino projects using the best possible methods for the fastest possible times...

MasterT · « **Reply #3 on:** August 02, 2022, 05:03:05 pm »

Default arduino UNO SPI clock 4 MHz. Max 8 MHz. I think F0 would be very close to atmega328 (16MHz) in TFT application. Search more video on youtube, see if there is an example with F0

hans · « **Reply #4 on:** August 02, 2022, 09:18:21 pm »

I think that video shows live graphics rendering on a ATMEGA328. That's ofcourse a very slow task to do given the limited capabilities of an AVR for rich graphics. It may even need to read back graphics memory data for complicated stuff, because the ATMEGA doesn't hold enough RAM for a proper frame buffer. I've no clue how well written the code is, for all we know they are using floating points for circular coordinates and then convert them at the last possible moment.

In terms of frame buffer transfer rate (from e.g. a SD card), the STM32F0 should be atleast twice as quick. The max display clock for that driver is 100MHz in write (lower in read). So 16MHz SPI clock (max of F0) should be fine, so OP's estimate of 80ms should give ~12.5fps. Especially given that DMA can be used, so there will be virtually zero gaps between SPI transfers. That's probably a whole different story on an AVR, where it's easy to write code that has >100% timing overhead, probably even a lot worse with default Arduino compiler settings (so a 8MHz clock rate is still <4MHz effective).

MasterT · « **Reply #5 on:** August 02, 2022, 09:44:46 pm »

Quote from: hans on August 02, 2022, 09:18:21 pm

I think that video shows live graphics rendering on a ATMEGA328.

So 16MHz SPI clock (max of F0) should be fine, so OP's estimate of 80ms should give ~12.5fps. Especially given that DMA can be used, so there will be virtually zero gaps between SPI transfers. That's probably a whole different story on an AVR, where it's easy to write code that has >100% timing overhead, probably even a lot worse with default Arduino compiler settings (so a 8MHz clock rate is still <4MHz effective).

I pointed at 24-28 sec. in video, dumb simple single colour FillScreen - sure uCPU just pushing data into SPI- DR, no fancy rendering involved.

F0 likely doesn't support 24-bits SPI transfer, 16 probably at the max, so DMA is a little help here, FO has to drive CS pin "manually" using uCPU - not much differ compare to arduino.

I have benchmark for 240x320 TFT LCD mcuFriends, unfortunately it's 8-bits parallel - so not directly comparative with SPI driven, nevertheless see FillScreen (STM32F767Zi)

Quote

Serial took 0ms to start
ID = 0x9486
MCUFRIEND 2.99 UNO
FillScreen 156615
Text 8420
Lines 156954
Horiz/Vert Lines 13015
Rectangles (outline) 7559
Rectangles (filled) 382163
Circles (filled) 69338
Circles (outline) 67380
Triangles (outline) 43409
Triangles (filled) 132859
Rounded rects (outline) 23750
Rounded rects (filled) 420259
Total:1.48sec
ID: 0x9486
F_CPU:216.00MHz

DavidAlfa · « **Reply #6 on:** August 02, 2022, 09:59:41 pm »

Yeah if storing the image somewhere and using DMA, you can get the theorical framerate, ex. 16MHz / (240*240*24) = 11fps.

I've done that using external SPI flash, connecting together SDI and SDO in both screen and flash.

- Select the flash, send the read CMD & address, unselect.
Most SPI memories allow this, only once, after a full CMD is sent, instead resetting the state it holds the cmd and waits for the next chip select, they call it "transfer pause" of similar.

- Select the screen, send write CMD, keep selected.

- Disable MCU data pin, select the flash, now send n clocks as master using either DMA or polling.

- Now SPI flash is outputting data, and the screen is receiving it.

- Deselect both devices, enable MCU data pin.

This will absolutely saturate the max SPI bandwidth, but drawing operations (circle, text...) will be much slower.

The worst thing is when you have to clear the LCD and write over again, that will make a terrible tearing/blinking effect.

mikeselectricstuff · « **Reply #7 on:** August 02, 2022, 10:04:09 pm »

Bear in mind the max SPI clock rate is not only dependent on the controller, but also how much flex it has to go through, as this will add inductance, capacitance, ground bounce etc.. I'd be surprised if you could get over 16MHz in practice, and might be limited to half that

hans · « **Reply #8 on:** August 02, 2022, 10:11:06 pm »

Quote from: MasterT on August 02, 2022, 09:44:46 pm

I pointed at 24-28 sec. in video, dumb simple single colour FillScreen - sure uCPU just pushing data into SPI- DR, no fancy rendering involved.

F0 likely doesn't support 24-bits SPI transfer, 16 probably at the max, so DMA is a little help here, FO has to drive CS pin "manually" using uCPU - not much differ compare to arduino.

You can still do 8-bit transfers in HW, and use timer compares + DMA to glue it all together for 24-bit transfer if you really need to wiggle the chip select between pixel writes.
It should certainly be possible to have very little idle time on the SPI bus with that. A software-only solution without hardware FIFO cannot achieve that kind of bus load, and as I said, will probably have overhead in the order of 100%+ (e.g. a regular SpiTxRxByte function) unless you squeeze the absolute guts out of your low-level SPI transfers and compiler settings.

That's why I don't think a 80ms refresh time is unreasonable.

In fact, here is a nice animation running on a similar display:

https://youtu.be/Y0BGnHFuYBU?t=469

DavidAlfa · « **Reply #9 on:** August 02, 2022, 10:16:10 pm »

You can also just use the display in 16-bit mode (rgb565) if not requiring the higher color mode, will save a lot of bandwidth and processing.

Siwastaja · « **Reply #10 on:** August 07, 2022, 03:12:41 pm »

Quote from: DavidAlfa on August 02, 2022, 10:16:10 pm

You can also just use the display in 16-bit mode (rgb565) if not requiring the higher color mode, will save a lot of bandwidth and processing.

Those cheap LCDs do not look that great anyway and internally use 6-bit color at most, and probably don't implement any dithering, so end user would likely see no difference at all between RGB565 or 888. Some specifically crafted gradient scale might look a tad worse side-by-side. 16-bit makes everything so much easier, as you can now modify two pixels at once, so the whole graphics code will be significantly faster.

Quote from: MasterT on August 02, 2022, 09:44:46 pm

F0 likely doesn't support 24-bits SPI transfer, 16 probably at the max, so DMA is a little help here, FO has to drive CS pin "manually" using uCPU - not much differ compare to arduino.

I have no idea what FO and uCPU mean, but a quick glance at RM0091 shows even the F0 series has hardware nSS management on the SPI peripheral and 32-bit FIFOs allowing saving on memory bus cycles with DMA.

Of course this should be tested before assuming anything as hardware nSS management on STM32 can be notoriously iffy, but this should be a relatively simple SPI master use case so I see no reason why it would not work.

Specifically if the OP can choose a part with enough memory to hold double buffer of the whole display, software can modify one buffer and DMA can refresh the screen from the another.

wek · « **Reply #11 on:** August 07, 2022, 05:21:52 pm »

Quote from: Siwastaja on August 07, 2022, 03:12:41 pm

[...] a quick glance at RM0091 shows even the F0 series has hardware nSS management on the SPI peripheral and 32-bit FIFOs allowing saving on memory bus cycles with DMA

What exactly would you expect from "hardware nSS management" and how would that allow "saving on memory bus cycles with DMA"?

The STM32 NSS probably does not do what you expect (maybe in 'H7, which has an exorbitantly complex SPI I am not familiar with), but I am still curious what would be your expectation, exactly - maybe comparing to other mcus' SPI framing implementation.

JW

langwadt · « **Reply #12 on:** August 07, 2022, 06:33:24 pm »

Quote from: wek on August 07, 2022, 05:21:52 pm

Quote from: Siwastaja on August 07, 2022, 03:12:41 pm
[...] a quick glance at RM0091 shows even the F0 series has hardware nSS management on the SPI peripheral and 32-bit FIFOs allowing saving on memory bus cycles with DMA
What exactly would you expect from "hardware nSS management" and how would that allow "saving on memory bus cycles with DMA"?

The STM32 NSS probably does not do what you expect (maybe in 'H7, which has an exorbitantly complex SPI I am not familiar with), but I am still curious what would be your expectation, exactly - maybe comparing to other mcus' SPI framing implementation.

JW

afaict the F0 SPI with pulse mode on does what you'd expect, NSS going high between each 8 or 16 bit frame

F4 hardware is useless because it just pulls the NSS low then the PSI is enabled

wek · « **Reply #13 on:** August 07, 2022, 07:37:19 pm »

Quote from: langwadt on August 07, 2022, 06:33:24 pm

afaict the F0 SPI with pulse mode on does what you'd expect, NSS going high between each 8 or 16 bit frame

I personally wouldn't expect that. In fact, in many regards it's a hindrance: it slows down the transmission, causes discontinuous SCK, and at the end of the day it does not provide multibyte framing which is needed for many practical usages of the signal, anyway.

As I've said, the 'H7 SPI probably does provide a multi-frame NSS, AFAIK there's a counter to support that there. I am not going to check, I don't care.

In my ideal world, the SPI clock would be simply connected internally to a counter (bidirectionally, allowing SPI being set as slave or master), which then (together with other linked timers) could generate any clock/framing signals mixture as needed. Of course, the same can always be achieved by connecting pins externally, but pins are precious asset in mcus.

In context of the original post, IMO raw refresh rates close to theoretically calculated ones are achievable, with carefully crafted software, if the display is simple (e.g. single-color, or plain copy of a precalculated bitmap). Calculating what has to be displayed often takes much longer than the transmission. As is the normal case in mcus, generic methods (wonderful 24-bit fully shaded/rendered graphics, for example) are usually very inefficient. Things have to be designed with concrete hardware and application in mind and tradeoffs have to be made.

JW

langwadt · « **Reply #14 on:** August 07, 2022, 07:55:02 pm »

Quote from: wek on August 07, 2022, 07:37:19 pm

Quote from: langwadt on August 07, 2022, 06:33:24 pm
afaict the F0 SPI with pulse mode on does what you'd expect, NSS going high between each 8 or 16 bit frame
I personally wouldn't expect that. In fact, in many regards it's a hindrance: it slows down the transmission, causes discontinuous SCK, and at the end of the day it does not provide multibyte framing which is needed for many practical usages of the signal, anyway.

"slowing down" transmission and discontinuous SCK is what is required for SPI when the receiving end acts like a shift register with NSS as latch enable,
like many displays. NSS going low at the start of a transmission and high once you are done with how ever frames you need, is how it works in non pulse mode, and how many flash works

wek · « **Reply #15 on:** August 07, 2022, 11:22:25 pm »

> [...] when the receiving end acts like a shift register with NSS as latch enable,
> like many displays.

The gc9a01 driver mentioned by OP does not appear to behave in this way. Can you point me to such display/driver, please?

JW

langwadt · « **Reply #16 on:** August 08, 2022, 07:06:52 am »

Quote from: wek on August 07, 2022, 11:22:25 pm

> [...] when the receiving end acts like a shift register with NSS as latch enable,
> like many displays.

The gc9a01 driver mentioned by OP does not appear to behave in this way. Can you point me to such display/driver, please?

JW

RA8875

Nominal Animal · « **Reply #17 on:** August 08, 2022, 08:01:33 am »

I think OP is too worried about the SPI data transfer rate, and not worried enough about exactly what they want to display, and especially how to draw/construct that in the first place.

STM32F0 series does not have much RAM at all, up to 32k or so. That is not enough for any kind of 240×240 framebuffer, except maybe monochrome (7200 bytes). Unless they intend to stream the data somehow, I suspect they will end up having to use pre-defined tiles to construct the image. (Pixel art!)

According to the GC9A01A datasheet I found at BuyDisplay (they use it for 240x240 round modules), the 4-wire SPI interface is quite forgiving. Command bytes will need special handling (of the separate Data/Command bit), but you can just DMA your pixel data. In fact, if you always transfer full frames, you only need to do a few commands to setup the display, and then just keep sending data.

When using the four-wire interface, the GC9A01A supports R4G4B4 (12-bit), R5G6R5 (16-bit), and R6G6B6 (18-bit, but with 2 unused bits per component, in 24-bit per pixel) color formats.

If you use the 12-bit format, each set of three bytes describes two pixels (with 4096 possible colors), and you need 86400 bytes for a 240x240 framebuffer. The first byte contains the red and green components of the first pixel, the second byte contains the blue component of the first pixel and red component of the second pixel, and third pixel the green and blue components of the second pixel. In other words, you'll have to treat odd and even pixels differently in your framebuffer, making pixmap blitting operations a bit complicated, unless you align them at even pixel boundaries. But, at 16 MHz SPI with DMA, you can get over 23 FPS.

You can also save memory, and use an indexed color framebuffer. An 8-bit/256-color one needs 57600 bytes (with 512 or 768 bytes for the color lookup).

If you use the 16-bit format, each 16-bit unit describes a pixel (or each 32-bit unit two pixels), and you need 115200 bytes for a 240x240 framebuffer. At 16 MHz SPI with DMA, you should be able to reach 17 FPS.

With the R6G6B6 format, at 16 MHz SPI clock, you should be able to reach about 11 FPS.

Note that GC9A01A does support rectangular updates using three commands: A column address command (0x2A) with four data bytes to define the column range to be updated, a row address command (0x2B) with four data bytes to define the row range to be updated, followed by a memory write command (0x2C), followed by the pixel data. This means eleven bytes overhead per rectangular area. And you will have to redefine all pixels in that area, there is no way to leave some unmodified.

If you use 8×8 tiles, you get a map of 30×30 tiles, typically with 256 possible tiles. This map only takes 900 bytes of RAM, and the tile data (which you can put in Flash) up to 24576 bytes (12-bit color), 32768 bytes (16-bit color), or 49152 bytes (18-bit color). If you intend to display text, you'll probably want to reserve most of the tiles for possible letters.

If you use 16-bit color, and you have enough Flash to store each possible pixmap or tile or glyph you want to display, you can implement a rasterizer that regenerates the entire frame whenever needed, with only the location of each tile or glyph (plus size, and a pointer to the data in Flash) in RAM. You really only need a single row/column buffer (480 bytes; double or triple that if you want to use DMA – and you do, so you can render and transfer the data at the same time), to which you draw the glyphs/tiles. You probably have enough time (while DMA'ing the previous scan line) to use even indexed-color pixmaps/tiles, reducing the amount of Flash needed. In that case, using fixed-width tiles, say 8 pixels wide, will make things much easier. An useful trick is to extend the scan buffer by that width before and after, so you can display partial tiles as well. Note that in this case, the tiles (or "sprites") can be transparent. Partial transparency is possible, but likely too slow on STM32F0, especially if it has a 32-cycle (instead of 1-cycle) multiplication. After all, there are only about 11520 cycles at 48 MHz to construct each scan line, if using 16 bit color and 16 MHz SPI clock. (Of course, if some scan line takes longer to construct, it just slows down the display update from the optimum.)

Quote from: langwadt on August 08, 2022, 07:06:52 am

RA8875

When using the parallel interface, the /CS pulse is only needed after each command, not between data. Not sure about the SPI or I²C interfaces, though. (See datasheet figures 6-30 and 6-31 on pages 72 and 73.)

Even on the GC9A01A the command bytes do seem to need special care: the D/C (Data or /Command) line is recommended to be pulled high during the second-to-last bit in the command byte when data follows, although any falling edge of the clock during the first seven bits of the command should also work (as the D/C is supposed to be sampled by the GC9A01A during the falling edge of the first clock; the data line being sampled at the rising edges of the clock line). The data, however, can be streamed indefinitely, as long as it is paused (with /CS high) only between bytes/pixels; jitter (or delays between bytes) in the clock should not affect anything. I'm not absolutely certain of this, though, because I don't currently have any RA8875 or GC9A01A display modules to check.

Siwastaja · « **Reply #18 on:** August 08, 2022, 12:35:34 pm »

Quote from: wek on August 07, 2022, 05:21:52 pm

Quote from: Siwastaja on August 07, 2022, 03:12:41 pm
[...] a quick glance at RM0091 shows even the F0 series has hardware nSS management on the SPI peripheral and 32-bit FIFOs allowing saving on memory bus cycles with DMA
What exactly would you expect from "hardware nSS management" and how would that allow "saving on memory bus cycles with DMA"?

The STM32 NSS probably does not do what you expect (maybe in 'H7, which has an exorbitantly complex SPI I am not familiar with), but I am still curious what would be your expectation, exactly - maybe comparing to other mcus' SPI framing implementation.

JW

I expect "hardware nSS" to do what the "new feature" link on that page says - toggle the nCS pin to delimit 8- or 16-bit transactions. This is assuming user needs to delimit 8- or 16-bit transactions. If no, and if the display accepts longer packets, then toggling nCS in software is only minor overhead anyway; definitely not an "order of magnitude" thing.

Hardware nSS management has nothing to do with saving DMA bus cycles. Please read more carefully before replying.

32-bit FIFO, assuming it actually works and is not only a datasheet decoration, saves on DMA bus cycles because DMA peripheral is no magic, it accesses the SPI peripheral through the same AHB/APB bus as the CPU would; the problem is, there are other resources on the same bus, notably RAM. If the DMA is able to do 32-bit transfers, then it can do only a quarter the number of transfers compared to doing 8-bit transfers. Yet the bus cycle is fully "wasted" even if you only transfer 8 bits at once. This is a problem if and when the CPU also has to access RAM on the same bus - CPU and/or DMA will see stall cycles slowing down either or both.

DMA transfers that match the bus width hence save time as the resources do not need to arbitrate and wait for their turns nearly as much. But this requires buffering in the peripheral - called FIFO - because the actual SPI data register is narrower than the memory bus width.

Again, if any of this works or not, needs to be tested. It is always a bad idea to expect STM32 devices to be designed in a way which would seem like the logical and right thing to do by reading the datasheet or marketing material, and prepare for the worst case, which means toggling nCS in software and only using 8 or 16-bit DMA transfers.

Hope this helps.

wek · « **Reply #19 on:** August 08, 2022, 09:22:00 pm »

Quote from: Siwastaja

I expect "hardware nSS" to do what the "new feature" link on that page says - toggle the nCS pin to delimit 8- or 16-bit transactions.

OK. As I've said above, I don't think this is very useful with display drivers. I can imagine other scenarios where this would be useful but never came across such.

Lack of multi-frame framing or any hardware mechanism supporting it is annoying not because it would decrease performance or whatever, but because it needs to be written - and it involves both checking DMA and SPI being finished, and making sure all timing is right (which is not entirely trivial).

Quote from: Siwastaja

Hardware nSS management has nothing to do with saving DMA bus cycles. Please read more carefully before replying.

OK, I misunderstood that.

Actually, it's a tad bit more complicated with the 'F0. The DMA in it cannot pack/unpack, and the SPI module is in fact 16-bit (vast majority of IPs in the earlier STM32 are - after all, those millions of transistors are not THAT cheap) so even if the FIFO is 32-bit, you cannot fill it in 32-bit chunks. It still helps, though.

JW

wek · « **Reply #20 on:** August 08, 2022, 09:32:47 pm »

Quote from: langwadt on August 08, 2022, 07:06:52 am

Quote from: wek on August 07, 2022, 11:22:25 pm
> [...] when the receiving end acts like a shift register with NSS as latch enable,
> like many displays.

The gc9a01 driver mentioned by OP does not appear to behave in this way. Can you point me to such display/driver, please?

JW

RA8875

Thanks.

While the DS indicates framing, given all transfers are exactly 16 bits (at least according to the description), I am willing to bet that it is enough to have one negative edge before the very first transfer, and then just go on without touching that signal ever.

(Although I may be quite well losing this bet. I already had an encounter with the RA8875: in parallel setting, read does not set the WAIT signal, a pretty annoying thing as it forces one to go for the worst possible timing in each and every read cycle).

Nonetheless, the SPI protocol in this chip is so idiotic (wasting 6 clocks in every transaction, and - if the description is correct - lacking continuous data modes) that one or two extra cycles spent to framing is almost negligible, especially given that this is intended for mid-range displays (VGA/WVGA).

JW

mikeselectricstuff · « **Reply #21 on:** August 09, 2022, 09:42:07 am »

Quote from: DavidAlfa on August 02, 2022, 10:16:10 pm

You can also just use the display in 16-bit mode (rgb565) if not requiring the higher color mode, will save a lot of bandwidth and processing.

Depending on how you are generating the images, it may be possible to use dithering to convert 24 bit images to 565. for colour photo type content, a dithered 565 image is amost indistinguishable from true 24-bit. You may also find that even with a 24 bit interface the LCD controller may only actually be displaying 18 bits

langwadt · « **Reply #22 on:** August 10, 2022, 10:41:48 pm »

Quote from: wek on August 08, 2022, 09:32:47 pm

Quote from: langwadt on August 08, 2022, 07:06:52 am
Quote from: wek on August 07, 2022, 11:22:25 pm
> [...] when the receiving end acts like a shift register with NSS as latch enable,
> like many displays.

The gc9a01 driver mentioned by OP does not appear to behave in this way. Can you point me to such display/driver, please?

JW

RA8875
Thanks.

While the DS indicates framing, given all transfers are exactly 16 bits (at least according to the description), I am willing to bet that it is enough to have one negative edge before the very first transfer, and then just go on without touching that signal ever.

(Although I may be quite well losing this bet. I already had an encounter with the RA8875: in parallel setting, read does not set the WAIT signal, a pretty annoying thing as it forces one to go for the worst possible timing in each and every read cycle).

Nonetheless, the SPI protocol in this chip is so idiotic (wasting 6 clocks in every transaction, and - if the description is correct - lacking continuous data modes) that one or two extra cycles spent to framing is almost negligible, especially given that this is intended for mid-range displays (VGA/WVGA).

JW

I was about to write that you lost the bet, because I had tried it without success some time ago, but then I decided to try again, and wouldn't you know it,
it does work once you find the quirks.

can't use 16 bit spi (which seems like the obvious choice reading the datasheet) because the first _byte_ of a stream has to be zero or the data will be ignored, and the chip gets picky about spi clock speed

once that was sorted and the clock speed not too close to the limit, it works, even with DMA

wek · « **Reply #23 on:** August 11, 2022, 06:18:17 am »

Thanks for all this work and sharing the findings.

JW


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Order Of Magnitude Expected Refresh Rate, STM32, Small TFT Display (Read 4029 times)

Share me