I think OP is too worried about the SPI data transfer rate, and not worried enough about exactly what they want to display, and especially how to draw/construct that in the first place.
STM32F0 series does not have much RAM at all, up to 32k or so. That is not enough for any kind of 240×240 framebuffer, except maybe monochrome (7200 bytes). Unless they intend to stream the data somehow, I suspect they will end up having to use pre-defined tiles to construct the image. (Pixel art!)
According to the
GC9A01A datasheet I found at BuyDisplay (they use it for 240x240 round modules), the 4-wire SPI interface is quite forgiving. Command bytes will need special handling (of the separate Data/Command bit), but you can just DMA your pixel data. In fact, if you always transfer full frames, you only need to do a few commands to setup the display, and then just keep sending data.
When using the four-wire interface, the GC9A01A supports R4G4B4 (12-bit), R5G6R5 (16-bit), and R6G6B6 (18-bit, but with 2 unused bits per component, in 24-bit per pixel) color formats.
If you use the 12-bit format, each set of three bytes describes two pixels (with 4096 possible colors), and you need 86400 bytes for a 240x240 framebuffer. The first byte contains the red and green components of the first pixel, the second byte contains the blue component of the first pixel and red component of the second pixel, and third pixel the green and blue components of the second pixel. In other words, you'll have to treat odd and even pixels differently in your framebuffer, making pixmap blitting operations a bit complicated, unless you align them at even pixel boundaries. But, at 16 MHz SPI with DMA, you can get over 23 FPS.
You can also save memory, and use an indexed color framebuffer. An 8-bit/256-color one needs 57600 bytes (with 512 or 768 bytes for the color lookup).
If you use the 16-bit format, each 16-bit unit describes a pixel (or each 32-bit unit two pixels), and you need 115200 bytes for a 240x240 framebuffer. At 16 MHz SPI with DMA, you should be able to reach 17 FPS.
With the R6G6B6 format, at 16 MHz SPI clock, you should be able to reach about 11 FPS.
Note that GC9A01A does support rectangular updates using three commands: A column address command (0x2A) with four data bytes to define the column range to be updated, a row address command (0x2B) with four data bytes to define the row range to be updated, followed by a memory write command (0x2C), followed by the pixel data. This means eleven bytes overhead per rectangular area. And you will have to redefine all pixels in that area, there is no way to leave some unmodified.
If you use 8×8 tiles, you get a map of 30×30 tiles, typically with 256 possible tiles. This map only takes 900 bytes of RAM, and the tile data (which you can put in Flash) up to 24576 bytes (12-bit color), 32768 bytes (16-bit color), or 49152 bytes (18-bit color). If you intend to display text, you'll probably want to reserve most of the tiles for possible letters.
If you use 16-bit color, and you have enough Flash to store each possible pixmap or tile or glyph you want to display, you can implement a rasterizer that regenerates the entire frame whenever needed, with only the location of each tile or glyph (plus size, and a pointer to the data in Flash) in RAM. You really only need a single row/column buffer (480 bytes; double or triple that if you want to use DMA – and you do, so you can render and transfer the data at the same time), to which you draw the glyphs/tiles. You probably have enough time (while DMA'ing the previous scan line) to use even indexed-color pixmaps/tiles, reducing the amount of Flash needed. In that case, using fixed-width tiles, say 8 pixels wide, will make things much easier. An useful trick is to extend the scan buffer by that width before and after, so you can display partial tiles as well. Note that in this case, the tiles (or "sprites") can be transparent. Partial transparency is possible, but likely too slow on STM32F0, especially if it has a 32-cycle (instead of 1-cycle) multiplication. After all, there are only about 11520 cycles at 48 MHz to construct each scan line, if using 16 bit color and 16 MHz SPI clock. (Of course, if some scan line takes longer to construct, it just slows down the display update from the optimum.)
RA8875
When using the parallel interface, the /CS pulse is only needed after each command, not between data. Not sure about the SPI or I²C interfaces, though. (See
datasheet figures 6-30 and 6-31 on pages 72 and 73.)
Even on the GC9A01A the command bytes do seem to need special care: the D/C (Data or /Command) line is recommended to be pulled high during the second-to-last bit in the command byte when data follows, although any falling edge of the clock during the first seven bits of the command should also work (as the D/C is supposed to be sampled by the GC9A01A during the falling edge of the first clock; the data line being sampled at the rising edges of the clock line). The data, however, can be streamed indefinitely, as long as it is paused (with /CS high) only between bytes/pixels; jitter (or delays between bytes) in the clock should not affect anything. I'm not absolutely certain of this, though, because I don't currently have any RA8875 or GC9A01A display modules to check.