Yes, DMA can handle the size of 1. You could improve performance of these single-byte accesses by not using DMA on them, but not much, because DMA channel configuration isn't too many operations, especially if you don't use HAL for it. If performance is important:
* Only write the DMA control fields which need change (maybe M0AR, probably NDTR)
* Instead of read-modify-write on CR, do full writes. DMA1_Stream0->CR = config; DMA1_Stream0->CR = config | 1UL; is faster than DMA1_Stream0->CR = config; DMA1_Stream0->CR |= 1UL;
If DMA config doesn't change except for NDTR, re-enabling would be just three writes: NDTR, IFCR, CR once for enabling the channel.
Time consumed is same if you read from 8-bit SPI and write a byte in memory, or read from 16-bit SPI and write a halfword in memory. Hence, 16-bit version will double the performance. It may happen that you need to swap the bytes, I don't remember if M7 has a single instruction for this but assuming you have optimizations enabled, it's not a big operation, you still save a lot of time by halving all other overhead (think about interrupt entry latency of 12 cycles alone!), even if you need to spend 1-3 CPU cycles to swap the bytes.
You can make the buffer uint16_t, or you can keep it uint8_t and cast the pointer to uint16_t, or you can create two different access types through union, whatever. In some solutions (like pointer casting), you need to add align attribute to the definition, because uint16_t needs to be aligned by 2, but uint8_t can be arbitrarily aligned. Or, you can just write higher level C code where you do two writes to the uint8_t[] table, and let the compiler optimize it. But I'm not sure if the compiler understands if the table is aligned or not, even if you have aligned attribute, so it's not necessarily as fast.