Author Topic: Any 32-bit microcontroller that can unpack 12-bit to 16-bit during DMA? (Read 3491 times)

axonometric · « **on:** March 01, 2021, 10:22:46 pm »

STM32F4, and probably others I haven't looked at, can unpack SPI data during DMA transfer. That is, each 8-bit chunk is expanded to a 16-bit or 32-bit word in memory (and similarly 16-bit SPI data to 32-bit words), via the DMA_SxCR.PSIZE and DMA_SxCR.MSIZE bits.

Are there any 32-bit micros that can take 12-bit chunks from the SPI data stream and unpack them to 16-bit chunks in memory, without hogging the CPU?

Non-DMA solutions acceptable too. I suspect the RP2040 may be able to achieve this via its programmable IO blocks?

Thanks,
Ken

brucehoult · « **Reply #1 on:** March 02, 2021, 11:05:21 am »

"unpack" in what way? Zero extend? Sign extend? Convert groups of three bits into nybbles/hex digits? Something else?

If you want to zero extend then an M3 (or greater) can expand blocks of 12 bytes (8 values) to 16 bytes in at worst 23 instructions plus 3 for loop control. There might be a better way to formulate this, but I did it two different ways and the compiler optimises both to the same code (the second function just jumps to the first one):

https://godbolt.org/z/18McPe

Depending on the data rate, and the size of the data, that might not be hogging the CPU too much.

SiliconWizard · « **Reply #2 on:** March 02, 2021, 05:41:43 pm »

Yeah, the OP needs to state how they want exacctly their 12-bit chunks to get "unpacked".
They also need to explain how the data is streamed exactly over SPI. Is that for instance 12-bit samples from some ADC? If so, most SPI ADCs with sample width not multiple of a byte have a built-in way of doing this - as long as you use 16 clock pulses to clock data in. (But not many MCUs support SPI transfers by 12-bit words anyway, so you don't really have a choice?)

We definitely need more details.

axonometric · « **Reply #3 on:** March 03, 2021, 12:41:34 am »

Thanks for the answers.

Zero extend and sign extend are both acceptable.

I'm dealing with custom silicon that outputs 4800-bit packets, i.e. 400 12-bit chunks per packet, at ~50 Mbit/s. The packing is done to maximize data rate.

I agree with you that there probably isn't an MCU capable of this, but I wanted to ask in case there's one I'm not aware of. I imagine I'll end up with a solution similar to that bruceholt suggested. It would just be nice if this could be done with DMA as it would reduce the demand on the CPU.

SiliconWizard · « **Reply #4 on:** March 03, 2021, 01:52:37 am »

Quote from: axonometric on March 03, 2021, 12:41:34 am

Zero extend and sign extend are both acceptable.

I'm dealing with custom silicon that outputs 4800-bit packets, i.e. 400 12-bit chunks per packet, at ~50 Mbit/s. The packing is done to maximize data rate.

I see. Just going to suggest something. Transfer the whole packet through DMA as is. 400 12-bit words => 600 bytes, or 150 32-bit words. Of course the data will be "packed" in the buffer. Unpack the whole buffer when needed (and then for this you can use whatever is efficient). This way you can still use DMA transfers. This would be much more efficient than transferring word by word and unpacking on the fly.

Note that the number of 12-bit words that fit in a integer number of 16-bit words is 4.
4 12-bit words -> 3 16-bit words. So your algorithm can work on blocks of 3 16-bit words from your DMA buffer and extend them to 4, zero-extended 16-bit words.

brucehoult · « **Reply #5 on:** March 03, 2021, 02:12:23 am »

Quote from: SiliconWizard on March 03, 2021, 01:52:37 am

I see. Just going to suggest something. Transfer the whole packet through DMA as is. 400 12-bit words => 600 bytes, or 150 32-bit words. Of course the data will be "packed" in the buffer. Unpack the whole buffer when needed (and then for this you can use whatever is efficient). This way you can still use DMA transfers. This would be much more efficient than transferring word by word and unpacking on the fly.

Note that the number of 12-bit words that fit in a integer number of 16-bit words is 4.
4 12-bit words -> 3 16-bit words. So your algorithm can work on blocks of 3 16-bit words from your DMA buffer and extend them to 4, zero-extended 16-bit words.

I'm guessing you didn't follow my godbolt link :-)

NorthGuy · « **Reply #6 on:** March 03, 2021, 02:40:33 am »

I don't understand why you would need this. Just read the whole 4800-bit as if it was 150 32-bit numbers.

When you start procesing, it'll take only 3 instructions (read, shift and mask) to get each 16-bit number from the memory. Whatever you're going to do with the numbers, your processing will take much longer than extraction.

errorprone · « **Reply #7 on:** March 03, 2021, 02:45:03 am »

The LPC54000 series lets you select spi transfers of 12 bits but I think it can only reach 50MHz as a master. Other option is to use a couple shift register such as SN74AHC594PWR if you can generate the right strobes and load it through gpio or a memory controller.

brucehoult · « **Reply #8 on:** March 03, 2021, 03:21:17 am »

Quote from: NorthGuy on March 03, 2021, 02:40:33 am

When you start procesing, it'll take only 3 instructions (read, shift and mask) to get each 16-bit number from the memory.

If you're doing them individually, 2 out of every 8 will require two loads, two shifts, an OR and a mask.

Unless you're using a processor that supports unaligned accesses.

It's a pity ARM doesn't have a shift or rotate instruction with two source registers. Actually, I can't think of anything that does, in the integer instruction set. PowerPC has one in the SIMD ISA. We're thinking about adding one in the RISC-V BitManip extension.

NorthGuy · « **Reply #9 on:** March 03, 2021, 04:24:39 am »

Quote from: brucehoult on March 03, 2021, 03:21:17 am

Quote from: NorthGuy on March 03, 2021, 02:40:33 am
When you start procesing, it'll take only 3 instructions (read, shift and mask) to get each 16-bit number from the memory.

If you're doing them individually, 2 out of every 8 will require two loads, two shifts, an OR and a mask.

Yes, but in return you do not need to store it to memory, and you do not need to load it back when processing. That's much large savings. Besides, you do save some shifts too.

Quote from: brucehoult on March 03, 2021, 03:21:17 am

Unless you're using a processor that supports unaligned accesses.

With unaligned access you only need shifts every second pass, although it might be more effort in avoiding shifts than in doing them.

Quote from: brucehoult on March 03, 2021, 03:21:17 am

It's a pity ARM doesn't have a shift or rotate instruction with two source registers. Actually, I can't think of anything that does, in the integer instruction set. PowerPC has one in the SIMD ISA. We're thinking about adding one in the RISC-V BitManip extension.

x86 obviously.

brucehoult · « **Reply #10 on:** March 03, 2021, 09:36:55 am »

Quote from: NorthGuy on March 03, 2021, 04:24:39 am

Quote from: brucehoult on March 03, 2021, 03:21:17 am
It's a pity ARM doesn't have a shift or rotate instruction with two source registers. Actually, I can't think of anything that does, in the integer instruction set. PowerPC has one in the SIMD ISA. We're thinking about adding one in the RISC-V BitManip extension.

x86 obviously.

Really? I'm not all that familiar with x86. I tried compiling my function(s) for x86 and nothing that looked like a double-wide shift or rotate showed up in the assembly language.

What's the instruction called?

newbrain · « **Reply #11 on:** March 03, 2021, 09:53:39 am »

The SPI in STM32F0 and STM32F7 series (and probably others, but not the F4) can be programmed for any transfer size from 4 to 16 bits.
Data will be right aligned in the DR (data register) and of course DMA is available.

See, for example, chapter 32 of the STM32F7x6 Reference Manual

As said, this also possible on the F0 series, but looking at the speeds involved they won't cut it.
Maximum clock frequency for the SPI in STM32F7x6 is 54 MHz.

brucehoult · « **Reply #12 on:** March 03, 2021, 10:00:30 am »

Quote from: newbrain on March 03, 2021, 09:53:39 am

The SPI in STM32F0 and STM32F7 series (and probably others, but not the F4) can be programmed for any transfer size from 4 to 16 bits.
Data will be right aligned in the DR (data register) and of course DMA is available.

Is that with multiple values back to back in the same data packet?

newbrain · « **Reply #13 on:** March 03, 2021, 11:09:20 am »

Quote from: brucehoult on March 03, 2021, 10:00:30 am

Is that with multiple values back to back in the same data packet?

That's my understanding (and I seem to remember using it for an oddball peripheral), see chapter 32.5.8, specifically the Data Packing and the Communication diagrams sections.

Any other behaviour would be (IMO) pretty much pointless, but I would not be overly surprised...

Psi · « **Reply #14 on:** March 03, 2021, 11:32:34 am »

You could have a look at the nRF52, i've heard the DMA on that is pretty advanced, so it can likely do what you need.
And, you get Bluetooth thrown in

NorthGuy · « **Reply #15 on:** March 03, 2021, 03:09:23 pm »

Quote from: brucehoult on March 03, 2021, 09:36:55 am

What's the instruction called?

SHLD and SHRD.

brucehoult · « **Reply #16 on:** March 03, 2021, 04:15:33 pm »

Quote from: NorthGuy on March 03, 2021, 03:09:23 pm

Quote from: brucehoult on March 03, 2021, 09:36:55 am
What's the instruction called?

SHLD and SHRD.

Aha.

According to Agner Fog's instruction tables, SHLD&SHRD with reg/reg/imm has 3 or 4 cycles latency all the way though to Coffee Lake (and was as much as 6 on some older CPUs). For some unknown reason SHRD is on many CPUs 1 cycle more than SHLD.

Since you can in the worse case do SHL;SHR;OR in 3 clock cycles -- and usually 2 on anything since the original Pentium -- there is no reason to ever use SHLD and SHRD if you are optimising for speed, and so compilers don't generate them.

NorthGuy · « **Reply #17 on:** March 03, 2021, 04:29:34 pm »

Quote from: brucehoult on March 03, 2021, 04:15:33 pm

Since you can in the worse case do SHL;SHR;OR in 3 clock cycles -- and usually 2 on anything since the original Pentium -- there is no reason to ever use SHLD and SHRD if you are optimising for speed, and so compilers don't generate them.

If you wanted to do this efficiently, you would probably do it with SSE2 anyway. SHLD and SHRD are remnants of the past.

brucehoult · « **Reply #18 on:** March 03, 2021, 05:10:34 pm »

Quote from: NorthGuy on March 03, 2021, 04:29:34 pm

Quote from: brucehoult on March 03, 2021, 04:15:33 pm
Since you can in the worse case do SHL;SHR;OR in 3 clock cycles -- and usually 2 on anything since the original Pentium -- there is no reason to ever use SHLD and SHRD if you are optimising for speed, and so compilers don't generate them.

If you wanted to do this efficiently, you would probably do it with SSE2 anyway. SHLD and SHRD are remnants of the past.

We were talking about efficiently unpacking stuff on a microcontroller, so I've been assuming something like Cortex M3. You don't get things like SSE or NEON on those.

SiliconWizard · « **Reply #19 on:** March 03, 2021, 05:18:03 pm »

Quote from: brucehoult on March 03, 2021, 02:12:23 am

Quote from: SiliconWizard on March 03, 2021, 01:52:37 am
I see. Just going to suggest something. Transfer the whole packet through DMA as is. 400 12-bit words => 600 bytes, or 150 32-bit words. Of course the data will be "packed" in the buffer. Unpack the whole buffer when needed (and then for this you can use whatever is efficient). This way you can still use DMA transfers. This would be much more efficient than transferring word by word and unpacking on the fly.

Note that the number of 12-bit words that fit in a integer number of 16-bit words is 4.
4 12-bit words -> 3 16-bit words. So your algorithm can work on blocks of 3 16-bit words from your DMA buffer and extend them to 4, zero-extended 16-bit words.

I'm guessing you didn't follow my godbolt link :-)

I didn't (and just did now) but I was just assuming it was one efficient way of unpacking (which it is.)

I wasn't suggesting any particular way of unpacking - just mentioned what was the minimum number of integer chunks you could act on. Which by the way, can always be determined using the LCM (probably obvious for many, maybe not so much for those that are not very math-inclined.) For instance, here, lcm(12, 16) = 48, so 48 bits is the minimum number of bits containing both an integer number (4) of 12-bit chunks and an integer number (3) of 16-bit chunks.

I merely mentioned the fact you could still use DMA for the SPI transfers, instead of unpacking on the fly (which I assume was what the OP had in mind, since they said they didn't see a way of using DMA), which, whatever method you use for unpacking, is always going to be more efficient (unless it causes issues for latency, which I doubt, since the OP was willing to transfer through DMA in the first place.)

mikeselectricstuff · « **Reply #20 on:** March 03, 2021, 05:20:37 pm »

Quote from: axonometric on March 03, 2021, 12:41:34 am

I'm dealing with custom silicon that outputs 4800-bit packets, i.e. 400 12-bit chunks per packet, at ~50 Mbit/s. The packing is done to maximize data rate.

How frequently does the packet come in?
If it's not too frequent, software unpacking afterwards wouldn't be a big deal. If speed is crucial, a little assembly using LDM/STM, plenty of registers and some shifts to do a chunk at a time and optimise memory access could be significantly faster than doing it in C.

NorthGuy · « **Reply #21 on:** March 03, 2021, 05:31:07 pm »

Quote from: brucehoult on March 03, 2021, 05:10:34 pm

We were talking about efficiently unpacking stuff on a microcontroller, so I've been assuming something like Cortex M3.

It certainly didn't look that way:

Quote from: brucehoult on March 03, 2021, 04:15:33 pm

According to Agner Fog's instruction tables, SHLD&SHRD with reg/reg/imm has 3 or 4 cycles latency all the way though to Coffee Lake (and was as much as 6 on some older CPUs). For some unknown reason SHRD is on many CPUs 1 cycle more than SHLD.

Anyway, speaking of MCUs, the unpacking overhead is small, regardless of the CPU.

Siwastaja · « **Reply #22 on:** March 03, 2021, 06:10:00 pm »

The big question is, where and how is the data finally used?

Most of the time, doing all processing "at once" is most efficient. I.e., where you are looping through all items anyway, do everything there. When you have already loaded from memory to CPU registers, it usually doesn't matter if your processing is then 10 or 11 or 15 instructions. Load and stores are slow in comparison; if you have slower memory interfaces and need caches, the difference is even bigger.

"On-the-fly" processing makes sense when you are able to significantly "pack" the data, i.e., decrease the memory footprint. Unpacking on the fly sounds like the opposite, don't do it.

To me, it sounds a good idea to keep the efficiently packed 12-bit data as long as possible and "unpack" at use.

axonometric · « **Reply #23 on:** March 03, 2021, 10:48:05 pm »

Thanks for all the responses, especially errorprone and newbrain for the suggestions of LPC54000 and STM32F7, both of which support the unpacking I was looking for. Configurable SPI "data width" or "data frame width" seems to be the most common terminology used to describe this feature.

To address a few other points:

Quote from: Psi on March 03, 2021, 11:32:34 am

You could have a look at the nRF52

I checked the datasheet. Doesn't look like nRF52 supports configurable SPI data width.

Quote from: mikeselectricstuff on March 03, 2021, 05:20:37 pm

How frequently does the packet come in?

One pretty much right after the other. Close to a constant 50 Mbit/s.

Quote from: Siwastaja on March 03, 2021, 06:10:00 pm

To me, it sounds a good idea to keep the efficiently packed 12-bit data as long as possible and "unpack" at use.

In most cases, I'd agree with you. Here, I need this MCU to sit in between the data source, receiving packed 12-bit via SPI, and the data processor, offloading unpacked 16-bit via USB, both of which cannot be changed. The MCU needs to do some other tasks too, which is why DMA is preferable (though not essential, given the clock speeds current MCUs can run at).

mikeselectricstuff · « **Reply #24 on:** March 04, 2021, 01:15:10 am »

Quote

Quote from: mikeselectricstuff on March 03, 2021, 05:20:37 pm
How frequently does the packet come in?

One pretty much right after the other. Close to a constant 50 Mbit/s.

OK in that case you use double-buffering to DMA it in to a pair of ping-pong buffers at about 4Mwords/sec DMA rate.
You then have 96uS to unpack the first buffer while the second is filling.
The above mentioned STM32F7 does 216MHz, so about 20,000 clock cycles to unpack 400 words.
Doesn't seem like that would be much of a problem.
It might be more efficient to DMA it in as 32 bit words, which should reduce the memory bandwidth loading due to fewer accesses.

You probably could do it with interrupts as long as the SPI peripheral has a decent sized FIFO, otherwise interrupt entry/exit overhead would chew up a lot of cycles, but DMA would be a lot more efficient. Most DMA peripherals support double-buffering.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Any 32-bit microcontroller that can unpack 12-bit to 16-bit during DMA? (Read 3491 times)

Share me