Author Topic: Any 32-bit microcontroller that can unpack 12-bit to 16-bit during DMA?  (Read 3491 times)

0 Members and 1 Guest are viewing this topic.

Offline axonometricTopic starter

  • Newbie
  • Posts: 3
  • Country: us
STM32F4, and probably others I haven't looked at, can unpack SPI data during DMA transfer. That is, each 8-bit chunk is expanded to a 16-bit or 32-bit word in memory (and similarly 16-bit SPI data to 32-bit words), via the DMA_SxCR.PSIZE and DMA_SxCR.MSIZE bits.

Are there any 32-bit micros that can take 12-bit chunks from the SPI data stream and unpack them to 16-bit chunks in memory, without hogging the CPU?

Non-DMA solutions acceptable too. I suspect the RP2040 may be able to achieve this via its programmable IO blocks?

Thanks,
Ken
 
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4376
  • Country: nz
"unpack" in what way? Zero extend? Sign extend? Convert groups of three bits into nybbles/hex digits? Something else?

If you want to zero extend then an M3 (or greater) can expand blocks of 12 bytes (8 values) to 16 bytes in at worst 23 instructions plus 3 for loop control. There might be a better way to formulate this, but I did it two different ways and the compiler optimises both to the same code (the second function just jumps to the first one):

https://godbolt.org/z/18McPe

Depending on the data rate, and the size of the data, that might not be hogging the CPU too much.

« Last Edit: March 02, 2021, 11:28:31 am by brucehoult »
 
The following users thanked this post: I wanted a rude username, harerod

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15127
  • Country: fr
Yeah, the OP needs to state how they want exacctly their 12-bit chunks to get "unpacked".
They also need to explain how the data is streamed exactly over SPI. Is that for instance 12-bit samples from some ADC? If so, most SPI ADCs with sample width not multiple of a byte have a built-in way of doing this - as long as you use 16 clock pulses to clock data in. (But not many MCUs support SPI transfers by 12-bit words anyway, so you don't really have a choice?)

We definitely need more details.
 

Offline axonometricTopic starter

  • Newbie
  • Posts: 3
  • Country: us
Thanks for the answers.

Zero extend and sign extend are both acceptable.

I'm dealing with custom silicon that outputs 4800-bit packets, i.e. 400 12-bit chunks per packet, at ~50 Mbit/s. The packing is done to maximize data rate.

I agree with you that there probably isn't an MCU capable of this, but I wanted to ask in case there's one I'm not aware of. I imagine I'll end up with a solution similar to that bruceholt suggested. It would just be nice if this could be done with DMA as it would reduce the demand on the CPU.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15127
  • Country: fr
Zero extend and sign extend are both acceptable.

I'm dealing with custom silicon that outputs 4800-bit packets, i.e. 400 12-bit chunks per packet, at ~50 Mbit/s. The packing is done to maximize data rate.

I see. Just going to suggest something. Transfer the whole packet through DMA as is. 400 12-bit words => 600 bytes, or 150 32-bit words. Of course the data will be "packed" in the buffer. Unpack the whole buffer when needed (and then for this you can use whatever is efficient). This way you can still use DMA transfers. This would be much more efficient than transferring word by word and unpacking on the fly.

Note that the number of 12-bit words that fit in a integer number of 16-bit words is 4.
4 12-bit words -> 3 16-bit words. So your algorithm can work on blocks of 3 16-bit words from your DMA buffer and extend them to 4, zero-extended 16-bit words.


 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4376
  • Country: nz
I see. Just going to suggest something. Transfer the whole packet through DMA as is. 400 12-bit words => 600 bytes, or 150 32-bit words. Of course the data will be "packed" in the buffer. Unpack the whole buffer when needed (and then for this you can use whatever is efficient). This way you can still use DMA transfers. This would be much more efficient than transferring word by word and unpacking on the fly.

Note that the number of 12-bit words that fit in a integer number of 16-bit words is 4.
4 12-bit words -> 3 16-bit words. So your algorithm can work on blocks of 3 16-bit words from your DMA buffer and extend them to 4, zero-extended 16-bit words.

I'm guessing you didn't follow my godbolt link :-)
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
I don't understand why you would need this. Just read the whole 4800-bit as if it was 150 32-bit numbers.

When you start procesing, it'll take only 3 instructions (read, shift and mask) to get each 16-bit number from the memory. Whatever you're going to do with the numbers, your processing will take much longer than extraction.
 

Offline errorprone

  • Contributor
  • Posts: 39
The LPC54000 series lets you select spi transfers of 12 bits but I think it can only reach 50MHz as a master.  Other option is to use a couple shift register such as SN74AHC594PWR if you can generate the right strobes and load it through gpio or a memory controller.
 
The following users thanked this post: axonometric

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4376
  • Country: nz
When you start procesing, it'll take only 3 instructions (read, shift and mask) to get each 16-bit number from the memory.

If you're doing them individually, 2 out of every 8 will require two loads, two shifts, an OR and a mask.

Unless you're using a processor that supports unaligned accesses.

It's a pity ARM doesn't have a shift or rotate instruction with two source registers. Actually, I can't think of anything that does, in the integer instruction set. PowerPC has one in the SIMD ISA. We're thinking about adding one in the RISC-V BitManip extension.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
When you start procesing, it'll take only 3 instructions (read, shift and mask) to get each 16-bit number from the memory.

If you're doing them individually, 2 out of every 8 will require two loads, two shifts, an OR and a mask.

Yes, but in return you do not need to store it to memory, and you do not need to load it back when processing. That's much large savings. Besides, you do save some shifts too.

Unless you're using a processor that supports unaligned accesses.

With unaligned access you only need shifts every second pass, although it might be more effort in avoiding shifts than in doing them.

It's a pity ARM doesn't have a shift or rotate instruction with two source registers. Actually, I can't think of anything that does, in the integer instruction set. PowerPC has one in the SIMD ISA. We're thinking about adding one in the RISC-V BitManip extension.

x86 obviously.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4376
  • Country: nz
It's a pity ARM doesn't have a shift or rotate instruction with two source registers. Actually, I can't think of anything that does, in the integer instruction set. PowerPC has one in the SIMD ISA. We're thinking about adding one in the RISC-V BitManip extension.

x86 obviously.

Really? I'm not all that familiar with x86. I tried compiling my function(s) for x86 and nothing that looked like a double-wide shift or rotate showed up in the assembly language.

What's the instruction called?
 

Offline newbrain

  • Super Contributor
  • ***
  • Posts: 1754
  • Country: se
The SPI in STM32F0 and STM32F7 series (and probably others, but not the F4) can be programmed for any transfer size from 4 to 16 bits.
Data will be right aligned in the DR (data register) and of course DMA is available.

See, for example, chapter 32 of the STM32F7x6 Reference Manual

As said, this also possible on the F0 series, but looking at the speeds involved they won't cut it.
Maximum clock frequency for the SPI in STM32F7x6 is 54 MHz.
Nandemo wa shiranai wa yo, shitteru koto dake.
 
The following users thanked this post: axonometric

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4376
  • Country: nz
The SPI in STM32F0 and STM32F7 series (and probably others, but not the F4) can be programmed for any transfer size from 4 to 16 bits.
Data will be right aligned in the DR (data register) and of course DMA is available.

Is that with multiple values back to back in the same data packet?
 

Offline newbrain

  • Super Contributor
  • ***
  • Posts: 1754
  • Country: se
Is that with multiple values back to back in the same data packet?
That's my understanding (and I seem to remember using it for an oddball peripheral), see chapter 32.5.8, specifically the Data Packing and the Communication diagrams sections.

Any other behaviour would be (IMO) pretty much pointless, but I would not be overly surprised...
Nandemo wa shiranai wa yo, shitteru koto dake.
 

Offline Psi

  • Super Contributor
  • ***
  • Posts: 10180
  • Country: nz
You could have a look at the nRF52,  i've heard the DMA on that is pretty advanced, so it can likely do what you need.
And, you get Bluetooth thrown in :)
« Last Edit: March 03, 2021, 11:34:31 am by Psi »
Greek letter 'Psi' (not Pounds per Square Inch)
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
What's the instruction called?

SHLD and SHRD.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4376
  • Country: nz
What's the instruction called?

SHLD and SHRD.

Aha.

According to Agner Fog's instruction tables, SHLD&SHRD with reg/reg/imm has 3 or 4 cycles latency all the way though to Coffee Lake (and was as much as 6 on some older CPUs). For some unknown reason SHRD is on many CPUs 1 cycle more than SHLD.

Since you can in the worse case do SHL;SHR;OR in 3 clock cycles -- and usually 2 on anything since the original Pentium -- there is no reason to ever use SHLD and SHRD if you are optimising for speed, and so compilers don't generate them.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Since you can in the worse case do SHL;SHR;OR in 3 clock cycles -- and usually 2 on anything since the original Pentium -- there is no reason to ever use SHLD and SHRD if you are optimising for speed, and so compilers don't generate them.

If you wanted to do this efficiently, you would probably do it with SSE2 anyway. SHLD and SHRD are remnants of the past.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4376
  • Country: nz
Since you can in the worse case do SHL;SHR;OR in 3 clock cycles -- and usually 2 on anything since the original Pentium -- there is no reason to ever use SHLD and SHRD if you are optimising for speed, and so compilers don't generate them.

If you wanted to do this efficiently, you would probably do it with SSE2 anyway. SHLD and SHRD are remnants of the past.

We were talking about efficiently unpacking stuff on a microcontroller, so I've been assuming something like Cortex M3. You don't get things like SSE or NEON on those.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15127
  • Country: fr
I see. Just going to suggest something. Transfer the whole packet through DMA as is. 400 12-bit words => 600 bytes, or 150 32-bit words. Of course the data will be "packed" in the buffer. Unpack the whole buffer when needed (and then for this you can use whatever is efficient). This way you can still use DMA transfers. This would be much more efficient than transferring word by word and unpacking on the fly.

Note that the number of 12-bit words that fit in a integer number of 16-bit words is 4.
4 12-bit words -> 3 16-bit words. So your algorithm can work on blocks of 3 16-bit words from your DMA buffer and extend them to 4, zero-extended 16-bit words.

I'm guessing you didn't follow my godbolt link :-)

I didn't (and just did now) but I was just assuming it was one efficient way of unpacking (which it is.)

I wasn't suggesting any particular way of unpacking - just mentioned what was the minimum number of integer chunks you could act on. Which by the way, can always be determined using the LCM (probably obvious for many, maybe not so much for those that are not very math-inclined.) For instance, here, lcm(12, 16) = 48, so 48 bits is the minimum number of bits containing both an integer number (4) of 12-bit chunks and an integer number (3) of 16-bit chunks.

 I merely mentioned the fact you could still use DMA for the SPI transfers, instead of unpacking on the fly (which I assume was what the OP had in mind, since they said they didn't see a way of using DMA), which, whatever method you use for unpacking, is always going to be more efficient (unless it causes issues for latency, which I doubt, since the OP was willing to transfer through DMA in the first place.)
« Last Edit: March 03, 2021, 05:20:31 pm by SiliconWizard »
 

Offline mikeselectricstuff

  • Super Contributor
  • ***
  • Posts: 13914
  • Country: gb
    • Mike's Electric Stuff
I'm dealing with custom silicon that outputs 4800-bit packets, i.e. 400 12-bit chunks per packet, at ~50 Mbit/s. The packing is done to maximize data rate.
How frequently does the packet come in?
If it's not too frequent, software unpacking afterwards wouldn't be a big deal. If speed is crucial, a  little assembly using LDM/STM, plenty of registers and some shifts to do a chunk at a time and optimise memory access could be significantly faster than doing it in C.
Youtube channel:Taking wierd stuff apart. Very apart.
Mike's Electric Stuff: High voltage, vintage electronics etc.
Day Job: Mostly LEDs
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
We were talking about efficiently unpacking stuff on a microcontroller, so I've been assuming something like Cortex M3.

It certainly didn't look that way:

According to Agner Fog's instruction tables, SHLD&SHRD with reg/reg/imm has 3 or 4 cycles latency all the way though to Coffee Lake (and was as much as 6 on some older CPUs). For some unknown reason SHRD is on many CPUs 1 cycle more than SHLD.

Anyway, speaking of MCUs, the unpacking overhead is small, regardless of the CPU.
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 8608
  • Country: fi
The big question is, where and how is the data finally used?

Most of the time, doing all processing "at once" is most efficient. I.e., where you are looping through all items anyway, do everything there. When you have already loaded from memory to CPU registers, it usually doesn't matter if your processing is then 10 or 11 or 15 instructions. Load and stores are slow in comparison; if you have slower memory interfaces and need caches, the difference is even bigger.

"On-the-fly" processing makes sense when you are able to significantly "pack" the data, i.e., decrease the memory footprint. Unpacking on the fly sounds like the opposite, don't do it.

To me, it sounds a good idea to keep the efficiently packed 12-bit data as long as possible and "unpack" at use.
 

Offline axonometricTopic starter

  • Newbie
  • Posts: 3
  • Country: us
Thanks for all the responses, especially errorprone and newbrain for the suggestions of LPC54000 and STM32F7, both of which support the unpacking I was looking for. Configurable SPI "data width" or "data frame width" seems to be the most common terminology used to describe this feature.

To address a few other points:

You could have a look at the nRF52

I checked the datasheet. Doesn't look like nRF52 supports configurable SPI data width.


How frequently does the packet come in?

One pretty much right after the other. Close to a constant 50 Mbit/s.


To me, it sounds a good idea to keep the efficiently packed 12-bit data as long as possible and "unpack" at use.

In most cases, I'd agree with you. Here, I need this MCU to sit in between the data source, receiving packed 12-bit via SPI, and the data processor, offloading unpacked 16-bit via USB, both of which cannot be changed. The MCU needs to do some other tasks too, which is why DMA is preferable (though not essential, given the clock speeds current MCUs can run at).
 

Offline mikeselectricstuff

  • Super Contributor
  • ***
  • Posts: 13914
  • Country: gb
    • Mike's Electric Stuff
Quote

How frequently does the packet come in?

One pretty much right after the other. Close to a constant 50 Mbit/s.

OK in that case you use double-buffering to DMA it in to a pair of ping-pong buffers at about 4Mwords/sec DMA rate.
You then have 96uS to unpack the first buffer while the second is filling.
The above mentioned STM32F7 does 216MHz, so about 20,000 clock cycles to unpack 400 words.
Doesn't seem like that would be much of a problem.
It might be more efficient to DMA it in as 32 bit words, which should reduce the memory bandwidth loading due to fewer accesses.
 
You probably could do it with interrupts as long as the SPI peripheral has a decent sized FIFO, otherwise interrupt entry/exit overhead would chew up a lot of cycles, but DMA would be a lot more efficient. Most DMA peripherals support double-buffering.
« Last Edit: March 04, 2021, 01:18:38 am by mikeselectricstuff »
Youtube channel:Taking wierd stuff apart. Very apart.
Mike's Electric Stuff: High voltage, vintage electronics etc.
Day Job: Mostly LEDs
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf