Most of that went over my head.
Sorry!
To summarise, if you have arrays of 24-bit samples, three bytes per sample, I showed that it is best to handle them in groups of four samples (12 bytes), and very efficiently sign-expand to 32-bit. The rest was details, and two different approaches you can choose from, although one has better performance.
On the optimisation... my assumption was that the C compiler would include things like hardware DP FPU and DSP extensions if they existed with no optimisations.
STM32F4 series has an ARM Cortex-M4 core, and thus ARMv7E-M architecture.
ST's
AN4841: Digital signal processing for STM32 microcontrollers using CMSIS describes its DSP features (but also includes some of the STM32F7 series).
Since GCC and Clang try to implement ARM C Language Extensions,
ACLE (ihi0053) is also useful.
In general, C is not an easy language to automatically SIMD-vectorize. For STM32F4, there are instructions that add, subtract, and/or multiply using pairs of signed or unsigned 16-bit values in each register, or quartets of 8-bit signed or unsigned values in each registers.
As of right now, you do need to use the ACLE macros to SIMD-vectorize your C code.
For example, let's say you have an array of 16-bit signed integer samples
data, another array of 16-bit signed integer coefficients
coefficient, both arrays having
count elements (even; multiple of two
count), and both arrays being 32-bit aligned (
__attribute__((align (4)))), you can obtain the 64-bit sum (which cannot overflow since each product is between -0x3FFF8000 and +0x40000000, i.e. 31 bits):
#include <arm_acle.h>
typedef struct {
union {
uint32_t u32;
int32_t i32;
uint16_t u16[2];
int16_t i16[2];
uint16x2_t u16x2;
int16x2_t i16x2;
uint8_t u8[4];
int8_t i8[4];
uint8x4_t u8x4;
int8x4_t i8x4;
char c[4];
};
} word32;
int64_t sum_i16xi16(const int16_t *data, const int16_t *coeff, const uint32_t count)
{
const word32 *dz = (const word32 *)data + (count / 2);
const word32 *d = (const word32 *)data;
const word32 *c = (const word32 *)coeff;
int64_t result = 0;
while (d < dz)
result = __smlald((*(d++)).i16x2, (*(c++)).i16x2, result);
return result;
}
using
-O2 -march=armv7e-m+fp -mtune=cortex-m4 -mthumb (with gcc; use -Os instead of -O2 with clang).
The inner loop will compile to
.L3: ldr r0, [r3], #4 ; 2 cycles
ldr r2, [r1], #4 ; 2 cycles
cmp ip, r3, #1 ; 1 cycle
smlald r4, r5, r2, r0 ; 1 cycle
bhi .L3 ; 2-5 cycles when taken
which by my count should yield 8 cycles, or 4 cycles per sample, or equivalently 0.25 samples per cycle.
Furthermore, the result is trivial to shift and then clamp to the proper range, so it is just a few additional cycles
per buffer to adjust for the fractional bits in samples and/or coefficients.
Me Like Dis.If
count is odd, the last sample and coefficient will be ignored. I suggest you ensure your array sizes are even, and set the final element of both arrays to zero, so you can round
count up to the next even number without affecting the result.
The inner loop is equivalent to
for (uint32_t i = 0; i < count/2; i++)
result = __smlald(d[i].i16x2, c[i].i16x2, result);
but is written in a form that both gcc and clang can optimize to the above code or equivalent.
Each loop iteration computes two 16-bit products, summing them to
result.