Hmm, I might change my mind - slightly - on this. I've just spent the last two days optimising some DSP code on an LPC4370 (ARM M4F), and there are rare occasions when it can make sense.
I have two pieces of code I'm trying to squeeze the last cent out of, one is a polyphase decimator (the ARM CMSIS-DSP decimator is _not_ polyphase but the interpolator is, go figure) and the other is a quadrature NCO oscillator and mixer.
I first spent some time trying to make the C code work reasonably fast. Keep in mind that my CPU cycle budget was already over by over 200% when I started looking at this on Friday, short of overclocking the 200MHz part to 600MHz, something had to give. With carefully crafted idioms and some data restructuring, I got that down to about 50% over budget, an improvement, but not enough. The environment is LPCXpresso 7.7.2, so gcc is the compiler. I have now recoded both bottlenecks in inline assembler, and after several hours have managed to improve optimised C code for the polyphase decimator from 3.3Msps (input samples) to 6.25Msps. The quadrature oscillator and mixer is improved from 8.5Msps to 12.5Msps. For both, I am pretty sure I can squeeze out another 5-10%, there are some stalls that as yet I'm unable to explain away.
The approach on the C side was to move from flash to RAM, then carefully code up compiler idioms, and examine what's generated. There was at this point a fair bit of loop unrolling, and some restructuring of data to allow the use of LDM/STM/VLDM/VSTM multiple register loads and stores.
Then I rolled up my sleeves and embarked on the most assembler I've written for probably a couple of decades.
On the assembler side, the human has the benefit that they know the nature of the parameters and how functions are called. The compiler just doesn't know that. With this knowledge, the code can be carefully crafted to avoid pipeline stalling by interleaving operations by making adjacent instructions have non-dependent registers, combining previously separate processes into one, avoiding excessive load and store operations, and keeping as much data as possible in registers (for processors like ARM especially). Again, being able to adjust your data structures and unrolling loops to make use of the LDM/STM/VLDM/VSTM instructions and minimise loop overhead helped.
In addition, for the NCO, there was some register moving going on for a delay line (it's a pair of IIR filters), and by unrolling the loop and recoding it, the register moving could be avoided. Combining the quadrature mixer and the quadrature NCO into a single process also made some savings.
So in short, you can sometimes improve things with assembler fairly significantly, but it's a bit of a last resort as maintenance is almost always going to be a head scratcher. Irrespective, to make any headway, even if you don't ever write any assembler, eventually you may well find that you need to understand the way the CPU works at least to the extent of being able to roughly follow the disassembled version in the debugger to be able to resolve a performance related problem.
Don't try this at home kids...
__asm__ __volatile__
(
"\n\t"
"vldm %[NCO],{s0-s7} \n\t" // Initialise NCOs into registers
"mov r6,#3 \n\t" // Divide by 3
"udiv r5,%[NUMSAMPLES],r6 \n\t" // Result in R5 // ***Check this udiv, takes 160ns, needs work
"mls r6,r6,r5,%[NUMSAMPLES] \n\t" // Remainder in R6
"cbz r5,loopexit1 \n\t"
// Calculate NCOs, three at a time to avoid shuffling registers about
"\nloop1: \n\t"
"vldm %[IN]!,{s10-s12} \n\t" // Get next three samples
"vmul.f32 s4,s6,s0 \n\t" // Iy0=A1*Iy1 // s2=Iy0, s0=Iy1, s1=Iy2 // NCO
"vmul.f32 s5,s6,s1 \n\t" // Qy0=A1*Qy1
"vmul.f32 s8,s7,s2 \n\t" // s8=A2*Iy2
"vmul.f32 s9,s7,s3 \n\t" // s9=A2*Qy2
"vadd.f32 s4,s4,s8 \n\t" // Iy0=A1*Iy1 + A2*Iy2
"vadd.f32 s5,s5,s9 \n\t" // Qy0=A1*Qy1 + A2*Qy2
"vmul.f32 s13,s4,s10 \n\t" // s13=Iy0*In[0] // Mixer
"vmul.f32 s16,s5,s10 \n\t" // s16=Qy0*In[0]
"vmul.f32 s2,s6,s4 \n\t" // Iy0=A1*Iy1 // s1=Iy0, s2=Iy1, s0=Iy2 // NCO
"vmul.f32 s3,s6,s5 \n\t" // Qy0=A1*Qy1
"vmul.f32 s8,s7,s0 \n\t" // s8=A2*Iy2
"vmul.f32 s9,s7,s1 \n\t" // s9=A2*Qy2
"vadd.f32 s2,s2,s8 \n\t" // Iy0=A1*Iy1 + A2*Iy2
"vadd.f32 s3,s3,s9 \n\t" // Qy0=A1*Qy1 + A2*Qy2
"vmul.f32 s14,s2,s11 \n\t" // s14=Iy0*In[1] // Mixer
"vmul.f32 s17,s3,s11 \n\t" // s17=Qy0*In[1]
"vmul.f32 s0,s6,s2 \n\t" // Iy0=A1*Iy1 // s0=Iy0, s1=Iy1, s2=Iy2 // NCO
"vmul.f32 s1,s6,s3 \n\t" // Qy0=A1*Qy1
"vmul.f32 s8,s7,s4 \n\t" // s8=A2*Iy2
"vmul.f32 s9,s7,s5 \n\t" // s9=A2*Qy2
"vadd.f32 s0,s0,s8 \n\t" // Iy0=A1*Iy1 + A2*Iy2
"vadd.f32 s1,s1,s9 \n\t" // Qy0=A1*Qy1 + A2*Qy2
"vmul.f32 s15,s0,s12 \n\t" // s15=Iy0*In[2] // Mixer
"vmul.f32 s18,s1,s12 \n\t" // s18=Qy0*In[2]
"vstm %[OUTI]!,{s13-s15} \n\t" // Write the complex downconverted sample out
"vstm %[OUTQ]!,{s16-s18} \n\t"
"subs r5,#1 \n\t" // Loop counter
"bne loop1 \n\t"
"\nloopexit1: \n\t"
"cbz r6,loopexit2 \n\t"
"\nloop2: \n\t" // This is the non-unrolled version for stragglers when num samples isn't divisible by 3
"vldm %[IN]!,{s10} \n\t" // Get next sample
"vmov.f32 s4,s2 \n\t" // Iy2=Iy1 // Interleave I and Q instructions to prevent stalling // NCO
"vmov.f32 s5,s3 \n\t" // Qy2=Qy1
"vmov.f32 s2,s0 \n\t" // Iy1=Iy0
"vmov.f32 s3,s1 \n\t" // Qy1=Qy0
"vmul.f32 s0,s6,s2 \n\t" // Iy0=A1*Iy1
"vmul.f32 s1,s6,s3 \n\t" // Qy0=A1*Qy1
"vmul.f32 s8,s7,s4 \n\t" // s8=A2*Iy2
"vmul.f32 s9,s7,s5 \n\t" // s9=A2*Qy2
"vadd.f32 s0,s0,s8 \n\t" // Iy0=A1*Iy1 + A2*Iy2
"vadd.f32 s1,s1,s9 \n\t" // Qy0=A1*Qy1 + A2*Qy2
"vmul.f32 s8,s0,s10 \n\t" // s8=Iy0*In[0] // Mixer
"vmul.f32 s9,s1,s10 \n\t" // s8=Qy0*In[0]
"vstm %[OUTI]!,{s8} \n\t" // Write the complex downconverted sample out
"vstm %[OUTQ]!,{s9} \n\t"
"subs r6,#1 \n\t" // Loop counter
"bne loop2 \n\t"
"\nloopexit2: \n\t"
"vstm %[NCO],{s0-s5] \n\t" // Store the NCO variables back
: [OUTI]"+r" (pstOutI), [OUTQ]"+r" (pstOutQ), [IN]"+r" (pstIn)
: [NCO]"r" (pncos), [NUMSAMPLES]"r" (nNumSamples)
: "r5","r6","s0", "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9","s10","s11","s12","s13","s14", "s15", "s16", "s17", "s18"
);