I like to approach optimization from the compiler's side, and go from there. Work in stages:
1. Any high-level optimizations (to the algorithm) have already been done. (Start there first!!)
2. You can write your ASM function to drop in, so the compiler knows how to use it and you can go back to #1 if a new strategy comes up. (You may end up discarding that function in the process, wasting effort -- hence the emphasis on the front end.)
3. While you still need to comply with the compiler's contract for call/return conventions (the ABI), you have complete freedom to (ab)use the hardware within the function. If you can make better use of it, there you go, that's clock cycles saved!
4. If it looks like #1 is basically done, you can go farther and expand into other functions, optimizing neighboring ones, inlining them in a longer or higher-level function, etc. (The compiler does some of this already, but it won't discover many optimizations it couldn't have already made in the base functions.)
Case study: a simple DSP (digital signal processing) operation. The basic fundamental of DSP is the multiply-accumulate (MAC) instruction, A = A + B * C, used in a loop to convolve two arrays. (Convolution is a big word for a simple process: for each element in two arrays, multiply the elements pairwise, then sum up their results. If you know linear algebra, it's the dot product of two vectors. This isn't a full definition of convolution, more a functional example.)
For example, if we convolve a sequence of digital samples (the latest sample, and the N previous samples), with a sequence of samples representing the impulse response of a filter (also of length N), the result is the next sample in the series of that signal as if it were passed through that filter. This is a FIR (finite impulse response) filter: an impulse input (one sample nonzero, the rest zero) gets multiplied by each element of the filter in turn, all of which can be nonzero, up to length N where the magnitude implicitly drops to zero, because, well, that's the size of the array.
FIR filters are great because you can control the impulse response, well, exactly; that's all the filter array is! And by extension the frequency response, and it's very easy to get certain minimum-phase or equal-delay properties from them. The downside is, if your desired filter has a very long time constant (e.g., a low frequency, narrow bandpass/stop, or long time delay or reverberation), you need as long of an array. And as much sample history. And you need to compute the convolution every sample.
So, DSP machines need to crank a lot of data, and tend to be very powerful indeed, if relatively specialized for this one task. (Example: one might use a 200MHz core capable of delivering one MAC per cycle, so could process around 4500 words per 44.1kHz audio sample; a FIR filter could have a minimum cutoff on the order of 20Hz.)
If we use a history of input
and output samples, we can employ feedback to useful effect. We have to be careful, obviously we don't want that accumulating to gibberish, with exponential growth; it has to be stable. And being a feedback process, we expect it to have an exponential (well, discrete time, so technically, geometric) decay; an infinite impulse response (IIR). Whereas the FIR filter coefficients can take any value, the IIR filter coefficients have to be computed carefully. Fortunately, that analysis has been done, so we can design filters using tools, without having to get too in-depth with the underlying theory. (Which revolves around the Z transform. It happens to map to the Fourier transform -- everything the EE knows about analog signals, already applies to DSP, if in a somewhat odd way.)
Anyway, that explains the basic operation. How do we compute it?
Here's a basic MAC operation, taking as parameters, a 32-bit "accumulator", a 16-bit sample, and an 8-bit coefficient. (This wouldn't be nearly enough bits for a proper filter, but works well enough for a simple volume control -- we might convolve arrays of live samples, with gain coefficients, to create a mixing board. We can fill the arrays however we like, after all; the convolution doesn't have to be across time, it just is when we are filtering a signal.) The format is integer, signed, presumably in fixed point. (A typical case would be both sample and coefficient in fractional format (0.16 and 0.8 ), so that the result is 8.24, and the top 8 bits are simply discarded, after testing for overflow of course.
int32_t mac32r16p8(int32_t acc, int16_t r, const int8_t* p) {
return acc + (int32_t)r * *p;
}
In avr-gcc 4.5.4, this compiles to: (comments added inline to help with those unfamiliar with the instruction set)
mac32r16p8:
push r14
push r15
push r16
push r17 ; save no-clobber registers
mov r14,r22
mov r15,r23
mov r16,r24
mov r17,r25 ; r14:r15:r16:r17 = acc
mov r30,r18
mov r31,r19 ; Z = p
ld r22,Z ; r22 = *p
clr r23
sbrc r22,7 ; skip following instruction if bit in register set
com r23 ; bit 7 = sign bit; com = complement -- ah, this is a sign extend operation
mov r24,r23
mov r25,r23 ; sign extend to 32 bits (r22:r23:r24:r25 = (int32_t)*p)
mov r30,r20
mov r31,r21 ; [1]
mov r18,r30
mov r19,r31 ; ?!?!
clr r20
sbrc r19,7 ; oh, sign extending r20:r21 = r...
com r20 ; probably the compiler allocated r18:r19:r20:r21 at [1],
mov r21,r20 ; so it had to move to a temporary register (r30:r31) first. sloppy.
rcall __mulsi3 ; r22-r25 * r18-r21 = r26-r31; 37 cycles
mov r18,r22
mov r19,r23
mov r20,r24
mov r21,r25 ; ?!
add r18,r14
adc r19,r15
adc r20,r16
adc r21,r17 ; acc + product
mov r22,r18
mov r23,r19
mov r24,r20
mov r25,r21 ; return result
pop r17 ; restore no-clobber registers
pop r16
pop r15
pop r14
ret
- Note the surrounding push and pops: the ABI says, don't clobber r2-r17 and r28-r29. This function uses a lot of registers (8 passed in, 4 passed out), so that might happen. Push and pop costs a couple cycles each (most instructions are 1 cycle, but most memory accesses add a 1-2 cycle penalty), so they're a priority to get rid of.
- Most of the instructions are moves. Hmm, that's not a good sign. Why can't we get the registers in the right places, to begin with? Well, calling conventions are fixed by the ABI, nothing we can do about that at this stage, but there's still more shuffling going on than that.
- Though, it looks like acc starts in r22-r25, and that's also where our output goes. Hmmmmm...
- Also, the compiler has already made obvious boners: there is a MOVW instruction which copies registers pairwise; instead of
mov r14,r22; mov r15,r23 it should've used
movw r14,r22 (and the rest).
- It looks like everything is just setup to use the library wide-multiply call __mulsi3, a 32 x 32 = 32 bit multiply. This sounds terribly inefficient. I mean, for what it does, the library call isn't bad, 37 cycles for that much work -- but we were supposed to be doing 8x16 here. This is ridiculous!
But it is standard practice, when the compiler encounters operations that would be too difficult to reason about. GCC will emit MUL instructions for byte multiplies, but anything larger uses library calls.
Naturally, they don't have libraries for every possible combination, only the most common -- signed-signed, signed-unsigned and unsigned-unsigned at 16 and 32 bits each I think.
Library calls also use a custom ABI, are never inlined, and never pruned (even with -flto). So the unused parts (sign extension and correction) waste code space, too.
So let's have a go at this, eh? What can we get it down to? Here's what I have in my project:
mac32r16p8:
movw r30, r18
ld r18, Z+ ; get multiplier
; r25:r24:r23:r22 += mul_signed(r21:r20, r18)
eor r19, r19 ; use r19 as zero reg
mulsu r18, r20 ; p*lo (result in r0:r1)
sbc r24, r19
sbc r25, r19 ; sign extend
add r22, r0
adc r23, r1
adc r24, r19
adc r25, r19
muls r18, r21 ; p*hi
sbc r25, r19
add r23, r0
adc r24, r1
adc r25, r19
eor r1, r1
ret
- Yup, the push/pop can be removed!
- Wait, how is this even so short? Yeah the compiler is wasteful, but heck! Well, inlining the multiply and stripping it down to the required 8x16 bit operation saves a hell of a lot of effort. The two sub-terms need to be added together (same way you do 1 x 2 = 3 digit multiplication by hand). The MULS(U) instructions return carry set if the result needs to be corrected for signedness; in essence we sign-extend the result, hence the sbc rx, 0's into the accumulator. It's a lot more addition than the 2-3 add/adc needed for an unsigned operation, but it's still better than adjusting for sign after the fact (i.e. using an unsigned multiply).
- Clearly, the semantics of doing all this is more involved than just one instruction. Extra registers need to be allocated (MUL uses r0:r1 as implicit destination; and I've used r19 to hold zero). Probably there's no equivalent in GCC's internal representation, either (where most of the optimization is done). So they give up and call a library.
- Conveniently, acc is passed in the same place the result is returned, so we can just accumulate directly into those registers. (This depends on parameter order -- swapping parameters alone may yield optimization!)
- Only one instruction is needed to maintain compiler compatibility: r1 is used as the zero register, so needs to be zeroed after use.
I further inlined this into the looping function. Notice the
ld r18, Z+ postincrement instruction; r30:r31 isn't read after this, it's discarded instead. (There's no cycle penalty for ld Z vs. ld Z+ so it doesn't matter that I left it in there.) I can abuse this clobbered value of Z to just run this function in a loop, loading and accumulate all the products right there.
But further, I can inline it, saving a few more cycles (loop overhead being less than call overhead + unrolled loop).
Overall, the optimizations on this project yielded almost a tenfold improvement -- on the meager platform, it went from hardly worth it (two mixer channels, ~50k MACs/s) to reasonably featureful (two biquad filter stages and 8 mixer channels, ~500k MACs/s).
Tim