Can anyone give me an explanation of why that jambalaya with FFT will give me inter-mic delays? I understand that frequency domain graph will give me a sort of instant signature of the sound but I don't understand how that'll detect that one sound arrived a few ms on this mic vs that mic...
A short answer...
If you multiply the samples in one signal by the other, then integrate over a number
of samples you can get a measure of how 'alike' to signals are.
If you want to match signals at different offsets, you have to do this at each offset.
The result is a time series that shows how much the signals match at different alignments.
Testing a 16k sample signature in a set of 16k samples at each alignment will require
268,435,456 multiply-add operations. That is a lot of math. Or O(N^2) if you like big-O complexity notation.
FFT is an effective way to do this, but with less work.
You split the signature and test data into their frequency and phase components using FFT.
You multiply the components in one signature by the ones in the other. If the frequency
is only in one of them, it will be removed (as zero times anything = zero). If the frequency is in both, they will be enhanced (and the phase changed).
You are then left with the set of amplitudes that indicate which frequencies that are common to both signals, with some phase information (remember phase information = timing information)
By running the Inverse FFT on this result, where things are 'in sync' the magnitudes of the various components add up, where they are 'out of sync' on average they cancel out.
The result is a time series that shows how much the signals match at different alignments
- exactly same numbers (except rounding!) as found by the hard method.
However, because FFT and IFFT are each an O(N*logN) complexity this is far more practical for large values of N.
So for 16k of samples, the hard way takes 268,435,456 units of work. The FFT way takes 2*16k*4.2 = 1,101,004 units of work (however the size of a unit of work may be different between the two processes, so you can't just say it will be 256 times faster).
This also brings up another point. To get good timing information from such a process, your signature must have a good mix of frequencies in it - so a sine wave will match to another sine wave of the same frequency when they are in phase, giving ambiguous results. The signature must also have structure to it
- one bit of pure random noise is much like any other bit of random noise.
This is one of the reason that dolphins use 'clicks' for echolocation and radar uses 'chirps'. This technique will not allow you to accurately pinpoint the location of that annoying hum, unless you have lots of microphones to allow you to remove the ambiguity.