Wow this thread blew up! Well, a lot of greek to me, but Im trying to learn.
Regarding what Im doing
Im a mechanical/electrical engineer, the algorithm in question is essentially a filter for vibration analysis/control.
Im also a bit of a geek, and am eager to learn to use FPGAs which I feel are my next foray into electrical engineering. So my goals are two fold here, to learn to use FPGAs and to actually get this algorithm working. The algorithm is something Ive been conceptualizing for a few years; simple signal processing theory but very computationally intense.
Im used to embedded systems; hence my natural gravitation towards DSPs and FPGAs. My first goal is to do a proof of concept - for a 1D or 2D case (As long as I know it can expand to 6+!).
Regarding the algorithm, delays, and data dependencies:
The algorithm is simple multiply and sum. Just a lot of them. The foundation of signal processing.
For the multiplies - there is no data dependency. As soon as we get a sample (@200kHz) from the ADC, we have all the coefficients we need to multiply.
There is a data dependency on the summation - it is the results of all the multiplications that need to be summed together.
A few cycles of delay in the MHz range - no problem. But once you get a few cycles of delay in the kHz range can lead to mechanical instability.
I knows delays are inevitable in a digital system; and the more pipeline delay I have, the lower the frequency the algorithm will be stable at. Ive already got some delays from the Sig/Del ADCs Im planning on using. If in the end the algorithm is only good to 50Hz, then so be it. But I would like it to be useful to a few kHz at least; ideally.
Regarding Using a GPU/CPU/Other Hardware:
Ill admit first, that this is probably waay beyond my area of comfort here.
Admittedly, I think me getting a proof-of-concept here would be impossible for me if working on anything other than a DSP or FPGA. Using multiple DSPs would be...tough; coding the algorithm would be simple; but data routing would be a nightmare and the #DSPs required wouldnt scale well with the algorithm's O(N^2)
If it is only possible (or a lot easier for an actual hardware engineer) to implement on another device - be it a GPU, CPU, or anything else - that advice is also welcome. If that is the case, I probably wouldn't take that on. But, I would love to hear it if you think its the case.
The other reason I would strongly prefer a hardware solution is that it must keep up realtime. Keeping up realtime and eliminating pipeline delays are the name of the game here - which is why I instantly thought FPGA - I can interface with the ADC/DAC and perform the computaions on the same chip.