There is limited potential for parallel operations in an FFT, so extremely low latency is a problem. There is a lot of potential for pipelining, so getting high throughput is straightforward.
This is true. Each output term will eventually map back to all the input terms, so there’s a lot of data path to think about.
FFTs also require you to have all of the inputs before you can begin, so there’s always the latency of collecting your data samples. Might it be possible to begin the FFT as your samples come in? I did an FIR filter bank this way once; it precomputed as much as possible and then waited for the final sample to come in.
Finally, you mentioned variable size. I expect that this would, in hardware terms, involve more MUXing, which is difficult for place and route of the FPGA. I suspect the way forward there is to look at a 2^n radix, and use software to zero pad inputs of less than 2^n