Author Topic: FPGA Beginner question...would an FPGA have enough gates to perform this? (Read 5368 times)

rsridhar · « **on:** March 26, 2021, 01:41:09 am »

Hi All,

Im new to FPGAs and just barely getting my first one (DE0 Nano) connected to an external ADC (having difficulties)...

I guess Im deciding whether this is all worth it in the end.

My real goal in exploring fpgas is to run a numerically intense algorithm in as many dimensions as possible.

For a 1D case, it boils down to ~10k multiply and and add functions. Im running an ADC at ~10k-200kHz (the higher the better; 200k max). I want to complete these operation in one sample at the ADC frequency; to drive a DAC at the same frequency in realtime.

On a DSP this timing can work. On newer C6000 DSPs, >1GHz with 8 logical units. But once I go to a 2D or greater; it gets incredibly complex and probably too much for a DSP. The algorithm increases at O(N^2) if N is the "dimension" - so a 2D case is 4x the computations; 3D case is is 9x the computations, etc.

So once I get to 3D, running on a DSP in realtime is out of the question. But, ideally Id like to get to at least 6D; if not more.

On the FPGA we arent templorally limited; but we are (eventually) limited by the number of transistors/logic units.
N^2 x (~10000 multiply and add) functions?
The data can be assumed to be fixed point on the fpga implementation.

Is this still in the realm of possibility to be able to run N>=6 on an FPGA? Or is this an unrealistic amount computation to expect an fpga to be able to handle?

Please tell me if my question is stupid as well - as I said I am new to hardware.

Thanks and cheers

rstofer · « **Reply #1 on:** March 26, 2021, 02:40:02 am »

I guess you have to look at the high end FPGAs and see how many multipliers they have. At the very high end, FPGAs are expensive. VERY expensive!

Have you thought about using CUDA units in a graphics card (GPU):
https://developer.nvidia.com/cuda-faq

Then there is the NVIDIA Jetson family with a max of 512 GPU cores plus 64 Tensor cores (may not be helpful). It can do 32 teraflops
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/

I have no idea if any of these solutions are helpful. But it seems to me that we made it to the Moon with machines doing 1-3 megaflops and 32 teraflops seems like a lot. But they only go fast when the problem architecture matches the hardware architecture.

You can get a 4GB Jetson Nano (472 gigaflops) for $99. That's about 238,000 times faster than a Control Data 6400 back in the '70s. If the architectures match.

Here's a family overview for the Xilinx Virtex 7 devices. I don't see anything about multipliers but they do have the rest of the device specs plus links for more information.

https://www.xilinx.com/support/documentation/selection-guides/7-series-product-selection-guide.pdf

I'm pretty sure you have seen this but here is the DSP information:

https://www.xilinx.com/support/documentation/selection-guides/7-series-product-selection-guide.pdf

You can get up to 3200 DSP slices in a Virtex7

asmi · « **Reply #2 on:** March 26, 2021, 02:46:36 am »

Quote from: blueskull on March 26, 2021, 02:40:24 am

XC7A15T uses exactly the same silicon as XC7K50T, which has 120 DSP slices. You can crack Vivado software to unlock extra resources of your XC7A15T.

Quote from: blueskull on March 26, 2021, 02:44:34 am

Also the VERY EXPENSIVE parts are not so expensive if you have quantity and know the market. If gray market parts are accepted, I can get a fully functioning XCVU9P accelerator card for less than $1200.

What about VU29P?

To OP: It can be done if you can tolerate some latency due to pipelining, but you will need a rather beefy FPGA, which won't be cheap.

asmi · « **Reply #3 on:** March 26, 2021, 02:54:53 am »

Quote from: blueskull on March 26, 2021, 02:54:11 am

I can use 3x VU9P, why would I use a chip that is kinda equivalent but priced 100x higher?

It's not even close to being equivalent.

asmi · « **Reply #4 on:** March 26, 2021, 03:10:28 am »

Quote from: blueskull on March 26, 2021, 03:00:42 am

If people start using VU29P in made-in-China quantity, it will find its way to China at wafer cost.

They won't, and it won't. At least not anytime soon.

asmi · « **Reply #5 on:** March 26, 2021, 03:11:44 am »

Quote from: blueskull on March 26, 2021, 02:54:11 am

His input sample rate is 200ksps, who cares about a few cycles of 360MHz pipeline latency???

It depends on the nature of algorithm. It might very well be much longer than that.

asmi · « **Reply #6 on:** March 26, 2021, 03:30:08 am »

Quote from: blueskull on March 26, 2021, 03:22:06 am

Even general purpose CPUs like Intel/AMD use pipelined multipliers. If you algorithm can't tolerate pipelined multipliers, then you are doing it wrong.

It will lead to massive stalls if there are data dependencies. Hence my statement that it depends on algorithm.

Quote from: blueskull on March 26, 2021, 03:22:06 am

Pipelining and data coherency management are basic skills of FPGA designers.

And OP has clearly stated that he is not an FPGA designer.
This thread is becoming a good test for reading comprehension...

NorthGuy · « **Reply #7 on:** March 26, 2021, 03:32:18 am »

You need DSP blocks for multiplications, not for addition. If ADC is 200k then a single 400 MHz multiplier can do 2k multiplications every cycle. So, for 10k case you only need 5 DSP blocks (provided you can organize them). Artix-200 has 740 DSP slices, which will be good for sqrt(740/5) = 12D. It's around $100 in China.

<edit>If you need longer numbers, it'll be slower.

asmi · « **Reply #8 on:** March 26, 2021, 03:37:57 am »

Dsp tiles can do both multiplication and addition (hence the name). Also 7 series dsp tile can only do 18x25 macc/madd, so if you need higher precision (or, God forbid, a floating point operation), you will need more than one dsp per multiplication.

asmi · « **Reply #9 on:** March 26, 2021, 04:04:47 am »

As I understood the OP, the point here is as much to gain some practical experience with FPGAs as it is to actually complete the project.
Also doing Dsp on general purpose CPUs has it's own challenges and pitfalls.

rsridhar · « **Reply #10 on:** March 26, 2021, 06:01:06 am »

Wow this thread blew up! Well, a lot of greek to me, but Im trying to learn.

Regarding what Im doing
Im a mechanical/electrical engineer, the algorithm in question is essentially a filter for vibration analysis/control.

Im also a bit of a geek, and am eager to learn to use FPGAs which I feel are my next foray into electrical engineering. So my goals are two fold here, to learn to use FPGAs and to actually get this algorithm working. The algorithm is something Ive been conceptualizing for a few years; simple signal processing theory but very computationally intense.

Im used to embedded systems; hence my natural gravitation towards DSPs and FPGAs. My first goal is to do a proof of concept - for a 1D or 2D case (As long as I know it can expand to 6+!).

Regarding the algorithm, delays, and data dependencies:

The algorithm is simple multiply and sum. Just a lot of them. The foundation of signal processing.

For the multiplies - there is no data dependency. As soon as we get a sample (@200kHz) from the ADC, we have all the coefficients we need to multiply.
There is a data dependency on the summation - it is the results of all the multiplications that need to be summed together.

A few cycles of delay in the MHz range - no problem. But once you get a few cycles of delay in the kHz range can lead to mechanical instability.
I knows delays are inevitable in a digital system; and the more pipeline delay I have, the lower the frequency the algorithm will be stable at. Ive already got some delays from the Sig/Del ADCs Im planning on using. If in the end the algorithm is only good to 50Hz, then so be it. But I would like it to be useful to a few kHz at least; ideally.

Regarding Using a GPU/CPU/Other Hardware:

Ill admit first, that this is probably waay beyond my area of comfort here.

Admittedly, I think me getting a proof-of-concept here would be impossible for me if working on anything other than a DSP or FPGA. Using multiple DSPs would be...tough; coding the algorithm would be simple; but data routing would be a nightmare and the #DSPs required wouldnt scale well with the algorithm's O(N^2)

If it is only possible (or a lot easier for an actual hardware engineer) to implement on another device - be it a GPU, CPU, or anything else - that advice is also welcome. If that is the case, I probably wouldn't take that on. But, I would love to hear it if you think its the case.

The other reason I would strongly prefer a hardware solution is that it must keep up realtime. Keeping up realtime and eliminating pipeline delays are the name of the game here - which is why I instantly thought FPGA - I can interface with the ADC/DAC and perform the computaions on the same chip.

NorthGuy · « **Reply #11 on:** March 26, 2021, 01:21:05 pm »

Quote from: asmi on March 26, 2021, 03:37:57 am

Dsp tiles can do both multiplication and addition (hence the name).

Yes, but you can do addition in the fabric just as fast, so if the number of DSPs is a limiting factor, you don't need DSPs for addition.

Or, as in OP's case, if you need to do MAC, you do not need extra DSPs for addition as the same DSP does both.

Therefore, when calculating the DSP resources needed, you only count the multiplications.

SMB784 · « **Reply #12 on:** March 26, 2021, 01:23:46 pm »

Quote from: NorthGuy on March 26, 2021, 01:21:05 pm

Quote from: asmi on March 26, 2021, 03:37:57 am
Dsp tiles can do both multiplication and addition (hence the name).

Yes, but you can do addition in the fabric just as fast, so if the number of DSPs is a limiting factor, you don't need DSPs for addition.

Are you sure that's true? I recognize that you can indeed to addition with fabric or DSPs, but I'm not convinced that the fabric is as fast as the DSPs are. Isn't the whole point of DSPs to offload MACC operations from the fabric so they can be done at extremely high speed & throughput? What evidence is there to demonstrate that the fabric is just as fast as the DSPs are?

asmi · « **Reply #13 on:** March 26, 2021, 01:34:38 pm »

Quote from: NorthGuy on March 26, 2021, 01:21:05 pm

Yes, but you can do addition in the fabric just as fast, so if the number of DSPs is a limiting factor, you don't need DSPs for addition.

In 7 series DSP tiles addition and multiplication are performed by a separate parts of DSP tile, so you can do both at the same time as a single macc/madd operation.

NorthGuy · « **Reply #14 on:** March 26, 2021, 01:41:23 pm »

Quote from: rsridhar on March 26, 2021, 06:01:06 am

A few cycles of delay in the MHz range - no problem. But once you get a few cycles of delay in the kHz range can lead to mechanical instability.

The delay depends on how much calculations you need to do which depends on the last ADC reading. In case of a FIR filter, this is only one MAC you need to get from the last ADC reading to the final result. Otherwise, the MAC operations are easily distributed, so you can get decent latency.

Quote from: rsridhar on March 26, 2021, 06:01:06 am

Admittedly, I think me getting a proof-of-concept here would be impossible for me if working on anything other than a DSP or FPGA. Using multiple DSPs would be...tough; coding the algorithm would be simple; but data routing would be a nightmare and the #DSPs required wouldnt scale well with the algorithm's O(N^2)

I wouldn't expect routing problems in this application. You typically get routing problems when you use a big portion (70% or more) of all FPGA resources.

Once you use all the cycles of all the DSPs, the only way to increase N is to slow the ADC down. What values of N in "O(N^2)" do you need?

NorthGuy · « **Reply #15 on:** March 26, 2021, 01:48:59 pm »

Quote from: SMB784 on March 26, 2021, 01:23:46 pm

Are you sure that's true? I recognize that you can indeed to addition with fabric or DSPs, but I'm not convinced that the fabric is as fast as the DSPs are. Isn't the whole point of DSPs to offload MACC operations from the fabric so they can be done at extremely high speed & throughput? What evidence is there to demonstrate that the fabric is just as fast as the DSPs are?

If you do MAC, the DSP block will do both addition and multiplication for you. So, you don't save anything by moving addition to the fabric for MAC.

However, if you have some other pattern, you can do addition in the fabric. DSP roughly works at 400 MHz. Create a project which does addition in the fabric. See if it can run at 400 MHz. This will be the evidence you're seeking.

asmi · « **Reply #16 on:** March 26, 2021, 01:57:38 pm »

Quote from: rsridhar on March 26, 2021, 06:01:06 am

A few cycles of delay in the MHz range - no problem. But once you get a few cycles of delay in the kHz range can lead to mechanical instability.
I knows delays are inevitable in a digital system; and the more pipeline delay I have, the lower the frequency the algorithm will be stable at. Ive already got some delays from the Sig/Del ADCs Im planning on using. If in the end the algorithm is only good to 50Hz, then so be it. But I would like it to be useful to a few kHz at least; ideally.

This is not how it works - in FPGAs everything is happening in parallel. Please don't confuse latency with throughput. In fully pipelined system you will get a result every cycle, and it will consume an input value every cycle. Think of it as a conveyor belt-driven assembly line - on every tact(tick, clock cycle) it accepts an input and "returns" an end product - this is throughput (1 item per tick), but it takes a certain amount of time for the specific input item to reach output - this is latency.

So you can have both high latency and high throughput at the same time, in fact in most cases if you want higher throughput, you will have to live with increased latency. A nice thing about FPGA designs is that these factors - latency, throughput, resource utilization can be adjusted with (relatively) little effort to find the optimal solution for each specific use case.

It's hard to wrap your head around this coming from writing code for CPUs, where everything tends to happen sequentially, but you will need to understand this concept intuitively to be successful with FPGA designs.

asmi · « **Reply #17 on:** March 26, 2021, 02:00:48 pm »

Quote from: NorthGuy on March 26, 2021, 01:48:59 pm

However, if you have some other pattern, you can do addition in the fabric. DSP roughly works at 400 MHz. Create a project which does addition in the fabric. See if it can run at 400 MHz. This will be the evidence you're seeking.

If you use all pipeline registers of the DSP tile (and you really should), you can run it at full 550 Mhz for 7 series speed grade 2 without much trouble.

asmi · « **Reply #18 on:** March 26, 2021, 02:04:38 pm »

Quote from: rsridhar on March 26, 2021, 06:01:06 am

Admittedly, I think me getting a proof-of-concept here would be impossible for me if working on anything other than a DSP or FPGA. Using multiple DSPs would be...tough; coding the algorithm would be simple; but data routing would be a nightmare and the #DSPs required wouldnt scale well with the algorithm's O(N^2)

You might want to try out HLS, like Vitis HLS. It can "convert" an algorithm written on C/C++ into a logic core that you can integrate into FPGA design. Take a look at these videos: https://www.youtube.com/playlist?list=PLo7bVbJhQ6qzK6ELKCm8H_WEzzcr5YXHC They are a bit dated, but I think he explains all concepts you need to know very well.

NorthGuy · « **Reply #19 on:** March 26, 2021, 02:20:28 pm »

Quote from: asmi on March 26, 2021, 02:00:48 pm

If you use all pipeline registers of the DSP tile (and you really should), you can run it at full 550 Mhz for 7 series speed grade 2 without much trouble.

Even at 550 MHz, you can pipeline additons in the fabric to match the DSP speed.

If you want to save DSPs, it is insane to use DSPs to do addition only. If you want to figure out how many DSPs you need to do a given calculation, you only count multiplications. All the additions can be either consumed by the MACs or pushed to the fabric.

When the OP says 10k multiplications and additions. If it's 1k multiplications and 9k additions, the design may be done with far fewer DSPs than if there were 9k multiplications and 1k of additions.

OP is doing MACs, most likely a simple FIR filter, so I would take it's 5k multiplications and 5k additions, that is 5k MACs. Thus, to evaluate how many DSPs he needs, we need to consider 5k multiplications.

asmi · « **Reply #20 on:** March 26, 2021, 02:46:43 pm »

Quote from: NorthGuy on March 26, 2021, 02:20:28 pm

Even at 550 MHz, you can pipeline additons in the fabric to match the DSP speed.

He mentioned that algorithm is using macc, so no need for any fabric additions.
At these frequencies the typical problem is feeding DSP tiles with enough data, as general purpose fabric can not run that fast. This often requires creative solutions, like utilizing BRAM as FSM to drive control signals for time-multiplexed use cases, or sacrificing a single DSP tile for a super-fast counter to drive the sequencing, etc.

NorthGuy · « **Reply #21 on:** March 26, 2021, 03:52:11 pm »

Quote from: asmi on March 26, 2021, 02:46:43 pm

Quote from: NorthGuy on March 26, 2021, 02:20:28 pm
Even at 550 MHz, you can pipeline additons in the fabric to match the DSP speed.
He mentioned that algorithm is using macc, so no need for any fabric additions.

Sure, but he mentioned this after the discussion about additions started.

Quote from: asmi on March 26, 2021, 02:46:43 pm

At these frequencies the typical problem is feeding DSP tiles with enough data, as general purpose fabric can not run that fast. This often requires creative solutions, like utilizing BRAM as FSM to drive control signals for time-multiplexed use cases, or sacrificing a single DSP tile for a super-fast counter to drive the sequencing, etc.

BRAM will probably not run at 550 MHz anyway. So, if you're using BRAM for FSM or if you're doing FIR and want to draw the coefficients from BRAM, you will have to slow it down to the BRAM speed.

rsridhar · « **Reply #22 on:** March 26, 2021, 06:18:06 pm »

Yes, this is esentially an NxN matrix of FIR filters; with filter length ~10k.

@asmi, yes, Im familiar with fpga functionality vs processors - thats what has drawn me to them for this application; that all the computations are done in parallel.

So getting back to the quesiton at hand...
Assuming all the data is fixed point; and all the math is fixed point; it sounds like an fpga could handle this number of computations? 6x6x10000 MAC's?
Is 10x10x10000 MACCs out of the question?
Am I not even close to reaching the limits of the FPGA?

Regarding the memory...
Would I need external memory for this many operations?

SiliconWizard · « **Reply #23 on:** March 26, 2021, 06:57:33 pm »

Although I'm not sure what the OP is really trying to achieve, I suspect this may be a case of XY problem.

asmi · « **Reply #24 on:** March 26, 2021, 07:08:35 pm »

Quote from: rsridhar on March 26, 2021, 06:18:06 pm

So getting back to the quesiton at hand...
Assuming all the data is fixed point; and all the math is fixed point; it sounds like an fpga could handle this number of computations? 6x6x10000 MAC's?
Is 10x10x10000 MACCs out of the question?
Am I not even close to reaching the limits of the FPGA?

Nothing is out of the question, but as was mentioned above, you will require progressively more beefier FPGAs, and it will get more and more expensive. There are FPGAs out there which cost over US$100K for a single chip.

Quote from: rsridhar on March 26, 2021, 06:18:06 pm

Regarding the memory...
Would I need external memory for this many operations?

It depends on what you want to do. All FPGAs have some amount of internal memory, which is good enough for stuff like FIFO buffers to smooth out data flow in case your source is not constant flow, or as a scratch memory. In your case, if you take samples directly from ADC and then output them directly into DAC (and their sampling rate is the same), you probably won't need any external memory. Some DSP effects (like delay and feedback in case of audio effects) might require more memory than what's available on-chip, and it might be required if your FPGA is going to have some auxiliary functions - like small CPU softcore which will orchestrate the entire design - stuff like configuring ADC/DAC at startup, responding to user input, etc. Again, it all depends on your design.
--------
One cool thing about FPGAs is that they can be simulated and their behavior can be verified without having actual hardware. So what I would recommend you to do is to get IDE from the vendor of your choice, and try actually implementing a design and verifying it in simulation. It will answer a lot of your questions, and you will get a better idea as to which FPGAs you will actually require.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: FPGA Beginner question...would an FPGA have enough gates to perform this? (Read 5368 times)

Share me