Author Topic: FPGA Beginner question...would an FPGA have enough gates to perform this?  (Read 5096 times)

0 Members and 1 Guest are viewing this topic.

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Regarding the memory...
Would I need external memory for this many operations?

10k is an awfully long FIR filter.

You can easily calculate this, but you need to know how many bits you need per a single coefficient and how many bits you need for an accumulator.

For each FIR filter, you need to maintain 10k of accumulators. If an accumulator is 20 bits its 200k bits - 6 BRAM blocks. Every ADC clock you need to fetch every one of these,  multiply the most recent ADC reading by an appropriate coefficient, add it to the value you have read and store it back into the next slot. If you don't have enough BRAM you will need to organize a pipeline which will read from the external RAM then write it back, but if you have 5 DSPs working on this, it's 100 bits read and 100 bits write every DSP cycle, which you probably won't be able to achieve with external RAM.

You will also need space for the coefficients. For example, 10bit x 10k = 100k bits another 3 BRAM blocks.

What is easy about the FPGA, once you build one FIR filter, all of the others will be the same. So, N x N matrix is just N^2 identical filters which use N^2 identical sets of resources.

If one FIR filter takes 5 DSPs and 9 BRAM blocks, then for 10 x 10 = 100 FIR filters you'll need 500 DSP and 900 BRAM blocks. These are very aproximate estimates though.

As FPGAs are built, BRAM will be more of a problem than DSP.

Using a big FPGA to fit all filters may net be a good idea. Say if you want 100 FIR filters, XC7K480T has 955 BRAM blocks and 1920 DSPs and costs few $k.

It may be more beneficial to use a number of smaller FPGAs. Say, XC7A200T has 365 BRAM blocks and 740 DSPs, which might be enough for 40 FIR filters.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Perhaps you can make better use of external memory if you process your filter multiple taps at a time. Say you can accumulate 200 taps then adjust data in your external memory. Such approach uses less internal memory and requires less external memory bandwidth.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
I quickly sketched a design on paper. The big problem will be the storage for the FIR filter's kernel - As I have to assume that each filter uses a unique set of coefficients.

Most FPGAs have a block memory that is tightly coupled to the DSP block, that is somewhere about 1k entries in size. This makes filters of up to this length very fast and efficient, but longer ones much less so.

A DSP operations at about 200MHz or so without extra-careful design. That will allow a single DSP block to 200kS/s, with 1000 length filter kernel. To make longer filters you need to use 'n' partial filters, each with 1/nth of the kernel in parallel, then total the output outside of the DSP block.

For thirty six 10k-long FIR filters, you will need 360 DSP blocks to process 200k samples per second. And of course, because you can only process the data after it has been acquired you will have latency a little over 5000 samples (1/40th of a second) - this is assuming you are using FIR filters with '5000 values in the past, and 5000 values in the future'.

In Xilinx world, an XC7A100T with 240 DSP slices is too small, and a XC7A200T with 740 DSP slices is about twice what you need. However, if you could halve the length of your filter length to 5k then XC7A100T would be plenty, and you would halve the overall latency too.

I would want to write the core of the design and work out the achievable timing before selection of part or even thinking of the board design. It is far cheaper to change a dropdown in the EDA tools than get a new FPGA board.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9921
  • Country: us
I would want to write the core of the design and work out the achievable timing before selection of part or even thinking of the board design. It is far cheaper to change a dropdown in the EDA tools than get a new FPGA board.

Right!  Pick a large device and start writing code.  Once the code is complete, reduce the device size until it no longer fits.

I haven't played with the DSP tiles but I checked with Google and apparently the simulator works.  So the project can be tested up through working simulation before selecting the final chip.

Search: "xilinx vivado simulate dsp devices"
 

Offline ali_asadzadeh

  • Super Contributor
  • ***
  • Posts: 1929
  • Country: ca
Quote
Also the VERY EXPENSIVE parts are not so expensive if you have quantity and know the market. If gray market parts are accepted, I can get a fully functioning XCVU9P accelerator card for less than $1200.
I love gray market >:D what part numbers of Ultrascales are popular there? ^-^
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 


Offline ali_asadzadeh

  • Super Contributor
  • ***
  • Posts: 1929
  • Country: ca
Quote
There are $150 ZU3EG boards out there, so presumably with some quantity you can get the chips for at least no more than that.

The chip has quad A53 cores, dual R5F cores, 154k logic cells, 7.6Mb of RAM, and 360 DSP slice
Thanks for sharing
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline gnuarm

  • Super Contributor
  • ***
  • Posts: 2247
  • Country: pr
Are you sure that's true?  I recognize that you can indeed to addition with fabric or DSPs, but I'm not convinced that the fabric is as fast as the DSPs are.  Isn't the whole point of DSPs to offload MACC operations from the fabric so they can be done at extremely high speed & throughput? What evidence is there to demonstrate that the fabric is just as fast as the DSPs are?

If you do MAC, the DSP block will do both addition and multiplication for you. So, you don't save anything by moving addition to the fabric for MAC.

However, if you have some other pattern, you can do addition in the fabric. DSP roughly works at 400 MHz. Create a project which does addition in the fabric. See if it can run at 400 MHz. This will be the evidence you're seeking.

Flexibility.  Using the adders in the MAC limit you to the configuration provided.  Using the adders from the fabric remove restrictions and make porting to other architectures much easier.   
Rick C.  --  Puerto Rico is not a country... It's part of the USA
  - Get 1,000 miles of free Supercharging
  - Tesla referral code - https://ts.la/richard11209
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf