Author Topic: Which FGPA/tool for this project? (Read 16938 times)

PartialDischarge · « **on:** September 21, 2017, 05:09:33 pm »

Long time ago I did program a somewhat limited algorithm to locate the position of a magnet in space. Now I'd like to go further and port it to an FPGA.
I'm currently using a PSOC 5LP running at 67MHz(cortex M3).
The algorithm makes heavy use of 16-bit fixed point multiplications and sums for speed purposes, because I have multiple 4x4 matrix multiplications, jacobians, determinants, cosines, sines.... For example trygonometric functions are calculated by taylor series, to avoid slow floating point functions.

-All I have is programmed in C
-Project hardware has only one LCD and 4 I2C magnetometers (although it could be 8 )
-I had little experience with Quartus long time ago and didn't like it, besides that I have no idea of the fpga world
-Don't care about the cost of the FPGA or hw dev tools, but would like to keep the software tools free or to a minimum, and easy to learn. I need fast architecture and easy programming not lots of features.
- I need something that has like 5-10 times the processing speed of this Psoc 5.

legacy · « **Reply #1 on:** September 21, 2017, 06:53:09 pm »

PIC32, or an MPU with Cordic in hw may accelerate better than a softcore in fgpa, especially on the software side.

PartialDischarge · « **Reply #2 on:** September 22, 2017, 05:49:34 am »

Quote from: legacy on September 21, 2017, 06:53:09 pm

PIC32, or an MPU with Cordic in hw may accelerate better than a softcore in fgpa, especially on the software side.

Sorry for not being specific enough. Since I want fast execution I'll have to implement the algorithm in hard-core. Easiest would be of course to use a very fast MPU but I'm not sure that would be enough. Even 300-400Mhz could turn out slow.

PartialDischarge · « **Reply #3 on:** September 22, 2017, 06:10:04 am »

Quote from: blueskull on September 22, 2017, 05:53:56 am

How about BlackFin+? BF702, at $6.80 each at 1kpcs, it packs 400MHz of speed (dual 16 bit MAC), or 1.6GFLOPS.

What is the cost of the software dev tools for the blackfin family?

Ice-Tea · « **Reply #4 on:** September 22, 2017, 06:52:23 am »

By processing speed you mean CPU power? And you want to keep the FPGA fabric around?

In that case, perhaps a Samartfusion 2 may do the trick (I know, I mentioned these before, yes I like them a lot).

PartialDischarge · « **Reply #5 on:** September 22, 2017, 07:01:00 am »

Quote from: Ice-Tea on September 22, 2017, 06:52:23 am

In that case, perhaps a Samartfusion 2 may do the trick (I know, I mentioned these before, yes I like them a lot).

Suggestion appreciated, didn't know about this platform. And the cost of the software tools is...?

Ice-Tea · « **Reply #6 on:** September 22, 2017, 07:07:12 am »

They changed it recently, but if I understand correctly: as with many vendors, there's a free version that will work fine unless you get the big a** devices.

Ice-Tea · « **Reply #7 on:** September 22, 2017, 07:11:35 am »

Sure? Because I got some quotes for Smartfusion 1 a few years ago and they were preyy damn low.

PartialDischarge · « **Reply #8 on:** September 22, 2017, 07:12:11 am »

Quote from: blueskull on September 22, 2017, 07:05:40 am

Libero is free for most uses. Beware that AFS/A2F chips are not cheap.

The application that I'm targeting is not cost sensitive. If it makes the job 30€ a piece is ok.

rs20 · « **Reply #9 on:** September 22, 2017, 07:14:50 am »

Sounds like you could take your complicated equations and fit some approximations that are off by 1ppm but require dramatically fewer computations?

PartialDischarge · « **Reply #10 on:** September 22, 2017, 07:21:05 am »

Quote from: rs20 on September 22, 2017, 07:14:50 am

Sounds like you could take your complicated equations and fit some approximations that are off by 1ppm but require dramatically fewer computations?

Nope, its not an equation that gives an exact result. It is a math intensive iterative algorithm that minimizes the error of the output. Right now I'm simplifying things by restricting the magnet to be perpendicular to the plane.

Marco · « **Reply #11 on:** September 22, 2017, 08:14:58 am »

Cortex m7 should have plenty of power if you switch to floating point.

technix · « **Reply #12 on:** September 22, 2017, 11:30:10 am »

If space and power allows, a Raspberry Pi 3? That thing packs a quad-core Cortex-A53 and a GPU.

NorthGuy · « **Reply #13 on:** September 22, 2017, 01:30:44 pm »

If your algorithm performs a long consecutive chain of calculations, FPGA is not the best choice. FPGA is good at doing things in parallel. For example, it may have a number of DSP blocks which all can do multiplication at the same time - say 100 multiplications at once. This gives FPGA tremendous speed compared to CPU. For consecutive algorithms, people often built their own CPUs (so called softcores) with FPGA fabric and then them for processing. Such approach doesn't take full advantage of FPGA parallelism. If that's what you want to do, you probably will be better off with a real CPU.

daqq · « **Reply #14 on:** September 22, 2017, 01:46:51 pm »

I too believe an FPGA is not the right choice for this kind of thing. There are various ARMs out there that have LOADs of computational power, even fairly small ones can have DPS instructions and you can tweak your algorithms for more efficient usage. There are also fun ways to get around the nasty functions - you mentioned you are replacing your trigonometry with a Taylor series - with 16 bit math you can easily replace any such calculations with a look up table (sin/cos asin/acos share the same table), essentially removing any waiting. 16bit input x 16bit result is 128k of RAM/FLASH - less if you look at it properly (sinus being symmetrical).

You mentioned the number crunching is pretty complex - as a rough benchmark could it run at sufficient speed on a Raspberry pi?

kakabouras · « **Reply #15 on:** September 22, 2017, 01:49:19 pm »

A possible option could be the Texas Instruments TMS320F2837xD series MCUs.

These beasts feature dual core processors running at 200MHz plus :

- Dual floating point units.
- Dual trigonometric math units.
- Dual Viterbi/complex math units.
- Dual processors optimized for implementing control loops.

All the extra processing units run independently and concurrently with the main processors.

Of course such an approach assumes an extra care for the software design in order to take advantage of the MCU features.

Prices for the chips fall within the range of 17 to 25 USD ea for quantities of 1k.
Development boards are around 200 USD ( if my memory serves me well ).
The development software tools are free.

However it should be noted that the learning curve is very steep due to the complexity of the chips.

Scrts · « **Reply #16 on:** September 22, 2017, 03:00:44 pm »

Isn't I2C will be the bottle neck if fast DSP is used here?

PartialDischarge · « **Reply #17 on:** September 22, 2017, 04:11:18 pm »

Quote from: Scrts on September 22, 2017, 03:00:44 pm

Isn't I2C will be the bottle neck if fast DSP is used here?

The ultimate bottleneck are the magnetic sensors. They are slow and the faster they go the more noise you get. One of the models I use is the LSM303D, 25-50Hz is the fastest for usable data. This was a few years back, maybe there is something radically better now.

Quote from: kakabouras on September 22, 2017, 01:49:19 pm

A possible option could be the Texas Instruments TMS320F2837xD series MCUs.

Know that kind of solution. As you mention it takes effort to cram everything nicely in there, I did that kind of effort when I was younger and now I don't see the point when Mhzs are widely available and cheap now.

Quote from: daqq on September 22, 2017, 01:46:51 pm

You mentioned the number crunching is pretty complex - as a rough benchmark could it run at sufficient speed on a Raspberry pi?

I don't know yet. The algorithm I have programmed is a simplified version. The general algorithm considers the magnet at any position, I use Euler angles to make rotations and it gets computational intensive when you have to search the whole space.

Quote from: NorthGuy on September 22, 2017, 01:30:44 pm

If your algorithm performs a long consecutive chain of calculations, FPGA is not the best choice. FPGA is good at doing things in parallel.

True. But its part sequential part parallel. The iterations for minimization of course have to be done in series. But 4x4 matrix inverses, jacobians or sines could be hardwired, that amounts to a LOT of 16-bit multiplies.

I'll research the Blackfin and the Smartfusion solutions a bit more.

rstofer · « **Reply #18 on:** September 22, 2017, 04:26:44 pm »

Quote from: Scrts on September 22, 2017, 03:00:44 pm

Isn't I2C will be the bottle neck if fast DSP is used here?

You would think...

5 times a 67 MHz chip is not hard to do. The Raspberry PI 3 has a quad core running at 1.2 GHz. But how do you get down to the hardware?

The last couple of days I have been looking at Ultibo (www.ultibo.org) as it is based on FreePascal and provides the tools to talk directly to the hardware without a formal OS getting in the middle. In addition, the thread unit allows for manually scheduling the 4 cores. Ultibo is free... The slope of the learning curve will approach infinity, I think. There is just so much stuff to learn.

There are probably other ways to do this with the RPi. Basically, you want to get Linux out of the way (if it is necessary) and write straight to hardware.

There are other processors with similar features that aren't locked into the RPi design. This may make more SPI channels available or perhaps DMA would be easier to use, etc. The Broadcom chips seem to be locked in secrecy.

Any time you are doing multiply-add, you should be thinking of DSP. The Blackfin 537 is a 600 MHz chip designed specifically for DSP as is the Sharc. I was using the GCC toolchain quite effectively and it's free. uCLinux runs on the chip as it doesn't require an MMU.

http://www.analog.com/en/design-center/evaluation-hardware-and-software/evaluation-boards-kits/bf537-ezlite.html#eb-overview

If you can arrange your algorithm as parallel operations or some kind of pipelined arrangement, the FPGA can be VERY fast. The problem then becomes latency. Do the pipelined stages, with their associated delay, impact the speed requirements? You get one result each clock but it may be the result of a sample taken 10 clocks before. OTOH, the clock might turn out to be pretty fast. Many FPGAs can run at 200 MHz and the high dollar chips can run MUCH faster.

On the FPGA, you read all sensors in parallel. You can have as many SPI channels as your package pins permit. And they can be FAST! It's simple to add a FIFO buffer for each sensor channel.

In terms of FPGAs, I can only discuss what I know and that's Xilinx and 'very little'. I am coming to terms with Vivado and the Artix 7 chips are pretty nice. Digilent's Nexys 4 DDR board (expensive) uses a HUGE chip with gobs of BlockRam. The Arty board uses a much smaller chip but it's still pretty useful.

If you have a need for bulk memory, the Nexys 4 DDR board is pretty nice because Digilent supplies a component to make the DDR look like static RAM. This eliminates the need to endlessly study timing diagrams or use a Xilinx IP core. I haven't used this component. I have used the Xilinx IP with their Microblaze softcore and it works well.

technix · « **Reply #19 on:** September 22, 2017, 05:30:34 pm »

If you can tolerate some latency you can use an ESP8266/ESP32/Raspberry Pi Zero W to stream the data off the system to a Wi-Fi network, to be processed using a laptop, a desktop PC, a gaming PC with one or two high-end GPUs, a server with a lot of high power x86 cores, or some cloud service like Amazon EC2. I don't think your calculations can hog up an 8-core Ryzen 7 and two GeForce GTX 1080 Ti's.

PartialDischarge · « **Reply #20 on:** September 22, 2017, 05:51:11 pm »

Quote from: technix on September 22, 2017, 05:30:34 pm

If you can tolerate some latency you can use an ESP8266/ESP32/Raspberry Pi Zero W to stream the data off the system to a Wi-Fi network, to be processed using a laptop, a desktop PC, a gaming PC with one or two high-end GPUs, a server with a lot of high power x86 cores, or some cloud service like Amazon EC2. I don't think your calculations can hog up an 8-core Ryzen 7 and two GeForce GTX 1080 Ti's.

Latency I can tolerate. However the system must be light, portable, self-contained and battery operated Sorry I can't disclose the application I've thought of

Sal Ammoniac · « **Reply #21 on:** September 22, 2017, 05:57:37 pm »

Quote from: rstofer on September 22, 2017, 04:26:44 pm

Digilent's Nexys 4 DDR board (expensive) uses a HUGE chip with gobs of BlockRam.

That's actually a pretty small FPGA. Take a look at this truly huge one... And don't choke on the price (yes, it actually is $76,000 for one chip.)

https://www.digikey.com/product-detail/en/xilinx-inc/XCVU440-3FLGA2892E/XCVU440-3FLGA2892E-ND/7604556

PartialDischarge · « **Reply #22 on:** September 22, 2017, 06:01:46 pm »

This is another setup I made using 8 magnetometer boards. I placed 2 perpendicular 3d magnetometers (Honeywell IIRC) at each corner.
Just something I wanted to try.

mikeselectricstuff · « **Reply #23 on:** September 22, 2017, 06:23:11 pm »

How fast does it actually need to be ? I'd imagine it's limited by mechanics and the speed of the sensors.
Unless you want to do it as a learning excercise, it's going to be a lot easier to do in software on an MCU or DSP than an FPGA, and you should explore that before assuming FPGA will be necessary.

rstofer · « **Reply #24 on:** September 22, 2017, 06:45:47 pm »

Just working up floating point in an FPGA could be a major undertaking even if a core could be found. Verification comes to mind. The GCC libraries have been around for years and years being used on millions of applications with no apparent issues.

Yes, if an MCU has the capability that is absolutely the way to go. A specialized DSP type MCU like the Blackfin or Sparc would probably be better than a general purpose MCU but it may not be required. Megaflops are pretty easy to buy.

If the app needs to be portable and battery powered, I might consider a laptop or notebook. I imagine my Microsoft Surface Book has enough horsepower: Dual core, 4 threads, 2.6 GHz - that should cover it. Even with Win 10 in the way. The sensors are SLOW!


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Which FGPA/tool for this project? (Read 16938 times)

Share me