Author Topic: Is there any standard reference designs for a high speed inter-FPGA bridge? (Read 7753 times)

technix · « **on:** June 05, 2019, 04:29:49 am »

For example, I have a board with a Zynq-7020 and a Cyclone IV connected using some traces. Now I want to implement some kind of high speed bridge that exposes an AMBA AHB interface from the Zynq to the Cyclone IV. Is there any standard implementation for that, which can encapsulate AMBA AHB transactions into some form of external signaling format and shoot it over the traces, and decapsulate it on the other end?

I want to avoid SerDes at all costs (since my chips don't have any to begin with) but differential serial transmission is preferred (so length matching between pairs can be less critical.)

Berni · « **Reply #1 on:** June 05, 2019, 05:53:39 am »

That is a bit of a odd design to join two vendors of FPGAs like that.

But standardization in FPGAs is pretty bad to say the least. Each vendor pushes there own kind of memory or streaming bus that does the same thing, but ever so slightly differently so its not directly compatible.

In principle you can just run the internal bus onto the pins and run it right to the other chip. Perhaps trough some sort of memory bus bridge to transform it into variant of the bus with a 8 or 16bit wide bus with perhaps address latching modes to cut down the number of pins required, some extra latency can also be set for the bus to make sure the timings still work out when they make it to the other chip trough the PCB. Another way is so have the master chip act as a synchronous SRAM controller and the slave chip to act like its SRAM memory.

To use diff pairs you can just run the whole bus into a softcore serdes block, those will generally give you about 300 to 1000 Mbit/s on most FPGAs. On the other side you simply get your big parallel bus out of the serdes block again. Tho some trace length matching might still be required because you will not tend to have individual clock recovery for each pair to be able to tune out the delays. But due to the lower speeds it will tolerate a lot more trace length mismatch.

technix · « **Reply #2 on:** June 05, 2019, 06:20:14 am »

Quote from: Berni on June 05, 2019, 05:53:39 am

That is a bit of a odd design to join two vendors of FPGAs like that.

But standardization in FPGAs is pretty bad to say the least. Each vendor pushes there own kind of memory or streaming bus that does the same thing, but ever so slightly differently so its not directly compatible.

In principle you can just run the internal bus onto the pins and run it right to the other chip. Perhaps trough some sort of memory bus bridge to transform it into variant of the bus with a 8 or 16bit wide bus with perhaps address latching modes to cut down the number of pins required, some extra latency can also be set for the bus to make sure the timings still work out when they make it to the other chip trough the PCB. Another way is so have the master chip act as a synchronous SRAM controller and the slave chip to act like its SRAM memory.

To use diff pairs you can just run the whole bus into a softcore serdes block, those will generally give you about 300 to 1000 Mbit/s on most FPGAs. On the other side you simply get your big parallel bus out of the serdes block again. Tho some trace length matching might still be required because you will not tend to have individual clock recovery for each pair to be able to tune out the delays. But due to the lower speeds it will tolerate a lot more trace length mismatch.

So for using diff pairs, use one soft core LVDS serdes on each side, somehow have AHB directly connected to it on one side, and have AMBA AHB come out on the other side? Is it going to be transparent regardless of latency? As of clock synchronizaton, is it a good idea to run a LVDS clock pair along with the two unidirectional LVDS data pairs? And is 1.8V LVDS a good option for that?

As of poking the parallel bus out, is there any transparent solutions that can pack a 32-bit bus into something that is 8-bit wide and restore the bus on the other end?

Berni · « **Reply #3 on:** June 05, 2019, 07:56:12 am »

Well of course you have to make sure your timings and latencies work out for the bus you are running across. Buses are usually configured for certain timing settings, so its just a matter of setting up the bus correctly.

LVDS obviously adds extra latency to the whole thing because the process of serialization takes time, but the time it takes is always constant so the bus can be configured for the correct latency. Clock + unidirectional LVDS pairs is the way to go, tho you may want more than one pair going each way if you want to have lots of speed and low latency.

As for narrowing the bus down to 16bit or 8bit, this is done by the same memory bridge block that you will likely need to sort out the bus timings (Internal buses likely run on timings that are too fast). Various memory bus bridges and adapters are often used inside FPGAs for connecting modules with the wrong bus type onto the given main memory bus. So if you connect a module with a 8bit interface to a 32bit bus there will have to be a bus converter between to make it work. Vendors often provide such bus converters, or provide a tool that builds your bus system automatically, inserting these converters wherever needed like magic (Like Altera SOPC builder and whatever Xilinxes equivalent tool is called).

Even so these bus width/timing adapters are not hard to make yourself as long as you don't try to support all the advanced functionality of a complex memory bus configuration (Like transaction interleaving, reordering, buffering, special DMA related stuff etc) that in most cases you will not use or need. So in the end all you need to handle is "Please write X to location Y" and "Please tell me what is at location Y" along with perhaps a busy signal to pause the bus if something clogs up. Its your choice how to get that information across depending on your speed and performance needs. Heck you could do it all over a simple SPI bus if high bandwidth is not required.

tszaboo · « **Reply #4 on:** June 05, 2019, 08:37:10 am »

You know, that big companies, like Intel and AMD is struggling connecting their chips together transparent, with memory coherency. AMBA, AHB and others are on chip buses. Meaning that they are designed to stay on one chip.

technix · « **Reply #5 on:** June 05, 2019, 10:39:46 am »

Quote from: NANDBlog on June 05, 2019, 08:37:10 am

You know, that big companies, like Intel and AMD is struggling connecting their chips together transparent, with memory coherency. AMBA, AHB and others are on chip buses. Meaning that they are designed to stay on one chip.

I have some FPGA overflow but it just isn't worth it to step the main FPGA chip up from CLG400 package to FBG484 package, which requires me adding PCB layers to hold that increased package size. So while staying on the original layer count I had the idea of offloading some modules into a separate, smaller and cheaper 144-pin FPGA and bridge the two.

asmi · « **Reply #6 on:** June 05, 2019, 11:23:46 am »

There is a free AXI Chip2Chip IP provided by Xilinx for that purpose. It can use either parallel or MGT serial physical connection. Check Xilinx document PG067.

tszaboo · « **Reply #7 on:** June 05, 2019, 11:30:02 am »

Usually unless you need a lot of high speed interfaces, you typically just put a bunch of 74AC164 and 165s on your board.

technix · « **Reply #8 on:** June 05, 2019, 12:07:19 pm »

Quote from: NANDBlog on June 05, 2019, 11:30:02 am

Usually unless you need a lot of high speed interfaces, you typically just put a bunch of 74AC164 and 165s on your board.

These are not just simple GPIO blocks.

Quote from: asmi on June 05, 2019, 11:23:46 am

There is a free AXI Chip2Chip IP provided by Xilinx for that purpose. It can use either parallel or MGT serial physical connection. Check Xilinx document PG067.

Can it use soft-core serdes though, and can it be used on the Intel chip?

asmi · « **Reply #9 on:** June 05, 2019, 12:33:53 pm »

Quote from: technix on June 05, 2019, 12:07:19 pm

Can it use soft-core serdes though, and can it be used on the Intel chip?

I don't know what is "soft-core serdes", and the latter is likely a "no".

asmi · « **Reply #10 on:** June 05, 2019, 12:40:41 pm »

Quote from: technix on June 05, 2019, 10:39:46 am

I have some FPGA overflow but it just isn't worth it to step the main FPGA chip up from CLG400 package to FBG484 package, which requires me adding PCB layers to hold that increased package size. So while staying on the original layer count I had the idea of offloading some modules into a separate, smaller and cheaper 144-pin FPGA and bridge the two.

This is a bad idea. Increasing layers count is very likely to end up significantly cheaper than adding another chip (with all required components it will require - like power DC-DCs, QSPI flash, JTAG, etc.).
There is also CLG484/485 package if you want to stick to 0.8 mm pitch and not go to much larger 1.0 mm one, but the latter will obviously much easier to route.

NorthGuy · « **Reply #11 on:** June 05, 2019, 01:19:18 pm »

How fast do you need?

technix · « **Reply #12 on:** June 05, 2019, 02:40:54 pm »

Quote from: asmi on June 05, 2019, 12:40:41 pm

Quote from: technix on June 05, 2019, 10:39:46 am
I have some FPGA overflow but it just isn't worth it to step the main FPGA chip up from CLG400 package to FBG484 package, which requires me adding PCB layers to hold that increased package size. So while staying on the original layer count I had the idea of offloading some modules into a separate, smaller and cheaper 144-pin FPGA and bridge the two.
This is a bad idea. Increasing layers count is very likely to end up significantly cheaper than adding another chip (with all required components it will require - like power DC-DCs, QSPI flash, JTAG, etc.).
There is also CLG484/485 package if you want to stick to 0.8 mm pitch and not go to much larger 1.0 mm one, but the latter will obviously much easier to route.

The 144-pin one shares all its power rails with the CLG400 one, so no additional DC-DC needed. No QSPI used either as the interconnect traces does include provision for the ARM cores in the Zynq to load bitstream into that satellite FPGA. And JTAG is daisy chained. I am talking TQG144 too which is a package doable even in a dual-layer board.

OwO · « **Reply #13 on:** June 05, 2019, 03:18:59 pm »

More pins != harder to route, because you don't need to route out all the IO pins. It's usually very much the opposite, since the used IO density is lower meaning you have more space between vias to route signals through. Also going to 1.0mm pitch helps a lot. How many layers are you currently using?

Berni · « **Reply #14 on:** June 05, 2019, 05:40:40 pm »

Well who knows maybe the design makes sense to use a 2nd smaller FPGA. You do sometimes see a FPGA and a small CPLD on the same board, so why not a big FPGA+SOC and a smaller FPGA on the same board.

Oh and a so called "softcore serdes" is a serdes that is implemented inside of the FPGA fabric rather than being a real physical serdes block. As a result its much slower in terms of max bitrate, but can often be used on any IO pin and can be more flexible in terms of data formats. Often this serdes implementation uses the help of some extra hardware inside the IO pin such as DDR flip flops in order to achieve higher speeds than are generally possible inside the FPGA fabric. Despite being a lot slower they are often still fast enough to run a lot of LVDS interfaces you might run into.

Since you can shove the memory write/read requests down any kind of "pipe" by writing your own simple bus adapter on each side, means that the choice of the bus is mainly driven by how much bandwith you need. On the top end you could have a LVDS going 8 pairs one way and 8 pairs back, costing you 18 pins and getting about 1GB/s in both directions. Cutting it down to a single pair gets your IO cost down to 6 pins to get you about 100MB/s. Going for a 8bit data bus will cost about 20 normal general purpose pins and get you about 300 MB/s but with nice latency due to no serdes, or cutting it down to a simple SPI bus with 4 pins could get you up to 300Mbit/s or about 35MB/s. Heck even UART running at a ridiculous 100Mbit+ baud rate could be used, tho that's not a great idea for FPGAs.

technix · « **Reply #15 on:** June 05, 2019, 05:42:23 pm »

Quote from: OwO on June 05, 2019, 03:18:59 pm

More pins != harder to route, because you don't need to route out all the IO pins. It's usually very much the opposite, since the used IO density is lower meaning you have more space between vias to route signals through. Also going to 1.0mm pitch helps a lot. How many layers are you currently using?

I am on 6 and seriously considering going to 4 and move the Zynq to a COTS SoM. And since I am talking Zynq the memory bus is a routing bitch forcing the layer count up as for that side almost all pins must be routed out.

Quote from: Berni on June 05, 2019, 05:40:40 pm

Since you can shove the memory write/read requests down any kind of "pipe" by writing your own simple bus adapter on each side, means that the choice of the bus is mainly driven by how much bandwith you need. On the top end you could have a LVDS going 8 pairs one way and 8 pairs back, costing you 18 pins and getting about 1GB/s in both directions. Cutting it down to a single pair gets your IO cost down to 6 pins to get you about 100MB/s. Going for a 8bit data bus will cost about 20 normal general purpose pins and get you about 300 MB/s but with nice latency due to no serdes, or cutting it down to a simple SPI bus with 4 pins could get you up to 300Mbit/s or about 35MB/s. Heck even UART running at a ridiculous 100Mbit+ baud rate could be used, tho that's not a great idea for FPGAs.

My processor maxes out at 667MHz anyway so the latecy of a serdes can be just a few CPU cycles, so I would take the serdes route. Also I am offloading the slower peripherals here that has a maximum bus frequency lower than 50MHz, so having something like 800MB/s should pose no more latency than internal implementation.

asmi · « **Reply #16 on:** June 05, 2019, 06:04:49 pm »

Quote from: Berni on June 05, 2019, 05:40:40 pm

Oh and a so called "softcore serdes" is a serdes that is implemented inside of the FPGA fabric rather than being a real physical serdes block. As a result its much slower in terms of max bitrate, but can often be used on any IO pin and can be more flexible in terms of data formats. Often this serdes implementation uses the help of some extra hardware inside the IO pin such as DDR flip flops in order to achieve higher speeds than are generally possible inside the FPGA fabric. Despite being a lot slower they are often still fast enough to run a lot of LVDS interfaces you might run into.

The question is "what's the point?" 7 series FPGAs and Zynqs have a pair of ISERDES/OSERDES for each and every IO pin so you will never run out of them unless you're doing something really cheesy like using SERDES cascading with single-ended IO.

OwO · « **Reply #17 on:** June 05, 2019, 06:12:38 pm »

Quote from: technix on June 05, 2019, 05:42:23 pm

I am on 6 and seriously considering going to 4 and move the Zynq to a COTS SoM. And since I am talking Zynq the memory bus is a routing bitch forcing the layer count up as for that side almost all pins must be routed out.

I have a reference design for Zynq-7010/7020 400 pin BGA on 4 layers: https://github.com/gabriel-tenma-white/sdr5#zynq_som_2
The DDR3 memory runs at full speed with no errors, 2 units prototyped so far.

OwO · « **Reply #18 on:** June 05, 2019, 06:19:28 pm »

Cost of a Zynq 7010: $15 (new). Cost of a commercial SoM with DDR3: over $100. Similar story for 7020. I wouldn't use a SoM especially when you have already spent the effort designing the DDR3 layout.

Berni · « **Reply #19 on:** June 05, 2019, 06:31:51 pm »

Well since internal memory buses are often 32bits wide that would means 50MHz bus would only transfer 200MB/s. So you would probably need about 3 to 4 diff pairs in each direction to make it work when counting in overhead for control signals and that you might not want to run the pairs at max speed to get some extra timing slack and reliability.

In theory the 4 pairs will need 8 clock cycles to transfer 32bits and likely need a 9th for control signals. So 450Mbit per pair would do nicely, this is something most FPGAs can easily do with soft serdes. So this transfers the 32bits in the time of a single 50MHz bus clock cycle. You will also need to get your address across tho. So its 1 cycle to move the address, 1 cycle to move the data, 1 cycle to execute the operation on the slave bus, 1 cycle to receive the data response, all together 4 cycles to execute a read in the slave FPGA. This is not too bad, but you could also have support for burst transfers to get more speed for large chunks. There you would need the 4 cycles to get the first read done, but each next read is only 1 cycle, thus getting you the 200MB/s theoretical max speed until the burst transfer finishes.

You might need to add a extra clock cycle here or there if some extra time is needed to make a decision in there. Bus transfers always have some sort of latency associated with them so you will never get to 0 cycle latency. This was one of the motivations for Intel in x86 to get rid of the north bridge and bring the memory controller onto the CPU die.

EDIT: By the way OwO where can you buy a Zynq 7010 for $15?

asmi · « **Reply #20 on:** June 05, 2019, 06:42:37 pm »

Quote from: OwO on June 05, 2019, 06:19:28 pm

Cost of a Zynq 7010: $15 (new).

Not everyone is willing to put up with dodgy Chinese sellers selling you random parts while trying to fool you into believing it's something else. Some like to only buy through official channels. This is especially so for commercial products when you need to be sure the seller you bought chips from will still be there in 5 years from now. I tried it personally and the seller sent me a SG1 device instead of SG2 I ordered and somehow though that it was OK to do so

I got my money back (and still got a chip, thou I haven't soldered it yet to see if it works at all), but if it would be a commercial project, I'd be totally screwed as for some time I'd have no money nor chips.
Oh, and please stop calling this crap "new". It's not. It might be from old stocks, or it may be desoldered reballed junk that may or may not work, or (my case) it might be a different chip altogether from the one you ordered.

asmi · « **Reply #21 on:** June 05, 2019, 06:43:47 pm »

Quote from: Berni on June 05, 2019, 06:31:51 pm

EDIT: By the way OwO where can you buy a Zynq 7010 for $15?

Go to aliexpress and go for it - if you're willing to play Russian roulette with what you will actually get (if at all).

NorthGuy · « **Reply #22 on:** June 05, 2019, 07:12:56 pm »

Quote from: asmi on June 05, 2019, 06:43:47 pm

Go to aliexpress and go for it - if you're willing to play Russian roulette with what you will actually get (if at all).

How about the chips you bought few months ago. Did they work Ok?

asmi · « **Reply #23 on:** June 05, 2019, 07:16:56 pm »

Quote from: NorthGuy on June 05, 2019, 07:12:56 pm

How about the chips you bought few months ago. Did they work Ok?

Don't know yet. The board I was going to put it on was really designed for SG2 device (DDR3L and all that), and so are most of my projects, so such board won't be very useful to me. Which is why I don't feel like spending hours assembling the board to check it out.

brucehoult · « **Reply #24 on:** June 06, 2019, 08:17:20 am »

At SiFive, we used TileLink, originally developed at Berkeley University, for all on-chip communications. SOme of our customers prefer AXI so we have fully-featured bridges between them:

https://www.sifive.com/documentation/tilelink/tilelink-spec/
https://static.dev.sifive.com/AXI2TL.pdf
https://static.dev.sifive.com/TL2AXI.pdf

For inter-chip or inter-board, we have "ChipLink". For example this is used to send coherent memory traffic between the HiFive Unleashed board and either a Xilinx VC707 FPGA board, or MicroSemi "HiFive Unleashed Expansion Board" over an FMC connector using 2x35 pins @200 MHz. The boards bridge this to PCIe (and other things) and we run video cards, SSDs etc there.

There's no official spec of ChipLink yet. The source for the interface is here https://github.com/sifive/sifive-blocks/tree/master/src/main/scala/devices/chiplink and a presentation here
https://content.riscv.org/wp-content/uploads/2017/12/Wed-1154-TileLink-Wesley-Terpstra.pdf

This will all be proposed as a standard once it settles down a bit.

Western Digital has also been working on a TileLink-over-Ethernet implementation https://github.com/westerndigitalcorporation/omnixtend

We also very much like NVMe as a physical layer, so I think you'll see some work around that emerge later.

A quote from Wes (ChipLink designer):

Quote

There is no PHY between ChipLink.scala and the IO pins. The only trickery is that we use PLLs to shift the data relative to the source synchronous clock to close timing. Here are the timing constaints: https://github.com/sifive/fpga-shells/blob/650b45dbf7fe03ff606a92715ae2277e9fde2c28/src/main/scala/shell/xilinx/ChipLinkOverlay.scala#L37

And one from Krste Asanovic (SiFive founder):

Quote

We are keen to have TileLink be an open standard, and we also want to have the TileLink protocol be mapped on top of various physical chip-chip links, forming various kinds of open ChipLink standards - though we’re not sure we want to use the ChipLink name instead of TileLink-over-X for example . We’re still sorting out the best way of layering the specifications, but are actively working towards this. We are also working to create a better venue to have open discussions/development around this standard. As Bruce said, the principals are very busy doing many other wonderful things at SiFive so it’s lack of bandwidth, not intent, that has meant the spec is not better documented.

Both these from https://forums.sifive.com/t/chiplink-isnt-an-open-spec-apparently-or-can-i-see-it-somewhere/1479


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Is there any standard reference designs for a high speed inter-FPGA bridge? (Read 7753 times)

Share me