Author Topic: FPGA: More elegant (and less timing violating) way of doing simple register map? (Read 6757 times)

daqq · « **on:** December 05, 2017, 07:27:07 am »

Hi guys,

I'm doing a block in an FPGA project that is pretty much a configuration register map - if NReset is set to zero, all of the registers gain their default values (a fixed parameter, but different for different registers), otherwise assuming that DataValid is one, the appropriate register (32 bit), defined by Address (7 bit) will be set to DataIn. These are then connected outside of the module to control whatever blocks are needed.

This approach synthesized well enough, but after adding more registers, it generates timing violations...

First of all I would like to ask whether there is any more elegant way to do something like this? Since verilog can't use an array of registers as a module port I can't just use something like

Code: [Select]

output reg [31:0] Registers [31:0];. As such I have to write out a LOT of lines.

Second: Is there any more elegant way to do this, that does not create an unholy synthesized mass of logic? The system runs @ 250MHz internally, so I can understand why it creates problems, since the combination logic has a very high fanout (at least to my understanding, since I'm very new at this). It does not have to work in a single clock cycle - the fastest the control SPI can feed this is in several thousand clock cycles.

Code: [Select]

always @(posedge Clock) begin
	if(NReset == 0) begin
		Reg00 <= Reg00_DefaultValue;
		Reg01 <= Reg01_DefaultValue; 
		Reg02 <= Reg02_DefaultValue; 
		Reg03 <= Reg03_DefaultValue;
		Reg04 <= Reg04_DefaultValue; 
		Reg05 <= Reg05_DefaultValue; 
		Reg06 <= Reg06_DefaultValue; 
		Reg07 <= Reg07_DefaultValue; 
		Reg08 <= Reg08_DefaultValue; 
		Reg09 <= Reg09_DefaultValue; 
		Reg0A <= Reg0A_DefaultValue; 
		Reg0B <= Reg0B_DefaultValue; 
		Reg0C <= Reg0C_DefaultValue; 
		Reg0D <= Reg0D_DefaultValue; 
		Reg0E <= Reg0E_DefaultValue; 
		Reg0F <= Reg0F_DefaultValue;
		
		Reg10 <= Reg10_DefaultValue;
		Reg11 <= Reg11_DefaultValue; 
		Reg12 <= Reg12_DefaultValue; 
		Reg13 <= Reg13_DefaultValue;
		Reg14 <= Reg14_DefaultValue; 
		Reg15 <= Reg15_DefaultValue; 
		Reg16 <= Reg16_DefaultValue; 
		Reg17 <= Reg17_DefaultValue; 
		Reg18 <= Reg18_DefaultValue; 
		Reg19 <= Reg19_DefaultValue; 
		Reg1A <= Reg1A_DefaultValue; 
		Reg1B <= Reg1B_DefaultValue; 
		Reg1C <= Reg1C_DefaultValue; 
		Reg1D <= Reg1D_DefaultValue; 
		Reg1E <= Reg1E_DefaultValue; 
		Reg0F <= Reg1F_DefaultValue;
	end else if(DataValid == 1'b1) begin 
		case (Address)
			7'h00: Reg00 <= DataIn;
			7'h01: Reg01 <= DataIn;
			7'h02: Reg02 <= DataIn;
			7'h03: Reg03 <= DataIn;
			7'h04: Reg04 <= DataIn;
			7'h05: Reg05 <= DataIn;
			7'h06: Reg06 <= DataIn;
			7'h07: Reg07 <= DataIn;
			7'h08: Reg08 <= DataIn;
			7'h09: Reg09 <= DataIn;
			7'h0A: Reg0A <= DataIn;
			7'h0B: Reg0B <= DataIn;
			7'h0C: Reg0C <= DataIn;
			7'h0D: Reg0D <= DataIn;
			7'h0E: Reg0E <= DataIn;
			7'h0F: Reg0F <= DataIn;
			
			7'h10: Reg10 <= DataIn;
			7'h11: Reg11 <= DataIn;
			7'h12: Reg12 <= DataIn;
			7'h13: Reg13 <= DataIn;
			7'h14: Reg14 <= DataIn;
			7'h15: Reg15 <= DataIn;
			7'h16: Reg16 <= DataIn;
			7'h17: Reg17 <= DataIn;
			7'h18: Reg18 <= DataIn;
			7'h19: Reg19 <= DataIn;
			7'h1A: Reg1A <= DataIn;
			7'h1B: Reg1B <= DataIn;
			7'h1C: Reg1C <= DataIn;
			7'h1D: Reg1D <= DataIn;
			7'h1E: Reg1E <= DataIn;
			7'h1F: Reg1F <= DataIn;
		endcase;
	end;
end

Best regards,

David

ataradov · « **Reply #1 on:** December 05, 2017, 08:02:06 am »

The only reason this may not meet timings is the case statement doing something funky. Otherwise, this is a pretty straightforward thing where write enable is generated by a combinatorial logic based on the Address. There is no reason why this would get significantly slower with number of registers.

Try something like this:

Code: [Select]

reg [31:0] Reg [31:0];

always @(posedge Clock) begin
 if(NReset == 0) begin
   Reg[0] <= Reg00_DefaultValue;
   ........
   Reg[31] <= Reg1F_DefaultValue;
 end else if(DataValid == 1'b1) begin 
   Reg[Address] <= DataIn;
 end;
end

assign Reg00_o = Reg[0];
........
assign Reg1F_o = Reg[31];

Also, SystemVerilog supports arrays as ports.

hamster_nz · « **Reply #2 on:** December 05, 2017, 08:13:18 am »

It all depends on you needs. For example, imagine a 256x8 write only register file, 8 bits address, 8 bits data, 1 bit write enable, and 256 8-bit outputs.

If all of the outputs are going to be used directly, and you want them to update instantly you need 2048 flip-flops, lots of decoding logic, and very high fan outs on the data signal. Basically the mess you already have.

The only option is to increase the latency to allow you to manage thimgs better.

Let's take this idea to the extreme. If you only need one update every 20 cycles you can do the following:

When the data come in, write a 1, the address and the data to a 17 bit shift register.

Shift those bits out, starting with the 1. Into a single wire.

Where you want your register, receive those bits into an 8 bit shift register, and add am FSM triggered by the leading one. This FSM inspects the address, and if it matches a constant address it then captues the next 8 bits into registers, the assigns them to the registers outputs.

You can then chain these stages together so your write-pnly register file can be implemented in very small parts all over the FPGA die, with very low fan-outs, so running at very high speed. It can also use pipelining as required to meet timing

Basically it becomes a simple network on a chip, the cost being maybe 2x as many registers, and latency between the write and the register changing.

Taking this to the extreme, routing this network can be implemented as an overlay, where you chain these blocks togeather after the main place and route, using whatever leftover resources you have to hand.

Oh, the other option for things like a 3x3 matrix used in colour conversion is to rather than having 9 registers, have an 9-deep shift register and just have the CPU stream the values in. This gets rid of a lot of the decode logic.

ataradov · « **Reply #3 on:** December 05, 2017, 08:18:55 am »

Quote from: hamster_nz on December 05, 2017, 08:13:18 am

and very high fan outs on the data signal. Basically the mess you already have.

Yeah, I totally missed data line fan-out.

AndyC_772 · « **Reply #4 on:** December 05, 2017, 08:28:58 am »

What exactly are (say) the top three timing violations?

A very common cause of timing errors in a register map is having registers which are updated in one clock domain but read back in another. If the two clocks are not synchronised, a situation can exist where a register is read at a time when it's not yet stabilised from the previous write. The existence of the problem doesn't depend on the frequency of the two clocks.

Possible fixes include:

- treat SCK as an asynchronous signal, and sample it in the same (fast) clock domain as the one in which the registers are read.

- use a dual clock FIFO. SPI writes go into the FIFO under control of the 'write' clock, and are read out from the other side (into your register file) under control of a 'read' clock

hamster_nz · « **Reply #5 on:** December 05, 2017, 08:45:32 am »

Quote from: AndyC_772 on December 05, 2017, 08:28:58 am

What exactly are (say) the top three timing violations?

A very common cause of timing errors in a register map is having registers which are updated in one clock domain but read back in another. If the two clocks are not synchronised, a situation can exist where a register is read at a time when it's not yet stabilised from the previous write. The existence of the problem doesn't depend on the frequency of the two clocks.

Possible fixes include:

- treat SCK as an asynchronous signal, and sample it in the same (fast) clock domain as the one in which the registers are read.

- use a dual clock FIFO. SPI writes go into the FIFO under control of the 'write' clock, and are read out from the other side (into your register file) under control of a 'read' clock

Even if it is all in one clock domain (which would be a good thing, IMO), the high fanouts will be killing this - esp at 200+MHz. Some pipelining might help a little (as it allows for high-fan out registers to get duplicated, but it will end up being a pretty large mesh of wires. The placer will most likely try and place this all in one spot (as it is so interconnected with high fanouts) so running the register outputs out across the die to the functional blocks of the design will be problematic. Upshot is that the "command and control" aspect will be very constraining on design (which at 250MHz is shooting pretty high anyway).

Another option is to run the register file at an integer fraction of the main design's clock (e.g. 50.0MHz or 25MHz). That way you have plenty of slack for place and route, and don't get any timing violations because the clock domains are synchronous. Actually that sounds like quite a nice idea. I should charge for this quality of advice!

daqq · « **Reply #6 on:** December 05, 2017, 09:12:52 am »

First of all: Thanks guys for the advice! I have already eliminated the problem - it was in one of the controlled "peripherals" that connected to one of the control registers and compared it - basically did a horrible amount of combinatorial logic (comparison, then mux, then comparison) before the next register stage. I've added intermediate registers between the two areas and it works OK. So it's all a problem of me not identifying the offending source properly

Quote

The only reason this may not meet timings is the case statement doing something funky. Otherwise, this is a pretty straightforward thing where write enable is generated by a combinatorial logic based on the Address. There is no reason why this would get significantly slower with number of registers.

I've tried synthesizing both (before finding out the bug), both gave roughly the same result and there is no noteworthy difference in how the actual logic was created. Attached are both outputs of both cases, they seem pretty similar, even when looked at close.

Quote

Try something like this:

Thanks! Looks great!

hamster_nz: Those are some very interesting ideas. For now I'll stay clear of them, but thanks! The asynchronous bit might work, but at the moment the SPI is sampled and processed - the receiving system does not run on the SPI SCK clock, but rather on the internal 250MHz clock and the signals are sampled. Converting the SPI into a slower system could lessen the problems in the future, I'll keep that door open.

AndyC_772: The SPI is treated asynchronously and is sampled - the SCK does not provide any real clocking, just moving forward in the receiving/transmitting FSM.

So, problem solved... for now...

Someone · « **Reply #7 on:** December 05, 2017, 09:14:56 am »

Quote from: hamster_nz on December 05, 2017, 08:45:32 am

Even if it is all in one clock domain (which would be a good thing, IMO), the high fanouts will be killing this - esp at 200+MHz... (which at 250MHz is shooting pretty high anyway).

Speed is relative to the part being used, this would breeze it in on a modern node but be fiendishly difficult on a 20 year old part. The tools should be properly handling fan out in every stage with appropriate techniques for the part, unless there is already very high utilisation then replication of the nets is routine in even basic tools.

hamster_nz · « **Reply #8 on:** December 05, 2017, 09:50:42 am »

Quote from: Someone on December 05, 2017, 09:14:56 am

Quote from: hamster_nz on December 05, 2017, 08:45:32 am
Even if it is all in one clock domain (which would be a good thing, IMO), the high fanouts will be killing this - esp at 200+MHz... (which at 250MHz is shooting pretty high anyway).
Speed is relative to the part being used, this would breeze it in on a modern node but be fiendishly difficult on a 20 year old part. The tools should be properly handling fan out in every stage with appropriate techniques for the part, unless there is already very high utilisation then replication of the nets is routine in even basic tools.

When doing commercial video work - (3D LUTs, focus aids, peaking, zooming and so on), we found that 148.5MHz was quite achievable without too many issues in lower-end Zynqs and Cyclone V SoCs, but going to 4k at 295MHz was almost asking too much for the parts, unless we paid very, very careful attention to every detail (and then there was the power issues with lots of logic running at that speed...).

A former pet project mine, of a Mandelbrot fractal generator on a Kintex, started getting tricky around 230MHz when implemented complex multiplication in a mix of DSP48s and LUT logic.

nctnico · « **Reply #9 on:** December 05, 2017, 10:37:09 am »

Using a 250MHz clock inside an FPGA for generic logic is very a bad idea and this is the root cause of the problem. It is better to have 2 or 3 clock domains with related clocks (like 250MHz, 125MHz and 31.75MHz) using the FPGA's internal clock generator(s). The lowest clock frequency can be used for house keeping stuff like configuration registers. The advantage is that only the parts which need to be fast cause routing problems where the other (slower) stuff can be placed anywhere and it still meets timing.

daqq · « **Reply #10 on:** December 05, 2017, 10:58:35 am »

Quote

The lowest clock frequency can be used for house keeping stuff like configuration registers.

Wouldn't this cause a lot of extra registers for syncing between the two, or issues when reading out the counter part of the configuration registers - the status registers? The counter part for the provided code is a status register muxing system.

nctnico · « **Reply #11 on:** December 05, 2017, 11:21:31 am »

Quote from: daqq on December 05, 2017, 10:58:35 am

Quote
The lowest clock frequency can be used for house keeping stuff like configuration registers.
Wouldn't this cause a lot of extra registers for syncing between the two, or issues when reading out the counter part of the configuration registers - the status registers? The counter part for the provided code is a status register muxing system.

No because when the clocks are related their edges are still aligned which in turn makes it very easy to go from a domain with a lower frequency to a domain with a higher frequency. Vice versa is a bit more tricky because you need to keep the signal stable for the clock cycle duration of the lower frequency but still you can use the fact the clock edges are aligned.

daqq · « **Reply #12 on:** December 05, 2017, 01:10:48 pm »

Hmmm... I'll think about it - it certainly would solve some problems, since all of the configuration/status reading is slow compared to the main logic.

The configuration registers (writing them) would be fairly trivial (slow clock domain feeding a fast clock domain), but the status registers readout would cause a lot of objections from the compiler unless handled with a lot of nastyness...

nctnico · « **Reply #13 on:** December 05, 2017, 01:35:52 pm »

Quote from: daqq on December 05, 2017, 01:10:48 pm

Hmmm... I'll think about it - it certainly would solve some problems, since all of the configuration/status reading is slow compared to the main logic.

The configuration registers (writing them) would be fairly trivial (slow clock domain feeding a fast clock domain), but the status registers readout would cause a lot of objections from the compiler unless handled with a lot of nastyness...

Not if you clock the readout signals into flipflops with a lower frequency clock OR you specifically tell the compiler the signals have more relaxed timing constraints.

NorthGuy · « **Reply #14 on:** December 05, 2017, 03:09:27 pm »

Quote from: daqq on December 05, 2017, 10:58:35 am

The counter part for the provided code is a status register muxing system.

If you're muxing your registers afterwards (that is do not need all the outputs at the same time), then the whole thing looks very similar to a RAM block. Your FPGA may have built-in RAM blocks which might be able to replace both of your systems, but it all depends on the details.

aandrew · « **Reply #15 on:** December 05, 2017, 05:48:04 pm »

Quote from: nctnico on December 05, 2017, 10:37:09 am

Using a 250MHz clock inside an FPGA for generic logic is very a bad idea and this is the root cause of the problem. It is better to have 2 or 3 clock domains with related clocks (like 250MHz, 125MHz and 31.75MHz) using the FPGA's internal clock generator(s).

I disagree; run all/most of the core logic at your main rate (250MHz) and use the built in clock enables to create slower/lower speed logic. There's no need to use DCMs/PLLs to generate integer fractions of your main rate, so long as it doesn't have to be 50% duty cycle.

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.

nctnico · « **Reply #16 on:** December 05, 2017, 08:30:51 pm »

Quote from: aandrew on December 05, 2017, 05:48:04 pm

Quote from: nctnico on December 05, 2017, 10:37:09 am
Using a 250MHz clock inside an FPGA for generic logic is very a bad idea and this is the root cause of the problem. It is better to have 2 or 3 clock domains with related clocks (like 250MHz, 125MHz and 31.75MHz) using the FPGA's internal clock generator(s).
I disagree; run all/most of the core logic at your main rate (250MHz) and use the built in clock enables to create slower/lower speed logic. There's no need to use DCMs/PLLs to generate integer fractions of your main rate, so long as it doesn't have to be 50% duty cycle.

If the synthesizer is able to deal with that and detects the clock rate reduction properly then it can be a solution as well. Otherwise you'd need a mess of timing constraints and really think about what you are doing. Having multiple clock domains with clocks as slow as possible has been serving me well to get to short place & route times and excellent use of FPGA resources. Ofcourse you'd need enough clock distribution nets and PLLs available which is why wrote to have 2 or 3 different clocks.

Quote

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.

I've seen people do that as well and it is awfull indeed.

hamster_nz · « **Reply #17 on:** December 05, 2017, 08:48:20 pm »

Quote from: aandrew on December 05, 2017, 05:48:04 pm

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.

Amen!

daqq · « **Reply #18 on:** December 05, 2017, 09:24:54 pm »

Quote

I disagree; run all/most of the core logic at your main rate (250MHz) and use the built in clock enables to create slower/lower speed logic. There's no need to use DCMs/PLLs to generate integer fractions of your main rate, so long as it doesn't have to be 50% duty cycle.

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.

If not flip flops or DCMs/PLLs then how? If the clocks must be related to one another (be direct multiples of one another) then how do I generate such a clock besides those two options?

Could you give m some keywords that I can feed into google for this kind of thin? Let's say that I want to go with the dual/multiple clock domains, one for the high speed nasty (250MHz) and one, say, 50MHz for housekeeping (SPI interface with config/status registers, misc.). Is there any example? In particular the clock crossing and what constraints should I look for?

hamster_nz · « **Reply #19 on:** December 05, 2017, 10:30:18 pm »

Quote from: daqq on December 05, 2017, 09:24:54 pm

Quote
I disagree; run all/most of the core logic at your main rate (250MHz) and use the built in clock enables to create slower/lower speed logic. There's no need to use DCMs/PLLs to generate integer fractions of your main rate, so long as it doesn't have to be 50% duty cycle.

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.
If not flip flops or DCMs/PLLs then how? If the clocks must be related to one another (be direct multiples of one another) then how do I generate such a clock besides those two options?

Could you give m some keywords that I can feed into google for this kind of thin? Let's say that I want to go with the dual/multiple clock domains, one for the high speed nasty (250MHz) and one, say, 50MHz for housekeeping (SPI interface with config/status registers, misc.). Is there any example? In particular the clock crossing and what constraints should I look for?

I think the suggested design is to use the same DCM or PLL to generate both the fast and slow clocks.

At least for Xilinx, there is no need to synchronise going from slow to fast just have a value registered in the slow domain and consume it in the fast domain - the derived clock constraints cover it and make it happen magically. Just be aware that any control signals (e.g. a write enable) will have stretched pulses when used in the fast domain).

Going the other way can be trickier, best bet is to use a FIFO any data streams. Or only act on control signals sourced from the slow domain only once every 'n' cycles.

NorthGuy · « **Reply #20 on:** December 05, 2017, 10:44:21 pm »

Quote from: hamster_nz on December 05, 2017, 10:30:18 pm

Going the other way can be trickier

But if you derive slow clock domain using clock enable (e.g. on BUFG) then your fast clock domain will always have a counter which tells you the phase of your slow clock. This makes any form of synchronization very easy.

hamster_nz · « **Reply #21 on:** December 06, 2017, 12:50:31 am »

Quote from: NorthGuy on December 05, 2017, 10:44:21 pm

Quote from: hamster_nz on December 05, 2017, 10:30:18 pm
Going the other way can be trickier

But if you derive slow clock domain using clock enable (e.g. on BUFG) then your fast clock domain will always have a counter which tells you the phase of your slow clock. This makes any form of synchronization very easy.

When needed, I tended to sample a flipflop toggling in the slow domain to allow the relative phase to be deduced locally... but using a BUFGCE is a nice idea too.

Bassman59 · « **Reply #22 on:** December 06, 2017, 06:43:05 am »

Quote from: aandrew on December 05, 2017, 05:48:04 pm

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.

Except — if youre using an FPGA which does not have a PLL, or for Real Good Reasons you can’t use one that it might have, then you have little choice if you need to divide the clock using flip-flops. This is the boat I’m in now.

In this case, you have to ensure that the divided clock ends up on a global net, and realize that the divided clock is asynchronous to its source clock. The divided clock will always switch after its source, so you have to treat registers and signals generated in the source domain carefully. That means double- or triple-flip-flop synchronizers, asserting strobes for a “long” time, and all of the metastability hardening you don’t need with Xilinx parts.

daqq · « **Reply #23 on:** December 06, 2017, 12:05:35 pm »

Thanks guys for the tips. I've done the switchover to a slow (50MHz) housekeeping domain and a fast (250MHz) number crunchy domain. It seems to be working so far.

There seems to be no problem so far, even without extra syncing registers between the two domains in both ways.

Dubbie · « **Reply #24 on:** December 06, 2017, 12:19:39 pm »

Quote from: nctnico on December 05, 2017, 08:30:51 pm

Quote from: aandrew on December 05, 2017, 05:48:04 pm
And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.
I've seen people do that as well and it is awfull indeed.

Have you guys been reading my Verilog files? [emoji17]

Ah well, you learn something new every day!


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: FPGA: More elegant (and less timing violating) way of doing simple register map? (Read 6757 times)

Share me