Nice work!
Must play havoc with your timing closure though?
Only if you use the FIFO wired directly to and from IO pins. If the FIFO is internal, the compiler would consider the 'WIRING' from it's input port to the output port just part of the logic gates, or a mux selection logic within your design. I still get an fmax in the 350MHz range on the slowest CycloneIV. This is the reason for making the FIFO only 4 bytes exactly. Anything width up to 36 bits will give me that theoretical 350MHz range. Or in other words, inserting such a FIFO anywhere inside my existing designs won't lower my design's current fmax.
However, wired to IO pins, you need to take into account the input's tsu and un-latched data output registers behaving like a mux or gates, not clocked registers. Especially on the output side, FPGAs don't handle this quite well. Quartus gave me a 'restricted FMAX' of 250Mhz, but the data is mush, or, the data is just valid at the last few picoseconds. To be useful like this, and you want a valid data output window of say 10ns, you could not use this fifo faster than around 75MHz. I have not considered smaller PLD devices like MAX3000/MAX7000 series. With optimal chosen IOs, they might actually perform better as they were designed for glue logic / gate driven IO pins and the fabric is tiny compared to an FPGA.
Doesn't this trouble when zero_latency is 1, linking the timing paths leading up to "data_in" with those downstream from "data_out", with a couple of levels of logic added in?
assign data_out = ( fifo_size == 0 ) && (zero_latency ) ? data_in : fifo_data_reg[fifo_rd_pos[1:0]] ;
As I count it, a bit of data_out depends on 9 bits of input (or maybe less, depending if fifo_data_reg is inferred as RAM)...
With zero_latency set to 0, the downstream logic will always be a 4 position mux selecting 1 of the 4 fifo_data_regs selected by the 2 bit address fifo_rd_pos.
With zero_latency set to 1, we now have a 3 bit address. The MSB address is tied to the output of a 3 input OR gate whose inputs are tied to the 3 bit counter fifo_size. The inputs of that now 8:1 mux selector have the first 4 mux inputs tied in parallel to the data coming data_in (basically a register from somewhere in you existing design) while the top 4 inputs of that 8:1 mux are tied to the 4 fifo_data_regs.
Since you will be feeding the the data in from another register in your FPGA anyways, just like the 4 registers fifo_data_reg, the penalty between going from zero_latency off to on is equivalent to switching between a 4:1 mux memory register bank to a 8:1 mux memory register bank, with the first 4 of those registers tied together from with your data_in (again, still another reg in your FPGA design) and the MSB addr of the mux selector if tied to the output of a 3 input OR gate whose inputs come from the 3 bit counter fifo_size.
Dam it hamster_nz (I don't know why I didn't realize it myself), I just thought of a way to eliminate that 3 input OR gate making it a single register bit just like the first 2 bits fifo_rd_pos, hence shaving off a nanosecond. As fast as the current design is, it would still be an improvement for slower FPGAs or old PLDs.
Updated V2 code coming with timing comparison if possible (meaning the current code is already so fast).
Note that I will code it so that the 8:1 mux becomes obvious & I will adapt my simulation test bench to reflect the FIFO being inside a design where it is being fed by a set of registers, and the outputs will be latched by another set of registers. The compiler will then provide a proper valid FMAX reading and penalty between zero_latency on and off when using the FIFO buried inside a design in this manner.
Note: zero_latency doesn't count as an input as it's a fixed parameter removed at compile time.