After 2 months of digging into this issue, i have almost completed an answer to my original post.
To start with, the adder was only one of 5 inferred circuits that caused timing errors. To break this problem into more detail allow me to explain my approach to constructing a uart, then the problems and solutions encountered.
Transmitter
- acquire data to send
- format the word assign packet = {1'b1, data[7:0], 1'b0};
- send word
Very simple right, but what about a variable width or word size. How about parity.
Next i tried making a shift register, where i counted how many unused bits( maximum word size minus current word size) there where and shifted the data around to build the packet. It worked but I wanted something cleaner.
Then a mux came to mind. Actually two of them.
assign w_data = {data,1'b0}[current_bit];
assign tx_pin = {1'b1, parity, w_data, 1'b1}[current_state];
A little more logic to tract the current state and bit number, as well as a clock counter to keep track of time and I was up and running around 160Mhz. I could have stopped there. But no. Why cant it go faster. So digging into the timing report and rtl viewer showed that for an 8 bit word with a start bit appended(9 bits total) needed 4 levels of lut to complete.
After some reading the term retiming came up a lot. Mater of fact every post and artical i found when searching for a solution mentioned it. To bad its not in GOWIN's documentation. I tried adding a register, hoping it would infer what i wanted to do. NOPE... not gonna get off that easy. About 20 hours spread across 3 weeks, i had a way to build a couple of data structures. By theirself they are not much, but when used to connect a few lut to a few registers, i built a variable mux registers eating machine. Trading throughput for timing, the FMax quickly took on an RC time constant curve as more latancy provided diminishing returns.
But now I have a new critical path, not the actual work being done, but the status of the FSM. Doing simple if, then, else statements are crippling my speed. A case statement was also to slow, requiring at least 2 lut in series. After a few days of brain storming, i realized that only one state is time critical, that is the start state, where input data is valid for only one clock cycle. So I amended my alu code to include comparisons and gate reduction operations.
Now changes in state take more than one clock cycle to complete. This is acceptable because the output if the module is dependent on clock enable. As long as any multicycle operations finish before the next clock enable, it is the equivalent of a circuit built with latency parameter zero, in a slower clock domain. The inputs to the circuit all operate in a fast domain. When accessing the fifo's, 100% through put in the fast domain was desired, and has been achieved, without having to cross clock domains.
By not inferring these circuits(adders, mux, dmux, &{}, and |{}) but instead using equivalent modules and named wires, i believe that I can more accurately write multi-cycle path constraints.
For the receiver, I went with a demultiplexer, simply taking the reverse of a mux, and or'ing it with the data that has already been collected with a little oversampling sprinkled on top for flavor.
Please take a look and tell me what you think
https://github.com/Adivinedude/FPGA-toolbox#uartvI have not found any good suggestions/reading material for coding style. Is there a standard or just company/school specific?