Tutorial for Nockieboy in improving FMAX. Part 1.
Ok, for the new 'ellipse_generator.sv', we have a problem passing our required 125MHz when running the core at 32bits. Right now, our limit is 117 MHz. Looking at the timing report, we see that a lot of signals (From Node) p[ # ],px[ # ],ry2[ # ],px[ # ] do not make it to the register p[ # ] (To Node) in time. The worst signal arrives 0.54ns late (Slack in -xxx ns). See:
(Unfortunately, in QuartusPrime when compiling for CV, I think they only provide 1 single worst case timing signal. I'm sure there is a ways to increase the size of the timing report so you get a better overview.)
Ok, so we need to look at the code to see what P is equal to and why these signals feeding it come in too late and how we may be able to improve the situation.
Here is how I begin to approach the problem and the technique I used here is middle of the road and there are other ways, but this was my first approach trying to maintain the current structure. First look to see everywhere I make register 'p' = to something. Here:
When sub_function == 3
p <= (alu_mult_y + 2) >> 2 ;
When sub_function == 6
p <= p + ry2 - alu_mult_y ;
When sub_function == 7 && (px <= py) && (p <= 0)
p <= p + ry2 + (px + (ry2<<1)) ;
When sub_function == 7 && (px <= py) && !(p <= 0)
p <= p + ry2 + (px + (ry2<<1)) - (py - (rx2<<1)) ;
Below, I'm showing you how the compiler constructs the logic for calculating 'p' (approximately). Remember, the FPGA is not a CPU passing memory variable to and from a single ALU, all the above instructions need to be combined into a single set of gates to make the 32 bit register 'p' equal the following function at the core clock of 125MHz.
(Yes, I tried to get this right, so analyze it...)
p <= ( p * (sub_function !=3)) +
(( (((alu_mult_y + 2) >> 2) * sub_function == 3) )) -
(( ((alu_mult_y) * sub_function == 6) )) +
(( ry2 * ( sub_function == 6 || (sub_function == 7 && (px <= py)) ) )) +
(( (px + (ry2<<1)) * ((sub_function == 7) && (px <= py)) )) -
(( (py - (rx2<<1)) * ((sub_function == 7) && (px <= py) && !(p <= 0)) )) ;
YES, all that shit... Though, the compiler will simplify the algebra as much as possible, this is the mess that 32 bit register 'p' must equal with all those other variables being 32 bits which feed a mass of gates to compute for the D-flipflop 32 bit data input. Apparently, the necessary entire mass of gates will fail to guarantee the correct solution when register 'p' is clocked (with everything else of course) above 117MHz.
'p' is dependent on the sub_function[3:0] number, (px <= py), !(p <= 0), plus the 32 bit registers 'p' itself since it is being added to itself, then alu_mult_y both added by 2 and shifted and again natively, rx2, ry2, px and py.
sub_function = 4 bits
(px<=px) = 32+32 bits
(p<=0) = 32 bits
'p' = 32 bits
alu_mult_y = 32bits *2=64 (shifted and non shifted)
rx2,ry2,px2,py3 = 32*4bits =128
TOTAL: 324 bits / 324 wires/signals to generate the result 'p'.
Here is a test I performed. What I did was make p<=0 at sub_function==3 and getting rid of sub_function 6 which changes the equation to:
p <= ( p * (sub_function !=3)) +
(( ry2 + (px + (ry2<<1))) * ((sub_function == 7) && (px <= py)) ) -
(( (py - (rx2<<1)) * ((sub_function == 7) && (px <= py) && !(p <= 0)) )) ;
We got rid of the 2 * alu_mult_y cutting 64 wires from the 324 needed from the first equation.
Now the compiler give us an FMAX of 132MHz and looking at the worst case timing paths, 'p' (To Node) is actually third down on the list which means there is something else above which is limiting the system to 132MHz.
So, we have a goal, how to incorporate these 2 setup actions:
p <= (alu_mult_y + 2) >> 2 ;
p <= p + ry2 - alu_mult_y ;
And not add any complexity/dependancies to the above test 132MHz FMAX equation.
The trick I decided to use was to temporarily store '(alu_mult_y + 2) >> 2' in ry2 and then just use the beginning of what already exists in the equation:
(remove the red part and just keep the beginning...)
p <= p + ry2
+ (px + (ry2<<1)) - (py - (rx2<<1)) ;
Ok, trick 1, with the repeat of reusing ry2.
In the above sub_function==3, I made 'ry2 <= (alu_mult_y + 2) >> 2'. Since ry2 only has 1 = alu_mult_y, adding this here doesn't really slow down that register.
Now during sub_function==4, I made 'p <= ry2'. Since 'ry2' is added to 'p' everywhere else, this doesn't add additional signal dependencies to the master 'p <= 'blahhh blahhh blahhh' ' equation.
Sub_function==5, since ry2 will now have been updated to the next value, I just added:
p <= p + ry2;
Again, no new dependencies to calculate the master equation 'P'.
Sub_function==6, ok, there was no choice, I had to make it:
p <= p - alu_mult_y;
This added a 32 new bits of dependence. Let's try a compile and see the results.
As you can see, the new FMAX is 121MHz and there are only 4 signals too slow to make the cut.
Now, we know getting rid of that '- alu_mult_y' will allow us to clear the hurdle with spades, but without doing back-flips let's try 1 thing first.
The rx2 & ry2 are the square of the Xr & Yr 12 bit numbers which I have forced to 32 bits.
Since they are only positive integers from 0 to 2047, the result will always be an unsigned 22 bit number. Let's see if I force these to 'UNSIGED 22 bits' since rx2 & ry2 are used so often everywhere in that gigantic ' p <= blahhh blahhh blahhh '.
Ok, talk about just clearing the hurdle, 126MHz...
We also went from 926 Logic elements to 888.
Question, can we do better.
Another solution may be making a temporary register hold:
p <= (alu_mult_y + 2) >> 2 ;
p <= p + ry2 - alu_mult_y ;
Then at the last step. make p<= that 1 register.
Optimization attempt #2 as well as testing only making the Rx2 & Ry2 = 22bits with the original code for tomorrow.
Test V9 attached spaghetti code.
Snapshots not necessary unless there are errors...
(I also found out I'm doing 2 sub_functions uselessly identically twice, the correction will be done next.)