Author Topic: Learning FPGAs: wrong approach? (Read 55120 times)

NorthGuy · « **Reply #175 on:** June 26, 2017, 08:13:31 pm »

Quote from: Cerebus on June 26, 2017, 07:06:17 pm

It would be helpful if you didn't use 'speed' for both 'latency' and 'throughput', what you're trying to say would be much clearer if you used the two separate terms.

Sorry for the confusing. I'll try to re-write in your terms.

When running in pure combinatorial form (in one-clock cycle), a design with more combinatorial layers will require longer clock period and lower clock frequency. Thus, it will have lower throughput and longer latency. The latency will be equal to one clock period.

If fully pipelined, any design will have the same maximum clock frequency and the same throughput. However, a design with more combinatorial layers will have longer latency. Its latency will be equal to the number of combinatorial layers multiplied by clock period.

Is this more understandable?

AndyC_772 · « **Reply #176 on:** June 26, 2017, 09:04:07 pm »

Quote from: Bruce Abbott on June 26, 2017, 07:48:22 pm

What stops the the values in temp, var_array(0) and var_array(1) from continuously cycling around until the comparator changes state?

It's worth taking a moment to re-iterate: VHDL code is *not* a sequence of instructions that are executed one at a time in order by the target device.

When you write an algorithm using variables, think of it as saying "Hey, compiler! I want you to synthesize some logic for me. Here's a method which describes what outputs I want for a given set of inputs. Now *you* go off and work out the best set of logic gates to give me the outputs I want, OK?"

All that this nested loop is doing, is providing a way - some way - of determining what the outputs should be for a given set of inputs. The compiler executes the loops, works out what the eventual relationship will be between input and output on a given clock edge, then programs this into look-up tables.

hamster_nz · « **Reply #177 on:** June 26, 2017, 09:16:53 pm »

Quote from: nctnico on June 26, 2017, 07:33:20 pm

Actually my earlier LUT number is for an Artix7. Somehow ISE didn't catch I wanted to use a Spartan 6! The other numbers (speed) are for the Spartan6 design. The Spartan 6 design uses 79 Slice LUTs and occupies 33 slices (optimised for speed). I think your reasoning goes off the trail because the synthesizer turns the problem into logic equations which are then minimized keeping the architecture of the FPGA in mind. This means that some of the hardware you describe is probably combined in a way you can't see when designing 'in hardware'. I think it is very similar to a C compiler optimising for pre-fetching and caching.

Can you post/PM me the code you are using?

The code I saw Stack Exchange only made one pass of the items per clock cycle, so to sort four items would take three cycles.

It would also explain the low LUT usage

nctnico · « **Reply #178 on:** June 26, 2017, 09:30:41 pm »

Quote from: hamster_nz on June 26, 2017, 09:16:53 pm

Quote from: nctnico on June 26, 2017, 07:33:20 pm
Actually my earlier LUT number is for an Artix7. Somehow ISE didn't catch I wanted to use a Spartan 6! The other numbers (speed) are for the Spartan6 design. The Spartan 6 design uses 79 Slice LUTs and occupies 33 slices (optimised for speed). I think your reasoning goes off the trail because the synthesizer turns the problem into logic equations which are then minimized keeping the architecture of the FPGA in mind. This means that some of the hardware you describe is probably combined in a way you can't see when designing 'in hardware'. I think it is very similar to a C compiler optimising for pre-fetching and caching.
Can you post/PM me the code you are using?

The code I saw Stack Exchange only made one pass of the items per clock cycle, so to sort four items would take three cycles.

It would also explain the low LUT usage

Here it is but it uses a nested loop so it seems to me it is doing a full bubble-sort.

Code: [Select]

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use ieee.numeric_std.all;

package array_type is
    type bubble is array (0 to 3) of unsigned(7 downto 0);
end package;

library ieee;
use ieee.std_logic_1164.all;
use work.array_type.all;

entity bubblesort is
    port (
        signal clk:             in  std_logic;
        signal reset:           in  std_logic;
        signal in_array_in:        in  bubble;
        signal sorted_array_out:    out bubble 
    );
end entity;


architecture foo of bubblesort is
    use ieee.numeric_std.all;

	--signals to allow optimal routing
    signal in_array: bubble;
    signal sorted_array: bubble;

begin


BSORT:
    process (clk)
        variable temp:      unsigned (7 downto 0);
        variable var_array:     bubble;        
    begin

	 --move inside if rising_edge... to catch the whole thing inside the clock constraint and
	 --not depend on routing delays between input & output pads.
	 in_array<=in_array_in;
	 sorted_array_out <=sorted_array;
	 --

        var_array := in_array;
        if rising_edge(clk) then
				
            for j in bubble'LEFT to bubble'RIGHT - 1 loop 
                for i in bubble'LEFT to bubble'RIGHT - 1 - j loop 
                    if var_array(i) > var_array(i + 1) then
                        temp := var_array(i);
                        var_array(i) := var_array(i + 1);
                        var_array(i + 1) := temp;
                    end if;
                end loop;
            end loop;
            sorted_array <= var_array;
        end if;
    end process;
end architecture foo;

BrianHG · « **Reply #179 on:** June 26, 2017, 09:46:57 pm »

Am I wrong (I'm a Veriliog guy nit VHDL), but isn't this bubble sort sorting 4 variables 'array (0 to 3)' only 3 bit 'of unsigned(7 downto 0)' ?

nctnico · « **Reply #180 on:** June 26, 2017, 09:50:11 pm »

Quote from: BrianHG on June 26, 2017, 09:46:57 pm

Am I wrong (I'm a Veriliog guy nit VHDL), but isn't this bubble sort sorting 4 variables 'array (0 to 3)' only 3 bit 'of unsigned(7 downto 0)' ?

No, it is sorting an array with 4 elements where each element is an 8 bit unsigned int.

BrianHG · « **Reply #181 on:** June 27, 2017, 12:38:57 am »

Quote from: nctnico on June 26, 2017, 09:50:11 pm

Quote from: BrianHG on June 26, 2017, 09:46:57 pm
Am I wrong (I'm a Veriliog guy nit VHDL), but isn't this bubble sort sorting 4 variables 'array (0 to 3)' only 3 bit 'of unsigned(7 downto 0)' ?
No, it is sorting an array with 4 elements where each element is an 8 bit unsigned int.

Sorry, I'm used to seeing 255 downto 0, as in my default internal ram-buss size. It's been over 5 years since I've done any HDL.
Or in Verilog, I skimp out and just use wire[255:0], or, reg[255:0] for a single 256 bit word/buss, or just 'integer' and let the compiler workout how many bits it needs to be to finalize the logic...
Back when I started in 2004, Quartus' internal compiler was very crappy in just decoding a buss & at the time, as well as crashing with anything too complex, they recommended using a third party HDL compiler with Quartus, or to learn AHDL, Altera hardware definition language. This is why my coding style leans towards more like 'Assembly' rather than letting the compiler work for you like 'C' coding.

Bruce Abbott · « **Reply #182 on:** June 27, 2017, 02:37:22 am »

Quote from: AndyC_772 on June 26, 2017, 09:04:07 pm

It's worth taking a moment to re-iterate: VHDL code is *not* a sequence of instructions that are executed one at a time in order by the target device.

Yes, and that's where I was confused because there seemed to be a circular assignment. Now I can see that while ':=' assignments are immediate, statements using them are executed sequentially by the compiler as it builds up the logical relationship between them.

Quote

All that this nested loop is doing, is providing a way - some way - of determining what the outputs should be for a given set of inputs. The compiler executes the loops, works out what the eventual relationship will be between input and output on a given clock edge, then programs this into look-up tables.

Right. So after examining all the statements sequentially it figures out what logic is required to bubble sort the entire array in a single clock cycle?

AndyC_772 · « **Reply #183 on:** June 27, 2017, 06:26:43 am »

Quote from: Bruce Abbott on June 27, 2017, 02:37:22 am

Right. So after examining all the statements sequentially it figures out what logic is required to bubble sort the entire array in a single clock cycle?

Almost. It figures out what logic is required to *sort* the entire array in a single clock - not necessarily *bubble* sort.

The details of how the outputs were derived from each possible set of inputs are lost. The compiler executes the loop, builds up a table of outputs vs inputs, then uses that table to generate the necessary logic.

That's not to say it can't make use of the original code to get some hints as to how the logic might work, but it doesn't have to.

There might even be an interesting exercise here. For example, a bubble sort algorithm to sort 'n' elements has to do (n-1) compare/swap operations on the first pass, then (n-2) on the second, and (n-3) on the third, and so on. In a software implementation, shortening each subsequent pass by one element is a trivial and worthwhile optimisation. In VHDL, though, it really shouldn't make any difference at all whether this is done, because the outcome of the nested loop is exactly the same whether each subsequent pass gets shortened or not.

hamster_nz · « **Reply #184 on:** June 27, 2017, 09:35:45 am »

Quote from: nctnico on June 26, 2017, 09:30:41 pm

Quote from: hamster_nz on June 26, 2017, 09:16:53 pm
Quote from: nctnico on June 26, 2017, 07:33:20 pm
Actually my earlier LUT number is for an Artix7. Somehow ISE didn't catch I wanted to use a Spartan 6! The other numbers (speed) are for the Spartan6 design. The Spartan 6 design uses 79 Slice LUTs and occupies 33 slices (optimised for speed). I think your reasoning goes off the trail because the synthesizer turns the problem into logic equations which are then minimized keeping the architecture of the FPGA in mind. This means that some of the hardware you describe is probably combined in a way you can't see when designing 'in hardware'. I think it is very similar to a C compiler optimising for pre-fetching and caching.
Can you post/PM me the code you are using?

The code I saw Stack Exchange only made one pass of the items per clock cycle, so to sort four items would take three cycles.

It would also explain the low LUT usage

Here it is but it uses a nested loop so it seems to me it is doing a full bubble-sort.

Code: [Select]
library IEEE; use IEEE.STD_LOGIC_1164.ALL; use ieee.numeric_std.all; package array_type is type bubble is array (0 to 3) of unsigned(7 downto 0); end package; library ieee; use ieee.std_logic_1164.all; use work.array_type.all; entity bubblesort is port ( signal clk: in std_logic; signal reset: in std_logic; signal in_array_in: in bubble; signal sorted_array_out: out bubble ); end entity; architecture foo of bubblesort is use ieee.numeric_std.all; --signals to allow optimal routing signal in_array: bubble; signal sorted_array: bubble; begin BSORT: process (clk) variable temp: unsigned (7 downto 0); variable var_array: bubble; begin --move inside if rising_edge... to catch the whole thing inside the clock constraint and --not depend on routing delays between input & output pads. in_array<=in_array_in; sorted_array_out <=sorted_array; -- var_array := in_array; if rising_edge(clk) then for j in bubble'LEFT to bubble'RIGHT - 1 loop for i in bubble'LEFT to bubble'RIGHT - 1 - j loop if var_array(i) > var_array(i + 1) then temp := var_array(i); var_array(i) := var_array(i + 1); var_array(i + 1) := temp; end if; end loop; end loop; sorted_array <= var_array; end if; end process; end architecture foo;

Got back home and tested it, using the same testbench - a 32-bit counter feeding the inputs, outputting to pins.

With the Vivado default Strategy & PerfOprimized_High:

Code: [Select]

1. Utilization by Hierarchy
---------------------------

+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+
|    Instance    |     Module     | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP48 Blocks |
+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+
| top_sort_4     |          (top) |        125 |        125 |       0 |    0 |  96 |      0 |      0 |            0 |
|   (top_sort_4) |          (top) |          1 |          1 |       0 |    0 |  64 |      0 |      0 |            0 |
|   uut          | sort_4_wrapper |        124 |        124 |       0 |    0 |  32 |      0 |      0 |            0 |
|     uut        |     bubblesort |        124 |        124 |       0 |    0 |  32 |      0 |      0 |            0 |
+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+

Timing is 10.563ns / 94.67 MHz (when constrained for 200MHz)

With the Vivado "Area Optimized" Strategy:

Code: [Select]

+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+
|    Instance    |     Module     | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP48 Blocks |
+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+
| top_sort_4     |          (top) |        109 |        109 |       0 |    0 |  96 |      0 |      0 |            0 |
|   (top_sort_4) |          (top) |          1 |          1 |       0 |    0 |  64 |      0 |      0 |            0 |
|   uut          | sort_4_wrapper |        108 |        108 |       0 |    0 |  32 |      0 |      0 |            0 |
|     uut        |     bubblesort |        108 |        108 |       0 |    0 |  32 |      0 |      0 |            0 |
+----------------+----------------+------------+------------+---------+------+-----+--------+--------+--------------+

Interestingly, timing is a slightly faster 10.205 ns / 97.99 MHz (when constrained for 200MHz)

"sort_4_wrapper.vhd" just makes the interface compatible with the interface I used at the top level:

Code: [Select]

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

use work.array_type.all;


entity sort_4_wrapper is
    Port ( clk   : in STD_LOGIC;
           a_in  : in STD_LOGIC_VECTOR (7 downto 0);
           b_in  : in STD_LOGIC_VECTOR (7 downto 0);
           c_in  : in STD_LOGIC_VECTOR (7 downto 0);
           d_in  : in STD_LOGIC_VECTOR (7 downto 0);
           a_out : out STD_LOGIC_VECTOR (7 downto 0);
           b_out : out STD_LOGIC_VECTOR (7 downto 0);
           c_out : out STD_LOGIC_VECTOR (7 downto 0);
           d_out : out STD_LOGIC_VECTOR (7 downto 0));
end sort_4_wrapper;

architecture Behavioral of sort_4_wrapper is
    signal in_array_in : bubble;
    component bubblesort is
    port (
        signal clk              : in  std_logic;
        signal reset            : in  std_logic;
        signal in_array_in      : in  bubble;
        signal sorted_array_out : out bubble 
    );
    end component;
    signal sorted_array_out : bubble;
begin

    in_array_in(0)   <= unsigned(a_in);
    in_array_in(1)   <= unsigned(b_in);
    in_array_in(2)   <= unsigned(c_in);
    in_array_in(3)   <= unsigned(d_in);
    
uut: bubblesort port map (
        clk              => clk,
        reset            => '0',
        in_array_in      => in_array_in,
        sorted_array_out => sorted_array_out);
    a_out  <= std_logic_vector(sorted_array_out(0));
    b_out  <= std_logic_vector(sorted_array_out(1));
    c_out  <= std_logic_vector(sorted_array_out(2));
    d_out  <= std_logic_vector(sorted_array_out(3));
end Behavioral;

So now I am at a loss - I can't recreate your results. I get exactly what my unrolled BubbleSort does (which is what I expected). Do you have any hints of what I have missed?

What are you using as the 'source' for your inputs? as mentioned, I'm just using a 32-bit counter, and the outputs just to to external pins. It was the easiest way to ensure that nothing can get optimized away...

nctnico · « **Reply #185 on:** June 27, 2017, 09:54:29 am »

I have I/O pins at the inputs & outputs. It might be the synthesizer from ISE14.7 is better then the one from Vivado. Last news I heard is that Vivado's synthesizer isn't quite there yet. Besides that I also placed & routed the design which gives some extra logic optimisation.

hamster_nz · « **Reply #186 on:** June 27, 2017, 10:58:33 am »

Quote from: nctnico on June 27, 2017, 09:54:29 am

I have I/O pins at the inputs & outputs. It might be the synthesizer from ISE14.7 is better then the one from Vivado. Last news I heard is that Vivado's synthesizer isn't quite there yet. Besides that I also placed & routed the design which gives some extra logic optimisation.

Humm, fully P+R under ISE with defaults, for Spartan 6 LX 4. I get 255 LUTs (see attached image).

One thing I do see as "odd" is that as written you need to add two more signals to the process's sensitivity list - it should be:

process (clk, in_array_in, sorted_array)

I haven't made the change, but I wonder if that is the cause for our differences?

nctnico · « **Reply #187 on:** June 27, 2017, 11:12:57 am »

No, not to the sensitivity list but inside the 'if rising_edge...' clause so the logic is caught by the clock constraint. Otherwise the input to logic and flipflop to output routing would add extra delays.
Did you enable optimise across hierarchy / keep hierarchy? AFAIK it is off by default but it produces lesser results but it would clutter your results because they would include the counter.

Bruce Abbott · « **Reply #188 on:** June 27, 2017, 06:05:22 pm »

Quote from: AndyC_772 on June 27, 2017, 06:26:43 am

Almost. It figures out what logic is required to *sort* the entire array in a single clock - not necessarily *bubble* sort.

Yes, I understand that. The compiler creates logic that performs the requested function, but it decides how to do that. The source algorithm is a bubble sort, but the logic doesn't have to look anything like a bubble sort - it just has to produce the same output as a bubble sort.

In software we think of an array as a block of memory with numbers stored in it, and sorting the array changes the order of its contents. An advantage of Bubble Sort over other algorithms is that since elements are swapped 'in place' its memory footprint can be very low. I had incorrectly assumed that the VHDL code was also sorting the array 'in place'. However it actually takes an array as input and then fills another array with the sorted data. IOW the output is a (separate) sorted version of the input. If the array data came from some memory (registers or RAM) then it could be stored back into that memory if desired, or it could be used elsewhere without affecting the original array's contents.

The disadvantage of Bubble Sort is that as array size increases so processing time increases exponentially. In a hardware implementation this is not necessarily true because an entire array can be sorted in one clock cycle, but (I presume) in real hardware the number of gates required will increase exponentially, which may increase latency and reduce the maximum permitted clock frequency. Also the largest array that can be sorted in 1 clock cycle is limited by the maximum number of bits that can be operated on in parallel. Large arrays would have to be stored in RAM and sorted in several passes just like in software.

NorthGuy · « **Reply #189 on:** June 27, 2017, 06:34:57 pm »

Quote from: Bruce Abbott on June 27, 2017, 06:05:22 pm

Also the largest array that can be sorted in 1 clock cycle is limited by the maximum number of bits that can be operated on in parallel. Large arrays would have to be stored in RAM and sorted in several passes just like in software.

Exactly. If you wanted it to be scalable. or if you wanted to sort arrays of variable size, it would be much better to store the array in BRAM and sort it in place. You could create a specialized soft core for this (or a state machine if you will), which would do it sequentially. But it would be much much slower.

If you wanted to make it faster, you could use a more advanced software sorting algorithms, such as quicksort. Or you could come up with FPGA-friendly algorithm which manages to establish a pipeline and make bubble-sort faster. But such approach would be much more complex than sorting a small array with combinatorial logic.

nctnico · « **Reply #190 on:** June 27, 2017, 07:33:04 pm »

Quote from: NorthGuy on June 27, 2017, 06:34:57 pm

Quote from: Bruce Abbott on June 27, 2017, 06:05:22 pm
Also the largest array that can be sorted in 1 clock cycle is limited by the maximum number of bits that can be operated on in parallel. Large arrays would have to be stored in RAM and sorted in several passes just like in software.

Exactly. If you wanted it to be scalable. or if you wanted to sort arrays of variable size, it would be much better to store the array in BRAM and sort it in place. You could create a specialized soft core for this (or a state machine if you will), which would do it sequentially. But it would be much much slower.

If you wanted to make it faster, you could use a more advanced software sorting algorithms, such as quicksort. Or you could come up with FPGA-friendly algorithm which manages to establish a pipeline and make bubble-sort faster. But such approach would be much more complex than sorting a small array with combinatorial logic.

One of the things I did to the sorting example I posted was adding extra buffer (flipflop) stages between the input & output. The synthesizer can use that to do the pipelining for you. IOW you can let the tools do a lot of work for you before you need to resort to getting into the nitty-gritty bits of an FPGA.

However sorting large amounts of data is better done using an iterative approach.

Mattjd · « **Reply #191 on:** June 27, 2017, 08:09:12 pm »

How are you going about doing the pipelining? Are you running all your in/outs through always blocks that represent a register or what?

NorthGuy · « **Reply #192 on:** June 27, 2017, 08:49:03 pm »

Quote from: Mattjd on June 27, 2017, 08:09:12 pm

How are you going about doing the pipelining? Are you running all your in/outs through always blocks that represent a register or what?

I don't use Verilog. In VHDL it is very simple.

Combinatorial within one clock cycle:

Code: [Select]

process(clk)
begin
  if rising_edge(clk) then
    x <= (a+b)+c;
  end if;
end process;

Pipelined:

Code: [Select]

process(clk)
begin
  if rising_edge(clk) then
    -- stage 1
    a_plus_b <= a+b;
    c_stage_2 <= c;

    -- stage 2
    x <= a_plus_b + c_stage_2;
  end if;
end process;

Signals a_plus_b and c_stage_2 are added flip-flops.

hamster_nz · « **Reply #193 on:** June 28, 2017, 12:01:04 am »

Quote from: nctnico on June 27, 2017, 11:12:57 am

No, not to the sensitivity list but inside the 'if rising_edge...' clause so the logic is caught by the clock constraint. Otherwise the input to logic and flipflop to output routing would add extra delays.
Did you enable optimise across hierarchy / keep hierarchy? AFAIK it is off by default but it produces lesser results but it would clutter your results because they would include the counter.

So... mystery deepens. I see a learning experience ahead for me.

I've now using ISE. I've promoted your module to be the top level, so all 32 inputs are on pins, and the outputs are registered before pins - usage is now 76 LUTs

As everything except the output registers is async, the timing looks to be > 20ns for the inputs to the output registers to be valid. Not quite sure how to read the numbers in the timing report... However, registering the inputs makes the LUT count go up dramatically to 222, with a Fmax of 89.8 MHz.

The "designed with hardware in mind" version is 76 LUTs, (with an FMAX of > 213MHz), so without trolling through the technology schematic it seems that the inputs-unregistered bubble sort version has the freedom to be optimized down to the design, but the restrictions placed on it by having registers on the inputs prevents this from occurring. (wonder why that would be?...)

So the "designed with hardware in mind" can be almost 3x smaller, and > 2x faster than the "simple bubble sort" version. It also delivers consistent performance and usage no matter how it is used.

BrianHG · « **Reply #194 on:** June 28, 2017, 01:53:06 am »

Though not with this code, Altera's Quartus II v9 & above, just adding an extra stage of DFF, without any additional logic or deliberate pipe-lining, before or after feeding such a piece of HDL code will actually have the same effect of potentially slimming the LUT count and doubling the FMAX. Though this may be just the way I was coding at the time, but there are features in the compiler to decompose logic & reconstruct logic to achieve the best possible FMAX both int compiler stage and the fitting/physical synthesis stage.

Darn, if I had quartus installed on one of my PC's today, I would have played with this VHDL code already and posted the results...

hamster_nz · « **Reply #195 on:** June 28, 2017, 03:22:15 am »

Quote from: BrianHG on June 28, 2017, 01:53:06 am

Though not with this code, Altera's Quartus II v9 & above, just adding an extra stage of DFF, without any additional logic or deliberate pipe-lining, before or after feeding such a piece of HDL code will actually have the same effect of potentially slimming the LUT count and doubling the FMAX. Though this may be just the way I was coding at the time, but there are features in the compiler to decompose logic & reconstruct logic to achieve the best possible FMAX both int compiler stage and the fitting/physical synthesis stage.

Darn, if I had quartus installed on one of my PC's today, I would have played with this VHDL code already and posted the results...

There are a lot of undocumented tricks on how you can get great results with inference - how to cast your code 'just right' so a DSP block is inferred, with all the right pipeline registers, or so it uses block RAM, or using LUTs become shift registers rather than a chain of FFs.

The thing that annoys me is that the patterns that work are not well defined. For example, in a clocked process

data <= memory(to_integer(unsigned(address)));

should infer a block RAM if memory is big enough, but as a general rule, anything with an expression for the array index won't:

data <= memory(to_integer(unsigned(address)+1));

Will only infer LUTs and flip-flops. It leaves you with 'land mines' in your code:

addr_temp := unsigned(address)+1; -- assign address to variable
data <= memory(to_integer(addr_temp)); -- look up address

They have big flags waving away saying "Who wrote this junk! Make me shiny!", and when you touch them your design blows up.

(the example is somewhat contrived, I would have to test to find an exact case when I can prove this to be the case, but you get the idea)

BrianHG · « **Reply #196 on:** June 28, 2017, 03:48:02 am »

Quote from: hamster_nz on June 28, 2017, 03:22:15 am

There are a lot of undocumented tricks on how you can get great results with inference - how to cast your code 'just right' so a DSP block is inferred, with all the right pipeline registers, or so it uses block RAM, or using LUTs become shift registers rather than a chain of FFs.

That pretty much boils down what's been going on...

NorthGuy · « **Reply #197 on:** June 28, 2017, 04:04:45 am »

Quote from: hamster_nz on June 28, 2017, 03:22:15 am

The thing that annoys me is that the patterns that work are not well defined. For example, in a clocked process

data <= memory(to_integer(unsigned(address)));

should infer a block RAM if memory is big enough, but as a general rule, anything with an expression for the array index won't:

data <= memory(to_integer(unsigned(address)+1));

On Xilinx BRAM must be clocked. You cannot calculate the address, give it to BRAM and get a combinatorial result. Therefore, when you try to do that (as in your second expression above), you will never get BRAM. You may get "distributed memory" instead, because the distributed memory can get you the combinatorial result you want.

If "address" is registered, then you may get the BRAM by removing "+1".

nctnico · « **Reply #198 on:** June 28, 2017, 06:01:04 am »

Quote from: hamster_nz on June 28, 2017, 12:01:04 am

Quote from: nctnico on June 27, 2017, 11:12:57 am
No, not to the sensitivity list but inside the 'if rising_edge...' clause so the logic is caught by the clock constraint. Otherwise the input to logic and flipflop to output routing would add extra delays.
Did you enable optimise across hierarchy / keep hierarchy? AFAIK it is off by default but it produces lesser results but it would clutter your results because they would include the counter.
So... mystery deepens. I see a learning experience ahead for me.

I've now using ISE. I've promoted your module to be the top level, so all 32 inputs are on pins, and the outputs are registered before pins - usage is now 76 LUTs

As everything except the output registers is async, the timing looks to be > 20ns for the inputs to the output registers to be valid. Not quite sure how to read the numbers in the timing report... However, registering the inputs makes the LUT count go up dramatically to 222, with a Fmax of 89.8 MHz.

The "designed with hardware in mind" version is 76 LUTs, (with an FMAX of > 213MHz), so without trolling through the technology schematic it seems that the inputs-unregistered bubble sort version has the freedom to be optimized down to the design, but the restrictions placed on it by having registers on the inputs prevents this from occurring. (wonder why that would be?...)

So the "designed with hardware in mind" can be almost 3x smaller, and > 2x faster than the "simple bubble sort" version. It also delivers consistent performance and usage no matter how it is used.

That is the wrong conclusion. By adding the registers you add more logic to the design. Also: did you P&R the design? There is an extra logic optimisation stage in there as well.

hamster_nz · « **Reply #199 on:** June 28, 2017, 08:03:47 am »

Quote from: nctnico on June 28, 2017, 06:01:04 am

That is the wrong conclusion. By adding the registers you add more logic to the design. Also: did you P&R the design? There is an extra logic optimisation stage in there as well.

I checked the Technology Schematic - Inputs go straight into a FF from the IBUF, outputs straight from the FF to OBUF.

And then I checked the FPGA editor, and all 32 inputs run into a slice, and in that slice they run directly into a FF's D input (the input MUX on the FF that acts as CE is set to a fixed value).

As for the 32 outputs, they are all directly from the output of a flipflop to the output buffer.

So that is all 64 FFs accounted for. All the logic is between these two sets of FFs, and no retiming has occurred.

As far as I have seen for Xilinx, the P+R optimization makes zero difference to what generic primitives are actually used - only where they are placed on the die and how they are connected (hence the name place and route). The logic of the design at that point is fixed by the Implementation step.

(And of course the timing of a design depends on how well it is P+Red due to how well it minimzes routing delays, and there are a few little corner cases like route-throughs, which do consume additional LUTs by running signals through them like a buffer to tweek timing, but resource usage can only go up during P+R, and the logical design is not transformed at all)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Learning FPGAs: wrong approach? (Read 55120 times)

Share me