Author Topic: Most efficient way of going through a std_logic_vector  (Read 1730 times)

0 Members and 2 Guests are viewing this topic.

Offline AtomilloTopic starter

  • Regular Contributor
  • *
  • Posts: 205
  • Country: es
Most efficient way of going through a std_logic_vector
« on: December 20, 2022, 03:47:35 pm »
Hello!

I have a VHDL description in which I want to through every element of a std_logic_vector (of 8 bits to be specific). I have two ways of doing this, either have a counter and then do:

out<=vector(counter)

Or with shift registers:

out<=vector(0);
vector(0 downto 6) <= vector(1 downto 7);

Obviously for this second solution I still neeed a counter in order to know when to stop, but which solution uses less LUTs? Is there a way to know beforehand?

If I understood it correctly, when doing out <= vector(counter) it will synthetize a MUX with all elements of the vector at the input, out at the output and counter as select signal.

In comparison, I would expect the second solution to only need as many registers as elements of vector. However, after doing a test the second solution uses more LUTs!

Any tips?

Thanks!
 

Offline ale500

  • Frequent Contributor
  • **
  • Posts: 415
Re: Most efficient way of going through a std_logic_vector
« Reply #1 on: December 20, 2022, 04:22:10 pm »
Is your design already so constrained that you are counting LUTs ?
Both approaches are quite different in nature: if you use the MUX you do not need a clock but you may add an output register that is clocked... the shift register needs a clock.
 

Offline AtomilloTopic starter

  • Regular Contributor
  • *
  • Posts: 205
  • Country: es
Re: Most efficient way of going through a std_logic_vector
« Reply #2 on: December 20, 2022, 04:24:44 pm »
It's not really a design necessity more like an exercise in trying to find the best way to do it since it's a problem that seems to occur relatively often.
What puzzled me is that I got the inverse result I expected!
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9915
  • Country: us
Re: Most efficient way of going through a std_logic_vector
« Reply #3 on: December 20, 2022, 08:15:04 pm »
Can you look at the generated logic?  You can with the Xilinx tools but I don't know about others.

I have found that counters are implemented with a register and an adder and all the logic it entails.  I had never really stopped to think about how a counter would be implemented.  I guess I thought it would look like a TTL counter, a register with steering logic around the outside.
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 27435
  • Country: nl
    • NCT Developments
Re: Most efficient way of going through a std_logic_vector
« Reply #4 on: December 20, 2022, 08:19:16 pm »
The real answer is: it depends.
With a counter you have a few flipflops for the counter and muxes to bring out the bits. With a shift register, you'll have a flipflop for each bit. The shift register version does allow higher clock speeds.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline AtomilloTopic starter

  • Regular Contributor
  • *
  • Posts: 205
  • Country: es
Re: Most efficient way of going through a std_logic_vector
« Reply #5 on: December 20, 2022, 08:27:27 pm »
Can you look at the generated logic?  You can with the Xilinx tools but I don't know about others.

I have found that counters are implemented with a register and an adder and all the logic it entails.  I had never really stopped to think about how a counter would be implemented.  I guess I thought it would look like a TTL counter, a register with steering logic around the outside.

I'm using Vivado, never tried to look at the generated logic but sounds fun! Will try tomorrow at work and report what the result is.
Yes but the counter is in both versions, since even with the shift registers I still need to know when to stop and move on to the next state.
 

Offline AtomilloTopic starter

  • Regular Contributor
  • *
  • Posts: 205
  • Country: es
Re: Most efficient way of going through a std_logic_vector
« Reply #6 on: December 20, 2022, 08:29:28 pm »
The real answer is: it depends.
With a counter you have a few flipflops for the counter and muxes to bring out the bits. With a shift register, you'll have a flipflop for each bit. The shift register version does allow higher clock speeds.
So in principle the shift register would just use FFs and not any LUTs?

Btw, does anyone have any good readings on what the machine actually synthetizes when doing VHDL?

I've actually never done this at uni and I'm just now realizing how important it is.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2806
  • Country: nz
Re: Most efficient way of going through a std_logic_vector
« Reply #7 on: December 20, 2022, 09:14:37 pm »
I've coded the problem four different ways:

First, using a counter and a MUX:

Code: [Select]
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity serial_test_1 is
    Port ( clk      : in  STD_LOGIC;
           parallel : in  STD_LOGIC_VECTOR (7 downto 0);
           serial   : out STD_LOGIC := '0');
end serial_test_1;

architecture Behavioral of serial_test_1 is
    signal counter : unsigned(2 downto 0) := (others => '1');
    signal data    : std_logic_vector(7 downto 0) := (others => '0');
begin

    serial <= data(to_integer(counter));

process(clk)
    begin
        if rising_edge(clk) then
            if counter = 7 then
                data <= parallel;
            end if;
            counter <= counter + 1;
        end if;
    end process;

end Behavioral;

Second, using a counter and a shift register:

Code: [Select]
architecture Behavioral of serial_test_2 is
    signal counter : unsigned(2 downto 0) := (others => '0');
    signal data_sr : std_logic_vector(7 downto 0);
begin

    serial <= data_sr(0);

process(clk)
    begin
        if rising_edge(clk) then
            if counter = 0 then
                data_sr <= parallel;
            else
                data_sr <= '0' & data_sr(7 downto 1);
            end if;
            counter <= counter + 1;
        end if;
    end process;

end Behavioral;

Third, using two shift registers and recycling the 'hot bit' in the restart shift register:

Code: [Select]
architecture Behavioral of serial_test_3 is
    signal data_sr    : std_logic_vector(7 downto 0);
    signal restart_sr : std_logic_vector(7 downto 0) := x"01";
begin

    serial <= data_sr(0);
   
process(clk)
    begin
        if rising_edge(clk) then
            if restart_sr(0) = '1' then
                data_sr <= parallel;
            else
                data_sr <= '0' & data_sr(7 downto 1);
            end if;
            restart_sr <= restart_sr(0) & restart_sr(7 downto 1); 
        end if;
    end process;
end Behavioral;

Last, using two shift registers, but not recycling the hot bit:

Code: [Select]
architecture Behavioral of serial_test_4 is
    signal data_sr    : std_logic_vector(7 downto 0);
    signal restart_sr : std_logic_vector(6 downto 0) := (others => '1');
begin

    serial <= data_sr(0);

process(clk)
    begin
        if rising_edge(clk) then
            if restart_sr(0) = '1' then
                data_sr <= parallel;
                restart_sr <= (others => '0');
            else
                data_sr    <= "0" & data_sr(7 downto 1);
                restart_sr <= "1" & restart_sr(6 downto 1); 
            end if;
        end if;
    end process;
end Behavioral;

All give identical waveforms results in simulation.

Resource usage:
Code: [Select]
+-------------------+---------------+------------+------------+---------+------+-----
|      Instance     |     Module    | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs |
+-------------------+---------------+------------+------------+---------+------+-----+
|   i_serial_test_1 | serial_test_1 |          5 |          5 |       0 |    0 |  11 |
|   i_serial_test_2 | serial_test_2 |          8 |          8 |       0 |    0 |  11 |
|   i_serial_test_3 | serial_test_3 |          5 |          4 |       0 |    1 |   9 |
|   i_serial_test_4 | serial_test_4 |          5 |          5 |       0 |    0 |  15 |
+-------------------+---------------+------------+------------+---------+------+-----+

The first method (using the MUX) has an unregistered output, which is not ideal.

The second uses the most LUTs (but still OK as it is much less than one LUT per FF, which is a good rule of thumb).

The third method (recycling a single '1' bit in a shift register rather than a short counter) allows it to be optimized into a 'SRL" primitive - a 8 bit shift register implemented in LUT.

The last method requires that every FF either gets reset or loaded, preventing optimization.

With only 8 data bits the difference isn't that obvious. Here's what resource usage looks like with 16 parallel input bits:

Code: [Select]
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
|      Instance     |     Module    | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP Blocks |
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
|   i_serial_test_1 | serial_test_1 |          7 |          7 |       0 |    0 |  20 |      0 |      0 |          0 |
|   i_serial_test_2 | serial_test_2 |         18 |         18 |       0 |    0 |  20 |      0 |      0 |          0 |
|   i_serial_test_3 | serial_test_3 |          9 |          8 |       0 |    1 |  17 |      0 |      0 |          0 |
|   i_serial_test_4 | serial_test_4 |          9 |          9 |       0 |    0 |  31 |      0 |      0 |          0 |
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+

If resources were important I would go for "i_serial_test_3", based on it having the best timing (due to registered output) and the lowest resource count. It is the only one that gives the tools any way to optimize the design.
« Last Edit: December 20, 2022, 09:17:17 pm by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: jancumps, FenTiger, Atomillo

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14967
  • Country: fr
Re: Most efficient way of going through a std_logic_vector
« Reply #8 on: December 20, 2022, 09:33:24 pm »
As you can see (thanks to hamster_nz), it's not as obvious as one may think, and the answer is, it depends.

One common misconception is to think expressing a shift register in HDL would be the most efficient, and this is not necessarily the case.

It will largely depend on the tool and target. What you'll get with some FPGA may be different from what you get with another, and what it would yield in synthesis for silicon would be yet something else.

 
The following users thanked this post: Atomillo

Offline AtomilloTopic starter

  • Regular Contributor
  • *
  • Posts: 205
  • Country: es
Re: Most efficient way of going through a std_logic_vector
« Reply #9 on: December 21, 2022, 07:37:44 am »
I've coded the problem four different ways:

First, using a counter and a MUX:

Code: [Select]
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity serial_test_1 is
    Port ( clk      : in  STD_LOGIC;
           parallel : in  STD_LOGIC_VECTOR (7 downto 0);
           serial   : out STD_LOGIC := '0');
end serial_test_1;

architecture Behavioral of serial_test_1 is
    signal counter : unsigned(2 downto 0) := (others => '1');
    signal data    : std_logic_vector(7 downto 0) := (others => '0');
begin

    serial <= data(to_integer(counter));

process(clk)
    begin
        if rising_edge(clk) then
            if counter = 7 then
                data <= parallel;
            end if;
            counter <= counter + 1;
        end if;
    end process;

end Behavioral;

Second, using a counter and a shift register:

Code: [Select]
architecture Behavioral of serial_test_2 is
    signal counter : unsigned(2 downto 0) := (others => '0');
    signal data_sr : std_logic_vector(7 downto 0);
begin

    serial <= data_sr(0);

process(clk)
    begin
        if rising_edge(clk) then
            if counter = 0 then
                data_sr <= parallel;
            else
                data_sr <= '0' & data_sr(7 downto 1);
            end if;
            counter <= counter + 1;
        end if;
    end process;

end Behavioral;

Third, using two shift registers and recycling the 'hot bit' in the restart shift register:

Code: [Select]
architecture Behavioral of serial_test_3 is
    signal data_sr    : std_logic_vector(7 downto 0);
    signal restart_sr : std_logic_vector(7 downto 0) := x"01";
begin

    serial <= data_sr(0);
   
process(clk)
    begin
        if rising_edge(clk) then
            if restart_sr(0) = '1' then
                data_sr <= parallel;
            else
                data_sr <= '0' & data_sr(7 downto 1);
            end if;
            restart_sr <= restart_sr(0) & restart_sr(7 downto 1); 
        end if;
    end process;
end Behavioral;

Last, using two shift registers, but not recycling the hot bit:

Code: [Select]
architecture Behavioral of serial_test_4 is
    signal data_sr    : std_logic_vector(7 downto 0);
    signal restart_sr : std_logic_vector(6 downto 0) := (others => '1');
begin

    serial <= data_sr(0);

process(clk)
    begin
        if rising_edge(clk) then
            if restart_sr(0) = '1' then
                data_sr <= parallel;
                restart_sr <= (others => '0');
            else
                data_sr    <= "0" & data_sr(7 downto 1);
                restart_sr <= "1" & restart_sr(6 downto 1); 
            end if;
        end if;
    end process;
end Behavioral;

All give identical waveforms results in simulation.

Resource usage:
Code: [Select]
+-------------------+---------------+------------+------------+---------+------+-----
|      Instance     |     Module    | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs |
+-------------------+---------------+------------+------------+---------+------+-----+
|   i_serial_test_1 | serial_test_1 |          5 |          5 |       0 |    0 |  11 |
|   i_serial_test_2 | serial_test_2 |          8 |          8 |       0 |    0 |  11 |
|   i_serial_test_3 | serial_test_3 |          5 |          4 |       0 |    1 |   9 |
|   i_serial_test_4 | serial_test_4 |          5 |          5 |       0 |    0 |  15 |
+-------------------+---------------+------------+------------+---------+------+-----+

The first method (using the MUX) has an unregistered output, which is not ideal.

The second uses the most LUTs (but still OK as it is much less than one LUT per FF, which is a good rule of thumb).

The third method (recycling a single '1' bit in a shift register rather than a short counter) allows it to be optimized into a 'SRL" primitive - a 8 bit shift register implemented in LUT.

The last method requires that every FF either gets reset or loaded, preventing optimization.

With only 8 data bits the difference isn't that obvious. Here's what resource usage looks like with 16 parallel input bits:

Code: [Select]
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
|      Instance     |     Module    | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP Blocks |
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
|   i_serial_test_1 | serial_test_1 |          7 |          7 |       0 |    0 |  20 |      0 |      0 |          0 |
|   i_serial_test_2 | serial_test_2 |         18 |         18 |       0 |    0 |  20 |      0 |      0 |          0 |
|   i_serial_test_3 | serial_test_3 |          9 |          8 |       0 |    1 |  17 |      0 |      0 |          0 |
|   i_serial_test_4 | serial_test_4 |          9 |          9 |       0 |    0 |  31 |      0 |      0 |          0 |
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+

If resources were important I would go for "i_serial_test_3", based on it having the best timing (due to registered output) and the lowest resource count. It is the only one that gives the tools any way to optimize the design.


This is amazing many thanks hamster_nz!! Some questions that perhaps are very basic but would help me make sure I understood everything correctly:

1.- In the first implementation, could we register the output if we did it inside the process? Or by doing something like:

serial <= data(to_integer(counter));
serial2 <= serial;

With serial being a signal and serial2 being the output, we would now have serial -> FF -> serial2 right?

2.-In the third implementation, if I understood it right, the synthetizer uses a SRL primitive for the restart one because we specifically tell it to recycle the hot bit, is there any documentation on the available primitives and how to make sure you are actually using them?

Once again many thanks for your help! While I've received formal training on VHDL this is my first "hands-on" job with them.
 

Offline AtomilloTopic starter

  • Regular Contributor
  • *
  • Posts: 205
  • Country: es
Re: Most efficient way of going through a std_logic_vector
« Reply #10 on: December 21, 2022, 03:16:50 pm »
I just found out why the design with the counter + shift register is so inefficient! Look at how the following VHDL was synthetized.

So just to include the initial condition of the vector it needs a multiplexer with number of inputs = 2*lenght of the vector... So that's a lesson about how important it is to use SLRs primitives when needed.

EDIT: forgot to add the VHDL
« Last Edit: December 21, 2022, 05:21:44 pm by Atomillo »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2806
  • Country: nz
Re: Most efficient way of going through a std_logic_vector
« Reply #11 on: December 21, 2022, 08:03:10 pm »
1.- In the first implementation, could we register the output if we did it inside the process?
We could, but the extra cycle of latency would would put it out of step with the other implementations.
t?

2.-In the third implementation, if I understood it right, the synthetizer uses a SRL primitive for the restart one because we specifically tell it to recycle the hot bit, is there any documentation on the available primitives and how to make sure you are actually using them?

It's able to use the SRL primitive because the shift register only output (at the top bit) one input, and no resets.

Here's an instance of the actual primitive, from the "Language templates" in Vivado:
Quote
   SRL16E_inst : SRL16E
   generic map (
      INIT => X"0000")
   port map (
      Q => Q,       -- SRL data output
      A0 => A0,     -- Select[0] input
      A1 => A1,     -- Select[1] input
      A2 => A2,     -- Select[2] input
      A3 => A3,     -- Select[3] input
      CE => CE,     -- Clock enable input
      CLK => CLK,   -- Clock input
      D => D        -- SRL data input
   );

In the case above it might configured as:

Quote
   SRL16E_inst : SRL16E
   generic map (
      INIT => X"0080")
   port map (
      Q => feedback,
      A0 => '1',
      A1 => '1',
      A2 => '1',
      A3 => '0',
      CE => '1',
      CLK => CLK,
      D => feedback
   );

(This is most likely not correct - the 'A' value might be off by one, or the 'INIT' value needs shifting a bit)

IMO best thing to do is read your FPGA's User Guide, an not remember the exact details of each primitive, but get the gist of what the logic blocks can do and then look up the finer details when needed. The User Guides are usually pretty readable and often have very useful tidbits like the internal registers of DSP slices do not have resets.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: Atomillo

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14967
  • Country: fr
Re: Most efficient way of going through a std_logic_vector
« Reply #12 on: December 21, 2022, 08:25:06 pm »
Yep, read the docs, that will give you definite hints regarding how to write HDL in order to make reasonably sure synthesis will infer given primitives. For memory blocks, they usually give code examples.

Now of course, while the question is interesting, it should be put into context, and we don't know what the OP's context is.

Unless you're dealing with hundreds of shift registers and/or are targeting very high clock rates (in which case you probably should instantiate relevant primitives anyway), the minute details of the implementation will not matter much.
 
The following users thanked this post: Atomillo

Offline AtomilloTopic starter

  • Regular Contributor
  • *
  • Posts: 205
  • Country: es
Re: Most efficient way of going through a std_logic_vector
« Reply #13 on: December 21, 2022, 10:36:09 pm »
Hamster_nz and SiliconWizard many thanks for your help! I will definitely take a look at the FPGA User Guide. It seems quite a helpful doc.

You are both right, there's no need to optimize this, it was just an exercise to learn more. And I'm happy to say I've accomplished that!
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3219
  • Country: ca
Re: Most efficient way of going through a std_logic_vector
« Reply #14 on: December 31, 2022, 01:08:23 am »
Xilinx also has a built-in entity called OSERDES which is a shift register with parallel load, but it can be used only for driving external pins.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf