Author Topic: Most efficient way of going through a std_logic_vector (Read 1730 times)

Atomillo · « **on:** December 20, 2022, 03:47:35 pm »

Hello!

I have a VHDL description in which I want to through every element of a std_logic_vector (of 8 bits to be specific). I have two ways of doing this, either have a counter and then do:

out<=vector(counter)

Or with shift registers:

out<=vector(0);
vector(0 downto 6) <= vector(1 downto 7);

Obviously for this second solution I still neeed a counter in order to know when to stop, but which solution uses less LUTs? Is there a way to know beforehand?

If I understood it correctly, when doing out <= vector(counter) it will synthetize a MUX with all elements of the vector at the input, out at the output and counter as select signal.

In comparison, I would expect the second solution to only need as many registers as elements of vector. However, after doing a test the second solution uses more LUTs!

Any tips?

Thanks!

ale500 · « **Reply #1 on:** December 20, 2022, 04:22:10 pm »

Is your design already so constrained that you are counting LUTs ?
Both approaches are quite different in nature: if you use the MUX you do not need a clock but you may add an output register that is clocked... the shift register needs a clock.

Atomillo · « **Reply #2 on:** December 20, 2022, 04:24:44 pm »

It's not really a design necessity more like an exercise in trying to find the best way to do it since it's a problem that seems to occur relatively often.
What puzzled me is that I got the inverse result I expected!

rstofer · « **Reply #3 on:** December 20, 2022, 08:15:04 pm »

Can you look at the generated logic? You can with the Xilinx tools but I don't know about others.

I have found that counters are implemented with a register and an adder and all the logic it entails. I had never really stopped to think about how a counter would be implemented. I guess I thought it would look like a TTL counter, a register with steering logic around the outside.

nctnico · « **Reply #4 on:** December 20, 2022, 08:19:16 pm »

The real answer is: it depends.
With a counter you have a few flipflops for the counter and muxes to bring out the bits. With a shift register, you'll have a flipflop for each bit. The shift register version does allow higher clock speeds.

Atomillo · « **Reply #5 on:** December 20, 2022, 08:27:27 pm »

Quote from: rstofer on December 20, 2022, 08:15:04 pm

Can you look at the generated logic? You can with the Xilinx tools but I don't know about others.

I have found that counters are implemented with a register and an adder and all the logic it entails. I had never really stopped to think about how a counter would be implemented. I guess I thought it would look like a TTL counter, a register with steering logic around the outside.

I'm using Vivado, never tried to look at the generated logic but sounds fun! Will try tomorrow at work and report what the result is.
Yes but the counter is in both versions, since even with the shift registers I still need to know when to stop and move on to the next state.

Atomillo · « **Reply #6 on:** December 20, 2022, 08:29:28 pm »

Quote from: nctnico on December 20, 2022, 08:19:16 pm

The real answer is: it depends.
With a counter you have a few flipflops for the counter and muxes to bring out the bits. With a shift register, you'll have a flipflop for each bit. The shift register version does allow higher clock speeds.

So in principle the shift register would just use FFs and not any LUTs?

Btw, does anyone have any good readings on what the machine actually synthetizes when doing VHDL?

I've actually never done this at uni and I'm just now realizing how important it is.

hamster_nz · « **Reply #7 on:** December 20, 2022, 09:14:37 pm »

I've coded the problem four different ways:

First, using a counter and a MUX:

Code: [Select]

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity serial_test_1 is
    Port ( clk      : in  STD_LOGIC;
           parallel : in  STD_LOGIC_VECTOR (7 downto 0);
           serial   : out STD_LOGIC := '0');
end serial_test_1;

architecture Behavioral of serial_test_1 is
    signal counter : unsigned(2 downto 0) := (others => '1'); 
    signal data    : std_logic_vector(7 downto 0) := (others => '0'); 
begin

    serial <= data(to_integer(counter));

process(clk)
    begin
        if rising_edge(clk) then
            if counter = 7 then
                data <= parallel;
            end if;
            counter <= counter + 1;
        end if;
    end process;

end Behavioral;

Second, using a counter and a shift register:

Code: [Select]

architecture Behavioral of serial_test_2 is
    signal counter : unsigned(2 downto 0) := (others => '0'); 
    signal data_sr : std_logic_vector(7 downto 0);
begin

    serial <= data_sr(0);

process(clk)
    begin
        if rising_edge(clk) then
            if counter = 0 then
                data_sr <= parallel;
            else
                data_sr <= '0' & data_sr(7 downto 1); 
            end if;
            counter <= counter + 1;
        end if;
    end process;

end Behavioral;

Third, using two shift registers and recycling the 'hot bit' in the restart shift register:

Code: [Select]

architecture Behavioral of serial_test_3 is
    signal data_sr    : std_logic_vector(7 downto 0);
    signal restart_sr : std_logic_vector(7 downto 0) := x"01";
begin

    serial <= data_sr(0);
    
process(clk)
    begin
        if rising_edge(clk) then
            if restart_sr(0) = '1' then
                data_sr <= parallel;
            else
                data_sr <= '0' & data_sr(7 downto 1); 
            end if;
            restart_sr <= restart_sr(0) & restart_sr(7 downto 1);  
        end if;
    end process;
end Behavioral;

Last, using two shift registers, but not recycling the hot bit:

Code: [Select]

architecture Behavioral of serial_test_4 is
    signal data_sr    : std_logic_vector(7 downto 0);
    signal restart_sr : std_logic_vector(6 downto 0) := (others => '1');
begin

    serial <= data_sr(0);

process(clk)
    begin
        if rising_edge(clk) then
            if restart_sr(0) = '1' then
                data_sr <= parallel;
                restart_sr <= (others => '0');
            else
                data_sr    <= "0" & data_sr(7 downto 1); 
                restart_sr <= "1" & restart_sr(6 downto 1);  
            end if;
        end if;
    end process;
end Behavioral;

All give identical waveforms results in simulation.

Resource usage:

Code: [Select]

+-------------------+---------------+------------+------------+---------+------+-----
|      Instance     |     Module    | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs |
+-------------------+---------------+------------+------------+---------+------+-----+
|   i_serial_test_1 | serial_test_1 |          5 |          5 |       0 |    0 |  11 |
|   i_serial_test_2 | serial_test_2 |          8 |          8 |       0 |    0 |  11 |
|   i_serial_test_3 | serial_test_3 |          5 |          4 |       0 |    1 |   9 |
|   i_serial_test_4 | serial_test_4 |          5 |          5 |       0 |    0 |  15 |
+-------------------+---------------+------------+------------+---------+------+-----+

The first method (using the MUX) has an unregistered output, which is not ideal.

The second uses the most LUTs (but still OK as it is much less than one LUT per FF, which is a good rule of thumb).

The third method (recycling a single '1' bit in a shift register rather than a short counter) allows it to be optimized into a 'SRL" primitive - a 8 bit shift register implemented in LUT.

The last method requires that every FF either gets reset or loaded, preventing optimization.

With only 8 data bits the difference isn't that obvious. Here's what resource usage looks like with 16 parallel input bits:

Code: [Select]

+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
|      Instance     |     Module    | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP Blocks |
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
|   i_serial_test_1 | serial_test_1 |          7 |          7 |       0 |    0 |  20 |      0 |      0 |          0 |
|   i_serial_test_2 | serial_test_2 |         18 |         18 |       0 |    0 |  20 |      0 |      0 |          0 |
|   i_serial_test_3 | serial_test_3 |          9 |          8 |       0 |    1 |  17 |      0 |      0 |          0 |
|   i_serial_test_4 | serial_test_4 |          9 |          9 |       0 |    0 |  31 |      0 |      0 |          0 |
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+

If resources were important I would go for "i_serial_test_3", based on it having the best timing (due to registered output) and the lowest resource count. It is the only one that gives the tools any way to optimize the design.

SiliconWizard · « **Reply #8 on:** December 20, 2022, 09:33:24 pm »

As you can see (thanks to hamster_nz), it's not as obvious as one may think, and the answer is, it depends.

One common misconception is to think expressing a shift register in HDL would be the most efficient, and this is not necessarily the case.

It will largely depend on the tool and target. What you'll get with some FPGA may be different from what you get with another, and what it would yield in synthesis for silicon would be yet something else.

Atomillo · « **Reply #9 on:** December 21, 2022, 07:37:44 am »

Quote from: hamster_nz on December 20, 2022, 09:14:37 pm

I've coded the problem four different ways:

First, using a counter and a MUX:

Code: [Select]
library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.NUMERIC_STD.ALL; entity serial_test_1 is Port ( clk : in STD_LOGIC; parallel : in STD_LOGIC_VECTOR (7 downto 0); serial : out STD_LOGIC := '0'); end serial_test_1; architecture Behavioral of serial_test_1 is signal counter : unsigned(2 downto 0) := (others => '1'); signal data : std_logic_vector(7 downto 0) := (others => '0'); begin serial <= data(to_integer(counter)); process(clk) begin if rising_edge(clk) then if counter = 7 then data <= parallel; end if; counter <= counter + 1; end if; end process; end Behavioral;
Second, using a counter and a shift register:

Code: [Select]
architecture Behavioral of serial_test_2 is signal counter : unsigned(2 downto 0) := (others => '0'); signal data_sr : std_logic_vector(7 downto 0); begin serial <= data_sr(0); process(clk) begin if rising_edge(clk) then if counter = 0 then data_sr <= parallel; else data_sr <= '0' & data_sr(7 downto 1); end if; counter <= counter + 1; end if; end process; end Behavioral;
Third, using two shift registers and recycling the 'hot bit' in the restart shift register:

Code: [Select]
architecture Behavioral of serial_test_3 is signal data_sr : std_logic_vector(7 downto 0); signal restart_sr : std_logic_vector(7 downto 0) := x"01"; begin serial <= data_sr(0); process(clk) begin if rising_edge(clk) then if restart_sr(0) = '1' then data_sr <= parallel; else data_sr <= '0' & data_sr(7 downto 1); end if; restart_sr <= restart_sr(0) & restart_sr(7 downto 1); end if; end process; end Behavioral;
Last, using two shift registers, but not recycling the hot bit:

Code: [Select]
architecture Behavioral of serial_test_4 is signal data_sr : std_logic_vector(7 downto 0); signal restart_sr : std_logic_vector(6 downto 0) := (others => '1'); begin serial <= data_sr(0); process(clk) begin if rising_edge(clk) then if restart_sr(0) = '1' then data_sr <= parallel; restart_sr <= (others => '0'); else data_sr <= "0" & data_sr(7 downto 1); restart_sr <= "1" & restart_sr(6 downto 1); end if; end if; end process; end Behavioral;
All give identical waveforms results in simulation.

Resource usage:
Code: [Select]
+-------------------+---------------+------------+------------+---------+------+----- | Instance | Module | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | +-------------------+---------------+------------+------------+---------+------+-----+ | i_serial_test_1 | serial_test_1 | 5 | 5 | 0 | 0 | 11 | | i_serial_test_2 | serial_test_2 | 8 | 8 | 0 | 0 | 11 | | i_serial_test_3 | serial_test_3 | 5 | 4 | 0 | 1 | 9 | | i_serial_test_4 | serial_test_4 | 5 | 5 | 0 | 0 | 15 | +-------------------+---------------+------------+------------+---------+------+-----+
The first method (using the MUX) has an unregistered output, which is not ideal.

The second uses the most LUTs (but still OK as it is much less than one LUT per FF, which is a good rule of thumb).

The third method (recycling a single '1' bit in a shift register rather than a short counter) allows it to be optimized into a 'SRL" primitive - a 8 bit shift register implemented in LUT.

The last method requires that every FF either gets reset or loaded, preventing optimization.

With only 8 data bits the difference isn't that obvious. Here's what resource usage looks like with 16 parallel input bits:

Code: [Select]
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+ | Instance | Module | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP Blocks | +-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+ | i_serial_test_1 | serial_test_1 | 7 | 7 | 0 | 0 | 20 | 0 | 0 | 0 | | i_serial_test_2 | serial_test_2 | 18 | 18 | 0 | 0 | 20 | 0 | 0 | 0 | | i_serial_test_3 | serial_test_3 | 9 | 8 | 0 | 1 | 17 | 0 | 0 | 0 | | i_serial_test_4 | serial_test_4 | 9 | 9 | 0 | 0 | 31 | 0 | 0 | 0 | +-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
If resources were important I would go for "i_serial_test_3", based on it having the best timing (due to registered output) and the lowest resource count. It is the only one that gives the tools any way to optimize the design.

This is amazing many thanks hamster_nz!! Some questions that perhaps are very basic but would help me make sure I understood everything correctly:

1.- In the first implementation, could we register the output if we did it inside the process? Or by doing something like:

serial <= data(to_integer(counter));
serial2 <= serial;

With serial being a signal and serial2 being the output, we would now have serial -> FF -> serial2 right?

2.-In the third implementation, if I understood it right, the synthetizer uses a SRL primitive for the restart one because we specifically tell it to recycle the hot bit, is there any documentation on the available primitives and how to make sure you are actually using them?

Once again many thanks for your help! While I've received formal training on VHDL this is my first "hands-on" job with them.

Atomillo · « **Reply #10 on:** December 21, 2022, 03:16:50 pm »

I just found out why the design with the counter + shift register is so inefficient! Look at how the following VHDL was synthetized.

So just to include the initial condition of the vector it needs a multiplexer with number of inputs = 2*lenght of the vector... So that's a lesson about how important it is to use SLRs primitives when needed.

EDIT: forgot to add the VHDL

hamster_nz · « **Reply #11 on:** December 21, 2022, 08:03:10 pm »

Quote from: Atomillo on December 21, 2022, 07:37:44 am

1.- In the first implementation, could we register the output if we did it inside the process?

We could, but the extra cycle of latency would would put it out of step with the other implementations.
t?

Quote from: Atomillo on December 21, 2022, 07:37:44 am

2.-In the third implementation, if I understood it right, the synthetizer uses a SRL primitive for the restart one because we specifically tell it to recycle the hot bit, is there any documentation on the available primitives and how to make sure you are actually using them?

It's able to use the SRL primitive because the shift register only output (at the top bit) one input, and no resets.

Here's an instance of the actual primitive, from the "Language templates" in Vivado:

Quote

SRL16E_inst : SRL16E
generic map (
INIT => X"0000")
port map (
Q => Q, -- SRL data output
A0 => A0, -- Select[0] input
A1 => A1, -- Select[1] input
A2 => A2, -- Select[2] input
A3 => A3, -- Select[3] input
CE => CE, -- Clock enable input
CLK => CLK, -- Clock input
D => D -- SRL data input
);

In the case above it might configured as:

Quote

SRL16E_inst : SRL16E
generic map (
INIT => X"0080")
port map (
Q => feedback,
A0 => '1',
A1 => '1',
A2 => '1',
A3 => '0',
CE => '1',
CLK => CLK,
D => feedback
);

(This is most likely not correct - the 'A' value might be off by one, or the 'INIT' value needs shifting a bit)

IMO best thing to do is read your FPGA's User Guide, an not remember the exact details of each primitive, but get the gist of what the logic blocks can do and then look up the finer details when needed. The User Guides are usually pretty readable and often have very useful tidbits like the internal registers of DSP slices do not have resets.

SiliconWizard · « **Reply #12 on:** December 21, 2022, 08:25:06 pm »

Yep, read the docs, that will give you definite hints regarding how to write HDL in order to make reasonably sure synthesis will infer given primitives. For memory blocks, they usually give code examples.

Now of course, while the question is interesting, it should be put into context, and we don't know what the OP's context is.

Unless you're dealing with hundreds of shift registers and/or are targeting very high clock rates (in which case you probably should instantiate relevant primitives anyway), the minute details of the implementation will not matter much.

Atomillo · « **Reply #13 on:** December 21, 2022, 10:36:09 pm »

Hamster_nz and SiliconWizard many thanks for your help! I will definitely take a look at the FPGA User Guide. It seems quite a helpful doc.

You are both right, there's no need to optimize this, it was just an exercise to learn more. And I'm happy to say I've accomplished that!

NorthGuy · « **Reply #14 on:** December 31, 2022, 01:08:23 am »

Xilinx also has a built-in entity called OSERDES which is a shift register with parallel load, but it can be used only for driving external pins.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Most efficient way of going through a std_logic_vector (Read 1730 times)

Atomillo

Most efficient way of going through a std_logic_vector

ale500

Re: Most efficient way of going through a std_logic_vector

Atomillo

Re: Most efficient way of going through a std_logic_vector

rstofer

Re: Most efficient way of going through a std_logic_vector

nctnico

Re: Most efficient way of going through a std_logic_vector

Atomillo

Re: Most efficient way of going through a std_logic_vector

Atomillo

Re: Most efficient way of going through a std_logic_vector

hamster_nz

Re: Most efficient way of going through a std_logic_vector

SiliconWizard

Re: Most efficient way of going through a std_logic_vector

Atomillo

Re: Most efficient way of going through a std_logic_vector

Atomillo

Re: Most efficient way of going through a std_logic_vector

hamster_nz

Re: Most efficient way of going through a std_logic_vector

SiliconWizard

Re: Most efficient way of going through a std_logic_vector

Atomillo

Re: Most efficient way of going through a std_logic_vector

NorthGuy

Re: Most efficient way of going through a std_logic_vector

Share me