I've coded the problem four different ways:
First, using a counter and a MUX:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
entity serial_test_1 is
Port ( clk : in STD_LOGIC;
parallel : in STD_LOGIC_VECTOR (7 downto 0);
serial : out STD_LOGIC := '0');
end serial_test_1;
architecture Behavioral of serial_test_1 is
signal counter : unsigned(2 downto 0) := (others => '1');
signal data : std_logic_vector(7 downto 0) := (others => '0');
begin
serial <= data(to_integer(counter));
process(clk)
begin
if rising_edge(clk) then
if counter = 7 then
data <= parallel;
end if;
counter <= counter + 1;
end if;
end process;
end Behavioral;
Second, using a counter and a shift register:
architecture Behavioral of serial_test_2 is
signal counter : unsigned(2 downto 0) := (others => '0');
signal data_sr : std_logic_vector(7 downto 0);
begin
serial <= data_sr(0);
process(clk)
begin
if rising_edge(clk) then
if counter = 0 then
data_sr <= parallel;
else
data_sr <= '0' & data_sr(7 downto 1);
end if;
counter <= counter + 1;
end if;
end process;
end Behavioral;
Third, using two shift registers and recycling the 'hot bit' in the restart shift register:
architecture Behavioral of serial_test_3 is
signal data_sr : std_logic_vector(7 downto 0);
signal restart_sr : std_logic_vector(7 downto 0) := x"01";
begin
serial <= data_sr(0);
process(clk)
begin
if rising_edge(clk) then
if restart_sr(0) = '1' then
data_sr <= parallel;
else
data_sr <= '0' & data_sr(7 downto 1);
end if;
restart_sr <= restart_sr(0) & restart_sr(7 downto 1);
end if;
end process;
end Behavioral;
Last, using two shift registers, but not recycling the hot bit:
architecture Behavioral of serial_test_4 is
signal data_sr : std_logic_vector(7 downto 0);
signal restart_sr : std_logic_vector(6 downto 0) := (others => '1');
begin
serial <= data_sr(0);
process(clk)
begin
if rising_edge(clk) then
if restart_sr(0) = '1' then
data_sr <= parallel;
restart_sr <= (others => '0');
else
data_sr <= "0" & data_sr(7 downto 1);
restart_sr <= "1" & restart_sr(6 downto 1);
end if;
end if;
end process;
end Behavioral;
All give identical waveforms results in simulation.
Resource usage:
+-------------------+---------------+------------+------------+---------+------+-----
| Instance | Module | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs |
+-------------------+---------------+------------+------------+---------+------+-----+
| i_serial_test_1 | serial_test_1 | 5 | 5 | 0 | 0 | 11 |
| i_serial_test_2 | serial_test_2 | 8 | 8 | 0 | 0 | 11 |
| i_serial_test_3 | serial_test_3 | 5 | 4 | 0 | 1 | 9 |
| i_serial_test_4 | serial_test_4 | 5 | 5 | 0 | 0 | 15 |
+-------------------+---------------+------------+------------+---------+------+-----+
The first method (using the MUX) has an unregistered output, which is not ideal.
The second uses the most LUTs (but still OK as it is much less than one LUT per FF, which is a good rule of thumb).
The third method (recycling a single '1' bit in a shift register rather than a short counter) allows it to be optimized into a 'SRL" primitive - a 8 bit shift register implemented in LUT.
The last method requires that every FF either gets reset or loaded, preventing optimization.
With only 8 data bits the difference isn't that obvious. Here's what resource usage looks like with 16 parallel input bits:
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
| Instance | Module | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP Blocks |
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
| i_serial_test_1 | serial_test_1 | 7 | 7 | 0 | 0 | 20 | 0 | 0 | 0 |
| i_serial_test_2 | serial_test_2 | 18 | 18 | 0 | 0 | 20 | 0 | 0 | 0 |
| i_serial_test_3 | serial_test_3 | 9 | 8 | 0 | 1 | 17 | 0 | 0 | 0 |
| i_serial_test_4 | serial_test_4 | 9 | 9 | 0 | 0 | 31 | 0 | 0 | 0 |
+-------------------+---------------+------------+------------+---------+------+-----+--------+--------+------------+
If resources were important I would go for "i_serial_test_3", based on it having the best timing (due to registered output) and the lowest resource count. It is the only one that gives the tools any way to optimize the design.