Don't threat, I am deliberately pushing to to take big steps, and you are doing fine for such a complex 1st project.
You are doing great.
Your bulk of code is perfect, though 2 little mistakes: (I hope Im not making one....
)
----------------------------------------
into data_pipe [5*8+7:5*8] <= pc_ena 0 pixel 0
into data_pipe [4*8+7:4*8] <= pc_ena 1 pixel 0
into data_pipe [3*8+7:3*8] <= pc_ena 2 pixel 0
into data_pipe [2*8+7:2*8] <= pc_ena 3 pixel 0
into data_pipe [1*8+7:1*8] <= pc_ena 4 pixel 0
into data_pipe [0*8+7:0*8] <= pc_ena 0 pixel 1
inside ram 2 <= pc_ena 1 pixel 1
inside ram 1 <= pc_ena 2 pixel 1
addr to addr_mux <= pc_ena 3 pixel 1
pc_ena pos 0 <= pc_ena 4 pixel 1
------------------------------------------------------
You need to push all the pixels on the right up by 1 more since the 'pc_ena pos 0' is the time in 'IF (pc_ena pos == 0)' statement and not a register. At the beginning of 'pc_ena pos == 0', 'addr to addr_mux's output should holding the last 'pc_ena 4 pixel 1' and 'pc_ena pos == 0' should be ready to take the next pixel 'pc_ena 0 pixel 2'... This also adds 1 whole pixel to the pixel pipe delay.
Next: (I corrected your offset figures)
localparam MUX_0_POS = 6;
localparam MUX_1_POS = 5;
localparam MUX_2_POS = 4;
localparam MUX_3_POS = 3;
localparam MUX_4_POS = 2;
Is functional, however, there is a smarter way to fill these out and it includes my earlier hint:
Here is the hint#1 :
------------------------------------
localparam CLK_CYCLES_MUX = 1; // Adjust this parameter to the number of 'clk' cycles it takes to select 1 of 5 muxed inputs
localparam CLK_CYCLES_RAM = 2; // Adjust this figure to the number of clock cycles the DP_ram takes to retrieve a valid data from the read address in.
localparam CLK_CYCLES_PCENA = 5; // Adjust this figure to the number of clock cycles per pixel.
-------------------------------------
Currently, in your code, your first mux takes 1 clock and the INTEL altsyncram megafunction takes 2 clocks:
----------------------------
into data_pipe[0*8+7:0*8]
inside ram 2 2 clock cycles here for INTEL's altsyncram function.
inside ram 1
addr to addr-mux 1 clock cycle here for you current MUX code.
PC_ENA pos0
-------------------------------
Now, also knowing that PC_ENA has 5 positions per pixel, and using those 3 reference 'CLK_CYCLES_xxxx' which describes the # of clocks as each step in your delay pipe, write a formula which fills in all 5 'localparam MUX_#_POS's numbers.
Next, test you formula against a slower addr-mux algorithm which takes 2 clocks instead of 1 clock.
Step 1:Example, fill in 'localparam CLK_CYCLES_MUX = 2'
Step 2, change table on page 1 so addr-mux has 2 clock steps:
-----------------------------
nto data_pipe[2*8+7:2*8]
into data_pipe[1*8+7:1*8]
into data_pipe[0*8+7:0*8]
inside ram 2
inside ram 1
addr to addr-mux step #2
addr to addr-mux step #1
PC_ENA pos0
-------------------------------------------
Now verify that you formula generating the 5 'localparam MUX_#_POS's have valid pipe positions.
Do this a few more times with 'CLK_CYCLES_MUX = 3', 'CLK_CYCLES_MUX = 4', CLK_CYCLES_RAM = 3'.
Everything else you wrote looks correct.
Also, you should realize trying to properly unmux the data stream coming out of the ram could never be done properly any other way without pure luck. And with luck, if you ever had to increase a number of clock steps in for example the addr-mux stage, or, the FPGA's altsyncram dual port ram function, everything would fall apart and luck again would be needed to hope you get the right output all in parallel.