On the read clock, when each BL8 128bit has been latched, toggle a DFF.
On the read clock, delay that toggle output DFF by 1 clock.
Remember, the read clock is 400MHz, so during an uninterrupted continuous burst, that toggle DFF will invert every 4 clocks.
On your 400MHz ck_0 clock, latch the 128bit when that 1-delayed-toggle has flipped.
Also, on the ck_0 clock, DFF latch that 1-delayed-toggle as well.
Now, feed all that ck_0 bulk into your slower clock domain and by the slower clock domain's clock, capture the data when is sees the ck_0 toggle output has flipped.
Also generate a BL8_read_data_ready when that capture has been done.