It all depends on you needs. For example, imagine a 256x8 write only register file, 8 bits address, 8 bits data, 1 bit write enable, and 256 8-bit outputs.
If all of the outputs are going to be used directly, and you want them to update instantly you need 2048 flip-flops, lots of decoding logic, and very high fan outs on the data signal. Basically the mess you already have.
The only option is to increase the latency to allow you to manage thimgs better.
Let's take this idea to the extreme. If you only need one update every 20 cycles you can do the following:
When the data come in, write a 1, the address and the data to a 17 bit shift register.
Shift those bits out, starting with the 1. Into a single wire.
Where you want your register, receive those bits into an 8 bit shift register, and add am FSM triggered by the leading one. This FSM inspects the address, and if it matches a constant address it then captues the next 8 bits into registers, the assigns them to the registers outputs.
You can then chain these stages together so your write-pnly register file can be implemented in very small parts all over the FPGA die, with very low fan-outs, so running at very high speed. It can also use pipelining as required to meet timing
Basically it becomes a simple network on a chip, the cost being maybe 2x as many registers, and latency between the write and the register changing.
Taking this to the extreme, routing this network can be implemented as an overlay, where you chain these blocks togeather after the main place and route, using whatever leftover resources you have to hand.
Oh, the other option for things like a 3x3 matrix used in colour conversion is to rather than having 9 registers, have an 9-deep shift register and just have the CPU stream the values in. This gets rid of a lot of the decode logic.