I remember that in my implementation immediate decoding was second on a critical path after the pipeline hazards. I have not looked into the details, but guess most of it comes from the instruction decoding to pick the control signals, not the muxing itself.
If you are like early microprocessors such as the 6502 and z80 which are designed to assume they are fed valid code and don't try to diagnose invalid instructions then you can minimize instruction decoding a lot. SeRV falls in that category, and I'd suggest you might as well be too if you're designing an embedded processor for minimal size and it will be running only a limited set of predetermined code.
In that case, RV32I instruction formats are entirely determined by bits 6:2 i.e. 5 bits. Bits 1:0 are always 11. Func3 (bits 14:12) determines a specific function in an opcode class but not the instruction format. Funct7 (bits 31:25) are all 0s if they are not part of a literal, except bit 30 distinguishes sub from add (it's effectively the carry in, and indicates whether to invert rs2) and right logical shifts from arithmetic shifts (i.e. it determines whether hi bits of the result get 0 or the original bit 31). So bits 31:25 play no part in determining the instruction format or the format of any literal. Fields rd, rs1, and rs2 similarly play no part in determining the instruction format or literal decoding (except as being part of a literal)
So any signal you want to generate from the instruction format can be generated from the original instruction bits using a single LUT.
e.g. if you want a signal "should bit 11 of the instruction's literal come from instruction bit 31, 20, 7, or be always 0?" That is a single Xilinx LUT outputting two signals from input of bits 6:2 of the instruction. Those two signals can then feed into a single LUT that also inputs bits 31, 20, and 7 and outputs bit 11 of the 32 bit constant.
So that's two LUTs and two LUT delays to get bit 11 of the constant/offset.
And that's the WORST case in decoding RV32I literals.
Bits 10:5 of the literal are either a constant 0 or else come from bits 30:25 of the instruction. So to generate those 6 bits each bit needs a single LUT that inputs bits 6:2 and one bit each from bits 30:25.
That's assuming no simplification at all from perhaps some of bits 6:2 of the instruction being "don't care" for some of the literal fields. The fields whose bits come from either 2 places, or 2 places or constant 0, could have their bits generated from the instruction bits by a single LUT per bit *if* it turns out that you only need 4 bits (or fewer) from bits 6:2.
That's for minimum delay decoding. Minimum LUT count decoding would say to use 1 LUT for, say, decoded literal bits 31:20 that outputs an "instruction bit 31:20, or bit 31" signal, and then 6 LUTs to generate output bits 31:30, 29:28, 27:26, 25:24, 23:22, and 21:20. (Bit 31 can of course be don with 0 LUTs but whatever). So 7 LUTs for 12 bits, with a 2 LUT delay. Half of the first LUT is available to generate some other useful signal.
Actually .. regarding invalid opcode detection ... that calculation can be done in parallel with the minimised generation of the possible literal (and other things) and the delay doesn't really matter as it only has to be valid in time to prevent register writeback or memory write. So it's purely a size thing.