I'm thinking you should change the command structure into 2 byte, 3 byte and 4 byte commands.
Commands 0-127 would send 2 byte controls like now.
Commands 128-191, 3 bytes would just directly change a 16 bit integer setting, directly feed any control, or set the x/y[ # ] instead of piping everything through the XY regs. Now the XY regs, room for up to 64 of them will be 16bit each instead of 12bit. (IE Z80 sends 3 bytes total.)
Commands 192-160 would send 24 bit integer commands. (IE Z80 sends 4 bytes total).
Commands 161-255 would send 32 bit integer commands. (IE Z80 sends 5 bytes total).
You will no longer need to convert all your 16 bits into 12bit + shift part of the remainder into the command's LSB's or the next Y register for the 24 bit ints.
This way, you may increase all the 12bit regs to 16bit, making your new XY coordinates go from -32768 to 32767, as well as a scaler which would have a 16bit scales of N:65536, 16x the current 4096 division steps.
Also, you will now have access to many more x/y regs than just 4 if you need them.
With room for 64x 16 bit integer regs and even 32 x 32 bit integers,it is now feasible to implement a 32bit ALU, full multiply/divide & add/sub between the regs with so many spare commands in the first 128 to direct things like holding offset and scale factors for the line drawing engines making high quality accelerated geometry graphics possible.
The 32 32bit integers will allow for true floating point accelerated geometry. Though, we are getting into the realm of how can a Z80 feed all of this. Though, you can feed this command pipe from memory contents of the GPU ram itself.
Maybe just use 2 byte and 4 byte command modes since the FIFO pipe and GPU ram is organized in 16 bits.
To send a 32 bit word, you would need to transmit 2x 4 byte commands, or we can scrap 32 bit ALU and just use 24 bit limiting use to +/- 8million range integers.