Update on progress - below are my musings, not really seeking help unless anyone can point me to a fast transistor to replace a 2N3904?
The GPU is able to pull WAIT low in response to memory accesses by the host Z80. It's not rock-solid yet, though - I can adjust the delay for each memory access by setting a value in a GPU register accessible via an IO port, and turn the test on or off via that same register (bit 0 controls whether or not to apply WAIT states).
Most of the time it works fine - but every now and again it locks up with frenetic activity on the WAIT line. I need to spend some more time looking into the specific timings of what's going, but here's what I've found so far:
What you see above is a nanosecond-accurate timing chart for a typical memory read operation by the host Z80 (in the uCOM's case, a 10MHz Z80 running at 8MHz), with the FPGA's clock (asynchronous to the Z80's) at the bottom. Pay special attention to the red bar - this denotes the amount of time the FPGA has to detect a memory read op and set the WAIT signal high to trigger the 2N3904 transistor to pull the Z80_WAIT line low.
TsWAIT(Cf) is the critical timing, the time for WAIT to stabilise LOW before it is sampled by the Z80 on the falling edge of its T2 clock cycle - on the above diagram, it has been extended to include the 2N3904's Td (activation delay), so instead of 20ns as quoted in the Z80's specifications for the 10MHz variant, it is 55ns because of the additional 35ns required by the 2N3904 to pull the WAIT line low. This is an approximation - I'm no expert at reading or understanding datasheets, this is my best guess and is likely a ballpark figure at best.
So this explains some of the crashes I was experiencing when I first tried inserting WAIT states, as I was waiting to decode a memory read, which requires RD to go low, and as you can see in the chart above it doesn't leave enough time to get WAIT pulled low in time. Instead, the FPGA is now pulling WAIT low the cycle it sees MREQ go low and the address lines match the GPU RAM address range. It's a lot more stable now, but I'm still getting occasional crashes - or certain crashes if I do a memory op with LDIR, one of the Z80's fast block-memory op commands.
So timing is still an issue. It just surprises me that it is an issue at all, because (to me) the FPGA is fast, but I guess not as fast as dedicated, discrete chips decoding the memory op and address range (or a chip select, more likely) in old peripheral equipment back in the day.
Is there anything faster than a 2N3904, I wonder? That 35ns delay is a killer.