I guess my motivations may be hard to understand for some
I understand it and respect it, even if I don't personally feel the attraction (anymore).
To paraphrase an old proverb, maybe Daoist, "The man who knows no stories is a fool. The man who knows many stories is wise. The man who knows one story is dangerous." The Old Ones have bequeathed many useful stories to us through their artifacts. I urge you to look at what other designers were doing in the 1980s when tasked to write a display engine. Even a close reading through the programmer's manual of some of these older
systems will give you some flavor of their concerns and the clever tricks they used to get (then) high-performance video out of (today) low-performance hardware while holding to an accessible price point.
But you
have to know your medium before you can create a useful artifact. Digital design is not a qualitative discipline. There is no substitute for looking at timing diagrams and characteristics tables, establishing cause and effect, looking up propagation delays, and doing the sums. For example,
no noticeable slow-down if I were to shut down memory access for ~73% of the time while the FPGA is accessing the frame buffer to draw the visible area of the screen
isn't a very useful question. The better question is,
how much slowdown would there be if...? To reckon that, we have to look at the Timing section of the Z80 CPU User Manual, UM0080. For example, if you look at the instruction fetch cycle you'll see that the last 2T of the M1 cycle is spent "refreshing", where the Z80 doesn't care what's on the bus but still drives some of the address pins as a convenience to users of then-new DRAM, and that the Z80's proper business is done at the end of T2. We also know that our memory system
can service instruction reads in 2T, and that, if we are using SRAM or a DRAM controller that handles its own refresh, the Z80's refresh cycle is wasted time. What if we simply disconnect those pins from the data bus at T3, hold the read value in a latch for the convenience of the Z80, and let other hardware access the bus instead? You just found a £10 note in the sofa cushions, depending on what kind of deal your company was able to score on DRAM.
Back to cases. Having written out some ins and outs, you might then look at I/O read and write machine cycles, and see that they are 4T in length. For I/O writes, we see that all address, control, and data of interest are available in T2. If all our devices are fast enough to have completed the write by the end of T2, we can just disconnect the Z80 from the bus and let the system bus free for other business during TW* (third beat) and T3 (fourth beat). For I/O reads, we see that all the address and control are once again available in T2, but the CPU doesn't read the data until the second half of T3. So, again assuming that our devices are fast enough, we will have our read data by the end of T2 and need only hold it for the CPU through TW and T3. We can then unhook the Z80 from the bus at the end of T2 and go about our other business. Cool, another £10 in the cushions!
Now we look at loads and stores. We see that memory write cycles are 3T in length, and that our snappy little jig has become instruction-dependent math rock. But
! The Z80, like most other processors of the time, allows bus cycles to be stretched to accommodate slow hardware. We can insert one TW to lengthen the cycle to 4T and keep the rhythm, taking note of the penalty in order to answer our original question. Having done that, we look again at the memory write cycle and see that data and address are valid by the end of T1, but the !WR signal isn't valid until the end of T2. We can assume by the end of T1 that, if !MREQ is asserted and !RD is not, !WR will be asserted by the end of T2, and prepare accordingly. The rest of the write cycle follows that of I/O out cycles with the exception of the wait state we inserted, and we can likewise disconnect the CPU from the bus for TW and T3 without the Z80 any the wiser. £10! As for loads, we see that they too are 3T in length, so to keep the rhythm let's add the wait state and mark the penalty. Looking at our extended read cycle, we once again see that data isn't sampled until the end of T3 (fourth beat, as TW was inserted as the third beat). If our memories are fast enough, we can sample the data at the end of T2 just as we did for the I/O and hold it for the CPU, while we unhook the CPU from the bus and use the last 2T for our own business. £10!
Finally let's look at the interrupt request/acknowledge cycle, which is at minimum 5T long, and now we've gone to playing experimental jazz. But we will know what kind of cycle it is by the end of T1, and know it is an interrupt if we see !M1 asserted and neither !MREQ nor !IORQ asserted. In our system design we have a decision to make: do we want to pass the vector cycle through to the system bus, or handle it off the system bus? If you pass it through, you can treat it much the same as any other memory read cycle but generate the !WAIT signal and hold the received vector for 5T longer than otherwise, which we will mark into the penalty column. If you prefer to handle it off the system bus, you do the same with the !WAIT signal but disconnect the CPU from the bus and supply the vector by your choice of means. Your call. In either case, to keep the rhythm we have to stretch the interrupt cycle out to 8T. A 50p coin is better than nothing.
The NMI cycle is just a dummy instruction fetch with a runt pulse on !MREQ in T3 which we can ignore because the CPU would be off the bus anyway. £1!
The overall effect, then, is that we have introduced some ancillary logic to the Z80 so that it can vacate the system bus for 2T out of every 4T, at the cost of memory access by the CPU taking 14.3% longer than theoretically possible, and a modest hit to interrupt latency which in practice may not mean all that much. In doing so we have saved millions of RAM chips and tens of millions of pounds. Going a little bit out of brief, while the video system is not actively fetching frame buffer data, its 2T cycles could be borrowed to service other peripherals, for example, to buffer up sprite data, service disk controllers, or output PCM audio. Going very far afield, it may be evident that, instead of display DMA, we could place a second Z80 with much the same ancillary logic, sync them up, and have them both run at nearly full speed out of the same memory and I/O space, with the usual caveats about multiprocessing. If you were feeling especially naughty, you could pull the wool over the second CPU's data lines and feed it NOP instructions while exploiting its program counter as an address generator for the video output (beware, people have
been knighted for doing this sort of thing).
Part of the reason for this post was to discuss the best way to interface the FPGA with the computer - frame buffer in computer RAM, or gated via the FPGA, for example. Both have positives and negatives, I just don't have the knowledge or experience to understand their weight and balance the options properly, if that makes sense?
The word salad I wrote above is a walk through the sort of thinking the Old Ones utilized in many of the 1980s home computer designs, starting with the need to compete vigorously on both BOM cost and performance. The IBM PC, coming from its mainframe background and the culture of modularity, could not engage in this level of coupling. Either approach is certainly feasible. Which one is more desirable is a
systems-level decision that depends, in part, on the desired display size and depth, and in turn the desired pixel rate, but also on cost, code size, flexibility, programmer convenience, and so on. "Better than the Amstrad" is an open-ended brief that could encompass anything from the tightly-coupled home computer systems as napkin-designed here, to an ISA bus bridge to a CGA/VGA/EGA/Hercules card rescued from the scrap pile. In any case, there will be side effects which you also have to pursue down the line and decide to live with for your application.
Oh, if you want to play USB host, the SL811HS is a classic choice, which incidentally can also be configured as a device should you wish to link your system to your PC. Be advised that USB HID can be a bit of a hairball.