Ohhh boy, Ohhh boy...
Ok, currently, you pixel writer and geometry unit runs at 100MHz. Let's call this a limit of the lower end FPGAs and they coding style architecture for now. Right now, because we support less the 8bpp, every time we want to write a pixel, we first must read a byte (or in our case we read 32bits), edit the bits we want to draw on, then write the newly edited byte.
Cheap optimization option 1, pipeline the read pixel through one of my DDR3 read ports, stuffing the write pixel command inside my 'read vector' feature, taking that vector and data output generate a new write data in a new DDR3 write channel. This piping of the write-pixel command means our writ pixel module isn't waiting for a read to come back to edit, then write data out once ready. This would probably increase pixel write speed by 2-3x and still maintain 100% backwards compatibility and still support the existing write pixel collision counter.
Cheap optimization option #2, get rid of 1bpp, 2bpp, 4bpp support all together for writing pixels. This means we loose the ability to paint pictures on anything less than 8bpp, 16bpp, or 32bpp screens. This should increase pixel write speed by 3-5x, but, we will hit a hard limit of 100 million pixels a second but generally achieve only around 75-50 million pixels a second. Since we no longer pre-read the memory address where we are painting pixels, we no longer have a write pixel collision counter.
Proper optimization method #1. Our DDR3 right now runs at full speed only at 128bits wide in the 100MHz clock domain. In each 128bits, we have 16 8ppp pixels, 8 16bpp pixels, or 4 32bpp pixels. This means to get the highest pixel writing speed in 8bpp mode, we need to fill 16 sequential pixels each 100MHz clock. This means our geometry unit needs to change how it works. The best way to describe this is when we request like a triangle, we first need to generate a rectangle box address area which the triangle fits inside containing the width rows padded to 128 bits and the columns padded to 128 bits. Next, for every 100MHz clock, we go through each 128bit chunk, and with 16 parallel running pixel shaders, we decide which pixels get filled or not. Compared to our old drawing routine, we get the ~4x of optimization technique #2 multiplied by 16 pixels, meaning we get a 64x speed increase over your current pixel writer. We are beginning to enter the realm of a simple 3D accelerator and with an added texture reader prior to this writer, with proper design, and maybe a second DDR3 chip for a 256bit wide bus & 128x speed, we will pass the first Sony Playstation in rendering capability.
Now, as you can guess, though we might use the same geometry coordinate to initiate the drawings for this optimization, the geometry unit core guts will look completely different for this proper optimization method #1 & #2. It is more akin to us having a bounding box to draw inside and at each 8 or 16 pixel chunk, our 8 or 16 pixel shaders all in parallel will be answering the same question: At this point on the screen, does my pixel fit on or inside the line's / triangle's / rectangle's coordinates? (Yes/No) This will produce 8-16 parallel sequential pixels in one shot, every 100MHz clock. (Note that if we are paint 1 pixel thin vertical line, yes 7 of the 8 pixel writers will say no, but the DDR3 cannot draw vertical line any faster anyways. And if we want antialiasing, then the question is how much so and if it isn't 0% or 100%, then we need to do a pixel read and do a pixel blend.
There are others ways to handle the vertical 1 pixel wide line issue by rendering multiple objects by cached screen square blocks, but, a lot of what is needed to process that is still technically needs the engineering step of designing multiple parallel pixel shaders.