Author Topic: any simple blitter around? (looking for FPGA design) (Read 13039 times)

legacy · « **on:** August 21, 2018, 03:51:43 pm »

The name - blitter - comes from the bit blit operation of the 1973 Xerox Alto for a coprocessor dedicated to the rapid movement and modification of data within a computer's memory; in fact, a blitter can do a lot of useful tasks in parallel, and it can draw patterned lines with a variety of textures, and can even draw them in a special way for simple area fill

I am interested in these two features:
- to draw a line
- to fill an area

ummmm maybe the project "Amiga Minimig" is a reference, it includes the Amiga-500's blitter, the code is opensource and it's written in Verilog.

any other blitter around?

hamster_nz · « **Reply #1 on:** August 21, 2018, 09:28:01 pm »

This had me thinking, when was the big shift from bit-plane graphics (which really need a blitter to work) to the current "pixel per word" graphics.

I guess that happened when memory chip bandwidth exceeded the required display bandwidth, and 32-bit address spaces became the norm, rather than the likes of a 64k window at memory segment 0xA000...

legacy · « **Reply #2 on:** August 21, 2018, 10:24:54 pm »

Quote from: hamster_nz on August 21, 2018, 09:28:01 pm

64k

I'd like to add more features to my VDU (video display unit); currently, it's only a text display (able to scroll), and I am not willing to implement an external DDR1 video memory controller; DRAM is huge large and cheap, but very complex to be handled, therefore I'd like to use an external static ram (asynchronous bus) for both the text memory and the video memory, and the chip happens to be limited to 128Kbyte with a granularity of 8bit.

Besides, I'd like to have a hardware unit that draws lines and fill areas

Kleinstein · « **Reply #3 on:** August 22, 2018, 06:02:58 am »

Quote from: hamster_nz on August 21, 2018, 09:28:01 pm

This had me thinking, when was the big shift from bit-plane graphics (which really need a blitter to work) to the current "pixel per word" graphics.

I guess that happened when memory chip bandwidth exceeded the required display bandwidth, and 32-bit address spaces became the norm, rather than the likes of a 64k window at memory segment 0xA000...

Those separate bit planes were not very popular more like a specialty of the Amiga. There might have been a few really old ones before 1985. Other computers sometimes had something like a 4 color mode with 4 pixels packed in 1 byte. The main point to give up bit planes was when memory got cheap enough to have 1 byte per pixel.

BrianHG · « **Reply #4 on:** August 22, 2018, 06:25:25 am »

Atari 8 bit computers, Commodore 64, Apple II, Macintosh, Next. Any computer before the age or minimum 256 colors per pixel. This even includes PCs 16 colors and 2 color graphics modes which all had multiple pixels compacted into each byte.

The newer full color Next computers had a special graphics processor which allowed mixed color bit plane modes and non-palleted byte/pixel 8/16/24 bit modes in each individual window allowing backwards compatibility with the first Next B&W grey modes with newer color applications.

I would say for simplicity sake, just do 32 bit per pixel and use the first 24 bits, or a 16 bit per pixel blitter.

As for drawing, I don't know how smart your ram controller is, but, when drawing, writing single bytes is a wasteful number of clock cycles on the memory unless your ram controller is very-very smart and has a large write cache. Reading memory for drawing the display is easier as the reads are in straight lines and read bursts of 32-64-or-128 pixels in a single shot will be super efficient for the ram. Drawing a vertical line will be the worst as you need to address a separate byte every time you draw 1 new pixel on the line, and, if your ram controller cant do write masking, this may include a read-modify-write cycle.

Ignore all the ram timing junk I just stated if you are working at 320x240, or, 640x480 as in these slow ass modes, you can get away with writing at any slow speed.

Drawing lines and fills are as simple as an x&y loop counters with a start, stop & increment size for each counter operating in either true floating point, or, a m.n integer counter. (This is not for horizontal or vertical lines as you only need to increment by one for those, the m.n counters are lines at any angle!) Creating a second A to B coordinate line counter running within the first which resets to the beginning after every pixel is drawn in the firs A-B line counter, drawing a line to the second A-B line engine allows you to render a filled any 4 sided polygon, or 3 sided one is on of the lines has a length of 0.

Creating circles and ovals gets more complicated fast, but using the m.n counters adapting the m.n values every increment with some simple clever integer math located elsewhere on this forum, you can draw perfect circles and ovals just as fast as rectangles with some effort. (The magic of FPGA and hard coded functions wired to a task...)

I think these might go way beyond the Amiga's Agnus's blitter's capabilities as it was primarily a simple rectangle memory copy/fill device with a transparency stencil. It's been too long since I've done development with an Amiga, but, to replicate just those rectangle functions in a 16bit/32bit per pixel environment, it would be nothing more than a page or 2 of simple Verilog coding. It's everything else on how it's wired to your ram and how you feed it instructions which makes the difference. You are talking about a source counter with an X1 start position, and an X2 destination counter with a Z size to copy. If you want, you can also implement a Y loop counter, and for every Y count, you have an X1 source increment and a X2 destination increment which allows a 1 rectangle shape from a source bitmap being rendered into a destination bitmap with a different X number of pixels in width.

T3sl4co1l · « **Reply #5 on:** August 22, 2018, 09:36:09 am »

Quote from: Kleinstein on August 22, 2018, 06:02:58 am

Those separate bit planes were not very popular more like a specialty of the Amiga. There might have been a few really old ones before 1985. Other computers sometimes had something like a 4 color mode with 4 pixels packed in 1 byte. The main point to give up bit planes was when memory got cheap enough to have 1 byte per pixel.

EGA and VGA (except for mode 0x13) did bit planes as well, 16 color palette from 64 or 262k possible choices. No hardware blitting, aside from bytewise access with logic operations (for masking and such).

Other crazy ideas: compositing layers, sprites, transforms (e.g., SNES mode 7), etc.

Tim

legacy · « **Reply #6 on:** August 22, 2018, 10:00:10 am »

Quote from: BrianHG on August 22, 2018, 06:25:25 am

Drawing lines and fills are as simple as an x&y loop counters with a start, stop & increment size for each counter operating in either true floating point, or, a m.n integer counter

this is the prototype in C to draw a line between two points on the display, just to see how the algorithm goes

Code: [Select]

private uint32_t quanto = 1;

/*
 * draw a line between two points on the display
 */
void DDA_doline
(
    uint32_t x1,
    uint32_t y1,
    uint32_t x2,
    uint32_t y2
)
{
    sint32_t dx;
    sint32_t dy;
    sint32_t stepx;
    sint32_t stepy;
    sint32_t fraction;
    uint32_t screen_x;
    uint32_t screen_y;
    uint32_t response;

    /*
     * calculate differential form
     *
     *  dy   y2 - y1
     *  -- = -------
     *  dx   x2 - x1
     *
     */

    /*
     * take differences
     */
    dy       = y2 - y1;
    dx       = x2 - x1;

    screen_x = x1;
    screen_y = y1;

    /*
     * dy is negative
     */
    if (dy < 0)
    {
        dy    = -dy;
        stepy = -1;
    }
    else
    {
        stepy = 1;
    }

    /*
     * dx is negative
     */
    if (dx < 0)
    {
        dx    = -dx;
        stepx = -1;
    }
    else
    {
        stepx = 1;
    }

    dx = dx shiftLeft quanto;
    dy = dy shiftLeft quanto;

    /*
     * draw initial position
     */
    screen_pixel_xy(screen_x, screen_y);

    /*
     * draw next positions until end
     */
    if (dx > dy)
    {
        /*
         * take fraction
         */
        fraction = dy - (dx shiftRight quanto);
        while (screen_x isNotEqualTo x2)
        {
            if (fraction >= 0)
            {
                screen_y += stepy;
                fraction -= dx;
            }
            screen_x += stepx;
            fraction += dy;

            /*
             * draw calculated point
             */
            screen_pixel_xy(screen_x, screen_y);
        }
    }
    else
    {
        /*
         * take fraction
         */
        fraction = dx - (dy shiftRight quanto);
        while (screen_y isNotEqualTo y2)
        {
            if (fraction >= 0)
            {
                screen_x += stepx;
                fraction -= dy;
            }
            screen_y += stepy;
            fraction += dx;

            /*
             * draw calculated point
             */
            screen_pixel_xy(screen_x, screen_y);
        }
    }
}

Code: [Select]

test1, screen size = 39 x 39
------------------[screen]------------------
|......................................*|
|.....................................*.|
|....................................*..|
|...................................*...|
|..................................*....|
|.................................*.....|
|................................*......|
|...............................*.......|
|..............................*........|
|.............................*.........|
|............................*..........|
|...........................*...........|
|..........................*............|
|.........................*.............|
|........................*..............|
|.......................*...............|
|......................*................|
|.....................*.................|
|....................*..................|
|...................*...................|
|..................*....................|
|.................*.....................|
|................*......................|
|...............*.......................|
|..............*........................|
|.............*.........................|
|............*..........................|
|...........*...........................|
|..........*............................|
|.........*.............................|
|........*..............................|
|.......*...............................|
|......*................................|
|.....*.................................|
|....*..................................|
|...*...................................|
|..*....................................|
|.*.....................................|
|*......................................|

dmills · « **Reply #7 on:** August 22, 2018, 12:52:36 pm »

Thats Breshenans line drawing algorithm, there is also a circle drawing variant and an improved line drawing one that exploits the symmetry.

"Graphics Gems" by Glassner (Academic Press) is IMHO the bible for CPU based graphics stuff, dated by todays standards but very, very good stuff when you lack a modern graphics processor.

Regards, Dan.

BrianHG · « **Reply #8 on:** August 22, 2018, 08:20:58 pm »

Quote from: legacy on August 22, 2018, 10:00:10 am

Quote from: BrianHG on August 22, 2018, 06:25:25 am
Drawing lines and fills are as simple as an x&y loop counters with a start, stop & increment size for each counter operating in either true floating point, or, a m.n integer counter

this is the prototype in C to draw a line between two points on the display, just to see how the algorithm goes

Code: [Select]
private uint32_t quanto = 1; /* * draw a line between two points on the display */ void DDA_doline ( uint32_t x1, uint32_t y1, uint32_t x2, uint32_t y2 ) { sint32_t dx; sint32_t dy; sint32_t stepx; sint32_t stepy; sint32_t fraction; uint32_t screen_x; uint32_t screen_y; uint32_t response; /* * calculate differential form * * dy y2 - y1 * -- = ------- * dx x2 - x1 * */ /* * take differences */ dy = y2 - y1; dx = x2 - x1; screen_x = x1; screen_y = y1; /* * dy is negative */ if (dy < 0) { dy = -dy; stepy = -1; } else { stepy = 1; } /* * dx is negative */ if (dx < 0) { dx = -dx; stepx = -1; } else { stepx = 1; } dx = dx shiftLeft quanto; dy = dy shiftLeft quanto; /* * draw initial position */ screen_pixel_xy(screen_x, screen_y); /* * draw next positions until end */ if (dx > dy) { /* * take fraction */ fraction = dy - (dx shiftRight quanto); while (screen_x isNotEqualTo x2) { if (fraction >= 0) { screen_y += stepy; fraction -= dx; } screen_x += stepx; fraction += dy; /* * draw calculated point */ screen_pixel_xy(screen_x, screen_y); } } else { /* * take fraction */ fraction = dx - (dy shiftRight quanto); while (screen_y isNotEqualTo y2) { if (fraction >= 0) { screen_x += stepx; fraction -= dy; } screen_y += stepy; fraction += dx; /* * draw calculated point */ screen_pixel_xy(screen_x, screen_y); } } }

Code: [Select]
test1, screen size = 39 x 39 ------------------[screen]------------------ |......................................*| |.....................................*.| |....................................*..| |...................................*...| |..................................*....| |.................................*.....| |................................*......| |...............................*.......| |..............................*........| |.............................*.........| |............................*..........| |...........................*...........| |..........................*............| |.........................*.............| |........................*..............| |.......................*...............| |......................*................| |.....................*.................| |....................*..................| |...................*...................| |..................*....................| |.................*.....................| |................*......................| |...............*.......................| |..............*........................| |.............*.........................| |............*..........................| |...........*...........................| |..........*............................| |.........*.............................| |........*..............................| |.......*...............................| |......*................................| |.....*.................................| |....*..................................| |...*...................................| |..*....................................| |.*.....................................| |*......................................|

You got the basics of it, in fact, in verilog/vhdl, just use a huge single 32 or 48 bit x and y counter. Use the upper 10-12 bits to point to your pixel position and the lower bits for the fractional position. The screen pointers to those top 10-12 bits are nothing more than wires into the right place. An yes, that same chunk of C code will be about the same size in Verilog.

For the verilog code, I would have an input instruction & data port which allows you to set each variable register (leave instruction space to select fill colors and source/destination memory type fills as you expand your blitter, I figure 8 bits address space for 256 commands) and output port to point to ram address, output wire for busy and ready. This will allow your onboard cpu to instruct your blitter. Also, with a busy and ready line out, for instructing your blitter, you may add a simple FIFO and cache blitter instructions opening up the cpu to fill a bunch of line commands in advance while the blitter is busy drawing. Don't forget the fill color register in the command list and output color.

(Further future optimization) More sophisticated techniques may be to have a 2 stage parallel bank of setup registers inside your blitter so while you are drawing, the FIFO can still fill all of the next set of source line registers, then once the current line is drawn, in a single clock, that next prepared bank of line drawing registers will all transferred to the active ones in a single clock and get drawing immediately, no wait state. This also allows you to edit only some of the registers in the setup command instructions for your blitter retaining the un-changed registers as you instruct new line geometries making repetitious line feature updates even faster.

Other thing, a blitter like this can be made to hardware decompress a pre-fabricated RLE compressed 16/24bit color image. Any non-picture based graphics with wide repetitious lines and some random pixels all over the place will decode at near video playback speeds with this. At 640x480, perhaps full 16 bit playback at video speeds.

legacy · « **Reply #9 on:** August 23, 2018, 03:08:20 pm »

how to draw & fill a triangle? hints?

dave j · « **Reply #10 on:** August 23, 2018, 04:00:27 pm »

Quote from: legacy on August 23, 2018, 03:08:20 pm

how to draw & fill a triangle? hints?

Drawing a triangle is just drawing three lines.

Search for "triangle scan conversion". There are many ways to do it and it's something that is covered on computer graphics courses so there are loads of resources describing them. I don't know much about FPGAs so can't advise which method would be most suitable for you (try searching with FPGA as well).

legacy · « **Reply #11 on:** August 23, 2018, 07:15:21 pm »

Quote from: dave j on August 23, 2018, 04:00:27 pm

Drawing a triangle is just drawing three lines.

drawing is easy, but the problem is how to fill it with an efficient hardware algorithm.

BrianHG · « **Reply #12 on:** August 23, 2018, 08:39:35 pm »

Quote from: legacy on August 23, 2018, 07:15:21 pm

drawing is easy, but the problem is how to fill it with an efficient hardware algorithm.

That's the magic thing.
Here is an easy un-optomized approach to create your own:
Can you expand your line code to hold 3 coordinates?

In an outer loop, work out the coordinates for a line from coordinates A to C, step 1 pixel.
Inside that line algorithm, draw a line beginning at the current pixel coordinates of the outer line generator to position B. (Don't erase the position for the outer line algorithm)
Loop back to continue to the next pixel in line A to C, then draw the next inner line from the next coordinates to position B.

This will draw a filled triangle. It however will be 1 color, or, you can create a simple 3 coordinate smooth fill blend color algorithm with 3 separate RGB values at each end point A,B, & C in the triangle. This however wont fill a graphic texture. That feature is beyond my knowledge and the size of your FPGA.

I prefer holding 4 coordinates & drawing 4 sided filled polygons instead of triangles. It only requires 1 more register coordinates, and 1 more line generating counter in the outer loop algorithm & you can now draw rectangles at any angle, or, make coordinates C & D at the same point, and you have a triangle.

Though inefficient for triangles, (if you use 4 coordinates polygons, rectangles are completely efficient) this hardware triangle rendering will be done so fast that it would run circles around anything you can program the 68000 to do manually by software, especially if you include RGB faded transitioning.

Example, a system clock of 200Mhz means your algorithm will fill or draw 200 million pixels a second if your ram controller can keep up.
I suspect you will get 25-50 million RGB pixels per second.

Actually, if you always make the beginning of both lines in the 4 coordinate, 2 line filled polygon the center corner in your triangle, the filled triangle should be filled pretty efficiently. I haven played with filling algorithms in years, so you may need some fiddling around to ensure there are no empty pixels inside the fill when doing the 4 side polygon trick.

hamster_nz · « **Reply #13 on:** August 23, 2018, 09:55:52 pm »

Clipping to the visible screen is also a pain.

For filled objects I used to have one primitive shape, and to oversimplify it, each shape was defined by six numbers - x_left_start, x_left_end, x_right_start, x_right_end, y_top, y_bottom. Any triangle can be decomposed into at most two of these objects - Just cut across horizontally from the middle point in the Y value.

To un-simplify it a bit..

In the filling routines the x values were actually composite values, containing the four values needed to define the edge, including any DDA error values - the x_value, x_whole_step, x_error_step, x_accumulated_error, max_error. The variables x_left_end and x_right_end are not helpful when actually drawing.

https://www.tutorialspoint.com/computer_graphics/line_generation_algorithm.htm

Also, one important thing is decide "where" your are sampling your shapes to get your pixels, and how your filling starts and end at the edges. This may seem like a very stupid statement, but back in the '80s, when writing 3D graphics involved assembler it was very, very important to get this right before you coded anything.

For example, using an image from the link above, the circles may be the sampling points used for get pixel values, and are aligned with the integer coordinates:

If you your pixels are aligned with your drawing coordinates (as in this picture), then drawing 1-pixel wide lines is easy. The toughest it gets is deciding if you draw the first or last pixel on a line. But if you fill shapes, and two shapes of different colours share the same vertical or horizontal boundary, then the shape that is drawn last will 'own' the pixels on the boundary causing all sorts of weird effects. A common solution was to offset your sampling by half a pixel.

Think about it like sketching a bitmap icon on graph paper - you take the colour at the center of the square not the average colour at where the lines cross, which might be the intersection of up to four different colours - you are offsetting the sampling point by (+0.5, +0.5) from the integer coordinates.

BrianHG · « **Reply #14 on:** August 24, 2018, 02:22:31 am »

Quote from: hamster_nz on August 23, 2018, 09:55:52 pm

https://www.tutorialspoint.com/computer_graphics/line_generation_algorithm.htm

It's been so long, but for efficient fast single color, single angled line generation, I remember playing with something like this on my Atari 800 way, way, way back. This is still something you can easily implement in Verilog with registers, adders and look ahead carry if statements and get 1 new pixel out every system clock cycle.

Now we just need to bug 'Legacy' to shove 10 of these home made modules into his FPGA and optimize it to 200Mhz and he will have a line rendering algorithm which can render 2 billion pixels a second.

How 'Legacy' decides to pipe those to his video memory now becomes the miracle...
How 'Legacy' gets his onboard 68000 to keep filling the registers of those 10 line generators will be another miracle in itself...
He will probably need to wire the command ports to his onboard dram, prepare in advance an entire string of line commands in sequence in that dram, and DMA in all of them to keep the line generators continuously active.

legacy · « **Reply #15 on:** August 24, 2018, 10:19:39 am »

Quote from: BrianHG on August 24, 2018, 02:22:31 am

in Verilog

VHDL

Quote from: BrianHG on August 24, 2018, 02:22:31 am

200Mhz ... 68000

Spartan3 @ 50Mhz at the moment, it will be switched to a Spartan6 @ 100Mhz
but it's not for 68000, it's for my own made Softcore Arise-v2 (sort of RISC-design without pipeline)

The above pic with 68SEC000 & FPGA was used just to quote a project that shows lines of Verilog about a documented (by Amiga) blitter.

Quote from: BrianHG on August 24, 2018, 02:22:31 am

need to wire the command ports to his onboard dram, prepare in advance an entire string of line commands in sequence in that dram, and DMA in all of them to keep the line generators continuously active.

currently, this is a big design problem. I have two FPGAs on my project, and they are not on the same PCB.

one FPGA is used for the softcore Arise-V2, the debugger Otaku-v3, PSX's Mouse and matrix keyboard controller(1) and basic devices (UART, timer, COPs, etc) ... COP0 is for exceptions, COP1 is for DMA, COP3 is for Cordic, COP4 is for DSP fixed-point saturated math ... this FPGA is 89% full
one FPGA is used for the VDU, the PHY (VGA, it will be replaced by LVDS for LCDs), font-ROM, video text-ram (by BRAM), interface to the external static-RAM for the framebuffer ram (up to 4 chip of 512Kbyte means 2Mbyte, but this is the max I can do) and here is where the blitter & cooper stuff will be implemented

(1) I can free up to 1.4% of resources by moving the Playstation1's PAD/Mouse (it uses a proprietary protocol, made by SONY, basically it's like SPI) and the matrix keyboard controller into an external CPLD, or it could be reimplemented in firmware, by a little MPU, like PIC/AVR8; connecting then it by serial port to the FPGA. Well, PC-Keyboards and mouses are connected via serial-PS/2, so ...

Code: [Select]

 ____________             ____________
|            |           |            |_____
|            |  link1    |            |     |
|            |==========>|            | PHY |==VGA== LCD
|    SoC     |  link2    |    VDU     |_____|
|  Arise-v2  |<=========>|            |
|            |  IRQ      |            |
|            |<----------|            |
|____________|           |____________|
   |      |                 |      |
   | dbug |                 | SRAM |
   |______|                 |______|

Therefore the problem is the isolation between FPGA1 (the SoC) and FGPA2 (the VDU). Currently, these two FPGAs communicate by two super fast serial links that are "fast", but ... not so fast for video stuff.

Links are synchronous full duplex serial at 32bit data-size 5Mbps manchester-encoded

typedef struct
{
color_t color;
uint32_t x;
uint32_t y;
} point_t;

Via the first link, the SoC sends commands like
- outchar(char,row,col,color)
- scroll_x
- scroll_y
- draw_line(Point1,Point2)
- draw_rectangle(Point1, Point2, Point3, Point4, is_filled)

These commands are put into a queue and served as soon as the engine is able to do it.
And an interrupt can be issued to tell the CPU the queue has become empty, thus ready again to process new commands.

The second link is used by the CPU to directly access the video ram, thus it's not on CPU's bus, everything needs to go through the serial line, that is n-times slower (say, approximately 5Mbps vs 500Mbs, it's 100 times slower at least) than the local bus.

I have a few problems to solve, i.e. how to connect the framebuffer ram to the CPU in a better way, this implies how to put the two FPGAs on the same PCB, in order to use a parallel-bus approach but it's pin-resources consuming .... 32bit for the data_in, 32bit for the data_out, and 22 bit for the address, 4 bit for the control line ... not so good, but it would be faster than a serial line

legacy · « **Reply #16 on:** August 24, 2018, 10:24:23 am »

Quote from: BrianHG on August 23, 2018, 08:39:35 pm

Can you expand your line code to hold 3 coordinates?

done, implemented

I am testing it on both the C prototype and VHDL and I am going to test it on the RTL simulator
There are now four mini-blitter in parallel, for the trick you suggested (thanks!)

Code: [Select]

------------------[screen]------------------
|.......................................*|
|.....................................**.|
|...................................**.*.|
|.................................**..*..|
|...............................**....*..|
|.............................**.....*...|
|...........................**.......*...|
|.........................**........*....|
|.......................**..........*....|
|.....................**...........*.....|
|...................**.............*.....|
|.................**..............*......|
|...............**................*......|
|.............**.................*.......|
|...........**...................*.......|
|.........**....................*........|
|.......**......................*........|
|.....**.......................*.........|
|...**.........................*.........|
|.**..........................*..........|
|*............................*..........|
|.*..........................*...........|
|..*.........................*...........|
|...*.......................*............|
|....*......................*............|
|.....*....................*.............|
|......*...................*.............|
|.......*.................*..............|
|........*................*..............|
|.........*..............*...............|
|..........*.............*...............|
|...........*...........*................|
|............*..........*................|
|.............*........*.................|
|..............*.......*.................|
|...............*.....*..................|
|................*....*..................|
|.................*..*...................|
|..................*.*...................|
|...................*....................|

(the solid filling part is not yet implemented, probably I will use a cooper to do it)

asmi · « **Reply #17 on:** August 24, 2018, 04:40:49 pm »

Quote from: legacy on August 24, 2018, 10:19:39 am

Links are synchronous full duplex serial at 32bit data-size 5Mbps manchester-encoded

Why so slow? Most relatively modern FPGAs can do 600+ Mbps LVDS per diff pair, and they scale quite nicely onto multiple parallel lanes (though PCB routing becomes progressively harder as you add more lanes). If that isn't fast enough, a lot of FPGAs have versions with MGTs which can do 3-6Gbit/s per lane. Both options are trivial to implement if you have control over both sides.

legacy · « **Reply #18 on:** August 24, 2018, 05:27:14 pm »

Quote from: asmi on August 24, 2018, 04:40:49 pm

Why so slow?

the FPGA is clocked @ 50 Mhz. Also, the PCB and the debugger stuff (i.e. LA) are problematic for me.

legacy · « **Reply #19 on:** August 24, 2018, 05:34:38 pm »

Code: [Select]

------------------[screen]------------------
|........................................|
|........................................|
|........................................|
|..+++++++++++++++++++++++++++++++++++*..|
|..++++++++++++++++++++++++++++++++++*+..|
|..++++++++++++++++++++++++++++++++***+..|
|..+++++++++++++++++++++++++++++++*.*++..|
|..++++++++++++++++++++++++++++++*..*++..|
|..+++++++++++++++++++++++++++++*..*+++..|
|..++++++++++++++++++++++++++++*...*+++..|
|..++++++++++++++++++++++++++**...*++++..|
|..+++++++++++++++++++++++++*.....*++++..|
|..++++++++++++++++++++++++*.....*+++++..|
|..+++++++++++++++++++++++*......*+++++..|
|..++++++++++++++++++++++*......*++++++..|
|..++++++++++++++++++++**.......*++++++..|
|..+++++++++++++++++++*........*+++++++..|
|..++++++++++++++++++*.........*+++++++..|
|..+++++++++++++++++*.........*++++++++..|
|..++++++++++++++++*..........*++++++++..|
|..++++++++++++++**..........*+++++++++..|
|..+++++++++++++*............*+++++++++..|
|..++++++++++++*............*++++++++++..|
|..+++++++++++*.............*++++++++++..|
|..++++++++++*.............*+++++++++++..|
|..++++++++**..............*+++++++++++..|
|..+++++++*...............*++++++++++++..|
|..++++++*................*++++++++++++..|
|..+++++*................*+++++++++++++..|
|..++++*.................*+++++++++++++..|
|..++**.................*++++++++++++++..|
|..+*...................*++++++++++++++..|
|..**..................*+++++++++++++++..|
|..++***...............*+++++++++++++++..|
|..+++++***...........*++++++++++++++++..|
|..++++++++**.........*++++++++++++++++..|
|..++++++++++***.....*+++++++++++++++++..|
|..+++++++++++++***..*+++++++++++++++++..|
|..++++++++++++++++**++++++++++++++++++..|
|........................................|

so, using the cooper-unit to fill the triangle/rectangle is not as simple and efficient as it seems ...
I need to re-think about it

asmi · « **Reply #20 on:** August 24, 2018, 06:18:42 pm »

Quote from: legacy on August 24, 2018, 05:27:14 pm

the FPGA is clocked @ 50 Mhz.

That doesn't really matter - that's what PLL/MCMM is for. For MGTs you would need a dedicated jow-jitter clock source, but for LVDS it's not important.

Quote from: legacy on August 24, 2018, 05:27:14 pm

Also, the PCB and the debugger stuff (i.e. LA) are problematic for me.

You can use ILA IP core for debugging. As for PCB traces - differential traces are quite forgiving as long as you stay below 1 Gbps area because of termination. Just about any 4+ layer stackup will do just fine, even without controlled impedance.

Gribo · « **Reply #21 on:** August 24, 2018, 07:07:10 pm »

Your fill rate is probably limited by the VDU's clock. You can replicate the rasterizer block 3 times (the LVDS/VGA PHY is the 4th) and divide the work between them at the expanse of a more complicated SRAM interface. Have the PHY access the SRAM in a round robin manner, and the rasterizers to use the other devices.
Older VGA cards usually used dual port RAM for this, but it is very expansive.

BrianHG · « **Reply #22 on:** August 24, 2018, 08:22:21 pm »

Quote from: legacy on August 24, 2018, 10:19:39 am

currently, this is a big design problem. I have two FPGAs on my project, and they are not on the same PCB.

one FPGA is used for the VDU, the PHY (VGA, it will be replaced by LVDS for LCDs), font-ROM, video text-ram (by BRAM), interface to the external static-RAM for the framebuffer ram (up to 4 chip of 512Kbyte means 2Mbyte, but this is the max I can do) and here is where the blitter & cooper stuff will be implemented

The entire command port cache (say 32 bit command serial port connection with fifo as well), and the parallel buffer structure (4x2x3 x color x fill ) + active buffer (4x2x3 + color + filled (around 410 bits)), nothing more than a if (ready to draw next triangle & go command) (triangle structure active(410 bits)) = (triangle structure filled from serial port(410 bits)) registers for the triangle filing engine are all on the same chip, so, yes, you can cache and buffer the geometric engine like this so while a slow filled triangle is being drawn, you are still able to fill the current triangle. Once filled, 1 clock to shift over those 410 bits and continue immediately. Only the command port fifo, optional on the GPU FPGA, may be whatever size you like since it might take multiple words to fill and instruct that 410bit structure. It might be easier to skip the fifo and a 1 single prior 410bit structure which you actively fill, then shift along when ready. The idea is the CPU shouldn't have to wait for a triangle to be filled as it sends out commands, then processes the next triangle.

Like you said, filling is slow. You need to paint each pixel. There should be no problem rendering a line at a time if you simultaneously draw 2 sides of the triangle at the same time, but, this get really complicated when rendering the bottom line of the triangle. I doubt you will be able to fit this in a small fpga achieving full clock speeds. This exceeds my knowledge on optimal triangle engines.

Another possible optimal way may be to do a paint style fill, but you need to examine the display memory contents and it wont work right if there are other items on the screen.

BrianHG · « **Reply #23 on:** August 24, 2018, 09:10:04 pm »

I've searched around, and, a different easier approach may be a 3D tracing rendered. Basically, you scan a rectangle box of the area on the screen where any part of the triangle you want to fill is located (using the outer coordinates of the triangle) and as you scan that rectangle, for each pixel as you go, all you are trying to solve is if the current X,Y coordinate is located inside the triangle, if so, draw the pixel, otherwise don't. The advantage her is in the future, if you want to go 3D, it is easier to have triangles larger than your screen coordinates and you don't need to scan of the screen active area when filling. The waste here is a slim huge 45 degree triangle wastes a lot of time scanning the large rectangle which has mostly empty pixels.

legacy · « **Reply #24 on:** August 27, 2018, 10:28:54 am »

Quote from: BrianHG on August 24, 2018, 09:10:04 pm

Basically, you scan a rectangle box of the area on the screen where any part of the triangle you want to fill is located (using the outer coordinates of the triangle)

Amiga does a trick with the blitter by a simple circuit that does some bit manipulation when it copies the bit-plane into the DMA-buffer ready to go to the video memory, but this trick is ... not so good and full of defectives. E.g. it's not able to fill a triangle if one of the borders has even points

I have recently bought "Amiga System Programmer's Guide", it shows this kind of problem and tricks to mitigate it, but I believe the problem needs a radically different approach.

i.e. it needs to think about what "fill a triangle" means, which well ... for a given triangle of vertexes { V1(x,y),V2(x,y),V3(x,y) }, and a point P(x,y), it's almost like the finding "sort-of-weights" (what?) that tell us how much of P's X coordinate is made of { V1, V2, V3 }, and also the same for P's Y coordinate:

function( P(x,y), V1(x,y),V2(x,y),V3(x,y) ) ----> { W1, W2, W3 }

Three numbers to tell:

W1: how much of P(x,y)'s coordinates is made of V1(x,y) ?
W2: how much of P(x,y)'s coordinates is made of V2(x,y) ?
W3: how much of P(x,y)'s coordinates is made of V3(x,y) ?
(with "sort-of-weights" I mean this)

I don't know it can be generalized for a generic polygon, but for a triangle it's like solving a linear system of three equations, whose denominators of the first two equations are the same, whose expression of the third equation is the sum of W1,W2,W2, and whose solution { W1,W2,W2 } has an intriguing property:

if P(x,y) is actually inside of the triangle, then
range( W1 ) is { 0 .. 1 }
range( W2 ) is { 0 .. 1 }
range( W3 ) is { 0 .. 1 }

if P(x,y) is actually outside of the triangle, then at least one of { W1, W2, W3 } will be negative!

This can be used as an easy check to see if a point lies inside or outside of a triangle

All good, and exciting, except ...
1) computing { W1,W2,W2 } require six dot-multiplications(1), and two divisions
2) this must be applied to all the points for the rectangle that contains the triangle

life is complex

(1) e.g. dot(v1, v2) = (v1.x * v2.x + v1.x * v2.y) + (v1.y * v2.x + v1.y * v2.y)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: any simple blitter around? (looking for FPGA design) (Read 13039 times)

Share me