Author Topic: FPGA VGA Controller for 8-bit computer (Read 511026 times)

nockieboy · « **Reply #3575 on:** August 25, 2022, 09:48:58 pm »

Quote from: BrianHG on August 25, 2022, 05:11:38 pm

The code you attached would be slow as molasses.
It only has 4 opcode.
It doesn't doe INT <> FP conversion for you.
It has no integer 32bit math.

I would begin by making an MMU blitter for the DDR3 first, then make that MMU support 32bit INT & FP math within.

Okie dokie - but err... what's an MMU blitter?

BrianHG · « **Reply #3576 on:** August 25, 2022, 10:44:49 pm »

A nice one for you would contain these control registers:

functions:

Loop iterations, (number of cycles),
Source (A) bits (8/16/32/64/128),
source (B) bits (8/16/32/64/128),
dest bits (8/16/32/64/128),
function -, +, *, /, int>float, float>int, stencil mask output, bit shift results by #.

source (A) begin address.
source (A) inc step size integer.
source (A) inc step size fractional.
source (A) address loop limiter.

source (B) begin address.
source (B) inc step size integer.
source (B) inc step size fractional.
source (B) address loop limiter.

dest begin address.
dest inc step size integer.
dest inc step size fractional.
dest address loop limiter.

What do you think you can do with that?

nockieboy · « **Reply #3577 on:** August 26, 2022, 08:09:41 am »

Quote from: BrianHG on August 25, 2022, 10:44:49 pm

A nice one for you would contain these control registers:

functions:

Loop iterations, (number of cycles),
Source (A) bits (8/16/32/64/128),
source (B) bits (8/16/32/64/128),
dest bits (8/16/32/64/128),
function -, +, *, /, int>float, float>int, stencil mask output, bit shift results by #.

source (A) begin address.
source (A) inc step size integer.
source (A) inc step size fractional.
source (A) address loop limiter.

source (B) begin address.
source (B) inc step size integer.
source (B) inc step size fractional.
source (B) address loop limiter.

dest begin address.
dest inc step size integer.
dest inc step size fractional.
dest address loop limiter.

What do you think you can do with that?

Well, I guess fast block operations on bulk memory? Taking two sources, performing a mathematical operation on them and putting the result somewhere else implies some form of image manipulation, merging/blending? Sprite blitting? Plus an FPU as well...

EDIT: Throw in a third source and you could use it as a mask for colour-independent transparency in source A/source B...?

DiTBho · « **Reply #3578 on:** August 26, 2022, 11:26:52 am »

can you define and list the mathematical operations you need for image-manipulations?

BrianHG · « **Reply #3579 on:** August 26, 2022, 01:05:02 pm »

Well, using what I have listed above, say in channel B we generate a 8+8+8+8 bit ramp from 0-255.
In channel A, we have a line of a bit map 1024 pixels wide x 1024 pixels tall.

We can loop B to a limit of 1024, and increment A normally with a total 1048576 iterations, multiply 8x8 mode, result right shifted by 8. With this, we would have created a gradient of our source bitmap, left dark to right bright.

Changing the source (B) inc step size integer to 0 and source (B) inc step size fractional to 1/1024, the same operation would vertically generate a gradient from the top of the image dark to the bottom bright.

Or, do not use function B and change the source (A) inc step size integer to 0 and source (A) inc step size fractional of 1/3, and change the total iterations to 1048576*3 and we would stretch the image by 3. Since this is linear, you could also change the bit depths to 16 and say we resampled an audio sample to 3x long. Changing the source (A) inc step size integer to 3 and we can say we shrink the sample by 1/3.

Performing such a task 3 times with a different starting offset, using A and being the just resampled data, A beginning with an offset of 1 and B being the previous computed data, summing together and left shift the results by 3 would interpolate the results.

You can mix and multiply a table period of sine waves, loop processing the results with floats to perform complex filters on large data of sums as input B can be circular tables or even circular matrices for convolution filtering. Since we can mix floating and integer sources and destinations, you can do processing like FFTs on source data, even 2D ones with multiple passes.

I know modern DSP can do much of this in 1 pass as they have lots of cache for 2D matrices each clock cycle, but we only have so much room left in this FPGA, so multipass with 1D, 2 point matrices will be what we are stuck with.

BrianHG · « **Reply #3580 on:** August 26, 2022, 01:19:26 pm »

Note that the function list would include logical and/or/xor/nor/nand/xnor ...
Also note that the write destination will have a read/modify/write function for superimposing stenciled graphics, or, 1 final summation/multiplication allowing for 2 functions to take place in 1 copy at the expense of slowing down to below half speed.

nockieboy · « **Reply #3581 on:** August 26, 2022, 09:00:49 pm »

Quote from: BrianHG on August 26, 2022, 01:05:02 pm

You can mix and multiply a table period of sine waves, loop processing the results with floats to perform complex filters on large data of sums as input B can be circular tables or even circular matrices for convolution filtering. Since we can mix floating and integer sources and destinations, you can do processing like FFTs on source data, even 2D ones with multiple passes.

I'm having a little trouble picturing what this could all be used for, beyond the required FPU functions to speed up floating-point calculations, multiplication and division (both integer and FP) for the host. Convolution filters will allow effects like blurs, edge detection etc., right?

Where do I make a start with this? Is there anything out there I could use as a base to start from, or can you walk me through this?

BrianHG · « **Reply #3582 on:** August 26, 2022, 09:47:25 pm »

Try X & Y scaling of geometric objects for the geometry processor.
Even X,Y,Z scaling. How about a geometry rotation matrix.
Well it can do that as well as 1-D sample and 2-D bitmap scaling and rotation.

I cant imagine you getting anywhere if you want the Z80 to perform a floating point X&Y scale to a few thousand 32bit coordinates in ram. How fast will the Z80 be just read 32bit X by read 32 bit scale X, get 32 bit result and copy into destination, then again for the 96bits total for the Y axis, then loop this around a thousand points to render a geometric image.

You asked for a FPU unit. This unit just allows you to process huge chunks of data in the DDR3.

(Actually you need to remind me about how a rotation is done, I think we need to add a square-root to our available math operations.)

DiTBho · « **Reply #3583 on:** August 27, 2022, 01:58:49 am »

can you list the mathematical functions you are talking about so I can see *what* you need to compute stuff?

BrianHG · « **Reply #3584 on:** August 30, 2022, 02:34:20 am »

Quote from: DiTBho on August 27, 2022, 01:58:49 am

can you list the mathematical functions you are talking about so I can see *what* you need to compute stuff?

Hello Nockieboy? I believe this question is for you.

nockieboy · « **Reply #3585 on:** August 30, 2022, 04:00:18 pm »

Quote from: BrianHG on August 30, 2022, 02:34:20 am

Quote from: DiTBho on August 27, 2022, 01:58:49 am
can you list the mathematical functions you are talking about so I can see *what* you need to compute stuff?
Hello Nockieboy? I believe this question is for you.

Oh, thought it was for you for some reason!

Quote from: DiTBho on August 27, 2022, 01:58:49 am

can you list the mathematical functions you are talking about so I can see *what* you need to compute stuff?

Okay, well I'm no 3D graphics programmer so I'd be more than happy to be corrected by someone who knows more about this sort of stuff (or maths in general!), but I imagine that some of (if not all of) the following functions would be required for some sort of 3D graphics to be displayed, and thus would benefit from hardware acceleration and floating-point accuracy (this list is NOT exhaustive):

Linear/affine transformations
Translation
Rotation
Uniform/non-uniform scaling
Orthographic projection
Perspective projection
Reflection
Shearing

...all applicable to 3x3 or 4x4 matrices. Would like to steer clear of quaternions if at all possible - in 4D quaternion space, everyone, everywhere, everywhen can hear you scream!

Joking aside, having written the above list it's looking highly ambitious to my untrained eye. All I really want is an FPU that will allow me to add/sub/div/mult 32-bit floating-point numbers, simply and quickly. If all this additional stuff (block RAM operations etc) isn't going to cause big issues, then great, but I don't want to overstay BrianHG's patience or willingness to help.

FenTiger · « **Reply #3586 on:** August 30, 2022, 04:55:23 pm »

A lot of these are just 4x4 matrix multiplies. A hardware accelerated "multiply this list of 4-vectors by this matrix" could get you a long way.

For rotations you might want sin and cos operations too.

nockieboy · « **Reply #3587 on:** August 30, 2022, 05:30:26 pm »

Quote from: FenTiger on August 30, 2022, 04:55:23 pm

A lot of these are just 4x4 matrix multiplies. A hardware accelerated "multiply this list of 4-vectors by this matrix" could get you a long way.

For rotations you might want sin and cos operations too.

Absolutely, it's 4D matrix multiplications, and fast sin/cos functions would be essential.

All of this requires an FPU that can perform 32-bit float multiplication very quickly. The bottleneck is going to be the 8-bit host, I guess - for any of this to be even vaguely taxed in terms of performance, it's going to need to be able to work flat-out on data loaded into the GPU's DDR RAM, which the host can populate in slow-time during 'loading', but the FPU/GPU can work on very quickly once it's all set up. All the host would need to do is update the position/rotation of the viewport camera and any actors in the scene - the rest could be handled by the GPU, I guess.

I could get a headache very quickly thinking about all this - I think it's all a step (or a few steps) on from creating the FPU, too.

DiTBho · « **Reply #3588 on:** August 31, 2022, 12:44:23 pm »

playstation-z80

BrianHG · « **Reply #3589 on:** August 31, 2022, 12:52:13 pm »

Quote from: DiTBho on August 31, 2022, 12:44:23 pm

playstation-z80

No, wasn't the Playstation using integer math, hence those chunky jumping square pixels.
Nockieboy wants floatingpoint.

DiTBho · « **Reply #3590 on:** August 31, 2022, 02:31:20 pm »

Quote from: BrianHG on August 31, 2022, 12:52:13 pm

Quote from: DiTBho on August 31, 2022, 12:44:23 pm
playstation-z80
No, wasn't the Playstation using integer math, hence those chunky jumping square pixels.
Nockieboy wants floatingpoint.

That's exactly * THE * point.
Fixed point was/is enough for SONY' PSX-GTE.
What do you really want to achieve on the software side?

BrianHG · « **Reply #3591 on:** September 02, 2022, 06:10:24 am »

Quote from: DiTBho on August 31, 2022, 02:31:20 pm

Quote from: BrianHG on August 31, 2022, 12:52:13 pm
Quote from: DiTBho on August 31, 2022, 12:44:23 pm
playstation-z80
No, wasn't the Playstation using integer math, hence those chunky jumping square pixels.
Nockieboy wants floatingpoint.
That's exactly * THE * point.
Fixed point was/is enough for SONY' PSX-GTE.
What do you really want to achieve on the software side?

Speed wise, the FPGA can do 32bit int or float at 200MHz per multiplier and we are currently using ~5% of the available DSP blocks. Remember, the Z80 is running at 2 mips max.

Quote from: nockieboy on August 30, 2022, 05:30:26 pm

Absolutely, it's 4D matrix multiplications, and fast sin/cos functions would be essential.

Ok Nockieboy, ( with a note that you have been quiet for 3 days on this... ) First tell me what and from where you would be feeding into this so-called 4D matrix multiplication an what will be the outputs. It is dumb to make an 8bit 2mips Z80 fille a 100mhz 32bit x 32bit multiplier multiple times to generate a 4d matrix (4*4+4*4=32 bytes input per computation not counting where to put the output bytes). Just moving the data through the Z80 alone will be a waste of significant time.

Start by showing me how you would use 1 multiplier to make a 4x4d matrix and what would be the factors.

nockieboy · « **Reply #3592 on:** September 02, 2022, 10:27:42 am »

Quote from: BrianHG on September 02, 2022, 06:10:24 am

Ok Nockieboy, ( with a note that you have been quiet for 3 days on this... ) First tell me what and from where you would be feeding into this so-called 4D matrix multiplication an what will be the outputs. It is dumb to make an 8bit 2mips Z80 fille a 100mhz 32bit x 32bit multiplier multiple times to generate a 4d matrix (4*4+4*4=32 bytes input per computation not counting where to put the output bytes). Just moving the data through the Z80 alone will be a waste of significant time.

Start by showing me how you would use 1 multiplier to make a 4x4d matrix and what would be the factors.

I'm working away at the moment, so this is all even more vague than I'd usually be when confronted with technical questions like this on areas I have little knowledge about.

Just to set the scene, at this point I'd just like an FPU that can take two 32-bit floating point numbers and a 2-bit control signal and perform one of the basic mathematical operations on those two numbers to produce a 32-bit floating point result. This isn't beyond my own ability using an existing FPU module available online (although as you've pointed out with the previous FPU module I suggested, I might need a little help in picking the best one for the job). This is (initially, at least) purely to speed up integer and floating-point calculations on the host for BBCBASIC and general maths.

As usual though, I'm failing to see the full potential of what I'm asking for and it's clear that it could open up the possibility of some form of 3D graphics acceleration, which is the direction this conversation has taken and (in my opinion at least) is now several steps further on from my original idea of having an FPU as specified in the previous paragraph.

The 8-bit host will be the bottleneck in performance.
The only way I can see to bypass this is to pre-load the actors/camera etc into GPU RAM before the 3D processing starts and leave the 3D engine to deal with the data, with the host making changes to the actor/camera data as required based on user/program input, via the 3D engine - i.e. once it has loaded the objects into the GPU's RAM, the Z80's need to manipulate 3D data would be minimised to sending commands to the 3D engine to rotate/scale/translate objects or the camera as required, including enabling/disabling the drawing of specific objects; but already the language has changed from 'FPU' to '3D engine', with some big implications; there's a hell of a lot more to this than a basic FPU, although it can't be impossible because I remember playing Starglider, Driller and Total Eclipse and those games ran on 8-bit hardware with less than 64KB to play with and no hardware acceleration at all, it was all done in software. Admittedly though, the bar was very low at around 1-3 fps with <100 vertices on screen.

BrianHG · « **Reply #3593 on:** September 02, 2022, 05:58:12 pm »

Ok. When you got the time, then you can properly answer my question.

nockieboy · « **Reply #3594 on:** September 02, 2022, 10:00:22 pm »

Okay, so a typical operation on a 3D vertex would go something like this:

The maths required for an input vector and transformation matrix. The values in the rows would be added together to create a final 4D vector output.

Each transformation matrix (on the left of the image above, comprising values a-p) would be pre-defined in the HDL and would define the translate, scale, rotation etc. operation. This would be chosen by the host (or the engine itself) by specifying the operation to be performed on the provided vertex's 4D vector.

An example of a translation matrix to move a vertex in 3D space.

The 4D vector is the X,Y,Z and W values for the current vertex to be manipulated. These values could be up to 32-bit floats or integers, I guess, depending on the complexity of the engine. They'd be stored in GPU RAM, as part of an object's 3D model description, and be copied to the input vector location in RAM for the engine to read at the appropriate time.

The HDL would simply read the 4D vector and multiply its X,Y,Z,W values against the appropriate row/column values in the selected transform matrix, then reduce the resultant matrix down to a 4D vector again to produce the output which would be written to the output location in GPU RAM - the contents of the 3D space being rendered.

I'll have a think about the matrix multiplication and see if I can post some pseudo-HDL (or maybe even proper SystemVerilog) by the end of the weekend.

BrianHG · « **Reply #3595 on:** September 02, 2022, 11:22:20 pm »

Quote from: BrianHG on August 25, 2022, 10:44:49 pm

A nice one for you would contain these control registers:

functions:

Loop iterations, (number of cycles),
Source (A) bits (8/16/32/64/128),
source (B) bits (8/16/32/64/128),
dest bits (8/16/32/64/128),
function -, +, *, /, int>float, float>int, stencil mask output, bit shift results by #.

source (A) begin address.
source (A) inc step size integer.
source (A) inc step size fractional.
source (A) address loop limiter.

source (B) begin address.
source (B) inc step size integer.
source (B) inc step size fractional.
source (B) address loop limiter.

dest begin address.
dest inc step size integer.
dest inc step size fractional.
dest address loop limiter.

What do you think you can do with that?

Ok, now take a look at my function above and add 1 new function when rendering the 'destination', that being accumulate the results until each 'inc step size fractional' carries into the next integer.
Do you think you can make this work to perform your 4D matrix function?

DiTBho · « **Reply #3596 on:** September 03, 2022, 08:43:00 am »

Why don't you first write the software side for PC, and see how it will be?

nockieboy · « **Reply #3597 on:** September 04, 2022, 10:14:28 am »

Quote from: BrianHG on September 02, 2022, 11:22:20 pm

Quote from: BrianHG on August 25, 2022, 10:44:49 pm
A nice one for you would contain these control registers:

functions:

Loop iterations, (number of cycles),
Source (A) bits (8/16/32/64/128),
source (B) bits (8/16/32/64/128),
dest bits (8/16/32/64/128),
function -, +, *, /, int>float, float>int, stencil mask output, bit shift results by #.

source (A) begin address.
source (A) inc step size integer.
source (A) inc step size fractional.
source (A) address loop limiter.

source (B) begin address.
source (B) inc step size integer.
source (B) inc step size fractional.
source (B) address loop limiter.

dest begin address.
dest inc step size integer.
dest inc step size fractional.
dest address loop limiter.

What do you think you can do with that?
Ok, now take a look at my function above and add 1 new function when rendering the 'destination', that being accumulate the results until each 'inc step size fractional' carries into the next integer.
Do you think you can make this work to perform your 4D matrix function?

Maybe. Well, I'm guessing 'yes' as you've asked the question.

So let's make some assumptions and work on 8-bit integers for simplicity:

Source A is the 3D vertex being worked on, presented as a 4D vector [X, Y, Z, W], with W being 1 and X, Y and Z being its position in 3D space.
Source B is the appropriate 4D transform matrix to use for the selected operation on the supplied 3D vertex.
Source A's start address is X in the 4D vector.
Source B's start address is the first element in the selected 4D transformation matrix, a.
'Accumulated'-Destination address is simply pointing to the X location in the 4D vector output location.

At this point I'm not too sure about the step size (integer or fractional), I'm assuming this increments to the next address for the next value in each factor? Fractional values would allow bit operations?

The multiplier performs the matrix multiplication as per the first image in my previous post, multiplying the values in each column of row 1 of Source B by the vector values in Source A, accumulating each result until the end of the loop (4 iterations) before writing the result to the Destination Address.
The Destination Address is incremented (by its step size) to the next output value (Y), then the previous step is repeated for row 2, then row 3, then row 4 in Source B.
Once row 4 has been completed, the last value is written to the Destination Address (W) and the operation is complete.

I'm obviously missing something with the 'step size fractional' - looks like you're intending it to be used to step the loop through each column of the transform matrix?

EDIT: In fact, if we're only ever doing a 4D vector x 4D matrix multiplication (fixed-dimension maths), is there any need to step at all? Couldn't we just parallelise the entire system and have them feed the multiplied values into the accumulator simultaneously?

DiTBho · « **Reply #3598 on:** September 04, 2022, 01:22:13 pm »

Quote from: nockieboy on September 04, 2022, 10:14:28 am

is there any need to step at all? Couldn't we just parallelise the entire system and have them feed the multiplied values into the accumulator simultaneously?

DSP slices are not magically 32bit, and they are not magically floating point, and even if you do fixed point you have larger infra logic.

Usually DSP slices are 16 bit sized

Can a 32x32 multiply be done with 16x16 multiplies?

Z = UMUL(X, Y)

Let us now split each of X and Y into two subwords of k-bit each
X = 2^k*X1 + X0
Y = 2^k*Y1 + Y0

X1 is the integer formed by the k most significant bits of X
X0 is made of the k least significant bits of X.

The product Z=UMUL(X,Y) may be written

Z = 2^k* X1*Y1 + 2^k* (X1*Y0 + X0*Y1) + X0*Y0 (1)

X0*Y0 ----> DSP slice
X0*Y1 ----> DSP slice
X1*Y0 ----> DSP slice
X1*Y1 ----> DSP slice
2^k* ----> shifter

BrianHG · « **Reply #3599 on:** September 04, 2022, 07:51:29 pm »

Quote from: nockieboy on September 04, 2022, 10:14:28 am

Quote from: BrianHG on September 02, 2022, 11:22:20 pm
Quote from: BrianHG on August 25, 2022, 10:44:49 pm
A nice one for you would contain these control registers:

functions:

Loop iterations, (number of cycles),
Source (A) bits (8/16/32/64/128),
source (B) bits (8/16/32/64/128),
dest bits (8/16/32/64/128),
function -, +, *, /, int>float, float>int, stencil mask output, bit shift results by #.

source (A) begin address.
source (A) inc step size integer.
source (A) inc step size fractional.
source (A) address loop limiter.

source (B) begin address.
source (B) inc step size integer.
source (B) inc step size fractional.
source (B) address loop limiter.

dest begin address.
dest inc step size integer.
dest inc step size fractional.
dest address loop limiter.

What do you think you can do with that?
Ok, now take a look at my function above and add 1 new function when rendering the 'destination', that being accumulate the results until each 'inc step size fractional' carries into the next integer.
Do you think you can make this work to perform your 4D matrix function?

Maybe. Well, I'm guessing 'yes' as you've asked the question.

So let's make some assumptions and work on 8-bit integers for simplicity:
Source A is the 3D vertex being worked on, presented as a 4D vector [X, Y, Z, W], with W being 1 and X, Y and Z being its position in 3D space.
Source B is the appropriate 4D transform matrix to use for the selected operation on the supplied 3D vertex.
Source A's start address is X in the 4D vector.
Source B's start address is the first element in the selected 4D transformation matrix, a.
'Accumulated'-Destination address is simply pointing to the X location in the 4D vector output location.
At this point I'm not too sure about the step size (integer or fractional), I'm assuming this increments to the next address for the next value in each factor? Fractional values would allow bit operations?
The multiplier performs the matrix multiplication as per the first image in my previous post, multiplying the values in each column of row 1 of Source B by the vector values in Source A, accumulating each result until the end of the loop (4 iterations) before writing the result to the Destination Address.
The Destination Address is incremented (by its step size) to the next output value (Y), then the previous step is repeated for row 2, then row 3, then row 4 in Source B.
Once row 4 has been completed, the last value is written to the Destination Address (W) and the operation is complete.
I'm obviously missing something with the 'step size fractional' - looks like you're intending it to be used to step the loop through each column of the transform matrix?

EDIT: In fact, if we're only ever doing a 4D vector x 4D matrix multiplication (fixed-dimension maths), is there any need to step at all? Couldn't we just parallelise the entire system and have them feed the multiplied values into the accumulator simultaneously?

Ok, well you are warming up to the idea.
A fractional step is used to re-iterate a source or destination value multiple times before going onto the next term.

Maybe this will help you better see things, where do you store everything?
Assuming everything is a 32bit float...

For example, place [a] in address 1000, [ b ] address 1004, [c] in address 1008... [p] in address 1060.

For [x,y,z,w], lets place them at address 0200000 to 0200012, a group of 1000000 these coordinates meaning addresses 0200000 -> 1799996.

And we want the output to begin at 8000000 consisting of the top row [ax+by+cx+dw] and so forth...

One method may be:

Code: [Select]

set source A-B function to 32x32bit floating multiply and output accumulator = on.
set loop iterations to 1000000x4.  (we neex 4x the number of iterations since every B needs to be multiplied 4 times the number of A inputs)

// Process all the [x]
A   begin address =    1000, A   inc size int =  4,   A inc size fraction=1   , A   size loop limiter = 4.         (every 4 reads, loop back to begin address, A multiplier's input rotates around the read data [a,b,c,d])
B   begin address = 0200000, B   inc size int = 16,   B inc size fraction=0.25, B   size loop limiter = 2billion.  (4 every loops, hop from [x] to the next [x])
Out begin address = 8000000, Out inc size int = 16, Out inc size fraction=0.25, Out loop size limiter = 2billion.  (accumulate every 4 A*B together into 1 address, then jump 4 to the next output address.)
Run

// Process all the [y]
A   begin address =    1016, A   inc size int =  4,   A inc size fraction=1   , A   size loop limiter = 4.         (every 4 reads, loop back to begin address, A multiplier's input rotates around the read data [e,f,g,h])
B   begin address = 0200004, B   inc size int = 16,   B inc size fraction=0.25, B   size loop limiter = 2billion.  (4 every loops, hop from [y] to the next [y])
Out begin address = 8000004, Out inc size int = 16, Out inc size fraction=0.25, Out loop size limiter = 2billion.  (accumulate every 4 A*B together into 1 address, then jump 4 to the next output address.)
Run

// Process all the [z]
A   begin address =    1032, A   inc size int =  4,   A inc size fraction=1   , A   size loop limiter = 4.         (every 4 reads, loop back to begin address, A multiplier's input rotates around the read data [i,j,k,l])
B   begin address = 0200008, B   inc size int = 16,   B inc size fraction=0.25, B   size loop limiter = 2billion.  (4 every loops, hop from [z] to the next [z])
Out begin address = 8000008, Out inc size int = 16, Out inc size fraction=0.25, Out loop size limiter = 2billion.  (accumulate every 4 A*B together into 1 address, then jump 4 to the next output address.)
Run

// Process all the [w]
A   begin address =    1048, A   inc size int =  4,   A inc size fraction=1   , A   size loop limiter = 4.         (every 4 reads, loop back to begin address, A multiplier's input rotates around the read data [m,n,o,p])
B   begin address = 0200012, B   inc size int = 16,   B inc size fraction=0.25, B   size loop limiter = 2billion.  (4 every loops, hop from [w] to the next [w])
Out begin address = 8000012, Out inc size int = 16, Out inc size fraction=0.25, Out loop size limiter = 2billion.  (accumulate every 4 A*B together into 1 address, then jump 4 to the next output address.)
Run

Since we are using 1 multiplier and 1 adder, to do the matrix 4x4, I ran the function 4 times to illustrate what was going on. In actually, this can be combined down to:

Code: [Select]

set source A-B function to 32x32bit floating multiply and output accumulator = on.
set loop iterations to 1000000x16.  (we need 16x to process everything...)

// Process all the [x,y,z,w] in a single shot.
A   begin address =    1000, A   inc size int =  4,   A inc size fraction=1   , A   size loop limiter = 16.        (every 16 reads, loop back to begin address, A multiplier's input rotates around the read data [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p])
B   begin address = 0200000, B   inc size int =  4,   B inc size fraction=0.25, B   size loop limiter = 2billion.  (4 every loops, hop from [x,y,z,w], to next [x,y,z,w])
Out begin address = 8000000, Out inc size int =  4, Out inc size fraction=0.25, Out loop size limiter = 2billion.  (accumulate every 4 A*B together into 1 address, then jump 4 to the next output address.)
Run

To run as fast as possible, you will need to choose base addresses which sit on 128bit boundaries as the 16 bytes used in 4x32bit words fit inside my DDR3 controller's cache. My first solution only requires the [a...p] multipliers sitting on 128bit boundaries while the object's coordinates [x..w] only need to be located on 32bit boundaries.

To make things >20x faster, we would need 4 ALU sections, each running my first option all in parallel arriving taking in 128bits at a time and outputting 128 bits at a time with the table [a...p] pre-loaded from ram into a cache. But this would be a dedicated ALU module just for this instead of a generic any copy & compute module.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: FPGA VGA Controller for 8-bit computer (Read 511026 times)

Share me