Author Topic: Int to float (Read 12621 times)

BrianHG · « **Reply #25 on:** January 28, 2021, 03:53:22 pm »

Doing it the manual way:

The divides become multiples of a factor left shifted by 24 or 32 bits, then after an integer multiply, the results are right shifted by 24 or 32 bits.

IE, your fractional factors are multiplied by 16777216 or 4294967296, the the result after the multiplication is right shifted/or use the upper bits of the result.

IE: 32bits X 32bit actually has a 64bit result. You are just using the top 32 bits of the result as your new bottom 32bits.

Unless you want a 64bit accumulation in you core instead of a 32 bit. This would actually have even more precision than summing floats which only have 24bits of precision.

BrianHG · « **Reply #26 on:** January 28, 2021, 04:04:29 pm »

Quote from: ali_asadzadeh on January 28, 2021, 03:30:34 pm

result = 4.5453* data + 0.0034128;

Example:
result = ( (4.5453*16777216) * data + (0.0034128*16777216) ) >> 24;

...2^24=16777216
or, use 2^32=4294967296 in place of 16777216 to get extra precision.

However, when doing this in C, make sure you are using 64bit ints as 32bit will shave off the MSBs.
This is not a problem in an FPGA...

ali_asadzadeh · « **Reply #27 on:** January 28, 2021, 08:15:26 pm »

BrianHG thanks for clarification,

So lets consider a real example,
Suppose the data is 1950345 , and we would use 32bit's ie 4294967296

So you say I do it like this

(4.5453*4294967296) * 1950345 + (0.0034128*4294967296) >> 32?

So the final result is like this
19521914850 * 1950345 + 14657864 = 38074469032781114‬ >>32
so the final result using fixed point is this 8864903

and if I use floating point I would get this
4.5453*1950345 + 0.0034128 = 8864903.1319128‬

it's good enough, for this high data input,
Now a lower input data
for example Data = 123

float version
4.5453*123 + 0.0034128 = 559.0753128
The fixed point version
(4.5453*4294967296) * 123 + (0.0034128*4294967296) >> 32
equals to 19521914850 * 123 + 14657864 = 2401210184414 >>32 = 559
still not bad,

And finally something lower
4.5453*23 + 0.0034128 = 104.5453128
The fixed point version
(4.5453*4294967296) * 23 + (0.0034128*4294967296) >> 32
equals to 19521914850 * 23 + 14657864 = 449018699414 >> 32 = 104

if I put a lower data as input the final result may miss some information, is there a workaround?
also regarding resource usage, is it lower than float version? since I have found a FPU MUL unit that uses around 210Luts and 4 DSP slices, it's in opencores.

How about calculating the ratio division for finding the (second harmonic) / (first harmonic )
like these sample numbers from the CORDIC core
45338/1950345 = 0.0232461436
How should I do this calculation?

NorthGuy · « **Reply #28 on:** January 28, 2021, 10:46:41 pm »

When you do the calculation, you don't need to always shift by 32 (or 24). Say, if you multiply 2 32-bit numbers, you can get higher precision if you shift right by 24 instead of 32 (but the result will be longer - 40-bit instead of 32). So, you write down all your formulae, and for every item in your calculation you decide how much it is shifted, and by selecting the shifts you make sure you get the desired precision.

If you go to floats, you basically give up your privilege of selection what shift is used. Instead, for every number you store, you also store another number (called exponent) which shows how much the number is shifted. These are floats - a pair of numbers - mantissa and exponent. This doesn't guarantee any accuracy and you lose control of the shifts. Thus achieving the desired accuracy becoming more difficult. What do you get in return - you get a hardware FPU which can process floating numbers very quickly.

There's no FPU in FPGA, so there's no benefits in using floating numbers. Instead, there is an overhead. For example, if you need to add two numbers, you need one (or two) barrel shifters to align the numbers to each other, only then you can add. This is unnecessary overhead. In contrast, fixed shift costs nothing in FPGA.

The division requires lots of resources and will be very slow - it is slow in MCUs too - whether you use floating point or fixed integers. Technically you approach is the same as multiplication.

A/B = A*4294967296/B >> 32.

BrianHG · « **Reply #29 on:** January 28, 2021, 11:08:54 pm »

I'm assuming:
4.5453* #

That the ' # ' is the 24bit adc. Remember, you have chosen 32/64bits, you can always left shift the ADC by 8 bit (making the ADC source 32bit), IE multiply by 256. This means the data out is 0,256,512,1024... And if 123 is still not bad, you will have double that minimum precision on the output as the ADC number of +/-1, which will now be +/-256, more than twice your 123. If after all the combined filtering, if you want 24 bit out, you can just divide the result by 256.

This is just another round about way of 'NorthGuy' post just 1 above, when choosing the bits of precision, you may control the shift.

BrianHG · « **Reply #30 on:** January 29, 2021, 02:39:17 am »

Quote from: NorthGuy on January 28, 2021, 10:46:41 pm

A/B = A*4294967296/B >> 32.

Well, since we are obviously talking about bit shifting.....

A/B = ((A<<32)/B) >>32.

I think to get a reasonable fmax, a 64bit int divided by a 32bit int would have something like at least a 11 clock latency if not double. However, this is not bad if your code is piping in a stream of uninterrupted divides.

gnuarm · « **Reply #31 on:** January 29, 2021, 06:55:29 am »

Quote from: BrianHG on January 27, 2021, 03:51:06 am

Quote from: gnuarm on January 27, 2021, 02:43:49 am
In floating point the mantissa is left justified and in IEEE the most significant 1 is assumed. So the shift is variable depending on where the most significant 1 bit is. That also impacts the exponent.

Yes, this is the correct way to assure the maximum definition.
I could have sworn that fixing the exponent so that is it a fixed +23 bit integer offset and filling in the mantissa with a 23 bit integer number would still work so long as you do not go any larger than 23 bits. Otherwise a hidden bit 24 would always be assumed to be 1 no matter what and 0 would be impossible unless there is a specific pattern for 0 which I completely forgot about. Maybe it had something to do with the old assembly PIC16 floating point math routines I used from 20 years back that certain exceptions were accommodated.

There's nothing to say you couldn't do this for an arbitrary floating point format. In IEEE format the hidden '1' will muck up the value.

gnuarm · « **Reply #32 on:** January 29, 2021, 07:51:16 am »

Quote from: NorthGuy on January 28, 2021, 10:46:41 pm

When you do the calculation, you don't need to always shift by 32 (or 24). Say, if you multiply 2 32-bit numbers, you can get higher precision if you shift right by 24 instead of 32 (but the result will be longer - 40-bit instead of 32). So, you write down all your formulae, and for every item in your calculation you decide how much it is shifted, and by selecting the shifts you make sure you get the desired precision.

If you go to floats, you basically give up your privilege of selection what shift is used. Instead, for every number you store, you also store another number (called exponent) which shows how much the number is shifted. These are floats - a pair of numbers - mantissa and exponent. This doesn't guarantee any accuracy and you lose control of the shifts. Thus achieving the desired accuracy becoming more difficult. What do you get in return - you get a hardware FPU which can process floating numbers very quickly.

I can't follow that reasoning. You don't give up any control, you simply have it done automatically to preserve the maximum resolution possible rather than storing a bunch of leading zeros. Since the exponent takes bits, the question is it better to use some of the bits for an exponent extending the range or is it better to use them as more significant bits? I think the examples shown indicate the preference is for the exponent.

Quote

There's no FPU in FPGA, so there's no benefits in using floating numbers. Instead, there is an overhead. For example, if you need to add two numbers, you need one (or two) barrel shifters to align the numbers to each other, only then you can add. This is unnecessary overhead. In contrast, fixed shift costs nothing in FPGA.

Again, I don't follow. Floating point takes more hardware, but if that is what is required, that's what is required. A shift is also known as a multiply. I'm working a design now that is using a single 36 x 18 multiplier for a non-IEEE math unit. It will iteratively do the denormalize, the operation and the normalize. For a multiply there is no denormalize of course, but all three steps use the same multiplier. I have to add an optional shift by 18 at the output to handle the full range of possible shifts. For math the two mantissas are fed into the multiplier. For shifting one of the inputs is a shift parameter with a single 1 and the rest zeros.

If you need to do this without iteration it needs three multipliers. Yes, more hardware, but most multipliers in FPGAs go unused anyway. If the processing is not every clock cycle, then iterative conversions are practical.

Quote

The division requires lots of resources and will be very slow - it is slow in MCUs too - whether you use floating point or fixed integers. Technically you approach is the same as multiplication.

A/B = A*4294967296/B >> 32.

I think the hard part of this is the divide. I'm using an Newton-Raphson iteration. In my case I only need 18 bits of resolution. A single block RAM will give me a seed accurate to at least 9 bits and one turn of the NR crank gives me 18 bits. IEEE 32 bit format resolution requires a second turn of the NR crank. The significant bits double on each iteration so it converges quickly. These iterations are not on full floating point words, just the mantissas, or the integers if you want fixed point.

Oh, almost forgot. NR calculates the inverse of D, so you still need to multiply the N by this inverse, but again just the mantissa, not a full floating point multiply.

I believe for NR you need to start with values between 0.5 and 1 or 1 and 2 depending on how you look at it. So a sort of denormalization and normalization has to happen in fixed point as well I expect. I haven't really considered the fixed point option so much.

ali_asadzadeh · « **Reply #33 on:** January 29, 2021, 10:08:02 am »

Thanks for the points BrianHG, do you recommend any DIV algorithm?

gnuarm thanks for your feedback, is your FPU open? can we take a look at it?

SiliconWizard · « **Reply #34 on:** January 29, 2021, 02:58:03 pm »

Quote from: ali_asadzadeh on January 29, 2021, 10:08:02 am

Thanks for the points BrianHG, do you recommend any DIV algorithm?

If a latency of N bits is OK, you can implement a basic restoring, or non-restoring division: https://en.wikipedia.org/wiki/Division_algorithm
They are pretty simple to implement in HDL.

NorthGuy · « **Reply #35 on:** January 29, 2021, 07:33:27 pm »

Quote from: gnuarm on January 29, 2021, 07:51:16 am

I can't follow that reasoning. You don't give up any control, you simply have it done automatically to preserve the maximum resolution possible rather than storing a bunch of leading zeros. Since the exponent takes bits, the question is it better to use some of the bits for an exponent extending the range or is it better to use them as more significant bits? I think the examples shown indicate the preference is for the exponent.

Floatiung point has some mantissa size, say 23-bit for singles. With scaled integers you have full control over the size of the numbers. For example, you can do multiple MACs accumulating the result. You get better precision if your use longer accumulator then chop it off at the end.

You certainly can do the same with floats - say use 64-bit IEEE floats for accumulator. So, you will have to convert arguments from 32-bit floats to 64-bit ones then do the MAC. This is lots of unnecessary work, you can try to simplify it. In the process of the simplification you'll come to a simpler solution - scaled integers.

Quote from: gnuarm on January 29, 2021, 07:51:16 am

Quote
There's no FPU in FPGA, so there's no benefits in using floating numbers. Instead, there is an overhead. For example, if you need to add two numbers, you need one (or two) barrel shifters to align the numbers to each other, only then you can add. This is unnecessary overhead. In contrast, fixed shift costs nothing in FPGA.

Again, I don't follow. Floating point takes more hardware, but if that is what is required, that's what is required.

Required? By whom? Sure there may be situations where floats are extremely beneficial, such as inversion of nearly singular matrices. But such situations are rather rare. The majority of DSP tasks can be done without floats faster and with less resources.

gnuarm · « **Reply #36 on:** January 30, 2021, 12:28:55 am »

Quote from: ali_asadzadeh on January 29, 2021, 10:08:02 am

Thanks for the points BrianHG, do you recommend any DIV algorithm?

gnuarm thanks for your feedback, is your FPU open? can we take a look at it?

Yeah, it's part of an open source project to build a ventilator, but it's not code yet. It's still block diagrams and Forth evaluation code.

gnuarm · « **Reply #37 on:** January 30, 2021, 12:48:35 am »

Quote from: NorthGuy on January 29, 2021, 07:33:27 pm

Quote from: gnuarm on January 29, 2021, 07:51:16 am
I can't follow that reasoning. You don't give up any control, you simply have it done automatically to preserve the maximum resolution possible rather than storing a bunch of leading zeros. Since the exponent takes bits, the question is it better to use some of the bits for an exponent extending the range or is it better to use them as more significant bits? I think the examples shown indicate the preference is for the exponent.

Floatiung point has some mantissa size, say 23-bit for singles. With scaled integers you have full control over the size of the numbers. For example, you can do multiple MACs accumulating the result. You get better precision if your use longer accumulator then chop it off at the end.

You certainly can do the same with floats - say use 64-bit IEEE floats for accumulator. So, you will have to convert arguments from 32-bit floats to 64-bit ones then do the MAC. This is lots of unnecessary work, you can try to simplify it. In the process of the simplification you'll come to a simpler solution - scaled integers.

Sorry, I have no idea what you are talking about. You seem to be constructing examples to prove a point, but the examples are not part of any problem anyone is trying to solve.

Quote

Quote from: gnuarm on January 29, 2021, 07:51:16 am
Quote
There's no FPU in FPGA, so there's no benefits in using floating numbers. Instead, there is an overhead. For example, if you need to add two numbers, you need one (or two) barrel shifters to align the numbers to each other, only then you can add. This is unnecessary overhead. In contrast, fixed shift costs nothing in FPGA.

Again, I don't follow. Floating point takes more hardware, but if that is what is required, that's what is required.

Required? By whom? Sure there may be situations where floats are extremely beneficial, such as inversion of nearly singular matrices. But such situations are rather rare. The majority of DSP tasks can be done without floats faster and with less resources.

The OP has provided an example where the final result does not have enough resolution when using integers. I don't want to put words in your mouth, but I expect you might say he could extend the precision of the intermediate fixed point computations to get more resolution in the result... or I suggest he can use floating point. Either way uses more hardware than a simple fixed point approach. Which is better? Depends on the details.

I chose float because I need to handle a wide range of data using the same hardware. Some of the computations are specified with constants of A*10^-6 and such so that I just got tired of trying to track the durn scale factors. It was going to be harder to manage the tracking in fixed point than it is to just build the durn floating point hardware. The division algorithm requires normalization and keeping track of a real time scale factor which is essentially the same as floating point, so it is not really so large a leap to just use the durn floating point and give up on the manual scaling.

Here is an example of the calculations I've been asked to perform.

gnuarm · « **Reply #38 on:** January 30, 2021, 12:55:44 am »

Image didn't load correctly, let's try again.

Freaking Google Drive is so retarded. They can't just give you a link to a file with a file name and an extension.

NorthGuy · « **Reply #39 on:** January 30, 2021, 02:43:14 am »

Quote from: gnuarm on January 30, 2021, 12:48:35 am

The OP has provided an example where the final result does not have enough resolution when using integers ... I suggest he can use floating point.

It doesn't work that way. Say, OP has 24-bit ADC readings. This is a 24-bit integer. What happens to the precision when you convert it to 32-bit floats which have 23-bit mantissa. For the upper half of the range (8388608 to 16777215), you lose one bit of resolution. 24-bit become 23-bit. For the lower half of the range, the resolution is increased. Say, for ADC reading of 1, floating point presentation adds 23 zeroes, which are totally meaningless.

When you perform linear transformation, such as FIR or FFT, it stays that way. Floating point numbers provide excess resolution for smaller values, and at the same time truncate resolution for larger numbers. Moreover, there's rounding error, which accumulates, so the more calculations you do, the more resolution you lose. The end result - you do more calculations, and get less resolution.

In contrast, integers have the same resolution accross the scale - if ADC measures to 1 mV, integers will give you your 1 mV resolution whether you measure 10 mV or 1000 V.

Quote from: gnuarm on January 30, 2021, 12:48:35 am

I chose float because I need to handle a wide range of data using the same hardware. Some of the computations are specified with constants of A*10^-6 and such so that I just got tired of trying to track the durn scale factors. It was going to be harder to manage the tracking in fixed point than it is to just build the durn floating point hardware. The division algorithm requires normalization and keeping track of a real time scale factor which is essentially the same as floating point, so it is not really so large a leap to just use the durn floating point and give up on the manual scaling.

I don't think this is an example of wide range of data. Wide range would be if the flow was changing between fractions of ml and millions of liters. Even then, 32-bit integers give you a very large range - you can represent up to roughly 4000 l/s with 1 ul/s resolution. This is way beyond what is practically needed.

Most of the constants have very modest effective resolution, probably somewhere around 16 bit, which means that the required accuracy of the final calculations is somewhere in the same range. So you wouldn't need big integers anyway.

There's a lot of other things that can be done to simplify the formulae. For example, the majority of the divisions are divisions by a constant which can be replaced with multiplications.

Of course, you can just grab floating point libraries and it all works off the bat, especially if you don't need fast calculations and use an FPGA with tons of DSPs.

gnuarm · « **Reply #40 on:** January 30, 2021, 03:09:39 am »

Quote from: NorthGuy on January 30, 2021, 02:43:14 am

Quote from: gnuarm on January 30, 2021, 12:48:35 am
The OP has provided an example where the final result does not have enough resolution when using integers ... I suggest he can use floating point.

It doesn't work that way. Say, OP has 24-bit ADC readings. This is a 24-bit integer. What happens to the precision when you convert it to 32-bit floats which have 23-bit mantissa. For the upper half of the range (8388608 to 16777215), you lose one bit of resolution. 24-bit become 23-bit. For the lower half of the range, the resolution is increased. Say, for ADC reading of 1, floating point presentation adds 23 zeroes, which are totally meaningless.

Hmmm... you start off with a fallacy. IEEE 32 bit floating point has 24 significant bits in the significand.
https://en.wikipedia.org/wiki/Single-precision_floating-point_format#IEEE_754_single-precision_binary_floating-point_format:_binary32

So I guess you'll need to invent a 25 bit input to make this argument. But it's not terribly relevant. In an FPGA the significand can be 25 bits, or 20 bits or any number you want. This is not a CPU where you are stuck with whatever the CPU gives you.

Quote

When you perform linear transformation, such as FIR or FFT, it stays that way. Floating point numbers provide excess resolution for smaller values, and at the same time truncate resolution for larger numbers. Moreover, there's a rounding error, which accumulates, so the more calculations you do, the more resolution you lose. The end result - you do more calculations, and get less resolution.

When you do math, there can be growth of the bits. Addition can produce what you would call overflow but in a floating point is called an addition.

Quote

In contrast, integers have the same resolution accross the scale - if ADC measures to 1 mV, integers will give you your 1 mV resolution whether you measure 10 mV or 1000 V.

Until you do an operation that exceeds the range or do a divide that requires fractions which have to be truncated. What is 2,000,000 divided by 3,000,000 in fixed point? Zero. Oh, adjust the radix point in your head and its... 0.666,667, but now we have to track the moved radix point outside the real time calculation. Even then the same point in the calculations might produce 0.000,007 and we have lost significant resolution.

Quote from: gnuarm on January 30, 2021, 12:48:35 am

I chose float because I need to handle a wide range of data using the same hardware. Some of the computations are specified with constants of A*10^-6 and such so that I just got tired of trying to track the durn scale factors. It was going to be harder to manage the tracking in fixed point than it is to just build the durn floating point hardware. The division algorithm requires normalization and keeping track of a real time scale factor which is essentially the same as floating point, so it is not really so large a leap to just use the durn floating point and give up on the manual scaling.

I don't think this is an example of wide range of data. Wide range would be if the flow was changing between fractions of ml and millions of liters. Even then, 32-bit integers give you a very large range - you can represent up to roughly 4000 l/s with 1 ul/s resolution. This is way beyond what is practically needed.[/quote]

Please show me how to do these calculations in fixed point. I didn't say they could not be done. I said it was too big a PITA to bother with. Much easier to design the floating point hardware (which really isn't that big a deal) than to try to manage the scale factors to track all the calculations. This is just one of several. It is by far the most complex, but others have divides and subtractions that result in the classic small difference between two large numbers.

One that isn't shown is a quadratic correction for temperature. So T in °K has to be squared and multiplied by a coefficient in the ppm range. Rather ugly in fixed point. Each one of these calculations requires shifting to manage the radix point which you seemed to think was a downside of floating point. The reality is floating point is doing exactly the same thing as fixed point, it's just doing it automatically without my being involved in planning it.

Quote

Most of the constants have very modest effective resolution, probably somewhere around 16 bit, which means that the required accuracy of the final calculations is somewhere in the same range. So you wouldn't need big integers anyway.

There's a lot of other things that can be done to simplify the formulae. For example, the majority of the divisions are divisions by a constant which can be replaced with multiplications.

If they were constant divisions I would not have mentioned them. Off the top of my head I know division is used in scaling sensor readings for the calibration that is done periodically. The machine either has to do the division when it calculates the scale factor or when it uses the scale factor. Pick one. Other divides are done because that's the algorithm. The equations I showed you are full of those.

Quote

Of course, you can just grab floating point libraries and it all works off the bat, especially if you don't need fast calculations and use an FPGA with tons of DSPs.

There are libraries? Oh, you mean for CPUs! Never mind.

hamster_nz · « **Reply #41 on:** January 30, 2021, 04:29:16 am »

Maybe somebody (ie the OP) should write some code and evaluate its performance?

Not HDL, just C or Python or something... they will most likely need to do this for validation of their design anyway.

Maybe if they can give us a file with representative data we can write our own and report back.

NorthGuy · « **Reply #42 on:** January 30, 2021, 05:47:24 am »

Quote from: gnuarm on January 30, 2021, 03:09:39 am

If they were constant divisions I would not have mentioned them. Off the top of my head I know division is used in scaling sensor readings for the calibration that is done periodically. The machine either has to do the division when it calculates the scale factor or when it uses the scale factor. Pick one.

I don't know what's in the top of your head. I look at the formulae you posted. Things like "Percentage of Oxygen in Dry air" or "Molecular mass of oxygen" ...

BrianHG · « **Reply #43 on:** January 30, 2021, 12:23:57 pm »

Quote from: NorthGuy on January 28, 2021, 10:46:41 pm

The division requires lots of resources and will be very slow - it is slow in MCUs too - whether you use floating point or fixed integers. Technically you approach is the same as multiplication.

A/B = A*4294967296/B >> 32.

How can I be so stupid as to forget. When using a divider in HDL, do not shift bits around like that, it wastes gates and clock cycles/speed to divide a 64bit by a 32bit. The just do a 32bit divide by a 32 bit, and the output you get is the quotient, BUT, don't forget, you also get a separate 32bit 'remainder' as well without any extra gates... This is where your precision lies.

Ok, early morning lemon thought... But yes, the remainder is useful.

ali_asadzadeh · « **Reply #44 on:** January 30, 2021, 02:49:05 pm »

BrianHG thanks for the update, but the example numbers propose 0.0232461436
How can this be done in fixed point? by multiplying the first number by 4294967296?

Then after all this how should I tell Cortex M7 how to treat numbers efficiently or accept them as float? specially when the costumer has about 500 floating parameters in CM7.

NorthGuy · « **Reply #45 on:** January 30, 2021, 03:20:31 pm »

Quote from: gnuarm on January 30, 2021, 03:09:39 am

Until you do an operation that exceeds the range or do a divide that requires fractions which have to be truncated. What is 2,000,000 divided by 3,000,000 in fixed point? Zero. Oh, adjust the radix point in your head and its... 0.666,667, but now we have to track the moved radix point outside the real time calculation. Even then the same point in the calculations might produce 0.000,007 and we have lost significant resolution.

Ok. I'll try to explain. Math is done with values. What we discussing are different presentations of the same value. It's all binary, but I use decimals for clarity.

Say, I have 20 dollars.

Floating point: 2.0E+1
Fixed point: 20.00

For scaled integers, I shift the position of the decimal point to the right, by two. Another way, I measure in cents, not dollars:

Scaled integers: 2000

Same thing for 0.01 dollars

Floating point: 1.0E-2
Fixed point: 0.01
Scaled integers: 1

Let's now do some math.

Addition:

Floating point: 2.0E+1 + 1.0E-2 = 2000.0E-2 + 1.0E-2 = 2001.0E-2 = 2.001E+1
Fixed point: 20.00 + 0.01 = 20.01
Scaled integers: 2000 + 1 = 2001

Note how floating pint requires two extra step - a shift to align mantissas and normalization at the end

Multiplication:

I buy something for 205.40 cents and I need to pay 5.0% tax on it. How much is that?

Presentation of the amount:

Floating point: 2.054E+2
Fixed point: 205.40
Scaled integers: 20540

Presentation of the tax rate:

Floating point: 5.0E-2
Fixed point: 0.05
Scaled integers: 50 - I have chosen to shift right by 3 decimal points - one unit is 0.1%

Multiplication proceeds as follows

Floating point: 2.054E+2 x 5.0E-2 = (2.054*5.0)E(+2-2) = 10.27E0 = 1.027E+1
Fixed point: 205.40 x 0.05 = 10.27
Scaled integers: 20540 x 50 >>> 3 = 1027000 >>> 3 = 1027

Here ">>> 3" means shift right by 3 decimal digits.

Note that floating point multiplication is easy, but there's still an overhead - normalization at the end.

Scaled integers require a shift. It is decimal in the example, but it is binary in hardware. The amount of the shift is the same every time. Thus, in FPGA, it's nothing more than using a different set of wires - no overhead.

Division:

I buy something for 205.40 cents and I need to pay 10.27 in taxes. What is the tax rate?

Floating point: 1.027E+1 / 2.054E+2 = (1.027 / 2.054)E(+1-2) = 0.5E-2 = 5.0E-1
Fixed point: 10.27/205.40 = 0.05 (5%)
Scaled integers: 1027 <<< 3 / 20540 = 1027000 / 20540 = 50

Here "<<< 3" means shift left by 3 decimal digits.

Of course, here everything is dominated by the division which will take long time. Otherwise, it's the same as multiplication - normalization overhead for floats, no overhead for scaled integers.

In short - you can do any math with scaled integers and it takes less operations than with floats, especially in FPGA.

Dynamic range:

If you have numbers which differs in magnitude, you need very long integers to represent. For example, us national debt is

Floating point: 2.785208840732531E+13
Fixed point: 27852088407325.31
Scaled integers: 2785208840732531 (not really that big - only 52 bits)

If we want to deal with smaller numbers, we can use floats and chop of precision. Say, we chop it to 6 digits after decimal point:

Floating point: 2.785209E+13
Fixed point: 27852088407325.31
Scaled integers: 2785208840732531

Now we can get away with smaller numbers (for example 32-bit floats instead of 52-bit integers), but we did this by deliberately removing resolution, so the resolution will be less (not more). Floating point numbers is an approximation.

This can lead to troubles with calculations. Say, we record the debt 10 minutes later:

Floating point: 2.785213E+13
Fixed point: 27852126073450.61
Scaled integers: 2785212607345061

How much money was borrowed during these 10 minutes:

Floating point: 2.785213E+13 - 2.785209E+13 = 4.000000E+7
Fixed point: 27852126073450.61 - 27852088407325.31 = 37666125.30
Scaled integers: 2785212607345061 - 2785208840732531 = 3766612530

See what happens. Most of the digits produced by floating point calculation are totally bogus. If you just apply floating point calculations without thinking of precision, the result may be less accurate than you think, if not outright wrong.

ali_asadzadeh · « **Reply #46 on:** January 30, 2021, 07:08:37 pm »

Thanks for the great example,

I get most of it, expect how you decide to choose 3 for >> and << for MUL and DIV?

BrianHG · « **Reply #47 on:** January 30, 2021, 08:12:49 pm »

Quote from: ali_asadzadeh on January 30, 2021, 07:08:37 pm

Thanks for the great example,
I get most of it, expect how you decide to choose 3 for >> and << for MUL and DIV?

I'm curious, do you know what M.N math is?

Say you want a 24 bit integer with 24 bit fraction range.

When you add, subtract, multiply, divide, the first 24 bits to the left of the '.' is your integer portion, the 24 bits on the right is your fraction. The difference is that all the 0/1's on the left are power of 2's, IE 1,2,4,8,16,... To the right of the decimal place, the .N, is it fractions powers of 2, IE 1/2, 1/4, 1/8, 1/16,...

When doing addition/subtraction/multiplication/division, you just treat the 24bitM, 24bitN as 1 huge 48 bit integer. When getting the result from a multiply, you need to right shift the result by 48 bits to recenter the decimal point. With divide, you need to left shift the dividend by 48 bits. Addition and subtraction have no shift.

It is no different in decimal math. It's just that you decided to do everything in integer, but, your integer has (decimal example) 16 digits, but, you imagine that there is a dummy decimal place in the middle, so 8 digits integer, 8 digits after the decimal point in floating point.

The authentic IEE754 floating point math does the same has only the 24 bit N, but has a 8 bit exponent which tells you how to shift that N number to in binary to the left or right. For adding/multiplying... 2 small fraction in IEE754 means you will get a precise result. But, when adding/subtracting... 1 huge number with a fraction, that fraction may completely get ignored/disappear since the exponent left and right shift may be more than 24 shifts/bits apart. Exponents in binary just left and right shift the bits, IE like in M.N math, it just moves the binary decimal place. (binary exponent is 2^^exp where as in decimal, the exponent is 10^^exp (This is the 'E#' on your calculator in base 10 decimal math) Now when multiplying/dividing, the exponents between a huge number and small number are first summed, then, a smaller 24bit multiply/divide is done while the new exponent is just the sum/difference of the 8bit exponents.

Back to your earlier question about the what to shift. Working in integer 48bit may be good enough for your app if you target keeping that binary decimal point in the middle.

My issue is where do you get a divide for GOWIN? I know all the math routines, floating point and fixed point are all available for Altera FPGAs, including sine/cos/tan/square root/complex plane multiply, but, I never had to deal with a vendor limiting my ability to do basic math before as well as int<>float conversion unless we are talking over 15 years ago.

Just googling around, here is an integer and M.N fixed point divider code, but, it cannot be pipelined. You basically send it your numbers and wait for a result with an unknown number of clock cycles to compute.
It does support dividing a smaller number by a larger number. (The second M.N 'FIXED' point version on the page.) You would need to implement your own negative number support, but, that's a simple task.

https://projectf.io/posts/division-in-verilog/
(Sorry, everything today is in SystemVerilog.)
div_int is integer divide only.
div -> supports a total WIDTH for your binary numbers and FIXED location FBITS which sets the location of the '.' in your M.N number with it's total WIDTH. It looks like a WIDTH of 48, and a FBITS of 24 gives you a M.N of 24bitsINT.24bitsFRACTION. (IE: You do not need to shift the dividend, it appears to be designed to deal with that fixed decimal point of the M.N style math.)

Solved with the above divider in FIXED width mode so long as you have FBITS with a significant enough number above 0:

Quote from: ali_asadzadeh on January 28, 2021, 08:15:26 pm

How about calculating the ratio division for finding the (second harmonic) / (first harmonic )
like these sample numbers from the CORDIC core
45338/1950345 = 0.0232461436
How should I do this calculation?

Since you already have a multiply, using the fixed point M.N style, you can run your floating point number code.
Using a a parameter to set the overall width and FBITS to set your fixed overall decimal point with the above divide, you can do all the math in SystemVerilog except for negative numbers in the divide function. To fix that, you need an absolute value function and to keep track of the 2 source number polarities, and flip the polarity of the output number if needed.

As for FMAX and gate count, well, I never used that function, so, I do not know.

Also, if the gate count is small, you can run many in parallel and cheat simulate a fixed delay pipelined version to get continuous divides on every single clock cycle, but, you may not need such speed.

NorthGuy · « **Reply #48 on:** January 30, 2021, 09:13:34 pm »

Quote from: ali_asadzadeh on January 30, 2021, 07:08:37 pm

Thanks for the great example,
I get most of it, expect how you decide to choose 3 for >> and << for MUL and DIV?

Say, you calculate

C = A x B

The shift of C (Sc) is equal to shift of A (Sa) plus shift of B (Sb), which is, of course, the same as with floats.

In the example above

Sa = 2
Sb = 3

Therefore

Sc = Sa + Sb = 2 + 3 = 5

but we wanted the shift of the result to be 2, not 5. Hence we had to shift right by 3 to transform the shift from 5 into 2.

Similarly, with the division,

Sc = Sa - Sb

where Sa = 2 and Sb = 2

If we didn't apply the shift to A, the result would have a shift of zero (Sc = Sa - Sb = 2 - 2 = 0). But we want it to be 3 (one unit is equal to 0.001 or 0.1%). Hence we had to increase Sa to 5. Then Sc = Sa - Sb = 5 - 2 = 3. A already had shift Sa = 2. To make it into 5 we had to shift Sa by 3 to the left.

That's the same as M.N notation that BrianHG is talking about. N is the shift (only binary, not decimal). M + N is the total length of the integer you use. Usually it is preceded by Q. CPUs are limited to numbers of fixed size, so the most common format is Q1.15 (or simply Q15 format) , but in FPGA you're not restricted and you can use any M and N as you see fit. So you can choose its own M and N for any given number and get the exact resolution you need. The efficiency of multiplications would depend on the size used in DSP blocks, but additions can be done in fabric without restrictions.

BrianHG · « **Reply #49 on:** January 30, 2021, 09:21:59 pm »

BTW, I used 48bits in my above example sing the ADC is 24 bits integer value and when doing FFTs, the 24 bits fraction for the math should achieve a good output definition. Using an M.N of 24.32 would retain perfect rounding definition (in a number of cases, this will render better results than 32bit floats, IE wide dynamic range source factors, as NorthGuy was trying to get his point across) if you are summing up to 256 factors together, IE 24+32=56bit fixed point precision. (Weird, has someone else mentioned that 56bit number...)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Int to float (Read 12621 times)

Share me