Author Topic: Int to float (Read 14091 times)

ali_asadzadeh · « **on:** January 26, 2021, 08:26:17 am »

Hi,
I want to know if you know an open source integer to float HDL code

Daixiwen · « **Reply #1 on:** January 26, 2021, 08:54:36 am »

Opencores has an FPU project (https://opencores.org/projects/fpu) that includes int to float and float to int conversions, in addition to floating point operations

gnuarm · « **Reply #2 on:** January 26, 2021, 09:13:39 am »

If you have hardware that is working on floating point numbers, the conversion between ints and floats can be done using that same hardware. To convert an int to float you just need to find the highest bit set to a 1 and assure that bit is shifted to the msb of the mantissa, which is actually a hidden bit in the IEEE format. The exponent is set according to this shift count. This is an operation that has to be done at the end of every floating point addition, renormalizing the result.

Converting from float to int is the initial step of a floating point addition, denormalizing the mantissa to match the exponents. In the case of a float to int conversion the idea is to denormalize the mantissa to achieve an exponent of zero.

ali_asadzadeh · « **Reply #3 on:** January 26, 2021, 09:32:19 am »

Quote

Opencores has an FPU project (https://opencores.org/projects/fpu) that includes int to float and float to int conversions, in addition to floating point operations

Thanks, I have seen this, but it lacks the int to float and float to int, in the codes, it's just in the description

BrianHG · « **Reply #4 on:** January 26, 2021, 11:19:50 am »

If you are using Quartus, there is a LPM function for it, both directions.

ali_asadzadeh · « **Reply #5 on:** January 26, 2021, 11:57:49 am »

Quote

If you are using Quartus, there is a LPM function for it, both directions.

Thanks, I'm using gowin $:-\$
Is there any good open source out there?

BrianHG · « **Reply #6 on:** January 26, 2021, 04:45:08 pm »

How many bits?
Which float format?

Wiljan · « **Reply #7 on:** January 26, 2021, 08:25:10 pm »

Intel / Altera does have a cookbook with examples and there are many nice cool codes

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/stx_cookbook.pdf

and the code
https://github.com/thomasrussellmurphy/stx_cookbook

The one you might look for "fixed_to_float.v"

Code: [Select]

module fixed_to_float (
	fixed_sign,
	fixed_mag,
	float_out
);

parameter FIXED_WIDTH = 8; // must not be > 32
parameter FIXED_FRACTIONAL = 4;

input fixed_sign;
input [FIXED_WIDTH-1:0] fixed_mag;
output [31:0] float_out;

	wire [7:0] exponent;
	wire [31:0] unscaled_mantissa = fixed_mag << (32-FIXED_WIDTH);
	wire [31:0] scaled_mantissa;
	wire [4:0] scale_distance;

	scale_up sc (.in(unscaled_mantissa),
				.out(scaled_mantissa),
				.distance(scale_distance));
		defparam sc .WIDTH = 32;
		defparam sc .WIDTH_DIST = 5;

	assign exponent = 8'd127 + (FIXED_WIDTH-FIXED_FRACTIONAL) - 1
						- scale_distance;

	// Zero is special and gets an exponent of 0, not "something very small"
	assign float_out = &scale_distance ? {fixed_sign,31'h0} :
					{fixed_sign, exponent, scaled_mantissa[30:8]};
	
endmodule

BrianHG · « **Reply #8 on:** January 26, 2021, 08:52:25 pm »

Quote from: Wiljan on January 26, 2021, 08:25:10 pm

Intel / Altera does have a cookbook with examples and there are many nice cool codes

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/stx_cookbook.pdf

and the code
https://github.com/thomasrussellmurphy/stx_cookbook

The one you might look for "fixed_to_float.v"

Code: [Select]
module fixed_to_float ( fixed_sign, fixed_mag, float_out ); parameter FIXED_WIDTH = 8; // must not be > 32 parameter FIXED_FRACTIONAL = 4; input fixed_sign; input [FIXED_WIDTH-1:0] fixed_mag; output [31:0] float_out; wire [7:0] exponent; wire [31:0] unscaled_mantissa = fixed_mag << (32-FIXED_WIDTH); wire [31:0] scaled_mantissa; wire [4:0] scale_distance; scale_up sc (.in(unscaled_mantissa), .out(scaled_mantissa), .distance(scale_distance)); defparam sc .WIDTH = 32; defparam sc .WIDTH_DIST = 5; assign exponent = 8'd127 + (FIXED_WIDTH-FIXED_FRACTIONAL) - 1 - scale_distance; // Zero is special and gets an exponent of 0, not "something very small" assign float_out = &scale_distance ? {fixed_sign,31'h0} : {fixed_sign, exponent, scaled_mantissa[30:8]}; endmodule

The source code for module 'scale_up' is missing in your example.

Wiljan · « **Reply #9 on:** January 26, 2021, 10:28:29 pm »

Code: [Select]

module scale_up (in,out,distance);

parameter WIDTH = 16;
parameter WIDTH_DIST = 4;

input [WIDTH-1:0] in;
output [WIDTH-1:0] out;
output [WIDTH_DIST-1:0] distance;

wire [(WIDTH_DIST+1) * WIDTH-1:0] shift_layers;
assign shift_layers [WIDTH-1:0] = in;

genvar i;
generate
	for (i=0;i<WIDTH_DIST;i=i+1)
	begin : shft
		wire [WIDTH-1:0] layer_in;
		wire [WIDTH-1:0] shifted_out;
		wire [WIDTH-1:0] layer_out;
		
		assign layer_in = shift_layers[(i+1)*WIDTH-1:i*WIDTH];

		// are there ones in the upper part?
		wire shift_desired = ~|(layer_in[WIDTH-1:WIDTH-(1 << (WIDTH_DIST-1-i))]);
		assign distance[(WIDTH_DIST-1-i)] = shift_desired;

		// barrel shifter
		assign shifted_out = layer_in << (1 << (WIDTH_DIST-1-i));
		assign layer_out = shift_desired ? shifted_out : layer_in;
		
		assign shift_layers[(i+2)*WIDTH-1:(i+1)*WIDTH] = layer_out;
								
	end
endgenerate

assign out = shift_layers[(WIDTH_DIST+1)*WIDTH-1 : WIDTH_DIST*WIDTH];

endmodule

SiliconWizard · « **Reply #10 on:** January 27, 2021, 12:05:02 am »

Not necessarily what you're looking for, but it's interesting to note that VHDL-2008 comes with packages which support fixed point and floating point, both supposed (from the standard) to be synthesizable!

For instance, there is the 'float_generic_pkg', which supports many FP operations, including a series of 'to_float' functions.

I don't know how well this is supported by FPGA tools though, knowing that full VHDL-2008 support is not always that great to begin with. I'll have to try that one of these days.

BrianHG · « **Reply #11 on:** January 27, 2021, 01:19:09 am »

Remember my question...

Quote from: BrianHG on January 26, 2021, 04:45:08 pm

How many bits?
Which float format?

If you have a <24bit int to 32bit 754 float, it's just a direct wiring + fixed wired exponent + the polarity bit.
Same goes if you have a <53 bit int going to a 64bit float. (IE, no converter needed.)

If your int is signed, you just need to invert the 2's complement when negative and feed the polarity bit.
(If I remember correctly, 754 uses absolute value + the polarity, not 2's compliment. This also means 1 extra bit on the int side. IE, a 24bit ADC feeds a 32 bit float without loss, almost directly wired.)

This is why I hate it when the OP doesn't specify anything at all...

gnuarm · « **Reply #12 on:** January 27, 2021, 02:43:49 am »

Quote from: BrianHG on January 27, 2021, 01:19:09 am

Remember my question...
Quote from: BrianHG on January 26, 2021, 04:45:08 pm
How many bits?
Which float format?
If you have a <24bit int to 32bit 754 float, it's just a direct wiring + fixed wired exponent + the polarity bit.
Same goes if you have a <53 bit int going to a 64bit float. (IE, no converter needed.)

If your int is signed, you just need to invert the 2's complement when negative and feed the polarity bit.
(If I remember correctly, 754 uses absolute value + the polarity, not 2's compliment. This also means 1 extra bit on the int side. IE, a 24bit ADC feeds a 32 bit float without loss, almost directly wired.)

This is why I hate it when the OP doesn't specify anything at all...

Yes, IEEE floating point is sign magnitude with a biased exponent. You can't just direct connect an Integer format data to the mantissa of the floating point number. In floating point the mantissa is left justified and in IEEE the most significant 1 is assumed. So the shift is variable depending on where the most significant 1 bit is. That also impacts the exponent.

I am currently designing a floating point math unit and converting to and from Int is the easy part, just just treat it like the appropriate portions of the ADD instruction that has to denormalize the mantissa before the ADD or normalizes the result after the ADD operation.

BrianHG · « **Reply #13 on:** January 27, 2021, 03:51:06 am »

Quote from: gnuarm on January 27, 2021, 02:43:49 am

In floating point the mantissa is left justified and in IEEE the most significant 1 is assumed. So the shift is variable depending on where the most significant 1 bit is. That also impacts the exponent.

Yes, this is the correct way to assure the maximum definition.
I could have sworn that fixing the exponent so that is it a fixed +23 bit integer offset and filling in the mantissa with a 23 bit integer number would still work so long as you do not go any larger than 23 bits. Otherwise a hidden bit 24 would always be assumed to be 1 no matter what and 0 would be impossible unless there is a specific pattern for 0 which I completely forgot about. Maybe it had something to do with the old assembly PIC16 floating point math routines I used from 20 years back that certain exceptions were accommodated.

ejeffrey · « **Reply #14 on:** January 27, 2021, 05:06:29 am »

Quote from: BrianHG on January 27, 2021, 03:51:06 am

Quote from: gnuarm on January 27, 2021, 02:43:49 am
In floating point the mantissa is left justified and in IEEE the most significant 1 is assumed. So the shift is variable depending on where the most significant 1 bit is. That also impacts the exponent.

Yes, this is the correct way to assure the maximum definition.
I could have sworn that fixing the exponent so that is it a fixed +23 bit integer offset and filling in the mantissa with a 23 bit integer number would still work so long as you do not go any larger than 23 bits. Otherwise a hidden bit 24 would always be assumed to be 1 no matter what and 0 would be impossible unless there is a specific pattern for 0 which I completely forgot about.

In IEEE floating point there aren't multiple representations of the same value (except for NaNs) although -0 is distinct but equal to +0. If the exponent is non-zero a leading 1 is implicit. If the exponent is all zeros then you have a denormal number which doesn't have an implicit leading 1. Thus conveniently the all-bits-zero pattern is also floating point zero.

Non-zero integers are never denormal so there is no need to worry about that, just handle zero specifically. If the number is zero just output zero. Otherwise find the first one and shift it into position 24 (the implicit leading 1 that will be dropped when packed into the output register)) and set the exponent based on how far you had to shift.

BrianHG · « **Reply #15 on:** January 27, 2021, 05:55:42 am »

Quote from: ejeffrey on January 27, 2021, 05:06:29 am

Non-zero integers are never denormal so there is no need to worry about that, just handle zero specifically. If the number is zero just output zero. Otherwise find the first one and shift it into position 24 (the implicit leading 1 that will be dropped when packed into the output register)) and set the exponent based on how far you had to shift.

Thanks. Forgot that there was an implied hidden 1 always there.

ali_asadzadeh · « **Reply #16 on:** January 27, 2021, 07:18:27 am »

Thanks for all the feedback,
Actually I have two scenarios in my design, first one is easier, I got 24 bit data from ADC, and I want to convert it to Float (32bit),
BrianHG I'm not sure if a direct connection from 24bit (23bit + 1sign bit) to float is possible, do you have any sample code?
Also I have a nice Cordic design from Zicpu, that would produce 32bit results for me, I wanted to convert the output of this module to Float (32bit) too. this Cordic core from Zipcpu have a max frequency of around 100MHz in gowin, I wonder if there are cores that have better speed performance.

I like opensource designs, Gowin has Cordic core, But I prefer opensource

gnuarm · « **Reply #17 on:** January 27, 2021, 08:13:56 am »

So you want to convert values to float, but not do any math on them? Is this data going to a CPU with floating point capability? Why not let the CPU do the conversion?

I guess my point in talking about how to do the conversion is that this is not hard code to write. It consists of a priority encoder and a barrel shifter. An integer multiplier can be used as the barrel shifter. The exponent is just the priority encoder output subtracted from a constant. As someone else has mentioned the mantissa must be positive. So if your CORDIC output is signed, it needs to be complemented for negative values. Do you think you can't write this easily enough?

ali_asadzadeh · « **Reply #18 on:** January 27, 2021, 08:25:44 am »

I have written some DSP stuff in float inside FPGA and they are working as expected, also the end result should got to Cortex M7 for further process, because this algorithm is so math intensive, I decided to do some parts of it inside the FPGA, since it has enough room and it's relatively cheap, I can get gowin with 20K lut and 48 DSP and 48 Block ram and 32MB Internal SDRAM under 5$, so why not.

I have written a test bench for the Wiljan code sample,

But it seems it has wrong results, in positive and negative numbers,

Sample Test bench,

Code: [Select]

module fixed_to_float_tb ();

parameter MAIN_CLK_DELAY = 20;  // 25 MHz

reg r_Rst_L     = 1'b0;
reg r_Clk       = 1'b0;
// Clock Generators:
always #(MAIN_CLK_DELAY) r_Clk = ~r_Clk;


reg [31:0] r_a;
wire [31:0] w_Out;

fixed_to_float DUT(
.fixed_sign(r_a[31]),
.fixed_mag(r_a[30:0]),
.float_out(w_Out)
);


initial
begin

	r_Rst_L  = 1'b1;
	repeat(1) @(posedge r_Clk);
	r_Rst_L  = 1'b0;
	repeat(3) @(posedge r_Clk);
	r_a = 32'd1250302788;
	repeat(1) @(posedge r_Clk);
	r_a = 32'd8388608;
	repeat(1) @(posedge r_Clk);

	r_a = 32'd536870911;//32'h1fffffff;
	repeat(1) @(posedge r_Clk);
	r_a = 32'h00000001;
	repeat(1) @(posedge r_Clk);
	r_a = 32'd2147483520;//32'h7fffff80;
	repeat(1) @(posedge r_Clk);
	r_a = 32'd2147483584;//32'h7fffffc0;
	repeat(1) @(posedge r_Clk);
	r_a = 32'h80000000; //-2147483648
	repeat(1) @(posedge r_Clk);
	r_a = 32'h80000040;//-2147483584
	repeat(1) @(posedge r_Clk);
	r_a = 32'hffffffff;//-1
	repeat(1) @(posedge r_Clk);
	r_a = 32'h00000000;


end // initial begin
	
endmodule

// parameters of the code,

Code: [Select]

parameter FIXED_WIDTH = 31; // must not be > 32
parameter FIXED_FRACTIONAL = 0;

For example this input 1250302788 should produce 0x4e950c37, But it would produce 0x4e950c36, I know it's small value, But I want a perfect int to float conversion, since I would do lot's of float add and Mul, and it would increase the error.

I will test my data and it's result with this online calc

https://www.h-schmidt.net/FloatConverter/IEEE754.html

BrianHG · « **Reply #19 on:** January 27, 2021, 12:09:36 pm »

Quote from: ali_asadzadeh on January 27, 2021, 08:25:44 am

For example this input 1250302788 should produce 0x4e950c37, But it would produce 0x4e950c36, I know it's small value, But I want a perfect int to float conversion, since I would do lot's of float add and Mul, and it would increase the error.

I will test my data and it's result with this online calc

https://www.h-schmidt.net/FloatConverter/IEEE754.html

Not possible unless your integer is only 24bits ~16million high, or, you are using 64bit double.

Computer 32 bit floating point only stores 24bits as described here:

It gets worse if you add a full 16million value 24bit integer with a fraction. That fraction will usually disappear.

NorthGuy · « **Reply #20 on:** January 27, 2021, 04:35:38 pm »

Quote from: ali_asadzadeh on January 27, 2021, 07:18:27 am

I got 24 bit data from ADC, and I want to convert it to Float (32bit)

I don't see any reason for converting to floats. All you describe can be done in (scaled) integers. Will work faster and will take less logic.

24-bit ADC cannot be very fast, so you should be able to keep up with it even if you do floats.

ali_asadzadeh · « **Reply #21 on:** January 28, 2021, 01:06:16 pm »

Quote

I don't see any reason for converting to floats. All you describe can be done in (scaled) integers. Will work faster and will take less logic.

24-bit ADC cannot be very fast, so you should be able to keep up with it even if you do floats.

My algorithm is like this, I have 8 channels of 24bit simultaneous ADC with 16Ksps each, I need to do a 256 point FFT on each adc channel, compute phase and amplitude from FFT using cordic, also compute RMS on a 256 sample windows on each channel, also I need to calculate a division to find the ratio between the first and second harmonics of each channel from FFT, and finally I should pass the phase and amplitude to a 16 point calibration unit (it will use a float MUL and ADD something like m*x + a), which basically curve fit (interpolate) to a line, so the final result is calibrated,

and please note that all these calculations should be repeated at 16Ksps for each new sample that comes in, since the Cortex M7 should decide to do some fast decisions based on the results of all these calculations and all ADC channels are sampled at the same time.

your points are correct, But I do not know how to convert my whole algorithm to fixed point to be used, since the final results should be in floating point, because the customer/user would use these values for setting the parameters for CM7 algorithm. which are a lot of parameters(in the order of 500 variables)

SiliconWizard · « **Reply #22 on:** January 28, 2021, 02:55:27 pm »

All of this can be done with fixed point indeed.
Off-loading the fixed-point to floating-point to the "client" MCU would make more sense. It would add only marginal complexity on the MCU's side.

Not sure how you intended to make all of this work, but if providing end values in FP is the goal and you can't have the customer make the above modication, then doing everything in fixed point in the FPGA and only converting the end values to FP, instead of computing everything in FP, would take fewer resources on the FPGA.

To begin with, surely you didn't intend on implementing a FP FFT? So your FFT outputs are NOT FP? Or did you?

ali_asadzadeh · « **Reply #23 on:** January 28, 2021, 03:30:34 pm »

The FFT and CORDIC are fixed point, the result of CORDIC should convert to float, and the harmonics ratio division, and curve fitting should be done in float, But if I know how it can be done in fixed point, it would help a lot. since my floating coefficient for curve fitting are like this.

These are real numbers that I have measured,

For example Suppose that the input to the calibrate function is called data and I should use a line formula for curve fitting, Data is 32bit integer from CORDIC, (it's magnitude) ,I calculate result with this formula

result = 4.5453* data + 0.0034128;

or the first harmonic Mag from CORDIC is 1950345 and the second harmonic is 45338, so the ratio is like this
45338/1950345 = 0.0232461436

How these two formulas (the first one a MUL and an ADD and the second one only a DIV) can be done in fixed point?

ejeffrey · « **Reply #24 on:** January 28, 2021, 03:51:26 pm »

If I understand you correctly my first approach here would be to do up through the FFT and the 256 sample mean square in fixed point on the FPGA and then pass the results to the MCU that does the float conversion and final analysis. It sounds like you are only using a few points out of the FFT like a fundamental and one or more harmonics?

With a bigger FFT I would prefer to do floating point but for only 256 points it's not a big deal. Quantization noise will be in the ~29th bit. In principle a floating point FFT would let you get that much dynamic range while still using 24 bit multipliers but chances are your noise floor is nowhere near there anyway. If you were limited to smaller multipliers than 24 bit a floating point algorithm might let you conserve dynamic range and multiplier resources at the cost of additional complexity and LUT logic.

Doing 255 point overlap is usually overkill. You could probably get away with doing the FFT only after every 64 new samples, but if your customer wants to redo the full fft every sample you may not have a choice.

I didn't see a windowing step, you probably need one.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Int to float (Read 14091 times)

Share me