Author Topic: STM 32F4 FPU registers and main() gotcha (Read 7615 times)

brucehoult · « **Reply #75 on:** September 08, 2024, 01:48:14 pm »

Quote from: Siwastaja on September 08, 2024, 12:30:15 pm

I thought it is completely obvious that in context of discussing ISRs, stacking the registers must happen after the interrupt is triggered.

I'd have thought so too!

Cortex-M really does an extraordinary amount of things in hardware to make using them easy for unskilled programmers. Far more than any other Arm CPU line -- and even on the (relatively) tiny Cortex-M0.

Really the diametric opposite to RISC-V, which does the least amount in hardware that can possibly work efficiently, leaving the programmer (or your standard BSP library) to implement a lot of features.

Quote

You are a genius in making simple things sound complex and misunderstand anything.

I reckon Simon has that prize.

Nominal Animal · « **Reply #76 on:** September 08, 2024, 02:44:50 pm »

Quote from: gf on September 08, 2024, 11:57:40 am

Seems to be pretty tricky

How so? With lazy stacking, the Cortex-M core will reserve space for the floating-point registers in the stack, but not fill the values in until a floating-point instruction is first used. The core itself automatically tracks all this using registers named in the appnote, so there is nothing the user/programmer needs to do except enable this if desired. It is very clearly described, in detail, with simple diagrams included, in the linked-to appnote.

When implementing an OS, the kernel must account for the unfilled floating-point registers in the stack until a floating-point instruction is used if lazy stacking is enabled, but that is it.

To me, this is quite straightforward and easily explained and modeled. If anything, the only tricky bit is that when implementing an OS, one must understand the float registers values on stack are undefined until a float instruction is used; but among the tricks needed for OS development and userspace-kernel interfaces, this is insignificant.

Quote from: brucehoult on September 08, 2024, 01:48:14 pm

Cortex-M really does an extraordinary amount of things in hardware to make using them easy for unskilled programmers. Far more than any other Arm CPU line -- and even on the (relatively) tiny Cortex-M0.

I've always associated AVRs as the little siblings of the Cortex-M series, as they both seem to be focused on this: easy to program, easy to generate code for.

Although I've mostly used Cortex-M cores on 32-bit MCUs, I've always considered it nice how the core handles so many details for me.

Quote from: brucehoult on September 08, 2024, 01:48:14 pm

Really the diametric opposite to RISC-V, which does the least amount in hardware that can possibly work efficiently, leaving the programmer (or your standard BSP library) to implement a lot of features.

Yes. The RISC-V ISA seems easy to generate code for, although compilers haven't gotten the code optimization to the same level as on other ISAs (as observed at e.g. Compiler Explorer – edit: just a link to it, I don't have examples of this). It is the underlying "paradigm" that differs. Cortex-M is something like "make it easy", and RISC-V is something like "reduce the unnecessary complexity".

While they have definitely different approaches, I like them both a lot: wider range of tools in my toolbox.

peter-h · « **Reply #77 on:** September 08, 2024, 02:54:37 pm »

Quote

Most people find it simple if you solve their problems and give them something which just works, even if it is complex behind the curtains. Then again, there is always someone who wants to understand the internals completely, will disagree with some details, and want to micromanage it. They will complain.

Depends on how much experience someone has. Some of us here have been programming micros since the 1970s when they first came out. So the way arm32 implements ISRs is not at all obvious.

No need to treat others like idiots, notwithstanding that this is common on the internet.

brucehoult · « **Reply #78 on:** September 08, 2024, 03:04:40 pm »

Quote from: Nominal Animal on September 08, 2024, 02:44:50 pm

Yes. The RISC-V ISA seems easy to generate code for, although compilers haven't gotten the code optimization to the same level as on other ISAs (as observed at e.g. Compiler Explorer).

Not sure what you were trying to demonstrate there, as it brought up a default simple square() function with x86 code generation. Switching it to RISC-V, it couldn't possibly be improved upon...

https://www.godbolt.org/z/4bsWeMY1d

Nominal Animal · « **Reply #79 on:** September 08, 2024, 04:04:04 pm »

Quote from: brucehoult on September 08, 2024, 03:04:40 pm

Not sure what you were trying to demonstrate there

No specific code, just a link to the editor. You see, I do not have a good example.

Perhaps I should have linked to say this one, blending two RGB555 colors using a phase value between 0 (first color) and 32 (second color), inclusive. See how much the code generated by GCC for rv32gc differs between -O2 and -Os, latter being 20% longer than the former?

Using -O2 -Wall -march=armv7-m -mtune=cortex-m4 -mthumb on various compilers to compile to Cortex-M4, obviously the code is slightly denser –– that's what you should get with a complex instruction set, after all ––, but changing optimization to -Os does not change the code much, if at all.

Granted, I haven't written a hand-optimized version of this in rv32gc assembly to say how close to optimal the RISC-V code generated by GCC (or clang) is, and am assuming that the large difference between -O2 and -Os is an indicator that some things just aren't optimized well yet. (Code generated by clang seems to be quite stable across versions and between -O2 and -Os, however.)

brucehoult · « **Reply #80 on:** September 08, 2024, 06:20:03 pm »

Quote from: Nominal Animal on September 08, 2024, 04:04:04 pm

Perhaps I should have linked to say this one, blending two RGB555 colors using a phase value between 0 (first color) and 32 (second color), inclusive. See how much the code generated by GCC for rv32gc differs between -O2 and -Os, latter being 20% longer than the former?

Hmm. I see 114 bytes vs 106, which is 7.5% more for -Os. Not a great thing, for sure, but it is after all just a heuristic not a guarantee. It seems it trips up on code generation strategies for the min/max adjustment. Also 7.5% is significantly smaller than the 20% you state.

Turning on Zbb (which has min/max directly) it changes to 94 bytes for -Os vs 102 bytes for -O2.

Also, I'm confused why 33 - phase, not 32?

In fact I think your function is simply buggy.

Here is a reference implementation, which I think is very probably correct:

Code: [Select]

uint_fast16_t blend555_ref(uint_fast16_t r5g5b5_0, uint_fast16_t r5g5b5_1, int_fast8_t phase) {
    if (phase < 0)
        phase = 0;
    if (phase > 32)
        phase = 32;

    const uint32_t  src_r = (r5g5b5_0>>10) & 0x1f;
    const uint32_t  src_g = (r5g5b5_0>>5) & 0x1f;
    const uint32_t  src_b = (r5g5b5_0) & 0x1f;

    const uint32_t  dst_r = (r5g5b5_1>>10) & 0x1f;
    const uint32_t  dst_g = (r5g5b5_1>>5) & 0x1f;
    const uint32_t  dst_b = (r5g5b5_1) & 0x1f;

    const uint32_t  sum_r = (((uint32_t)phase * dst_r + (uint32_t)(32 - phase) * src_r) >> 5) & 0x1f;
    const uint32_t  sum_g = (((uint32_t)phase * dst_g + (uint32_t)(32 - phase) * src_g) >> 5) & 0x1f;
    const uint32_t  sum_b = (((uint32_t)phase * dst_b + (uint32_t)(32 - phase) * src_b) >> 5) & 0x1f;

    return (sum_r<<10) | (sum_g<<5) | sum_b;
}

Your function (with 33-phase changed to 32-phase) produces different results in 35423187968 / 35433480192 test cases.

Here is an alternative function which produces the same results as the reference function in all test cases:

Code: [Select]

uint_fast16_t blend555_2(uint_fast16_t r5g5b5_0, uint_fast16_t r5g5b5_1, int_fast8_t phase) {
    if (phase < 0)
        phase = 0;
    if (phase > 32)
        phase = 32;

    const uint32_t rb_mask = 0x7c1f;
    const uint32_t g_mask  = 0x03e0;

    const uint32_t  src1 = r5g5b5_0 & rb_mask;
    const uint32_t  src2 = r5g5b5_0 & g_mask;
    const uint32_t  dst1 = r5g5b5_1 & rb_mask;
    const uint32_t  dst2 = r5g5b5_1 & g_mask;

    const uint32_t  sum1 = (uint32_t)phase * dst1 + (uint32_t)(32 - phase) * src1;
    const uint32_t  sum2 = (uint32_t)phase * dst2 + (uint32_t)(32 - phase) * src2;

    return ((sum1>>5) & rb_mask) | ((sum2>>5) & g_mask);
}

My function produces 74 bytes of code with Zbb regardless of -O2 or -Os, or without Zbb 92 bytes with -Os and 84 bytes with -O2.

So yeah, -Os is a bit sad in the face of that min/max argument pinning without min/max instructions.

brucehoult · « **Reply #81 on:** September 08, 2024, 07:19:27 pm »

Ahhh ... your return expression should have been ...

Code: [Select]

    return ((sum >>  5) & 0x001F)
         | ((sum >> 10) & 0x03E0)
         | ((sum >> 15) & 0x7C00);

... not ...

Code: [Select]

    return ((sum >>  5) & 0x001F)
         | ((sum >> 15) & 0x03E0)
         | ((sum >> 25) & 0x7C00);

Production code?

peter-h · « **Reply #82 on:** September 08, 2024, 07:46:40 pm »

How does this relate to FPU?

Nominal Animal · « **Reply #83 on:** September 08, 2024, 09:09:04 pm »

Quote from: brucehoult on September 08, 2024, 07:19:27 pm

Production code?

No, just wrote offhand and failed

. I tried to think of a good enough test case without digging through my archives. Production code uses 8 or 10 bits per component and 64-bit arithmetic (or two pairs of 32-bit values). Correct code would be

Code: [Select]

uint_fast16_t blend555(uint_fast16_t r5g5b5_0, uint_fast16_t r5g5b5_1, int_fast8_t phase)
{
    const uint_fast32_t  rgb0 =  (r5g5b5_0 & 0x001F)
                              | ((r5g5b5_0 & 0x03E0) << 5)
                              | ((r5g5b5_0 & 0x7C00) << 10);
    const uint_fast32_t  rgb1 =  (r5g5b5_1 & 0x001F)
                              | ((r5g5b5_1 & 0x03E0) << 5)
                              | ((r5g5b5_1 & 0x7C00) << 10);

    if (phase < 0)
        phase = 0;
    if (phase > 32)
        phase = 32;

    const uint_fast32_t  rgb = (32 - phase) * rgb0 + phase * rgb1   // Blending
                             + UINT32_C(0x01004010);                // Rounding

    return ((rgb >>  5) & 0x001F)
         | ((rgb >> 10) & 0x03E0)
         | ((rgb >> 15) & 0x7C00);
}

as you surmised. (This one I actually verified, you see. It produces the same results as using floating-point arithmetic for red, green, and blue separately, then rounding the results in the normal fashion.)

The 33 is actually a funky detail: \$(2^n + 1)(2^n - 1) = 2^{2 n} - 1\$. It means that scaling some value from \$n\$ bits to \$2n\$ bits is done via multiplication by a factor between \$0\$ and \$2^n+1\$, inclusive. Here, we do a weighted sum instead of that kind of scaling, so the addition of "0.5" per component is proper "rounding".

Quote from: peter-h on September 08, 2024, 07:46:40 pm

How does this relate to FPU?

Not directly: it was a sidetrack about the fundamental difference in paradigm between Cortex-M and RISC-V architectures. It happened, because I mis-chose the link to Compiler Explorer, and then suggested examining pretty bad code in Compiler Explorer.

The intent behind was to show that although RISC-V is easy to generate code for, compiler optimizations are not as mature as for e.g. Cortex-M. You can see if you simply compare the generated code between different versions of the same compiler; I suggest using -O2 -Wall -march=armv7e-m -mtune=cortex-m4 -mthumb for Cortex-M4. Between GCC 10.2.1 and 14.2.0, the same assembly is produced for Cortex-M4 using none-eabi, with -O2 and -Os having only trivial differences.

However, there was a tie. Let me explain (and apologies for the long post):

I picked the blend function as an example because it is something that is often done with floating-point arithmetic instead. The corrected version produces exactly the same values as the standard RGB blending, including rounding. The underlying idea is SIMD: expanding the argument(s) to a single integer "vector", with room between components so that they will never overflow into each other. Indeed, only two multiplications are needed using this scheme, one of each coefficient ("phase" and "inverse phase"). In practice, most of the operation cost is in expanding the "vector" from the source values, and packing the result "vector" back to a returnable value.

The technique can be used with up to 10 bits per color component using 64-bit arithmetic, but at some point depending on the architecture, saving on the number of multiplications just is not worth the effort of unpacking and re-packing the "vector". It also does not work optimally for e.g. RGB565 format, where the components have different sizes. For example, with 8 bits per color component, you can save multiplication operations using the method here, but the packing and unpacking operation has a fixed "cost" (depending only on the number of components, not the size of each component). It may be less costly to do at least some of the multiplications separately, and avoid some/most of the packing and unpacking work.

This is analogous to FPU use in an IRQ with lazy stacking enabled. When lazy FPU stacking is enabled, the use of FPU in an IRQ will have a fixed cost compared to no FPU use at all. However, any additional FPU use will not incur an extra cost. So, one must consider the cost of implementing the calculation in fixed-point arithmetic or otherwise integer arithmetic, versus how much faster/easier to maintain the code would be if FPU were used – including all possible uses in the IRQ, not just the key part that triggered the consideration. The details of how the lazy stacking of FPU registers (or any other registers) does not matter at all, because we do not have a kernel-userspace type of interface at hand: it will just work.

brucehoult · « **Reply #84 on:** September 09, 2024, 12:29:52 am »

Quote from: Nominal Animal on September 08, 2024, 09:09:04 pm

The 33 is actually a funky detail: \$(2^n + 1)(2^n - 1) = 2^{2 n} - 1\$. It means that scaling some value from \$n\$ bits to \$2n\$ bits is done via multiplication by a factor between \$0\$ and \$2^n+1\$, inclusive. Here, we do a weighted sum instead of that kind of scaling, so the addition of "0.5" per component is proper "rounding".

Then you should be clamping phase to 33, not 32, otherwise you can never get the value of dst as the result.

I wanted to add rounding to my function, but that would make the results different to the (fixed version) of your function.

I don't know what the performance characteristics of the M4's multiply are, but for the very first RISC-V microcontroller, the FE-310 in the HiFive1, or for any other Berkley Rocket based design (K210, the cores in Microsemi FPGAs etc) the multiplier is multi-cycle latency, but pipelined and can start/finish one multiply per cycle, so it's much faster to do twice the multiplies for even/odd elements and use simple masking to extract the even/odd, rather than futz about with a lot of operations to spread the elements apart and gather them again.

bson · « **Reply #85 on:** September 09, 2024, 01:22:43 am »

Correct me if I'm wrong, but lazy stacking only works if there's one thread using FP with the possibility of an ISR. It doesn't work if there are two threads using FP and no ISR, as this will produce no stacking and the FP regs won't get saved across context switches. If there's no ISR FP usage and only a single thread using FP, then of course they don't need to be saved at all, they can be owned by that one thread. FP reg stacking can be disabled completely.

westfw · « **Reply #86 on:** September 09, 2024, 01:46:06 am »

one thing to watch out for is that some chips (rp2040 for instance) have various "math accelerators" whose context will NOT be saved by the Cortex ISR mechanisms...

brucehoult · « **Reply #87 on:** September 09, 2024, 02:04:33 am »

Quote from: bson on September 09, 2024, 01:22:43 am

Correct me if I'm wrong, but lazy stacking only works if there's one thread using FP with the possibility of an ISR. It doesn't work if there are two threads using FP and no ISR, as this will produce no stacking and the FP regs won't get saved across context switches.

If you're using threads then you're probably using an RTOS, which you didn't have to write, which deals with saving and restoring registers for you. The RTOS can also implement lazy FP save/restore [1], so threads that don't use FP in any given time-slice don't needlessly save/restore FP registers.

[1] In software, assuming it has FPU states such as off, clean, dirty like other ISAs have, which I haven't checked.

ejeffrey · « **Reply #88 on:** September 09, 2024, 02:18:44 am »

Quote from: bson on September 09, 2024, 01:22:43 am

Correct me if I'm wrong, but lazy stacking only works if there's one thread using FP with the possibility of an ISR. It doesn't work if there are two threads using FP and no ISR, as this will produce no stacking and the FP regs won't get saved across context switches.

My understanding is that the auto lazy stacking of fpu registers only works for ISRs. It's part of the same mechanism that handles stacking of the caller saved integer registers. I think it is possible for an RTOS to implement lazy fpu stacking context switches in the traditional way: diable the FPU when switching threads. If the new thread uses the FPU that will fault and the fault handler can stack the old registers and enable the FPU before returning.

In general I think most of the "automatic / just works with the standard C ABI" features are really only applicable to single tasking embedded applications plus interrupts. If you build a multitasking OS you need to go read all of that detailed hardware information you got to ignore for bare metal applications.

SiliconWizard · « **Reply #89 on:** September 09, 2024, 05:45:40 am »

Yes that's mainly for ISRs. Sure can be used for RTOS threading (will make things easier for the developer of the RTOS itself), but otherwise as you said, that can be handled "manually".

I've written a small RTOS (for RISC-V for now) and this is how I implemented it. Each "task" has parameters, among which whether it uses the FPU (and soon will be also for vector instructions). If so, during context switch, FPU registers will be saved/restored; if not, only regular registers are. Simple and effective. (But requires determining in advance which task will use the FPU and which won't. So requires some discipline while maintaining the software.)

Siwastaja · « **Reply #90 on:** September 09, 2024, 05:56:32 am »

Quote from: bson on September 09, 2024, 01:22:43 am

Correct me if I'm wrong, but lazy stacking only works if there's one thread using FP with the possibility of an ISR. It doesn't work if there are two threads using FP and no ISR, as this will produce no stacking and the FP regs won't get saved across context switches. If there's no ISR FP usage and only a single thread using FP, then of course they don't need to be saved at all, they can be owned by that one thread. FP reg stacking can be disabled completely.

ARM Cortex-M provides two "threads" so to speak, or we could call them contexts, "main" program and interrupt. All of the details needed to manage this is in hardware, such that programmer can just write standard C and not think about it any further. This includes interrupts nesting (pre-empting each other.)

But if you need more than those two levels of threads (for example, things not supported by the CPU hardware e.g. time-sharing multitasking), then you need software which implements this extra. This piece of software is commonly called an "operating system". And if you want to write your own, then you indeed need to understand all those CPU details. It is not black magic, but nevertheless takes some time and concentration to get right.

But if bare metal and main "thread" + interrupt "thread" + nested interrupts is all that you need, then ARM Cortex-M is really super easy to use, even to beginners.

brucehoult · « **Reply #91 on:** September 09, 2024, 06:16:20 am »

Quote from: SiliconWizard on September 09, 2024, 05:45:40 am

I've written a small RTOS (for RISC-V for now) and this is how I implemented it. Each "task" has parameters, among which whether it uses the FPU (and soon will be also for vector instructions). If so, during context switch, FPU registers will be saved/restored; if not, only regular registers are. Simple and effective. (But requires determining in advance which task will use the FPU and which won't. So requires some discipline while maintaining the software.)

Why do it like that?

OK, it's easier to get going at the start, I guess, but lazy save/restore is pretty easy.

peter-h · « **Reply #92 on:** September 09, 2024, 07:35:54 am »

Quote

It doesn't work if there are two threads using FP and no ISR, as this will produce no stacking and the FP regs won't get saved across context switches.

IIRC, FreeRTOS saves the FPU state across task switches, so multiple tasks (threads) can all use the FPU, with lazy stacking. It certainly seems to work fine.

BUT I just had a look for the config for this in FR. Didn't find any! The following code appears on google

Code: [Select]

/* Set configENABLE_FPU to 1 to enable the Floating Point Unit (FPU), or 0
 * to leave the Floating Point Unit disabled. */
#define configENABLE_FPU                  1

but my port of FR does not make reference to configENABLE_FPU... it is not in the freertosconfig.h file or anywhere else.

I seem to have the CM4F FR build, which does not have FPU support according to here
https://forums.freertos.org/t/freertos-port-for-arm-m4-no-fpu-no-mpu/7089

But think about it: is it even possible to get a task switch or an interrupt during an FPU operation? The 32F4 hardware single floats are 1 clock cycle, although I recall something about division being 11 clocks.

newbrain · « **Reply #93 on:** September 09, 2024, 08:59:25 am »

Quote from: peter-h, heavily snipped by me on September 09, 2024, 07:35:54 am

but my port of FR does not make reference to configENABLE_FPU... it is not in the freertosconfig.h file or anywhere else.

I seem to have the CM4F FR build, which does not have FPU support according to here
https://forums.freertos.org/t/freertos-port-for-arm-m4-no-fpu-no-mpu/7089

No because that flag is only for ArmV8M - as the comments in later versions say.

On the contrary, CM4F does include FP support (as the name suggest). In the thread, they were looking for a non FP version, AFAICU.

peter-h · « **Reply #94 on:** September 09, 2024, 01:10:13 pm »

Newbrain - you have saved me a ton of work - thank you

It looks like this FR port is one which is for the FPU so they took out any FPU related #defines.

This bit of code (with my mods) confirms it and interestingly talks about the lazy stacking bit (all /* */ quotes are original)

Code: [Select]

// This picks up the _estack value from vectab2. We have to add 64k for 32F437.

static void prvPortStartFirstTask( void )
{
	/* Start the first task.  This also clears the bit that indicates the FPU is
	in use in case the FPU was used before the scheduler was started - which
	would otherwise result in the unnecessary leaving of space in the SVC stack
	for lazy saving of FPU registers. */

#ifdef EXTRA_64K

	extern uint32_t g_dev_id;

	if (g_dev_id==437)
	{

		__asm volatile(
				" ldr r0, =0xE000ED08 	\n" /* Use the NVIC offset register to locate the stack. */
				" ldr r0, [r0] 			\n"
				" ldr r0, [r0] 			\n"
				" adds r0, #65536		\n" // adjust for stack being 64k higher up
				" msr msp, r0			\n" /* Set the msp back to the start of the stack. */
				" mov r0, #0			\n" /* Clear the bit that indicates the FPU is in use, see comment above. */
				" msr control, r0		\n"
				" cpsie i				\n" /* Globally enable interrupts. */
				" cpsie f				\n"
				" dsb					\n"
				" isb					\n"
				" svc 0					\n" /* System call to start first task. */
				" nop					\n"
		);

	}
	else	// 32F417 - original code
	{

		__asm volatile(
				" ldr r0, =0xE000ED08 	\n" /* Use the NVIC offset register to locate the stack. */
				" ldr r0, [r0] 			\n"
				" ldr r0, [r0] 			\n"
				" msr msp, r0			\n" /* Set the msp back to the start of the stack. */
				" mov r0, #0			\n" /* Clear the bit that indicates the FPU is in use, see comment above. */
				" msr control, r0		\n"
				" cpsie i				\n" /* Globally enable interrupts. */
				" cpsie f				\n"
				" dsb					\n"
				" isb					\n"
				" svc 0					\n" /* System call to start first task. */
				" nop					\n"
		);

	}

wek · « **Reply #95 on:** September 09, 2024, 01:31:46 pm »

Quote from: brucehoult on September 09, 2024, 02:04:33 am

The RTOS can also implement lazy FP save/restore [1], so threads that don't use FP in any given time-slice don't needlessly save/restore FP registers.

[1] In software, assuming it has FPU states such as off, clean, dirty like other ISAs have, which I haven't checked.

off/on - SCB_CPACR.CP10 = 0/1
clean/dirty - CONTROL.FPCA=0/1

CONTROL is an internal register of the processor, i.e. it can be accessed only using MRS/MSR.

If FPU is clean (i.e. there no FPU instruction has been executed since reset or last time this bit was cleared (by hardware, this bit is cleared upon interrupt entry and restored upon exit)), there is no stacking, i.e. not even stack reservation for the lazy stacking if that is enabled.

Besides writing an OS (or context switching for any other purpose), activities where these painful details are needed are stack calculation, and stack unwinding during debugging (i.e. they are needed only if one writes tools like debugger or something similar).

JW

ejeffrey · « **Reply #96 on:** September 09, 2024, 02:28:05 pm »

Quote from: peter-h on September 09, 2024, 07:35:54 am

But think about it: is it even possible to get a task switch or an interrupt during an FPU operation? The 32F4 hardware single floats are 1 clock cycle, although I recall something about division being 11 clocks.

There is no such thing as a half completed instruction. The division (or other long operation) will either be complete or not started from an architectural view when the ISR fires. The CPU may throw away partially completed work to improve ISR latency (I don't know if the M4F actually does this), but that's a micro-architectural detail programmers don't really have to consider.

The only partial exception to this that I am aware of is the LDM/STM instructions. Again, they are either completed or cancelled, but they may have generated visible results before being interrupted. That is they may have modified the target registers or memory. After the ISR they are still restarted from the beginning. I think this is only supported on certain core families and controlled with a mode bit to enable lower interrupt latency.

langwadt · « **Reply #97 on:** September 09, 2024, 02:29:22 pm »

https://developer.arm.com/documentation/dai0298/latest/

peter-h · « **Reply #98 on:** September 09, 2024, 02:34:40 pm »

I had a dig around the oldest fossilised bits of my project and have this

Code: [Select]

	// ========== This was in SystemInit() ============

	// FPU startup is now in startupxxx.s - see
	// [url]https://www.eevblog.com/forum/microcontrollers/stm-32f4-fpu-registers-and-main()-gotcha[/url]
	// #if (__FPU_PRESENT == 1) && (__FPU_USED == 1)
	// SCB->CPACR |= ((3UL << 10*2)|(3UL << 11*2));  /* set CP10 and CP11 Full Access */
	// #endif

but now this is in asm in startup.s

Code: [Select]

/* Start FPU, to avoid this problem */
/* [url]https://www.eevblog.com/forum/microcontrollers/stm-32f4-fpu-registers-and-main()-gotcha/[/url] */

	ldr.w r0, =0xE000ED88       	/* The FPU enable bits are in the CPACR. */
	ldr r1, [r0]
	orr r1, r1, #( 0xf << 20 )   	/* Enable CP10 and CP11 coprocessors, then save back. */
	str r1, [r0]

Hopefully this is right... I have certainly used floats in various RTOS tasks and not seen any issues... not that is a 100% assurance because you can get accidental synchronicity.

Quote

There is no such thing as a half completed instruction. The division (or other long operation) will either be complete or not started from an architectural view when the ISR fires. The CPU may throw away partially completed work to improve ISR latency (I don't know if the M4F actually does this), but that's a micro-architectural detail programmers don't really have to consider.

Doesn't that answer my Q about whether FPU reg stacking is required at all?

It is what I would expect. The FPU must issue an internal WAIT while it is working, otherwise the compiled code would need to insert a loop waiting on some FPU status and that loop would then be vulnerable to interrupts / RTOS task switching and the FPU regs would need saving.

Quote

The only partial exception to this that I am aware of is the LDM/STM instructions. Again, they are either completed or cancelled, but they may have generated visible results before being interrupted. That is they may have modified the target registers or memory. After the ISR they are still restarted from the beginning. I think this is only supported on certain core families and controlled with a mode bit to enable lower interrupt latency.

Does GCC use these?

They seem to be like the Z80 LDIR which is subject to interrupts, etc.

As an aside, aren't all "long" instructions subject to DMA cycle stealing? So DMA could pick up intermediate data from a large number of arm32 instructions used by a compiler. But relying on that not happening is dumb.

Nominal Animal · « **Reply #99 on:** September 09, 2024, 02:44:49 pm »

Quote from: brucehoult on September 09, 2024, 12:29:52 am

so it's much faster to do twice the multiplies for even/odd elements and use simple masking to extract the even/odd, rather than futz about with a lot of operations to spread the elements apart and gather them again.

Yes, and that would look like the following (this time a real world example for RGB565, with phase = 0..32, inclusive), but it wouldn't tie to the lazy FPU stacking anymore. (Yes, I really did try to think of an analogous example that hard, even though I failed!)

Code: [Select]

uint_fast16_t blend565(const uint_fast16_t rgb0, const uint_fast16_t rgb1, int_fast8_t p)
{
    if (p < 0)
        p = 0;
    if (p > 32)
        p = 32;

    const uint32_t  rb = ( (uint32_t)(32 - p) * (uint32_t)(rgb0 & 0xF81F)
                         + (uint32_t)(     p) * (uint32_t)(rgb1 & 0xF81F)
                         + UINT32_C(0x8010)
                         ) >> 5;
    const uint32_t   g = ( (uint32_t)(32 - p) * (uint32_t)(rgb0 & 0x07E0)
                         + (uint32_t)(     p) * (uint32_t)(rgb1 & 0x07E0)
                         + UINT32_C(0x0200)
                         ) >> 5;

    return (rb & 0xF81F) | (g & 0x07E0);
}

This does include rounding, and is strictly equivalent to floating-point calculation of said components with rounding using roundf() (halfway cases away from zero).

Some key notes when translating between floating-point and equivalent integer calculations:

Linear interpolation with a real coefficient \$0 \le \lambda \le 1\$ is simply $$v = (1 - \lambda) v_0 + \lambda v_1$$which is often written in the mathematically equivalent form \$v = v_0 + \lambda (v_1 - v_0)\$, but the latter is not exact with floating-point numbers when the magnitudes of \$v_0\$ and \$v_1\$ differ enough: the error due to inexactness accumulates near \$\lambda=1\$, so \$v\$ may never reach \$v_1\$. The first expression, however, distributes most error in the middle of the range, and is exact at both ends, \$\lambda=0\$ and \$\lambda=1\$.

When converted to integer or fixed-point arithmetic, we describe \$\lambda = w/W\$ where \$W \gt 0\$ determines the range of values and is usually a power of two, and \$0 \le w \le W\$. (Many programmers fail to notice that \$w\$ has \$W+1\$ possible values, because \$w = W\$ is allowed; if \$W = 2^n\$, \$w\$ needs \$n+1\$ bits.) With rounding, this ends up being $$v = \frac{(W - w) v_0 + w v_1 + \left\lfloor\frac{W}{2}\right\rfloor}{W}$$where adding half of \$W\$ (truncated, or rounded towards zero) applies rounding, halfway cases upwards (thus assuming \$v_0 \ge 0\$, \$v_1 \ge 0\$).

This is useful to know if you wish to compare an IRQ with float computation to the equivalent one with integer or fixed-point computation. The one trick is to use roundf()/round() semantics, rounding exact halfway cases away from zero, because they're the easiest to implement.

Another trick I often end up using is the fact that $$\frac{a}{b} = \frac{c}{d} \quad \iff \quad a d = b c \quad$$and when \$b \gt 0\$ and \$d \gt 0\$, you can replace the equality operator with any comparator operator (\$\lt\$, \$\le\$, \$=\$, \$\ne\$, \$\ge\$, \$\gt\$). It means that one can often use integer fractions to describe values, but instead of having to do two divisions when comparing them, one can do two multiplications instead. In some cases, like when implementing an atan2() operation, this can be more precise than the equivalent floating-point calculation. Of course, the same relation applies to floating-point numbers too, to within available precision (as multiplying with a reciprocal does not yield the exact same results as division; there is often a bit of rounding/quantization error due to finite precision).


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: STM 32F4 FPU registers and main() gotcha (Read 7615 times)

Share me