Author Topic: STM 32F4 FPU registers and main() gotcha (Read 6807 times)

Nominal Animal · « **Reply #50 on:** July 26, 2024, 06:37:48 am »

Quote from: brucehoult on July 25, 2024, 12:06:01 pm

Using an unsigned type as a counter / loop index can make it tricky getting the termination condition right when counting down, as you usually want to (for both simple code and efficiency) update the counter before the exit test, not after, so you want to stop not when the counter is zero before updating it, but when it is negative after updating it.

For for loops, yes; but that's why I prefer the idiom
size_t i = number of elements;
while (i-->0) {
operate on element at index i
}
instead. I find the i-->0 expression easy to read and interpret and funny here.

In many cases, however, when scanning through an array, the forward-scanning idiom
type *const end = array + number of elements;
for (type *ptr = array; ptr < end; ptr++) {
operate on element at *ptr;
}
can generate better code, although this does depend on the compiler and options used. Note that one can include the definition of end in the for loop start condition, although it does make the line quite long.

The corresponding backward-scanning idiom
for (type *ptr = array + number of elements - 1; ptr >= array; ptr--) {
operate on element at *ptr;
}
suffers from a variant of the issue brucehoult mentoned; it requires the address of array to be nonzero.

SiliconWizard · « **Reply #51 on:** July 26, 2024, 08:39:38 am »

Yes. Alternatively, if you want to use a for loop, you can offset your index by 1. I guarantee you that it won't make a single difference in terms of performance on most targets. (Interestingly, for GCC on x86_64, the descending version looks faster than the ascending version - at least it has 1 fewer instruction in the compiled loop body - which admittedly may not necessarily mean that it's faster.)

Like:

Code: [Select]

double Foo2(size_t n, double A[n])
{
	double S = 0.0;

	for (size_t i = n; i > 0; i--)
		S += A[i - 1];
	
	return S;
}

Of course you may find this version potentially more confusing to read, with this offset. Pick your favorite.

Nominal Animal · « **Reply #52 on:** July 26, 2024, 08:53:24 am »

The loop idiom that I miss in C, especially for microcontrollers, is FORALL with Fortran 95/2003/later semantics. Essentially, it works like a for loop, except that the compiler can choose the iteration order freely. (Thus, the parameter block should only define exactly one scalar or vector iterator variable, and the range of values and step sizes in each dimension.)

bson · « **Reply #53 on:** July 29, 2024, 08:24:41 pm »

You could perhaps put your enable code in a function and give it the constructor attribute. This way it will run with the other static initializers from __main (in libgcc).

Code: [Select]

__attribute__((constructor))
static void enable_fpu() {
	__asm volatile
	(
		"	ldr.w r0, =0xE000ED88		\n" /* The FPU enable bits are in the CPACR. */
		"	ldr r1, [r0]				\n"
		"								\n"
		"	orr r1, r1, #( 0xf << 20 )	\n" /* Enable CP10 and CP11 coprocessors, then save back. */
		"	str r1, [r0]				"
	);
}

https://gcc.gnu.org/onlinedocs/gccint/Initialization.html

Edit: oh, and add r0 and r1 to the asm clobber list.

wek · « **Reply #54 on:** July 29, 2024, 10:14:47 pm »

Quote from: bson on July 29, 2024, 08:24:41 pm

You could perhaps put your enable code in a function and give it the constructor attribute. This way it will run with the other static initializers from __main (in libgcc).

Unless you've removed __libc_init_array() from the startup code, for some of the good reasons... :-)

JW

peter-h · « **Reply #55 on:** July 31, 2024, 12:30:33 pm »

I was reading your article again, wek, and one could spend a week down that rabbit hole

I just put this in the linkfile

Code: [Select]


	/* this is for __libc_init_array() */
	
	 .preinit_array     :
  {
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP (*(.preinit_array*))
    PROVIDE_HIDDEN (__preinit_array_end = .);
  } >FLASH_APP
  .init_array :
  {
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP (*(SORT(.init_array.*)))
    KEEP (*(.init_array*))
    PROVIDE_HIDDEN (__init_array_end = .);
  } >FLASH_APP
  .fini_array :
  {
    PROVIDE_HIDDEN (__fini_array_start = .);
    KEEP (*(.fini_array*))
    KEEP (*(SORT(.fini_array.*)))
    PROVIDE_HIDDEN (__fini_array_end = .);
  } >FLASH_APP

with some comments e.g. malloc etc may not work otherwise...

bson · « **Reply #56 on:** August 02, 2024, 11:20:51 pm »

Quote from: wek on July 29, 2024, 10:14:47 pm

Quote from: bson on July 29, 2024, 08:24:41 pm
You could perhaps put your enable code in a function and give it the constructor attribute. This way it will run with the other static initializers from __main (in libgcc).
Unless you've removed __libc_init_array() from the startup code, for some of the good reasons... :-)

JW

I wouldn't... unless you implement everything from scratch including, as peter-h mentioned, malloc. While somewhat rare, it IS a used feature in many libraries and it's easy to trip over a deeply buried constructor function somewhere.

I also use C++, and while I don't rely on static initializers for non-POD types, it could still happen with composite POD types, especially with bloated debug builds with minimal compiler optimizations and lacking aggressive linker gc. I'd rather address such things as optimizations at some point, than crashes during init. Mostly, just copying .rodata to .data and zeroing the rest of memory should be enough though, but it's soo easy to trip over.

However, I actually agree it's better to put it in a separate hardware _init function called separately prior to _main. The reason is to get a predictable initialization order. If the code is built with hard fp, then other constructor functions could very easily try to save/restore FP registers, although it is admittedly a bit of a stretch. Still, better to stay on the right side of the odds...

wek · « **Reply #57 on:** August 05, 2024, 08:53:56 am »

Quote from: bson on August 02, 2024, 11:20:51 pm

Quote from: wek on July 29, 2024, 10:14:47 pm
Quote from: bson on July 29, 2024, 08:24:41 pm
You could perhaps put your enable code in a function and give it the constructor attribute. This way it will run with the other static initializers from __main (in libgcc).
Unless you've removed __libc_init_array() from the startup code, for some of the good reasons... :-)

JW
I wouldn't... unless you implement everything from scratch including, as peter-h mentioned, malloc. While somewhat rare, it IS a used feature in many libraries and it's easy to trip over a deeply buried constructor function somewhere.

Define *everything*

Given __libc_init_array() and related is poorly defined (see rabbit hole), that's not a simple thing to do.

But that's the overarching theme here. Would the toolmakers' attitude be "here's a description of each step of startup process, together with explanation and reasoning behind the individual steps" instead of "we, the almighty and powerful, have performed all the magic needed to make it work, don't you dare to question it", we wouldn't have this discussion at all.

JW

westfw · « **Reply #58 on:** August 05, 2024, 08:04:47 pm »

Quote

Would the toolmakers' attitude be "here's a description of each step of startup process, together with explanation and reasoning behind the individual steps"

avr-gcc's documentation of the nine (!) different initialization steps that happen before main is pretty nice.
https://www.nongnu.org/avr-libc/user-manual/mem_sections.html

(and another 9 sections that happen after exit(), though most are unused.)

peter-h · « **Reply #59 on:** August 06, 2024, 07:59:22 am »

The problem with __libc_init_array() etc is that we don't actually know the definitive sources for the newlib libc.a. There was a thread here a couple of years ago on this and the likely source code was located, and I confirmed it by looking at some functions, and disassembly, but nobody is 100% sure, and AFAIK ST are not telling. And then the actual library changes according to your float options selector in the Cube IDE config - something I spent months working out. ST built dozens of versions of libc.a and the one you get is based on that selector. The thing has loads of empty stubs for calling mutexes (because some of the code is not thread-safe) and I had to implement those, but the library was not built with weak symbols for these! Anyway I posted it all here; in the end I had to use objcopy to weaken the whole libc.a so it could be done properly. A massive amount of time wasted and I don't ever want to got into that again.

wek · « **Reply #60 on:** August 06, 2024, 09:08:02 am »

Quote from: westfw on August 05, 2024, 08:04:47 pm

Quote
Would the toolmakers' attitude be "here's a description of each step of startup process, together with explanation and reasoning behind the individual steps"

avr-gcc's documentation of the nine (!) different initialization steps that happen before main is pretty nice.

Indeed it is. But that's the exception, result of enthusiasm and also a certain amount of luck.

I believe you can name all major contributors to the avr-gcc/avr-libc project. This and the state of documentation are no coincidence: they put their names behind their work, and they were willing to publicly discuss it. That's not the norm in the world of "professionally" made tools (and make no mistake, gcc as such is *not* an enthusiasts-driven project).

JW

Nominal Animal · « **Reply #61 on:** August 06, 2024, 10:01:57 am »

Give me a specific version of newlibc and a gcc on a specific architecture compatible with those, and I can describe the entire C runtime and initialization scheme. Especially so if you use an official ARM GNU Toolchain. I've found that most work is mapping all the debugging options and their effects; for a single version, single set of compilation options, single target architecture, it isn't too painful..

I fully understand wek's issues when trying to do that in a wholly version-agnostic way, because there is so much configurable stuff in the target machine definition compiled to gcc, and all of it interacts with newlib – just see the macros controlling gcc/libgcc/crtstuff.c (which ends up compiled to most of the C runtime). I've considered sketching it out using Graphviz as a directed graph starting from the hardware reset vector, but it is a lot of work; way too much for a generic directed graph!

For a specific version of the ARM GNU toolchain on a specific architecture (say, ARMv7e-M for Cortex-M4 and -M7), it's not too bad. The toolchain release notes even include the build instructions (using Linaro ABE manifest.txt files).

What simply fucks all that up, is vendors like STMicroelectronics who package their own toolchains without documenting how to rebuild that toolchain (to distributed binaries differing only in timestamps and such). Users are told "it's gcc and newlib", but that's less precise than saying "Euler angles (there being only 12 different incompatible ways of defining Euler angles; or 24, if you consider intrinsic and extrinsic rotations separately).

langwadt · « **Reply #62 on:** August 06, 2024, 12:53:52 pm »

Quote from: Nominal Animal on August 06, 2024, 10:01:57 am

Give me a specific version of newlibc and a gcc on a specific architecture compatible with those, and I can describe the entire C runtime and initialization scheme. Especially so if you use an official ARM GNU Toolchain. I've found that most work is mapping all the debugging options and their effects; for a single version, single set of compilation options, single target architecture, it isn't too painful..

I fully understand wek's issues when trying to do that in a wholly version-agnostic way, because there is so much configurable stuff in the target machine definition compiled to gcc, and all of it interacts with newlib – just see the macros controlling gcc/libgcc/crtstuff.c (which ends up compiled to most of the C runtime). I've considered sketching it out using Graphviz as a directed graph starting from the hardware reset vector, but it is a lot of work; way too much for a generic directed graph!

For a specific version of the ARM GNU toolchain on a specific architecture (say, ARMv7e-M for Cortex-M4 and -M7), it's not too bad. The toolchain release notes even include the build instructions (using Linaro ABE manifest.txt files).

What simply fucks all that up, is vendors like STMicroelectronics who package their own toolchains without documenting how to rebuild that toolchain (to distributed binaries differing only in timestamps and such). Users are told "it's gcc and newlib", but that's less precise than saying "Euler angles (there being only 12 different incompatible ways of defining Euler angles; or 24, if you consider intrinsic and extrinsic rotations separately).

https://github.com/STMicroelectronics/gnu-tools-for-stm32 ?

Nominal Animal · « **Reply #63 on:** August 06, 2024, 02:03:45 pm »

Quote from: langwadt on August 06, 2024, 12:53:52 pm

https://github.com/STMicroelectronics/gnu-tools-for-stm32 ?

Having build scripts is nice, but not nearly as nice as a description of how to reproduce the distributed binaries (ignoring irrelevant differences like timestamps, function order, and so on).

peter-h · « **Reply #64 on:** September 07, 2024, 07:06:45 pm »

Getting back to this FPU topic, and referring to this post
https://www.eevblog.com/forum/microcontrollers/for-more-memory-check-this-out/msg5633683/#msg5633683

I am a bit confused. That implies that you need to a float operation before interrupts are enabled.

I am enabling the FPU in startup.s

Code: [Select]

/* Start FPU, to avoid this problem */
/* [url]https://www.eevblog.com/forum/microcontrollers/stm-32f4-fpu-registers-and-main()-gotcha/[/url] */

	ldr.w r0, =0xE000ED88       	/* The FPU enable bits are in the CPACR. */
	ldr r1, [r0]
	orr r1, r1, #( 0xf << 20 )   	/* Enable CP10 and CP11 coprocessors, then save back. */
	str r1, [r0]

but I am not doing a float operation before FreeRTOS is started near the end of main(). After that various RTOS tasks do float ops, but I wonder if I should do one early in main(). There is currently a printf there which goes to the SWV ITM console output (which is harmless if a debugger is not connected) and I could output a float via that, which would not get optimised-out.

Or perhaps the other way is to enable the lazy stacking bit (somewhere??) right after the FPU is enabled. I read that FR handles the FPU issues internally when switching tasks and maybe it sets that bit.

brucehoult · « **Reply #65 on:** September 08, 2024, 01:37:14 am »

Quote from: peter-h on September 07, 2024, 07:06:45 pm

Getting back to this FPU topic, and referring to this post
https://www.eevblog.com/forum/microcontrollers/for-more-memory-check-this-out/msg5633683/#msg5633683

I am a bit confused. That implies that you need to a float operation before interrupts are enabled.

How so?

ejeffrey · « **Reply #66 on:** September 08, 2024, 02:04:55 am »

Quote from: peter-h on September 07, 2024, 07:06:45 pm

but I am not doing a float operation before FreeRTOS is started near the end of main(). After that various RTOS tasks do float ops, but I wonder if I should do one early in main(). There is currently a printf there which goes to the SWV ITM console output (which is harmless if a debugger is not connected) and I could output a float via that, which would not get optimised-out.

What would that accomplish? As long as you have enabled the FPU before using it, isn't that all that matters?

peter-h · « **Reply #67 on:** September 08, 2024, 05:48:15 am »

I would agree, but Siwastaja wrote:

ARM Cortex-M implements lazy stacking and stacks the float registers when the first float operation is encountered. Therefore you can simply use floats everywhere your code, including most interrupts, and choose not to use them in those most timing-critical interrupts.

The implication seems to be that you need to perform a float operation before interrupts are enabled.

brucehoult · « **Reply #68 on:** September 08, 2024, 06:04:19 am »

Quote from: peter-h on September 08, 2024, 05:48:15 am

I would agree, but Siwastaja wrote:

ARM Cortex-M implements lazy stacking and stacks the float registers when the first float operation is encountered. Therefore you can simply use floats everywhere your code, including most interrupts, and choose not to use them in those most timing-critical interrupts.

The implication seems to be that you need to perform a float operation before interrupts are enabled.

Then you completely misunderstood what he wrote.

ARM Cortex-M implements lazy stacking and stacks the float registers IN AN INTERRUPT HANDLER ONLY IF AND when the first float operation IN THAT INTERRUPT HANDLER is encountered, AND RESTORES THE FLOAT REGISTERS ON RETURN FROM THE INTERRUPT IF AND ONLY IF THEY WERE SAVED IN THAT INTERRUPT HANDLER.

gf · « **Reply #69 on:** September 08, 2024, 10:09:10 am »

Quote from: brucehoult on September 08, 2024, 06:04:19 am

ARM Cortex-M implements lazy stacking and stacks the float registers IN AN INTERRUPT HANDLER ONLY IF AND when the first float operation IN THAT INTERRUPT HANDLER is encountered, AND RESTORES THE FLOAT REGISTERS ON RETURN FROM THE INTERRUPT IF AND ONLY IF THEY WERE SAVED IN THAT INTERRUPT HANDLER.

And what happens if the interrupt handler function does not do any float operations itself, but calls functions which do float operations?

Siwastaja · « **Reply #70 on:** September 08, 2024, 11:19:22 am »

Quote from: gf on September 08, 2024, 10:09:10 am

Quote from: brucehoult on September 08, 2024, 06:04:19 am
ARM Cortex-M implements lazy stacking and stacks the float registers IN AN INTERRUPT HANDLER ONLY IF AND when the first float operation IN THAT INTERRUPT HANDLER is encountered, AND RESTORES THE FLOAT REGISTERS ON RETURN FROM THE INTERRUPT IF AND ONLY IF THEY WERE SAVED IN THAT INTERRUPT HANDLER.

And what happens if the interrupt handler function does not do any float operations itself, but calls functions which do float operations?

It just works, the Cortex-M CPU core is aware that it's executing in ISR mode, and the first floating point operation in ISR mode triggers stacking of float registers, and when the original handler function returns and thus ISR mode exits, then unstacking happens.

The whole Cortex-M is designed such that handlers can be normal C functions and no special considerations of any kind are needed, this extends to being able to call arbitrarily functions from such functions. This only causes problems to people who have historical experience and assume things must be difficult when they aren't.

Now if you do stuff like manipulate or analyze stack or registers in assembly (e.g. for debug logging), then you need to be aware of all these details.

peter-h · « **Reply #71 on:** September 08, 2024, 11:28:11 am »

Quote

And what happens if the interrupt handler function does not do any float operations itself, but calls functions which do float operations?

https://www.eevblog.com/forum/microcontrollers/how-does-st-32f4-know-when-an-isr-has-finished/msg4164745/#msg4164745

On an interrupt, the arm32 pushes magic numbers (which have high bits set and cannot be valid addresses) onto the stack and when it pops these, it knows the ISR has ended, regardless of how much stuff was in between. That is how it avoids the need for the classic "RETI" and how one can write ISRs in straight C without any keyword or attribute like "interrupt".

On the earlier "debate", Siwastaja's words "when the first float operation is encountered" should have been "when the first float operation is encountered in the ISR"

wek · « **Reply #72 on:** September 08, 2024, 11:55:24 am »

The painful details here, start with chapter 2.

JW

gf · « **Reply #73 on:** September 08, 2024, 11:57:40 am »

Quote from: wek on September 08, 2024, 11:55:24 am

The painful details here, start with chapter 2.

Seems to be pretty tricky

Siwastaja · « **Reply #74 on:** September 08, 2024, 12:30:15 pm »

Quote from: peter-h on September 08, 2024, 11:28:11 am

On the earlier "debate", Siwastaja's words "when the first float operation is encountered" should have been "when the first float operation is encountered in the ISR"

This is similar to customer coming into a supermarket and to the fruit section, then asking "where are the apples". When given answer "on the left", customer runs out of the store and tries to look left out of the door, shouting "there are no apples here, you should have said left on where you currently stand in the fruit section".

I thought it is completely obvious that in context of discussing ISRs, stacking the registers must happen after the interrupt is triggered. How could you even imagine somehow stacking the CPU state before the interrupt happens? If you know what stacking means and why it is done, that is. And if you don't, then Cortex-M really makes life easy for you, see below.

You are a genius in making simple things sound complex and misunderstand anything.

Quote from: gf on September 08, 2024, 11:57:40 am

Seems to be pretty tricky

Classic issue. Most people find it simple if you solve their problems and give them something which just works, even if it is complex behind the curtains. Then again, there is always someone who wants to understand the internals completely, will disagree with some details, and want to micromanage it. They will complain.

I have programmed these CPUs for over a decade and never needed to think about these details at all. I have only read about them out of pure interest and learned about it on this very forum.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: STM 32F4 FPU registers and main() gotcha (Read 6807 times)

Share me