Author Topic: GCC compiler optimisation (Read 42907 times)

ataradov · « **Reply #225 on:** August 22, 2021, 09:41:28 pm »

Well, I did not say that you should do that, I just said that it does not hurt in this case. Of course correctly written code would not use a single volatile here. And yes, compiler would complain, and generally it is a very good idea to listen to those complaints.

If you read something into a variable, this variable does not need to be volatile. This is a perfectly valid code:

Code: [Select]

uint32_t variable = *((volatile uint32_t *)0x0800000);

peter-h · « **Reply #226 on:** August 22, 2021, 10:11:48 pm »

OK; you declared the source as volatile. But nothing else should need it, so long as the origin of the data is declared thus.

How would you totally avoid "volatile" in this scenario?

ataradov · « **Reply #227 on:** August 22, 2021, 10:18:45 pm »

You can't. The source must be volatile.

Nominal Animal · « **Reply #228 on:** August 22, 2021, 10:31:31 pm »

Quote from: peter-h on August 22, 2021, 09:06:19 pm

Sure it does not hurt but then why not just make everything "volatile".

You could, but then the compiler could not optimize the code much. The code would work as intended, but be slower than necessary.

Quote from: peter-h on August 22, 2021, 09:06:19 pm

It makes no sense to do it as a precaution, just in case.

No, but a lot of "programmers" throw stuff at the wall, and see what sticks. The honest ones will tell you that "this is how I got it to (seem to) work, but I don't know how or why". I consider them similar to software engineers as alchemists are to chemists.

Quote from: peter-h on August 22, 2021, 09:06:19 pm

AIUI, if you read location 0x0800000 into some variable then that variable must be "volatile", but code which subsequently accesses that variable does not have to be.

I'd prefer putting it a bit different way (since I'm not sure if that sentence describes the situation correctly or not; my English fails me here).

You declare a variable (or an object like an array) volatile, when you do not want the compiler to infer its value in any way, and want the compiler to access the actual storage (memory) of it whenever the variable or object is accessed.

If you look at ataradovs above example, you'll see the pattern of how to make a single access volatile: you construct a pointer to the volatile data, with the pointer pointing to the object or address you want, and then dereference the pointer. The value is stored in a non-volatile variable. To aid us human programmers, I like to explicitly declare such "locally cached values" as const –– which is just a promise to the compiler that we only read the value, and will not try to modify it, making it easier for the compiler to optimize the code using the const values. (GCC and Clang are pretty darned good at inferring constness on their own, though, so in practice, it really is more a reminder to us humans that this value will stay constant in this scope.)

volatile is not the only way to tell the compiler that its assumptions about variables or objects are no longer valid. In GCC, on all architectures, asm volatile ("" : : : "memory"); acts as a compiler memory barrier (generates no machine code itself!) that tells the compiler that any assumptions about memory contents across that statement are invalid. However, it does not affect local variables and objects, since these are on stack.

Another way is to call a function whose prototype is known but implementation unknown – for example, compiled in a separate unit (C source file) –, that takes a pointer to the non-const memory range containing one or more variables. For a specific variable or object (including a dereferenced pointer), you can also use asm volatile ("" : "+m" (object)); which tells the compiler that any assumptions about the value of object become invalid at that point.

Quote from: peter-h on August 22, 2021, 10:11:48 pm

How would you totally avoid "volatile" in this scenario?

For example,
asm volatile ("" : "+m" (*(uint32_t *)0x0800000));
uint32_t variable = *(uint32_t *)0x0800000;
asm volatile ("" : "+m" (*(uint32_t *)0x0800000));
does the exact same thing.

In pure C, without inline assembly, there is no exactly equivalent alternate to volatile whose behaviour the standard guarantees. In practice, bracketing the access with a call to a function, say foo((uint32_t *)0x0800000); with the implementation of foo() not visible to the compiler (but prototype e.g. void foo(uint32_t *); , the compiler would have to load the 32-bit unsigned integer at that point in the code.

peter-h · « **Reply #229 on:** August 23, 2021, 06:26:31 am »

Doesn't declaring a variable as extern also do it? The compiler cannot "see" across multiple source files.

ataradov · « **Reply #230 on:** August 23, 2021, 06:35:10 am »

Not necessarily, especially if you use LTO, which is a good idea.

Also, even example foo((uint32_t *)0x0800000); is not entirely correct. If foo() writes to the flash, then sequential call to foo(); may use old value cached instead of reading the new one again. Or if something else writes to the flash and you want foo() to notice the change.

There is no need to invent new ways to trick the compiler. There are well defined ways to communicate what you want.

Nominal Animal · « **Reply #231 on:** August 23, 2021, 04:48:20 pm »

Quote from: ataradov on August 23, 2021, 06:35:10 am

If foo() writes to the flash, then sequential call to foo(); may use old value cached instead of reading the new one again. Or if something else writes to the flash and you want foo() to notice the change.

True.

Quote from: ataradov on August 23, 2021, 06:35:10 am

There is no need to invent new ways to trick the compiler. There are well defined ways to communicate what you want.

Absolutely. I use volatile rather often (in POSIX userspace applications, the most typical one is an interrupt flag, a volatile sig_atomic_t flag), and compiler memory barriers and such constructs extremely rarely; only when doing odd stuff with memory caching controls and such.

As a mental model for a human programmer, however, I do believe my examples of "compiler cannot assume value if ..." are useful, though. (That is the way and reason I included them, at least; not as a practical example.)

They explain why you do not usually have volatile pointers in function prototypes (except static inline accessor functions compiled in the same unit/source file –– for GCC and Clang, these are just as fast as macros, with their function bodies included in the call site, but unlike macros, also provide type checking at compile time), as it'd really only affect the function implementation, for example. They should also help see one way how C compilers optimize various expressions, by tracking the access to variables and avoiding superfluous accesses; and what that means to actual code generated, and how to avoid any problems from such optimizations.
(Which, if I've understood correctly, is the entire reason for this thread: having C code behave in unexpected/unintended ways when compiling with optimizations enabled, and trying to understand why and how that happens.)

peter-h · « **Reply #232 on:** August 24, 2021, 10:00:48 am »

OK here goes another dumb Q:

The two g_ flags are picked up by another RTOS task. They are not referenced in the current .c file.

Should they be "volatile"? They are not declared as extern in the current .c file (but are in the other one, obviously) but are set to zero when defined.

Amazingly this code runs as-is even with -Og, and the compiler is not complaining about an unused variable.

harerod · « **Reply #233 on:** August 24, 2021, 12:11:11 pm »

Have you considered using inter-task communication mechanisms, to make results less volatile?

https://freertos.org/Embedded-RTOS-Binary-Semaphores.html

peter-h · « **Reply #234 on:** August 24, 2021, 12:58:40 pm »

Yes but I believe in simplicity

I take it the answer to my question is Yes

But maybe not?

gf · « **Reply #235 on:** August 24, 2021, 05:05:00 pm »

Volatile can only provide limited memory ordering guarantees anyway. Memory barriers are IMO the better instrument to provide the memory ordering guarantees typically required in multi-threaded environments. In a single-processor environment, asm volatile("" ::: "memory") can act as compiler barrier, but in multi-processor environments even hardware memory barriers are frequently unavoidable (and volatile alone would no longer suffice then at all, even if all objects were volatile).

Various thread synchronization functions provided by the OS happen to be implicit memory barriers. For instance, if you protect the access to shared objects (i.e. shared between threads) with mutexes, then you get implied memory barriers at the points where you acquire and release the mutex. So if you use the OS-provided thread synchronization primitives, then you don't need to care about the low-level stuff, and you can renounce explicit memory barries, atomic operations or volatile qualifiers for shared objects in most cases.

ataradov · « **Reply #236 on:** August 24, 2021, 05:14:47 pm »

Quote from: peter-h on August 24, 2021, 10:00:48 am

The two g_ flags are picked up by another RTOS task. They are not referenced in the current .c file.
Should they be "volatile"? They are not declared as extern in the current .c file (but are in the other one, obviously) but are set to zero when defined.

From the compiler point of view tasks are just functions, and as long as there is a write and a read, it can't be optimized. Volatile here is useless.

But you need to be very careful how you use shared variables like this. It is safe to use simple types like booleans in some cases, but you really need to know what you are doing.

And if you start to get a lot of those shared variables, then you should be expecting for something to break.

gf · « **Reply #237 on:** August 24, 2021, 05:56:12 pm »

Quote from: ataradov on August 24, 2021, 05:14:47 pm

Quote from: peter-h on August 24, 2021, 10:00:48 am
The two g_ flags are picked up by another RTOS task. They are not referenced in the current .c file.
Should they be "volatile"? They are not declared as extern in the current .c file (but are in the other one, obviously) but are set to zero when defined.

From the compiler point of view tasks are just functions, and as long as there is a write and a read, it can't be optimized. Volatile here is useless.

The actual point in this code snipplet is whether the store to the g_ variables can be re-ordered behind the call to xTaskCrate(), or not.
If they are global varibles and the compiler cannot see what xTaskCrate() does, then it must assume anyway that they might be accessed in xTaskCrate(), therefore re-ordering behind the call can't happen.
But if the compiler can inline xTaskCrate(), or if LTO is used, then it depends...
[ Basically xTaskCrate() is one of these functions which IMO should act as implied memory barrier, as it spawns a new thread. ]

peter-h · « **Reply #238 on:** August 24, 2021, 06:22:07 pm »

" It is safe to use simple types like booleans in some cases, but you really need to know what you are doing."

I am well aware that passing a unit64_t between RTOS tasks is possibly not safe, because it won't be written in one operation, and of course strings are worse. So if doing that, one needs an atomic flag to indicate when ready, etc.

I use the FreeRTOS mutexes around things like set/read RTC, because I am setting the RTC from the GPS in one thread and reading it from other(s). And one could do the same around simple variables. I've done it around all SPI3 ops because SPI3 is shared between different devices (and yes amazingly it does work, with some precautions).

I've measured the exec time of the mutex call and it is very fast - 1us IIRC.

AFAIK ARM 32F4 booleans are single byte, but even 32 bit vars would be atomically written, and read.

peter-h · « **Reply #239 on:** August 30, 2022, 06:54:25 am »

I posted this in a thread which appears to have gone dead, and this is more on-topic.

Can I be sure that this loop will not be replaced with memcpy, due to the use of "volatile"?

Is isn't getting replaced (I am using -Og) so the answer is probably affirmative.

It actually could be a memcpy (because the CPU FLASH is not going to change) and I have a local version of memcpy which has optimisation turned off, and it would perhaps be better maintenance-wise to use that, but I am not sure of the syntax

addr starts at 0x8000000 - the base of FLASH.

ataradov · « **Reply #240 on:** August 30, 2022, 07:04:46 am »

It depends. With the current compiler -Og is close to -O1 with minor modifications. But this can change in the future.

The loop replacement is controlled by -ftree-loop-distribute-patterns flag, which is included by default only at -O3. EDIT: It looks like in recent versions it is also enabled in -O2. So, it already changed at least once.

But you can just be explicit and pass "-fno-tree-loop-distribute-patterns" to the compiler and it will not do the replacement at any optimization level.

westfw · « **Reply #241 on:** August 30, 2022, 07:45:53 am »

Quote

you can just be explicit and pass "-fno-tree-loop-distribute-patterns" to the compiler

Or stick it in a pragma, for just that function?

peter-h · « **Reply #242 on:** August 30, 2022, 07:49:38 am »

Quote

The loop replacement is controlled by -ftree-loop-distribute-patterns flag, which is included by default only at -O3. EDIT: It looks like in recent versions it is also enabled in -O2. So, it already changed at least once.

I have found memcpy replacement with -Og, IIRC. I experimented with levels other than -Og only briefly. We had a thread about it (maybe this one) way back. -O3 caused some problems, for no benefit.

Quote

But you can just be explicit and pass "-fno-tree-loop-distribute-patterns" to the compiler and it will not do the replacement at any optimization level.

I will do that - thanks.

Quote

Or stick it in a pragma, for just that function?

Do you have an example? Is it like the __attribute specification?

ataradov · « **Reply #243 on:** August 30, 2022, 07:50:08 am »

Yes, sure, if this only applies to one function. But given that -Og is used, it would not matter all that much. No need to micro-manage if you don't even macro-manage.

If -O3 causes "problems", you are very screwed and sitting on ticking time bomb. All such cases should be investigated immediately, not just ignored.

ataradov · « **Reply #244 on:** August 30, 2022, 07:53:23 am »

Quote from: peter-h on August 30, 2022, 07:49:38 am

Do you have an example? Is it like the __attribute specification?

Just like any other attribute:

Code: [Select]

__attribute__ ((optimize ("no-tree-loop-distribute-patterns"))) void foo(void)
{
}

peter-h · « **Reply #245 on:** August 30, 2022, 08:21:24 am »

Quote

If -O3 causes "problems", you are very screwed and sitting on ticking time bomb. All such cases should be investigated immediately, not just ignored.

IIRC this was stuff to do with memcpy etc substitution in the boot block which didn't have access to stdlib.

The reason I don't bother with it is because tests showed negligible code size or speed differences. As one would expect, the stuff got increasingly esoteric. Also some mentioned that -O3 ought to be considered "experimental" which is not really what I want to be doing.

The wider issue is that whenever something like this is changed, you have to do regression testing, which is impossible unless the product is trivial.

As I mention below, there are various places where one is relying on execution time to meet some hardware specs. How one does this, varies. If one needs to achieve say 1us (which is a looong time on a 168MHz 32F4) then a proper delay is necessary, and it needs to be scope-checked. It has to be a code loop (in asm, or in C with opt turned right off, or using CCYCNT). It can't be a "tick" because you would need a multi-MHz interrupt

If one needs to achieve say 10ms, then osDelay(10) is usually the right way. These two are immune to global optimisation settings. But if you need say 50ns then what? You will probably get that with a dozen lines of code, and absolutely it must be scope-checked. And that one is vulnerable to -O settings being changed, unless the code is in a function with the -O0 attribute (which is what I have usually done).

So I don't agree that code which breaks with -O3 is necessarily bad code.

I am writing up a "design document" as I go along and these are my notes on this topic

Much time has been spent on this. It is a can of worms, because e.g. optimisation level -O3 replaces this loop
for (uint32_t i=0; i<length; i++)
{
buf[offset+i]=data;
)
with memcpy() which while “functional” will crash the system if you have this loop in the boot loader and memcpy() is located at some address outside (which it will be, being a part of the standard library!) Selected boot block functions therefore have the attribute to force -O0 (zero optimisation) in case somebody tries -O3.

The basic optimisation of -O0 (zero) works fine, is easily fast enough for the job, and gives you informative single step debugging, but it produces about 30% more code. The next one, -Og, is the best one to use and doesn’t seem to do any risky stuff like above.

Arguably, one should develop with optimisation ON (say -Og) and then you will pick up problems as you go along. Then switch to -O0 for single stepping if needed to track something down. Whereas if you write the whole thing with -O0 and only change to something else later, you have an impossible amount of stuff to regression-test.

The problems tend to be really subtle, especially if timing issues arise. For example the 20ns min CS high time for the 45DB serial FLASH can be violated (by two function calls in succession) with a 168MHz CPU. A lot of ST HAL code works only because it is so bloated.

These figures show the relative benefits, at a particular stage of the project

-O0 produces 230k
-Og produces 160k
-O2 produces 160k*
-O3 produces 180k*
-Os produces 146k

The ones marked * are risky. -Os is pointless unless you are pushing against the FLASH size limit. Others, not listed above, have not been tested.

A compiler command line switch -fno-tree-loop-distribute-patterns has been added to prevent memcpy etc substitutions globally.

So yes probably -Og does not substitute stdlib functions.

DiTBho · « **Reply #246 on:** August 30, 2022, 08:33:49 am »

Quote from: Nominal Animal on August 23, 2021, 04:48:20 pm

compiler memory barriers and such constructs extremely rarely

that's one of the major reasons why I implemented myC: with tr-memory machines C doesn't offer *ANY* good memory barrier, and "Volatile" not only is futile but also a keyword that 90% of programmers don't understand and simply throw stuff at the wall, and see what sticks.

DiTBho · « **Reply #247 on:** August 30, 2022, 08:38:16 am »

Quote from: ataradov on August 30, 2022, 07:53:23 am

Quote from: peter-h on August 30, 2022, 07:49:38 am
Do you have an example? Is it like the __attribute specification?
Just like any other attribute:

Code: [Select]
__attribute__ ((optimize ("no-tree-loop-distribute-patterns"))) void foo(void) { }

And then you have these solutions full of "black voodoo magic" (oh magic attribute, oh, what? see the manual, oh what? see the manual of YOUR C compiler, Oh SHT it's not supported, now what?) rather than neat stuff

DiTBho · « **Reply #248 on:** August 30, 2022, 08:49:29 am »

Quote from: gf on August 24, 2021, 05:05:00 pm

if you protect the access to shared objects [..] with mutexes, then you get implied memory barriers at the points where you acquire and release the mutex

Yes, XINU/R18200 (MIPS4+, multi core) has mutex an tr-memory primitives implemented in assembly in a dedicated module to avoid problems with the C Compiler, even because you need Pipeline special instructions for these operations.

Called "critical code". Small portion of assembly. Good compromise.

gf · « **Reply #249 on:** August 30, 2022, 09:57:42 am »

Quote from: DiTBho on August 30, 2022, 08:33:49 am

that's one of the major reasons why I implemented myC: with tr-memory machines C doesn't offer *ANY* good memory barrier, and "Volatile" not only is futile but also a keyword that 90% of programmers don't understand and simply throw stuff at the wall, and see what sticks.

C basically does define an abstract, machine-independent memory ordering model, implemented via the stuff in stdatomic.h. Programs which strictly adhere to this abstract model (even if the current target CPU's requirements are not that strict) are even supposed to be portable to different CPUs, wrt. this functionality.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 42907 times)

Share me