Author Topic: GCC compiler optimisation (Read 42891 times)

SiliconWizard · « **Reply #150 on:** August 14, 2021, 05:14:55 pm »

Quote from: westfw on August 14, 2021, 08:10:56 am

Quote
The fact that it's on the stack means the compiler sees its entire lifespan, from function entry to exit, and knows it's never used.
Hmm. An interesting theory.
If I change the buffer to "static volatile", then gcc will produce code to do the copy. But clang doesn't. :-)
I like the "memcpy() doesn't know about volatile" explanation better, technically.

As some of us said, you can't expect any particular behavior for those cases because they are entirely implementation-dependent. And yes, you can see some pretty weird stuff when playing with this.

When *writing* to objects on the stack that are never *read* afterwards before they get out of scope, the compiler is free to optimize out those writes entirely. From a purely functional POV, those writes would have absolutely ZERO effect. The compiler assumes that the local stack is under its full control - so in some cases, even with a volatile qualifier, the code may be optimized out if the local variables in question are never read.

Now if you're willing to "play" with the stack on a very low-level (for instance for writing stack sentinels or something), your best bet is probably to do that directly in assembly.

Doc Daneeka · « **Reply #151 on:** August 15, 2021, 05:03:48 am »

Quote

When *writing* to objects on the stack that are never *read* afterwards before they get out of scope, the compiler is free to optimize out those writes entirely. From a purely functional POV, those writes would have absolutely ZERO effect. The compiler assumes that the local stack is under its full control - so in some cases, even with a volatile qualifier, the code may be optimized out if the local variables in question are never read

That's not right: the purpose of volatile is to define what is observable behaviour of the program: the concept of a stack is implementation, it doesn't matter what the scope of the variable is, if it's volatile all accesses to it have to made strictly per the semantics of C. As far as C language is concerned, there is no concept of stack just objects that can be accessed (read or write). Although its implementation defined what it means for an access to be 'obervable', it cannot 'optimise away' a volatile access, be declaring it volatile you are telling the compiler it is observable.

peter-h · « **Reply #152 on:** August 15, 2021, 06:26:15 am »

Does "static" have any effect on whether unused storage is optimised away?

What concerns me is what looks like cases where say you have a 512 byte array and you use bytes 0,1,2 for flags and never read (or explicitly write, a byte at a time) the rest; the compiler might optimise away operations which write the whole array.

That could screw up say a serial FLASH which is expecting to always get 512 bytes transferred.

Siwastaja · « **Reply #153 on:** August 15, 2021, 06:47:31 am »

If you have a memory-mapped hardware device expecting 512 bytes to be transferred, then this is the most obvious textbook example where volatile qualifier is required.

peter-h · « **Reply #154 on:** August 15, 2021, 08:10:34 am »

OK; let's look at whether "pData" or "len" could get optimised out here

Code: [Select]


HAL_StatusTypeDef B_HAL_SPI_Transmit(SPI_HandleTypeDef *hspi, uint8_t *pData, uint16_t Size)
{
//  uint32_t tickstart;
  HAL_StatusTypeDef errorcode = HAL_OK;
  uint16_t initial_TxXferCount;

  initial_TxXferCount = Size;

  if (hspi->State != HAL_SPI_STATE_READY)
  {
    errorcode = HAL_BUSY;
    goto error;
  }

  if ((pData == NULL) || (Size == 0U))
  {
    errorcode = HAL_ERROR;
    goto error;
  }

  /* Set the transaction information */
  hspi->State       = HAL_SPI_STATE_BUSY_TX;
  hspi->ErrorCode   = HAL_SPI_ERROR_NONE;
  hspi->pTxBuffPtr  = (uint8_t *)pData;
  hspi->TxXferSize  = Size;
  hspi->TxXferCount = Size;

  /*Init field not used in handle to zero */
  hspi->pRxBuffPtr  = (uint8_t *)NULL;
  hspi->RxXferSize  = 0U;
  hspi->RxXferCount = 0U;
  hspi->TxISR       = NULL;
  hspi->RxISR       = NULL;

  /* Configure communication direction : 1Line */
  if (hspi->Init.Direction == SPI_DIRECTION_1LINE)
  {
    SPI_1LINE_TX(hspi);
  }

  /* Check if the SPI is already enabled */
  if ((hspi->Instance->CR1 & SPI_CR1_SPE) != SPI_CR1_SPE)
  {
    /* Enable SPI peripheral */
    __HAL_SPI_ENABLE(hspi);
  }

  /* Transmit data in 8 Bit mode */

    if ((hspi->Init.Mode == SPI_MODE_SLAVE) || (initial_TxXferCount == 0x01U))
    {
      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
      hspi->pTxBuffPtr += sizeof(uint8_t);
      hspi->TxXferCount--;
    }
    while (hspi->TxXferCount > 0U)
    {
      /* Wait until TXE flag is set to send data */
      if (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE))
      {
        *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
        hspi->pTxBuffPtr += sizeof(uint8_t);
        hspi->TxXferCount--;
      }
    }


  /* Clear overrun flag in 2 Lines communication mode because received is not read */
  if (hspi->Init.Direction == SPI_DIRECTION_2LINES)
  {
    __HAL_SPI_CLEAR_OVRFLAG(hspi);
  }

  if (hspi->ErrorCode != HAL_SPI_ERROR_NONE)
  {
    errorcode = HAL_ERROR;
  }

error:
  hspi->State = HAL_SPI_STATE_READY;
  return errorcode;
}

I just can't see it. It would need to be tracking the code calling this function, and realising that of the 512 bytes which are always transferred, most (or even all) have not changed.

Should I use this function prototype

Code: [Select]

HAL_StatusTypeDef B_HAL_SPI_Transmit(SPI_HandleTypeDef *hspi, volatile uint8_t *pData, volatile uint16_t Size);
The most context-narrow failure I am getting is that Windows cannot format the drive, if -Og is used.

EDIT: this code uses 2k buffers for the USB transfers. I tried to make them volatile but it didn't change anything.

EDIT2: compiling just the low level 45dbxx stuff with -O0 makes formatting work fine. So one "optimisation vulnerability" was definitely in there. Can't find it though. Shall I offer 50 quid (Paypal) to anybody who can find it?

gf · « **Reply #155 on:** August 15, 2021, 09:57:17 am »

Quote from: peter-h on August 15, 2021, 08:10:34 am

Should I use this function prototype
Code: [Select]
HAL_StatusTypeDef B_HAL_SPI_Transmit(SPI_HandleTypeDef *hspi, volatile uint8_t *pData, volatile uint16_t Size);

pData is not dereferenced in the given code snipplet. Therefore it makes no difference whether it is declared uint8_t* or volatile uint8_t*.

The interesting statement where volatile matters is rather this one:

Code: [Select]

      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);

If __IO is a macro which expands to volatile, then the assignment to *((__IO uint8_t *)&hspi->Instance->DR) becomes a "visible side effect" and cannot be "optimized out". Consequently the value which is assigned also needs to be calculated (i.e. fetched from *hspi->pTxBuffPtr).

peter-h · « **Reply #156 on:** August 15, 2021, 10:23:18 am »

Yes __IO is volatile.

DR is the SPI "UART" data register.

Code: [Select]

  /* Set the transaction information */
  hspi->State       = HAL_SPI_STATE_BUSY_TX;
  hspi->ErrorCode   = HAL_SPI_ERROR_NONE;
  hspi->pTxBuffPtr  = (uint8_t *)pData;
  hspi->TxXferSize  = Size;
  hspi->TxXferCount = Size;

cv007 · « **Reply #157 on:** August 15, 2021, 02:42:29 pm »

>Could be a CS timing issue but I doubt it because the device is very fast; much faster than the above code.

Your device will have a datasheet and give the timing requirements for CS, so probably not a bad idea to figure out if timing requirements are being met. Probably not important when you are running a 24MHz mcu (you see low the ns timing requirement, and know you cannot possibly fail), but you are into higher cpu speeds where you may no longer be able to 'eyeball' it. There is also OSPEEDR for a gpio pin which may also come into play (default is LOW SPEED). I would assume you have a way to measure this, so measure it and see what you are getting- if its good in any optimization then check it off your list and look elsewhere.

I always use -Os from start to finish, so see the compiler generated asm in a consistent way and also get to deal with problems as I create them due to optimization. This also eliminates subtle timing problems that can show up even when your code is 'correct' and compiles at any optimization level (it should). Ideally, timing is also handled correctly no matter which optimization but is easy to forget when you are doing something that a peripheral is not taking care of (like CS)..

You can change optimization within a file (gcc), but probably not something you want to use except in special circumstances-

#pragma GCC push_options
#pragma GCC optimize ("-Os")
//code here
#pragma GCC push_options

Also, when working in -Os it sometimes can get a little difficult to pick out the generated asm code of interest so surrounding the code with nop's is one way to highlight it as mentioned before. I will sometimes make a function 'noinline' temporarily when I want a clearer view of it, then when I'm satisfied with what I see it will revert back to what is was and can be confident I'm getting the same thing although is now not as clear. Using an online compiler like godbolt.org is also a quick/good way to create/test code to see how the compiler acts, even though your mcu headers are not available.

Doctorandus_P · « **Reply #158 on:** August 15, 2021, 02:47:35 pm »

7 pages in 2 weeks. Quite impressive, but did not read it all.

Code: [Select]

This is news to me; I thought that compilers didn't change the order of functions in a .c file :) Why should they?
A lot of processors have a limited range for relative adressing, or can use smaller pointers for smaller jumps, and this is one good reason to re-order the functions, and can avoid "trampolines"

SiliconWizard · « **Reply #159 on:** August 15, 2021, 04:40:07 pm »

Quote from: Doc Daneeka on August 15, 2021, 05:03:48 am

Quote
When *writing* to objects on the stack that are never *read* afterwards before they get out of scope, the compiler is free to optimize out those writes entirely. From a purely functional POV, those writes would have absolutely ZERO effect. The compiler assumes that the local stack is under its full control - so in some cases, even with a volatile qualifier, the code may be optimized out if the local variables in question are never read

That's not right: the purpose of volatile is to define what is observable behaviour of the program: the concept of a stack is implementation, it doesn't matter what the scope of the variable is, if it's volatile all accesses to it have to made strictly per the semantics of C. As far as C language is concerned, there is no concept of stack just objects that can be accessed (read or write). Although its implementation defined what it means for an access to be 'obervable', it cannot 'optimise away' a volatile access, be declaring it volatile you are telling the compiler it is observable.

Sorry, but... no.... on almost all points you make here, except for the stack. The standard doesn't even talk about stacks, and any implementation is indeed free to implement local variables (and "return addresses") any way it sees fit. It just so happens that the most common way, by far, is to use stacks (the only C compiler that I vaguely remember of that didn't even use a stack was an old one for a tiny programmable chip, and it was not even really C-compliant anyway), which is why I used this as an example. But yeah, let's remove the "stack" term here, no problem with that. It was too specific.

The idea still remains: any *local* object that is not qualified static ceases to exist once it gets out of scope, so anything happening on such an object AFTER it goes out of scope just doesn't exist per the definition, wherever this object is actually stored (stack, registers, or whatever else the implementation does.)

As to volatile, it's - unfortunately - more subtle than what you said.
1. To begin with: you talk about "observable". The only sentence actually using this term (in C99 at least) in the std is in this part:

Quote

An object that is accessed through a restrict-qualiﬁed pointer has a special association
with that pointer. This association, deﬁned in 6.7.3.1 below, requires that all accesses to
that object use, directly or indirectly, the value of that particular pointer.
The intended
use of the restrict qualiﬁer (like the register storage class) is to promote
optimization, and deleting all instances of the qualiﬁer from all preprocessing translation
units composing a conforming program does not change its meaning (i.e., observable
behavior).

From what I understand here, this defines an "observable behavior" as the *meaning* of a program. Problem here is: what is the meaning of a program? The way I get this is the same as what I meant by the "functional POV", so anything volatile-related, when it may have unknown side-effects, but no analyzable effect, is NOT observable behavior. I may be wrong here and I admit we are really nitpicking on terms. I could not find the definition of the "meaning of a program" in the std.

For instance, taking the typical "delay loop" example, is a "delay loop", doing absolutely nothing apart from taking CPU cycles, part of the meaning of the program? If you can answer this one by a resounding "yes", without a blink, and backing it up with solid arguments, you are better than I am.

2. More importantly, about the volatile qualifier: it's unfortunately more subtle than it looks. Let's again quote C99 for the relevant parts:

Quote

An object that has volatile-qualiﬁed type may be modiﬁed in ways unknown to the
implementation or have other unknown side effects. Therefore any expression referring
to such an object shall be evaluated strictly according to the rules of the abstract machine,
as described in 5.1.2.3.

So far so good. Looks like "volatile" will guarantee that such a qualified object is evaluated in all cases, right?
But we need to refer to the "rules of the abstract machine" it mentions. So, again, relevant parts:

Quote

Accessing a volatile object, modifying an object, modifying a ﬁle, or calling a function
that does any of those operations are all side effects,
which are changes in the state of
the execution environment. Evaluation of an expression may produce side effects. At
certain speciﬁed points in the execution sequence called sequence points, all side effects
of previous evaluations shall be complete and no side effects of subsequent evaluations
shall have taken place. (A summary of the sequence points is given in annex C.)

Still looks, at this point, like the volatile object will be evaluated no matter what. But the following paragraph kind of ruins it all:

Quote

In the abstract machine, all expressions are evaluated as speciﬁed by the semantics. An
actual implementation need not evaluate part of an expression if it can deduce that its
value is not used and that no needed side effects are produced (including any caused by
calling a function or accessing a volatile object).

So, it looks a bit like what I said earlier. Doesn't it? (See the part in bold.)

ataradov · « **Reply #160 on:** August 15, 2021, 05:00:06 pm »

Quote from: peter-h on August 15, 2021, 08:10:34 am

OK; let's look at whether "pData" or "len" could get optimised out here

I think you are also confusing optimized out and code generated in a way that values are no longer traceable by the debugger.

If this code is optimized, then you would not see any transfers. If you can put a logic analyzer on the bus and see that there is a transfer and it is 512 bytes, then nothing was optimized.

Relying on the debugger for everything you do is a bad idea.

SiliconWizard · « **Reply #161 on:** August 15, 2021, 05:10:47 pm »

Quote from: ataradov on August 15, 2021, 05:00:06 pm

Relying on the debugger for everything you do is a bad idea.

Definitely.

peter-h · « **Reply #162 on:** August 15, 2021, 07:16:38 pm »

Fixed the 45dbxx issue with -Og.

There was a command to bring the device out of deep powerdown, but it was never put into deep powerdown in the first place (the command to do that was never used) although the device needed 35us if you did use that command. It worked with -O0 but marginally.

I also found a number of other timing issues where -Og shortened some delays considerably. This was quite interesting. I did some of that stuff months ago, not realising the optimisation issues. A good way to achieve supposedly reliable "minimum" timing is by reading an IO port. It is defined as volatile so the IO read itself can't get optimised out, but the compiler probably removed everything else, and possibly inlined the IO reads

The chip made it obvious: it is a display controller and the display got corrupted. It is also an exceptionally slow chip (max SPI clock is just 1MHz, with pretty long CS setup and hold times).

Whether this is "broken code" is debatable because these are standard practices in embedded devt, for decades. They just don't work on these chips and with a modern compiler.

So in the end it wasn't anything to do with volatile declarations because I still can't get my head around some of the issues where you might try to fill a 512 byte buffer but it doesn't actually happen because the compiler has worked out that the last 90% of it was never accessed afterwards

Ataradov's microsecond delay function has been very useful

ataradov · « **Reply #163 on:** August 15, 2021, 07:31:42 pm »

Quote from: peter-h on August 15, 2021, 07:16:38 pm

Whether this is "broken code" is debatable because these are standard practices in embedded devt, for decades. They just don't work on these chips and with a modern compiler.

Simply because you have been doing something for decades and tools were shit to really optimize anything, does not make it right.

Even your reliance on blocking loops for delays will break in the future once the hardware gets better.

Quote from: peter-h on August 15, 2021, 07:16:38 pm

I still can't get my head around some of the issues where you might try to fill a 512 byte buffer but it doesn't actually happen because the compiler has worked out that the last 90% of it was never accessed afterwards

This is a very rare case in full and complete applications. It happens sometimes when you comment out some code for debugging, and compiler finds a way to eliminate more than you expected.

ejeffrey · « **Reply #164 on:** August 15, 2021, 07:48:31 pm »

Quote from: SiliconWizard on August 15, 2021, 04:40:07 pm

Quote
In the abstract machine, all expressions are evaluated as speciﬁed by the semantics. An
actual implementation need not evaluate part of an expression if it can deduce that its
value is not used and that no needed side effects are produced (including any caused by
calling a function or accessing a volatile object).

So, it looks a bit like what I said earlier. Doesn't it? (See the part in bold.)

I think you are misreading that. Calling a function or accessing a volatile object are examples of side effects that would prevent the implementation from omitting an expression.

lucazader · « **Reply #165 on:** August 15, 2021, 08:12:00 pm »

Quote from: peter-h on August 15, 2021, 07:16:38 pm

Whether this is "broken code" is debatable because these are standard practices in embedded devt, for decades. They just don't work on these chips and with a modern compiler.

This is just completely not true. These "standard" embedded practices I see all the time result in hard to read code that is equally hard to maintain. It always seems to be held together by a shoestring and falls apart in a light breeze.
I really don't get why the embedded world has been so slow to learn from the rest of software development and adopt modern tools, compilers and best practices.

We use the latest gcc compiler at work with our STM32 based devices. Never have any issues with code breaking or getting optimised out, unless it was a mistake made by us in the code.
Especially don't have any issues with timing. But I guess thats because we don't implement our timers as blocking loops based on the a certain number of instructions.
The systick and the HAL_Delay that it is derived from is a super consistent way to get good reliable timing.
If I need anything that has more resolution than systick, for me the most reliable way is to start up a hardware timer (eg TIM6) as a 1uS tick rate.

SiliconWizard · « **Reply #166 on:** August 15, 2021, 08:52:17 pm »

Quote from: ejeffrey on August 15, 2021, 07:48:31 pm

Quote from: SiliconWizard on August 15, 2021, 04:40:07 pm

Quote
In the abstract machine, all expressions are evaluated as speciﬁed by the semantics. An
actual implementation need not evaluate part of an expression if it can deduce that its
value is not used and that no needed side effects are produced (including any caused by
calling a function or accessing a volatile object).

So, it looks a bit like what I said earlier. Doesn't it? (See the part in bold.)

I think you are misreading that.

After re-reading it, possibly. The phrasing, admittedly, is not the best here. It's borderline confusing.

Now, try this piece of code. It's interesting:

Code: [Select]

#include <memory.h>

void my_memcpy(volatile void *dst, volatile void *src, size_t n)
{
    while (n--)
        *(char *)dst++ = *(char *)src++;
}

void f()
{
    volatile char buffer[512];
    my_memcpy(buffer, (volatile char *)0x08000000, 512);
}

In your opinion, should it or should it not do anything?

gf · « **Reply #167 on:** August 15, 2021, 09:40:15 pm »

A bit questionable is certainly the cast from volatile void* to char*, as it drops volatile. I would rather not consider the memory access to *(char*)dest a volatile access any more.

peter-h · « **Reply #168 on:** August 15, 2021, 10:11:02 pm »

Should debugging be at all possible with anything other than -O0 or -Og?

I have tried it with -O3 and while stepping does work, sort of, it does bizzare things.

I wonder what the explanation for this is. Stepping works by temporarily inserting a special opcode in that place, although the 32F has a small number of dedicated hardware breakpoints which, when hit, substitute the instruction in place of the opcode fetch so that works on code in FLASH also.

"It always seems to be held together by a shoestring and falls apart in a light breeze."

You are missing the point. If you want to achieve a delay of say 100ns (min) from CS=0 to something happening, you aren't going to make a call to RTOS

This is done by a short delay made up of non-removable instructions, and I don't think there is any other way to do it, short of external hardware which inserts wait states (which would be ridiculous in this case).

SiliconWizard · « **Reply #169 on:** August 15, 2021, 10:11:25 pm »

Quote from: gf on August 15, 2021, 09:40:15 pm

A bit questionable is certainly the cast from volatile void* to char*, as it drops volatile. I would rather not consider the memory access to *(char*)dest a volatile access any more.

Alright, that's well spotted. Changing this to:

Code: [Select]

*(volatile char *)dst++ = *(volatile char *)src++;
does generate code for the copy.

SiliconWizard · « **Reply #170 on:** August 15, 2021, 10:17:39 pm »

Quote from: peter-h on August 15, 2021, 10:11:02 pm

Should debugging be at all possible with anything other than -O0 or -Og?

I have tried it with -O3 and while stepping does work, sort of, it does bizzare things.

With optimizations, code may get simplified in ways you don't expect. Apart from code possibly being optimized out entirely, and local variables put in registers, one thing that frequently happens too is that code can get reordered compared to what you wrote in C, so that matching between source code and assembly code statement by statement is not guaranteed. So the debugger may appear to be jumping all over the place when single stepping. You often have to switch to assembly code view in this case to figure out what happens.

ataradov · « **Reply #171 on:** August 15, 2021, 10:23:45 pm »

Quote from: peter-h on August 15, 2021, 10:11:02 pm

Should debugging be at all possible with anything other than -O0 or -Og?

Yes, you just need to get used to not seeing everything and just filling the gaps in your head. Why do you need to see absolutely everything? If you really need that, then stepping though assembly is a better option anyway.

Quote from: peter-h on August 15, 2021, 10:11:02 pm

Stepping works by temporarily inserting a special opcode in that place, although the 32F has a small number of dedicated hardware breakpoints which, when hit, substitute the instruction in place of the opcode fetch so that works on code in FLASH also.

Hardware breakpoints do not need any substitutions. They just stop execution when address comparator tells so. And placing transparently soft breakpoints in the flash is the worst debugger feature ever invented. Thankfully most sane debuggers let you disable that behaviour.

lucazader · « **Reply #172 on:** August 15, 2021, 11:06:46 pm »

Quote from: peter-h on August 15, 2021, 10:11:02 pm

Should debugging be at all possible with anything other than -O0 or -Og?

I almost exclusively use -Os most of the time, including in debug builds.
As others have said you just have to get used to single stepping jumping around a bit.
I find it helps to also have logs as well as the debugger to decode what is going on.

ejeffrey · « **Reply #173 on:** August 16, 2021, 05:47:47 am »

Quote from: peter-h on August 15, 2021, 10:11:02 pm

Should debugging be at all possible with anything other than -O0 or -Og?

I have tried it with -O3 and while stepping does work, sort of, it does bizzare things.

I wonder what the explanation for this is. Stepping works by temporarily inserting a special opcode in that place, although the 32F has a small number of dedicated hardware breakpoints which, when hit, substitute the instruction in place of the opcode fetch so that works on code in FLASH also.

Single stepping doesn't use explicit breakpoints at all, hardware or software. It is a separate operation mode of the CPU that causes it to halt before every instruction. The debugger then reads the PC and uses the debug symbols to map that address back to the source line of code that generated it.

The reason it jumps around is simply that the compiler re-ordered the code.

DavidAlfa · « **Reply #174 on:** August 16, 2021, 07:57:08 am »

You should debug with O0 unless you want weird things happenning.
The compiler will optimize the code, re-use parts from other functions, update the variables on a different order or directly omitting them... you will find yourself in a no sense behaviour when debugging.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 42891 times)

Share me