Author Topic: GCC compiler optimisation (Read 45837 times)

peter-h · « **Reply #175 on:** August 16, 2021, 08:41:18 am »

"You should debug with O0 unless you want weird things happenning.
The compiler will optimize the code, re-use parts from other functions, update the variables on a different order or directly omitting them... you will find yourself in a no sense behaviour when debugging."

I have come to the same conclusion.

Develop mostly with -Og (or whatever you fancy) and recompile with -O0 if you want to step through the code to check the logic.

But weird stuff seems to be going on anyway. Last night the whole thing was running with -Og. This morning I did some edits (to an unrelated file) and it no longer works unless the serial FLASH code is compiled with -O0 (which was the only way to make it work previously). Tried project/clean etc; the usual Cube IDE stuff one does when anything isn't working right. The serial FLASH code is 3rd party code and while it all seems right, something isn't right and I can't find it, and when I step through it (having compiled it with -Og) a lot of weird stuff is going on. One could tear out one's hair with this. That said, with that one file done with -O0, the project is rock solid.

Fairly obviously the only way to debug optimised code is with "printf" statements. I have a printf() which comes out via the SWV ITM console, and I also did a much more compact itm_puts() which does the same, and with ITM running at 2MHz it doesn't slow down the code much.

I reckon lots of people just work with -O0 and don't bother to ever change it, because everything works as you would expect it (or not) and you can immediately debug
the code if needed. There is a sizeable code penalty though: 220k versus 160k. That will not matter in most projects...

The problem is that not everybody is an expert, and e.g. -O3 opens up various traps. I was seeing strange stuff e.g. conditionals totally skipped. The test was checking a byte in a buffer which was read from an eeprom, but I could not easily examine the buffer since it was optimised out. By the looks of it, the compiler didn't implement the eeprom read loop.

DavidAlfa · « **Reply #176 on:** August 16, 2021, 08:55:28 am »

Og does some optimizations, so it'll still mess things up, like invisible functions because they are inlined.

If things stop working when enabling optimizations, your code is wrong.
Most often it's because you're forgetting to declare volatiles when you should.
That also includes simple delay loops that increase a value, the compiler will see that it does nothing but losing time and remove them, so you'll need to declare the variable as as volatile or insert an asm("nop") in the loop which is also considered volatile and thus not optimized.

westfw · « **Reply #177 on:** August 16, 2021, 09:00:46 am »

Quote

Calling a function or accessing a volatile object are examples of side effects that would prevent the implementation from omitting an expression.

So I would have believed.But apparently nowadays, if you insert a function call, it's getting harder and harder to tell whether that will actually result in an optimization-defeating function call. It could get inlined by the compiler. It could get inlined by the linker. The compiler can decide that it knows what the function does and do it some other way that doesn't defeat optimization.

Is there a way to defeat this on a per-function-call basis?

Code: [Select]

NOINLINE memcpy(p1, p2, l);
There claims to be an attribute:

Code: [Select]

void foo() __attribute__((no_builtin("memcpy"))) { : memcpy(b1, b2, l);}But it didn't seem to work in our example case. (perhaps it still knew what memcpy() was supposed to do, and decided that was pointless, even if it would have avoided generating inline code if it hadn't been pointless?)

peter-h · « **Reply #178 on:** August 16, 2021, 11:09:22 am »

"If things stop working when enabling optimizations, your code is wrong.
Most often it's because you're forgetting to declare volatiles when you should."

My 50 quid offer stands

EDIT: NO LONGER I found the problem. It was another timing issue; CS was being raised while the last byte being written was still shifting into the device. It worked previously due to bloated ST SPI code, which in any case needed the 1kHz tick running, and here I had removed all that timeout stuff so it all ran a lot faster, but still not fast enough in -O0 to break it.

"That also includes simple delay loops that increase a value, the compiler will see that it does nothing but losing time and remove them, so you'll need to declare the variable as as volatile or insert an asm("nop") in the loop which is also considered volatile and thus not optimized."

All that stuff is gone. There are ~5us and similar delays which are implemented exactly as suggested, and checked with a DSO.

I've written masses of code which all runs fine. This issue seems to be in the serial FLASH code.

Perhaps I am not fully understanding what "volatile" does.

If you read say a UART register (which is #defined with __IO which is "volatile") and load that data into a 512 byte buffer, that buffer cannot be optimised out - because the data source is volatile. The whole buffer should not be optimised out even if you loaded only 3 bytes into it. Is that correct?

BTW I am told that for the last 20 years memcpy and memmove are the same, and memcpy performs a check for overlapping blocks.

DavidAlfa · « **Reply #179 on:** August 16, 2021, 12:12:49 pm »

Accessing a peripheral is volatile.
Are your using you own function based on the HAL libraries, rigth? Try using HAL to ensure there isn't a mistake somewhere.
I don't think you need volatile for anything unless written by an interrupt or because you specifically want to be able to check that variable when debugging optimized code. (Or the said "dumb" loops).

How does it exactly fail? Is something working at all? Try doing the most simple thing, reading the jedec/device id from it.
If even that fails, you have to find out the problem source. Of course, not easy because it only fails when optimizing.
You can specifically force the compiler to not optimize a function by adding this:

Code: [Select]

__attribute__((optimize("O0"))) void notOptimizedFunction(void){
}

My approach would be this:
- Disable optimizations for all flash-related functions. If it still doesn't work, your problem is elsewhere.
- If it works, optimize the functions, one at a time, until you find the one causing trouble.

Nominal Animal · « **Reply #180 on:** August 16, 2021, 01:47:07 pm »

For the last twenty years, I've written a lot of C code, the vast majority compiled with GCC, some with Clang, Intel CC, Pathscale, or Portland Group compilers. Always with -O2. My bug rate is lower than average, too; I do claim to know this stuff.

(I'm also proficient in several other programming languages from Fortran to Python, and have a pretty widely ranging background in development, so I'm not "stuck" in C, or in imperative programming languages, either. By this I mean, I have years of experience in software development in very different fields from web development to microcontrollers; I'm not a one-niche guy, who assumes their experience in one niche is extensible to everywhere else. I've just found that in all niches where I've used C, -O2 has been the proper optimization choice.)

On x86 and x86-64, I do a lot of parallel and distributed processing. For efficiency, I often use lockless structures via compiler-provided atomic built-ins. A decade ago, I used to write a lot if extended inline assembly for SIMD operations, but nowadays the x86 intrinsics are so well integrated to the compilers it is no longer necessary. So, I do claim I know quite a bit about complex interactions between threads, atomicity, and manipulating volatile data (pun intended).

The first MCU I started developing on was a Teensy 2.0++, an Atmel AT90USB1286, on top of a set of header files, avr-libc, and avr-gcc.
I have a few dev boards using ATtinys (digispark clones), ATmega32u4 (pro micro clones), and ATmega328 (pro mini clones) that I've programmed the same way; on bare metal. For more interesting stuff, I now use ARM microcontrollers, in Arduino or PlatformIO environments. (I particularly like Teensy LC, 3.2, 4.0, and 4.1, all of which I have at least one. I do have about a dozen others, from various manufacturers, some still in their original packaging.)

On the electronics side, I'm an utter ham-handed hobbyist. I do have a physics background, with theoretical courses over electronics (up to opamps and digital logic), as my "core field" is computational materials physics, specifically simulator software development, but have only used this "in anger" in the last few years, mostly using EasyEda and JLCPCB (because it's so darned easy there). So, I'm still learning myself, and not "stuck" believing I know everything I need to know; I know I don't know enough, and am very interested in learning, and not at all afraid of admitting publicly when I'm wrong. I'm deliberately very blunt that way. It's cathartic, too.

GCC atomic built-ins are available for ARM architectures, and are also provided by LLVM Clang (ie. if you use clang to compile to ARM targets). Do not let the C++18 reference mislead you; all they mean is that the six __ATOMIC_ memory order constraints use the memory model definitions in the C++18 standard, that's all. It is perfectly acceptable to use these in plain C code, or embedded C++. I typically end up using __ATOMIC_SEQ_CST anyway. The "trick" is to always use the atomic built-in when accessing a variable, and not mix non-atomic accesses with atomic ones, unless the non-atomic accesses are allowed to occasionally be garbled (like in a compare-and-swap loop).

The C standard is defined in terms of an abstract machine. Mostly, C compilers strictly follow this standard. GCC (and other compilers) have options that diverge or relax some rules. Generally, -O2 does not include any of those. You can check by examining the output of gcc -c -Q -O2 --help=optimizers.
The C standard does leave many things "implementation defined". (When you use freestanding C++, for example in the Arduino environment, almost everything is "implementation defined" per the C++ standard; only when you know a feature is available, can you reasonably consult the C++ standard to see how it is supposed to be implemented. The corresponding freestanding C environment is much more "defined", so that the environment normally used for microcontroller development is quite a complicated subset of C and C++.)

The most relevant terms here are immutable, constant, const, volatile, and atomic.

Atomic is the most complicated one, because it really refers to several things that attempt to achieve the same result. In certain architectures, basic accesses to base types may be inherently atomic; this depends on the hardware architecture. Later C and C++ standards define atomic types, corresponding to these types. GCC and LLVM-clang provide the aforementioned builtins, that implements such atomic accesses, if it is possible in the current hardware; if the target is such an inherently atomic type, these builtins compile to basic accesses, and thus are very efficient ways to implement atomic accesses. In both cases, the problem is that not all hardware architectures implement the full complement –– in particular, most are either compare-exchange or load-locked, store-conditional type –– so one kinda-sorta needs to check the compiler generated assembly or machine code to see if the constructs you need generate sane-looking code. If there are things like "disable interrupts", you know that operation isn't really atomic on that architecture, and the compiler is trying to work around the hardware; a different code pattern is then needed on that hardware.

Constant != const. The C and C++ standards use specific definitions for terms like "literal constant" and "constant"; they are not necessarily what one might think they mean. So, for constant, be careful to check the context in which it is used. (Also, if you find your compiler does not do something that the standard says it should, means either the compiler has a bug, or the compiler developers and you see that passage in the standard differently. I always say that reality trumps theory, because it does. Instead of railing against it, it is more effective to report it (but accept that it likely will be ignored) and work around it, because what matters is that the generated code works as required in all situations in real life; whether it is exactly according to rules drawn up by a committee is always a secondary concern, something for the business and people staff to discuss in their endless meetings.)

Immutable is used in the sense that "this is not allowed to be modified". From the C programmers view, an immutable object or variable resides in read-only memory, and an attempt to modify such causes "undefined behaviour", something that depends on the hardware and the environment used. In userspace code running under a full operating system, it usually leads to segmentation violation error, and a crash of that process. In a microcontroller, the attempt may be ignored, the MCU can reset, or an interrupt fire. It varies.

This leaves the two C keywords, const and volatile. const is a promise from the programmer to the compiler that the code does not try to modify the object or variable such denoted. volatile is the inverse: it means the compiler is not allowed to make any assumptions whatsoever about the object or variable such denoted.

This means that constructs such as const volatile int foo; are perfectly valid and useful. The const keyword is a promise to the compiler that the code in this scope will not try to modify foo, and the volatile keyword tells the compiler that whenever foo is used, it must read its value from memory, because it may be modified by something unknown; even by hardware, another thread, whatever.

Trick is, const and volatile work exactly that way in an expression as well. Even if you have an object or variable not declared volatile, you can take its address, cast that to a pointer to volatile to the type of the object/variable, and dereference the cast; such access is then equivalent to one when the variable or object was declared volatile in the first place. I do not recommend this as a general pattern, because it means the type of that variable must be duplicated in every such cast.
A much better pattern is to have the variable or object declared volatile, but in any scope where a snapshot of that suffices, just copy its value into a local const one (non-volatile).

There are a couple of additional details in C that are useful when dealing with numerical expressions (important when you get into stuff like Kahan summation), without going into the details of how that abstract machine works and what side effects and sequence points are. First is that in C99 and later, casts of numeric types, both reals and integers, limit the range and precision to that of the cast type. The second is that unsigned integer arithmetic is modulo arithmetic (wraps around), and since C99, there are exact-width unsigned binary types uintN_t, binary twos complement types intN_t (and corresponding minimum-width and optimum-width/fast types) provided by the compiler in <stdint.h> even in freestanding environments (i.e., always). As of 2021-08-16, the fixed-point support in GCC is not good enough to really use in my opinion. Using the integer overflow built-ins you can do multi-limb (multi-byte/word) counters trivially. (Making one atomic really needs two generation counters, and a retry loop, though; and for both reading and incrementing/modification.)

Finally, compiler barriers, in particular __asm__ __volatile__("": : :"memory"); , can be used to ensure all memory accesses in preceding code are done prior to this barrier, and all memory accesses done in succeeding code are done after this barrier. Basically, it makes sure the compiler does not move memory accesses across this barrier. (It does this by basically telling the compiler that everything it knows about memory contents at this exact point becomes invalid.)

I really, really do not understand why you'd find -O0 necessary. I suspect it is because you haven't really yet grokked how C compilers and the C language work, deep inside the nitty gritty details. This is not an insult; not all C programmers need that kind of deep understanding to effectively wield C in anger, but since you found you need to disable optimizations to get the code you want, I suspect you do need to know. I warmly recommend reading the standard; specifically, starting with the C99 version, because it is the most widely supported one (except for Microsoft C++ compiler, as Microsoft still refuses to fully support C99, even after contributing significantly to later C11 version). The final draft, with the three corrigenda included, is publicly available as n1256.pdf at open-std.org. (For C11, the final draft is n1570.pdf, and for C18, archived as n2176.pdf.) The actual standards can be bought from ISO, but I haven't bothered; too expensive for what they are.

Bassman59 · « **Reply #181 on:** August 16, 2021, 03:21:40 pm »

Quote from: peter-h on August 16, 2021, 11:09:22 am

My 50 quid offer stands EDIT: NO LONGER I found the problem. It was another timing issue; CS was being raised while the last byte being written was still shifting into the device.

I hate when that happens!

peter-h · « **Reply #182 on:** August 16, 2021, 03:36:46 pm »

Everything now works with -Og.

However, -O3 still breaks things. I will work on it using the __attribute__((optimize("O0"))) void notOptimizedFunction(void) - thank you for that tip.

It is just difficult to debug -O3 code. But it does look like it is all SPI related. For example neither myself nor my colleague can see the reason for the special treatment of the 1 byte transfer case (SPI_MODE_SLAVE is false, btw):

Code: [Select]

  	/* Transmit and Receive data in 8 Bit mode */

    // The need for this initial byte is unknown
    if ((hspi->Init.Mode == SPI_MODE_SLAVE) || (initial_TxXferCount == 0x01U))
    {
      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
      hspi->pTxBuffPtr++;
      hspi->TxXferCount--;
    }

    while ((hspi->TxXferCount > 0U) || (hspi->RxXferCount > 0U))
    {
      /* Check TXE flag */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))
      {
        *(__IO uint8_t *)&hspi->Instance->DR = (*hspi->pTxBuffPtr);
        hspi->pTxBuffPtr++;
        hspi->TxXferCount--;
        /* Next Data is a reception (Rx). Tx not allowed */
        txallowed = 0U;
      }

      /* Wait until RXNE flag is reset */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (hspi->RxXferCount > 0U))
      {
        (*(uint8_t *)hspi->pRxBuffPtr) = hspi->Instance->DR;
        hspi->pRxBuffPtr++;
        hspi->RxXferCount--;
        /* Next Data is a Transmission (Tx). Tx is allowed */
        txallowed = 1U;
      }
    }

but sure as hell if you don't do that, it doesn't work. And it isn't a case of needing to transmit a byte to start things off (like you have to do with interrupt-driven UART TX usage) because what if you transmitted 2 bytes?

In the function which just transmits data (with SPI you always receive the same # as you transmit, but in this case the RX is discarded) I have the test at the end before CS is raised and that was a major bug which didn't surface with slower code

Code: [Select]

    /* Transmit data in 8 Bit mode */

  	// The need for this initial byte is unknown
    if ((hspi->Init.Mode == SPI_MODE_SLAVE) || (initial_TxXferCount == 0x01U))
    {
      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
      hspi->pTxBuffPtr++;
      hspi->TxXferCount--;
    }

    while (hspi->TxXferCount > 0U)
    {
      /* Wait until TXE flag is set before loading data */
      if (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE))
      {
        *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
        hspi->pTxBuffPtr++;
        hspi->TxXferCount--;
      }
    }

    // wait for last byte to get shifted out - needed before CS is raised!
    while ( !__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) ) {}

DavidAlfa · « **Reply #183 on:** August 16, 2021, 03:51:44 pm »

Oh, were you reading the buffer empty flag instead the shift register?
That's a typical mistake, as the spi peripheral is double buffered!
These are the 3 lines of code that take 80% of developing time

By the way, O3 barely makes any difference vs O2, but takes more space.
You also have Ofast if you prefer speed over code size, the performance boost is noticeable, at least with the stm32.

peter-h · « **Reply #184 on:** August 16, 2021, 04:24:25 pm »

Ha! It gets better

I made the same mistake which as you say so many have made. One has to wait for TX buffer empty AND TX shift register empty, before raising the slave device CS to 1.

Neither was happening and my above line is doing only half of it. This is needed:

Code: [Select]

     // wait for last byte to get shifted out - needed before CS is raised!
    while ( (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)==1) && (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_BSY)==0) ) {}

It was working by accident, at the combination of 21mbps and the bloated code.

The 32F4 does not have a "TX shift reg empty" bit as such. It has SPI_FLAG_BSY which looking at some original ST code does that job, but checking both bits is better in case you catch it just as the byte is being transferred from the holding reg to the shift reg.

This is an issue only on HAL_SPI_Transmit; on HAL_SPI_Transmit_Receive the code waits for the incoming byte to fully arrive so this is not an issue. Also I notice the original ST code contains, via a convoluted route involving system tick based timeouts, and of course a lot of code, the above test for SPI_FLAG_BSY... That is not great because in many applications you will not want a blocking function for SPI TX because it prevents you doing something useful while the last byte is going out, and doing osDelay(1) (to yield to the RTOS) is really crude because it will kill the data rate.

So far, not had issues apparently related to "volatile". It isn't trivial to try that because if you have a buffer and call some function to fill that, and you put "volatile" in front of that buffer, it complains that the volatile was discarded for the function. So you have to change a lot of stuff.

Now running with -O3 and no hacks.

JOEBOBSICLE · « **Reply #185 on:** August 16, 2021, 04:37:55 pm »

Why don't you just use the HAL?

peter-h · « **Reply #186 on:** August 16, 2021, 04:41:14 pm »

The HAL SPI functions, unmodified, need interrupts to implement timeouts. I am doing a boot loader which has these disabled.

JOEBOBSICLE · « **Reply #187 on:** August 16, 2021, 04:54:05 pm »

Seems overly complicated instead of replacing whatever the st function they use to read the tick for timeouts.

Changing one function (HAL_GetTick()) VS all the places they implement a timeout. Surely it just makes sense to do a small change?

DavidAlfa · « **Reply #188 on:** August 16, 2021, 05:47:37 pm »

Why can't you run the systick timer in the bootloader?
If you want to momentary disable interrupts, use __disable_irq() and __enable_irq() macros.

Systick doesn't cause any harm, it just increases the variable "uwTick" and returns.

It'll probably work anyways with the timer disabled.
The timeouts will never work, so it'll stall forever if something fails.
But as you said, that's impossible to happen in your application.

peter-h · « **Reply #189 on:** August 16, 2021, 06:20:21 pm »

Yes; very good points.

I did not want interrupts because setting up the vectors to movable ISRs is an extra complication. There were other reasons too; for example I don't like to throw in a ton of code which is so bloated that I can hardly read it. Remember I am new to the 32F world, coming from much simpler CPUs like the H8, and programming these mostly in assembler where everything does exactly what you want it to do (well, usually)

I want to simply stuff to a point where I can understand it fully because in the long run I will be supporting the product alone. In my little business we still sell products I designed in the 1990s.

I am learning but at a finite rate, and it is a steep curve, with the complicated Cube IDE animal to deal with too

When designing the PCB I read the 300 page hardware manual word for word and the PCB worked first time with no bugs. But I didn't read the 2000 page Reference Manual in the same way. Only when needed, and then one realises that 300 lines of ST HAL code is really about 10 lines if you write code to do the job you actually need. And their code can hide a lot of problems, and it does - the bugginess of HAL code is well known. So another reason to simplify the HAL code.

On the topic of optimisation, this is interesting: With -O3, the compiler replaces this

Code: [Select]

		for (uint32_t i=0; i<length; i++)
		{
			buf[offset+i]=data[i];
		}

with memcpy() and guess what happens when this code runs in the boot loader? The thing bombs, because mmcpy is way high up in the FLASH and well outside the boot loader! This was discovered only because Cube shows a stack trace and on there appeared memcpy which does not exist anywhere near there.

And people here tell me that optimised code which doesn't run is "broken"

SiliconWizard · « **Reply #190 on:** August 16, 2021, 06:45:06 pm »

Why bother with this "systick" interrupt when you're at a low level stage?
Just use the DWT cycle counter, as is often suggested in this forum. It just needs to be enabled. I use this for enabling it in C:

Code: [Select]

void DWT_Init(void)
{
	if (! (CoreDebug->DEMCR & CoreDebug_DEMCR_TRCENA_Msk))
		CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
	
	DWT->CYCCNT = 0;
	DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
}

Enabling it in assembly is also just a couple instructions if you need to do this in the startup code or something.

Then you can read the DWT->CYCCNT register. Resolution is the system clock period. Doesn't require any interrupt or any peripheral.

JOEBOBSICLE · « **Reply #191 on:** August 16, 2021, 07:40:41 pm »

Quote from: peter-h on August 16, 2021, 06:20:21 pm

Yes; very good points.

I did not want interrupts because setting up the vectors to movable ISRs is an extra complication. There were other reasons too; for example I don't like to throw in a ton of code which is so bloated that I can hardly read it. Remember I am new to the 32F world, coming from much simpler CPUs like the H8, and programming these mostly in assembler where everything does exactly what you want it to do (well, usually) I want to simply stuff to a point where I can understand it fully because in the long run I will be supporting the product alone. In my little business we still sell products I designed in the 1990s.

I am learning but at a finite rate, and it is a steep curve, with the complicated Cube IDE animal to deal with too

When designing the PCB I read the 300 page hardware manual word for word and the PCB worked first time with no bugs. But I didn't read the 2000 page Reference Manual in the same way. Only when needed, and then one realises that 300 lines of ST HAL code is really about 10 lines if you write code to do the job you actually need. And their code can hide a lot of problems, and it does - the bugginess of HAL code is well known. So another reason to simplify the HAL code.

On the topic of optimisation, this is interesting: With -O3, the compiler replaces this

Code: [Select]
for (uint32_t i=0; i<length; i++) { buf[offset+i]=data[i]; }
with memcpy() and guess what happens when this code runs in the boot loader? The thing bombs, because mmcpy is way high up in the FLASH and well outside the boot loader! This was discovered only because Cube shows a stack trace and on there appeared memcpy which does not exist anywhere near there.

And people here tell me that optimised code which doesn't run is "broken"

You're trying to run a bootloader out of ram which is somewhere in your flash image right? I have written 4 different bootloaders for STM chips and just allocate a section of flash for the bootloader and jump to the application code after X amounts of seconds since boot. Running out of ram is annoyingly complicated and dangerous imo. What if you flash a hex that's dangerous?

If you want code to last decades then consider how easy it'll be to port to a new chip. Your current chip may only last another decade. You'll likely need to rip up all your drivers so it's worth making them abstract and easy to port.

ataradov · « **Reply #192 on:** August 16, 2021, 08:01:37 pm »

Quote from: peter-h on August 16, 2021, 06:20:21 pm

The thing bombs, because mmcpy is way high up in the FLASH and well outside the boot loader!

This should not happen. How did it happen? A bootloader is a self contained project. It either builds standalone and works or does not.

Quote from: peter-h on August 16, 2021, 06:20:21 pm

And people here tell me that optimised code which doesn't run is "broken"

You are doing something highly unconventional and questionable, so of course it does not work.

ataradov · « **Reply #193 on:** August 16, 2021, 08:05:09 pm »

Quote from: JOEBOBSICLE on August 16, 2021, 07:40:41 pm

Running out of ram is annoyingly complicated and dangerous imo. What if you flash a hex that's dangerous?

All the bootloaders I've written for ARM run from RAM. At least the ones that would fit into RAM.

Sure, there is danger associated with updating the bootloader itself, but in other case there is not even an option. And having option and not using them is nicer than not having options.

There is nothing inherently dangerous with running from RAM. It also lets me receive the next packet of data while flash is busy writing the previous one, instead of execution just being blocked.

Nominal Animal · « **Reply #194 on:** August 16, 2021, 08:08:04 pm »

If you do examine the size of the generated code, you'll find that -O3 significantly expands the size of the code generated. Because of that, -O3 is rarely used. You really should consider using -O2 instead; and only if necessary, add specific additional optimization flags. (I occasionally use -fno-trapping-math -ffinite-math-only, for example.)

Quote from: peter-h on August 16, 2021, 06:20:21 pm

On the topic of optimisation, this is interesting: With -O3, the compiler replaces this [...code..] with memcpy()

Yes, because a compiler is allowed to turn code into a (standard) function call, when that standard function call is available, and -O3 enables all sorts of "unsafe" optimizations and assumptions the compiler may do to get stuff "run faster". Usually it fails at it, though, so -O3 is rarely used.

If you had read the GCC Standards conformance statement, second to last paragraph in section 2.1: "GCC requires the freestanding environment provide memcpy, memmove, memset and memcmp", you'd have known these four functions need special care because of this, when dealing with such a strictly constrained situation as you have right now.

GCC has built-ins for a lot of functions in the C library, but the above four are the ones that need to be carefully handled in a very strictly constrained situation like yours, because GCC may emit calls to them at any point. Its documentation says so.

This is not rare, either; just normally unobserved. For example, it is the GCC built-in printf() that turns printf("Hello, world!\n") into a puts("Hello, world!") or fputs("Hello, world!\n", stdout); library call. It is only because you have a very constrained situation that you noticed.
(Then again, very few programmers seem to know about this.)

These are not difficult to fix, although the optimum way always depends on the situation. I like to implement my own functions (with a prefix to the name, say local_, and do explicit calls to those). Usually the header file describing them has GCC preprocessor magic (including __builtin_constant_p(), for other but similar functions _Generic() too to do different things depending on the type of the arguments) so that when the targets are known to be aligned or of specific types, aligned/optimized versions of the functions are used.

You can trivially check after compilation but before linking using e.g.
LANG=C LC_ALL=C objdump -tT object-file 2>/dev/null | awk '$2 == "*UND*" { print $4 }'
to list all the symbols object-file uses, but does not define itself. In your case, when run against the object file containing your RAM-based Flash updater functions, it would have listed memcpy.

In script form,
#!/bin/sh
export LANG=C LC_ALL=C
status=0
for obj in "$@" ; do
if objdump -tT "$obj" 2>/dev/null | awk -v obj="$obj" 'BEGIN { rc=0 } $2 == "*UND*" { printf "%s: References external symbol: %s\n", obj, $4 ; rc=1 } END { exit rc }' ; then
status=1
fi
done
exit $status
this can be used in any Makefile. If it lists any symbols, it will return failure. In a Makefile, if your rule for building this object file contains, after the compile command ($(CC) $(CFLAGS) -c $^), a call to this script naming this file ($(NOEXTSYMS) $@ where NOEXTSYMS is a Makefile variable defining the path to the above script), the build will fail if such external symbols are found in the object file.

peter-h · « **Reply #195 on:** August 16, 2021, 08:40:27 pm »

"Just use the DWT cycle counter"

Thanks for the tip. I didn't know about that. However, the fact is that those error conditions are impossible unless the silicon is defective, so why carry so much bloat to pick up conditions which a) will never be seen and b) cannot be reported to the user. Example: I've sold, since 1995, about 10k of a box with an H8/300, 62256 SRAM, 28C256 EEPROM. The number of actual hardware failures could be counted in the fingers of one hand.

"Yes, because a compiler is allowed to turn code into a (standard) function call, when that standard function call is available, and -O3 enables all sorts of "unsafe" optimizations and assumptions the compiler may do to get stuff "run faster". Usually it fails at it, though, so -O3 is rarely used."

That isn't exactly what those above have been telling me (that any failure due to optimisation means my code is crap)

I can see this "optimisation" is legitimate, but that's not the same Q.

"All the bootloaders I've written for ARM run from RAM. At least the ones that would fit into RAM."

The boot loader starts in FLASH (obviously) and then loads a RAM-located portion into RAM which is used for actual FLASH programming. The initial FLASH code does a whole load of initialisation, verification, etc, not least because the RAM based portion has to get the code to program from "somewhere", in this case from a 4MB serial FLASH chip.

The above bit of code which got replaced with memcpy() was in FLASH. I had already spent many hours making sure that none of the C code is making calls anywhere outside the 32k boot block; that is a requirement if the boot loader is to be always capable of recovering a bricked device. I achieved this, but didn't bank on the bloody compiler doing the above stunt

The basic issue here is that there is no way to defend against this "optimisation". I am fairly sure -Og doesn't do it (I checked this exactl piece of program, and the loop does remain. Will re-check with -O2, but frankly -Og is good enough and delivers a big size reduction (1/3) and looking at the assembler an equivalent speed-up too.

Good point re checking the map file for text objects outside the boot loader. It appears a little involved when the boot loader is made up of several .c files, and being a part of a bigger program memcpy() will exist legitimately. I would need a means of checking whether any code in the bottom 32k references any code outside that. I think your script is processing the .o files (the assembler listing) since the memcpy etc would not be visible elsewhere.

"A bootloader is a self contained project"

It usually is, apparently, but as you say I am unconventional

I wanted to build this whole thing as one Cube IDE project, and it is pretty easy to do, and much easier to maintain long-term.

I now thing the best way is to develop in -Og and then switch to -O0 if you want meaningful single stepping to debug some algorithm. Otherwise the ITM data console is quite good for debugs (it crashes the STLINK debugger quite often though). Probably a dumb polled UART debug, 115kbaud, is the most reliable way.

EDIT: -O2 isn't much better than -O3. I am staying with -Og

abyrvalg · « **Reply #196 on:** August 16, 2021, 09:17:47 pm »

Quote from: peter-h on August 16, 2021, 08:40:27 pm

I wanted to build this whole thing as one Cube IDE project, and it is pretty easy to do, and much easier to maintain long-term.

In the time spent figuring out all the necessary tricks for this strange requirement you could have polished a "conventional" bootloader into a rock solid code. But instead you have a code that still "echoes" at you (like that sudden memcpy in a wrong flash section), and that will continue.
harerod have suggested a way to create two binaries in one Cube project (by creating two build configurations with different defines/file sets/.ld and building both of them) more that a month ago IIRC.

Nominal Animal · « **Reply #197 on:** August 16, 2021, 09:31:30 pm »

Quote from: peter-h on August 16, 2021, 08:40:27 pm

"Yes, because a compiler is allowed to turn code into a (standard) function call, when that standard function call is available, and -O3 enables all sorts of "unsafe" optimizations and assumptions the compiler may do to get stuff "run faster". Usually it fails at it, though, so -O3 is rarely used."

That isn't exactly what those above have been telling me (that any failure due to optimisation means my code is crap) I can see this "optimisation" is legitimate, but that's not the same Q.

There is optimization, and then there is unsafe optimization.
-Os optimizes for size. Often the code is fast as well.
-Og optimizes for debugging.
-O and -O1 enables optimizations.
-O2 optimizes even more. This is the setting that vast majority of projects use, and some people mean when they tell you to compile with optimizations enabled.
-O3 optimizes yet more. These optimizations usually cause code bloat, and often includes optimization features that have been relatively recently implemented, and are still being tested. If they were always useful, they'd be included in -O2. It should not enable any features that relax strict standards conformance, so if the compiler developers were perfect programmers, it would be safe to use -O3. Unfortunately, in reality, -O3 tends to enable features that programmer-users and compiler-programmers disagree wrt. the standard, or are not sufficiently integrated or debugged in the compiler. So, while -O3 is safe in theory, it is unsafe in reality.
-Ofast enables all (-O3) optimizations, plus some that are not strictly standards compliant. That makes it unsafe.

When I advise new programmers, I always recommend starting with -Wall -O2. Unfortunately, I too may occasionally say just "enable warnings and optimizations when compiling", when I actually mean -Wall -O2 exactly. I also occasionally say "enable warnings and optimize for size when compiling" (especially for AVRs and other MCUs with very little RAM/ROM/Flash), when I actually mean using -Wall -Os. I do apologize for this (and the others should too); it's just that these are so common, that anything else feels like an exception that needs describing.

If a compilation command includes more than one -O option, GCC uses the last one only. Not all C compilers support -Og or -Ofast, or have similar descriptions for the different optimization levels; but in practice, -Os and -O2 tend to be the ones actually used nevertheless.

Quote from: peter-h on August 16, 2021, 08:40:27 pm

I think your script is processing the .o files (the assembler listing) since the memcpy etc would not be visible elsewhere.

Yes; it assumes that the Flash updating code is compiled in separate .c file or files, and compiled to a single .o object file, before being linked into a single ELF file that is then converted to hex for uploading to the device.

ataradov · « **Reply #198 on:** August 16, 2021, 09:53:50 pm »

Quote from: peter-h on August 16, 2021, 08:40:27 pm

and it is pretty easy to do, and much easier to maintain long-term.

It is not easier. Clear separation has concrete advantages. But it is too much work to convince people on the internet to do the right thing.

All your issues stem from that unconventional approach. You are not unique, many people have tried this. People simply rethink things more often when they see the downsides of a new and creative thing they are trying.

DavidAlfa · « **Reply #199 on:** August 16, 2021, 10:17:47 pm »

Are you doing the whole thing in a single project? I doubt it's the way.
You have to make first the bootloader, setting the flash layout in the linker script, telling the compiler "you can only use these 12KB".
Then, a second program for the real application.
You have an option in the linker settings to generate relocatable code, so it'll use offsets instead literal addresses.
Now you could write that code anywhere in the flash and should work, at least in theory. Not sure about the vector table and such...

You don't need running in RAM. You erase the flash in pages, just don't wipe the bootloader's.
Although you would need a secondary bootloader to update the first uploader...

Also, you can make flash sections in the linker script, and tell the compiler where you want to place every function.
I don't know if it would still optimize, jumping to other flash addresses like the memcpy you saw.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 45837 times)

Share me