Author Topic: GCC compiler optimisation (Read 45829 times)

ataradov · « **Reply #25 on:** August 02, 2021, 09:24:07 pm »

Quote from: peter-h on August 02, 2021, 09:17:33 pm

And interrupts are not enabled at all in the loader. That complicates things too much because you have to switch the ISRs to RAM as well. .

That is fine. Where is your application vector table located?

Quote from: peter-h on August 02, 2021, 09:17:33 pm

Unless this CPU is doing something totally weird I can't see why the SP should be affected. It gets initialised as the first instruction in the .s startup.

Why? You are doing something really weird, but whatever, it is up to you what you do with your code. Why even have assembly startup? Designers of Cortex-Mx cores worked hard to make it possible to program the whole thing in C. Why go back to stone age?

Quote from: peter-h on August 02, 2021, 09:17:33 pm

But perhaps you meant using the normal SRAM for this; I would not risk that not getting corrupted by a reset, and in any case it gets wiped by the startup .s code (zeroing BSS etc).

There is no risk, SRAM is guaranteed to retain values over resets as long ad MCU is powered. It is the bootloader code that would be looking at that value, just don't erase it in the code. There is no problem here.

I don't understand. Do you erase the bootloader on the first update? Why even have it in a first place then?

lucazader · « **Reply #26 on:** August 02, 2021, 09:28:00 pm »

Quote from: ataradov on August 02, 2021, 09:19:34 pm

What dio you mean? The software requested MCU reset in my scenario would reset all the peripherals to their default values.
There will never be any issues, unless reset of the MCU does not fully reset the peripherals, but that would be stupid and I don't think this ever happens for practical devices.

You mean using something like:

Code: [Select]

NVIC_SystemReset()
I hadn't thought to use that, will it boot to the application at 8008000?
If so that sounds much easier! ill give it a shot, thanks!

ataradov · « **Reply #27 on:** August 02, 2021, 09:33:15 pm »

Quote from: lucazader on August 02, 2021, 09:28:00 pm

I hadn't thought to use that, will it boot to the application at 8008000?

It will boot as normal, but you can do the check first thing in the bootloader before anything is intialzied.

To request application run:

Code: [Select]

static uint32_t *ram = (uint32_t *)HMCRAMC0_ADDR;

    ram[0] = 0x...; // Any random values
    ram[1] = 0x...;
    ram[2] = 0x...;
    ram[3] = 0x...;
    NVIC_SystemReset();

And then the first thing in the bootloader main() (or even startup code if you want):

Code: [Select]

  if (0x... == ram[0] && 0x... == ram[1] &&  0x... == ram[2] && 0x... == ram[3]) // same values as before
  {
    // jump to the application
  }

I've used this method many times without any issues.

Furthermore, it is a very convenient way to request a bootloader from the application. Application puts a different set of values and resets the MCU. Bootloader runs, checked the values and understands that it needs to stick around.

There is obviously a chance that SRAM will randomly have the same set of values, but probability is very low, and a power cycle will solve the issue if that ever happens.

gf · « **Reply #28 on:** August 02, 2021, 09:53:19 pm »

Or vice versa, jump to the application by default, unless ram[0...3] indicate that an update needs to be done.

Code: [Select]

  if (0x... == ram[0] && 0x... == ram[1] &&  0x... == ram[2] && 0x... == ram[3] || !app_flash_checksum_is_valid())
  {
    // write new application to flash

    // clear update marker
    ram[0] = ram[1] = ram[2] = ram[3] = 0;
  }

  // jump to the application

Edit: Btw, if the bootloader does not overwrite itself when flashing the new application, then there is also no need to run it from RAM.

peter-h · « **Reply #29 on:** August 03, 2021, 06:09:59 am »

"Btw, if the bootloader does not overwrite itself when flashing the new application, then there is also no need to run it from RAM."

That is an interesting angle. It has been claimed in various places online that the 32F4 does not crash if you program the CPU FLASH with code running out of CPU FLASH. You just get a wait state which persists for the duration of the programming cycle.

I wonder if anyone has actually tested this?

Notably, the ST functions for CPU FLASH programming run out of RAM. Maybe they do it as a general case so the CPU can reprogram all its FLASH, or maybe they know something...

NVIC_SystemReset() should boot to the bottom reset vector, which is in the table at 0x08000000. I never change that table. The discussion mentioning 0x08008000 (base+32k) is merely to do with my application where I retain the bottom 32k across CPU flashing (in most cases) while loading app code at base+32k whose entry point is required to be at 0x08008000. This thread started with an example where this was failing because the compiler was sneaking some other code in before that, pushing the real entry point down a hundred bytes or so.

"Where is your application vector table located?"

In the bottom; never changes. The application runs under the RTOS.

"Why even have assembly startup? Designers of Cortex-Mx cores worked hard to make it possible to program the whole thing in C. Why go back to stone age?"

That is what ST supply, with their development board which we (like most people, I am sure) used as a starting point. I know assembler so am perfectly comfortable with it. It also ensures one doesn't get the very thing which started this thread

A .s file will never get its modules re-ordered by the assembler.

JOEBOBSICLE · « **Reply #30 on:** August 03, 2021, 06:27:36 am »

You don't need to copy the functions to RAM, you can definitely program other parts of flash using the normal hal functions.

peter-h · « **Reply #31 on:** August 03, 2021, 06:42:53 am »

OK, so it is only for programming the same block (16k or whatever size) that you need to move the programming code elsewhere. This (that someone has actually tested it) would have been useful to know when I was posting this
https://www.eevblog.com/forum/microcontrollers/how-to-create-elf-file-which-contains-the-normal-prog-plus-a-relocatable-block/

I did know that the 2MB version of the 32F4 can program one 1MB bank while executing from the other 1MB bank, but we aren't using that version.

And the RAM resident flash code also covers reprogramming the whole FLASH.

I wonder how ST deal with stuff like timer ticks for timeouts (which imply ISR usage) when getting a few milliseconds' worth of wait states?

ataradov · « **Reply #32 on:** August 03, 2021, 06:44:05 am »

If you execute write operations while running from the same flash bank, code execution will simply stall. This is a well documented behaviour. The reason to run from SRAM is if you want to still have interrupt vectors running (for example to receive next block of data while current one is written). ST code is just generic to address this scenario, there is no conspiracy, they don't "know something".

Your explanation of the memory layout makes no sense. Why is it required to be at 0x08008000? There is no reason for it. You seem to completely misunderstand how Cortex-Mx applications with bootloaders are supposed to be structured.

peter-h · « **Reply #33 on:** August 03, 2021, 06:47:00 am »

" code execution will simply stall."

That's good to know, but it doesn't happen with anything I have used previously. It would just crash, because the FLASH was not readable during a programming cycle, so the CPU fetched a duff opcode and crashed.

"Why is it required to be at 0x08008000? "

It's a complicated explanation, to do with loading an application module "somewhere".

I solved this function reordering business by using a stub containing just one function, and no include files, and linking this in the linkfile at 0x08008000, so there is now no chance of something else getting before that.

peter-h · « **Reply #34 on:** August 08, 2021, 06:05:52 am »

This is an interesting compiler optimisation Q.

This is from the ST libs for flashing the 32F4 CPU FLASH

Code: [Select]

**
  * @brief  Program word (32-bit) at a specified address.
  * @note   This function must be used when the device voltage range is from
  *         2.7V to 3.6V.
  *
  * @note   If an erase and a program operations are requested simultaneously,
  *         the erase operation is performed before the program one.
  *
  * @param  Address specifies the address to be programmed.
  * @param  Data specifies the data to be programmed.
  * @retval None
  * Waits for previous operation to finish
  *
  */

static void B_FLASH_Program_Word(uint32_t Address, uint32_t Data)
{
	// wait for any previous op to finish
	while(__HAL_FLASH_GET_FLAG(FLASH_FLAG_BSY) != RESET);
	// clear program size bits
	CLEAR_BIT(FLASH->CR, FLASH_CR_PSIZE);
	// reload program size bits
	FLASH->CR |= FLASH_PSIZE_WORD;
	// enable programming
	FLASH->CR |= FLASH_CR_PG;
	// write the data in
	*(volatile uint32_t*)Address = Data;
}

It is the use of "volatile". They actually use a macro (being ST

) called __IO, which one can understand for writing (not reading, surely) IO pins. But for writing FLASH memory? How can the compiler know that location might not be read back much later?

That func is a stripped down version of the real one, which is full of error condition code which basically cannot happen unless the silicon is defective. In my use I anyway do a verify and then potentially 1 more attempt. The above is working code.

The code is still stupidly convoluted e.g. the use of CLEAR_BIT but that's another story. I wonder about the mentality of the people who write this auto generated code.

abyrvalg · « **Reply #35 on:** August 08, 2021, 07:33:01 am »

volatile is there to force the order of operations I think. Not using volatile would mean the result of the assignment will not be seen by other code until return, so the optimizer could place it in any part of the function.

Siwastaja · « **Reply #36 on:** August 08, 2021, 07:56:13 am »

Quote from: abyrvalg on August 08, 2021, 07:33:01 am

volatile is there to force the order of operations I think. Not using volatile would mean the result of the assignment will not be seen by other code until return, so the optimizer could place it in any part of the function.

This - for the same reason, all the above control registers are already qualified volatile in the header files.

It's very ineffective to clear and set the PSIZE bits for each word, though. A function writing a whole buffer of words would be much more sensible than calling this function in a loop.

DavidAlfa · « **Reply #37 on:** August 08, 2021, 08:07:20 am »

volatile means that the data can change at any time regardless of the code, so avoid any assumptions when optimizing.
So the compiler doesn't optimize the code based on its value.

For example, this will translate as a while(1) and never work, no matter the value of flag, because the compiler only sees flag=1:

Code: [Select]


uint8_t flag;

void something(){
    flag=1;
    while(flag);
}

void interrupt(){
  flag =0;
}

In this case, as flag is changed by an interrupt, it breaks program workflow, so it must be declared as volatile. Now the compiler will check it in every loop.
Same with anything related to peripherals, ports...

That's one of the first things you learn when programming. When you see how a variable is 0 but he code is still stuck in the loop... you blame everything! Must be this hell of compiler full of bugs!
Happens only once in life and never forget. I still remember that moment when I started fiddling with C 12 years ago. Lost some hair before discovering it

Siwastaja · « **Reply #38 on:** August 08, 2021, 08:26:38 am »

We know that, but in this example the reason for volatile is different: it's used in writes only, and here it simply enforces the order of those writes.

(Doing the read-modify-writes with a volatile-qualified control register like in that code is crappy. ST usually does better, if you look at their functions usually they use a temporary variable to avoid multiple unnecessary read-modify-write operations. Personally, I prefer to avoid read-modify-writes of peripheral registers and do only writes, with two advantages: setting the full peripheral state to a known value at once, and with better performance, too. But sometimes this isn't an option as it would create new dependencies between different pieces of code.)

DavidAlfa · « **Reply #39 on:** August 08, 2021, 10:34:25 am »

Yup, sorry if I seemed condescending.
What I meant is volatile is for everything that behaves differently from obvious.
Anyways, these peripherals requiring specific patterns (ex. Flash writing unlock) usually have a macro or function un assembly.

Siwastaja · « **Reply #40 on:** August 08, 2021, 10:45:03 am »

Didn't seem condescending, no probs.

You definitely don't need to write anything in assembly to write the flash, and (static inline) functions are really cleaner than macros.

Just a few simple control register writes like with any peripheral, and then normal memory write operations (qualified volatile to ensure right order of operations), i.e., assignments.

Flash peripherals vary from device to device as usual with ST, but basically the sequence is,
* Write magic numbers to unlock register,
* Possibly set some options like programming parallelism size, allowing higher speed if you have enough supply voltage available
* Write erase command bit to '1' in control register to erase, poll some completion flag
* Write write enable bit to '1' in control register
* Do a normal memory write access, poll for some completion/busy flag.
* Repeat the previous item until finished.

abyrvalg · « **Reply #41 on:** August 08, 2021, 12:54:46 pm »

BTW, regarding the read-modify-write, why nobody uses the Cortex-M’s bit banding (accessing individual SRAM/IO bits via aliased 22xxxxxx/42xxxxxx memory regions)? Seen it only once I guess - one particular PVD bit accessed via hardcoded 42xxxxxx address in F103 SPL. It would be cool to have it supported by compilers (i.e. you declare a reg/var as a struct with bit fields and compiler uses the best access method), perhaps that’s why nobody uses it - no obvious ways to use, no examples. I’ve defined a BB(reg, bit) macro doing the necessary conversion and using it sometimes. One important aspect of bit banding is guarantied atomic write (locked AHB cycle, even DMA couldn’t steal the bus between read and write).

Siwastaja · « **Reply #42 on:** August 08, 2021, 01:25:27 pm »

Maybe because that would be most useful with GPIO, and vendors already offer atomic set/clear registers such as BSRR on STM32, so bit banding is just a duplicated facility for the same.

In peripherals, it's quite rare that you actually need RMW for the control registers; you are in the command of the peripheral so you can just write the register fully with the state you want, no read involved. Sometimes RMW is just for convenience to preserve modularity in your program, think about enabling clocks on RCC. If you want to enable SPI5, you don't want to accidentally turn SPI4 off, so you do RMW just turning the thing you need on. In such cases, performance and code size penalty is completely negligible.

peter-h · « **Reply #43 on:** August 09, 2021, 06:59:53 am »

Would something like this get optimised (a read of CPU FLASH)

uint8_t* p = (uint8_t*) 0x08000000;
ch=*p++;

on the assumption that the compiler has not seen any code writing to that location? That would be bizzare, surely?

ataradov · « **Reply #44 on:** August 09, 2021, 07:17:14 am »

It depends on the compiler. Most existing compilers will not optimize direct address casts like this. But you are setting yourself up for failure in the future when a new version of a compiler does this optimization. There is nothing really stopping them from doing it, apart from non-zero amount of poorly written code that will break. Don't be a part of the problem.

Also, there is another possible optimization here - if the compiler knows what would be placed at that address, it may use a fixed value known at compile/link time. And it could just remove the whole section of the code. This would break anything that was added after the compilation (like image size or CRC). Again, this is the case of things changing outside of the compiler control.

peter-h · « **Reply #45 on:** August 09, 2021, 08:25:36 am »

Interesting...

It does make me wonder whether these optimisations have any impact whatsoever on system performance. I have decades' experience of assembler, and all the tricks people used to do (including self modifying code, which I avoided), so I understand this stuff at the machine level. And in most systems some 1% of the code is speed critical, and one generally gains far more there by sitting down and thinking about doing that job differently, than by rewriting it in the slickest assembler possible.

DavidAlfa · « **Reply #46 on:** August 09, 2021, 09:05:05 am »

For sure you use a lot of rmw in peripherals.
Just to enable it por example.
You usually set up the registers, then you set the enable bit.
So you need to read, modify, and write back.
Unless you actually use writes for the whole process, writing the whole value each time with the bit changes.
There are tons of examples. Enabling RX interrupt, setting direction in half duplex mode, clearing a flag... Try all are masking operations requiring a read first.

peter-h · « **Reply #47 on:** August 09, 2021, 02:22:26 pm »

Just spent hours on this one. I created a second project in Cube (yeah - a complicated job!) and copied the 1st one to it.

Then I did a trivial change to make it flash some LED, to make sure Cube was really building the 2nd project. Well, it did flash the LED but it was massively smaller!

Spot the difference:

Code size about 250k:

Code: [Select]

	
for (uint32_t i=0; i<0xffffffff; i++)
	{
		LED_On(KDE_LED3);
		hang_around(200);
		LED_Off(KDE_LED3);
		hang_around(200);
	}
	main();

The 30k project:

Code: [Select]

	
for (;;)
	{
		LED_On(KDE_LED3);
		hang_around(200);
		LED_Off(KDE_LED3);
		hang_around(200);
	}
	main();

Obviously, the compiler is realising main() is never entered so it chucks out everything after that.

DavidAlfa · « **Reply #48 on:** August 09, 2021, 02:47:21 pm »

Yeah, you have that option in the linker settings, "remove unused sections".

ataradov · « **Reply #49 on:** August 09, 2021, 04:14:16 pm »

Quote from: peter-h on August 09, 2021, 08:25:36 am

It does make me wonder whether these optimisations have any impact whatsoever on system performance.

They do. Remember, the compiler you are using is not specifically for embedded. It is the same compiler that works on PCs, and most those optimizations happen before target-specific code is generated.

Any optimization has impact on performance, it would not be optimization otherwise.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 45829 times)

Share me