Author Topic: GCC compiler optimisation (Read 45813 times)

ataradov · « **Reply #75 on:** August 10, 2021, 05:16:32 pm »

Quote from: peter-h on August 10, 2021, 04:42:06 pm

So yes this is a good example!

And just to be clear, the only reason it works is because memcpy() is not a real function anymore, it is just an indication of intent to the compiler that is handled internally. Many standard functions are handled this way. If you call printf() with just the string, it will be substituted with puts().

If you substitute your own version that just copies things in a loop, then it will not be optimized this way.

Which is why you need to have your own implementation for ordered access anyway. Standard functions do not guarantee order of writes or reads.

peter-h · « **Reply #76 on:** August 10, 2021, 05:27:49 pm »

Like this

Code: [Select]

	eeread(0, buf);  // read 512 bytes into a buffer, from a serial EEPROM
   	uint32_t fsize = buf[4]|(buf[5]<<8)|(buf[6]<<16)|(buf[7]<<24);

although I am sure there is a slicker way, e.g. overlaying a packed struct onto buf, but that would assume endianess. And with the 32 bit barrel shifter the above will still be really fast.

This is quite a learning experience! But fortunately I think all my code will work fine.

gf · « **Reply #77 on:** August 10, 2021, 05:39:34 pm »

Quote from: ataradov on August 10, 2021, 05:16:32 pm

If you substitute your own version that just copies things in a loop, then it will not be optimized this way.

Unbelievable - even then it is optimized out

https://godbolt.org/z/eE9761jns
(clang obvioulsy still recognizes what the copy() does - at least if it can be inlined)

peter-h · « **Reply #78 on:** August 10, 2021, 05:46:18 pm »

What about 32F4 ARM GCC - I seem to be on 9.3.1.

I would expect x86 compilers to be clever.

GCC produces

Code: [Select]

f:
        sub     sp, sp, #512
        mov     x2, -513
        add     x0, sp, 512
        mov     x3, 512
        movk    x2, 0xf7ff, lsl 16
        movk    x3, 0x800, lsl 16
        add     x2, x0, x2
        mov     x0, 134217728
.L2:
        mov     x1, x0
        add     x0, x0, 1
        cmp     x0, x3
        ldrb    w1, [x1]
        strb    w1, [x0, x2]
        bne     .L2
        ldrb    w0, [sp]
        add     sp, sp, 512
        ret

Not as clever!

With -O0, it does it literally

Code: [Select]

copy:
        sub     sp, sp, #48
        str     x0, [sp, 24]
        str     x1, [sp, 16]
        str     x2, [sp, 8]
        ldr     x1, [sp, 16]
        ldr     x0, [sp, 8]
        add     x0, x1, x0
        str     x0, [sp, 40]
        b       .L2
.L3:
        ldr     x1, [sp, 16]
        add     x0, x1, 1
        str     x0, [sp, 16]
        ldr     x0, [sp, 24]
        add     x2, x0, 1
        str     x2, [sp, 24]
        ldrb    w1, [x1]
        strb    w1, [x0]
.L2:
        ldr     x1, [sp, 16]
        ldr     x0, [sp, 40]
        cmp     x1, x0
        bne     .L3
        nop
        nop
        add     sp, sp, 48
        ret
f:
        sub     sp, sp, #544
        stp     x29, x30, [sp]
        mov     x29, sp
        add     x0, sp, 24
        mov     x2, 512
        mov     x1, 134217728
        bl      copy
        ldrb    w0, [sp, 24]
        strb    w0, [sp, 543]
        ldrb    w0, [sp, 543]
        ldp     x29, x30, [sp]
        add     sp, sp, 544
        ret

In most applications the extra time will not be of relevance but the doubling of code size may well be.

SiliconWizard · « **Reply #79 on:** August 10, 2021, 06:11:04 pm »

The rules are actually pretty simple. What makes them hard to apply for us humans is that we inherently have a very hard time thinking that any single statement of code we write could be actually useless. That's probably because we are just so full of ourselves.

One thing that this kind of topic teaches you is that C is definitely NOTHING LIKE A PORTABLE ASSEMBLER, despite what many uninformed people keep saying. C statements are absolutely not guaranteed to be translated verbatim to machine code.

This does not mean that it's impossible to write "working C code". Most issues related to the kind of optimizations we're talking about in this thread deal either with objects that the compiler doesn't know about, or from expectations related to the above paragraph otherwise. For instance, expecting some statement to actually yield verbatim machine code when said statement doesn't actually change any result from the program's execution.

A typical example, often discussed in forums, but not in this thread, is the famous delay loop. You'll just write an empty 'for' loop in hopes it will actually yield the machine code executing for some time. Optimizers will just prune such loops. Because they have NO effect from a "functional" POV. They do not change the result of anything.

Code: [Select]

for (int i = 0; i < xxx; i++) {}
The way to force this generating an actual loop is either to declare the counter variable volatile, or do something in the loop body that the compiler can't prune. So this would be either:

Code: [Select]

for (volatile int i = 0; i < xxx; i++) {}
or:

Code: [Select]

for (int i = 0; i < xxx; i++) { Nop(); }
Nop() here being for instance some macro that actually is some assembly code (whatever it is, it must be something that the compiler can't assume having no effect). Drawback of the second version is that it's not portable. Benefit is that the loop will be usually more "efficient", because the loop counter here can be put in a register, whereas in the first form, the 'i' counter will usually be put on the stack, and a read access, incrementation, and write access will occur at each iteration. That's usually what happens. Again, there is no guarantee that a given compiler will compiler this in any specific way, but there's a guarantee that either of the two above forms will be compiled as actual loops.

Dealing with "outside objects" (such as any register/buffer accessed through pointers to absolute addresses) is indeed not that simple in C, while C is said to be a very low-level language good at this. Many developers will actually use C only for embedded dev these days... for which this kind of issues are ubiquitous. It doesn't take being an "expert", but I admit it takes more knowledge than is usually thought. C is again too often thought about as a very simple language. It's not quite.

Now what could be nice, in order to help developers with this, would be that compilers would actually give a list of any piece of code that would have been pruned during optimization. Problem with this is that in a typical program, the list is likely to be pretty large, and it would take you a lot of time to just go through everything in hopes of catching something getting pruned that you actually want to execute...

peter-h · « **Reply #80 on:** August 10, 2021, 06:19:12 pm »

One could still issue warnings in some cases - like that ludicrous removal of most of my program above

I have just tried a bigger piece of real code. This one is not right; it was hacked since this compiler doesn't know uint32_t and some other stuff...

Code: [Select]

			
            void L_HAL_FLASH_Unlock(void);
            void L_HAL_FLASH_Lock(void);
            void L_FLASH_Program_Word(int addr, int data);

        static char buffer1[32768];
           void test (void)
           {
           
            int page,pagebase;
            int error=0;
            int cpubase=0x0800000;
            int AT45dbxx_ReadPage(char* buf, int count1, int count2);
            int cust_blocks=444;

            for (int block32k=1; block32k<cust_blocks; block32k++)	// 1-31
 			{

  				// Read each 32k block into buffer1
 	 			int buffer1idx=0;
 				for ( int page=pagebase; page<(pagebase+64); page++ )
 				{
 					AT45dbxx_ReadPage(&buffer1[buffer1idx],512,page*512);
 					buffer1idx+=512;
 				}

 				// Program 32k block
 	 			L_HAL_FLASH_Unlock();
 				for (int i=0; i<(32*1024); i+=4)
 				{
 					int data=buffer1[i]|(buffer1[i+1]<<8)|(buffer1[i+2]<<16)|(buffer1[i+3]<<24);
 					L_FLASH_Program_Word(i+cpubase, data);
 				}
 				L_HAL_FLASH_Lock();

 				// Verify 32k block against buffer1
 				for (int i=0; i<(32*1024); i+=4)
 				{
 					int data=buffer1[i]|(buffer1[i+1]<<8)|(buffer1[i+2]<<16)|(buffer1[i+3]<<24);
 					if ( (*(volatile int*) (i+cpubase)) != data ) error++;

				}

 				cpubase+=(32*1024);
 				pagebase+=64;
 				block32k++;

 			}
            }

With -O0 I get 161 lines. With -O1 I get 67, and after that not much changes. It is interesting because most of the difference is just using different instructions. I don't see it doing anything cunning.

ataradov · « **Reply #81 on:** August 10, 2021, 07:04:06 pm »

Quote from: gf on August 10, 2021, 05:39:34 pm

Unbelievable - even then it is optimized out https://godbolt.org/z/eE9761jns
(clang obvioulsy still recognizes what the copy() does - at least if it can be inlined)

Wow. This is pretty cool.

So yeah, volatiles where needed is a must. Anything else is just a trouble waiting to happen.

peter-h · « **Reply #82 on:** August 10, 2021, 08:04:20 pm »

That x86 compiler must be building a table for every element of every array and every variable, to keep track of each element accessed by the program.

So if e.g. you had

uint8_t buf[512];

for (uint32_t i=0;i<256;i++)
{
buf[ i ]='a' ;
}

uint8_t fred=6;
i++;
buf[i ]=fred;

and if buf[257-511] was never used after that, it would throw away that code. Well, correctly so, because it would not do anything.

It must be almost "running" the code to do this.

I knew a guy many years ago who wrote a bunch of C compilers. He spent a lot of time looking for special cases e.g. integer x 10 is much faster (on the old CPUs) if you do x2, x8, and add. But you can always do that, 100% safely. I don't recall anybody using "volatile" in the 1980s. C just did what you expected

I supervised some quite big developments done in IAR C; I did asm portions to speed them up (in some cases by 100x to 1000x) relative to an admittedly dumb use of sscanf.

gf · « **Reply #83 on:** August 10, 2021, 09:10:06 pm »

Quote from: peter-h on August 10, 2021, 05:46:18 pm

What about 32F4 ARM GCC

ARM Cortex M4 clang: https://godbolt.org/z/4Eh1faYr9
ARM Cortex M4 gcc: https://godbolt.org/z/KdvhdW4zM
x64 clang: https://godbolt.org/z/rd1MTx5bG
x64 gcc: https://godbolt.org/z/39hdTE8Ph

clang optimizes both, memcpy() and copy(), for both processors.
gcc does even a better job for ARM than for x64, in this particular case.
But you can't not generalize that to other code. It really depends.

westfw · « **Reply #84 on:** August 10, 2021, 09:35:04 pm »

Quote

because memcpy() is not a real function anymore, it is just an indication of intent to the compiler that is handled internally. Many standard functions are handled this way.

Is there a command-line switch to stop this interpretation of "standard" functions? (globally, or on a per-function basis?)

I find it annoying; one of the "nice" things about C was that functions (including all of IO) were functions, and not behavior hard-wired into the language. (also one of the reasons that C is comparatively easy to port to new platforms.)

For a language often derided as "a high level assembly language", the compiler folk seem very keen on doing things to it that will annoy people who are used to assembly language. :-(

For a less controversial example, consider something like:

Quote

*flashPtr = x; // load up flash write buffer

// Manipulate NVMCTL to actually write the flash buffer to flash
// :

if (*flashPtr != x) { // verify
// flash write error!
}

if *flashPtr is not "volatile", even a relatively dumb compiler would have no reason to re-read it; it has not way to know that manipulating the NVMCTL stuff might change the contents of memory...

peter-h · « **Reply #85 on:** August 10, 2021, 09:58:02 pm »

"gcc does even a better job for ARM than for x64, in this particular case."

That's v10, which seems quite different to v9.x.

ataradov · « **Reply #86 on:** August 10, 2021, 10:04:00 pm »

Quote from: westfw on August 10, 2021, 09:35:04 pm

Is there a command-line switch to stop this interpretation of "standard" functions? (globally, or on a per-function basis?)

Sure. "-fno-builtin" for all of them, and then there are flags like "-fno-builtin-memcpy".

emece67 · « **Reply #87 on:** August 10, 2021, 10:34:53 pm »

ataradov · « **Reply #88 on:** August 10, 2021, 10:42:36 pm »

Quote from: emece67 on August 10, 2021, 10:34:53 pm

Declaring any of to or from as pointer to volatile results in copy() compiled as a loop.

Yes, sure. But the fact that it can track this stuff across function calls is great.

If you add "-fno-builtin", it will also generate the full loop. So it recognizes the copy semantics in that function, and it has logic to optimize that specifically.

This optimization is not generic. Replacing the body of the loop with "*to++ = *from++ * 2" breaks it, since it is not longer a simple copy.

ejeffrey · « **Reply #89 on:** August 10, 2021, 11:34:49 pm »

Quote from: westfw on August 10, 2021, 09:35:04 pm

Quote
because memcpy() is not a real function anymore, it is just an indication of intent to the compiler that is handled internally. Many standard functions are handled this way.
Is there a command-line switch to stop this interpretation of "standard" functions? (globally, or on a per-function basis?)

I find it annoying; one of the "nice" things about C was that functions (including all of IO) were functions, and not behavior hard-wired into the language. (also one of the reasons that C is comparatively easy to port to new platforms.)

I think -fno-builtins will do it? There are a number of built-in function and options to control their use, I don't remember all of them, mostly they are useful for writing the C library or other platform implementations.

It's not really true that memcpy isn't a real function. There is absolutely a version in libc that will be called if necessary (for instance if using function pointers). But the compiler has the option to replace it with equivalent code if possible, and that choice can be platform and context sensitive, such as if the compiler can prove alignment conditions or a compile time length. It's not really any different from inline functions.

SiliconWizard · « **Reply #90 on:** August 10, 2021, 11:42:32 pm »

Quote from: ejeffrey on August 10, 2021, 11:34:49 pm

It's not really true that memcpy isn't a real function. There is absolutely a version in libc that will be called if necessary (for instance if using function pointers). But the compiler has the option to replace it with equivalent code if possible, and that choice can be platform and context sensitive, such as if the compiler can prove alignment conditions or a compile time length. It's not really any different from inline functions.

Indeed, the compiler can do similar optimizations with user-defined functions. But it can just do better with std functions because it knows what they are supposed to achieve exactly.

But any kind of pruning - it can absolutely do the same. Thus the importance, *when relevant*, to qualify pointer parameters with volatile.

cfbsoftware · « **Reply #91 on:** August 10, 2021, 11:58:05 pm »

Quote from: SiliconWizard on August 10, 2021, 06:11:04 pm

One thing that this kind of topic teaches you is that C is definitely NOTHING LIKE A PORTABLE ASSEMBLER, despite what many uninformed people keep saying. C statements are absolutely not guaranteed to be translated verbatim to machine code.

I would not interpret the statement that 'C is like a portable assembler' as 'C is a portable assembler'. I see it as simplistic way to describe that C is best-suited to write software that you might otherwise have to use assembler for.

Bassman59 · « **Reply #92 on:** August 11, 2021, 03:44:53 am »

Quote from: peter-h on August 10, 2021, 08:04:20 pm

I don't recall anybody using "volatile" in the 1980s. C just did what you expected.

I'm in that category! I don't remember using "volatile" in Turbo C or really any personal-computer C programming. The first time I came across it was when I started using C for the 8051 back in, oh, 1997, I think. And the reason for "volatile" was for the usual reason: to tell the compile to not optimize access to global or file-scope variables that could be changed in an ISR.

And now that I think about it, I don't recall writing interrupt handlers in C for the PC under DOS and I know I never wrote any Windows programs that used interrupts. Of course embedded stuff on (what we now call) bare-metal 8051 and later for me 68k interrupts were necessary.

ejeffrey · « **Reply #93 on:** August 11, 2021, 04:04:42 am »

Quote from: peter-h on August 10, 2021, 08:04:20 pm

I knew a guy many years ago who wrote a bunch of C compilers. He spent a lot of time looking for special cases e.g. integer x 10 is much faster (on the old CPUs) if you do x2, x8, and add. But you can always do that, 100% safely. I don't recall anybody using "volatile" in the 1980s.

A lot of people wrote incorrect code in the 80s, whats your point? More to the point, ANSI C wasn't even fully standardized until 1989, until then you weren't targeting a standard you were targeting an implementation. That is also one reason performance code was usually written in Fortran instead of C -- Fortran had much better optimizers available.

For user applications you rarely need volatile anyway. You only need it or should use it for memory mapped IO and (carefully) for signal / interrupt handlers. Non-kernel code written for UNIX platforms of the day wouldn't have needed volatile for much. DOS on x86 required user applications to do hardware access although x86 had IO instructions for IO port access to which none of this applies.

In the 80s and 90s inlining was not well standardized and generally only performed when specifically requested. Link-time optimization across translation units basically didn't exist. If you put your IO in functions in a library and didn't mark them inline, many of these optimizations could not kick in if the compiler can't look across function call barriers.

Quote

I supervised some quite big developments done in IAR C; I did asm portions to speed them up (in some cases by 100x to 1000x) relative to an admittedly dumb use of sscanf.

Re-writing in assembly because sscanf is too slow seems a bit overkill, but whatever. Just rewriting in C would probably get you almost all the benefit. sscanf and friends are extremely slow. They are designed to be flexible and simple to use, not high performance.

newbrain · « **Reply #94 on:** August 11, 2021, 06:37:56 am »

Quote from: ejeffrey on August 11, 2021, 04:04:42 am

That is also one reason performance code was usually written in Fortran instead of C -- Fortran had much better optimizers available.

Just one nitpick:
Fortran had some advantages on C89 due to the different aliasing constraints that allowed for better optimizations.
This was (partially?) solved when C99 introduced the 'restrict' type qualifier.
In fact, memcpy signature now takes restricted pointers, so the behavior is undefined if the objects overlap: this was implicit (document) pre C99, but is now explicit part of the contract in the declaration.

C++ does not support restrict, though most compilers do as an extension.

peter-h · « **Reply #95 on:** August 11, 2021, 02:37:12 pm »

Could any of you experienced chaps suggest on which of these are worth enabling?

I have two configs: Debug (-O0 and max debug and Release (-O3 and no debug).

O3 is picking up some interesting warnings. One of these is

Code: [Select]

memcpy(inbuf,inbuf+1,INBUF_SEARCH_LEN); // Shuffle search area one left (its last byte is now garbage)
and I thought this is relevant in light of the above memcpy discussion. Memcpy is supposed to copy overlapping buffers correctly, no?? It's been running for months.

Inbuf is 1000 bytes long, and the search length is 7 bytes. It is a primitive way of testing the first 7 bytes for one of about a hundred different strings (to do with GPS satellite vehicle IDs etc). I then do

Code: [Select]

if ( memcmp(inbuf,pubx_00_header,7) == 0)

oPossum · « **Reply #96 on:** August 11, 2021, 02:48:38 pm »

Quote from: peter-h on August 11, 2021, 02:37:12 pm

Memcpy is supposed to copy overlapping buffers correctly, no?? It's been running for months.

No! memmove() allows overlapping buffers.

newbrain · « **Reply #97 on:** August 11, 2021, 02:52:25 pm »

Quote from: peter-h on August 11, 2021, 02:37:12 pm

Could any of you experienced chaps suggest on which of these are worth enabling?

I have two configs: Debug (-O0 and max debug and Release (-O3 and no debug).
[...]
Memcpy is supposed to copy overlapping buffers correctly, no??

Inbuf is 1000 bytes long, and the search length is 7 bytes. It is a primitive way of testing the first 8 bytes for one of about a hundred different strings (to do with GPS satellite vehicle IDs etc). I then do

Code: [Select]
if ( memcmp(inbuf,pubx_00_header,7) == 0)

You might have noticed I'm a bit of a pedant - only with C, I swear!
So my code usually goes with -pedantic -Wextra -Wall -Wswitch-default -Wswitch-enum (plus -std=c11 usually).
Of course, these might be relaxed for library (not mine) code.

No, memcpy has undefined behaviour if the objects overlap.
As previously noted, look at the signature (C99):

Code: [Select]

void* memcpy( void *restrict dest, const void *restrict src, size_t count );The 'restrict' type qualifier makes it clear that the compiler does not expect the memory pointed by the arguments to overlap.
So, it's UB if they do.
In C89, this was in the documentation - restrict (added in C99) makes that explicit.

To copy overlapping buffers, you need memmove.

As a habit, I use -Og for debug compiles: the code remains eminently debuggable, but a lot of pointelss memory/register shuffling and pushing/popping is removed.

EtA: -Wconversion is a bit chatty for my tastes. Helpful if you think you might have gotten some of them wrong.

newbrain · « **Reply #98 on:** August 11, 2021, 03:03:17 pm »

Quote from: peter-h on August 11, 2021, 02:37:12 pm

It is a primitive way of testing the first 7 bytes for one of about a hundred different strings (to do with GPS satellite vehicle IDs etc). I then do

Wait, what was the problem with using another pointer (e.g. char *search_ptr = inbuf + 1;) instead of moving bytes around?

Code: [Select]

if ( memcmp(search_ptr, pubx_00_header, 7) == 0)

peter-h · « **Reply #99 on:** August 11, 2021, 03:37:33 pm »

I googled on this, because I "asked somebody" before using memcpy for that, and just found this

87. WHAT IS THE DIFFERENCE BETWEEN MEMCPY() & MEMMOVE() FUNCTIONS IN C?
memcpy() function is is used to copy a specified number of bytes from one memory to another.
memmove() function is used to copy a specified number of bytes from one memory to another or to overlap on same memory.
Difference between memmove() and memcpy() is, overlap can happen on memmove(). Whereas, memory overlap won’t happen in memcpy() and it should be done in non-destructive way.

It is bad grammar but it seems to be saying that if you want to shuffle a 7 byte buffer 1 place, you need to use memmove.

Re the shuffling, I was just being "simple"

I am looking for a pattern in a data stream arriving on a serial port, so one can't just keep incrementing a pointer because when it gets to end of buffer minus 7, the compare string will run off the end. And anyway, upon match, one needs to fetch the rest. I know there are super slick ways of doing this stuff...

Another thing I am noticing. If one switches builds, one needs to do a Clean Project, otherwise it doesn't rebuild the whole project (or any of it actually) because it finds all the .o files already in place.

I am using -Og now, and I get 160k (down from 225k with -O0 and max debugs) in Debug build versus 179k in Release build with -O3. How is that possible?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 45813 times)

Share me