Author Topic: GCC compiler optimisation  (Read 42912 times)

0 Members and 4 Guests are viewing this topic.

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #325 on: February 13, 2023, 01:57:35 pm »
Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes, which is why, supposedly, memcpy is normally done in byte mode.

Can anybody relate to this? I don't get it at all.

Normal memcpy doesn't support overlapping blocks. For that you use memmove.

Code: [Select]
void * memcpy_fast (void *__restrict dst0, const void *__restrict src0, size_t len0)
{

char *dst = dst0;
const char *src = src0;
uint32_t *aligned_dst;
const uint32_t *aligned_src;

// If the size is >=4 then do 32 bit moves, until exhausted

if ( len0 >=4 )
{
aligned_dst = (uint32_t*)dst;
aligned_src = (uint32_t*)src;

while (len0 >= 4)
{
*aligned_dst++ = *aligned_src++;
len0 -= 4;
}

dst = (char*)aligned_dst;
src = (char*)aligned_src;
}

// Finish with any single byte moves

while (len0--)
*dst++ = *src++;

return dst0;

}
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 432
  • Country: be
Re: GCC compiler optimisation
« Reply #326 on: February 13, 2023, 02:15:12 pm »
Suppose two memory blocks, 0x10 bytes in size, src starts at address 0x100, dst starts at address 0x108 -- that's the definition of overlapping blocks.

memcpy() that uses src++ and dst++ will fail, byte-wide or word-wide, doesn't matter.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Re: GCC compiler optimisation
« Reply #327 on: February 13, 2023, 04:31:40 pm »
Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes

If you move in consecutive blocks starting from the lower end, then regardless of the bock size, memcpy wil fail if both of these conditions are met:

Code: [Select]
src < dst < (src+len)
len > block_size

and succeed otherwise.
 

Online ejeffrey

  • Super Contributor
  • ***
  • Posts: 3859
  • Country: us
Re: GCC compiler optimisation
« Reply #328 on: February 13, 2023, 05:29:47 pm »
Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes, which is why, supposedly, memcpy is normally done in byte mode.

Can anybody relate to this? I don't get it at all.

I don't get it either.  A properly working memcpy implementation should neither read bytes outside the source buffer nor write bytes outside the destination.  If this requirement is satisfied, it should not fail as long as the buffers don't overlap (which as you say is not supported, and requires memmove).  Some implementations do allow reading past the end of the source, as long as no architectural boundaries are crossed (i.e., the over-read can't cause a fault), typically to the end of an aligned word.  But since the extra bytes won't be written, it doesn't matter whether they are new or old data.

If the source and destination buffers are disjoint, but share the same 4 byte word, then they can't both be 4 byte aligned.  The reason simple memcpy implementations work in byte mode is entirely for alignment purposes.  memcpy doesn't place any alignment requirements on the buffers or the sizes.  So memcpy implementations that want to use larger word sizes need to handle buffers that are not aligned, lengths that are not a multiple of the word size, different alignment between source and destination, and do so without causing faults or severe misaligned access penalties.

Optimized memcpy implementations obviously *do* check the sizes and alignments and try to use larger chunks when possible.

 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #329 on: February 13, 2023, 07:19:03 pm »
The code I posted above uses 32 bit moves until there are 0-3 bytes left, so it can't be overrunning either buffer.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 432
  • Country: be
Re: GCC compiler optimisation
« Reply #330 on: February 13, 2023, 08:24:54 pm »
The code I posted above uses 32 bit moves until there are 0-3 bytes left, so it can't be overrunning either buffer.

Try this:

Code: [Select]
src [1][2][3]
dst       [?][?][?]
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 432
  • Country: be
Re: GCC compiler optimisation
« Reply #331 on: February 13, 2023, 08:53:29 pm »
Or better this:

Code: [Select]
src [1][2][3][4][5]
dst             [ ][ ][ ][ ][ ]
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #332 on: February 13, 2023, 09:20:27 pm »
Those are overlapping buffers.

I wrote

Quote
Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4352
  • Country: nz
Re: GCC compiler optimisation
« Reply #333 on: February 14, 2023, 08:55:52 am »
Those are overlapping buffers.

I wrote

Quote
Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes

That's not failing. The memcpy() spec says the results are undefined if src and dst overlap.

If you give memcpy() overlapping buffers then it is you that has failed to honour your side of the contract, not memcpy().
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #334 on: February 14, 2023, 09:04:32 am »
OK we got wires crossed.

The issue I described above (buffers with a gap between them of < 4 bytes) doesn't exist AFAICT.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #335 on: May 25, 2023, 08:30:32 pm »
I've just watched this video



and it convinces me more than ever that a huge amount of work has gone into optimisations which yield no useful runtime benefits whatsoever. For example how time critical can swapping the four bytes of a uint32_t possibly be?
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline voltsandjolts

  • Supporter
  • ****
  • Posts: 2377
  • Country: gb
Re: GCC compiler optimisation
« Reply #336 on: May 25, 2023, 08:51:30 pm »
how time critical can swapping the four bytes of a uint32_t possibly be?

If you need to do it 1e6 times per second, then pretty darn critical.
 
The following users thanked this post: Siwastaja, MK14

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #337 on: May 25, 2023, 09:18:13 pm »
There is a lot of coding style dependency involved in triggering the pattern recognitions.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11630
  • Country: us
    • Personal site
Re: GCC compiler optimisation
« Reply #338 on: May 25, 2023, 09:28:27 pm »
There is a lot of coding style dependency involved in triggering the pattern recognitions.
So? Hardware design tools are all based on very strict pattern for the tool to recognize your intent and generate optimized design. And hardware people deal with that.

If you are doing embedded stuff where nothing matters, then you may not appreciate all that optimization.  It does not mean it is useless.

Networking code involves a lot of swaps and recognizing that pattern may lead to a much faster code. Recognizing code that could be automatically vectored, results in huge performance increase on platforms that have vector hardware.

Feel free to use TCC or something like that if you don't want optimizations and don't like them even being in the compiler .
« Last Edit: May 25, 2023, 09:30:39 pm by ataradov »
Alex
 
The following users thanked this post: newbrain, MK14, lucazader

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #339 on: May 25, 2023, 09:34:37 pm »
Quote
feel free to use TCC or something like that if you don't want optimizations and don't like them even being in the compiler .

WOW.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline hans

  • Super Contributor
  • ***
  • Posts: 1670
  • Country: nl
Re: GCC compiler optimisation
« Reply #340 on: May 26, 2023, 12:05:27 pm »
What is easier to understand:
Quote
const int wrap = 8;
int idx = x % wrap;
Or:
Quote
const int wrap = 8;
static_assert((wrap & (wrap - 1)) == 0, "Wrap is not a power of 2");
int idx = x & (wrap - 1);
Maybe we as embedded programmers feel more comfortable with the 2nd one, but that's pretty domain specific knowledge. The average programmer will understand the first better, however, if a compiler is able to make the step to implementation #2 as long as wrap \in 2^N (which is put down as an implementation constraint of #2), then that's great. After all, almost no compiler is going to naively compile the modulo using a REM instruction (integer division unit) as its super slow, so I'm happy that compilers are smarter than me to find bit and integer manipulation tricks. Some of these tricks, like the whitespace check function, can also work on any platform that has a barrel shifter. These kind of optimizations can benefit x86, RISC-V, ARM and MIPS all in 1 go.

Vector code is all the hype these days. Unfortunately GCC doesn't generate SIMD instructions for Cortex-M4 cores, yet. These instructions are a must for DSP applications. The other day I optimized an algorithm that took 2M+ cycles/second back down to only a few hunderd kcyc/s. It would be amazing if the compiler could do this by itself - as I've to check between a naive MATLAB implementation (including several Fourier convolutions and complex number manipulations) and its SIMD optimized version. The problem with the SIMD version is that I can't easily unit test it.
« Last Edit: May 26, 2023, 12:07:51 pm by hans »
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15104
  • Country: fr
Re: GCC compiler optimisation
« Reply #341 on: May 26, 2023, 07:20:32 pm »
No doubt that compiler optimizations are a big plus in a whole range of case.

The ability to map to specific target instructions of course all depends on the target and current state of the compiler. GCC currently does a better job for x86, not really surprising.

Since some talked about byte swapping here goes:

Code: [Select]
#include <stdint.h>

uint32_t ByteSwap32( uint32_t val )
{
    val = ((val << 8) & 0xFF00FF00 ) | ((val >> 8) & 0xFF00FF );
    return (val << 16) | (val >> 16);
}

gcc -O2:
Code: [Select]
ByteSwap32(unsigned int):
        mov     eax, edi
        bswap   eax
        ret

gcc -O1:
Code: [Select]
ByteSwap32(unsigned int):
        mov     eax, edi
        sal     eax, 8
        and     eax, -16711936
        shr     edi, 8
        and     edi, 16711935
        or      eax, edi
        rol     eax, 16
        ret

gcc -O0:
Code: [Select]
ByteSwap32(unsigned int):
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        sal     eax, 8
        and     eax, -16711936
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        shr     eax, 8
        and     eax, 16711935
        or      eax, edx
        mov     DWORD PTR [rbp-4], eax
        mov     eax, DWORD PTR [rbp-4]
        rol     eax, 16
        pop     rbp
        ret

For ARM (32), not too bad either:

gcc -O2:
Code: [Select]
ByteSwap32(unsigned int):
        rev     r0, r0
        bx      lr

gcc -O1:
Code: [Select]
ByteSwap32(unsigned int):
        lsls    r3, r0, #8
        and     r3, r3, #-16711936
        lsrs    r0, r0, #8
        and     r0, r0, #16711935
        orrs    r0, r0, r3
        ror     r0, r0, #16
        bx      lr

Which is faster?

While it was maybe a bit tongue-in-cheek, TCC is actually not that bad if you don't care about optimizations. It's ultra-fast to compile and produces correct code as far as I tested it.

Now of course you can tell the difference in terms of optimization.

Code: [Select]
00000000 <ByteSwap32>:
   0:   e1a0c00d        mov     ip, sp
   4:   e92d0003        push    {r0, r1}
   8:   e92d5800        push    {fp, ip, lr}
   c:   e1a0b00d        mov     fp, sp
  10:   e1a00000        nop                     ; (mov r0, r0)
  14:   e59b000c        ldr     r0, [fp, #12]
  18:   e1a00400        lsl     r0, r0, #8
  1c:   e59f1000        ldr     r1, [pc]        ; 24 <ByteSwap32+0x24>
  20:   ea000000        b       28 <ByteSwap32+0x28>
  24:   ff00ff00                        ; <UNDEFINED> instruction: 0xff00ff00
  28:   e0000001        and     r0, r0, r1
  2c:   e59b100c        ldr     r1, [fp, #12]
  30:   e1a01421        lsr     r1, r1, #8
  34:   e59f2000        ldr     r2, [pc]        ; 3c <ByteSwap32+0x3c>
  38:   ea000000        b       40 <ByteSwap32+0x40>
  3c:   00ff00ff        ldrshteq        r0, [pc], #15
  40:   e0011002        and     r1, r1, r2
  44:   e1800001        orr     r0, r0, r1
  48:   e58b000c        str     r0, [fp, #12]
  4c:   e59b000c        ldr     r0, [fp, #12]
  50:   e1a00800        lsl     r0, r0, #16
  54:   e59b100c        ldr     r1, [fp, #12]
  58:   e1a01821        lsr     r1, r1, #16
  5c:   e1800001        orr     r0, r0, r1
  60:   e89ba800        ldm     fp, {fp, sp, pc}
;D
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #342 on: May 26, 2023, 08:12:12 pm »
The various GCC levels just trim the code more and more, so why not?

Your last example is what the H8/300 GNU compiler produced in 1995 :) I could have used that for a customer programmable product back then but it was awful. So I chose the Hitech one which was vastly better but had to be sold for £450.

My earlier comments were on a different thing.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11630
  • Country: us
    • Personal site
Re: GCC compiler optimisation
« Reply #343 on: May 26, 2023, 08:34:02 pm »
To be clear, I did not flame TCC, it is an excellent compiler. It is just a good example of total no-optimization compiler. It is something you don't want at all for general purpose programming.
Alex
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15104
  • Country: fr
Re: GCC compiler optimisation
« Reply #344 on: May 26, 2023, 08:42:05 pm »
To be clear, I did not flame TCC, it is an excellent compiler. It is just a good example of total no-optimization compiler. It is something you don't want at all for general purpose programming.

Yep. Well, TCC tends to do a bit worse than GCC -O0, but not a whole lot worse. So, for someone that would consistently compile their code at -O0, TCC would certainly be an alternative.

(The only thing is that it would take a bit of work to use in your toolchain compared to using a vendor-supplied GCC with everything already set up.)
 

Offline bson

  • Supporter
  • ****
  • Posts: 2385
  • Country: us
Re: GCC compiler optimisation
« Reply #345 on: May 26, 2023, 09:25:37 pm »
Why bother with this "systick" interrupt when you're at a low level stage?
Just use the DWT cycle counter, as is often suggested in this forum. It just needs to be enabled. I use this for enabling it in C:
Code: [Select]
void DWT_Init(void)
{
if (! (CoreDebug->DEMCR & CoreDebug_DEMCR_TRCENA_Msk))
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;

DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
}

Enabling it in assembly is also just a couple instructions if you need to do this in the startup code or something.

Then you can read the DWT->CYCCNT register. Resolution is the system clock period. Doesn't require any interrupt or any peripheral.
Why check DMCR & TRCENA before setting TRCENA?  Is there something in the STM32 (F) implementation of the DWT or ITM modules that necessitates this?
I've never checked it and haven't noticed any problems, though I use ITM (for SWO) and not really DWT.  The DWT though has useful optimization data so I may start using it!

 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4352
  • Country: nz
Re: GCC compiler optimisation
« Reply #346 on: May 27, 2023, 01:58:10 am »
Since some talked about byte swapping here goes:

riscv32 if you include "_zbb" in the arch string. gcc needs -O2 or -Os, with clang just -O1 is enough:

Code: [Select]
ByteSwap32:
        rev8    a0,a0
        ret
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Re: GCC compiler optimisation
« Reply #347 on: May 27, 2023, 01:50:15 pm »
Since some talked about byte swapping here goes:

These kind of optimizations are needed because C doesn't have operations for byte swaps, rotations, bit settings etc. Therefore, the compiler needs to recognize common patterns which you would use for these and then emit simple commands.
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3954
  • Country: gb
  • Doing electronics since the 1960s...
Re: GCC compiler optimisation
« Reply #348 on: May 27, 2023, 06:00:57 pm »
Quote
To be clear,

To be clear

Quote
Therefore, the compiler needs to recognize common patterns

that's exactly the point I was trying to make. This code replacement stuff is coding style dependent.

And if your code depends on the speed, then you need to do a lot of regression testing (especially timing stuff with a scope / logic analyser) after each compiler version change. And examine the asm generated to make sure it isn't working by accident.

And if your code doesn't depend on the speed, well, then the time invested by the compiler writer was a bit pointless ;) Delivering a human-readable linkfile syntax would save somebody a lot more time, for example.

If I was writing code which actually needs the speed I would do an asm function for it.

Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11630
  • Country: us
    • Personal site
Re: GCC compiler optimisation
« Reply #349 on: May 27, 2023, 06:17:51 pm »
And if your code depends on the speed, then you need to do a lot of regression testing (especially timing stuff with a scope / logic analyser) after each compiler version change.
Real projects do that anyway. This is why nobody changes the versions of the tools mid projects. And version change on a big project usually takes weeks of evaluation.

For small projects you can do whatever you want, compiler makers don't target you.

And if your code doesn't depend on the speed, well, then the time invested by the compiler writer was a bit pointless
even if something does not break if it is a bit slower, it does not mean that it does not benefit from the added performance.
Alex
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf