Author Topic: GCC compiler optimisation (Read 42912 times)

peter-h · « **Reply #325 on:** February 13, 2023, 01:57:35 pm »

Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes, which is why, supposedly, memcpy is normally done in byte mode.

Can anybody relate to this? I don't get it at all.

Normal memcpy doesn't support overlapping blocks. For that you use memmove.

Code: [Select]

void * memcpy_fast (void *__restrict dst0, const void *__restrict src0, size_t len0)
{

	char *dst = dst0;
	const char *src = src0;
	uint32_t *aligned_dst;
	const uint32_t *aligned_src;

	// If the size is >=4 then do 32 bit moves, until exhausted

	if ( len0 >=4 )
	{
		aligned_dst = (uint32_t*)dst;
		aligned_src = (uint32_t*)src;

		while (len0 >= 4)
		{
			*aligned_dst++ = *aligned_src++;
			len0 -= 4;
		}

		dst = (char*)aligned_dst;
		src = (char*)aligned_src;
	}

	// Finish with any single byte moves

	while (len0--)
		*dst++ = *src++;

	return dst0;

}

eutectique · « **Reply #326 on:** February 13, 2023, 02:15:12 pm »

Suppose two memory blocks, 0x10 bytes in size, src starts at address 0x100, dst starts at address 0x108 -- that's the definition of overlapping blocks.

memcpy() that uses src++ and dst++ will fail, byte-wide or word-wide, doesn't matter.

NorthGuy · « **Reply #327 on:** February 13, 2023, 04:31:40 pm »

Quote from: peter-h on February 13, 2023, 01:57:35 pm

Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes

If you move in consecutive blocks starting from the lower end, then regardless of the bock size, memcpy wil fail if both of these conditions are met:

Code: [Select]

src < dst < (src+len) 
len > block_size

and succeed otherwise.

ejeffrey · « **Reply #328 on:** February 13, 2023, 05:29:47 pm »

Quote from: peter-h on February 13, 2023, 01:57:35 pm

Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes, which is why, supposedly, memcpy is normally done in byte mode.

Can anybody relate to this? I don't get it at all.

I don't get it either. A properly working memcpy implementation should neither read bytes outside the source buffer nor write bytes outside the destination. If this requirement is satisfied, it should not fail as long as the buffers don't overlap (which as you say is not supported, and requires memmove). Some implementations do allow reading past the end of the source, as long as no architectural boundaries are crossed (i.e., the over-read can't cause a fault), typically to the end of an aligned word. But since the extra bytes won't be written, it doesn't matter whether they are new or old data.

If the source and destination buffers are disjoint, but share the same 4 byte word, then they can't both be 4 byte aligned. The reason simple memcpy implementations work in byte mode is entirely for alignment purposes. memcpy doesn't place any alignment requirements on the buffers or the sizes. So memcpy implementations that want to use larger word sizes need to handle buffers that are not aligned, lengths that are not a multiple of the word size, different alignment between source and destination, and do so without causing faults or severe misaligned access penalties.

Optimized memcpy implementations obviously *do* check the sizes and alignments and try to use larger chunks when possible.

peter-h · « **Reply #329 on:** February 13, 2023, 07:19:03 pm »

The code I posted above uses 32 bit moves until there are 0-3 bytes left, so it can't be overrunning either buffer.

eutectique · « **Reply #330 on:** February 13, 2023, 08:24:54 pm »

Quote from: peter-h on February 13, 2023, 07:19:03 pm

The code I posted above uses 32 bit moves until there are 0-3 bytes left, so it can't be overrunning either buffer.

Try this:

Code: [Select]

src [1][2][3]
dst       [?][?][?]

eutectique · « **Reply #331 on:** February 13, 2023, 08:53:29 pm »

Or better this:

Code: [Select]

src [1][2][3][4][5]
dst             [ ][ ][ ][ ][ ]

peter-h · « **Reply #332 on:** February 13, 2023, 09:20:27 pm »

Those are overlapping buffers.

I wrote

Quote

Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes

brucehoult · « **Reply #333 on:** February 14, 2023, 08:55:52 am »

Quote from: peter-h on February 13, 2023, 09:20:27 pm

Those are overlapping buffers.

I wrote

Quote
Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes

That's not failing. The memcpy() spec says the results are undefined if src and dst overlap.

If you give memcpy() overlapping buffers then it is you that has failed to honour your side of the contract, not memcpy().

peter-h · « **Reply #334 on:** February 14, 2023, 09:04:32 am »

OK we got wires crossed.

The issue I described above (buffers with a gap between them of < 4 bytes) doesn't exist AFAICT.

peter-h · « **Reply #335 on:** May 25, 2023, 08:30:32 pm »

I've just watched this video

and it convinces me more than ever that a huge amount of work has gone into optimisations which yield no useful runtime benefits whatsoever. For example how time critical can swapping the four bytes of a uint32_t possibly be?

voltsandjolts · « **Reply #336 on:** May 25, 2023, 08:51:30 pm »

Quote from: peter-h on May 25, 2023, 08:30:32 pm

how time critical can swapping the four bytes of a uint32_t possibly be?

If you need to do it 1e6 times per second, then pretty darn critical.

peter-h · « **Reply #337 on:** May 25, 2023, 09:18:13 pm »

There is a lot of coding style dependency involved in triggering the pattern recognitions.

ataradov · « **Reply #338 on:** May 25, 2023, 09:28:27 pm »

Quote from: peter-h on May 25, 2023, 09:18:13 pm

There is a lot of coding style dependency involved in triggering the pattern recognitions.

So? Hardware design tools are all based on very strict pattern for the tool to recognize your intent and generate optimized design. And hardware people deal with that.

If you are doing embedded stuff where nothing matters, then you may not appreciate all that optimization. It does not mean it is useless.

Networking code involves a lot of swaps and recognizing that pattern may lead to a much faster code. Recognizing code that could be automatically vectored, results in huge performance increase on platforms that have vector hardware.

Feel free to use TCC or something like that if you don't want optimizations and don't like them even being in the compiler .

peter-h · « **Reply #339 on:** May 25, 2023, 09:34:37 pm »

Quote

feel free to use TCC or something like that if you don't want optimizations and don't like them even being in the compiler .

WOW.

hans · « **Reply #340 on:** May 26, 2023, 12:05:27 pm »

What is easier to understand:

Quote

const int wrap = 8;
int idx = x % wrap;

Or:

Quote

const int wrap = 8;
static_assert((wrap & (wrap - 1)) == 0, "Wrap is not a power of 2");
int idx = x & (wrap - 1);

Maybe we as embedded programmers feel more comfortable with the 2nd one, but that's pretty domain specific knowledge. The average programmer will understand the first better, however, if a compiler is able to make the step to implementation #2 as long as wrap \in 2^N (which is put down as an implementation constraint of #2), then that's great. After all, almost no compiler is going to naively compile the modulo using a REM instruction (integer division unit) as its super slow, so I'm happy that compilers are smarter than me to find bit and integer manipulation tricks. Some of these tricks, like the whitespace check function, can also work on any platform that has a barrel shifter. These kind of optimizations can benefit x86, RISC-V, ARM and MIPS all in 1 go.

Vector code is all the hype these days. Unfortunately GCC doesn't generate SIMD instructions for Cortex-M4 cores, yet. These instructions are a must for DSP applications. The other day I optimized an algorithm that took 2M+ cycles/second back down to only a few hunderd kcyc/s. It would be amazing if the compiler could do this by itself - as I've to check between a naive MATLAB implementation (including several Fourier convolutions and complex number manipulations) and its SIMD optimized version. The problem with the SIMD version is that I can't easily unit test it.

SiliconWizard · « **Reply #341 on:** May 26, 2023, 07:20:32 pm »

No doubt that compiler optimizations are a big plus in a whole range of case.

The ability to map to specific target instructions of course all depends on the target and current state of the compiler. GCC currently does a better job for x86, not really surprising.

Since some talked about byte swapping here goes:

Code: [Select]

#include <stdint.h>

uint32_t ByteSwap32( uint32_t val )
{
    val = ((val << 8) & 0xFF00FF00 ) | ((val >> 8) & 0xFF00FF ); 
    return (val << 16) | (val >> 16);
}

gcc -O2:

Code: [Select]

ByteSwap32(unsigned int):
        mov     eax, edi
        bswap   eax
        ret

gcc -O1:

Code: [Select]

ByteSwap32(unsigned int):
        mov     eax, edi
        sal     eax, 8
        and     eax, -16711936
        shr     edi, 8
        and     edi, 16711935
        or      eax, edi
        rol     eax, 16
        ret

gcc -O0:

Code: [Select]

ByteSwap32(unsigned int):
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        sal     eax, 8
        and     eax, -16711936
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        shr     eax, 8
        and     eax, 16711935
        or      eax, edx
        mov     DWORD PTR [rbp-4], eax
        mov     eax, DWORD PTR [rbp-4]
        rol     eax, 16
        pop     rbp
        ret

For ARM (32), not too bad either:

gcc -O2:

Code: [Select]

ByteSwap32(unsigned int):
        rev     r0, r0
        bx      lr

gcc -O1:

Code: [Select]

ByteSwap32(unsigned int):
        lsls    r3, r0, #8
        and     r3, r3, #-16711936
        lsrs    r0, r0, #8
        and     r0, r0, #16711935
        orrs    r0, r0, r3
        ror     r0, r0, #16
        bx      lr

Which is faster?

While it was maybe a bit tongue-in-cheek, TCC is actually not that bad if you don't care about optimizations. It's ultra-fast to compile and produces correct code as far as I tested it.

Now of course you can tell the difference in terms of optimization.

Code: [Select]

00000000 <ByteSwap32>:
   0:   e1a0c00d        mov     ip, sp
   4:   e92d0003        push    {r0, r1}
   8:   e92d5800        push    {fp, ip, lr}
   c:   e1a0b00d        mov     fp, sp
  10:   e1a00000        nop                     ; (mov r0, r0)
  14:   e59b000c        ldr     r0, [fp, #12]
  18:   e1a00400        lsl     r0, r0, #8
  1c:   e59f1000        ldr     r1, [pc]        ; 24 <ByteSwap32+0x24>
  20:   ea000000        b       28 <ByteSwap32+0x28>
  24:   ff00ff00                        ; <UNDEFINED> instruction: 0xff00ff00
  28:   e0000001        and     r0, r0, r1
  2c:   e59b100c        ldr     r1, [fp, #12]
  30:   e1a01421        lsr     r1, r1, #8
  34:   e59f2000        ldr     r2, [pc]        ; 3c <ByteSwap32+0x3c>
  38:   ea000000        b       40 <ByteSwap32+0x40>
  3c:   00ff00ff        ldrshteq        r0, [pc], #15
  40:   e0011002        and     r1, r1, r2
  44:   e1800001        orr     r0, r0, r1
  48:   e58b000c        str     r0, [fp, #12]
  4c:   e59b000c        ldr     r0, [fp, #12]
  50:   e1a00800        lsl     r0, r0, #16
  54:   e59b100c        ldr     r1, [fp, #12]
  58:   e1a01821        lsr     r1, r1, #16
  5c:   e1800001        orr     r0, r0, r1
  60:   e89ba800        ldm     fp, {fp, sp, pc}

peter-h · « **Reply #342 on:** May 26, 2023, 08:12:12 pm »

The various GCC levels just trim the code more and more, so why not?

Your last example is what the H8/300 GNU compiler produced in 1995

I could have used that for a customer programmable product back then but it was awful. So I chose the Hitech one which was vastly better but had to be sold for £450.

My earlier comments were on a different thing.

ataradov · « **Reply #343 on:** May 26, 2023, 08:34:02 pm »

To be clear, I did not flame TCC, it is an excellent compiler. It is just a good example of total no-optimization compiler. It is something you don't want at all for general purpose programming.

SiliconWizard · « **Reply #344 on:** May 26, 2023, 08:42:05 pm »

Quote from: ataradov on May 26, 2023, 08:34:02 pm

To be clear, I did not flame TCC, it is an excellent compiler. It is just a good example of total no-optimization compiler. It is something you don't want at all for general purpose programming.

Yep. Well, TCC tends to do a bit worse than GCC -O0, but not a whole lot worse. So, for someone that would consistently compile their code at -O0, TCC would certainly be an alternative.

(The only thing is that it would take a bit of work to use in your toolchain compared to using a vendor-supplied GCC with everything already set up.)

bson · « **Reply #345 on:** May 26, 2023, 09:25:37 pm »

Quote from: SiliconWizard on August 16, 2021, 06:45:06 pm

Why bother with this "systick" interrupt when you're at a low level stage?
Just use the DWT cycle counter, as is often suggested in this forum. It just needs to be enabled. I use this for enabling it in C:
Code: [Select]
void DWT_Init(void) { if (! (CoreDebug->DEMCR & CoreDebug_DEMCR_TRCENA_Msk)) CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; DWT->CYCCNT = 0; DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; }
Enabling it in assembly is also just a couple instructions if you need to do this in the startup code or something.

Then you can read the DWT->CYCCNT register. Resolution is the system clock period. Doesn't require any interrupt or any peripheral.

Why check DMCR & TRCENA before setting TRCENA? Is there something in the STM32 (F) implementation of the DWT or ITM modules that necessitates this?
I've never checked it and haven't noticed any problems, though I use ITM (for SWO) and not really DWT. The DWT though has useful optimization data so I may start using it!

brucehoult · « **Reply #346 on:** May 27, 2023, 01:58:10 am »

Quote from: SiliconWizard on May 26, 2023, 07:20:32 pm

Since some talked about byte swapping here goes:

riscv32 if you include "_zbb" in the arch string. gcc needs -O2 or -Os, with clang just -O1 is enough:

Code: [Select]

ByteSwap32:
        rev8    a0,a0
        ret

NorthGuy · « **Reply #347 on:** May 27, 2023, 01:50:15 pm »

Quote from: SiliconWizard on May 26, 2023, 07:20:32 pm

Since some talked about byte swapping here goes:

These kind of optimizations are needed because C doesn't have operations for byte swaps, rotations, bit settings etc. Therefore, the compiler needs to recognize common patterns which you would use for these and then emit simple commands.

peter-h · « **Reply #348 on:** May 27, 2023, 06:00:57 pm »

Quote

To be clear,

To be clear

Quote

Therefore, the compiler needs to recognize common patterns

that's exactly the point I was trying to make. This code replacement stuff is coding style dependent.

And if your code depends on the speed, then you need to do a lot of regression testing (especially timing stuff with a scope / logic analyser) after each compiler version change. And examine the asm generated to make sure it isn't working by accident.

And if your code doesn't depend on the speed, well, then the time invested by the compiler writer was a bit pointless

Delivering a human-readable linkfile syntax would save somebody a lot more time, for example.

If I was writing code which actually needs the speed I would do an asm function for it.

ataradov · « **Reply #349 on:** May 27, 2023, 06:17:51 pm »

Quote from: peter-h on May 27, 2023, 06:00:57 pm

And if your code depends on the speed, then you need to do a lot of regression testing (especially timing stuff with a scope / logic analyser) after each compiler version change.

Real projects do that anyway. This is why nobody changes the versions of the tools mid projects. And version change on a big project usually takes weeks of evaluation.

For small projects you can do whatever you want, compiler makers don't target you.

Quote from: peter-h on May 27, 2023, 06:00:57 pm

And if your code doesn't depend on the speed, well, then the time invested by the compiler writer was a bit pointless

even if something does not break if it is a bit slower, it does not mean that it does not benefit from the added performance.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 42912 times)

Share me