Author Topic: GCC compiler optimisation (Read 45807 times)

DiTBho · « **Reply #250 on:** August 30, 2022, 10:42:08 am »

Quote from: gf on August 30, 2022, 09:57:42 am

C basically does define an abstract, machine-independent memory ordering model, implemented via the stuff in stdatomic.h. Programs which strictly adhere to this abstract model (even if the current target CPU's requirements are not that strict) are even supposed to be portable to different CPUs, wrt. this functionality.

yeah, the concurrency support library, basically typedef _atomic (include/stdatomic.h)

DiTBho · « **Reply #251 on:** August 30, 2022, 11:26:53 am »

support for this

Code: [Select]

enum memory_order
{
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
};

is what is available since C11.

It's what I have always workarounded and segregated into assembly with projects to be compiled with previous C89 C compilers.

gf · « **Reply #252 on:** August 30, 2022, 12:10:10 pm »

Quote from: DiTBho on August 30, 2022, 11:26:53 am

is what is available since C11.

It's what I have always workarounded and segregated into assembly with projects to be compiled with previous C89 C compilers.

Prior to that, gcc already suported the __sync_...() built-in functions. I think they were proprietary extensions, and most of them were full barriers.

Nominal Animal · « **Reply #253 on:** August 30, 2022, 12:26:37 pm »

Quote from: gf on August 30, 2022, 12:10:10 pm

Quote from: DiTBho on August 30, 2022, 11:26:53 am

is what is available since C11.

It's what I have always workarounded and segregated into assembly with projects to be compiled with previous C89 C compilers.

Prior to that, gcc already suported the __sync_...() built-in functions. I think they were proprietary extensions, and most of them were full barriers.

They were originally introduced by Intel, in the Intel Itanium ABI, I believe.

Gcc has provided the __atomic_...() built-in functions with the C++11 memory model since version 4.7 (2012), possibly earlier.

ataradov · « **Reply #254 on:** August 30, 2022, 03:05:59 pm »

Quote from: DiTBho on August 30, 2022, 08:38:16 am

And then you have these solutions full of "black voodoo magic"

Optimizations are by definition compiler-specific. And there are like 2 useful modern compilers anyway. And you don't have to do it though the attribute, you can just use a global flag. I hope you are not upset that different compilers have different command line flags?

SiliconWizard · « **Reply #255 on:** August 30, 2022, 06:30:21 pm »

If you have a problem with optimizations, it's 99.99% of the time because you are writing non-compliant code (consciously or not.)
My default optimization level with GCC and C is "-O3" for most embedded development I do, unless there is a problem with binary size, for which I'll usually switch to "-Os".
Never had a single issue related to this.

But I only use relatively recent targets, so nothing old, exotic or only supported by old or possibly unofficial compiler versions.

DiTBho · « **Reply #256 on:** August 30, 2022, 07:01:41 pm »

Quote from: Nominal Animal on August 30, 2022, 12:26:37 pm

Gcc has provided the __atomic_...() built-in functions with the C++11 memory model since version 4.7 (2012), possibly earlier.

I think that even IBM has interests in supporting that stuff for their POWER10 and POWER11 and their tr-mem.

Gcc-v11 is very promising!

But I don't know, C++11&C look like ballet dancers; "Consume" is deprecated in C++17 because essentially nobody has been able to implement it in any way that's better than "acquire".

The first model of tr-mem was based on "Consume", now they say you can think of "Consume" as a restricted version of "Acquire", and "Relaxed" imposes no memory order at all.

You get the point

DiTBho · « **Reply #257 on:** August 30, 2022, 07:20:57 pm »

(
ok, to tell you the true story, they said that C++17's design for "Consume" was "impractical" for Gcc to implement, so they gave up and strengthen it to "Acquire", but requiring a memory barrier on most weakly-ordered ISAs.

In real life, it's not a problem for x86 and ARM64, but it's a problem for PowerPC, MIPS4 (and more than horrible on MIPS4+) when to get that juicy performance devs need to actually use "Relaxed" but *very* carefully", more carefully than with x86 and ARM64, hoping it won't get optimized into something unsafe, because in this case it's really weak on weakly-ordered ISA therefore prone to catastrophic failures without the right enforced memory barriers.

So, you can see why I don't trust the C/C++ compiler and prefer the old school of segregating "critical code" into assembly modules.
)

ejeffrey · « **Reply #258 on:** August 30, 2022, 08:39:53 pm »

Quote from: peter-h on August 30, 2022, 08:21:24 am

The reason I don't bother with it is because tests showed negligible code size or speed differences. As one would expect, the stuff got increasingly esoteric. Also some mentioned that -O3 ought to be considered "experimental" which is not really what I want to be doing.

That isn't correct. -O3 is not experimental. All valid code should compile fine with -O3. Of course compilers do have bugs but that is very rare. Traditionally -O3 enables optimizations which have a reasonable chance to make performance worse: frequently ones that can cause a large increase in compiled code size (which can reduce pipeline stalls but increase cache pressure).

If your code breaks with -O3 then it is almost always non-compliant.

Some compilers have other options that relax constraints and can generate non-conformant behavior but perform even faster if you know this is OK for your application. But those options should never be turned on by -O?.

gf · « **Reply #259 on:** August 30, 2022, 09:35:50 pm »

Quote

ok, to tell you the true story, they said that C++17's design for "Consume" was "impractical" for Gcc to implement, so they gave up and strengthen it to "Acquire", but requiring a memory barrier on most weakly-ordered ISAs.

In real life, it's not a problem for x86 and ARM64, but it's a problem for PowerPC, MIPS4 (and more than horrible on MIPS4+) when to get that juicy performance devs need to actually use "Relaxed" but *very* carefully", more carefully than with x86 and ARM64, hoping it won't get optimized into something unsafe, because in this case it's really weak on weakly-ordered ISA therefore prone to catastrophic failures without the right enforced memory barriers.

It is of course always safe to implement Consume as Acquire. My feeling is that Consume was "invented" in order to have a faster alternative for some special cases of Acquire, for those processors which require an (inperformant) memory barrier instruction for the implementation of Acquire semantics.

On x86, regular loads have Acquire semantics per se, so no extra instructions are necessary in order to obtain Consume or Acquire semantics, and load-Consume and load-Acquire degrade to compiler-only barriers which only prevent instruction re-ordering by the compiler.

EDIT:
Btw, here's an interesing article about Consume:
https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

peter-h · « **Reply #260 on:** February 09, 2023, 05:08:50 pm »

I've been digging around memcpy lately (because it is used in the ST 32F4 ETH PHY low_level_* code) and having found the default Newlib memcpy does just one byte at a time, I switched to this one, which is also from Newlib but optimised for speed

Code: [Select]


#define BIGBLOCKSIZE    (sizeof (long) << 2)
#define LITTLEBLOCKSIZE (sizeof (long))
#define TOO_SMALL(LEN)  ((LEN) < BIGBLOCKSIZE)	/* Threshhold for punting to the byte copier.  */
/* Nonzero if either X or Y is not aligned on a "long" boundary.  */
#define UNALIGNED(X, Y) (((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))

void * memcpy (void *__restrict dst0, const void *__restrict src0, size_t len0)
{

	char *dst = dst0;
	const char *src = src0;
	long *aligned_dst;
	const long *aligned_src;

	/* If the size is small, or either SRC or DST is unaligned,
     then punt into the byte copy loop.  This should be rare.  */
	if (!TOO_SMALL(len0) && !UNALIGNED (src, dst))
	{
		aligned_dst = (long*)dst;
		aligned_src = (long*)src;

		/* Copy 4X long words at a time if possible.  */
		while (len0 >= BIGBLOCKSIZE)
		{
			*aligned_dst++ = *aligned_src++;
			*aligned_dst++ = *aligned_src++;
			*aligned_dst++ = *aligned_src++;
			*aligned_dst++ = *aligned_src++;
			len0 -= BIGBLOCKSIZE;
		}

		/* Copy one long word at a time if possible.  */
		while (len0 >= LITTLEBLOCKSIZE)
		{
			*aligned_dst++ = *aligned_src++;
			len0 -= LITTLEBLOCKSIZE;
		}

		/* Pick up any residual with a byte copier.  */
		dst = (char*)aligned_dst;
		src = (char*)aligned_src;
	}

	while (len0--)
		*dst++ = *src++;

	return dst0;

}

I am seeing a speedup of about 20% (time across the whole function, on packets of a few hundred bytes) which is a lot less than I would expect. Is that code really meaningful for an arm32? I can see what it is doing, I think. The typical exec time is 15us which is probably not worth optimising further (e.g. with DMA).

Could anyone also advise why the "const" in
const long *aligned_src;

Thank you.

NorthGuy · « **Reply #261 on:** February 09, 2023, 05:24:03 pm »

Quote from: peter-h on February 09, 2023, 05:08:50 pm

optimised for speed

Not by much. Agner Fog wrote a very good book on x86 optimizatons, which includes lots of interesting ideas on memory/string routine optimizations. x86 is an out-of-order CPU, so it's not completely transferable to small ARM, but you can read the book, see what's involved, and write optimized routines if you have a desire and time for this.

Siwastaja · « **Reply #262 on:** February 09, 2023, 05:29:36 pm »

Quote from: peter-h on February 09, 2023, 05:08:50 pm

Could anyone also advise why the "const" in
const long *aligned_src;

As the name suggests, const in C language communicates (both to the compiler, and human reader) that the thing is constant - it does not change.

Pretty obviously, given the purpose of memcpy, the source memory buffer is not changed, only destination is. The idea is to use the const keyword, so that if compiler detects someone is assigning to a const qualified object, it will be illegal so the compiler will raise an error and catch a dangerous bug.

Improve your code quality and make it a habit to always qualify things that are not supposed to change const.

Note that C declarations are read from right to left. Here, the actual memory is qualified const. Pointer is not const, so the pointer can be changed (to point somewhere else, like the next word). If one wants to declare a const pointer to const memory, that would be:
const long * const aligned_src; or equivalently:
long const * const aligned_src;

peter-h · « **Reply #263 on:** February 09, 2023, 05:31:54 pm »

The biggest optimisation should be moving 32 bits at a time, which should be simply 4x faster, on a decent size block. The 32F4 has no data cache.

And I am using a -O3 attribute.

Well, unless this function is trying to do 32 bit unaligned moves, which is dumb, even though the 32F4 does support unaligned access.

So maybe my src or dest are unaligned, in which case the 32 bit loop is skipped, but what would be dumb code. The smart way is to move unaligned bytes first, then move 32 bits at a time, and then clean up with byte moves.

I ought to look at this
https://cboard.cprogramming.com/c-programming/154333-fast-memcpy-alternative-32-bit-embedded-processor-posted-just-fyi-fwiw.html
but would have expected a lot less code.

brucehoult · « **Reply #264 on:** February 09, 2023, 08:12:13 pm »

Quote from: NorthGuy on February 09, 2023, 05:24:03 pm

x86 is an out-of-order CPU, so it's not completely transferable to small ARM

x86 is an instruction set. Some implementations of x86 are out of order (ok, most these days), and some are not. This one, for example, is in-order:

https://www.digikey.co.nz/en/products/detail/rochester-electronics-llc/N80186/12122323

ARM is an instruction set (3 or 4 different ones, actually). Some implementations of ARM are out of order (A15 and A57+), and some are not.

OoO gives you a bit more latitude in scheduling your instructions, and especially in not having to unroll loops, but the biggest issue in optimisation is issue/execute width. There are 3-wide in-order CPUs (including from ARM). There are 2-wide microcontrollers.

peter-h · « **Reply #265 on:** February 09, 2023, 08:22:59 pm »

I've just realised that this isn't possible to do fully, because if you start with the two buffers being unaligned, but unaligned differently, then no number of single byte transfers is going to get you to a point where you could continue with 32 bit transfers. So if e.g. one starts at 0x10000001 and the other at 0x10000002, you have to do the whole copy in the byte mode.

One option might be to use the 32 bit mode anyway because the 32F4 supports unaligned 32 bit transfers, until there are just 0-3 bytes left and then continue in a single byte mode. Then you will automatically get aligned transfers if the two buffers are both aligned, automatically.

I also found that the four consecutive 32 bit moves were always getting optimised out, and only one was executed, and the compiler was decrementing the loop counter appropriately

So I now have

Code: [Select]


#define BLOCKSIZE    (sizeof (long))
#define TOO_SMALL(LEN)  ((LEN) < BLOCKSIZE)	/* Threshhold for punting to the byte copier.  */
/* Nonzero if either X or Y is not aligned on a "long" boundary.  */
#define UNALIGNED(X, Y) (((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))

__attribute__((optimize("O2")))
void * memcpy_fast (void *__restrict dst0, const void *__restrict src0, size_t len0)
{

	char *dst = dst0;
	const char *src = src0;
	long *aligned_dst;
	const long *aligned_src;

	/* If the size is small, or either SRC or DST is unaligned,
     then punt into the byte copy loop.  This should be rare.  */
	if (!TOO_SMALL(len0) && !UNALIGNED (src, dst))
	{
		aligned_dst = (long*)dst;
		aligned_src = (long*)src;

		/* Copy long words if possible.  */
		while (len0 >= BLOCKSIZE)
		{
			*aligned_dst++ = *aligned_src++;
			len0 -= BLOCKSIZE;
		}

		/* Pick up any residual with a byte copier.  */
		dst = (char*)aligned_dst;
		src = (char*)aligned_src;
	}

	while (len0--)
		*dst++ = *src++;

	return dst0;

}

Can't get my head around how to do the pointer notation for my earlier proposal (always 32 bit until < 4 bytes left).

Nominal Animal · « **Reply #266 on:** February 09, 2023, 08:39:15 pm »

The proper test is actually (((uintptr_t)src) ^ ((uintptr_t)dst) & 3). It is zero whenever 32-bit aligned access is possible, and nonzero when the copy is inherently unaligned (ie. even when src is aligned, dst is unaligned, and vice versa).

When the above expression is zero, (((uintptr_t)src) & 3) tells you the number of leading bytes that need to be transferred before an aligned transfer is possible. If your MCU supports unaligned accesses, then doing a single unaligned 32-bit copy, followed by aligning src and dst to the next 32-bit boundary (so between one and three bytes will be transferred again), is likely to be more efficient than eg. Duff's Device bytewise copy of the initial bytes.

Similarly, if there are any trailing bytes, doing a single unaligned 32-bit transfer to copy the trailing bytes, is probably more efficient than Duff's Device bytewise copy or a trailing copy loop, whenever unaligned accesses are supported (because the extra cost tends to be single digit cycles).

Of course, if the unaligned access cost is low enough, then just explicitly aligning one (the destination, assuming that it will be more likely to be accessed next rather than the source, caching-wise) and letting the other be unaligned, is even better because the test/selection logic before the copy is done has a significant cost compared to typical memcpy() sizes.

For explicit calls where you know the pointers and the data are 32-bit aligned –– even via
if (_Alignof(*src) >= 4 && _Alignof(*dst) && !(n & 3))
memcpy32p((uint32_t *)dst, (const uint32_t *)src, (uint32_t *)dst + ((uint32_t)n / 4));
else
memcpy8p((unsigned char *)dst, (const unsigned char *)src, (unsigned char *)dst + n);
where the third parameter is a the limit to dst, and not the number of units to copy –– for example as an inline wrapper around your copy function, can shave off the extra test/selection logic cost.

eutectique · « **Reply #267 on:** February 09, 2023, 08:44:30 pm »

You can ensure 4-byte alignment of memory regions with __attribute__ ((aligned (4))).

peter-h · « **Reply #268 on:** February 09, 2023, 09:07:21 pm »

Interesting.

Indeed I take care to align buffers, but in some cases (ones inside LWIP) I can't possibly know what the addresses will be if one is picking up some buffer part way through.

Quote

then just explicitly aligning one

I didn't think of that... one being aligned is better than neither.

But not checking for alignment seems to automatically yield the optimal solution - on the 32F4:

Code: [Select]


// Optimised memcpy. This is based on one in Newlib but specifically for the 32F4
// which does unaligned 32 bit transfers transparently. This avoids having to check
// buffer alignment, and 4-aligned buffers are automatically optimised internally.

__attribute__((optimize("O2")))
void * memcpy_fast (void *__restrict dst0, const void *__restrict src0, size_t len0)
{

	char *dst = dst0;
	const char *src = src0;
	uint32_t *aligned_dst;
	const uint32_t *aligned_src;

	// If the size is >=4 then do 32 bit moves, until exhausted

	if ( len0 >=4 )
	{
		aligned_dst = (uint32_t*)dst;
		aligned_src = (uint32_t*)src;

		while (len0 >= 4)
		{
			*aligned_dst++ = *aligned_src++;
			len0 -= 4;
		}

		dst = (char*)aligned_dst;
		src = (char*)aligned_src;
	}

	// Finish with any single byte moves

	while (len0--)
		*dst++ = *src++;

	return dst0;

}

I am just not 100% sure the above code cannot run off the end.

I am now getting a 2x speedup over the byte version, on average.

I started on a DMA version but quickly remember that it would be no good due to lack of CCM access. A lot of the time I can explicitly control this (like using DMA for all SPI transfers) but with LWIP, who knows. I have the RTOS stacks in CCM.

ataradov · « **Reply #269 on:** February 09, 2023, 09:41:51 pm »

Quote from: peter-h on February 09, 2023, 09:07:21 pm

I started on a DMA version but quickly remember that it would be no good due to lack of CCM access.

You can check pointer ranges and figure out if both pointers are in a DMA region, then use DMA, otherwise use slower version.

brucehoult · « **Reply #270 on:** February 09, 2023, 09:51:09 pm »

Quote from: Nominal Animal on February 09, 2023, 08:39:15 pm

The proper test is actually (((uintptr_t)src) ^ ((uintptr_t)dst) & 3). It is zero whenever 32-bit aligned access is possible, and nonzero when the copy is inherently unaligned (ie. even when src is aligned, dst is unaligned, and vice versa).

When the above expression is zero, (((uintptr_t)src) & 3) tells you the number of leading bytes that need to be transferred before an aligned transfer is possible. If your MCU supports unaligned accesses, then doing a single unaligned 32-bit copy, followed by aligning src and dst to the next 32-bit boundary (so between one and three bytes will be transferred again), is likely to be more efficient than eg. Duff's Device bytewise copy of the initial bytes.

Even if your CPU supports unaligned accesses, using that capability might be slower than using a loop like the following (plus setup and cleanup of a few bytes at start and end). byteOffset is 0,1,2,3. Assuming little-endian.

Code: [Select]

uint32_t copyLoop(uint32_t *src, uint32_t *dst, int byteOffset, uint32_t *srcLimit, uint32_t dstBytes){
    int bitOffset = byteOffset<<3;
    while (src < srcLimit){
        uint32_t srcBytes = *src++;
        dstBytes |= srcBytes<<bitOffset;
        *dst++ = dstBytes;
        dstBytes = srcBytes >> (32-bitOffset);
    }
    return dstBytes;
}

On ARM the body of the loop is:

Code: [Select]

        rsb     rightShift, bitOffset, #32
loop:
        ldr     srcBytes, [src], #4
        orr     dstBytes, dstBytes, srcBytes, lsl bitOffset
        str     dstBytes, [dst], #4
        lsr     dstBytes, srcBytes, rightShift
        cmp     srcLimit, src
        bhi     loop

Of course you can unroll / Duff this as you wish, but it probably goes at close to full cache speed anyway.

NB: taking this literally is UB in C for the right shift with byteOffset=0. It works fine on ARM (with register source for the shift amount) but not on others such as x86 and RISC-V where a shift of 32 is a shift of 0. You can make a separate simple *dst++ = *src++ loop for the byteOffset==0 case.

peter-h · « **Reply #271 on:** February 09, 2023, 10:13:46 pm »

Yes; I did that in some other code. Use DMA if not in CCM. 2 questions:

1) Would DMA beat aligned 32 bit memcpy? It might do because one can do the same "auto unaligned support" hack on DMA. One would still need to finish off with software copy unless blocksize is a multiple of 4.

2) Any gotchas for DMA stream for memory-memory? I remember reading that most of the DMA streams cannot be used.

It's also a good point that even in the case of both buffers being unaligned, and even if they are unaligned differently, it will be faster to copy single bytes until one of the buffers becomes aligned, and then 32 bit moves will run faster.

ataradov · « **Reply #272 on:** February 09, 2023, 11:19:19 pm »

DMA will only do aligned transfers. But DMA may be fast enough to just transfer bytes and not worry about further optimization. I think there are also smart features that would pack multiple bytes into a word, but I don't know the details, and don't know if this is something useful here.

No idea about limitations of the specific device, but it should not be too hard to figure out.

peter-h · « **Reply #273 on:** February 10, 2023, 07:15:48 am »

If DMA does aligned only (I didn't know that) it will struggle to beat a software copy which will benefit from alignment a lot of the time.

ataradov · « **Reply #274 on:** February 10, 2023, 07:18:07 am »

It has independent access to two interfaces. There is no scenario where the most optimized aligned version doing word transfers in the firmware beats simple byte transfers using DMA.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 45807 times)

Share me