Author Topic: GCC compiler optimisation (Read 42905 times)

ejeffrey · « **Reply #125 on:** August 13, 2021, 03:44:11 pm »

One more thing to look for: you say ISRs get removed if you add an infinite loop to main. If you are talking about ISR functions installed through a runtime function that makes sense. If you are talking about the ISRs in the default flash interrupt table, that shouldn't happen. The default ISR table is supposed to be marked as "keep" in the linker script so that it serves as a root for the --gc-sections option.

peter-h · « **Reply #126 on:** August 13, 2021, 03:53:02 pm »

What I meant is that the remainder of main.c would be stripped out.

But perhaps this is wrong. The compiler should strip out only the remainder of the function main().

However, that is not what happened. My program lost about 90% of its size

ataradov · « **Reply #127 on:** August 13, 2021, 04:29:41 pm »

I agree with everyone else, forget that -O0 exists. It is never useful in real life. This way when something "breaks" you will have a limited scope of changes since the last working build to review.

If you are already facing the project that is broken like this, then you need to start looking at disassembly and check what was optimized.

Siwastaja · « **Reply #128 on:** August 13, 2021, 04:46:19 pm »

Debugging code made by others, including external "libraries" (I'm assuming you don't get object files + headers but actual source code because you talk about compiling it), is no different at all from debugging your own code. Use the same strategies.

If compiler completely removes some code, this is only because the code does nothing. Look at the code, figure out what it is supposed to do, then fix it so that it does that.

Yes this is tedious work, writing code is, but this is the only way, you can't fix it by trying different compiler settings, looking at disassembly and post about the stupidness of all this on the forum.

Sometimes it's faster to write completely from scratch than to decipher and fix non-working codebases that are basically untested.

peter-h · « **Reply #129 on:** August 13, 2021, 07:10:35 pm »

I have been very carefully working through it. The bit which isn't working has been narrowed down but is still a lot of code, from ST mainly, and with lots of pointers and such. How do I select -O0 on a particular .c file only? Is there some attribute one can put in the start of the file?

ejeffrey · « **Reply #130 on:** August 13, 2021, 08:03:51 pm »

Quote from: peter-h on August 13, 2021, 03:53:02 pm

What I meant is that the remainder of main.c would be stripped out.

But perhaps this is wrong. The compiler should strip out only the remainder of the function main().

However, that is not what happened. My program lost about 90% of its size

If you use -ffunction-sections and -fgc-sections then each function will be placed in its own section, and the linker (not compiler!) will do a garbage collection pass to remove unreferenced functions. If you do this (and it is very standard on microcontroller builds) then adding an infinite loop in main will cause everything except the init code to be removed. Which is not a problem at all since that code will never be called. Looking at what code is removed is counter productive. You should only look at where your code *behaves* improperly. Fix that, and either the rest of your code will "come back", or it isn't needed. Fixating on the "missing" code is 99% of the time a red herring.

langwadt · « **Reply #131 on:** August 13, 2021, 08:14:37 pm »

if code written by hammering the keyboard and "fixing it" with chewing gum and duct tape doesn't work it is obviously the libraries or the compilers fault

peter-h · « **Reply #132 on:** August 13, 2021, 08:49:38 pm »

I narrowed it down to a couple of files, and setting both to -O0 makes the whole thing run ok. One of them is FatFS and the other is a load of mainly ST code for serial FLASH and SPI. I don't think the root issue is in FatFS - partly because lots of people use it, and partly because the problem also shows up when trying to format the block device from Windows and that doesn't use FatFS (which merely enables internal code to see the filesystem); it uses just the latter stuff (ST mostly).

Unfortunately, in ST Cube at least, -Og makes it very hard to do debugging, because literally half the variables cannot be viewed - because they have been moved to registers. One can switch to register mode and by reference to the disassembly listing it is usually possible to see the value of the variable, but it's quite clumsy. I wonder if this is debugger dependent? I am using STLINK V2 & V3, and have a Segger Edu kicking around somewhere. Unfortunately there are also many cases where I pass the address of a buffer to a function, and cannot view its contents because it also says "optimised out".

On the plus side, it looks like all the code I have written myself is working fine with -Og

In Cube, one can select an -O level on a per-file basis, simply by right-clicking on that file and going to Properties. A tiny symbol appears next to the file, which disappears if you build the whole project with the same -O level as that file (so you have to be fairly careful there).

It's been a great learning experience - thank you all

westfw · « **Reply #133 on:** August 13, 2021, 10:03:40 pm »

Quote

I have a load of code, most of it 3rd party libs, no support on any of it of course... :
The bit which isn't working has been narrowed down but is still a lot of code, from ST mainly

Surely code from ST has SOME support?

Quote

demonstrate GCC optimising out a straight read of 0x08000000
Yes: https://godbolt.org/z/he6avac4E

That example has:

Code: [Select]

void f()
{
    char buffer[512];
    memcpy(buffer, (char*)0x08000000, 512);
}

If I change that code to have "volatile char buffer[512];", the copy should no longer be optimized away, right?
But it looks like it is (using gcc 11 and higher. gcc 10 creates actual code...) ARM gcc behaves similarly...

SiliconWizard · « **Reply #134 on:** August 13, 2021, 10:18:05 pm »

Quote from: westfw on August 13, 2021, 10:03:40 pm

Code: [Select]
void f() { char buffer[512]; memcpy(buffer, (char*)0x08000000, 512); }If I change that code to have "volatile char buffer[512];", the copy should no longer be optimized away, right?
But it looks like it is (using gcc 11 and higher. gcc 10 creates actual code...) ARM gcc behaves similarly...

Well, no.
As we talked about way earlier in the thread, the memcpy() function itself doesn't have volatile-qualified parameters.
So when you're passing a volatile [] to memcpy(), the parameter is converted to a pointer to non-volatile. So it doesn't make a difference. That's why I said earlier that the only way of solving this would be to write your own memcpy() function.

If you assign stuff to the local buffer array though within this function, it will make the difference you expect.
As in:

Code: [Select]

void f()
{
    volatile char buffer[512];
    memcpy(buffer, (char*)0x08000000, 512); // still pruned
    buffer[0] = 1; // not pruned
}

Interestingly, if you swap the destination and the source, then the memcpy() doesn't get pruned, even with no volatile qualifier:

Code: [Select]

void f()
{
    char buffer[512];
    memcpy((char*)0x08000000, buffer, 512);
}

The compiler probably allows itself to make more assumptions for objects that it knows about than for objects that it doesn't (here, an absolute address). Playing a bit with this will exhibit even funnier stuff...

Of course don't take all this as rules - this is just implementation-dependent behavior. Use volatile as needed.

westfw · « **Reply #135 on:** August 13, 2021, 11:40:15 pm »

BTW, I've completely lost track of what we're talking about.
The OP seemed to be about bootloaders and whether bootloaded code should contain startup code (yes, it should! My philosophy is that code that is bootloaded should look exactly like code that is used without a bootloader, except for its position. It should still have the initial SP and PC and the rest of the vectors, and still have all of the stuff that happens before main() is called. (You're not still having the bootloader go directly to your application main() without doing the initialized variable copy and stuff, are you? That could explain some issues!))

Then we diverted to a discussion of compiler optimization and behavior of "volatile."
There is code that is "90% smaller with optimization" and code which "doesn't work", but it's not clear that they're the same code...

Quote

-Og makes it very hard to do debugging, because literally half the variables cannot be viewed - because they have been moved to registers.

If you're looking for code that has been completely omitted, you'd be debugging code flow rather than variable content, wouldn't you?
If what you think are global ram variables cannot be viewed, that's a pretty substantial clue right there...

Quote

wanted 500/month for ongoing "support" which is way too much for what will be needed (maybe 1hr/month).

That actually doesn't seem all that unreasonable. 1hr of actual work in a month probably means another hour or two worth of re-familiarizing yourself with code and "interfacing with customer", and per-hour charges in the $300 range aren't uncommon (although, you said Pounds, I think...)
https://www.fullstacklabs.co/blog/software-development-price-guide-hourly-rate-comparison

westfw · « **Reply #136 on:** August 13, 2021, 11:58:33 pm »

Quote

As we talked about way earlier in the thread, the memcpy() function itself doesn't have volatile-qualified parameters.

Hmm. But it's only because the compiler has internal knowledge of what memcpy() is supposed to do. Normally it would just be an external function with unknown side-effects, and it would be called with the given arguments. If the compiler is going to treat it like a "language feature" instead of a library function, it should take into account the additional semantics that it is (should be) aware of. (and it did so up until gcc 11, apparently, even when it used inline code for the copy...)
(I guess this is a fine example of "I don't like it, so I think it's wrong.")

SiliconWizard · « **Reply #137 on:** August 14, 2021, 12:19:22 am »

Quote from: westfw on August 13, 2021, 11:58:33 pm

Quote
As we talked about way earlier in the thread, the memcpy() function itself doesn't have volatile-qualified parameters.
Hmm. But it's only because the compiler has internal knowledge of what memcpy() is supposed to do.

Of course, technically this is why. This knowledge allows compilers to generate pretty efficient inline code for those functions, tailored to the task at hand.

Quote from: westfw on August 13, 2021, 11:58:33 pm

Normally it would just be an external function with unknown side-effects, and it would be called with the given arguments.

Well, if the compiler has no knowledge of the possible side-effects of a given function, of course it must call it. As it happens though, modern compilers have internal knowledge of the most common (if not all? I dont know for sure) functions from the C std library, which allows them to implement more efficient code. There are tons of examples. One with the printf() function: called with a string without format specifiers (when the compiler can statically see this), GCC (and probably CLANG) will just call puts() instead.

I admit this can be confusing to many. The C std library becomes an integral part of the compiler. While this seems reasonable, this can lead to misconceptions and actual issues, especially on systems for which the C std lib is implemented in dynamic libraries, in which case, for a given C std lib function call, you may end up either with an inline, compiler-dependent version, or with a call to an export in a dynamic library, possibly implementing the same function in a slightly different way...

bson · « **Reply #138 on:** August 14, 2021, 12:59:35 am »

Quote from: westfw on August 13, 2021, 10:03:40 pm

That example has:

Code: [Select]
void f() { char buffer[512]; memcpy(buffer, (char*)0x08000000, 512); }If I change that code to have "volatile char buffer[512];", the copy should no longer be optimized away, right?
But it looks like it is (using gcc 11 and higher. gcc 10 creates actual code...) ARM gcc behaves similarly...

Probably some confusion over auto variables vs volatile. It knows buffer is never used, so it and any code that is used to compute or initialize it can be removed. The fact that it's on the stack means the compiler sees its entire lifespan, from function entry to exit, and knows it's never used. Volatile should override that and make it non-removable, but it's such an odd thing to have local volatile variables that it's not surprising if there's debate over whether it can be optimized out of existence or not; it's really more about literal implementation of the language spec.

peter-h · « **Reply #139 on:** August 14, 2021, 06:25:21 am »

I think there are multiple things going on in my project, and probably in a couple of files. I did some very careful debugging. Compiled it all with -O0, formatted the disk, and placed a file on it, and verified it. It has a CRC on the end. Then from the inside (via FatFS) read it to check the CRC. All good. For those who like to see code, this is the CRC func

Code: [Select]



/*
*
* CRC-32/JAMCRC. This is a "rolling" algorithm. Invert the *final* result for ISO-HDLC.
* Returns 0x098494f3 from "123456789" (9 bytes). Inverting this (at the end) gives 0xcbf43926.
* Polynomial is 0x04c11db7 and it holds it backwards (0xEDB88320).
* This version accepts one byte at a time, and maintains the CRC in crcvalue. This makes it suitable
* for calculating a CRC across a number of data blocks.
* Speed is approx 400kbytes/sec.
* crcvalue must be initialised to 0xffffffff by caller, and holds the accumulated CRC as you go along
* See e.g. [url]https://crccalc.com/[/url] [url]https://www.lammertbies.nl/comm/info/crc-calculation[/url]
* [url]https://reveng.sourceforge.io/crc-catalogue/17plus.htm#crc.cat-bits.32[/url]
*
*/


void crc32(uint8_t input_byte, uint32_t *crcvalue)
{
      for (uint32_t j=0; j<8; j++)
      {
    	  uint32_t mask = (input_byte^*crcvalue) & 1;
    	  *crcvalue>>=1;
    	  if(mask)
    		  *crcvalue=*crcvalue^0xEDB88320;
    	  input_byte>>=1;
      }
}



/*
 *
 * Check CRC, stored in last 4 bytes of a file.
 * For speed, reads into a 512 byte buffer.
 * This code is tricky, due to CRC potentially split across buffer boundaries.
 * crcinit is normally initialised to 0xffffffff by the caller.
 * No upper limit on file size. Minimum size = 5 bytes.
 * Returns true if CRC is good, false otherwise (including if file doesn't exist)
 * Exec time for 1MB: 15s.
 *
 */

bool filecrc(char * filename, uint32_t crcinit)
{

	FILINFO fno;

	uint32_t crc=crcinit;
	uint32_t filecrc=0;
	uint32_t offset=0;
	int32_t bytesleft;
	uint8_t pagebuf[512];
	uint32_t numread=0;

	if ( KDE_get_file_properties ( filename, &fno ) == false ) return (false);

	bytesleft=fno.fsize;

	if ( bytesleft<5 ) return (false);

	do
	{
		// Read up to 512 bytes into pagebuf
		if ( KDE_file_read( filename, offset, 512, pagebuf, &numread) == false )
			return (false);
		offset+=512;
		bytesleft-=numread;

		if ((numread==512) && (bytesleft>=4))
		// Most common case: 512 bytes read, not the last page, and no CRC bytes yet
		{
			for (int i=0; i<512; i++)
			{
				crc32(pagebuf[i], &crc);
			}
		}
		else
		{
			if ((numread<=512) && (bytesleft==0))
			// Last page, and contains entire CRC
			{
				for (int i=0; i<(numread-4); i++)
				{
					crc32(pagebuf[i], &crc);
				}
				filecrc=pagebuf[numread-4]|(pagebuf[numread-3]<<8)|(pagebuf[numread-2]<<16)|(pagebuf[numread-1]<<24);
			}
			else
			{
				// CRC is split (buffer contains only some of it). Calc CRC over all data in buffer
				for (int i=0; i<(512+bytesleft-4); i++)
				{
					crc32(pagebuf[i], &crc);
				}
				// read just CRC bytes
				if ( KDE_file_read( filename, offset+bytesleft-4, 512, pagebuf, &numread) == false )
					return (false);
				filecrc=pagebuf[0]|(pagebuf[1]<<8)|(pagebuf[2]<<16)|(pagebuf[3]<<24);
				bytesleft=0;	// ensure end of loop
			}
		}
	}
	while ( bytesleft > 0 );

	crc=~crc;  // invert JAMCRC to HDLC CRC which the file has on the end
	if ( crc==filecrc ) return (true);

	return (false);
}

Then I compiled the whole thing with -Og. The CRC calculation fails. There is nothing obviously wrong with the data, and ideally I could spend time repeating this with a small file, say 30 bytes, to narrow it down (the test file is 1MB). So definitely one problem here.

But even before the CRC fails, if FatFS is also compiled with -Og, FatFS cannot find the file on the disk! So I stepped through the code, and narrowed it down to somewhere deep in FatFS returning false - around here

Code: [Select]


/*-----------------------------------------------------------------------*/
/* Directory handling - Find an object in the directory                  */
/*-----------------------------------------------------------------------*/

static
FRESULT dir_find (	/* FR_OK(0):succeeded, !=0:error */
	DIR* dp			/* Pointer to the directory object with the file name */
)
{
	FRESULT res;
	FATFS *fs = dp->obj.fs;
	BYTE c;
#if _USE_LFN != 0
	BYTE a, ord, sum;
#endif

	res = dir_sdi(dp, 0);			/* Rewind directory object */
	if (res != FR_OK) return res;
#if _FS_EXFAT
	if (fs->fs_type == FS_EXFAT) {	/* On the exFAT volume */
		BYTE nc;
		UINT di, ni;
		WORD hash = xname_sum(fs->lfnbuf);		/* Hash value of the name to find */

		while ((res = dir_read(dp, 0)) == FR_OK) {	/* Read an item */
#if _MAX_LFN < 255
			if (fs->dirbuf[XDIR_NumName] > _MAX_LFN) continue;			/* Skip comparison if inaccessible object name */
#endif
			if (ld_word(fs->dirbuf + XDIR_NameHash) != hash) continue;	/* Skip comparison if hash mismatched */
			for (nc = fs->dirbuf[XDIR_NumName], di = SZDIRE * 2, ni = 0; nc; nc--, di += 2, ni++) {	/* Compare the name */
				if ((di % SZDIRE) == 0) di += 2;
				if (ff_wtoupper(ld_word(fs->dirbuf + di)) != ff_wtoupper(fs->lfnbuf[ni])) break;
			}
			if (nc == 0 && !fs->lfnbuf[ni]) break;	/* Name matched? */
		}
		return res;
	}
#endif
	/* On the FAT12/16/32 volume */
#if _USE_LFN != 0
	ord = sum = 0xFF; dp->blk_ofs = 0xFFFFFFFF;	/* Reset LFN sequence */
#endif
	do {
		res = move_window(fs, dp->sect);
		if (res != FR_OK) break;
		c = dp->dir[DIR_Name];
		if (c == 0) { res = FR_NO_FILE; break; }	/* Reached to end of table */
#if _USE_LFN != 0	/* LFN configuration */
		dp->obj.attr = a = dp->dir[DIR_Attr] & AM_MASK;
		if (c == DDEM || ((a & AM_VOL) && a != AM_LFN)) {	/* An entry without valid data */
			ord = 0xFF; dp->blk_ofs = 0xFFFFFFFF;	/* Reset LFN sequence */
		} else {
			if (a == AM_LFN) {			/* An LFN entry is found */
				if (!(dp->fn[NSFLAG] & NS_NOLFN)) {
					if (c & LLEF) {		/* Is it start of LFN sequence? */
						sum = dp->dir[LDIR_Chksum];
						c &= (BYTE)~LLEF; ord = c;	/* LFN start order */
						dp->blk_ofs = dp->dptr;	/* Start offset of LFN */
					}
					/* Check validity of the LFN entry and compare it with given name */
					ord = (c == ord && sum == dp->dir[LDIR_Chksum] && cmp_lfn(fs->lfnbuf, dp->dir)) ? ord - 1 : 0xFF;
				}
			} else {					/* An SFN entry is found */
				if (!ord && sum == sum_sfn(dp->dir)) break;	/* LFN matched? */
				if (!(dp->fn[NSFLAG] & NS_LOSS) && !mem_cmp(dp->dir, dp->fn, 11)) break;	/* SFN matched? */
				ord = 0xFF; dp->blk_ofs = 0xFFFFFFFF;	/* Reset LFN sequence */
			}
		}
#else		/* Non LFN configuration */
		dp->obj.attr = dp->dir[DIR_Attr] & AM_MASK;
		if (!(dp->dir[DIR_Attr] & AM_VOL) && !mem_cmp(dp->dir, dp->fn, 11)) break;	/* Is it a valid entry? */
#endif
		res = dir_next(dp, 0);	/* Next entry */
	} while (res == FR_OK);

	return res;
}

Unfortunately with -Og half the variables are "optimised out" so difficult to work out what is what.

The "unable to format" from Windows is a different issue however. This is nothing to do with FatFS. It is probably low down in the ST stuff like this

Code: [Select]


HAL_StatusTypeDef B_HAL_SPI_Transmit(SPI_HandleTypeDef *hspi, uint8_t *pData, uint16_t Size, uint32_t Timeout)
{
//  uint32_t tickstart;
  HAL_StatusTypeDef errorcode = HAL_OK;
  uint16_t initial_TxXferCount;

  initial_TxXferCount = Size;

  if (hspi->State != HAL_SPI_STATE_READY)
  {
    errorcode = HAL_BUSY;
    goto error;
  }

  if ((pData == NULL) || (Size == 0U))
  {
    errorcode = HAL_ERROR;
    goto error;
  }

  /* Set the transaction information */
  hspi->State       = HAL_SPI_STATE_BUSY_TX;
  hspi->ErrorCode   = HAL_SPI_ERROR_NONE;
  hspi->pTxBuffPtr  = (uint8_t *)pData;
  hspi->TxXferSize  = Size;
  hspi->TxXferCount = Size;

  /*Init field not used in handle to zero */
  hspi->pRxBuffPtr  = (uint8_t *)NULL;
  hspi->RxXferSize  = 0U;
  hspi->RxXferCount = 0U;
  hspi->TxISR       = NULL;
  hspi->RxISR       = NULL;

  /* Configure communication direction : 1Line */
  if (hspi->Init.Direction == SPI_DIRECTION_1LINE)
  {
    SPI_1LINE_TX(hspi);
  }

  /* Check if the SPI is already enabled */
  if ((hspi->Instance->CR1 & SPI_CR1_SPE) != SPI_CR1_SPE)
  {
    /* Enable SPI peripheral */
    __HAL_SPI_ENABLE(hspi);
  }

  /* Transmit data in 8 Bit mode */

    if ((hspi->Init.Mode == SPI_MODE_SLAVE) || (initial_TxXferCount == 0x01U))
    {
      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
      hspi->pTxBuffPtr += sizeof(uint8_t);
      hspi->TxXferCount--;
    }
    while (hspi->TxXferCount > 0U)
    {
      /* Wait until TXE flag is set to send data */
      if (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE))
      {
        *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
        hspi->pTxBuffPtr += sizeof(uint8_t);
        hspi->TxXferCount--;
      }
    }


  /* Clear overrun flag in 2 Lines communication mode because received is not read */
  if (hspi->Init.Direction == SPI_DIRECTION_2LINES)
  {
    __HAL_SPI_CLEAR_OVRFLAG(hspi);
  }

  if (hspi->ErrorCode != HAL_SPI_ERROR_NONE)
  {
    errorcode = HAL_ERROR;
  }

error:
  hspi->State = HAL_SPI_STATE_READY;
  return errorcode;
}

but there is something interesting in there which may be a clue, but which is above my pay grade to understand the code. It is this

Code: [Select]

*((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
__IO is #defined as volatile, but DR itself also is in the ST .h files. They appear to be creating a pointer to the hspi structure, whose member DR is the SPI data register. Yet the vast majority of references to hspi is not "volatile" qualified, and I don't know why it should be. OTOH the compiler sees hspi whole and may decide to prune unreferenced members (of which there are plenty; ST love a structure for absolutely everything). It could then re-pack it, but that should still work, eh?

Funnily enough FatFS has its own private memcpy. They call it mem_cpy. And others. Maybe they know something:

Code: [Select]

/*-----------------------------------------------------------------------*/
/* String functions                                                      */
/*-----------------------------------------------------------------------*/

/* Copy memory to memory */
static
void mem_cpy (void* dst, const void* src, UINT cnt) {
	BYTE *d = (BYTE*)dst;
	const BYTE *s = (const BYTE*)src;

	if (cnt) {
		do {
			*d++ = *s++;
		} while (--cnt);
	}
}

/* Fill memory block */
static
void mem_set (void* dst, int val, UINT cnt) {
	BYTE *d = (BYTE*)dst;

	do {
		*d++ = (BYTE)val;
	} while (--cnt);
}

/* Compare memory block */
static
int mem_cmp (const void* dst, const void* src, UINT cnt) {	/* ZR:same, NZ:different */
	const BYTE *d = (const BYTE *)dst, *s = (const BYTE *)src;
	int r = 0;

	do {
		r = *d++ - *s++;
	} while (--cnt && r == 0);

	return r;
}

/* Check if chr is contained in the string */
static
int chk_chr (const char* str, int chr) {	/* NZ:contained, ZR:not contained */
	while (*str && *str != chr) str++;
	return *str;
}

gf · « **Reply #140 on:** August 14, 2021, 07:07:16 am »

Quote from: bson on August 14, 2021, 12:59:35 am

Quote from: westfw on August 13, 2021, 10:03:40 pm
That example has:

Code: [Select]
void f() { char buffer[512]; memcpy(buffer, (char*)0x08000000, 512); }If I change that code to have "volatile char buffer[512];", the copy should no longer be optimized away, right?
But it looks like it is (using gcc 11 and higher. gcc 10 creates actual code...) ARM gcc behaves similarly...
Probably some confusion over auto variables vs volatile. It knows buffer is never used, so it and any code that is used to compute or initialize it can be removed. The fact that it's on the stack means the compiler sees its entire lifespan, from function entry to exit, and knows it's never used. Volatile should override that and make it non-removable, but it's such an odd thing to have local volatile variables that it's not surprising if there's debate over whether it can be optimized out of existence or not; it's really more about literal implementation of the language spec.

The volatile qualifier has virtually no meaning for the variable itself.
It only has an effect when a memory location (i.e. either the address of a variable, or a pointer) is dereferenced in the code flow.
If the dereferenced address (pointer) refers to a volatile memory location, then the load/store to the memory location cannot be optimized out.

Examples:

Code: [Select]

volatile int a;
a = 5;         // <== volatile memory store, because &a is a pointer to a volatile memory location

Code: [Select]

int a;
volatile int *__tmp = &a;
*__tmp = 5;    // <== volatile memory store, because __tmp is a pointer to a volatile memory location

Code: [Select]

volatile int a;
int *p = (int*)&a;
*p = 5;        // <== NOT a volatile memory store, alhtough it assigns 5 to a, and a was declared volatile

The effect of volatile is only that the instructions for a = 5 and *__tmp = 5 cannot be eliminated by the optimizer ¹⁾.
Retaining the variable a is just the consequence of not eliminating a = 5 in the first place, because a is used, then.
But if a were not used, it still can be eliminated, regardless whether qualified volatile or not.

¹⁾ In addition, the standard also prohibits re-ordering - for details please refer directly to the standard.

ataradov · « **Reply #141 on:** August 14, 2021, 07:12:15 am »

Don't look at FatFS, it is used by so many people that it is guaranteed to work with any level of optimizations. Look closer at the difference in behaviour of low level functions that actually read and write sectors.

westfw · « **Reply #142 on:** August 14, 2021, 08:10:56 am »

Quote

The fact that it's on the stack means the compiler sees its entire lifespan, from function entry to exit, and knows it's never used.

Hmm. An interesting theory.
If I change the buffer to "static volatile", then gcc will produce code to do the copy. But clang doesn't. :-)
I like the "memcpy() doesn't know about volatile" explanation better, technically.

It's nice that I can turn off the "builtin" functions; can I do that on a per-call basis?
I guess I can use "__builtin_memcpy()" if I want to use the "optimized, built-in" version and have builtins otherwise turned off. What about the other way around?(Hmm. Clang seems to have a function attribute __attribute__((no_builtin("memcpy"))) that is supposed to last for the scope of the function with the attribute, but ... it doesn't seem to work on the example we're using :-( ) It must be deciding that buffer is useless even if it IS volatile and static.
https://clang.llvm.org/docs/AttributeReference.html#no-builtin

peter-h · « **Reply #143 on:** August 14, 2021, 08:17:17 am »

"Look closer at the difference in behaviour of low level functions that actually read and write sectors."

Indeed, and the disk format failure is a clue (FatFS is not used). I posted the ST code for the SPI stuff above. But debugging the optimised version is just too hard.

ataradov · « **Reply #144 on:** August 14, 2021, 08:47:38 am »

Quote from: peter-h on August 14, 2021, 08:17:17 am

I posted the ST code for the SPI stuff above.

That is just the transmit part. And it is way too low level. You need to look at the whole sector read, including setting of the address.

And then you need to look at how your chip select is formed for SPI. Your optimized code may be too fast and it breaks setup time for the CS before data transfer. And your SPI code may be too fast that your memory can't handle it.

Find the call that FatFS does to read a sector and start from that, not just the lowest level code.

peter-h · « **Reply #145 on:** August 14, 2021, 09:28:36 am »

These are the sector I/O

Code: [Select]

// Write a number of bytes to a single page through the buffer with built-in erase.

bool AT45dbxx_WritePage(
	const uint8_t *data,	// In	Data to write
	uint16_t len,			// In	Length of data to write (in bytes)
	uint32_t page			// In	linear address, a multiple of 512
) {

	HAL_StatusTypeDef status = HAL_OK;

	if (len==0) return (status == HAL_OK);

	page = page << AT45dbxx.Shift;
	at45dbxx_resume();
	at45dbxx_wait_busy();
	B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_RESET);
	at45dbxx_tx_rx_byte(AT45DB_MNTHRUBF1);
	at45dbxx_tx_rx_byte((page >> 16) & 0xff);
	at45dbxx_tx_rx_byte((page >> 8) & 0xff);
	at45dbxx_tx_rx_byte(page & 0xff);
	status = B_HAL_SPI_Transmit(&_45DBXX_SPI, (uint8_t *) data, len);
	B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_SET);
	at45dbxx_wait_busy();

	return (status == HAL_OK);
}


// Read a number of bytes from a single page
// 1/8/21 The last parm is actually a linear address within the device.

bool AT45dbxx_ReadPage(
	uint8_t *data,		// Out	Buffer to read data to
	uint16_t len,		// In	Length of data to read (in bytes)
	uint32_t page		// In	linear address, a multiple of 512
) {
	HAL_StatusTypeDef status = HAL_OK;

	if (len==0) return (status == HAL_OK);

	page = page << AT45dbxx.Shift;

	// Round down length to the page size
	if (len > AT45dbxx.PageSize) {
		len = AT45dbxx.PageSize;
	}

	at45dbxx_resume();
	at45dbxx_wait_busy();
	B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_RESET);
	at45dbxx_tx_rx_byte(AT45DB_RDARRAYHF);
	at45dbxx_tx_rx_byte((page >> 16) & 0xff);
	at45dbxx_tx_rx_byte((page >> 8) & 0xff);
	at45dbxx_tx_rx_byte(page & 0xff);
	at45dbxx_tx_rx_byte(0);
	status = B_HAL_SPI_Receive(&_45DBXX_SPI, data, len);
	B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_SET);

	return (status == HAL_OK);
}

Could be a CS timing issue but I doubt it because the device is very fast; much faster than the above code.

gf · « **Reply #146 on:** August 14, 2021, 10:00:52 am »

Quote from: peter-h

Quote
The fact that it's on the stack means the compiler sees its entire lifespan, from function entry to exit, and knows it's never used.
Hmm. An interesting theory.

The primary purpose of a program is not to load values from memory locations, or to store values to memory locations. At the end, only the "visible side effect" count. How the visible effects are calculated does not matter. If the compiler can calculate the same visible side effects w/o accessing any memory, then it does not need to generate any load/store instructions. And if it can prove that a variable is completely unused (after eliminating the load/store instructions), then it ~~can eliminate~~ does not need to allocate any storage for the variable either.

Only volatile load/store operations are considered "visible side effect", i.e. to have a self-purpose, therefore they cannot be optimized out.

Mentally, it may be is better if you do not consider a variable a memory location in the first place, but rather consider it a name for a value (wherever the value happens to be stored, if stored at all).
It depends on the circumstances whether the compiler eventually allocates a memory location for a variable, or not.
It may also help to think in terms of SSA (in particular regarding local variables inside functions) which is the basis for virtually all optimzers today.

Quote

I like the "memcpy() doesn't know about volatile" explanation better, technically.

Of course. Even if memcpy() is a compiler intrisic, the compiler still has to handle an explicit call to memcpy() as if an inline funtion with the signature

Code: [Select]

void* memcpy(void * restrict dest, const void * restrict src, size_t count);

were called, and inlined into the code. I.e. any volatile qualifiers are lost when the arguments are passed to the function parameters dest and src.
And inside memcpy(), the load/store from/to source and destination buffer are NOT volatile operations, so they are allowed to be eliminated if they are needless for calculating the actual visible side effects.

Siwastaja · « **Reply #147 on:** August 14, 2021, 11:51:27 am »

Quote from: gf on August 14, 2021, 10:00:52 am

Mentally, it may be is better if you do not consider a variable a memory location in the first place, but rather consider it a name for a value (wherever the value happens to be stored, if stored at all).

This, this and this. You need to get rid of the "portable assembler" mindset and accept that C is a high level language. You describe the expected outcome of the program by sequential statements, then compiler is free to do anything to achieve the end-result your statements define. Because machine language is typically also sequential operations, you may incorrectly think these two sequences map directly or nearly directly to each other, but they don't (except by accident). It's also incorrect way of thinking that there "should" be direct mapping and it's just the "optimizer" that messes this up.

C does have some low-level features (like the volatile qualifier) that make it usable for near-hardware system programming, but the whole language doesn't work that way like people tend to expect.

DavidAlfa · « **Reply #148 on:** August 14, 2021, 01:26:59 pm »

B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_RESET);

Probably you'll need to put a small delay after that. I always start with slow SPI clocks and adding big delays when toggling pins.
Then, when it's already working, I start optimizing things.
Otherwise you add a lot of factors hat could be causing the problem. Too fast clock? Too fast CS? Who knows!

If you don't have a logic analyzer yet, get one! You have cheap 8-ch 24MHz analyzers in ebay.
They're amazingly handy when troubleshooting digital communications.
You'll quickly find out if there's something wrong in the bus.

peter-h · « **Reply #149 on:** August 14, 2021, 04:07:09 pm »

I wish it was that... Tcss etc are a mere 5ns

I have a 500MHz DSO and this was checked.

The serial FLASH runs off SPI2 which runs at 21mbps (the max possible on 32F4 SPI2/3). The FLASH can do 85MHz.

A timing issue does nevertheless remain, and it would be a bastard to find.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 42905 times)

Share me