Well, I've found my first case where the C compiler(s) produce better code than I wrote in assembler.
??You might remember this simple loop from the clock initialization:
clklp: ldr r1, [r0, #RCC_CR]
tst r1, #RCC_CR_HSERDY /* wait for clock ready */
beq.n clklp
That seems obvious enough. Read the register, check the relevant bit, loop if it's not set.
But the C compiler produces code like:
lp: ldr r2, [r1, #0]
lsls r2, r2, #14
bpl.n lp
And guess what? the "tst" instruction with the immediate bit value turns out to be a 32bit instruction, while the "lsls" instruction to shift the relevant bit to the sign position (it took me a bit to figure out that that's what it's doing!) is only 16bits. So the C code is better.
Of course, now that I know this "trick", I can theoretically do the same thing. But it's pretty inconvenient to convert that bitmask symbol value that I have into the required shift value... :-(