Looks like it has everything to do with 8-byte stack alignment, as you said which is part of the ABI. If greater body of functions are included, then eventually GCC will switch to a push {lr} and sub sp, sp, #xx output.
For xx=4, it's more space efficient for the compiler to generate push {rx, lr}, where rx is r3 in this case, so that the stack is always 8-byte aligned. The extra push vs subtraction is as fast, so putting everything in 1 push is more space efficient.
As discussed in the GCC ticket, it looks like more modern Cortex-M3 cores don't really care about stack alignment and can have the STKALIGN bit set low. Then this r3 push/pop is unnecessary. On m0 and m7 it is necessary, as STKALIGN bit is fixed high.
I read on m7 it is necessary because the AXI bus is 64-bit wide.One thing that does come to mind is which I don't have an answer to.. what happens when you use this 'interrupt' prologue/epilogue while running any kind of generic C code, and say that code runs on the main sp where in the middle of a function body it is not required to keep a 4-byte stack alignment. Therefore any IRQ can occur while the instantaneous sp is not 8-byte aligned. Is this going to cause UB or faults? Or is that something that will be handled/guaranteed by the hardware or microcode upon entering the interrupt/exception context?
In the linked godbolt example, this is what the prologue code seems to be doing, regardless if any stack alignment is necessary further down in the function.
Edit: oof, GCC is getting confused:
https://godbolt.org/z/T511Kxch8TIM7_IrqHandler():
mov r0, sp
ldr r3, .L3
movs r2, #0
bic r1, r0, #7
ldr r3, [r3]
mov sp, r1
push {r0}
pop {r0}
str r2, [r3]
mov sp, r0
bx lr
.L3:
.word .LANCHOR0
sr:
.word -889323520
Edit2: @newbrain Thanks for the explanation. Yes it seems the interrupt attribute is unnecessary then.