Running at 600 MHz, they verify that the interrupt latency is 11 cycles (18 ns), but they get 75 ns interval between in and out pulses on the scope. The majority of this time - 34 cycles (57 ns) - is consumed by the execution of 3 (presumably uncached) commands
No, I think
you authors misinterpret the results; the "majority" of time is not used to fetch and execute the instructions; majority of the delay is:
* Synchronization of the input to the IPG clock domain and generating the interrupt signal to the CPU clock domain - this part is completely lacking from the analysis,
and then,
* STR instruction writing from the CPU clock domain to the IPG clock domain.
They seem to give the impression that the STR instruction "stalls" until the result is seen on the scope screen, but this is not the case.
Instructions run fast. IO clock domain runs slower. That's exactly why we have data barrier instructions; when we need to "slow down" the instructions.
And this is where assumptions go wrong. You scope like they did on the paper, and calculate it took 34 cycles. But it really didn't. The CPU went on executing the code faster; the IO is just slower than the CPU! That's also why it's not fruitful to count CPU cycles because it's not the bottleneck.
Not 100% sure, but I'd guess most M7 MCUs would prefetch the first ISR instructions in parallel to stacking registers from flash, so there should be no difference in the first few linear instructions whether running from flash or ITCM. This could be proven, too, if necessary, but it's needless extra work when you can just place the routine in ITCM (or main RAM); that's why it exists. The benefit of ITCM compared to main RAM is independence of DMA transfers affecting the timing.
Having IPG set at 150MHz, IO operations are way slower than the 600MHz CPU clock might indicate for a casual reader. As you say, it's not said where the ISR resides, moving it to ITCM
might shed out a few clock cycles, but OTOH might not due to parallel flash prefetch making it already as fast as possible from the CPU point of view.
STM32H7 series, which I have used on quite a few projects since the introduction, for example supports IO clock domain (D3 domain; AHB3 or AHB4 IIRC) up to 240MHz. With fCPU of "only" 480MHz, the performance would be still better than the iMX running at IPG = 150MHz. The revision history of that pdf indicates they accidentally ran at IPG=300MHz at first, which is beyond the allowed range, but the results would have been interesting to see.
But I don't actually live in my lab so I can't measure anything now.
The point is, the clock domain speed is the most defining factor. Instruction counting is secondary, especially because CPU core tends to run at higher clock speed than the IO, so even if you have 1-2 cycles of jitter there, it's just... insignificant. Hence, the separation of computing timing from the everything else. Instruction timing might have mattered a lot more when it happened at 8MHz in a CPU with lousy peripherals.
But I'm sure I know nothing because I dare to challenge a "well respected" member.
EDIT: Even with FLASH prefetch going in parallel with stacking, that piece of LDR-MOV-STR code does have one disadvantage: the LDR (which loads the IO port address, I guess) likely accesses outside of the FLASH prefetch range, requiring spending the number of FLASH wait states anyway. So it pays off to have the ISR in ITCM (or even just main RAM) anyway.