Interrupt latency on Cortex-M7 at 400MHz is 30ns, and this includes automatic stacking of registers and fetching the vector address. With code in ITCM, the source for jitter is due to logic synchronization - something that can't be avoided. Let's call it 1 clock cycle. So average 30ns with 2.5 ns of jitter. Given easy-to-use interrupt priority system, heck, you can just make use of the interrupts without blinking an eye! Not having ITCM/DTCM available and sharing the RAM with heavy DMA transfers? Just assume 100% DMA utilization. Bus arbiter gives half of the cycles for CPU, so assume each load/store takes double the time. It's still blazing fast. So you have an if(condition) at the start in that critical ISR? So OK, it might run a clock cycle or two faster on the second invocation. So your jitter is now up by 5ns! There is no magical way branch prediction would sometimes make it take longer than it takes on the first round, or cause a large difference. It's very fast even with the miss, and the miss isn't some super special occurrence, you will see it on the scope screen the first time you try it.
Now let's compare these numbers to the "simple" "good" old systems. Polling loop on a simple 8-bit AVR at 8MHz, which alternates IO reads and comparison and branch instructions, would be something like 5 clock cycles in length, with maybe 3 clock cycles of jitter, plus 1 unavoidable cycle from synchronization. So something like average ~400ns with another 400ns of jitter. As you can see, I had hard time coming up with exact jitter number, so it's not easy to analyze. You thought you could write "cycle-accurate" code easily on a simple instruction set with predictable instruction timing, but in reality you didn't, because you still had to interface with the external world. And this applies to any xCORE as well. It can't magically predict when the external signal comes in.
Simple isn't always better. Identify what you really need. This sounds obvious, but if you need fast, then fast is fast. And MICROCONTROLLER cores, when compared to APPLICATION cores, offer fast worst-case. They do it by having on-chip SRAM, and higher end ones partitioning the RAM in multiple interfaces, some only accessible by CPU (with predictable exact 1-cycle latency). This is all blatantly obvious to everybody except guerilla marketing shills. Some uncertainty comes from the fact that Cortex-M7 is sometimes even faster by having branch prediction. The result is very fast worst case, and a tad better average case, with a few clock cycles of jitter. This is a big deal only if you really need to bit-bang an interface actually accurate to a few nanoseconds. Which is extremely rare, given the sheer performance, which let's you do the job much easier, if you can accept say 10-20ns of uncertainty. Which you easily should be able to do if you were happy with the 8-bitters. Their synchronization jitter alone was in the same order!
The actual reason we don't almost ever count individual cycles (I have done it once in a project, and it's fully possible!) on modern high-end microcontrollers is not the difficulty of doing so, but the fact there is no need. When the complexity went up, so did the performance. If average case went up by 20x, then worst case possibly went up by 10x. Gone are the days of having to resort to counting instruction cycles to bitbang an interface; because interrupt entry now only takes equivalent to what 0.5 clock cycles was on AVR/PIC, now you can just make a timer generate interrupts and bitbang the protocol there, and let everything else run in parallel. Oh the joy of programming it.
And the differentiating factor with Cortex-A running linux, and Cortex-M running bare metal, is exactly the predictability of worst case timing. Claims about caches being relevant at all are made-up BS arguments shown wrong countless of times. Small sources of jitter do exist (M7 branch prediction; DMA and CPU arbitrating RAM access; clock domain synchronization), but these are all just a few clock cycles, easily understood and dealt with. Most importantly, even if you don't understand every detail, the amount of uncertainty is still so small it's trivially totally swamped by very modest safety margin.
Compared to this, what is really slowing beginner's attempts to write timing-predictable down are overly complex software libraries and layers. Needless to say, for tight control of timing, you need to be in control of the code. And there is no limit how slow and bloated it can get. A perfect example which demonstrates this has nothing to do with MCUs getting more complex, but everything to do with software writing practices becoming bloated, is the horribly slow timing of Arduino's DigitalWrite() and the friends. You can totally have slow and uncertain timing by using bloated libraries, even on the simple and "predictable" MCUs.