That's easy and sufficient iff you can prove the worst case time. How do you prove the worst case time (a) without caches
Cortex-M interrupt latency is reliably specified, guaranteed by ARM.
Vector table can be relocated in ITCM or RAM, ISR itself can be in ITCM. Resulting access patterns are simple.
Output assembly listing, count clock cycles of instructions that lead to the important operation. Instruction set manual gives the range of clock cycles a particular instruction takes.
Additionally, go through all critical sections that run interrupts disabled; count the clock cycles. This is the worst-case delay
before the interrupt latency.
If a higher-priority interrupt can pre-empt a less urgent but still timing-critical interrupt, add the maximum duration of the higher-priority interrupt. This highly motivates to keep the highest priority ISRs short because counting cycles from an assembly listing of a complex state horror with a lot of branches will be a nightmare. If the hi-prio ISR needs to be complex, you can split it in two, by demoting the interrupt priority by generating a software interrupt request with lower priority from the hi-prio ISR. This way, you can do the absolutely timing critical thing first, then continue more complex code at lower priority so it can be pre-empted by "medium-importance" stuff.
Run all such code from ITCM. Running from flash is OK too, just slower and jittery, you can still guarantee worst-case longest timing by assuming the number of configured wait states happen for
every read; real-world average performance will be just better.
Yes, this is some manual work, but usually you don't need to go this far too often because in real world projects, not every signal, not every logical operation is so sensitive to timing.
All of this manual work has a risk of human error. Measurement can be used as a verification step, as you note. Margins are important in any case, against small errors. If you write low level C, how to do it right, don't use random bloated libraries, and finally, check the assembly listing, there just is no way expected 20 cycles becomes 200 cycles.
Everything in real world has range for errors. You don't pilot an airliner at millimeter accuracy.
(b) with caches?
Assume cache miss for every read. Actual average (and possibly even worst-case, but you can't trust this) is way way better but hey, you get your worst-case guarantee.
This being said, I don't understand why you are bringing up the caches so eagerly. Cycle-accurate, timing critical microcontroller code just isn't ran from a slow memory through cache, it makes no sense. Indeed, none of my projects on STM32F7 or H7 enable caches; they are wasted silicon. The fact these top-of-the-line micros include caches which are "wasted" silicon if not used is a moot point because such top tier micros have dozens and dozens of features being unused. For example, H7 has two CAN peripherals which take up quite some area because they have dedicated message RAMs roughly similar size to what a cache could be, yet you don't always need CAN. Same can be said about area used by Ethernet, USB, and so on. Custom ASIC for everything would be technically optimal solution but obviously impossible.