For example, the worst interrupt latency of a 400 MHz part won't be 100 times less than the interrupt latency of 4 MHz part.
Except that it (nearly, roughly) does. For example, limiting to ARM Cortex-M universe, if you take the lowest cost M0 part with its 16-cycle interrupt latency and run it at 4MHz, and compare to M7-part with 12-cycle interrupt latency running code from ITCM at 400MHz, you get
better than 100x improvement. If you consider worst case difference in presence of multiple interrupt sources, tail-chaining of the M7 compared to M0 not having it increases the advantage even further.
Now granted, if doing an IO from the beginning of the ISR is critical, as it often is, then that 400MHz part will lose a few cycles from synchronizing the write to the GPIO peripheral's address space's clock, and that pretty much loses those few clock cycles we gained from lower interrupt latency compared to M0, but oh well, it's still pretty close to 100x difference, give or take.
On the other hand, if you compare that 12-cycle latency at 400MHz to an AVR's 4-cycle latency at 4MHz, the difference isn't 100x, but still 25x. Except that the AVR does not do hardware stacking so only in rare cases you can actually execute something meaningful on the first instruction of the ISR. So in reality where you have to stack something first it's close to 100x again, or, if you just write ISRs in C and not hand-crafted assembly, stacking is done first so the M7@400MHz is probably again at least 100x as fast.
So it all depends, but in any case, rules of thumbs originating from
desktop computing are utterly useless, and that is
exactly where these market shills try to divert the discussion by mentioning irrelevant stuff such as caches, TLBs and whatnot, which are utterly irrelevant to the topic, because every Cortex-M7 runs timing critical code from ITCM if you so wish, bypassing caches completely and running with predictable memory performance equivalent to what you would get with guaranteed zero cache misses; and it doesn't even have the concept of TLBs.
Also, their pipelining is simple enough that we are discussing about a jitter of
few cycles. Say a loop often takes 12 cycles but sometimes it takes 11. And these
few cycles are exactly what this thread is all about. Not caches, not large differences, not timing guarantees, not gross underutilization; but cycle accuracy and getting rid of even small jitter in those few cases where it
does matter.
It is unsurprising this particular market shill then tries to again divert the discussion to a totally different off-topic subject, using the same proven-irrelevant claims that they never stop using, despite the fact the claims have been again and again proved irrelevant. It is nearly comical to see they have guts to complain
others diverting off-topic when simply replying to their off-topic. Which probably was a mistake, as the good old saying, don't feed a troll, applies pretty well to this master troll who has been doing this for years nonstop.
Caches in
microcontrollers are some of the most widely misunderstood features. Just like JPEG decoder peripheral, they are extras that only cost a little bit of die area (maybe a bit of static power consumption too) if not used, but otherwise do not hinder performance. Because
timing-critical control using interrupts is a colossally basic use case for any microcontroller, those caches have absolutely nothing to do with it; they reach decent performance running directly off flash, and excellent performance running off core-coupled extra instruction RAM, which is a basic feature in nearly every MCU which has cache, too.
And when you discuss further with this particular troll, you will see what happens. Despite the forum section saying microcontrollers, and discussion being about microcontrollers, suddenly application processors like Cortex-A pop up as example cases. No wonder the misconception that "high-end microcontrollers" somehow are not proper microcontrollers timing-wise is so widespread. Some people actually have a day job of confusing the field for others.
But in actual microcontrollers, just like some applications need SPI and other need I2C, some applications also need ITCM and others cache. Say, an UI application drawing animations or playing MP3 files out of SD card could enable caches and greatly benefit from them. And DC/DC or motor controller would again enable ITCM and not bother enabling caches at all because they are irrelevant.
In such case, a cache which is disabled is simply irrelevant, all it does is increase the chip cost by a few cents. Except that having fewer different parts (and instead, more capable parts with many options) increases the economy of scales, so in the end selling parts that have cache, even if most users would run them disabled, is still the cheapest option.
But as a result, there is underutilization. Which is a big thing in MCUs; nearly every MCU is 90% underutilized, if not more, if you count CPU cycle utilization, RAM, FLASH let alone peripheral utilization. But an XCORE with 9 cores out of 16 being used and executing NOPs 99% of the time emulating HW peripherals is clearly not underutilization. (And I like the concept and the idea, just not their marketing presence here.)