There is probably some pathological cases where running from SRAM would be slower, but it is not generally the case. Most instructions are fetched 2 at a time and on fast devices with more than 1 WS running from SRAM would almost always win.
And on CM4/CM7 is not an issue at all, since they have two buses and usually require a ton of wait states.
Add CM3, and add one bus. In CM7 the picture is further blurred by... all the stuff...including the massive conventional L1 caches on the AXI... so the timing picture is very complex there, with results all over the place depending on particularities... so let's just ignore CM7 now.
What David talks about is running from SRAM on the S bus, vs. FLASH on both I and D buses (for fetches/constant data reads) while user data are still in SRAM on S bus.
S bus is architecturally made so that it imposes one extra cycle on *every* read - the reason is, that it is heavily loaded by all the slave buses the processor can access (except FLASH), whereas I and D are loaded only by the FLASH (and in some cases SRAM intended for fast code execution, which is then called CCM at least in STM32s). Also, as David said, when code is fetched through the same port as data are read and written, there are collisions; and writing to and especially reading from peripherals may be surprisingly expensive. Sure, there's prefetch in the processor, that helps a bit, but damage is already done.
OTOH, latency of FLASH is mitigated by the FLASH being 128/256 bits wide, automatic prefetch from FLASH (although that's a slightly two-edged thing, as pessimal jump patters do exist). In STM32F2/F4 (and 'F7 although I said I won't talk about that :-)) even nonlinear code is treated through the jumpcache, and constant data (which usually means the address pool, i.e. a continuous space) through a small data cache (ART is the marketing name for this combo). They can be similar jumpcache or conventional caches in other manufacturers' implementations, I don't follow.
So yes, depending on circumstances, execution from FLASH may win over execution from SRAM. As part of marketing push of the then new 'F2/'F4, ST produced a comparison (unfortunately without publishing source code), see
https://community.st.com/sfc/servlet.shepherd/document/download/0690X0000060I7OQAU pp36-45.
JW