Also M7 has dual issue, i.e., can run two instructions in parallel if they do not depend on each other. And obviously, since NOP does nothing, it does not depend on anything around it, so NOP can always run in parallel with any other instruction, or another NOP.
I noticed this behavior when hand-adjusting timing-critical M7 assembly code: adding single NOP may or may not add delay of 1 CPU cycle, depending on what's going on around the NOP, but adding two NOPs back-to-back always did add at least one CPU cycle of delay.
Use ataradov's code. If you need super-small delay of just a single CPU cycle, then try NOP:
__asm__ __volatile__ ("nop");
Unless it's M7, this is very very likely to add one CPU cycle of delay.
But indeed, what's the use case? If you, for example, want to add a tiny dead time between high and low side transistors, just writing as two separate IO operations already gives you longer delay than 1 CPU cycle (depending on the bus clock to the IO port). Typical would be like 2-5, with possibly a cycle or two of jitter. If this is not long enough, then I fail to see how adding just one cycle more would change anything, so just use ataradov's code even though it does have a minimum delay of 4 cycles, plus the function call overhead (a few cycles, don't remember offhand how many).
Also remember, as others have mentioned, higher end ARM CPUs can't access the flash at the CPU core frequency, meaning they pop more instructions at once, so that linear code can run at full CPU speed. This makes jumps possibly take longer time, as the CPU has to wait for the flash access.