Author Topic: ARM NOP (Read 11307 times)

Simon · « **on:** January 12, 2022, 08:15:53 pm »

I need some rather small delays that will likely not be worth setting up counters and interrupts for. I think there is some sort of NOP instruction? how would I use that in the ARM-GCC compiler? (Microchip Studio)

westfw · « **Reply #1 on:** January 12, 2022, 08:30:07 pm »

In theory, __NOP() is defined by CMSIS, and should exist…

Note that a nop may not consume a cpu cycle…

Simon · « **Reply #2 on:** January 12, 2022, 08:41:28 pm »

OK, how would it not consume a CPU cycle?

snarkysparky · « **Reply #3 on:** January 12, 2022, 09:04:29 pm »

I use this to eat cycles

asm volatile ("nop");

SiliconWizard · « **Reply #4 on:** January 12, 2022, 09:05:46 pm »

Because it might not even be executed: https://developer.arm.com/documentation/dui0473/m/arm-and-thumb-instructions/nop?lang=en

SiliconWizard · « **Reply #5 on:** January 12, 2022, 09:08:50 pm »

In C, a way to consume cycles for sure would be to use a small loop with a volatile-qualified counter, such as:

Code: [Select]

for (volatile int i = 0; i < xxx; i++) {}

Of course, though, while it WILL consume cycles, it's hard to predict how many.

ataradov · « **Reply #6 on:** January 12, 2022, 09:17:32 pm »

Here is my standard way of generating small blocking delays:

Code: [Select]

__attribute__((noinline, section(".ramfunc")))
void delay_cycles(uint32_t cycles)
{
  cycles /= 4;

  asm volatile (
    "1: sub %[cycles], %[cycles], #1 \n"
    "   nop \n"
    "   bne 1b \n"
    : [cycles] "+l"(cycles)
  );
}

Code is located in SRAM so that execution time is not subject to flash wait states.

This is better than a plain C loop, since it does not depend on the compiler and optimization settings.

Simon · « **Reply #7 on:** January 12, 2022, 10:26:45 pm »

The only other thing I can think of it just increment a variable, so like "dummy_counter++;" I don't know how many clocks that takes but surely that is predictable. I may only need 4 cycles of delay and so maybe if that takes even 2 cycles is fine.

Benta · « **Reply #8 on:** January 12, 2022, 10:43:35 pm »

Welcome to the world of assembler programming

@atadarov's way seems to me to be how to do it.

ataradov · « **Reply #9 on:** January 12, 2022, 11:26:34 pm »

Quote from: Simon on January 12, 2022, 10:26:45 pm

The only other thing I can think of it just increment a variable, so like "dummy_counter++;" I don't know how many clocks that takes but surely that is predictable. I may only need 4 cycles of delay and so maybe if that takes even 2 cycles is fine.

The issue here is that this is not be predictable. And on some devices the execution time would depend on the location of the code in the flash.

For example, on SAM V7x if your whole loop fits into a 64-bit aligned block, it will be very fast, as it will be executed from the prefetch buffer. But as soon as it moves and crosses the boundary between the two 64-bit blocks, it would take way longer. Counting cycles is pointless if you are running from the flash. Caches (specifically I cache) help a lot, but are not a full solution.

SiliconWizard · « **Reply #10 on:** January 13, 2022, 12:04:59 am »

And, with all that said, if you absolutely need sorta accurate delays down to a few cycles on modern 32-bit MCUs, you're probably doing something wrong. Very short software delays were good on good old, fully predictable cores, often for bit-banging some IOs to emulate some peripherals. On any modern MCU, this is ridden with potential pitfalls, and there's usually another way of achieving the same, using the right peripheral. Of course, that's a general thought, there may be a very good reason for doing that - but certainly this would often be the last resort.

Bassman59 · « **Reply #11 on:** January 13, 2022, 12:53:09 am »

Quote from: Simon on January 12, 2022, 08:15:53 pm

I need some rather small delays that will likely not be worth setting up counters and interrupts for. I think there is some sort of NOP instruction? how would I use that in the ARM-GCC compiler? (Microchip Studio)

What's your use case?

westfw · « **Reply #12 on:** January 13, 2022, 02:14:27 am »

Older ARM architectures apparently didn't have a separate NOP instruction, so the assembler/compiler would generate a "mov r8, r8" instruction, or maybe "mov r0, r0"
There are of course MANY instructions like this that don't actually do anything, and many more than "only" affect the flags.

ARM v6m ARM says:

Quote

The timing effects of including a NOP instruction in code are not guaranteed. It can increase execution time, leave it unchanged, or even reduce it. NOP instructions are therefore not suitable for timing loops.

Sadly, there aren't any clear indications of what WOULD be "suitable for timing loops."

ataradov · « **Reply #13 on:** January 13, 2022, 02:43:52 am »

None of ARM architectures have a dedicated NOP. It is always assembled from an instruction without side effects.

While in theory it is not guaranteed, the behaviour is fixed to the core, and for all Cortex-Mx cores it would take at least one cycle to execute. And for Cortex-Ax you have a whole new set of problems anyway.

Nominal Animal · « **Reply #14 on:** January 13, 2022, 02:53:09 am »

I like using the DWT cycle counter on hardware that has it (ARM_DWT_CYCCNT), although it does mean DWT needs to be enabled, and I don't know how that affects power consumption.

westfw · « **Reply #15 on:** January 13, 2022, 03:36:16 am »

Quote

None of ARM architectures have a dedicated NOP.

Are you sure?
ARMv7m and ARMv6m ARMs both say that NOP should encode to 0b1011 1111 0000 0000, and that doesn't seem to match any other instruction encoding?
(leading 1011 means "miscellaneous", the second 1111 means "hints", and the zeros make it a no-op...)

ataradov · « **Reply #16 on:** January 13, 2022, 03:45:17 am »

Quote from: westfw on January 13, 2022, 03:36:16 am

Are you sure?

Ok, I stand corrected.

My statement mostly comes from looking at assembly all day long, and GCC encodes "nop" as "mov r8, r8" for all platforms. I think IAR does the same, as I'm so used to the encoding 0x46c0. So this avoids the timing issue, but this dummy move could be optimized too, of course.

westfw · « **Reply #17 on:** January 13, 2022, 03:59:49 am »

(I think the chances that anything in the Cortex-M class would actually have a pipeline smart enough to discard "mov r8,r8" instructions and similar without executing them would be about zero. But then I don't understand why the "architecture" allows the "real nop" to have such non-deterministic timing, either. Surely it would be easier to omit that possibility?)

+1 on using DWT if it's available.
You can also use systick.
https://github.com/WestfW/Duino-hacks/blob/master/systick_delay/systick_delay.ino

emece67 · « **Reply #18 on:** January 13, 2022, 08:42:37 am »

Simon · « **Reply #19 on:** January 13, 2022, 08:56:10 am »

Quote from: SiliconWizard on January 13, 2022, 12:04:59 am

And, with all that said, if you absolutely need sorta accurate delays down to a few cycles on modern 32-bit MCUs, you're probably doing something wrong. Very short software delays were good on good old, fully predictable cores, often for bit-banging some IOs to emulate some peripherals. On any modern MCU, this is ridden with potential pitfalls, and there's usually another way of achieving the same, using the right peripheral. Of course, that's a general thought, there may be a very good reason for doing that - but certainly this would often be the last resort.

You hit the nail on the head. I am controlling a good old Hitachi style character display (ST7066U) with a parallel port...... because.... all the serial port ones sold out.

It's not imperative here. The minimum delay between checking the busy signal is 80µs with a clock cycle at 48MHz of 20.8666ns so nearly 4000 cycles in there and worth a bit of delay code or using a timer with an interrupt.

This is a SAMC M0+, I doubt anyone would want to use an old character display on a wizz bang anything above that, this is just the easy way until I have time to go graphic and find serial stock.

wek · « **Reply #20 on:** January 13, 2022, 09:10:03 am »

The architectural NOP is optional, and may not be configured in a given incarnation of the Cortex-M core, see SCB ID_ISAR3.TrueNOP_instrs.

I've just checked, in STM32F427 it is set.

[EDIT] Upon second reading of ID_ISAR3.TrueNOP_instrs description, it says
0 None supported, ARMv7-M reserved.
which means that this value can't occur in ARMv7-M. In other words, architectural-NOP is always supported in Cortex-M3/4/7.

I did not know about the architectural-NOP being different from the traditional mov rx,rx. Thanks westfw for pointing this out.
[/EDIT]

BTW it might be an interesting idea to build a table of how these optional featuers are configured in various Cortex-M incarnations. Anybody is in?

JW

Siwastaja · « **Reply #21 on:** January 13, 2022, 09:18:03 am »

Also M7 has dual issue, i.e., can run two instructions in parallel if they do not depend on each other. And obviously, since NOP does nothing, it does not depend on anything around it, so NOP can always run in parallel with any other instruction, or another NOP.

I noticed this behavior when hand-adjusting timing-critical M7 assembly code: adding single NOP may or may not add delay of 1 CPU cycle, depending on what's going on around the NOP, but adding two NOPs back-to-back always did add at least one CPU cycle of delay.

Use ataradov's code. If you need super-small delay of just a single CPU cycle, then try NOP:
__asm__ __volatile__ ("nop");

Unless it's M7, this is very very likely to add one CPU cycle of delay.

But indeed, what's the use case? If you, for example, want to add a tiny dead time between high and low side transistors, just writing as two separate IO operations already gives you longer delay than 1 CPU cycle (depending on the bus clock to the IO port). Typical would be like 2-5, with possibly a cycle or two of jitter. If this is not long enough, then I fail to see how adding just one cycle more would change anything, so just use ataradov's code even though it does have a minimum delay of 4 cycles, plus the function call overhead (a few cycles, don't remember offhand how many).

Also remember, as others have mentioned, higher end ARM CPUs can't access the flash at the CPU core frequency, meaning they pop more instructions at once, so that linear code can run at full CPU speed. This makes jumps possibly take longer time, as the CPU has to wait for the flash access.

Simon · « **Reply #22 on:** January 13, 2022, 09:19:37 am »

Having looked at the datasheet for the display controller it looks like the smallest time I need to wait is 80µs or up to 4000 cycles, for this it makes sense to use a counter. Now I need to work out how I run my display updating with interrupts

I think it will come down to a single array variable that holds the display contents that things can write the data into and the display code residing in an interrupt routine to just move through the process of updating the characters. While the minimum time I need to compute is 80µs this is the minimum time I must allow between checking the ready pin, but it looks like most operations take 37µs to complete in the display. However the data read setup time is 320ns which i assume applies to when i read the busy flag which is 64 cycles so still timbale. with a counter.

So it looks like my code will be establishing where it got to in the process of sending a character and then set the time value to what wait time is required for the next step, zero's the counter and then arms the interrupt again.

brucehoult · « **Reply #23 on:** January 13, 2022, 09:36:06 am »

Quote from: westfw on January 13, 2022, 02:14:27 am

Older ARM architectures apparently didn't have a separate NOP instruction, so the assembler/compiler would generate a "mov r8, r8" instruction, or maybe "mov r0, r0"
There are of course MANY instructions like this that don't actually do anything, and many more than "only" affect the flags.

ARM v6m ARM says:

Quote
The timing effects of including a NOP instruction in code are not guaranteed. It can increase execution time, leave it unchanged, or even reduce it. NOP instructions are therefore not suitable for timing loops.

Sadly, there aren't any clear indications of what WOULD be "suitable for timing loops."

I don't know what plans they had for ARMv6-M when they wrote that manual, but it's used in Cortex M0, M0+. M1 *only*, none of which are ever going to optimise out a NOP.

Some cores executing ARMv6 were more sophisticated, up to as a limit of the 1.0 GHz ARM11 in the BCM2835 in the Pi Zero and Zero W. But that is still a simple single-issue in-order core that is never going to optimise out NOPs.

I don't think you have to worry for anything below a Cortex A-15 or A-57.

Obviously, if you have a dual-issue core such as a Cortex M7 or A7/A8/A9/A53/A55 then a NOP will occupy only one pipeline and some other instruction will execute in the other pipeline. If you are putting a lot of NOPs then they will be executed two in every cycle.

That's why it might be better to use something like ADD #1 or SUB #1 for a short delay. If you put a few dozen of them in a row then any CPU no matter how sophisticated has no choice but to execute them sequentially at one per clock cycle.

But write that in assembly language not C, if you don't want 100 a++'s to be optimised to a+=100 -- or even removed completely if it doesn't look like the result is used.

tggzzz · « **Reply #24 on:** January 13, 2022, 10:08:55 am »

Quote from: SiliconWizard on January 13, 2022, 12:04:59 am

And, with all that said, if you absolutely need sorta accurate delays down to a few cycles on modern 32-bit MCUs, you're probably doing something wrong. Very short software delays were good on good old, fully predictable cores, often for bit-banging some IOs to emulate some peripherals. On any modern MCU, this is ridden with potential pitfalls, and there's usually another way of achieving the same, using the right peripheral. Of course, that's a general thought, there may be a very good reason for doing that - but certainly this would often be the last resort.

... except for the XMOS xCORE devices, where the hardware/software ecosystem precisely defines timing during design. None of this "measure it and hope" nonsense.

Bonus: lots of parallel cores, so up to 4000MIPS/chip


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: ARM NOP (Read 11307 times)

Share me