Author Topic: ARM NOP (Read 11026 times)

Simon · « **Reply #25 on:** January 13, 2022, 10:16:08 am »

Looking at the display driver datasheet it almost looks like the enable/clock, really is meant to be a clock signal of at least 80µs period, this kind of helps a lot as basically I use that as my base timer and use a hardware PWM that will clock any other software counters.

Kleinstein · « **Reply #26 on:** January 13, 2022, 10:20:49 am »

Quote from: SiliconWizard on January 13, 2022, 12:04:59 am

And, with all that said, if you absolutely need sorta accurate delays down to a few cycles on modern 32-bit MCUs, you're probably doing something wrong. Very short software delays were good on good old, fully predictable cores, often for bit-banging some IOs to emulate some peripherals. On any modern MCU, this is ridden with potential pitfalls, and there's usually another way of achieving the same, using the right peripheral. Of course, that's a general thought, there may be a very good reason for doing that - but certainly this would often be the last resort.

With bit banging one may still need short delays, often no need to be accurate, more like at least some 100 ns to give external HW time to recognize a pulse / HS signal.

Siwastaja · « **Reply #27 on:** January 13, 2022, 10:23:43 am »

Yes, display designers have a weird tradition of calling the signal that all other industries (even academics) call "clock", "enable". Whenever they have actual enable, they need to come up with something else. This causes confusion for everyone the first time.

Simon · « **Reply #28 on:** January 13, 2022, 10:42:51 am »

Well they do go all out to make it confusing, having declared the minimum low time to be 80µs when checking the busy flag they then state that the clock cycle is to be 1.2µs so it's also a 2 speed clock based on what you are doing.

wek · « **Reply #29 on:** January 13, 2022, 10:46:12 am »

Quote from: Siwastaja on January 13, 2022, 10:23:43 am

Yes, display designers have a weird tradition of calling the signal that all other industries (even academics) call "clock", "enable". Whenever they have actual enable, they need to come up with something else. This causes confusion for everyone the first time.

That's a Motorola 68xx legacy. Its bus was clocked by a signal they called Enable. Hitachi (as maybe all japanese chipmakers) took over the basic idea of the 68xx architecture and together with it also the bus style and terminology.

JW

Simon · « **Reply #30 on:** January 13, 2022, 02:29:19 pm »

https://www.newhavendisplay.com/app_notes/ST7066U.pdf

So it looks like I may as well just run a PWM output at 6 kHz or less into the enable, every time the interrupt for the signal going high/low fires do whatever is the next step, that way I have one single timing to worry about.

what confuses me is the 1.2µs cycle time for the enable, but it wants 80µs minimum low pulse when testing for a busy flag on D7, so really I run on a 160µs cycle time, or has someone got their ns and µs mixed up here.

Simon · « **Reply #31 on:** January 13, 2022, 02:40:25 pm »

This is all further complicated by all instructions taking 37µs to execute except the ones that take over 1.5ms so I may as wel run a clock that runs at 25-50kHz not the 800kHz that the 1.2µs cycle time suggests.

wek · « **Reply #32 on:** January 13, 2022, 03:07:45 pm »

You can also reprogram the timer on the fly, always to provide the optimum delay until the next operation.

nctnico · « **Reply #33 on:** January 13, 2022, 03:38:25 pm »

Quote from: wek on January 13, 2022, 03:07:45 pm

You can also reprogram the timer on the fly, always to provide the optimum delay until the next operation.

That is how I create small delays on ARM controllers as well. I use a hardware timer that stops counting when it hits the number of counts. With the prescaler set so it counts in micro seconds I just load it with the number of microseconds and wait until the counter stops.

Siwastaja · « **Reply #34 on:** January 13, 2022, 03:39:08 pm »

Quote from: wek on January 13, 2022, 03:07:45 pm

You can also reprogram the timer on the fly, always to provide the optimum delay until the next operation.

This. Timers are not some fixed frequency or PWM generators that could be only adjusted at program startup. They are, after all, timers. Use them to create the interrupts and state changes you need, exactly when you need.

Whenever the delay is significantly longer than interrupt latencies, it's fruitful to make the timer trig an interrupt, so other code can run during the wait. If the delay is say < 50 clock cycles - or if you just have nothing else to do anyway - then you can just block until the completion of the timer, making it a simple flag poll.

Use interrupt priorities to your advantage.

Simon · « **Reply #35 on:** January 13, 2022, 04:11:20 pm »

Yes timings can be changed on the fly. But here I am doubting the display driver datasheet for mentioning something that did not need mentioning, given the speed of update I need 1kHz will probably do, really I just need to go fast enough that the user does not see half of an old display and half of a new one, but that is like 100ms. If the display takes 100ms to update no one will notice the characters changing one at a time.

With the clock slow enough the whole thing can just sit there updating the screen all the time and running the rest of the program in between. With a display clock of less than 10kHz and a CPU clock in the MHz it's not even worth worrying about starting a screen refresh, just do it perpetually.

Nominal Animal · « **Reply #36 on:** January 13, 2022, 10:58:07 pm »

Quote from: wek on January 13, 2022, 03:07:45 pm

You can also reprogram the timer on the fly, always to provide the optimum delay until the next operation.

Funnily enough, I use exactly the same approach in application development in Linux, to create timeouts and such.

Essentially, I have a binary min-heap of the timeout times (and associated linear array so that timeouts can be canceled etc.), with the root element containing the time of the next timeout event. Works well, for any number of timeouts, and after one once works out the corner cases (like multiple timeouts elapsing at the same time, trying to set a timeout that is in the past at the moment one re-arms the timer, et cetera), it is quite robust, too.

emece67 · « **Reply #37 on:** January 14, 2022, 12:07:10 am »

DavidAlfa · « **Reply #38 on:** January 14, 2022, 01:04:59 am »

Quote from: ataradov on January 12, 2022, 09:17:32 pm

Code is located in SRAM so that execution time is not subject to flash wait states.

I've tested SRAM execution before and noticed it was slower than flash in most cases.
No flash wait states, but due the bus sharing (The SRAM bus has to be shared between the instruction fetch and the data access), you might get the same effect.
If the device has CCRAM, any code there will be executed at 100% core speed (unless you also access the CCRAM as data), as it has it's own dedicated bus, no collision will happen with flash, sram or other busses.

Nominal Animal · « **Reply #39 on:** January 14, 2022, 01:22:11 am »

Quote from: emece67 on January 14, 2022, 12:07:10 am

Quote from: Nominal Animal on January 13, 2022, 10:58:07 pm
Essentially, I have a binary min-heap of the timeout times (and associated linear array so that timeouts can be canceled etc.), with the root element containing the time of the next timeout event. Works well, for any number of timeouts, and after one once works out the corner cases (like multiple timeouts elapsing at the same time, trying to set a timeout that is in the past at the moment one re-arms the timer, et cetera), it is quite robust, too.
Isn't this called "virtual timers"?. That's, use single HW counter to behave as n software timers. I also use it (but, as my programming skills sucks, I use a plain double-linked list of timeouts).

Sure, that's one name you can use.

In practice, I usually use a dedicated POSIX thread for this, with it blocking in pthread_cond_timedwait() for either the next event or a signal on the condition variable that there are changes in the event set. This means there usually isn't any userspace timers involved at all, and it is just the kernel waking up the blocking thread at a suitable time.

I've also used POSIX signals and a simple linear array. Signals (if installed without SA_RESTART flag) have the benefit of interrupting a blocking call (with errno==EINTR) in the thread used to run the userspace signal handler function; but, in a multithreaded program, the dedicated thread approach is still more efficient (and can optionally dispatch the interrupting signal via e.g. pthread_kill()).

Many ways to solve the same problem. If there are less than say a dozen concurrent timeouts, a linear array will do just fine; the search (to find an unused slot, and to find the next one to elapse) is not too slow, and the code will be simpler and much more maintainable.

What I have never tried, is using a timer and a DMA channel (set up to copy only one byte or word) when the timer elapses, for automagic timeout flags. This would not necessarily interrupt anything (except to set up the next elapsing timeout flag), but not cause (measurable) latencies either and would work even when interrupts disabled, so it might work well for co-operative timeouts on hardware capable of it and having an otherwise unused timer and a dma channel.

westfw · « **Reply #40 on:** January 14, 2022, 01:26:41 am »

Quote

I've tested SRAM execution before and noticed it was slower than flash in most cases.

I've been wondering about that. Which CPUs did you test?

Presumably, the sort of cycle-counting busy loop we're talking about here would not have any data accesses.

Trying to get deterministic timing out of most ARM chips, by cycle-counting your instructions, is difficult.

IO on these sorts of timeframes (80us, 3000+ cycles) is annoying. That's too slow to feel great about busy-looping, and ... somewhat fast for interrupts. (although, a 115200bps UART interrupts about every 90us, and no one bats an eye at using interrupts for that, even on slower CPUs.) I suppose a lot depends on whether you have anything else to do during that 80us...

ataradov · « **Reply #41 on:** January 14, 2022, 01:35:19 am »

There is probably some pathological cases where running from SRAM would be slower, but it is not generally the case. Most instructions are fetched 2 at a time and on fast devices with more than 1 WS running from SRAM would almost always win. It may be a toss up on a typical 48 MHz Cortex-M0+ with 1 WS flash.

Also, if instructions are fetched from the flash anyway, then you have to share the bus no matter what, so I really don't see how running from the flash can ever be faster.

And on CM4/CM7 is not an issue at all, since they have two buses and usually require a ton of wait states.

DavidAlfa · « **Reply #42 on:** January 14, 2022, 08:49:26 am »

Yep, last time I tried was a basic ARM core, STM32F103.
There was a small function to clock I2C data (Software Mode), using .ramfunc section was actually slower than normal flash.
Probably beefier devices (F4,F7 series) perform better.
Didn't investigate further, as anyways the speed had to be much lower, made a small test and compared the waveforms in flash and RAM execution.

Siwastaja · « **Reply #43 on:** January 14, 2022, 09:33:52 am »

Are there MCUs where sequential instructions all incur the flash wait state penalty?

Probably yes, but I don't think I have seen any? Until some point, CPU and FLASH run at the same frequency with no wait states. After this point, CPU runs faster and FLASH uses wait states, but at the same time, usually, FLASH will transition into wider word than a single instruction, allowing prefetch. Simplest implementations have no cache or "acceleration" of any kind, but linear code runs at full speed. Average speed is definitely not 1 / WAIT_STATES.

I have never been huge fan of running from normal SRAM (if core-coupled memory isn't available) because while it might improve the performance a bit, it does not make it predictable, bus is still shared with DMA for example.

Whereas, running from ITCM is greatest thing after sliced bread, with no compromises really. M7 devices are not the only ones with ITCM, for example the STM32F334 comes with CCM (the same thing with different name).

wek · « **Reply #44 on:** January 14, 2022, 10:17:18 am »

Quote from: ataradov on January 14, 2022, 01:35:19 am

There is probably some pathological cases where running from SRAM would be slower, but it is not generally the case. Most instructions are fetched 2 at a time and on fast devices with more than 1 WS running from SRAM would almost always win.

Quote from: ataradov on January 14, 2022, 01:35:19 am

And on CM4/CM7 is not an issue at all, since they have two buses and usually require a ton of wait states.

Add CM3, and add one bus. In CM7 the picture is further blurred by... all the stuff...including the massive conventional L1 caches on the AXI... so the timing picture is very complex there, with results all over the place depending on particularities... so let's just ignore CM7 now.

What David talks about is running from SRAM on the S bus, vs. FLASH on both I and D buses (for fetches/constant data reads) while user data are still in SRAM on S bus.

S bus is architecturally made so that it imposes one extra cycle on *every* read - the reason is, that it is heavily loaded by all the slave buses the processor can access (except FLASH), whereas I and D are loaded only by the FLASH (and in some cases SRAM intended for fast code execution, which is then called CCM at least in STM32s). Also, as David said, when code is fetched through the same port as data are read and written, there are collisions; and writing to and especially reading from peripherals may be surprisingly expensive. Sure, there's prefetch in the processor, that helps a bit, but damage is already done.

OTOH, latency of FLASH is mitigated by the FLASH being 128/256 bits wide, automatic prefetch from FLASH (although that's a slightly two-edged thing, as pessimal jump patters do exist). In STM32F2/F4 (and 'F7 although I said I won't talk about that :-)) even nonlinear code is treated through the jumpcache, and constant data (which usually means the address pool, i.e. a continuous space) through a small data cache (ART is the marketing name for this combo). They can be similar jumpcache or conventional caches in other manufacturers' implementations, I don't follow.

So yes, depending on circumstances, execution from FLASH may win over execution from SRAM. As part of marketing push of the then new 'F2/'F4, ST produced a comparison (unfortunately without publishing source code), see https://community.st.com/sfc/servlet.shepherd/document/download/0690X0000060I7OQAU pp36-45.

JW

Siwastaja · « **Reply #45 on:** January 14, 2022, 10:56:21 am »

CM7 has been easiest for me, timing-wise, really!

Why? Because they come with ITCM and DTCM*, enabling predictable code execution under all conditions, and predictable, fast variable access (stack or something else), independent from everything else going on - DMA's for example.

*) well not always, those are optional features of the core, but STM32F7 and H7 series which I do use, have fairly large ITCM and DTCM.

Also just the "raw power" can compensate quite a bit. I don't have to worry about interrupt latency of 12 cycles at 400MHz. At 20MHz, it might matter.

Caches are uninteresting. I have never enabled caches. They are not meant for anything where timing is important, they are "easy" solution to improving average case whenever worst case is not important: UIs for example.

It's those mid-range devices where timing is iffy. You have limited CPU clock, limited bus clock, DMAs consuming significant part of bus cycles, and usually no core-coupled memories.

wek · « **Reply #46 on:** January 14, 2022, 01:00:30 pm »

Quote from: Siwastaja on January 14, 2022, 10:56:21 am

CM7 has been easiest for me, timing-wise, really!

Why? Because they come with ITCM and DTCM
[...]
It's those mid-range devices where timing is iffy.

I see your points, I really do, but beg to differ.

You claim that this provides predictible timing - but timing of what? On that STM32H7xx, a single write to GPIO traverses through three* bus matrices and buffers/bridges between them - and *none* of those is documented, timing-wise.

I'd formulate differently: no matter what the raw clock and architecture of processor is, techniques used by the designers to achieve high clocks result in roughly same outwards signals' timing/jitter/latencies/whatever. Faster chips introduce more uncertainties. You can do better in the higher-end chips, but it requires extra care, and depending on particularities, sometimes you simply just can't. For better timing, always go for hardware.

2 eurocents.

JW

[EDIT] * two, I had a look and there is a D1-to-D3 interconnect; however, that means one of those matrices is AXIM...

MK14 · « **Reply #47 on:** January 14, 2022, 05:12:36 pm »

Quote from: Simon on January 13, 2022, 10:42:51 am

Well they do go all out to make it confusing, having declared the minimum low time to be 80µs when checking the busy flag they then state that the clock cycle is to be 1.2µs so it's also a 2 speed clock based on what you are doing.

I agree it seems confusing. I've only had a relaitively quick look at the datasheet, so could easily be confused/wrong, myself. But I attempt to explain why there are fast, around 1MHz/1µs elements, and much slower 80µs/1ms/100's of ms timing things.

The 'fast' (if you call 1MHz/1µs fast these days!) timings are the actual hardware interface timings itself. Which seems to prefer to have a clock which is regularly active. I.e. some kind of low level serial/parallel etc, hardware peripheral chips.

The much slower timings, such as the 80µs, for the busy flag to be active. Are probably the (worst case scenario, with suitable active clock(s) and active enable signals) time the displays internal MCU and/or graphics cpu/hardware takes (via software/microcode/hardwired-graphics/character-functions), to perform the requested tasks.
E.g. 80µs to set/reset the busy flag, or many milliseconds (and much, much longer) to perform more complicated display functions.

tggzzz · « **Reply #48 on:** January 14, 2022, 05:30:59 pm »

Quote from: wek on January 14, 2022, 01:00:30 pm

Quote from: Siwastaja on January 14, 2022, 10:56:21 am
CM7 has been easiest for me, timing-wise, really!

Why? Because they come with ITCM and DTCM
[...]
It's those mid-range devices where timing is iffy.
I see your points, I really do, but beg to differ.

You claim that this provides predictible timing - but timing of what? On that STM32H7xx, a single write to GPIO traverses through three* bus matrices and buffers/bridges between them - and *none* of those is documented, timing-wise.

I'd formulate differently: no matter what the raw clock and architecture of processor is, techniques used by the designers to achieve high clocks result in roughly same outwards signals' timing/jitter/latencies/whatever. Faster chips introduce more uncertainties. You can do better in the higher-end chips, but it requires extra care, and depending on particularities, sometimes you simply just can't. For better timing, always go for hardware.

2 eurocents.

JW

[EDIT] * two, I had a look and there is a D1-to-D3 interconnect; however, that means one of those matrices is AXIM...

Consider a separate timer directly attached to each and every i/o port, connected to several cores via a connection matrix. With that the cores can tell each port the clock cycle on which to do output (or input), or record the clock cycle on which input occurred. That should ensure the time taken to transit the connection matrix is irrelevant.

Consider each core acting independently, with no caches and no interrupts. That removes timing jitter, so the exact number of clock cycles for an instruction sequence can be defined in advance, without measurement and hoping you've stumbled on the worst case timing. Might as well use the same connection matrix between the cores.

Consider having a toolset that automates that, and a simple language designed for parallelism.

No, it isn't a dream of the future! It has been a reality since 2007.

You can buy XMOS xCORE processors at DigiKey, program them in xC, and use the free Eclipse-based IDE.

Oh, there aren't pages and pages of errata for the software and hardware - it just works as advertised.

Siwastaja · « **Reply #49 on:** January 14, 2022, 07:18:44 pm »

Quote from: wek on January 14, 2022, 01:00:30 pm

You claim that this provides predictible timing - but timing of what? On that STM32H7xx, a single write to GPIO traverses through three* bus matrices and buffers/bridges between them - and *none* of those is documented, timing-wise.

I'd formulate differently: no matter what the raw clock and architecture of processor is, techniques used by the designers to achieve high clocks result in roughly same outwards signals' timing/jitter/latencies/whatever. Faster chips introduce more uncertainties. You can do better in the higher-end chips, but it requires extra care, and depending on particularities, sometimes you simply just can't. For better timing, always go for hardware.

Yeah, non-zero jitter, in clock cycles, but quite tight in actual time, because of high clock frequencies.

Obviously, if you need zero-jitter IO with completely predictable timing, a general purpose modern MCU simply isn't the right tool. You need a PLD, or special purpose MCU like the tggzzz's favorite. Maybe in some very limited cases where coarse time unit and low performance is acceptable and predictability and low jitter is the main goal, an old 8-bitter with handwritten assembly does the trick.

But there are many many cases where timing accuracy within ~100-200ns is acceptable. In such cases, an M7 running running just bog standard C code, without special considerations, using interrupts and interrupt priorities to create the application, is the easiest and good enough solution.

When counting clock cycles and proving cycle accuracy, we often fail to see the forest from the trees: the fact that actual time, in seconds and not clock cycles, is all that matters, and in physical world, jitter and uncertainty is always there, we just need to define what we can accept. Even XMOS has jitter, just maybe ~2-3 orders of magnitude less than a bare metal Cortex M7 application, which again has maybe ~2-3 orders of magnitude less jitter than one written by inexperienced trainee using example codes and libraries.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: ARM NOP (Read 11026 times)

Share me