Author Topic: Cortex-M7 pipeline idiosyncrasies… (Read 8473 times)

brucehoult · « **Reply #25 on:** June 28, 2024, 11:05:06 pm »

Agree with all.

nimish · « **Reply #26 on:** June 29, 2024, 03:47:08 am »

Reducing data dependencies is critical to getting any perf out of superscalar CPUs. It's not like they can magically break those requirements, even OOO are stuck when confronting raw hazards

Rewrite your code to better break up dependent calculations, e.g. use Estrin's scheme and not Horner's. You can get near-theoretical speedup doing that + forcing the use of fused instructions when the compiler can't or won't infer them. Worth checking via sollya/megalibm and the like too.

If you have an armv8.1-m processor there's some low overhead looping instructions that help too.

Anyway https://www.quinapalus.com/cm7cycles.html was very useful to plan out execution.

jnk0le · « **Reply #27 on:** June 29, 2024, 01:58:00 pm »

Quote from: jnk0le on June 28, 2024, 12:17:00 am

EDIT: another approach could be to interleave FMAs with 2 or 4 loads/stores to get rid of accumulator interleaving

Not the case, can't interleave FMAs with dual loads/stores (2x vldr.f or vldr.d) ~~anything but vstr.f which can't dual issue with itself nor FMA's~~. mass load, mass compute mass store is the only way.

glenenglish · « **Reply #28 on:** July 01, 2024, 09:34:21 pm »

For CM7, I've found using the CMSIS DSP libs gets me pretty close(or better) to hand optmized assembler, and by looking at the instruction ordering in the CMSIS DSP libs, you can get some ideas about the ordering tricks. On some general purpose DSP (speech codec), I found approx 1.4x improvement for CM7 over CM4, clock for clock, *and using ITCM DTCM on the M7) and almost 2x using CMSIS DSP libs versus original CM4 optimized C code.

tggzzz · « **Reply #29 on:** July 01, 2024, 09:50:14 pm »

The solution to making i/o have predictable timing is to do two things:

have hardware dictate the i/o timing, not the time taken for the code to execute. That's easy, standard practice, and competent peripherals support it.
ensure the code executes within the timings specified by the hardware. That is easy - provided the processor is guaranteed to be grossly underutilised. The intractable problems arise with hardware features that minimise the average execution time. (In a hard realtime system the average time/performance is completely and utterly irrelevant!)

There is one processor designed for embedded hard realtime systems. Its software tool chain defines the minimum/maximum code execution time by examining the compiled object code without execution. There's none of this "measure and hope we blundered across the worst case" nonsense.

So, go to DigiKey, and buy an XMOS MCU (up to 4000MIPS/chip and expandable), and program it in xC using the xTimeComposer IDE.

glenenglish · « **Reply #30 on:** July 02, 2024, 11:06:08 am »

tggzzz
I like how you wrote "That is easy - provided the processor is guaranteed to be grossly underutilised."
AMEN
so true, there's a project I worked on --- the customer was trying to use a Nvidia Jetson to do a load of video processing.
The chip/board data sheet claims it can do "mountains of x", "tons of y " , "sh1tloads of z". It reads very impressive -
But not all simultaneously....
The busses all just saturate, the CPU starts to stall and wait , irqs and buffers dont get emptied, and then arbitration and throughput goes to hell .
That design is now on an FPGA...
Someone posted here somewhere talking about heavy DMA chewing up the bus , that might kill the CM4/CM7/dspic perf....

Siwastaja · « **Reply #31 on:** July 02, 2024, 11:40:18 am »

... and buying a 32-core XCORE^TM and only utilizing 17 cores for the application somehow doesn't count as "underutilization"

To put the underutilization thing in perspective, CPU time seldom is a limiting factor for a microcontroller selection. Performance has increased year after year and 32-bit chips running at over 100MHz cost some $1 in volume today. Almost always something else is limiting, such as peripherals, maybe flash or RAM; rarely CPU time.

The idea that you somehow should "fully utilize" a part is ridiculous. What you pay for vs. what you get is what matters. If a $1 CPU does the job and has excess performance, then what?

And think about using 1/16W rated resistors as pull-ups, where they only dissipate 10% of their ratings. What underutilization!

NorthGuy · « **Reply #32 on:** July 02, 2024, 01:38:18 pm »

Quote from: Siwastaja on July 02, 2024, 11:40:18 am

... Performance has increased year after year and 32-bit chips running at over 100MHz cost some $1 in volume today.

Quote from: Siwastaja on July 02, 2024, 11:40:18 am

The idea that you somehow should "fully utilize" a part is ridiculous.

I think his point was that high CPU frequency forces the part to use various caches and long pipelines. They give you decent average performance. However when it comes to the worst case, the caches and long pipelines don't help. For example, the worst interrupt latency of a 400 MHz part won't be 100 times less than the interrupt latency of 4 MHz part. Hence, to get decent worst-case interrupt latency, you may have to choose a part with much higher frequency (and much higher average performance) than you would otherwise need.

I agree with this to some extend. Instead of being obsessed with bits and MHz, you should look at the characteristics of the CPU and periphery and evaluate how they may be used for your task. For example, the OP went for higher MHz, did a very good job studying the internals of the CPU and then wrote lots of assembler with reshuffled instructions to avoid stalling the pipeline (which is hard to do). But wouldn't it be easier to just get a DSP for his convolutions, or even use a small FPGA?

tggzzz · « **Reply #33 on:** July 02, 2024, 02:21:03 pm »

Quote from: Siwastaja on July 02, 2024, 11:40:18 am

... and buying a 32-core XCORE^TM and only utilizing 17 cores for the application somehow doesn't count as "underutilization"

To put the underutilization thing in perspective, CPU time seldom is a limiting factor for a microcontroller selection. Performance has increased year after year and 32-bit chips running at over 100MHz cost some $1 in volume today. Almost always something else is limiting, such as peripherals, maybe flash or RAM; rarely CPU time.

The idea that you somehow should "fully utilize" a part is ridiculous. What you pay for vs. what you get is what matters. If a $1 CPU does the job and has excess performance, then what?

And think about using 1/16W rated resistors as pull-ups, where they only dissipate 10% of their ratings. What underutilization!

Don't make irrelevant and easily disprovable contentions...

The OP's question is about an application where he is specifically cycle counting. Your contentions conflict with that.

CPU time may not be a limitation in many applications, but as performance increases, so do the requirements. That is inevitable and has been demonstrated over the decades. That is particularly apparent as hardware functions are migrated into software.

Almost every embedded realtime application I've been involved in has pushed the envelope in one way or another; sometimes performance, sometimes simplifying hardware - and always deciding on a partition between hardware and software. Yes, that boundary has changed over time, but not as much as "raw headline specs" imply.

However, if the application's requirements and economics allow a $1 CPU, then by all means use it. Would I consider using an MCU instead of a 555 timer? Yes, in the right circumstances. (But to remind you, the OP's question involves pushing the performance).

You can get 8..32 core xCORE devices.

Your resistor analogy, as with most analogies, generates more heat than illumination. If you want to follow that analogy, then consider a resistor that had only a 1/16W peak rating without a "DC rating".

dietert1 · « **Reply #34 on:** July 02, 2024, 03:48:03 pm »

One criterion for using the M7 core is the double precision FPU. It can certainly make a difference in some applications.
And i'm currently studying a STM32H7 ethernet application. Although the ethernet peripheral is available with M4 as well, i'd guess it got some new tricks on the H7.
But getting into building computers really doesn't make a lot of sense nowadays. Very hard to design something competitive in terms of performance and cost in comparison to a RPi. Better learn how to use it properly.

Regards, Dieter

nimish · « **Reply #35 on:** July 02, 2024, 07:53:09 pm »

Quote from: tggzzz on July 01, 2024, 09:50:14 pm

The solution to making i/o have predictable timing is to do two things:
have hardware dictate the i/o timing, not the time taken for the code to execute. That's easy, standard practice, and competent peripherals support it.
ensure the code executes within the timings specified by the hardware. That is easy - provided the processor is guaranteed to be grossly underutilised. The intractable problems arise with hardware features that minimise the average execution time. (In a hard realtime system the average time/performance is completely and utterly irrelevant!)
There is one processor designed for embedded hard realtime systems. Its software tool chain defines the minimum/maximum code execution time by examining the compiled object code without execution. There's none of this "measure and hope we blundered across the worst case" nonsense.

So, go to DigiKey, and buy an XMOS MCU (up to 4000MIPS/chip and expandable), and program it in xC using the xTimeComposer IDE.

If you use more than 5 cores then it's:

This means that for more than five logical cores, the performance of each core is often higher than
the predicted minimum but cannot be guaranteed

Cortex-M7 isn't all that complicated even with a cache, TCM, and pipeline. And there's a reason core 2 is a cortex m4.

glenenglish · « **Reply #36 on:** July 02, 2024, 08:32:36 pm »

and also, the op needs to consider how much left over for future feature expansion.
How many of us have started with plenty of performance and been crushed at the end with extra stuff in the micro ?

---If you are struggling and cycle counting at the architecture stage you set yourself up for failure when more inevitably needs to be added to the design.

My POV is something like CM7 is one of the best general purpose machines, and you dont need to worry about the pipeline and ordering etc and all that at the start off the project- . IE at the beginning, if you need to optimize, then you probably have not started with enough horsepower. However, as you get squeezed and need to get blood from the stone, you will find that there is plenty of performance available if you start using DTCM, ITCM, and CMSIS DSP libraries, and then finally, hand coded assembler, if you really must.

I'm not a ARM fanboy, but I feel CM7 is a fantastic all-jobs machine,
Do be aware of cache effects ! referring to global variables all over the place will KILL the throughput (if cache is under pressure) - put your globals in DTCM .
Do use the CMSIS DSP libraries- they are excellent.

If you are keen, try the pre-canned tools aware RISC-V soft core on EFINIX FPGA with custom instructions (option) - then, you can sub-contract the hard DSP stuff out of the CPU inline with the code. RISCV is good and the Efinix tools are fairly simple, a much easier learning curve than Xilinx.

tggzzz · « **Reply #37 on:** July 02, 2024, 09:26:25 pm »

Yes and no.
Yes, on bog standard me-too processors+toolchains.
No, on processors+toolchains which have realised that is undesirable and that Things Can Be Better.

When considering the architecture, inevitably it will be desirable to understand whether key functions can be implemented in software, or what hardware support will be required. That's basic hardware/software architectural partitioning.

On the xCORE devices it is perfectly possible to

accurately determine the performance of key algorithms before hardware is available
add major functionality in the knowledge it cannot affect the performance of the key algorithms. Example of major functionality: bidirectional comms (including printf()!) over USB or ethernet, and front panel i/o

That's very powerful, and removes a major cause of uncertainty.

In effect the XMOS hardware+toolchain brings FPGA-like predictability to embedded software systems. It shifts the hardware-software boundary. It also removes the need for an RTOS: scheduling and intertask comms functions are implemented in hardware

Siwastaja · « **Reply #38 on:** July 03, 2024, 12:18:03 pm »

Quote from: NorthGuy on July 02, 2024, 01:38:18 pm

For example, the worst interrupt latency of a 400 MHz part won't be 100 times less than the interrupt latency of 4 MHz part.

Except that it (nearly, roughly) does. For example, limiting to ARM Cortex-M universe, if you take the lowest cost M0 part with its 16-cycle interrupt latency and run it at 4MHz, and compare to M7-part with 12-cycle interrupt latency running code from ITCM at 400MHz, you get better than 100x improvement. If you consider worst case difference in presence of multiple interrupt sources, tail-chaining of the M7 compared to M0 not having it increases the advantage even further.

Now granted, if doing an IO from the beginning of the ISR is critical, as it often is, then that 400MHz part will lose a few cycles from synchronizing the write to the GPIO peripheral's address space's clock, and that pretty much loses those few clock cycles we gained from lower interrupt latency compared to M0, but oh well, it's still pretty close to 100x difference, give or take.

On the other hand, if you compare that 12-cycle latency at 400MHz to an AVR's 4-cycle latency at 4MHz, the difference isn't 100x, but still 25x. Except that the AVR does not do hardware stacking so only in rare cases you can actually execute something meaningful on the first instruction of the ISR. So in reality where you have to stack something first it's close to 100x again, or, if you just write ISRs in C and not hand-crafted assembly, stacking is done first so the M7@400MHz is probably again at least 100x as fast.

So it all depends, but in any case, rules of thumbs originating from desktop computing are utterly useless, and that is exactly where these market shills try to divert the discussion by mentioning irrelevant stuff such as caches, TLBs and whatnot, which are utterly irrelevant to the topic, because every Cortex-M7 runs timing critical code from ITCM if you so wish, bypassing caches completely and running with predictable memory performance equivalent to what you would get with guaranteed zero cache misses; and it doesn't even have the concept of TLBs.

Also, their pipelining is simple enough that we are discussing about a jitter of few cycles. Say a loop often takes 12 cycles but sometimes it takes 11. And these few cycles are exactly what this thread is all about. Not caches, not large differences, not timing guarantees, not gross underutilization; but cycle accuracy and getting rid of even small jitter in those few cases where it does matter.

It is unsurprising this particular market shill then tries to again divert the discussion to a totally different off-topic subject, using the same proven-irrelevant claims that they never stop using, despite the fact the claims have been again and again proved irrelevant. It is nearly comical to see they have guts to complain others diverting off-topic when simply replying to their off-topic. Which probably was a mistake, as the good old saying, don't feed a troll, applies pretty well to this master troll who has been doing this for years nonstop.

Caches in microcontrollers are some of the most widely misunderstood features. Just like JPEG decoder peripheral, they are extras that only cost a little bit of die area (maybe a bit of static power consumption too) if not used, but otherwise do not hinder performance. Because timing-critical control using interrupts is a colossally basic use case for any microcontroller, those caches have absolutely nothing to do with it; they reach decent performance running directly off flash, and excellent performance running off core-coupled extra instruction RAM, which is a basic feature in nearly every MCU which has cache, too.

And when you discuss further with this particular troll, you will see what happens. Despite the forum section saying microcontrollers, and discussion being about microcontrollers, suddenly application processors like Cortex-A pop up as example cases. No wonder the misconception that "high-end microcontrollers" somehow are not proper microcontrollers timing-wise is so widespread. Some people actually have a day job of confusing the field for others.

But in actual microcontrollers, just like some applications need SPI and other need I2C, some applications also need ITCM and others cache. Say, an UI application drawing animations or playing MP3 files out of SD card could enable caches and greatly benefit from them. And DC/DC or motor controller would again enable ITCM and not bother enabling caches at all because they are irrelevant. In such case, a cache which is disabled is simply irrelevant, all it does is increase the chip cost by a few cents. Except that having fewer different parts (and instead, more capable parts with many options) increases the economy of scales, so in the end selling parts that have cache, even if most users would run them disabled, is still the cheapest option.

But as a result, there is underutilization. Which is a big thing in MCUs; nearly every MCU is 90% underutilized, if not more, if you count CPU cycle utilization, RAM, FLASH let alone peripheral utilization. But an XCORE with 9 cores out of 16 being used and executing NOPs 99% of the time emulating HW peripherals is clearly not underutilization. (And I like the concept and the idea, just not their marketing presence here.)

NorthGuy · « **Reply #39 on:** July 03, 2024, 01:51:48 pm »

Quote from: Siwastaja on July 03, 2024, 12:18:03 pm

Quote from: NorthGuy on July 02, 2024, 01:38:18 pm
For example, the worst interrupt latency of a 400 MHz part won't be 100 times less than the interrupt latency of 4 MHz part.

Except that it (nearly, roughly) does.

We've been through this already:

Quote from: mino-fm on April 10, 2022, 11:32:24 am

To get minimal jitter - avoiding distortion of other functions - I made a CubeMX startup-file and added only one init-function and the ISR.

In these tests, 216 MHz CPU had 110 ns response latency, while 550 MHz CPU had 70 ns response latency. These tests have been done in ideal conditions. Running other tasks on the same CPU caused roughly 40 ns jitter (see earlier posts), further increasing the worst response latency of the 550 MHz CPU to 110 ns.

For comparison, PIC18 would have roughly 190 ns response latency. Or compare this to 10 ns (or better) response latency you can achieve with 550 MHz clock in FPGA.

"Grau, teurer Freund, ist alle Theorie, Und grün des Lebens goldner Baum"

dietert1 · « **Reply #40 on:** July 03, 2024, 03:30:49 pm »

Cycle counting for an audio project means there is something fundamentally wrong. It doesn't make sense to force the CPU into precision timing. Clearly interrupt timing won't be deterministic, neither DMA, nor GPIO.
And it isn't necessary. Peripherals that need precision timing support that. The SAI peripheral device will take care of digital audio. It works with DMA buffers and FIFOs to assure precise timing. Of course there will be some latency.
The OP didn't explain what it's all about. It sounds more like a rant. For me cycle counting means reading the CPU cycle counter. It's fairly easy to determine whether a function doesn't finish in time or is close to that. Maybe it takes more time or starts late.

Regards, Dieter

tggzzz · « **Reply #41 on:** July 03, 2024, 04:22:52 pm »

Quote from: dietert1 on July 03, 2024, 03:30:49 pm

Cycle counting for an audio project means there is something fundamentally wrong. It doesn't make sense to force the CPU into precision timing. Clearly interrupt timing won't be deterministic, neither DMA, nor GPIO.

Agreed. The issue is almost always to guarantee the maximum time is less than the required time.

Quote

And it isn't necessary. Peripherals that need precision timing support that. The SAI peripheral device will take care of digital audio. It works with DMA buffers and FIFOs to assure precise timing. Of course there will be some latency.
The OP didn't explain what it's all about. It sounds more like a rant. For me cycle counting means reading the CPU cycle counter. It's fairly easy to determine whether a function doesn't finish in time or is close to that. Maybe it takes more time or starts late.

It might be easy enough to measure and state that you haven't observed a timing violation. That's insufficient in hard realtime systems.

Even a 486 (which had tiny caches compared with many modern embedded MCUs) and tested under ideal conditions showed a peak ISR latency ~8* the mean latency.

Modern embedded systems often have software components from more than one person and/or more than one company. If they are just bolted together and execute in a single core, it rapidly becomes very difficult to predict worst case performance.

And then you have the question of how networking components might behave when something has gone wrong with the network components outside your system. Been there, narrowly avoided the scars.

So you have to measure and hope you've hit the worst case performance. Unsatisfactory.

tggzzz · « **Reply #42 on:** July 03, 2024, 04:51:45 pm »

Quote from: Siwastaja on July 03, 2024, 12:18:03 pm

So it all depends, but in any case, rules of thumbs originating from desktop computing are utterly useless, and that is exactly where these market shills try to divert the discussion by mentioning irrelevant stuff such as caches, TLBs and whatnot, which are utterly irrelevant to the topic, because every Cortex-M7 runs timing critical code from ITCM if you so wish, bypassing caches completely and running with predictable memory performance equivalent to what you would get with guaranteed zero cache misses; and it doesn't even have the concept of TLBs.

First of all I will assume there is a language issue with the meaning of "shill". I gain absolutely nothing from noting the beneficial characteristics of the XMOS ecosystem. Therefore I am not a shill. I am, however, obviously a fan.

Secondly, as embedded processors and systems increase (mean) performance by using the same techniques as desktop systems, so embedded processors will suffer the same problems. How could it be otherwise?!

(Those with a long memory and wide experience will realise that the same is true for high performance computing and workstation/desktop computing)

Quote

Caches in microcontrollers are some of the most widely misunderstood features. Just like JPEG decoder peripheral, they are extras that only cost a little bit of die area (maybe a bit of static power consumption too) if not used, but otherwise do not hinder performance.

Another strawman argument. Caches increase mean performance, but not worst-case performance. (Actually, they can slightly reduce worst case performance, but that is a lesser concern)

Hard realtime systems only "care" about worst case performance. Guaranteeing worst-case performance in a system with caches is problematic, to say the least.

Quote

And when you discuss further with this particular troll, you will see what happens. Despite the forum section saying microcontrollers, and discussion being about microcontrollers, suddenly application processors like Cortex-A pop up as example cases. No wonder the misconception that "high-end microcontrollers" somehow are not proper microcontrollers timing-wise is so widespread. Some people actually have a day job of confusing the field for others.

Well, there's a mishmash rant of intertwined statements. Not worth trying to disentangle them and refute them.

Quote

But in actual microcontrollers, just like some applications need SPI and other need I2C, some applications also need ITCM and others cache. Say, an UI application drawing animations or playing MP3 files out of SD card could enable caches and greatly benefit from them. And DC/DC or motor controller would again enable ITCM and not bother enabling caches at all because they are irrelevant. In such case, a cache which is disabled is simply irrelevant, all it does is increase the chip cost by a few cents. Except that having fewer different parts (and instead, more capable parts with many options) increases the economy of scales, so in the end selling parts that have cache, even if most users would run them disabled, is still the cheapest option.

Disabling the caches will indeed remove the unpredictability; the processor then reverts to pessimistic performance. Better not to have the cache in the first place!

Quote

But as a result, there is underutilization. Which is a big thing in MCUs; nearly every MCU is 90% underutilized, if not more, if you count CPU cycle utilization, RAM, FLASH let alone peripheral utilization. But an XCORE with 9 cores out of 16 being used and executing NOPs 99% of the time emulating HW peripherals is clearly not underutilization. (And I like the concept and the idea, just not their marketing presence here.)

Such cores are devoting 100% of their computation to a single task predictably. That is a necessary goal.

dietert1 · « **Reply #43 on:** July 03, 2024, 05:30:25 pm »

Quote from: tggzzz on July 03, 2024, 04:22:52 pm

...
So you have to measure and hope you've hit the worst case performance. Unsatisfactory.

Yet that's how everybody handles this. The problems of complexity you mention are a good example. After integration demonstrate the error rate to be less than 10 ** -10 or 10 ** -15 and together with code revision that's enough of a proof. No reason for assembly language, on the contrary.
Let's mention the histogram method. An important tool in statistics for the analysis of actual statistical distributions and it should be applied to the timing analysis of complex MCU apps. One should learn to handle this instead of thinking endlessly about hardware details that can't be changed anyway.

Regards, Dieter

tggzzz · « **Reply #44 on:** July 03, 2024, 05:46:16 pm »

Quote from: dietert1 on July 03, 2024, 05:30:25 pm

Quote from: tggzzz on July 03, 2024, 04:22:52 pm
...
So you have to measure and hope you've hit the worst case performance. Unsatisfactory.
Yet that's how everybody handles this. The problems of complexity you mention are a good example. After integration demonstrate the error rate to be less than 10 ** -10 or 10 ** -15 and together with code revision that's enough of a proof. No reason for assembly language, on the contrary.
Let's mention the histogram method. An important tool in statistics for the analysis of actual statistical distributions and it should be applied to the timing analysis of complex MCU apps. One should learn to handle this instead of thinking endlessly about hardware details that can't be changed anyway.

Regards, Dieter

How exactly do you demonstrate failure rates of <10^-15? If there is a possible failure every microsecond (fast!), that is 10⁹s = 30years!

It is a unpleasant to have to resort to measurements and statistics when it is possible to have reliable prediction. Why presume you must settle for second best?

Let's fight for improved techniques

dietert1 · « **Reply #45 on:** July 03, 2024, 06:41:53 pm »

10 ** -15 could be for example be a failure rate for audio bit stream output. The STM32H7 SAI maximum frame length is 256 bits. Assume digital audio at 96 KHz, so bit rate of 24 MHz. After one year that gives 7.56E14, not so far off. It can be faster if you have multiple test setups. I'd guess the difficulty will be to make a tester that is as reliable.
I am always stunned by how easy things have become. Still remember how much time i once spent on making a Z8000 computer. Wiring buses with RAM, ROM and peripherals, creating nonstandard clock generators, TTL glue etc. etc. I am equally stunned by the fantastic reliability of modern MCUs. They can run for years without ever crashing. Of course if one includes error recovery for I2C and the like.

Regards, Dieter

tggzzz · « **Reply #46 on:** July 03, 2024, 08:35:35 pm »

Quote from: dietert1 on July 03, 2024, 06:41:53 pm

10 ** -15 could be for example be a failure rate for audio bit stream output. The STM32H7 SAI maximum frame length is 256 bits. Assume digital audio at 96 KHz, so bit rate of 24 MHz. After one year that gives 7.56E14, not so far off. It can be faster if you have multiple test setups. I'd guess the difficulty will be to make a tester that is as reliable.
I am always stunned by how easy things have become. Still remember how much time i once spent on making a Z8000 computer. Wiring buses with RAM, ROM and peripherals, creating nonstandard clock generators, TTL glue etc. etc. I am equally stunned by the fantastic reliability of modern MCUs. They can run for years without ever crashing. Of course if one includes error recovery for I2C and the like.

Regards, Dieter

I was presuming program-level event failures, not bit-level events.

Have things changed since I designed by first computer, a 6800 with 128bytes of RAM (i.e. not enough for an SMS message

)? Yes and no.

Yes: smaller faster cheaper.

No: 8-bit processors and C are still being used.

After 40+ years I think it is reasonable to expect a fundamental improvement. After all, multicore processors are the current standard and are definitely the future.

So far the best available new concepts is xC+xCORE, and possibly Rust.

But even they are old. XCORE=Transputer (early 80s), and xC=Occam++/CSP, is effectively Occam early 80s, and Tony Hoare mid 70s respectively. CSP is continually reinvented, e.g. in Go.

NorthGuy · « **Reply #47 on:** July 03, 2024, 08:50:50 pm »

Quote from: dietert1 on July 03, 2024, 03:30:49 pm

Cycle counting for an audio project means there is something fundamentally wrong. It doesn't make sense to force the CPU into precision timing. Clearly interrupt timing won't be deterministic, neither DMA, nor GPIO.
And it isn't necessary.

If you maintain a buffer which gets transferred somewhere with I2S, you don't need anything deterministic. Your timing can vary. You just need to maintain some amount of data in the buffer. This can be say 100 elements on average. But it can go down to 50, or it can increase to 150. In either case you don't care as long as there's something in the buffer. You're not concerned with jitter and what you need is average performance.

If you need to do more extensive calculations for the signal processing, it may so happen that the calculations are so slow that your buffer dries out very quickly. This is similar to the situation that OP has faced. He took faster CPU and he re-wrote the code in assembler. Thereby he increased average performance. Now the CPU calculates his convolutions twice as fast and the buffer no longer dries out. This also has nothing to do with being deterministic.

nimish · « **Reply #48 on:** July 03, 2024, 09:02:58 pm »

Quote from: tggzzz on July 03, 2024, 04:51:45 pm

Quote
Caches in microcontrollers are some of the most widely misunderstood features. Just like JPEG decoder peripheral, they are extras that only cost a little bit of die area (maybe a bit of static power consumption too) if not used, but otherwise do not hinder performance.

Another strawman argument. Caches increase mean performance, but not worst-case performance. (Actually, they can slightly reduce worst case performance, but that is a lesser concern)

this is why Cortex M7 almost always have DTCM and ITCM. Get the best of both scenarios: predictable latency when you want it, caching when it doesn't matter for speed.

XMOS barrel processors are a neat little artifact from an alien planet. Interesting, but mostly a curiosity. I'd rather use a SoC-FPGA and a standard language over some tiny company's proprietary custom stuff anyway.

uer166 · « **Reply #49 on:** July 03, 2024, 09:04:01 pm »

Quote from: tggzzz on July 03, 2024, 08:35:35 pm

So far the best available new concepts is xC+xCORE, and possibly Rust.

I sincerely hope to you're getting paid some serious bucks to shill for them so much, seriously wasted potential otherwise.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Cortex-M7 pipeline idiosyncrasies… (Read 8473 times)

Share me